From d199c788c4e62539b5f64ac438e6e17f9a84f1ad Mon Sep 17 00:00:00 2001 From: Mortdecai Date: Thu, 26 Mar 2026 03:02:18 -0400 Subject: [PATCH] =?UTF-8?q?docs:=20training=20data=20analysis=20=E2=80=94?= =?UTF-8?q?=206=20compounding=20failure=20modes=20identified?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Root cause: 90% of system prompts exceed max_seq_len (2048 tokens) by 2.5x, so model trained on truncated fragments with no user/assistant content. Plus mixed paradigm (55% tool_call / 45% JSON), 6 JSON schema variants, contaminated examples, and /no_think misuse. Co-Authored-By: Claude Opus 4.6 (1M context) --- training-data-analysis.md | 135 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 135 insertions(+) create mode 100644 training-data-analysis.md diff --git a/training-data-analysis.md b/training-data-analysis.md new file mode 100644 index 0000000..f8feac4 --- /dev/null +++ b/training-data-analysis.md @@ -0,0 +1,135 @@ +# Training Data Analysis: merged_training_v06.jsonl + +**Date:** 2026-03-26 +**Source:** `/home/claude/bin/Mincecraft-AI-model/data/processed/merged_training_v06.jsonl` +**Examples:** 7,256 total +**Training script:** `training/scripts/train_lora.py` (Unsloth + TRL SFTTrainer) + +--- + +## Dataset Structure + +| Metric | Value | +|--------|-------| +| Total examples | 7,256 | +| Parse errors | 0 | +| Message format | "conversations" (7,254) + "text" (2) | +| Multi-turn (>3 msgs) | 4,006 (55.2%) | +| Has system prompt | 100% | +| Has /no_think prefix | 3,459 (47.7%) | +| Avg system prompt length | 21,358 chars (~5,300 tokens) | +| System prompts >20K chars | 6,526 (89.9%) | +| JSON responses | 3,248 (44.8%) | +| tool_call responses | 4,006 (55.2%) | +| Contains `` tags | 52 | +| Tool role messages | 12,155 | +| Contaminated JSON (tool_call inside) | 84 | +| Empty commands arrays | 387 | + +## System Prompt Variants + +| Variant | Count | +|---------|-------| +| script_writer (no /no_think) | 3,559 | +| /no_think + paper_server | 2,942 | +| /no_think + other | 338 | +| paper_server (no /no_think) | 236 | +| /no_think + server_admin | 179 | +| god_persona | 6 | + +## JSON Response Schema Variants + +| Key Combination | Count | +|-----------------|-------| +| {commands, risk_level} | 1,112 | +| {commands, reasoning, risk_level} | 911 | +| {commands, message, risk_level} | 495 | +| {commands, message, reasoning} | 338 | +| {commands, message, reasoning, risk_level} | 243 | +| {commands, reasoning} | 149 | + +## Tool Usage in Training Data + +| Tool | Calls | +|------|-------| +| rcon.execute | 7,220 | +| script.validate | 1,496 | +| script.write | 1,493 | +| script.execute | 1,485 | +| journal.read | 110 | +| world.player_info | 70 | +| journal.write | 33 | +| world.scan_area | 29 | +| minecraft.wiki_lookup | 28 | +| world.nearby_entities | 25 | +| Others (14 tools) | <25 each | + +--- + +## Six Compounding Failure Modes + +### 1. System Prompt Truncation (The Killer) + +`max_seq_len = 2048 tokens`. Average system prompt = ~5,300 tokens. **90% of examples have system prompts that exceed the entire sequence length by 2.5x.** + +With packing enabled, the trainer stuffs multiple examples per 2048-token window. The system prompt alone doesn't fit — so the model trained on truncated system prompts with **no user input and no assistant response in most examples**. It learned system prompt fragments, not task behavior. + +### 2. Mixed Response Paradigm + +44.8% of examples teach: return clean JSON `{"commands": [...]}`. +55.2% of examples teach: emit `{"name": "rcon.execute",...}`. + +No clear signal distinguishes when to use each format. The system prompts differ but get truncated (Issue 1), so the model never sees the disambiguation. + +### 3. Inconsistent JSON Schema + +6 different key combinations across the JSON responses. No single schema dominates enough to become the learned default. + +### 4. Contaminated Examples + +- 84 examples have `` strings inside JSON responses (pipeline leakage) +- 387 examples have empty `commands: []` (teaches returning nothing is acceptable) +- 2 raw `text` format entries with literal `<|im_start|>` tokens + +### 5. Tool Role Incompatibility + +4,006 examples use custom tool names (`rcon.execute`, `script.validate`) that aren't in Qwen's pretrained vocabulary. The model needs to learn these from scratch, but with truncated sequences it never sees enough context. + +### 6. `/no_think` Misuse + +`/no_think` is a Qwen inference-time directive, not a trainable behavior. Including it in 48% of training data wastes tokens and doesn't transfer to the fine-tuned model's behavior (confirmed by probes). + +--- + +## Training Pipeline Details + +### Data Flow +``` +Raw sources (seed, tool, audit, self_play, distilled, etc.) + → merge_datasets.py (normalize, dedup, 95/5 split) + → merged_training_v06.jsonl + → train_lora.py + ├─ load_seed_dataset() → conversations format + ├─ load_tool_dataset() → messages/text format + └─ formatting_func() → tokenizer.apply_chat_template() + → SFTTrainer (Unsloth, QLoRA 4-bit) + → LoRA merge → GGUF → quantize → Ollama +``` + +### Training Config +- **Base models:** Qwen3-8B, Qwen3.5-9B, Qwen3.5-14B +- **Method:** QLoRA (4-bit base + FP16 LoRA) +- **LoRA:** rank 16-64, alpha 32-128, targets q/k/v/o/gate/up/down_proj +- **Batch:** 2 x 4 grad accum = effective 8 +- **LR:** 2e-4, cosine schedule, 0.1 warmup +- **Epochs:** 1 (default) +- **Packing:** enabled +- **max_seq_len:** 2048 + +### What the Model Actually Trained On + +Due to truncation + packing, most training examples were reduced to: +``` +[truncated system prompt fragment][truncated system prompt fragment][truncated...] +``` +The model spent compute learning to predict system prompt text, not learning the user→assistant mapping.