Files

T

Mortdecai d199c788c4 docs: training data analysis — 6 compounding failure modes identified

Root cause: 90% of system prompts exceed max_seq_len (2048 tokens)
by 2.5x, so model trained on truncated fragments with no user/assistant
content. Plus mixed paradigm (55% tool_call / 45% JSON), 6 JSON schema
variants, contaminated examples, and /no_think misuse.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-26 03:02:18 -04:00

4.7 KiB

Raw Permalink Blame History

Training Data Analysis: merged_training_v06.jsonl

Date: 2026-03-26 Source: /home/claude/bin/Mincecraft-AI-model/data/processed/merged_training_v06.jsonl Examples: 7,256 total Training script: training/scripts/train_lora.py (Unsloth + TRL SFTTrainer)

Dataset Structure

Metric	Value
Total examples	7,256
Parse errors	0
Message format	"conversations" (7,254) + "text" (2)
Multi-turn (>3 msgs)	4,006 (55.2%)
Has system prompt	100%
Has /no_think prefix	3,459 (47.7%)
Avg system prompt length	21,358 chars (~5,300 tokens)
System prompts >20K chars	6,526 (89.9%)
JSON responses	3,248 (44.8%)
tool_call responses	4,006 (55.2%)
Contains `<think>` tags	52
Tool role messages	12,155
Contaminated JSON (tool_call inside)	84
Empty commands arrays	387

System Prompt Variants

Variant	Count
script_writer (no /no_think)	3,559
/no_think + paper_server	2,942
/no_think + other	338
paper_server (no /no_think)	236
/no_think + server_admin	179
god_persona	6

JSON Response Schema Variants

Key Combination	Count
{commands, risk_level}	1,112
{commands, reasoning, risk_level}	911
{commands, message, risk_level}	495
{commands, message, reasoning}	338
{commands, message, reasoning, risk_level}	243
{commands, reasoning}	149

Tool Usage in Training Data

Tool	Calls
rcon.execute	7,220
script.validate	1,496
script.write	1,493
script.execute	1,485
journal.read	110
world.player_info	70
journal.write	33
world.scan_area	29
minecraft.wiki_lookup	28
world.nearby_entities	25
Others (14 tools)	<25 each

Six Compounding Failure Modes

1. System Prompt Truncation (The Killer)

max_seq_len = 2048 tokens. Average system prompt = ~5,300 tokens. 90% of examples have system prompts that exceed the entire sequence length by 2.5x.

With packing enabled, the trainer stuffs multiple examples per 2048-token window. The system prompt alone doesn't fit — so the model trained on truncated system prompts with no user input and no assistant response in most examples. It learned system prompt fragments, not task behavior.

2. Mixed Response Paradigm

44.8% of examples teach: return clean JSON {"commands": [...]}. 55.2% of examples teach: emit <tool_call>{"name": "rcon.execute",...}</tool_call>.

No clear signal distinguishes when to use each format. The system prompts differ but get truncated (Issue 1), so the model never sees the disambiguation.

3. Inconsistent JSON Schema

6 different key combinations across the JSON responses. No single schema dominates enough to become the learned default.

4. Contaminated Examples

84 examples have <tool_call> strings inside JSON responses (pipeline leakage)
387 examples have empty commands: [] (teaches returning nothing is acceptable)
2 raw text format entries with literal <|im_start|> tokens

5. Tool Role Incompatibility

4,006 examples use custom tool names (rcon.execute, script.validate) that aren't in Qwen's pretrained vocabulary. The model needs to learn these from scratch, but with truncated sequences it never sees enough context.

6. `/no_think` Misuse

/no_think is a Qwen inference-time directive, not a trainable behavior. Including it in 48% of training data wastes tokens and doesn't transfer to the fine-tuned model's behavior (confirmed by probes).

Training Pipeline Details

Data Flow

Raw sources (seed, tool, audit, self_play, distilled, etc.)
  → merge_datasets.py (normalize, dedup, 95/5 split)
  → merged_training_v06.jsonl
  → train_lora.py
    ├─ load_seed_dataset() → conversations format
    ├─ load_tool_dataset() → messages/text format
    └─ formatting_func() → tokenizer.apply_chat_template()
  → SFTTrainer (Unsloth, QLoRA 4-bit)
  → LoRA merge → GGUF → quantize → Ollama

Training Config

Base models: Qwen3-8B, Qwen3.5-9B, Qwen3.5-14B
Method: QLoRA (4-bit base + FP16 LoRA)
LoRA: rank 16-64, alpha 32-128, targets q/k/v/o/gate/up/down_proj
Batch: 2 x 4 grad accum = effective 8
LR: 2e-4, cosine schedule, 0.1 warmup
Epochs: 1 (default)
Packing: enabled
max_seq_len: 2048

What the Model Actually Trained On

Due to truncation + packing, most training examples were reduced to:

[truncated system prompt fragment][truncated system prompt fragment][truncated...]

The model spent compute learning to predict system prompt text, not learning the user→assistant mapping.

4.7 KiB Raw Permalink Blame History