Files
mortdecai-model-analysis/training-data-analysis.md
Mortdecai d199c788c4 docs: training data analysis — 6 compounding failure modes identified
Root cause: 90% of system prompts exceed max_seq_len (2048 tokens)
by 2.5x, so model trained on truncated fragments with no user/assistant
content. Plus mixed paradigm (55% tool_call / 45% JSON), 6 JSON schema
variants, contaminated examples, and /no_think misuse.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 03:02:18 -04:00

4.7 KiB

Training Data Analysis: merged_training_v06.jsonl

Date: 2026-03-26 Source: /home/claude/bin/Mincecraft-AI-model/data/processed/merged_training_v06.jsonl Examples: 7,256 total Training script: training/scripts/train_lora.py (Unsloth + TRL SFTTrainer)


Dataset Structure

Metric Value
Total examples 7,256
Parse errors 0
Message format "conversations" (7,254) + "text" (2)
Multi-turn (>3 msgs) 4,006 (55.2%)
Has system prompt 100%
Has /no_think prefix 3,459 (47.7%)
Avg system prompt length 21,358 chars (~5,300 tokens)
System prompts >20K chars 6,526 (89.9%)
JSON responses 3,248 (44.8%)
tool_call responses 4,006 (55.2%)
Contains <think> tags 52
Tool role messages 12,155
Contaminated JSON (tool_call inside) 84
Empty commands arrays 387

System Prompt Variants

Variant Count
script_writer (no /no_think) 3,559
/no_think + paper_server 2,942
/no_think + other 338
paper_server (no /no_think) 236
/no_think + server_admin 179
god_persona 6

JSON Response Schema Variants

Key Combination Count
{commands, risk_level} 1,112
{commands, reasoning, risk_level} 911
{commands, message, risk_level} 495
{commands, message, reasoning} 338
{commands, message, reasoning, risk_level} 243
{commands, reasoning} 149

Tool Usage in Training Data

Tool Calls
rcon.execute 7,220
script.validate 1,496
script.write 1,493
script.execute 1,485
journal.read 110
world.player_info 70
journal.write 33
world.scan_area 29
minecraft.wiki_lookup 28
world.nearby_entities 25
Others (14 tools) <25 each

Six Compounding Failure Modes

1. System Prompt Truncation (The Killer)

max_seq_len = 2048 tokens. Average system prompt = ~5,300 tokens. 90% of examples have system prompts that exceed the entire sequence length by 2.5x.

With packing enabled, the trainer stuffs multiple examples per 2048-token window. The system prompt alone doesn't fit — so the model trained on truncated system prompts with no user input and no assistant response in most examples. It learned system prompt fragments, not task behavior.

2. Mixed Response Paradigm

44.8% of examples teach: return clean JSON {"commands": [...]}. 55.2% of examples teach: emit <tool_call>{"name": "rcon.execute",...}</tool_call>.

No clear signal distinguishes when to use each format. The system prompts differ but get truncated (Issue 1), so the model never sees the disambiguation.

3. Inconsistent JSON Schema

6 different key combinations across the JSON responses. No single schema dominates enough to become the learned default.

4. Contaminated Examples

  • 84 examples have <tool_call> strings inside JSON responses (pipeline leakage)
  • 387 examples have empty commands: [] (teaches returning nothing is acceptable)
  • 2 raw text format entries with literal <|im_start|> tokens

5. Tool Role Incompatibility

4,006 examples use custom tool names (rcon.execute, script.validate) that aren't in Qwen's pretrained vocabulary. The model needs to learn these from scratch, but with truncated sequences it never sees enough context.

6. /no_think Misuse

/no_think is a Qwen inference-time directive, not a trainable behavior. Including it in 48% of training data wastes tokens and doesn't transfer to the fine-tuned model's behavior (confirmed by probes).


Training Pipeline Details

Data Flow

Raw sources (seed, tool, audit, self_play, distilled, etc.)
  → merge_datasets.py (normalize, dedup, 95/5 split)
  → merged_training_v06.jsonl
  → train_lora.py
    ├─ load_seed_dataset() → conversations format
    ├─ load_tool_dataset() → messages/text format
    └─ formatting_func() → tokenizer.apply_chat_template()
  → SFTTrainer (Unsloth, QLoRA 4-bit)
  → LoRA merge → GGUF → quantize → Ollama

Training Config

  • Base models: Qwen3-8B, Qwen3.5-9B, Qwen3.5-14B
  • Method: QLoRA (4-bit base + FP16 LoRA)
  • LoRA: rank 16-64, alpha 32-128, targets q/k/v/o/gate/up/down_proj
  • Batch: 2 x 4 grad accum = effective 8
  • LR: 2e-4, cosine schedule, 0.1 warmup
  • Epochs: 1 (default)
  • Packing: enabled
  • max_seq_len: 2048

What the Model Actually Trained On

Due to truncation + packing, most training examples were reduced to:

[truncated system prompt fragment][truncated system prompt fragment][truncated...]

The model spent compute learning to predict system prompt text, not learning the user→assistant mapping.