Root cause: 90% of system prompts exceed max_seq_len (2048 tokens) by 2.5x, so model trained on truncated fragments with no user/assistant content. Plus mixed paradigm (55% tool_call / 45% JSON), 6 JSON schema variants, contaminated examples, and /no_think misuse. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4.7 KiB
Training Data Analysis: merged_training_v06.jsonl
Date: 2026-03-26
Source: /home/claude/bin/Mincecraft-AI-model/data/processed/merged_training_v06.jsonl
Examples: 7,256 total
Training script: training/scripts/train_lora.py (Unsloth + TRL SFTTrainer)
Dataset Structure
| Metric | Value |
|---|---|
| Total examples | 7,256 |
| Parse errors | 0 |
| Message format | "conversations" (7,254) + "text" (2) |
| Multi-turn (>3 msgs) | 4,006 (55.2%) |
| Has system prompt | 100% |
| Has /no_think prefix | 3,459 (47.7%) |
| Avg system prompt length | 21,358 chars (~5,300 tokens) |
| System prompts >20K chars | 6,526 (89.9%) |
| JSON responses | 3,248 (44.8%) |
| tool_call responses | 4,006 (55.2%) |
Contains <think> tags |
52 |
| Tool role messages | 12,155 |
| Contaminated JSON (tool_call inside) | 84 |
| Empty commands arrays | 387 |
System Prompt Variants
| Variant | Count |
|---|---|
| script_writer (no /no_think) | 3,559 |
| /no_think + paper_server | 2,942 |
| /no_think + other | 338 |
| paper_server (no /no_think) | 236 |
| /no_think + server_admin | 179 |
| god_persona | 6 |
JSON Response Schema Variants
| Key Combination | Count |
|---|---|
| {commands, risk_level} | 1,112 |
| {commands, reasoning, risk_level} | 911 |
| {commands, message, risk_level} | 495 |
| {commands, message, reasoning} | 338 |
| {commands, message, reasoning, risk_level} | 243 |
| {commands, reasoning} | 149 |
Tool Usage in Training Data
| Tool | Calls |
|---|---|
| rcon.execute | 7,220 |
| script.validate | 1,496 |
| script.write | 1,493 |
| script.execute | 1,485 |
| journal.read | 110 |
| world.player_info | 70 |
| journal.write | 33 |
| world.scan_area | 29 |
| minecraft.wiki_lookup | 28 |
| world.nearby_entities | 25 |
| Others (14 tools) | <25 each |
Six Compounding Failure Modes
1. System Prompt Truncation (The Killer)
max_seq_len = 2048 tokens. Average system prompt = ~5,300 tokens. 90% of examples have system prompts that exceed the entire sequence length by 2.5x.
With packing enabled, the trainer stuffs multiple examples per 2048-token window. The system prompt alone doesn't fit — so the model trained on truncated system prompts with no user input and no assistant response in most examples. It learned system prompt fragments, not task behavior.
2. Mixed Response Paradigm
44.8% of examples teach: return clean JSON {"commands": [...]}.
55.2% of examples teach: emit <tool_call>{"name": "rcon.execute",...}</tool_call>.
No clear signal distinguishes when to use each format. The system prompts differ but get truncated (Issue 1), so the model never sees the disambiguation.
3. Inconsistent JSON Schema
6 different key combinations across the JSON responses. No single schema dominates enough to become the learned default.
4. Contaminated Examples
- 84 examples have
<tool_call>strings inside JSON responses (pipeline leakage) - 387 examples have empty
commands: [](teaches returning nothing is acceptable) - 2 raw
textformat entries with literal<|im_start|>tokens
5. Tool Role Incompatibility
4,006 examples use custom tool names (rcon.execute, script.validate) that aren't in Qwen's pretrained vocabulary. The model needs to learn these from scratch, but with truncated sequences it never sees enough context.
6. /no_think Misuse
/no_think is a Qwen inference-time directive, not a trainable behavior. Including it in 48% of training data wastes tokens and doesn't transfer to the fine-tuned model's behavior (confirmed by probes).
Training Pipeline Details
Data Flow
Raw sources (seed, tool, audit, self_play, distilled, etc.)
→ merge_datasets.py (normalize, dedup, 95/5 split)
→ merged_training_v06.jsonl
→ train_lora.py
├─ load_seed_dataset() → conversations format
├─ load_tool_dataset() → messages/text format
└─ formatting_func() → tokenizer.apply_chat_template()
→ SFTTrainer (Unsloth, QLoRA 4-bit)
→ LoRA merge → GGUF → quantize → Ollama
Training Config
- Base models: Qwen3-8B, Qwen3.5-9B, Qwen3.5-14B
- Method: QLoRA (4-bit base + FP16 LoRA)
- LoRA: rank 16-64, alpha 32-128, targets q/k/v/o/gate/up/down_proj
- Batch: 2 x 4 grad accum = effective 8
- LR: 2e-4, cosine schedule, 0.1 warmup
- Epochs: 1 (default)
- Packing: enabled
- max_seq_len: 2048
What the Model Actually Trained On
Due to truncation + packing, most training examples were reduced to:
[truncated system prompt fragment][truncated system prompt fragment][truncated...]
The model spent compute learning to predict system prompt text, not learning the user→assistant mapping.