docs: training data analysis — 6 compounding failure modes identified
Root cause: 90% of system prompts exceed max_seq_len (2048 tokens) by 2.5x, so model trained on truncated fragments with no user/assistant content. Plus mixed paradigm (55% tool_call / 45% JSON), 6 JSON schema variants, contaminated examples, and /no_think misuse. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,135 @@
|
||||
# Training Data Analysis: merged_training_v06.jsonl
|
||||
|
||||
**Date:** 2026-03-26
|
||||
**Source:** `/home/claude/bin/Mincecraft-AI-model/data/processed/merged_training_v06.jsonl`
|
||||
**Examples:** 7,256 total
|
||||
**Training script:** `training/scripts/train_lora.py` (Unsloth + TRL SFTTrainer)
|
||||
|
||||
---
|
||||
|
||||
## Dataset Structure
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Total examples | 7,256 |
|
||||
| Parse errors | 0 |
|
||||
| Message format | "conversations" (7,254) + "text" (2) |
|
||||
| Multi-turn (>3 msgs) | 4,006 (55.2%) |
|
||||
| Has system prompt | 100% |
|
||||
| Has /no_think prefix | 3,459 (47.7%) |
|
||||
| Avg system prompt length | 21,358 chars (~5,300 tokens) |
|
||||
| System prompts >20K chars | 6,526 (89.9%) |
|
||||
| JSON responses | 3,248 (44.8%) |
|
||||
| tool_call responses | 4,006 (55.2%) |
|
||||
| Contains `<think>` tags | 52 |
|
||||
| Tool role messages | 12,155 |
|
||||
| Contaminated JSON (tool_call inside) | 84 |
|
||||
| Empty commands arrays | 387 |
|
||||
|
||||
## System Prompt Variants
|
||||
|
||||
| Variant | Count |
|
||||
|---------|-------|
|
||||
| script_writer (no /no_think) | 3,559 |
|
||||
| /no_think + paper_server | 2,942 |
|
||||
| /no_think + other | 338 |
|
||||
| paper_server (no /no_think) | 236 |
|
||||
| /no_think + server_admin | 179 |
|
||||
| god_persona | 6 |
|
||||
|
||||
## JSON Response Schema Variants
|
||||
|
||||
| Key Combination | Count |
|
||||
|-----------------|-------|
|
||||
| {commands, risk_level} | 1,112 |
|
||||
| {commands, reasoning, risk_level} | 911 |
|
||||
| {commands, message, risk_level} | 495 |
|
||||
| {commands, message, reasoning} | 338 |
|
||||
| {commands, message, reasoning, risk_level} | 243 |
|
||||
| {commands, reasoning} | 149 |
|
||||
|
||||
## Tool Usage in Training Data
|
||||
|
||||
| Tool | Calls |
|
||||
|------|-------|
|
||||
| rcon.execute | 7,220 |
|
||||
| script.validate | 1,496 |
|
||||
| script.write | 1,493 |
|
||||
| script.execute | 1,485 |
|
||||
| journal.read | 110 |
|
||||
| world.player_info | 70 |
|
||||
| journal.write | 33 |
|
||||
| world.scan_area | 29 |
|
||||
| minecraft.wiki_lookup | 28 |
|
||||
| world.nearby_entities | 25 |
|
||||
| Others (14 tools) | <25 each |
|
||||
|
||||
---
|
||||
|
||||
## Six Compounding Failure Modes
|
||||
|
||||
### 1. System Prompt Truncation (The Killer)
|
||||
|
||||
`max_seq_len = 2048 tokens`. Average system prompt = ~5,300 tokens. **90% of examples have system prompts that exceed the entire sequence length by 2.5x.**
|
||||
|
||||
With packing enabled, the trainer stuffs multiple examples per 2048-token window. The system prompt alone doesn't fit — so the model trained on truncated system prompts with **no user input and no assistant response in most examples**. It learned system prompt fragments, not task behavior.
|
||||
|
||||
### 2. Mixed Response Paradigm
|
||||
|
||||
44.8% of examples teach: return clean JSON `{"commands": [...]}`.
|
||||
55.2% of examples teach: emit `<tool_call>{"name": "rcon.execute",...}</tool_call>`.
|
||||
|
||||
No clear signal distinguishes when to use each format. The system prompts differ but get truncated (Issue 1), so the model never sees the disambiguation.
|
||||
|
||||
### 3. Inconsistent JSON Schema
|
||||
|
||||
6 different key combinations across the JSON responses. No single schema dominates enough to become the learned default.
|
||||
|
||||
### 4. Contaminated Examples
|
||||
|
||||
- 84 examples have `<tool_call>` strings inside JSON responses (pipeline leakage)
|
||||
- 387 examples have empty `commands: []` (teaches returning nothing is acceptable)
|
||||
- 2 raw `text` format entries with literal `<|im_start|>` tokens
|
||||
|
||||
### 5. Tool Role Incompatibility
|
||||
|
||||
4,006 examples use custom tool names (`rcon.execute`, `script.validate`) that aren't in Qwen's pretrained vocabulary. The model needs to learn these from scratch, but with truncated sequences it never sees enough context.
|
||||
|
||||
### 6. `/no_think` Misuse
|
||||
|
||||
`/no_think` is a Qwen inference-time directive, not a trainable behavior. Including it in 48% of training data wastes tokens and doesn't transfer to the fine-tuned model's behavior (confirmed by probes).
|
||||
|
||||
---
|
||||
|
||||
## Training Pipeline Details
|
||||
|
||||
### Data Flow
|
||||
```
|
||||
Raw sources (seed, tool, audit, self_play, distilled, etc.)
|
||||
→ merge_datasets.py (normalize, dedup, 95/5 split)
|
||||
→ merged_training_v06.jsonl
|
||||
→ train_lora.py
|
||||
├─ load_seed_dataset() → conversations format
|
||||
├─ load_tool_dataset() → messages/text format
|
||||
└─ formatting_func() → tokenizer.apply_chat_template()
|
||||
→ SFTTrainer (Unsloth, QLoRA 4-bit)
|
||||
→ LoRA merge → GGUF → quantize → Ollama
|
||||
```
|
||||
|
||||
### Training Config
|
||||
- **Base models:** Qwen3-8B, Qwen3.5-9B, Qwen3.5-14B
|
||||
- **Method:** QLoRA (4-bit base + FP16 LoRA)
|
||||
- **LoRA:** rank 16-64, alpha 32-128, targets q/k/v/o/gate/up/down_proj
|
||||
- **Batch:** 2 x 4 grad accum = effective 8
|
||||
- **LR:** 2e-4, cosine schedule, 0.1 warmup
|
||||
- **Epochs:** 1 (default)
|
||||
- **Packing:** enabled
|
||||
- **max_seq_len:** 2048
|
||||
|
||||
### What the Model Actually Trained On
|
||||
|
||||
Due to truncation + packing, most training examples were reduced to:
|
||||
```
|
||||
[truncated system prompt fragment][truncated system prompt fragment][truncated...]
|
||||
```
|
||||
The model spent compute learning to predict system prompt text, not learning the user→assistant mapping.
|
||||
Reference in New Issue
Block a user