48df42b042
Full analysis of mortdecai:0.6.0-9b and mortdecai:latest (27B) fine-tunes vs 6 base model candidates. Both fine-tunes score 0% JSON compliance (catastrophic forgetting from chat template mismatch). Training signal exists in weights but is inaccessible through chat API. Base model rankings: phi4:14b (100%, 7.4s) > gemma3:12b (100%, 12.9s) > gemma3:27b (100%, 25.3s). Qwen3.5 not recommended for conductor role. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
214 lines
10 KiB
Markdown
214 lines
10 KiB
Markdown
# Mortdecai 0.6.0 Model Analysis Report
|
|
|
|
**Date:** 2026-03-26
|
|
**Analyst:** Claude Opus 4.6 (non-developer, pure analysis role)
|
|
**Target models:** mortdecai:0.6.0-9b (Qwen3.5:9B LoRA), mortdecai:latest (Qwen3.5:27B LoRA)
|
|
**Comparison models:** qwen3.5:latest, qwen3.5:27b, gemma3:12b, phi4:14b, gemma3:27b, qwen3:14b
|
|
**Inference hardware:** Matt's Strix Halo (64GB unified memory, Ollama)
|
|
**Expected output format:** `{"commands": [...], "reasoning": "..."}`
|
|
|
|
---
|
|
|
|
## 1. Executive Summary
|
|
|
|
Both fine-tuned models are completely broken. Training didn't partially stick — it actively destroyed the models' ability to follow instructions. The fine-tunes are worse than useless; the base models they were derived from dramatically outperform them.
|
|
|
|
---
|
|
|
|
## 2. Methodology
|
|
|
|
### Test Battery (Fine-tuned models — 8 tests each)
|
|
|
|
| Test | System Prompt | User Prompt | Purpose |
|
|
|------|--------------|-------------|---------|
|
|
| STANDARD | Full training system prompt | "give me a diamond sword" | Baseline compliance |
|
|
| MINIMAL | JSON format instruction only | "give me a diamond sword" | Minimal instruction following |
|
|
| NO SYSTEM | Empty | "give me a diamond sword" | Default behavior |
|
|
| /no_think | Training prompt + /no_think prefix | "give me a diamond sword" | Think token suppression |
|
|
| COMPLEX | Full training prompt | "build me a 5x5 house" | Multi-step command |
|
|
| IDENTITY | Empty | "What are you?" | Training awareness |
|
|
| FORMAT STRESS | Full training prompt | Time + weather + armor | Multi-command JSON |
|
|
| RISK | Full training prompt | "give me op" | Risk assessment |
|
|
|
|
### Test Battery (Base models — 5 tests each)
|
|
|
|
Same system prompt across all models. Prompts: diamond sword, multi-command, house build, op request, teleport.
|
|
|
|
### Diagnostic Probes
|
|
|
|
1. **Training signal detection** — exact training data format
|
|
2. **/no_think effect** — across fine-tuned and base models
|
|
3. **Raw completion** — bypassing chat template via /api/generate
|
|
4. **Correction coercion** — multi-turn with explicit correction
|
|
5. **Mortdecai awareness** — identity and training memory
|
|
|
|
---
|
|
|
|
## 3. Fine-Tuned Model Results
|
|
|
|
### mortdecai:0.6.0-9b (Qwen3.5:9B LoRA)
|
|
|
|
| Test | JSON Valid | Response Type | Latency |
|
|
|------|-----------|---------------|---------|
|
|
| STANDARD | NO | Generic Minecraft tutorial | 29.9s |
|
|
| MINIMAL | NO | Crafting recipe + game tips | 35.9s |
|
|
| NO SYSTEM | NO | Crafting recipe + tips | 42.6s |
|
|
| /no_think | NO | Tutorial with version advice | 22.6s |
|
|
| COMPLEX | NO | **Real-world construction advice** (permits, carpenters) | 46.0s |
|
|
| IDENTITY | NO | "I am Qwen3.5 by Tongyi Lab" | 45.8s |
|
|
| FORMAT STRESS | NO | Think block, incomplete | 46.0s |
|
|
| RISK | NO | **Investment advice** ($1M portfolio) | 45.7s |
|
|
|
|
**Score: 0/8 JSON compliance (0%)**
|
|
**Comparison: Base Qwen3.5:9B scores 40% (2/5) — fine-tuning reduced performance by 40 percentage points**
|
|
|
|
Key observations:
|
|
- Completely ignores system prompts
|
|
- Leaks raw special tokens (`<|endoftext|><|im_start|>`) into output
|
|
- Interprets Minecraft prompts as real-world requests (house = construction, op = operator/investment)
|
|
- `/no_think` suppresses `<think>` tags but doesn't restore instruction following
|
|
- Average latency: 36.0s
|
|
|
|
### mortdecai:latest (Qwen3.5:27B LoRA)
|
|
|
|
| Test | JSON Valid | Response Type | Latency |
|
|
|------|-----------|---------------|---------|
|
|
| STANDARD | NO | Think block + crafting tutorial | 54.2s |
|
|
| MINIMAL | NO | Think block + crafting recipe | 28.2s |
|
|
| NO SYSTEM | NO | Crafting recipe + emoji tips | 30.7s |
|
|
| /no_think | NO | Think block (still!) + tutorial | 39.0s |
|
|
| COMPLEX | NO | Think block about real-world building | 49.2s |
|
|
| IDENTITY | NO | "I am Qwen3.5 by Tongyi Lab" | 21.9s |
|
|
| FORMAT STRESS | NO | Commands listed as markdown, not JSON | 23.8s |
|
|
| RISK | NO | Research study methodology (!) | 49.1s |
|
|
|
|
**Score: 0/8 JSON compliance (0%)**
|
|
**Comparison: Base Qwen3.5:27B scores 40% (2/5) — fine-tuning reduced performance by 40 percentage points**
|
|
|
|
Key observations:
|
|
- Wraps everything in `<think>` blocks even with `/no_think` prefix
|
|
- Think tokens consume most context budget before any useful output
|
|
- Also leaks special tokens
|
|
- "give me op" → completely derails into academic research methodology
|
|
- Average latency: 37.0s
|
|
|
|
---
|
|
|
|
## 4. Root Cause Analysis
|
|
|
|
### 4.1 Chat Template Mismatch (Primary cause)
|
|
|
|
**Evidence:** Probe 3 (raw completion mode) proved the training signal IS in the weights.
|
|
|
|
When bypassing the chat template entirely:
|
|
```
|
|
Prompt: 'Assistant: {"commands": ["'
|
|
mortdecai:0.6.0-9b completion: 'give @p diamond_sword"]}'
|
|
mortdecai:latest completion: 'give @p diamond_sword"]}'
|
|
```
|
|
|
|
Both models produce valid, correct Minecraft commands in raw mode. The knowledge is there — it's just inaccessible through the chat API.
|
|
|
|
**Diagnosis:** The training data used a different message format than Qwen3.5's native chat template (`<|im_start|>system\n...\n<|im_end|>`). The LoRA learned to associate the JSON output format with the raw training format, not with the chat template wrapping that Ollama applies.
|
|
|
|
### 4.2 Catastrophic Forgetting
|
|
|
|
The LoRA didn't just add Minecraft knowledge — it overwrote the base model's instruction-following capability:
|
|
- Base Qwen3.5:9B: 70% command accuracy (bakeoff), 40% JSON compliance (this test)
|
|
- Fine-tuned 9B: 10% command accuracy (bakeoff), 0% JSON compliance (this test)
|
|
|
|
This is classic catastrophic forgetting from LoRA rank being too high, learning rate too aggressive, or insufficient regularization.
|
|
|
|
### 4.3 Think Token Contamination
|
|
|
|
Qwen3.5's thinking mode (`<think>...</think>`) was not accounted for during training:
|
|
- 27B: Always generates think blocks, even with `/no_think`
|
|
- 9B: Sometimes generates think blocks
|
|
- Base models: `/no_think` works correctly on both sizes
|
|
|
|
The fine-tuning broke the `/no_think` mechanism on the 27B model, making think token suppression impossible.
|
|
|
|
### 4.4 Special Token Leakage
|
|
|
|
Both fine-tuned models leak `<|endoftext|><|im_start|>user` into their output, which means:
|
|
- The model learned to predict special tokens as regular text
|
|
- The tokenizer/chat template boundary was corrupted during training
|
|
- This causes the model to "hallucinate" new conversation turns within a single response
|
|
|
|
---
|
|
|
|
## 5. Base Model Comparison
|
|
|
|
### Quantitative Results
|
|
|
|
| Model | JSON Valid | Has Commands | Avg Latency | Tokens/Response |
|
|
|-------|-----------|-------------|-------------|-----------------|
|
|
| **phi4:14b** | **5/5 (100%)** | **5/5** | **7.4s** | ~88 |
|
|
| **gemma3:12b** | **5/5 (100%)** | **5/5** | **12.9s** | ~117 |
|
|
| **gemma3:27b** | **5/5 (100%)** | **5/5** | 25.3s | ~166 |
|
|
| qwen3:14b | 3/5 (60%) | 3/5 | 23.8s | ~330 |
|
|
| qwen3.5:latest (9B) | 2/5 (40%) | 2/5 | 13.9s | ~370 |
|
|
| qwen3.5:27b | 2/5 (40%) | 2/5 | 65.4s | ~437 |
|
|
|
|
### Qualitative Assessment
|
|
|
|
**phi4:14b** — Fastest response times. Always wraps JSON in markdown fences (minor issue, easily stripped). Clean reasoning. Uses `@p` consistently. Good domain knowledge. House build attempt is structured but coordinates are imprecise.
|
|
|
|
**gemma3:12b** — Slightly slower but equally reliable. Sometimes returns raw JSON, sometimes wraps in fences. Uses `@s` (self) which is more correct for "give me" commands. Best Minecraft domain knowledge of all candidates. Very concise responses.
|
|
|
|
**gemma3:27b** — Same quality as 12b, 2x slower. Over-engineers some responses (unnecessary NBT attributes on armor). The tp command uses a redundant two-command approach. Not worth the latency penalty for most use cases.
|
|
|
|
**qwen3:14b** — Think tokens cause it to exceed token limits on complex prompts. When it does produce JSON, quality is decent but includes leading slashes on commands (against instructions).
|
|
|
|
**qwen3.5 (both sizes)** — Think tokens are the fundamental problem. Burns 300-400 tokens on reasoning before producing output, frequently hits token limits before completing JSON. The `/no_think` flag works on base models but is unreliable.
|
|
|
|
---
|
|
|
|
## 6. Conductor Candidacy Assessment
|
|
|
|
**Question:** Is Qwen3.5 (27B or 9B) a good candidate for the Conductor/Orchestrator role?
|
|
|
|
**Answer: No.** Four reasons:
|
|
|
|
1. **Uncontrollable think token overhead.** The conductor needs fast, reliable responses. Qwen3.5's thinking mode adds 5-30s latency and burns context on reasoning that should happen in orchestrator code, not inside the model.
|
|
|
|
2. **Unreliable JSON compliance.** The conductor must produce structured output (routing decisions, tool calls, dispatch instructions) 100% of the time. Qwen3.5 manages 40% vs gemma3's 100%.
|
|
|
|
3. **Fragile under fine-tuning.** LoRA on Qwen3.5 caused catastrophic forgetting. If the conductor needs fine-tuning later, Qwen3.5 is a risky base.
|
|
|
|
4. **27B is too slow.** 65s average is unacceptable for a routing layer in the critical path of every player request.
|
|
|
|
### Recommended Conductor Candidates
|
|
|
|
| Rank | Model | Why |
|
|
|------|-------|-----|
|
|
| 1 | **phi4:14b** | Fastest (7.4s), 100% JSON, good reasoning |
|
|
| 2 | **gemma3:12b** | 100% JSON, best MC domain knowledge, 12.9s |
|
|
| 3 | **gemma3:27b** | Most capable, but only if latency budget allows (25.3s) |
|
|
|
|
---
|
|
|
|
## 7. Recommendations
|
|
|
|
### Immediate Actions
|
|
1. **Delete the fine-tuned models** from Matt's Ollama. Base models are strictly superior.
|
|
2. **Use phi4:14b or gemma3:12b** for conductor prototyping.
|
|
3. **Preserve training data** (JSONL files) for future fine-tuning attempts.
|
|
|
|
### If Re-attempting Fine-tuning
|
|
1. **Fix chat template alignment.** Training data MUST use Qwen3.5's exact `<|im_start|>...<|im_end|>` format.
|
|
2. **Consider a different base model.** gemma3:12b showed the best instruction-following baseline and may be more robust under LoRA.
|
|
3. **Lower LoRA rank and learning rate** to prevent catastrophic forgetting.
|
|
4. **Add `/no_think` handling** or use a model without built-in thinking mode.
|
|
5. **Validate with the chat API during training**, not just loss metrics.
|
|
|
|
### Fine-tuning Priority (from 2.0 spec)
|
|
- Voice (persona, gemma3:4b) and Eye (router, functiongemma) are the 1.0.1 fine-tune targets.
|
|
- The conductor should run on a base model with strong instruction-following. Fine-tuning is not planned until 2.0.0.
|
|
|
|
---
|
|
|
|
## Appendix: Test Scripts
|
|
|
|
See `scripts/` directory for the Python scripts used to conduct these interviews. All scripts query Ollama's API at `http://192.168.0.141:11437`.
|