mortdecai-model-analysis/analysis-report.md

# Mortdecai 0.6.0 Model Analysis Report

**Date:** 2026-03-26
**Analyst:** Claude Opus 4.6 (non-developer, pure analysis role)
**Target models:** mortdecai:0.6.0-9b (Qwen3.5:9B LoRA), mortdecai:latest (Qwen3.5:27B LoRA)
**Comparison models:** qwen3.5:latest, qwen3.5:27b, gemma3:12b, phi4:14b, gemma3:27b, qwen3:14b
**Inference hardware:** Matt's Strix Halo (64GB unified memory, Ollama)
**Expected output format:** `{"commands": [...], "reasoning": "..."}`

---

## 1. Executive Summary

Both fine-tuned models are completely broken. Training didn't partially stick — it actively destroyed the models' ability to follow instructions. The fine-tunes are worse than useless; the base models they were derived from dramatically outperform them.

---

## 2. Methodology

### Test Battery (Fine-tuned models — 8 tests each)

| Test | System Prompt | User Prompt | Purpose |
|------|--------------|-------------|---------|
| STANDARD | Full training system prompt | "give me a diamond sword" | Baseline compliance |
| MINIMAL | JSON format instruction only | "give me a diamond sword" | Minimal instruction following |
| NO SYSTEM | Empty | "give me a diamond sword" | Default behavior |
| /no_think | Training prompt + /no_think prefix | "give me a diamond sword" | Think token suppression |
| COMPLEX | Full training prompt | "build me a 5x5 house" | Multi-step command |
| IDENTITY | Empty | "What are you?" | Training awareness |
| FORMAT STRESS | Full training prompt | Time + weather + armor | Multi-command JSON |
| RISK | Full training prompt | "give me op" | Risk assessment |

### Test Battery (Base models — 5 tests each)

Same system prompt across all models. Prompts: diamond sword, multi-command, house build, op request, teleport.

### Diagnostic Probes

1. **Training signal detection** — exact training data format
2. **/no_think effect** — across fine-tuned and base models
3. **Raw completion** — bypassing chat template via /api/generate
4. **Correction coercion** — multi-turn with explicit correction
5. **Mortdecai awareness** — identity and training memory

---

## 3. Fine-Tuned Model Results

### mortdecai:0.6.0-9b (Qwen3.5:9B LoRA)

| Test | JSON Valid | Response Type | Latency |
|------|-----------|---------------|---------|
| STANDARD | NO | Generic Minecraft tutorial | 29.9s |
| MINIMAL | NO | Crafting recipe + game tips | 35.9s |
| NO SYSTEM | NO | Crafting recipe + tips | 42.6s |
| /no_think | NO | Tutorial with version advice | 22.6s |
| COMPLEX | NO | **Real-world construction advice** (permits, carpenters) | 46.0s |
| IDENTITY | NO | "I am Qwen3.5 by Tongyi Lab" | 45.8s |
| FORMAT STRESS | NO | Think block, incomplete | 46.0s |
| RISK | NO | **Investment advice** ($1M portfolio) | 45.7s |

**Score: 0/8 JSON compliance (0%)**
**Comparison: Base Qwen3.5:9B scores 40% (2/5) — fine-tuning reduced performance by 40 percentage points**

Key observations:
- Completely ignores system prompts
- Leaks raw special tokens (`<|endoftext|><|im_start|>`) into output
- Interprets Minecraft prompts as real-world requests (house = construction, op = operator/investment)
- `/no_think` suppresses `<think>` tags but doesn't restore instruction following
- Average latency: 36.0s

### mortdecai:latest (Qwen3.5:27B LoRA)

| Test | JSON Valid | Response Type | Latency |
|------|-----------|---------------|---------|
| STANDARD | NO | Think block + crafting tutorial | 54.2s |
| MINIMAL | NO | Think block + crafting recipe | 28.2s |
| NO SYSTEM | NO | Crafting recipe + emoji tips | 30.7s |
| /no_think | NO | Think block (still!) + tutorial | 39.0s |
| COMPLEX | NO | Think block about real-world building | 49.2s |
| IDENTITY | NO | "I am Qwen3.5 by Tongyi Lab" | 21.9s |
| FORMAT STRESS | NO | Commands listed as markdown, not JSON | 23.8s |
| RISK | NO | Research study methodology (!) | 49.1s |

**Score: 0/8 JSON compliance (0%)**
**Comparison: Base Qwen3.5:27B scores 40% (2/5) — fine-tuning reduced performance by 40 percentage points**

Key observations:
- Wraps everything in `<think>` blocks even with `/no_think` prefix
- Think tokens consume most context budget before any useful output
- Also leaks special tokens
- "give me op" → completely derails into academic research methodology
- Average latency: 37.0s

---

## 4. Root Cause Analysis

### 4.1 Chat Template Mismatch (Primary cause)

**Evidence:** Probe 3 (raw completion mode) proved the training signal IS in the weights.

When bypassing the chat template entirely:
```
Prompt: 'Assistant: {"commands": ["'
mortdecai:0.6.0-9b completion: 'give @p diamond_sword"]}'
mortdecai:latest completion: 'give @p diamond_sword"]}'
```

Both models produce valid, correct Minecraft commands in raw mode. The knowledge is there — it's just inaccessible through the chat API.

**Diagnosis:** The training data used a different message format than Qwen3.5's native chat template (`<|im_start|>system\n...\n<|im_end|>`). The LoRA learned to associate the JSON output format with the raw training format, not with the chat template wrapping that Ollama applies.

### 4.2 Catastrophic Forgetting

The LoRA didn't just add Minecraft knowledge — it overwrote the base model's instruction-following capability:
- Base Qwen3.5:9B: 70% command accuracy (bakeoff), 40% JSON compliance (this test)
- Fine-tuned 9B: 10% command accuracy (bakeoff), 0% JSON compliance (this test)

This is classic catastrophic forgetting from LoRA rank being too high, learning rate too aggressive, or insufficient regularization.

### 4.3 Think Token Contamination

Qwen3.5's thinking mode (`<think>...</think>`) was not accounted for during training:
- 27B: Always generates think blocks, even with `/no_think`
- 9B: Sometimes generates think blocks
- Base models: `/no_think` works correctly on both sizes

The fine-tuning broke the `/no_think` mechanism on the 27B model, making think token suppression impossible.

### 4.4 Special Token Leakage

Both fine-tuned models leak `<|endoftext|><|im_start|>user` into their output, which means:
- The model learned to predict special tokens as regular text
- The tokenizer/chat template boundary was corrupted during training
- This causes the model to "hallucinate" new conversation turns within a single response

---

## 5. Base Model Comparison

### Quantitative Results

| Model | JSON Valid | Has Commands | Avg Latency | Tokens/Response |
|-------|-----------|-------------|-------------|-----------------|
| **phi4:14b** | **5/5 (100%)** | **5/5** | **7.4s** | ~88 |
| **gemma3:12b** | **5/5 (100%)** | **5/5** | **12.9s** | ~117 |
| **gemma3:27b** | **5/5 (100%)** | **5/5** | 25.3s | ~166 |
| qwen3:14b | 3/5 (60%) | 3/5 | 23.8s | ~330 |
| qwen3.5:latest (9B) | 2/5 (40%) | 2/5 | 13.9s | ~370 |
| qwen3.5:27b | 2/5 (40%) | 2/5 | 65.4s | ~437 |

### Qualitative Assessment

**phi4:14b** — Fastest response times. Always wraps JSON in markdown fences (minor issue, easily stripped). Clean reasoning. Uses `@p` consistently. Good domain knowledge. House build attempt is structured but coordinates are imprecise.

**gemma3:12b** — Slightly slower but equally reliable. Sometimes returns raw JSON, sometimes wraps in fences. Uses `@s` (self) which is more correct for "give me" commands. Best Minecraft domain knowledge of all candidates. Very concise responses.

**gemma3:27b** — Same quality as 12b, 2x slower. Over-engineers some responses (unnecessary NBT attributes on armor). The tp command uses a redundant two-command approach. Not worth the latency penalty for most use cases.

**qwen3:14b** — Think tokens cause it to exceed token limits on complex prompts. When it does produce JSON, quality is decent but includes leading slashes on commands (against instructions).

**qwen3.5 (both sizes)** — Think tokens are the fundamental problem. Burns 300-400 tokens on reasoning before producing output, frequently hits token limits before completing JSON. The `/no_think` flag works on base models but is unreliable.

---

## 6. Conductor Candidacy Assessment

**Question:** Is Qwen3.5 (27B or 9B) a good candidate for the Conductor/Orchestrator role?

**Answer: No.** Four reasons:

1. **Uncontrollable think token overhead.** The conductor needs fast, reliable responses. Qwen3.5's thinking mode adds 5-30s latency and burns context on reasoning that should happen in orchestrator code, not inside the model.

2. **Unreliable JSON compliance.** The conductor must produce structured output (routing decisions, tool calls, dispatch instructions) 100% of the time. Qwen3.5 manages 40% vs gemma3's 100%.

3. **Fragile under fine-tuning.** LoRA on Qwen3.5 caused catastrophic forgetting. If the conductor needs fine-tuning later, Qwen3.5 is a risky base.

4. **27B is too slow.** 65s average is unacceptable for a routing layer in the critical path of every player request.

### Recommended Conductor Candidates

| Rank | Model | Why |
|------|-------|-----|
| 1 | **phi4:14b** | Fastest (7.4s), 100% JSON, good reasoning |
| 2 | **gemma3:12b** | 100% JSON, best MC domain knowledge, 12.9s |
| 3 | **gemma3:27b** | Most capable, but only if latency budget allows (25.3s) |

---

## 7. Recommendations

### Immediate Actions
1. **Delete the fine-tuned models** from Matt's Ollama. Base models are strictly superior.
2. **Use phi4:14b or gemma3:12b** for conductor prototyping.
3. **Preserve training data** (JSONL files) for future fine-tuning attempts.

### If Re-attempting Fine-tuning
1. **Fix chat template alignment.** Training data MUST use Qwen3.5's exact `<|im_start|>...<|im_end|>` format.
2. **Consider a different base model.** gemma3:12b showed the best instruction-following baseline and may be more robust under LoRA.
3. **Lower LoRA rank and learning rate** to prevent catastrophic forgetting.
4. **Add `/no_think` handling** or use a model without built-in thinking mode.
5. **Validate with the chat API during training**, not just loss metrics.

### Fine-tuning Priority (from 2.0 spec)
- Voice (persona, gemma3:4b) and Eye (router, functiongemma) are the 1.0.1 fine-tune targets.
- The conductor should run on a base model with strong instruction-following. Fine-tuning is not planned until 2.0.0.

---

## Appendix: Test Scripts

See `scripts/` directory for the Python scripts used to conduct these interviews. All scripts query Ollama's API at `http://192.168.0.141:11437`.