2189579490
Tested gemma3n:e4b, qwen3-coder:30b, phi4-mini, qwen3:8b, qwen3.5:9b, qwen3.5:4b, and qwen3:4b on structured command generation from a single Quadro RTX 4000 (8GB). The 6.9B model beat the 30B model on every metric. Includes the test harness, evaluation dataset, raw results from all rounds, and a writeup covering the token budget discovery that doubled one model's score overnight. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1.8 KiB
1.8 KiB
Results Summary
Final Standings (all rounds combined)
| Rank | Model | Params | Cmd Match | Exact Match | Syntax OK | Safety | No Grat. Actions | Avg Latency | Avg Tokens |
|---|---|---|---|---|---|---|---|---|---|
| 1 | gemma3n:e4b | 6.9B | 80.6% | 19.4% | 77.4% | 100% | 100% | 5.9s | 98 |
| 2 | qwen3:8b (1500 tok) | 8B | 77.4% | 12.9% | 64.5% | 96.8% | 100% | 16.0s | 212 |
| 3 | qwen3-coder:30b | 30B MoE | 67.7% | 16.1% | 71.0% | 93.5% | 96.8% | 14.7s | 163 |
| 4 | phi4-mini | 3.8B | 61.3% | 9.7% | 80.6% | 93.5% | 100% | 4.5s | 59 |
| 5 | qwen3:8b (400 tok) | 8B | 41.9% | 19.4% | 87.1% | 100% | 96.8% | 8.7s | 297 |
| 6 | qwen3.5:9b | 9B | 29.0% | 22.6% | 96.8% | 96.8% | 100% | 22.6s | 271 |
| 7 | qwen3.5:4b | 4B | 19.4% | 19.4% | 100% | 100% | 100% | 7.7s | 377 |
| 8 | qwen3:4b | 4B | 16.1% | 16.1% | 100% | 100% | 100% | 5.7s | 400 |
Key Observations
- Size doesn't determine quality. The 6.9B model beat the 30B model on every metric.
- Token budget matters for thinking models. qwen3:8b jumped from 42% to 77% just by increasing num_predict from 400 to 1500.
- Safety is hard. Three models (qwen3-coder, phi4-mini, qwen3.5:9b) executed dangerous commands when asked politely.
- The 4B models are too small. Perfect syntax and safety scores are misleading -- they're scoring high by producing empty responses.
Round Details
- Round 1: gemma3n:e4b vs qwen3-coder:30b (400 token budget)
- Round 2: qwen3.5:4b + qwen3.5:9b + gemma3n:e4b (400 token budget)
- Round 3: qwen3:4b + qwen3:8b + phi4-mini + gemma3n:e4b (400 token budget)
- Round 4: qwen3:8b + qwen3:4b + gemma3n:e4b (1500 token budget -- the fix)