Files
small-llm-bakeoff/results/summary.md
T
Seth 2189579490 Small LLM Bake-Off: 7 models, 1 GPU, 31 tasks
Tested gemma3n:e4b, qwen3-coder:30b, phi4-mini, qwen3:8b, qwen3.5:9b,
qwen3.5:4b, and qwen3:4b on structured command generation from a single
Quadro RTX 4000 (8GB). The 6.9B model beat the 30B model on every metric.

Includes the test harness, evaluation dataset, raw results from all rounds,
and a writeup covering the token budget discovery that doubled one model's
score overnight.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 10:50:43 -04:00

1.8 KiB

Results Summary

Final Standings (all rounds combined)

Rank Model Params Cmd Match Exact Match Syntax OK Safety No Grat. Actions Avg Latency Avg Tokens
1 gemma3n:e4b 6.9B 80.6% 19.4% 77.4% 100% 100% 5.9s 98
2 qwen3:8b (1500 tok) 8B 77.4% 12.9% 64.5% 96.8% 100% 16.0s 212
3 qwen3-coder:30b 30B MoE 67.7% 16.1% 71.0% 93.5% 96.8% 14.7s 163
4 phi4-mini 3.8B 61.3% 9.7% 80.6% 93.5% 100% 4.5s 59
5 qwen3:8b (400 tok) 8B 41.9% 19.4% 87.1% 100% 96.8% 8.7s 297
6 qwen3.5:9b 9B 29.0% 22.6% 96.8% 96.8% 100% 22.6s 271
7 qwen3.5:4b 4B 19.4% 19.4% 100% 100% 100% 7.7s 377
8 qwen3:4b 4B 16.1% 16.1% 100% 100% 100% 5.7s 400

Key Observations

  1. Size doesn't determine quality. The 6.9B model beat the 30B model on every metric.
  2. Token budget matters for thinking models. qwen3:8b jumped from 42% to 77% just by increasing num_predict from 400 to 1500.
  3. Safety is hard. Three models (qwen3-coder, phi4-mini, qwen3.5:9b) executed dangerous commands when asked politely.
  4. The 4B models are too small. Perfect syntax and safety scores are misleading -- they're scoring high by producing empty responses.

Round Details

  • Round 1: gemma3n:e4b vs qwen3-coder:30b (400 token budget)
  • Round 2: qwen3.5:4b + qwen3.5:9b + gemma3n:e4b (400 token budget)
  • Round 3: qwen3:4b + qwen3:8b + phi4-mini + gemma3n:e4b (400 token budget)
  • Round 4: qwen3:8b + qwen3:4b + gemma3n:e4b (1500 token budget -- the fix)