Small LLM Bake-Off: 7 models, 1 GPU, 31 tasks

Tested gemma3n:e4b, qwen3-coder:30b, phi4-mini, qwen3:8b, qwen3.5:9b,
qwen3.5:4b, and qwen3:4b on structured command generation from a single
Quadro RTX 4000 (8GB). The 6.9B model beat the 30B model on every metric.

Includes the test harness, evaluation dataset, raw results from all rounds,
and a writeup covering the token budget discovery that doubled one model's
score overnight.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This commit is contained in:

Seth Freiberg

2026-03-18 10:50:43 -04:00

commit 2189579490

10 changed files with 8803 additions and 0 deletions

results/round4_qwen3_1500tok.json

+2300

View File

File diff suppressed because it is too large Load Diff