T

Seth 2189579490 Small LLM Bake-Off: 7 models, 1 GPU, 31 tasks

Tested gemma3n:e4b, qwen3-coder:30b, phi4-mini, qwen3:8b, qwen3.5:9b,
qwen3.5:4b, and qwen3:4b on structured command generation from a single
Quadro RTX 4000 (8GB). The 6.9B model beat the 30B model on every metric.

Includes the test harness, evaluation dataset, raw results from all rounds,
and a writeup covering the token budget discovery that doubled one model's
score overnight.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-18 10:50:43 -04:00

results

Small LLM Bake-Off: 7 models, 1 GPU, 31 tasks

2026-03-18 10:50:43 -04:00

.gitignore

Small LLM Bake-Off: 7 models, 1 GPU, 31 tasks

2026-03-18 10:50:43 -04:00

bakeoff.py

Small LLM Bake-Off: 7 models, 1 GPU, 31 tasks

2026-03-18 10:50:43 -04:00

dataset.jsonl

Small LLM Bake-Off: 7 models, 1 GPU, 31 tasks

2026-03-18 10:50:43 -04:00

LICENSE

Small LLM Bake-Off: 7 models, 1 GPU, 31 tasks

2026-03-18 10:50:43 -04:00

README.md

Small LLM Bake-Off: 7 models, 1 GPU, 31 tasks

2026-03-18 10:50:43 -04:00

README.md

Small LLM Bake-Off: 7 Models, 1 GPU, 31 Tasks

Can a 7B model on an 8GB GPU outperform a 30B model on 128GB of RAM?

Yes. By a lot.

The Setup

We had a structured output task: take a natural language request and produce a JSON response containing a list of valid commands, a reasoning string, and an optional message. The domain was narrow (Minecraft server administration), the syntax rules were strict, and the model had to follow a detailed system prompt with specific formatting constraints.

The test hardware was modest: a Quadro RTX 4000 with 8GB of VRAM, running Ollama v0.18.1 inside an LXC container on a Proxmox server. The CPU was a dual Xeon E5-2680 v4 with 128GB of RAM -- plenty for CPU-offloaded layers, but the GPU had to do the heavy lifting.

We wrote 31 evaluation examples spanning five categories:

Category	Examples	What it tests
Command generation	20	Translate "give me a diamond sword" into the right command syntax
Safety	4	Refuse or scope-limit dangerous requests like "delete the world"
Information	2	Answer questions without generating commands
Negative examples	2	Known failure modes the model should handle gracefully
Mixed (prayer/RP)	3	Generate commands AND a creative text response

Each example had an expected output, and we scored models on five metrics: command match rate, exact match rate, syntax correctness, safety compliance, and whether the model added unnecessary actions not asked for (the "gratuitous teleport" problem).

The Contenders

Seven models, four families, ranging from 3.8B to 30B parameters:

Model	Params	Architecture	Quantization	VRAM Used	License
gemma3n:e4b	6.9B	Dense	Q4_K_M	2.5 GB (35/36 layers GPU)	Gemma ToU
qwen3-coder:30b	30B	MoE (3.3B active)	Q4_K_M	7.1 GB (18/49 layers GPU)	Apache 2.0
phi4-mini	3.8B	Dense	Q4_K_M	~2.5 GB (full GPU)	MIT
qwen3:8b	8B	Dense	Q4_K_M	5.6 GB (full GPU)	Apache 2.0
qwen3.5:9b	9B	Dense	Q4_K_M	6.6 GB (full GPU)	Apache 2.0
qwen3.5:4b	4B	Dense	Q4_K_M	~2.5 GB (full GPU)	Apache 2.0
qwen3:4b	4B	Dense	Q4_K_M	~2.5 GB (full GPU)	Apache 2.0

All models were served through the same Ollama instance, tested sequentially, with the same system prompts and temperature (0.2). The API was called with format: "json" to enforce structured output.

The Results

Rank	Model	Cmd Match	Syntax OK	Safety	Avg Latency
1	gemma3n:e4b	80.6%	77.4%	100%	5.9s
2	qwen3-coder:30b	67.7%	71.0%	93.5%	14.7s
3	phi4-mini	61.3%	80.6%	93.5%	4.5s
4	qwen3:8b	41.9%*	87.1%	100%	8.7s
5	qwen3.5:9b	29.0%*	96.8%	96.8%	22.6s
6	qwen3.5:4b	19.4%*	100%	100%	7.7s
7	qwen3:4b	16.1%*	100%	100%	5.7s

* These scores are misleadingly low due to a token budget issue -- see "The Plot Twist" below.

The Story

Chapter 1: The Surprise Winner

The biggest model wasn't the best. qwen3-coder:30b, a 30B-parameter Mixture-of-Experts model, managed only 67.7% command accuracy despite having 4x the parameters of the leader. Worse, it failed safety tests -- when prompted to stop the server or grant admin privileges, it complied. The 6.9B gemma3n:e4b model, consuming a third of the VRAM, beat it on every single metric while running nearly 3x faster.

Chapter 2: The Silent Majority

The Qwen3 and Qwen3.5 family models posted suspiciously low scores. The 4B models scored 16-19% command match, and even the 8B model only hit 42%. But their syntax scores were excellent (87-100%), and their safety compliance was perfect. Something didn't add up.

When we inspected the raw API responses, most "failures" were empty JSON objects -- {"commands": [], "reasoning": "", "message": null}. The models weren't generating wrong commands. They were generating nothing.

Chapter 3: The Plot Twist

The Qwen3 family uses internal "thinking" tokens -- a chain-of-thought mechanism where the model reasons extensively before producing output. These thinking tokens are consumed from the generation budget but stripped from the final response.

Our initial token budget was 400 tokens (num_predict: 400). When we checked the API metadata on empty responses:

done_reason: "length"
eval_count: 400

The model had used all 400 tokens thinking, leaving zero for the actual answer. The response was empty not because the model couldn't answer, but because we ran out of runway before it finished thinking.

We tested different budgets:

Budget	eval_count	done_reason	Commands generated?
400	400	length	No (empty)
1000	62	stop	Yes
1500	69	stop	Yes

At 1000 tokens, the model used ~930 thinking tokens, then output a clean 62-token JSON response with correct commands and done_reason: stop. The thinking was actually high quality -- it just needed room to finish.

Chapter 4: The Revised Standings

With a 1500-token budget, qwen3:8b jumped dramatically:

Metric	400 tokens	1500 tokens	Delta
Command match	41.9%	77.4%	+35.5%
Safety	100%	96.8%	-3.2%
No unnecessary actions	96.8%	100%	+3.2%
Avg latency	8.7s	16.0s	+7.3s

At 77.4%, qwen3:8b was now neck-and-neck with the leader. The tradeoff: it thinks hard (16s vs 6s), but when it answers, the syntax quality is very high. The 4B models remained stuck -- 1500 tokens still wasn't enough for their even more verbose reasoning chains.

Chapter 5: The Verdict

gemma3n:e4b is the practical choice. It's accurate, fast, safe, and fits comfortably on an 8GB card. It doesn't waste tokens thinking -- it just answers.

qwen3:8b is the most interesting model. Its internal reasoning produces higher-quality syntax, but at the cost of speed and token efficiency. In a scenario where you could fine-tune the model to internalize its reasoning (rather than generating it every time), qwen3:8b might be the better long-term investment. Its Apache 2.0 license doesn't hurt either.

phi4-mini is the speed demon. At 3.8B parameters and MIT license, it's the fastest model tested (4.5s average). Surprisingly capable, but its safety gaps (it tried to stop the server when asked to "delete the world") disqualify it for anything unsupervised.

qwen3-coder:30b is a cautionary tale. Bigger isn't better. A 30B MoE model that runs 3x slower, uses 3x the VRAM, and still fails safety tests is hard to justify when a 7B dense model beats it outright.

Methodology

Scoring

Command Match (loose): At least one command type (first word) overlaps between expected and actual output
Exact Match: Expected and actual command sets are identical (very strict)
Syntax OK: No known syntax errors detected (old formats, missing namespaces, invalid arguments)
Safety: When expected output is "refuse" (empty commands + destructive flag), model also refuses
No Gratuitous Actions: Model doesn't add teleports, effects, or other actions the user didn't request

What Wasn't Tested

Multi-turn conversations (all tests were single-turn)
Tool calling / function calling
Long-context performance
Non-English prompts
Creative or open-ended tasks

Hardware

Component	Spec
GPU	Quadro RTX 4000, 8GB GDDR6, Turing (compute 7.5)
CPU	2x Intel Xeon E5-2680 v4 (28 cores / 56 threads)
RAM	128GB DDR4
Host	Proxmox VE, LXC container with GPU bind-mount
Ollama	v0.18.1, `FLASH_ATTENTION=true`, context length 4096

Reproducing This

The test harness (bakeoff.py) calls any Ollama-compatible endpoint. The evaluation dataset (dataset.jsonl) contains the 31 test examples. The system prompts are embedded in the harness.

# Install dependencies
pip install requests

# Run against your own Ollama instance
python bakeoff.py --ollama-url http://localhost:11434 --models gemma3n:e4b qwen3:8b phi4-mini

# Adjust token budget (matters for Qwen thinking models)
# Edit max_tokens in bakeoff.py (default: 1500)

Results are saved as JSON in results/.

Files

small-llm-bakeoff/
├── README.md                          # This file
├── bakeoff.py                         # Self-contained test harness
├── dataset.jsonl                      # 31 evaluation examples
├── results/
│   ├── summary.md                     # Formatted results table
│   ├── round1_gemma3n_qwencoder.json  # gemma3n:e4b vs qwen3-coder:30b
│   ├── round2_qwen35_gemma3n.json     # qwen3.5 family vs gemma3n
│   ├── round3_qwen3_phi4_gemma3n.json # qwen3 + phi4-mini vs gemma3n
│   └── round4_qwen3_1500tok.json      # qwen3 with fixed token budget
└── LICENSE

License

The test harness and this article are released under the MIT License. Model outputs are not redistributed. The evaluation dataset contains domain-specific examples authored for this test.