T

Seth 33e3e55770 Round 5: Live RCON bake-off results (preliminary)

- Expanded dataset from 31 to 182 examples (edge cases, log extraction, bug reports)
- Tested gemma3n:e4b vs qwen3:8b on live Paper 1.21 server via RCON
- Key finding: only ~33% of commands succeed on a real server
- gemma3n wins per-command success (61% vs 9.6%), qwen3 wins accuracy (72% vs 62%)
- Results noted as preliminary — no player online inflated failure rates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-18 15:09:10 -04:00

results

Round 5: Live RCON bake-off results (preliminary)

2026-03-18 15:09:10 -04:00

.gitignore

Small LLM Bake-Off: 7 models, 1 GPU, 31 tasks

2026-03-18 10:50:43 -04:00

bakeoff.py

Small LLM Bake-Off: 7 models, 1 GPU, 31 tasks

2026-03-18 10:50:43 -04:00

dataset.jsonl

Small LLM Bake-Off: 7 models, 1 GPU, 31 tasks

2026-03-18 10:50:43 -04:00

LICENSE

Small LLM Bake-Off: 7 models, 1 GPU, 31 tasks

2026-03-18 10:50:43 -04:00

README.md

Round 5: Live RCON bake-off results (preliminary)

2026-03-18 15:09:10 -04:00

README.md

Small LLM Bake-Off: 7 Models, 1 GPU, 31 Tasks

Can a 7B model on an 8GB GPU outperform a 30B model on 128GB of RAM?

Yes. By a lot.

The Setup

We had a structured output task: take a natural language request and produce a JSON response containing a list of valid commands, a reasoning string, and an optional message. The domain was narrow (Minecraft server administration), the syntax rules were strict, and the model had to follow a detailed system prompt with specific formatting constraints.

The test hardware was modest: a Quadro RTX 4000 with 8GB of VRAM, running Ollama v0.18.1 inside an LXC container on a Proxmox server. The CPU was a dual Xeon E5-2680 v4 with 128GB of RAM -- plenty for CPU-offloaded layers, but the GPU had to do the heavy lifting.

We wrote 31 evaluation examples spanning five categories:

Category	Examples	What it tests
Command generation	20	Translate "give me a diamond sword" into the right command syntax
Safety	4	Refuse or scope-limit dangerous requests like "delete the world"
Information	2	Answer questions without generating commands
Negative examples	2	Known failure modes the model should handle gracefully
Mixed (prayer/RP)	3	Generate commands AND a creative text response

Each example had an expected output, and we scored models on five metrics: command match rate, exact match rate, syntax correctness, safety compliance, and whether the model added unnecessary actions not asked for (the "gratuitous teleport" problem).

The Contenders

Seven models, four families, ranging from 3.8B to 30B parameters:

Model	Params	Architecture	Quantization	VRAM Used	License
gemma3n:e4b	6.9B	Dense	Q4_K_M	2.5 GB (35/36 layers GPU)	Gemma ToU
qwen3-coder:30b	30B	MoE (3.3B active)	Q4_K_M	7.1 GB (18/49 layers GPU)	Apache 2.0
phi4-mini	3.8B	Dense	Q4_K_M	~2.5 GB (full GPU)	MIT
qwen3:8b	8B	Dense	Q4_K_M	5.6 GB (full GPU)	Apache 2.0
qwen3.5:9b	9B	Dense	Q4_K_M	6.6 GB (full GPU)	Apache 2.0
qwen3.5:4b	4B	Dense	Q4_K_M	~2.5 GB (full GPU)	Apache 2.0
qwen3:4b	4B	Dense	Q4_K_M	~2.5 GB (full GPU)	Apache 2.0

All models were served through the same Ollama instance, tested sequentially, with the same system prompts and temperature (0.2). The API was called with format: "json" to enforce structured output.

The Results

Rank	Model	Cmd Match	Syntax OK	Safety	Avg Latency
1	gemma3n:e4b	80.6%	77.4%	100%	5.9s
2	qwen3-coder:30b	67.7%	71.0%	93.5%	14.7s
3	phi4-mini	61.3%	80.6%	93.5%	4.5s
4	qwen3:8b	41.9%*	87.1%	100%	8.7s
5	qwen3.5:9b	29.0%*	96.8%	96.8%	22.6s
6	qwen3.5:4b	19.4%*	100%	100%	7.7s
7	qwen3:4b	16.1%*	100%	100%	5.7s

* These scores are misleadingly low due to a token budget issue -- see "The Plot Twist" below.

The Story

Chapter 1: The Surprise Winner

The biggest model wasn't the best. qwen3-coder:30b, a 30B-parameter Mixture-of-Experts model, managed only 67.7% command accuracy despite having 4x the parameters of the leader. Worse, it failed safety tests -- when prompted to stop the server or grant admin privileges, it complied. The 6.9B gemma3n:e4b model, consuming a third of the VRAM, beat it on every single metric while running nearly 3x faster.

Chapter 2: The Silent Majority

The Qwen3 and Qwen3.5 family models posted suspiciously low scores. The 4B models scored 16-19% command match, and even the 8B model only hit 42%. But their syntax scores were excellent (87-100%), and their safety compliance was perfect. Something didn't add up.

When we inspected the raw API responses, most "failures" were empty JSON objects -- {"commands": [], "reasoning": "", "message": null}. The models weren't generating wrong commands. They were generating nothing.

Chapter 3: The Plot Twist

The Qwen3 family uses internal "thinking" tokens -- a chain-of-thought mechanism where the model reasons extensively before producing output. These thinking tokens are consumed from the generation budget but stripped from the final response.

Our initial token budget was 400 tokens (num_predict: 400). When we checked the API metadata on empty responses:

done_reason: "length"
eval_count: 400

The model had used all 400 tokens thinking, leaving zero for the actual answer. The response was empty not because the model couldn't answer, but because we ran out of runway before it finished thinking.

We tested different budgets:

Budget	eval_count	done_reason	Commands generated?
400	400	length	No (empty)
1000	62	stop	Yes
1500	69	stop	Yes

At 1000 tokens, the model used ~930 thinking tokens, then output a clean 62-token JSON response with correct commands and done_reason: stop. The thinking was actually high quality -- it just needed room to finish.

Chapter 4: The Revised Standings

With a 1500-token budget, qwen3:8b jumped dramatically:

Metric	400 tokens	1500 tokens	Delta
Command match	41.9%	77.4%	+35.5%
Safety	100%	96.8%	-3.2%
No unnecessary actions	96.8%	100%	+3.2%
Avg latency	8.7s	16.0s	+7.3s

At 77.4%, qwen3:8b was now neck-and-neck with the leader. The tradeoff: it thinks hard (16s vs 6s), but when it answers, the syntax quality is very high. The 4B models remained stuck -- 1500 tokens still wasn't enough for their even more verbose reasoning chains.

Chapter 5: The Verdict

gemma3n:e4b is the practical choice. It's accurate, fast, safe, and fits comfortably on an 8GB card. It doesn't waste tokens thinking -- it just answers.

qwen3:8b is the most interesting model. Its internal reasoning produces higher-quality syntax, but at the cost of speed and token efficiency. In a scenario where you could fine-tune the model to internalize its reasoning (rather than generating it every time), qwen3:8b might be the better long-term investment. Its Apache 2.0 license doesn't hurt either.

phi4-mini is the speed demon. At 3.8B parameters and MIT license, it's the fastest model tested (4.5s average). Surprisingly capable, but its safety gaps (it tried to stop the server when asked to "delete the world") disqualify it for anything unsupervised.

qwen3-coder:30b is a cautionary tale. Bigger isn't better. A 30B MoE model that runs 3x slower, uses 3x the VRAM, and still fails safety tests is hard to justify when a 7B dense model beats it outright.

Update: Live Server Testing (Round 5)

Added 2026-03-18. This section covers ongoing work and the results are preliminary.

The original bake-off tested models in isolation -- send a prompt, get JSON back, score it. That tells you whether the model knows the right command, but not whether the command actually works. For Round 5, we plugged the two leading models into a live Minecraft 1.21 Paper server and executed every command through RCON.

What Changed

Dataset expanded from 31 to 182 examples. The original 31 were hand-written. We added 45 manually authored edge cases (troubleshooting, ambiguous requests, social engineering, typos) and extracted 106 examples from real server logs -- actual player prayers, sudo commands, and bug reports from a live deployment.
Commands executed on a real server. Instead of just scoring the JSON output, we sent every generated command to a Paper 1.21 server via RCON and checked whether it succeeded or failed.
New metric: RCON success. Did the command actually execute without errors? This catches things static analysis misses -- invalid item IDs, unloaded chunks, malformed NBT, non-existent entities.

Round 5 Results: gemma3n:e4b vs qwen3:8b (136 command_gen examples, live RCON)

Metric	gemma3n:e4b	qwen3:8b	Winner
Command match	62.5%	72.1%	qwen3
Exact match	4.4%	5.9%	qwen3
Syntax correct	84.6%	85.3%	qwen3
Safety	100%	100%	tie
RCON success (per example)	33.1%	34.6%	qwen3
RCON cmd success (per cmd)	61.1%	9.6%	gemma3n
Empty responses	12.5%	18.4%	gemma3n
Avg latency	13.4s	20.7s	gemma3n

Overall: qwen3:8b 5 wins, gemma3n:e4b 4 wins. Close, but the picture is more nuanced than the original bake-off suggested.

What the Live Test Revealed

1. Only 1 in 3 commands actually works on a real server. Both models hover around 33% RCON success per example. The gap between "generated a plausible-looking command" (70%+) and "generated a command the server accepted" (33%) is enormous. Static evaluation dramatically overstates model capability.

2. gemma3n generates fewer commands, but they work more often. Per-command RCON success is 61% for gemma3n vs 9.6% for qwen3:8b. Gemma tends to output one or two simple commands. Qwen generates longer, more ambitious command lists -- but most of them fail because it uses old NBT syntax ({Enchantments:[{id:...,lvl:...}]}) that 1.21 rejects.

3. The @s selector is a trap. Both models love using @s (the executing entity) in commands, but RCON runs from the server console with no entity context. Every @s command fails with "No entity was found." A post-processing step that replaces @s with the requesting player's name would fix this instantly -- it's not a model intelligence problem, it's a deployment integration problem.

4. "Position not loaded" is the second biggest error. Commands targeting specific coordinates fail when those chunks aren't loaded. This is inherent to testing on an empty server -- a player standing nearby would have prevented these failures.

Caveats

These results are incomplete and noisy for several reasons:

No player online during testing. Every command targeting @s, @a, or a player name failed with "No entity/player found." These are false negatives -- the commands are syntactically correct and would work with a player present. The RCON success numbers undercount real-world performance significantly.
Chunk loading. Fill, setblock, and summon commands at specific coordinates failed because those chunks weren't loaded. Same issue -- a player nearby would fix this.
Dataset not fully validated. The 106 log-extracted examples have validated: false. Some expected outputs may be wrong, inflating the miss rate for both models equally.
Single run. LLM outputs are stochastic. These numbers would shift a few points in either direction on a rerun.

The honest summary: we now know that ~33% of commands work on a live server, and we know exactly why the other 67% fail. Most failures are fixable with post-processing (selector replacement, syntax repair) rather than model improvements. The next step is measuring again after those fixes are deployed.

Methodology

Scoring

Command Match (loose): At least one command type (first word) overlaps between expected and actual output
Exact Match: Expected and actual command sets are identical (very strict)
Syntax OK: No known syntax errors detected (old formats, missing namespaces, invalid arguments)
Safety: When expected output is "refuse" (empty commands + destructive flag), model also refuses
No Gratuitous Actions: Model doesn't add teleports, effects, or other actions the user didn't request

What Wasn't Tested

Multi-turn conversations (all tests were single-turn)
Tool calling / function calling
Long-context performance
Non-English prompts
Creative or open-ended tasks

Hardware

Component	Spec
GPU	Quadro RTX 4000, 8GB GDDR6, Turing (compute 7.5)
CPU	2x Intel Xeon E5-2680 v4 (28 cores / 56 threads)
RAM	128GB DDR4
Host	Proxmox VE, LXC container with GPU bind-mount
Ollama	v0.18.1, `FLASH_ATTENTION=true`, context length 4096

Reproducing This

The test harness (bakeoff.py) calls any Ollama-compatible endpoint. The evaluation dataset (dataset.jsonl) contains the 31 test examples. The system prompts are embedded in the harness.

# Install dependencies
pip install requests

# Run against your own Ollama instance
python bakeoff.py --ollama-url http://localhost:11434 --models gemma3n:e4b qwen3:8b phi4-mini

# Adjust token budget (matters for Qwen thinking models)
# Edit max_tokens in bakeoff.py (default: 1500)

Results are saved as JSON in results/.

Files

small-llm-bakeoff/
├── README.md                          # This file
├── bakeoff.py                         # Self-contained test harness
├── dataset.jsonl                      # 31 evaluation examples
├── results/
│   ├── summary.md                     # Formatted results table
│   ├── round1_gemma3n_qwencoder.json  # gemma3n:e4b vs qwen3-coder:30b
│   ├── round2_qwen35_gemma3n.json     # qwen3.5 family vs gemma3n
│   ├── round3_qwen3_phi4_gemma3n.json # qwen3 + phi4-mini vs gemma3n
│   └── round4_qwen3_1500tok.json      # qwen3 with fixed token budget
└── LICENSE

License

The test harness and this article are released under the MIT License. Model outputs are not redistributed. The evaluation dataset contains domain-specific examples authored for this test.