small-llm-bakeoff/README.md

# Small LLM Bake-Off: 7 Models, 1 GPU, 31 Tasks

**Can a 7B model on an 8GB GPU outperform a 30B model on 128GB of RAM?**

Yes. By a lot.

---

## The Setup

We had a structured output task: take a natural language request and produce a JSON response containing a list of valid commands, a reasoning string, and an optional message. The domain was narrow (Minecraft server administration), the syntax rules were strict, and the model had to follow a detailed system prompt with specific formatting constraints.

The test hardware was modest: a Quadro RTX 4000 with 8GB of VRAM, running Ollama v0.18.1 inside an LXC container on a Proxmox server. The CPU was a dual Xeon E5-2680 v4 with 128GB of RAM -- plenty for CPU-offloaded layers, but the GPU had to do the heavy lifting.

We wrote 31 evaluation examples spanning five categories:

| Category | Examples | What it tests |
|----------|---------|---------------|
| Command generation | 20 | Translate "give me a diamond sword" into the right command syntax |
| Safety | 4 | Refuse or scope-limit dangerous requests like "delete the world" |
| Information | 2 | Answer questions without generating commands |
| Negative examples | 2 | Known failure modes the model should handle gracefully |
| Mixed (prayer/RP) | 3 | Generate commands AND a creative text response |

Each example had an expected output, and we scored models on five metrics: command match rate, exact match rate, syntax correctness, safety compliance, and whether the model added unnecessary actions not asked for (the "gratuitous teleport" problem).

## The Contenders

Seven models, four families, ranging from 3.8B to 30B parameters:

| Model | Params | Architecture | Quantization | VRAM Used | License |
|-------|--------|-------------|-------------|-----------|---------|
| gemma3n:e4b | 6.9B | Dense | Q4_K_M | 2.5 GB (35/36 layers GPU) | Gemma ToU |
| qwen3-coder:30b | 30B | MoE (3.3B active) | Q4_K_M | 7.1 GB (18/49 layers GPU) | Apache 2.0 |
| phi4-mini | 3.8B | Dense | Q4_K_M | ~2.5 GB (full GPU) | MIT |
| qwen3:8b | 8B | Dense | Q4_K_M | 5.6 GB (full GPU) | Apache 2.0 |
| qwen3.5:9b | 9B | Dense | Q4_K_M | 6.6 GB (full GPU) | Apache 2.0 |
| qwen3.5:4b | 4B | Dense | Q4_K_M | ~2.5 GB (full GPU) | Apache 2.0 |
| qwen3:4b | 4B | Dense | Q4_K_M | ~2.5 GB (full GPU) | Apache 2.0 |

All models were served through the same Ollama instance, tested sequentially, with the same system prompts and temperature (0.2). The API was called with `format: "json"` to enforce structured output.

## The Results

| Rank | Model | Cmd Match | Syntax OK | Safety | Avg Latency |
|:----:|-------|:---------:|:---------:|:------:|------------:|
| 1 | **gemma3n:e4b** | **80.6%** | 77.4% | **100%** | **5.9s** |
| 2 | qwen3-coder:30b | 67.7% | 71.0% | 93.5% | 14.7s |
| 3 | phi4-mini | 61.3% | 80.6% | 93.5% | 4.5s |
| 4 | qwen3:8b | 41.9%\* | 87.1% | **100%** | 8.7s |
| 5 | qwen3.5:9b | 29.0%\* | **96.8%** | 96.8% | 22.6s |
| 6 | qwen3.5:4b | 19.4%\* | **100%** | **100%** | 7.7s |
| 7 | qwen3:4b | 16.1%\* | **100%** | **100%** | 5.7s |

\* *These scores are misleadingly low due to a token budget issue -- see "The Plot Twist" below.*

## The Story

### Chapter 1: The Surprise Winner

The biggest model wasn't the best. `qwen3-coder:30b`, a 30B-parameter Mixture-of-Experts model, managed only 67.7% command accuracy despite having 4x the parameters of the leader. Worse, it **failed safety tests** -- when prompted to stop the server or grant admin privileges, it complied. The 6.9B `gemma3n:e4b` model, consuming a third of the VRAM, beat it on every single metric while running nearly 3x faster.

### Chapter 2: The Silent Majority

The Qwen3 and Qwen3.5 family models posted suspiciously low scores. The 4B models scored 16-19% command match, and even the 8B model only hit 42%. But their syntax scores were excellent (87-100%), and their safety compliance was perfect. Something didn't add up.

When we inspected the raw API responses, most "failures" were **empty JSON objects** -- `{"commands": [], "reasoning": "", "message": null}`. The models weren't generating wrong commands. They were generating *nothing*.

### Chapter 3: The Plot Twist

The Qwen3 family uses internal "thinking" tokens -- a chain-of-thought mechanism where the model reasons extensively before producing output. These thinking tokens are consumed from the generation budget but stripped from the final response.

Our initial token budget was 400 tokens (`num_predict: 400`). When we checked the API metadata on empty responses:

```
done_reason: "length"
eval_count: 400
```

The model had used all 400 tokens thinking, leaving zero for the actual answer. The response was empty not because the model couldn't answer, but because **we ran out of runway before it finished thinking**.

We tested different budgets:

| Budget | eval_count | done_reason | Commands generated? |
|--------|-----------|-------------|:-------------------:|
| 400 | 400 | length | No (empty) |
| 1000 | 62 | stop | Yes |
| 1500 | 69 | stop | Yes |

At 1000 tokens, the model used ~930 thinking tokens, then output a clean 62-token JSON response with correct commands and `done_reason: stop`. The thinking was actually high quality -- it just needed room to finish.

### Chapter 4: The Revised Standings

With a 1500-token budget, `qwen3:8b` jumped dramatically:

| Metric | 400 tokens | 1500 tokens | Delta |
|--------|:---:|:---:|:---:|
| Command match | 41.9% | **77.4%** | +35.5% |
| Safety | 100% | 96.8% | -3.2% |
| No unnecessary actions | 96.8% | **100%** | +3.2% |
| Avg latency | 8.7s | 16.0s | +7.3s |

At 77.4%, `qwen3:8b` was now neck-and-neck with the leader. The tradeoff: it thinks hard (16s vs 6s), but when it answers, the syntax quality is very high. The 4B models remained stuck -- 1500 tokens still wasn't enough for their even more verbose reasoning chains.

### Chapter 5: The Verdict

**`gemma3n:e4b` is the practical choice.** It's accurate, fast, safe, and fits comfortably on an 8GB card. It doesn't waste tokens thinking -- it just answers.

**`qwen3:8b` is the most interesting model.** Its internal reasoning produces higher-quality syntax, but at the cost of speed and token efficiency. In a scenario where you could fine-tune the model to internalize its reasoning (rather than generating it every time), qwen3:8b might be the better long-term investment. Its Apache 2.0 license doesn't hurt either.

**`phi4-mini` is the speed demon.** At 3.8B parameters and MIT license, it's the fastest model tested (4.5s average). Surprisingly capable, but its safety gaps (it tried to stop the server when asked to "delete the world") disqualify it for anything unsupervised.

**`qwen3-coder:30b` is a cautionary tale.** Bigger isn't better. A 30B MoE model that runs 3x slower, uses 3x the VRAM, and still fails safety tests is hard to justify when a 7B dense model beats it outright.

---

## Update: Live Server Testing (Round 5)

*Added 2026-03-18. This section covers ongoing work and the results are preliminary.*

The original bake-off tested models in isolation -- send a prompt, get JSON back, score it. That tells you whether the model *knows* the right command, but not whether the command actually *works*. For Round 5, we plugged the two leading models into a live Minecraft 1.21 Paper server and executed every command through RCON.

### What Changed

- **Dataset expanded from 31 to 182 examples.** The original 31 were hand-written. We added 45 manually authored edge cases (troubleshooting, ambiguous requests, social engineering, typos) and extracted 106 examples from real server logs -- actual player prayers, sudo commands, and bug reports from a live deployment.
- **Commands executed on a real server.** Instead of just scoring the JSON output, we sent every generated command to a Paper 1.21 server via RCON and checked whether it succeeded or failed.
- **New metric: RCON success.** Did the command actually execute without errors? This catches things static analysis misses -- invalid item IDs, unloaded chunks, malformed NBT, non-existent entities.

### Round 5 Results: gemma3n:e4b vs qwen3:8b (136 command_gen examples, live RCON)

| Metric | gemma3n:e4b | qwen3:8b | Winner |
|--------|:-----------:|:--------:|:------:|
| Command match | 62.5% | **72.1%** | qwen3 |
| Exact match | 4.4% | **5.9%** | qwen3 |
| Syntax correct | 84.6% | **85.3%** | qwen3 |
| Safety | **100%** | **100%** | tie |
| RCON success (per example) | 33.1% | **34.6%** | qwen3 |
| RCON cmd success (per cmd) | **61.1%** | 9.6% | gemma3n |
| Empty responses | **12.5%** | 18.4% | gemma3n |
| Avg latency | **13.4s** | 20.7s | gemma3n |

**Overall: qwen3:8b 5 wins, gemma3n:e4b 4 wins.** Close, but the picture is more nuanced than the original bake-off suggested.

### What the Live Test Revealed

**1. Only 1 in 3 commands actually works on a real server.** Both models hover around 33% RCON success per example. The gap between "generated a plausible-looking command" (70%+) and "generated a command the server accepted" (33%) is enormous. Static evaluation dramatically overstates model capability.

**2. gemma3n generates fewer commands, but they work more often.** Per-command RCON success is 61% for gemma3n vs 9.6% for qwen3:8b. Gemma tends to output one or two simple commands. Qwen generates longer, more ambitious command lists -- but most of them fail because it uses old NBT syntax (`{Enchantments:[{id:...,lvl:...}]}`) that 1.21 rejects.

**3. The `@s` selector is a trap.** Both models love using `@s` (the executing entity) in commands, but RCON runs from the server console with no entity context. Every `@s` command fails with "No entity was found." A post-processing step that replaces `@s` with the requesting player's name would fix this instantly -- it's not a model intelligence problem, it's a deployment integration problem.

**4. "Position not loaded" is the second biggest error.** Commands targeting specific coordinates fail when those chunks aren't loaded. This is inherent to testing on an empty server -- a player standing nearby would have prevented these failures.

### Caveats

These results are incomplete and noisy for several reasons:

- **No player online during testing.** Every command targeting `@s`, `@a`, or a player name failed with "No entity/player found." These are false negatives -- the commands are syntactically correct and would work with a player present. The RCON success numbers undercount real-world performance significantly.
- **Chunk loading.** Fill, setblock, and summon commands at specific coordinates failed because those chunks weren't loaded. Same issue -- a player nearby would fix this.
- **Dataset not fully validated.** The 106 log-extracted examples have `validated: false`. Some expected outputs may be wrong, inflating the miss rate for both models equally.
- **Single run.** LLM outputs are stochastic. These numbers would shift a few points in either direction on a rerun.

The honest summary: we now know that ~33% of commands work on a live server, and we know exactly *why* the other 67% fail. Most failures are fixable with post-processing (selector replacement, syntax repair) rather than model improvements. The next step is measuring again after those fixes are deployed.

---

## Methodology

### Scoring

- **Command Match (loose):** At least one command type (first word) overlaps between expected and actual output
- **Exact Match:** Expected and actual command sets are identical (very strict)
- **Syntax OK:** No known syntax errors detected (old formats, missing namespaces, invalid arguments)
- **Safety:** When expected output is "refuse" (empty commands + destructive flag), model also refuses
- **No Gratuitous Actions:** Model doesn't add teleports, effects, or other actions the user didn't request

### What Wasn't Tested

- Multi-turn conversations (all tests were single-turn)
- Tool calling / function calling
- Long-context performance
- Non-English prompts
- Creative or open-ended tasks

### Hardware

| Component | Spec |
|-----------|------|
| GPU | Quadro RTX 4000, 8GB GDDR6, Turing (compute 7.5) |
| CPU | 2x Intel Xeon E5-2680 v4 (28 cores / 56 threads) |
| RAM | 128GB DDR4 |
| Host | Proxmox VE, LXC container with GPU bind-mount |
| Ollama | v0.18.1, `FLASH_ATTENTION=true`, context length 4096 |

## Reproducing This

The test harness (`bakeoff.py`) calls any Ollama-compatible endpoint. The evaluation dataset (`dataset.jsonl`) contains the 31 test examples. The system prompts are embedded in the harness.

```bash
# Install dependencies
pip install requests

# Run against your own Ollama instance
python bakeoff.py --ollama-url http://localhost:11434 --models gemma3n:e4b qwen3:8b phi4-mini

# Adjust token budget (matters for Qwen thinking models)
# Edit max_tokens in bakeoff.py (default: 1500)
```

Results are saved as JSON in `results/`.

## Files

```
small-llm-bakeoff/
├── README.md                          # This file
├── bakeoff.py                         # Self-contained test harness
├── dataset.jsonl                      # 31 evaluation examples
├── results/
│   ├── summary.md                     # Formatted results table
│   ├── round1_gemma3n_qwencoder.json  # gemma3n:e4b vs qwen3-coder:30b
│   ├── round2_qwen35_gemma3n.json     # qwen3.5 family vs gemma3n
│   ├── round3_qwen3_phi4_gemma3n.json # qwen3 + phi4-mini vs gemma3n
│   └── round4_qwen3_1500tok.json      # qwen3 with fixed token budget
└── LICENSE
```

## License

The test harness and this article are released under the MIT License. Model outputs are not redistributed. The evaluation dataset contains domain-specific examples authored for this test.