feat: CLI coding agent bakeoff — 26b reproducibly silent-stops at write_file

Ran minimal agent loop (Ollama /api/chat + read_file/write_file/run_bash) on steel141 3090 Ti against 3 models on a broken-median-function task: - gemma4:31b-it-q4_K_M: PASS (8 iters, 1 write, 44s) — textbook trace - qwen3-coder:30b: PASS (15 iters, 1 write, 22s) — correct but chatty - gemma4:26b: FAIL (6 iters, 0 writes) — silently stops with eval=4 after reading source. Reproduced on second run. One-shot probe confirms 26b CAN produce the correct fix — failure is specifically at the write_file tool-call argument boundary. Updates GOTCHAS with a new HIGH-severity entry, SYNTHESIS model-selection table, CORPUS_cli_coding_agent.md empirical-follow-up pointer, and adds docs/reference/bakeoff-2026-04-18.md with the full writeup.
2026-04-18 13:27:50 -04:00
parent 4b9c537dda
commit a945207aab
15 changed files with 1172 additions and 1 deletions
@@ -0,0 +1,146 @@
+# CLI Coding Agent Bakeoff — 2026-04-18
+
+> Empirical follow-up to `CORPUS_cli_coding_agent.md`. Runs a minimal CLI coding
+> agent loop against three candidate models on identical hardware and an
+> identical broken-code task. **n=1 per model** (plus one re-run to check
+> reproducibility of a failure). Treat as a smoke test, not a benchmark.
+
+## Setup
+
+- **Host:** steel141 (Seth's local box)
+- **GPU:** NVIDIA RTX 3090 Ti, 24 GiB, ~22.7 GiB free
+- **Ollama:** 0.20.4
+- **Harness:** `scripts/bakeoff/harness.py` — custom minimal agent loop, **not** openclaw / open code / aider / pi / hermes. Protocol: Ollama `/api/chat` with `tools=[read_file, write_file, run_bash]`, non-streaming, `think: false`, `num_ctx: 32768`, `num_predict: 4096`, `temperature: 0.3`. Iteration cap = 15.
+- **Task:** `scripts/bakeoff/task_seed/` — Python package with buggy `median()` function. 3 of 7 pytest tests fail on even-length inputs. Fix is ~5 lines.
+- **System prompt:** generic CLI-agent template (identity + allowed tools + rules: "never modify tests", "prefer minimal edits"). Not tuned per model.
+
+All three models pulled from steel141's local Ollama, swapped in/out of GPU as each run started. First iteration per run pays the load cost; later iterations are hot.
+
+## Results
+
+| Model | Pass | Iterations | write_file | read_file | run_bash | Wall clock | Halt reason |
+|---|---|---|---|---|---|---|---|
+| `gemma4:26b` | **Fail** | 6 | **0** | 2 | 3 | 10.9s | `no_tool_calls` (silent empty response) |
+| `gemma4:26b` (retry) | **Fail** | 6 | **0** | 2 | 3 | 11.4s | `no_tool_calls` (reproduces exactly) |
+| `gemma4:31b-it-q4_K_M` | **Pass** | 8 | 1 | 2 | 4 | 44.1s | `no_tool_calls` (clean summary turn) |
+| `qwen3-coder:30b` | **Pass** | 15 (cap) | 1 | 4 | 8 | 22.6s | `no_tool_calls` (at iteration cap) |
+
+### Gemma 4 31B — clean run
+
+Textbook agent trace:
+
+1. `read_file README.md`
+2. `pytest` (exit=2, module not found — pytest needs PYTHONPATH)
+3. `ls -R`
+4. `PYTHONPATH=. pytest` → sees 3 failures
+5. `read_file calc/stats.py`
+6. `write_file calc/stats.py` (eval_count=330, 13.4s) — correct fix
+7. `PYTHONPATH=. pytest` → all green
+8. summary: *"I updated the `median` function in `calc/stats.py` to correctly calculate the average of the two middle elements..."*
+
+Zero wasted turns. One write. Minimal edit.
+
+### Qwen3-Coder 30B — correct but chatty
+
+Passed, but used all 15 iterations:
+
+- Narrated every step ("I'll help you...", "Now let's look at...")
+- Tried to read a non-existent file (`test_calc.py`) — wasted iter 2
+- Tried to `read_file` on a directory (`calc`) — wasted iter 6
+- Ran several redundant bash calls (`pwd && pytest`, etc.)
+- Emitted a ceremonial `echo "All tests pass..."` bash call at iter 14
+- Final turn was a polite summary
+
+The fix itself (iter 12) was correct on the first write. Quality is fine; efficiency isn't. Per-iteration it was fast (many 20-40 token turns) — total wall clock 22.6s beat Gemma 31B despite using nearly 2× the iterations.
+
+### Gemma 4 26B — reproducible silent stop
+
+Both runs followed an identical trajectory:
+
+1. `ls -R`
+2. `read_file README.md`
+3. `pytest` (exit=2)
+4. `PYTHONPATH=. pytest` → sees 3 failures
+5. `read_file calc/stats.py`
+6. **Empty response. `eval_count=4`. No tool calls. Loop terminates.**
+
+Zero writes. The model saw all the context it needed (failing tests + buggy source) and then silently declined to act.
+
+### Isolating the failure — one-shot probe
+
+To check whether 26B can produce the fix at all, I ran a single-turn call with no tool loop:
+
+```
+prompt: "The following function is buggy — median([1,2,3,4]) returns 3
+         but should return 2.5. Rewrite it correctly. [buggy code]"
+```
+
+Response (eval_count=81):
+
+```python
+def median(numbers):
+    s = sorted(numbers)
+    n = len(s)
+    if n % 2 == 1:
+        return s[n // 2]
+    else:
+        return (s[n // 2 - 1] + s[n // 2]) / 2
+```
+
+**Correct.** So 26B's diagnosis and code generation are intact. The failure is specifically at the **tool-call-boundary** — when the model needs to emit a `write_file(path, content)` call where the `content` argument is a several-hundred-character string, it aborts with eval=4 instead.
+
+This aligns with `GOTCHAS.md` § "Weak at Long/Nested JSON". A `write_file` tool call argument with a ~500-char string is structurally similar to a long JSON value. Gemma 4 31B handles the same surface reliably (eval=330 on that turn); the 26B MoE does not.
+
+## Interpretation
+
+### What this is evidence for
+
+- **Gemma 4 31B is a viable CLI-coding-agent backing model on this class of task.** Clean trace, minimal wasted turns, correct fix on first write.
+- **Qwen3-Coder 30B also works**, at the cost of more iterations and looser discipline. Diff quality was fine; agentic efficiency wasn't.
+- **Gemma 4 26B has a reproducible failure mode** at tool-call-argument emission. It can reason. It can code. It struggles to deliver code through a `write_file` tool call when the content is non-trivial.
+
+### What this is NOT evidence for
+
+- **This is not a representative benchmark.** n=1 per model. One task. One fix. One harness. Do not conclude "Gemma 4 26B is broken for coding agents" — conclude "Gemma 4 26B failed this specific setup reproducibly; investigate further before relying on it."
+- **This harness is not openclaw / open code / aider / pi / hermes.** Production agents wrap prompts, retries, and tool surfaces differently. The 26B failure may be avoided in a harness that:
+  - Uses a **patch/diff tool** (`apply_patch(old, new)`) instead of `write_file(full_content)` — smaller argument surface, matches the "sequential tool calls" pattern from `SYNTHESIS.md`
+  - Adds a **retry on empty response** (same as Simon's streaming-fallback pattern in `IMPLEMENTATIONS.md`)
+  - Provides fewer but richer tools (a dedicated `fix_file` that re-prompts internally)
+- **This compares agent behavior, not raw performance.** Wall clock is noisy (model load, context size, token rate all differ). Per-iteration latency is more meaningful — but that only matters for throughput, not correctness.
+
+### Recommendations
+
+1. **For a CLI coding agent on Seth's hardware:** start with `gemma4:31b-it-q4_K_M`. Clean behavior, modest wall clock (44s for a simple fix), no retry needed.
+2. **For comparison or backup:** `qwen3-coder:30b` is equally correct, roughly half the per-iteration cost, ~2× the iteration count. In a longer session those extra turns add up.
+3. **Do not default to `gemma4:26b` for this pattern.** Two tests in a row silent-stopped at the write boundary. If you want to use the 26B MoE (it's strong on `LiveCodeBench v6` at 77.1%), validate it against your specific agent framework first — especially whether the framework uses `write_file` (full content) or `apply_patch` (delta) as its edit primitive.
+4. **Test with the real harness you plan to use in production** (openclaw2, open code, etc.) before committing. A handful of this style of run takes minutes on the 3090 Ti and will tell you more than any benchmark card.
+
+## Honest caveats
+
+- **Stochasticity.** Only 26B was re-run. 31B and Qwen3-Coder might hit failure modes on a different seed or a different task. Temperature 0.3 is low but not zero.
+- **System prompt bias.** "Start by reading README.md" steered all three models similarly; a different prompt skeleton would produce different traces. I did not tune per model — deliberately — because a production agent won't either.
+- **The 26B silent-stop hypothesis (tool-arg emission failure) is inferred, not proven.** A clean confirmation would require running the same task with a smaller-surface edit tool (`apply_patch(path, old, new)` instead of `write_file(path, full_content)`) and showing 26B succeeds. That's the obvious follow-up.
+- **Ollama 0.20.4** is between the 0.20.0/0.20.1 known-broken-streaming range and whatever is current. Non-streaming tool calls worked cleanly for 31B and Qwen; 26B's failure looks model-specific, not Ollama-specific, but I didn't test on a different Ollama version.
+- **No openclaw / open code / aider runs.** Those are the frameworks named in the HF launch blog. This was a synthetic harness; transfer is plausible but unverified.
+
+## Artifacts
+
+- `scripts/bakeoff/harness.py` — the agent loop
+- `scripts/bakeoff/task_seed/` — the broken-code seed (reset between runs)
+- `scripts/bakeoff/runs/gemma4-26b/log.json` — full turn-by-turn trace
+- `scripts/bakeoff/runs/gemma4-26b-retry/log.json`
+- `scripts/bakeoff/runs/gemma4-31b/log.json`
+- `scripts/bakeoff/runs/qwen3-coder-30b/log.json`
+
+Each log records per-turn: content, tool calls, results (truncated to 800 chars), prompt/eval token counts, wall time. Final block records halt reason, pass/fail, iteration count, tool-call totals, total wall clock.
+
+## Reproducing
+
+```bash
+cd scripts/bakeoff
+python3 harness.py gemma4:31b-it-q4_K_M runs/gemma4-31b/work runs/gemma4-31b/log.json
+python3 harness.py qwen3-coder:30b runs/qwen3-coder-30b/work runs/qwen3-coder-30b/log.json
+python3 harness.py gemma4:26b runs/gemma4-26b/work runs/gemma4-26b/log.json
+```
+
+Each invocation resets the work directory from `task_seed/`, runs the loop, writes the log, and prints a one-line summary.