Round 2 tested the hypothesis that 26B's silent-stop was about write_file argument size. Result: refuted. - Patch-mode (apply_patch instead of write_file): 26B fails identically at iter 6. Tool-arg size is not the cause. - Truncation sweep on tool responses reveals the real trigger: cap at 800 or 1200 chars → 26B PASSES (1200-cap is 8.4s, fastest of any run). Cap at 1600, 2000, or unlimited → 26B silent-stops with eval=4. Revised understanding: 26B silent-stops when cumulative tool-response context crosses a shape threshold around 1200-1600 chars per response. Not a tool-arg bug, not a raw code-gen bug — 26B emits correct code fine in both one-shot and short-context settings. Production CLI agents (openclaw, open code, aider) typically truncate tool responses by default, so this failure may not surface in them. Custom harnesses should cap ≤1200 chars per tool response when targeting the 26B MoE. Updates GOTCHAS (rewritten entry with the truncation sweep table), SYNTHESIS model-selection row, CORPUS_cli_coding_agent.md pointer, docs/reference/bakeoff-2026-04-18.md with full Round 2 methodology and data. Adds harness_patch.py (apply_patch edit tool), harness_patch_truncated.py (env-configurable TOOL_RESULT_CAP), all 7 run logs, and a .secrets.baseline for detect-secrets false positives on JSON timestamps.
15 KiB
CLI Coding Agent Bakeoff — 2026-04-18
Empirical follow-up to
CORPUS_cli_coding_agent.md. Runs a minimal CLI coding agent loop against three candidate models on identical hardware and an identical broken-code task. n=1 per model (plus one re-run to check reproducibility of a failure). Treat as a smoke test, not a benchmark.
Setup
- Host: steel141 (Seth's local box)
- GPU: NVIDIA RTX 3090 Ti, 24 GiB, ~22.7 GiB free
- Ollama: 0.20.4
- Harness:
scripts/bakeoff/harness.py— custom minimal agent loop, not openclaw / open code / aider / pi / hermes. Protocol: Ollama/api/chatwithtools=[read_file, write_file, run_bash], non-streaming,think: false,num_ctx: 32768,num_predict: 4096,temperature: 0.3. Iteration cap = 15. - Task:
scripts/bakeoff/task_seed/— Python package with buggymedian()function. 3 of 7 pytest tests fail on even-length inputs. Fix is ~5 lines. - System prompt: generic CLI-agent template (identity + allowed tools + rules: "never modify tests", "prefer minimal edits"). Not tuned per model.
All three models pulled from steel141's local Ollama, swapped in/out of GPU as each run started. First iteration per run pays the load cost; later iterations are hot.
Results
| Model | Pass | Iterations | write_file | read_file | run_bash | Wall clock | Halt reason |
|---|---|---|---|---|---|---|---|
gemma4:26b |
Fail | 6 | 0 | 2 | 3 | 10.9s | no_tool_calls (silent empty response) |
gemma4:26b (retry) |
Fail | 6 | 0 | 2 | 3 | 11.4s | no_tool_calls (reproduces exactly) |
gemma4:31b-it-q4_K_M |
Pass | 8 | 1 | 2 | 4 | 44.1s | no_tool_calls (clean summary turn) |
qwen3-coder:30b |
Pass | 15 (cap) | 1 | 4 | 8 | 22.6s | no_tool_calls (at iteration cap) |
Gemma 4 31B — clean run
Textbook agent trace:
read_file README.mdpytest(exit=2, module not found — pytest needs PYTHONPATH)ls -RPYTHONPATH=. pytest→ sees 3 failuresread_file calc/stats.pywrite_file calc/stats.py(eval_count=330, 13.4s) — correct fixPYTHONPATH=. pytest→ all green- summary: "I updated the
medianfunction incalc/stats.pyto correctly calculate the average of the two middle elements..."
Zero wasted turns. One write. Minimal edit.
Qwen3-Coder 30B — correct but chatty
Passed, but used all 15 iterations:
- Narrated every step ("I'll help you...", "Now let's look at...")
- Tried to read a non-existent file (
test_calc.py) — wasted iter 2 - Tried to
read_fileon a directory (calc) — wasted iter 6 - Ran several redundant bash calls (
pwd && pytest, etc.) - Emitted a ceremonial
echo "All tests pass..."bash call at iter 14 - Final turn was a polite summary
The fix itself (iter 12) was correct on the first write. Quality is fine; efficiency isn't. Per-iteration it was fast (many 20-40 token turns) — total wall clock 22.6s beat Gemma 31B despite using nearly 2× the iterations.
Gemma 4 26B — reproducible silent stop
Both runs followed an identical trajectory:
ls -Rread_file README.mdpytest(exit=2)PYTHONPATH=. pytest→ sees 3 failuresread_file calc/stats.py- Empty response.
eval_count=4. No tool calls. Loop terminates.
Zero writes. The model saw all the context it needed (failing tests + buggy source) and then silently declined to act.
Isolating the failure — one-shot probe
To check whether 26B can produce the fix at all, I ran a single-turn call with no tool loop:
prompt: "The following function is buggy — median([1,2,3,4]) returns 3
but should return 2.5. Rewrite it correctly. [buggy code]"
Response (eval_count=81):
def median(numbers):
s = sorted(numbers)
n = len(s)
if n % 2 == 1:
return s[n // 2]
else:
return (s[n // 2 - 1] + s[n // 2]) / 2
Correct. So 26B's diagnosis and code generation are intact. The failure is specifically at the tool-call-boundary — when the model needs to emit a write_file(path, content) call where the content argument is a several-hundred-character string, it aborts with eval=4 instead.
This aligns with GOTCHAS.md § "Weak at Long/Nested JSON". A write_file tool call argument with a ~500-char string is structurally similar to a long JSON value. Gemma 4 31B handles the same surface reliably (eval=330 on that turn); the 26B MoE does not.
Interpretation
What this is evidence for
- Gemma 4 31B is a viable CLI-coding-agent backing model on this class of task. Clean trace, minimal wasted turns, correct fix on first write.
- Qwen3-Coder 30B also works, at the cost of more iterations and looser discipline. Diff quality was fine; agentic efficiency wasn't.
- Gemma 4 26B has a reproducible failure mode at tool-call-argument emission. It can reason. It can code. It struggles to deliver code through a
write_filetool call when the content is non-trivial.
What this is NOT evidence for
- This is not a representative benchmark. n=1 per model. One task. One fix. One harness. Do not conclude "Gemma 4 26B is broken for coding agents" — conclude "Gemma 4 26B failed this specific setup reproducibly; investigate further before relying on it."
- This harness is not openclaw / open code / aider / pi / hermes. Production agents wrap prompts, retries, and tool surfaces differently. The 26B failure may be avoided in a harness that:
- Uses a patch/diff tool (
apply_patch(old, new)) instead ofwrite_file(full_content)— smaller argument surface, matches the "sequential tool calls" pattern fromSYNTHESIS.md - Adds a retry on empty response (same as Simon's streaming-fallback pattern in
IMPLEMENTATIONS.md) - Provides fewer but richer tools (a dedicated
fix_filethat re-prompts internally)
- Uses a patch/diff tool (
- This compares agent behavior, not raw performance. Wall clock is noisy (model load, context size, token rate all differ). Per-iteration latency is more meaningful — but that only matters for throughput, not correctness.
Recommendations
- For a CLI coding agent on Seth's hardware: start with
gemma4:31b-it-q4_K_M. Clean behavior, modest wall clock (44s for a simple fix), no retry needed. - For comparison or backup:
qwen3-coder:30bis equally correct, roughly half the per-iteration cost, ~2× the iteration count. In a longer session those extra turns add up. - Do not default to
gemma4:26bfor this pattern. Two tests in a row silent-stopped at the write boundary. If you want to use the 26B MoE (it's strong onLiveCodeBench v6at 77.1%), validate it against your specific agent framework first — especially whether the framework useswrite_file(full content) orapply_patch(delta) as its edit primitive. - Test with the real harness you plan to use in production (openclaw2, open code, etc.) before committing. A handful of this style of run takes minutes on the 3090 Ti and will tell you more than any benchmark card.
Honest caveats
- Stochasticity. Only 26B was re-run. 31B and Qwen3-Coder might hit failure modes on a different seed or a different task. Temperature 0.3 is low but not zero.
- System prompt bias. "Start by reading README.md" steered all three models similarly; a different prompt skeleton would produce different traces. I did not tune per model — deliberately — because a production agent won't either.
- The 26B silent-stop hypothesis (tool-arg emission failure) is inferred, not proven. A clean confirmation would require running the same task with a smaller-surface edit tool (
apply_patch(path, old, new)instead ofwrite_file(path, full_content)) and showing 26B succeeds. That's the obvious follow-up. - Ollama 0.20.4 is between the 0.20.0/0.20.1 known-broken-streaming range and whatever is current. Non-streaming tool calls worked cleanly for 31B and Qwen; 26B's failure looks model-specific, not Ollama-specific, but I didn't test on a different Ollama version.
- No openclaw / open code / aider runs. Those are the frameworks named in the HF launch blog. This was a synthetic harness; transfer is plausible but unverified.
Artifacts
scripts/bakeoff/harness.py— the agent loopscripts/bakeoff/task_seed/— the broken-code seed (reset between runs)scripts/bakeoff/runs/gemma4-26b/log.json— full turn-by-turn tracescripts/bakeoff/runs/gemma4-26b-retry/log.jsonscripts/bakeoff/runs/gemma4-31b/log.jsonscripts/bakeoff/runs/qwen3-coder-30b/log.json
Each log records per-turn: content, tool calls, results (truncated to 800 chars), prompt/eval token counts, wall time. Final block records halt reason, pass/fail, iteration count, tool-call totals, total wall clock.
Reproducing
cd scripts/bakeoff
python3 harness.py gemma4:31b-it-q4_K_M runs/gemma4-31b/work runs/gemma4-31b/log.json
python3 harness.py qwen3-coder:30b runs/qwen3-coder-30b/work runs/qwen3-coder-30b/log.json
python3 harness.py gemma4:26b runs/gemma4-26b/work runs/gemma4-26b/log.json
Each invocation resets the work directory from task_seed/, runs the loop, writes the log, and prints a one-line summary.
Round 2 — isolating the 26B silent-stop
After Round 1 I hypothesized the 26B failure was about long write_file(path, full_content) tool arguments. Round 2 tests that.
What was tested
- Patch-mode harness (
harness_patch.py) — identical to the original but swapswrite_file(path, content)forapply_patch(path, old_text, new_text). Arguments are a small delta (~100-200 chars), not the full file. - Truncation-mode harness (
harness_patch_truncated.py) — same as patch-mode, but caps every tool response toTOOL_RESULT_CAPchars (env-configurable) before returning it to the model.
All else identical: same task, same system prompt, same Ollama settings, same 3090 Ti on steel141.
Results
Round 2a — patch-mode (small edit tool arguments)
| Model | Pass | Iters | patches | reads | bashes | Wall |
|---|---|---|---|---|---|---|
gemma4:31b-it-q4_K_M |
✓ | 8 | 1 | 2 | 4 | 37s |
qwen3-coder:30b |
✓ | 14 | 1 | 3 | 9 | 22s |
gemma4:26b |
✗ | 6 | 0 | 2 | 3 | 8s |
Hypothesis refuted. 26B fails identically on patch-mode: 6 iters, silent stop at iter 6 with eval=4, zero edits. The tool-call argument size is not the trigger.
Round 2b — tool-result truncation cap
Ran 26B through patch-mode with progressively smaller caps on each tool response:
| TOOL_RESULT_CAP | 26B Pass | Halt turn | prompt_eval at halt | eval_count at halt |
|---|---|---|---|---|
| 800 | ✓ | iter 15 (cap) | 3741 | 24 |
| 1200 | ✓ | iter 8 | 2294 | 27 |
| 1600 | ✗ | iter 6 | 2070 | 4 |
| 2000 | ✗ | iter 6 | 2157 | 4 |
| unlimited | ✗ | iter 6 | 2139 | 4 |
Sharp transition between 1200 and 1600. Below the line, 26B generates code (eval_count=165 on the patch turn). Above the line, eval_count=4 — effectively an EOS.
The trigger is cumulative tool-response context shape, not total tokens. The 800-cap run continued reasoning past 3741 prompt tokens without issue. The failing runs all halt at ~2070-2150 tokens — but the 1200-cap run crossed that same range (2076 at iter 7) and kept going. So "N tokens" isn't the cause — the recent-context pattern (large tool responses accumulated over 5 iterations) is.
Bonus observation: 26B at 1200-cap is the fastest passing configuration
| Run | Iters | Wall clock |
|---|---|---|
| 26B @ 1200-cap | 8 | 8.4s |
| 31B @ patch | 8 | 37s |
| Qwen3-Coder @ patch | 14 | 22s |
Same task, same correct fix. 26B's MoE (3.8B active params) is ~5× faster than 31B dense when it cooperates.
Revised interpretation
- Not "26B is broken for CLI coding agents."
- Not "long tool-call arguments break 26B."
- Yes: "26B silent-stops when the cumulative tool-response context crosses a certain shape/size threshold, at the decision-to-edit boundary." Observed threshold here: per-tool-response cap somewhere between 1200 and 1600 chars, on this task / this Ollama version / this model variant.
- The mitigation is standard. Every production CLI agent (openclaw, open code, aider, cline, continue) truncates tool responses — this is table stakes, not exotic. 26B's "failure mode" is likely already mitigated in those frameworks. What my default harness did (pass full 4-6KB pytest outputs verbatim) is probably not what those frameworks do.
- Exact mechanism is unproven. I'm observing behavior, not internals. Could be MoE expert routing, could be chat-template edge case, could be some interaction with the tool-call channel tokens. Finding the root cause would require model instrumentation beyond this scope.
Revised recommendation
- Default to
gemma4:31b-it-q4_K_Mfor general CLI coding agent use. Robust to long tool responses, no mitigation needed. - Use
gemma4:26bif you care about latency AND your agent framework truncates tool responses (most do). 5× faster than 31B when it works. - Verify by re-running against your actual agent framework. Don't trust this harness as a proxy — it's a diagnostic, not a production test.
- If you're writing a custom agent and targeting 26B, cap tool responses aggressively (≤1200 chars per response worked here; ≤800 is safer). pytest output in particular benefits from
--tb=lineor-xto shrink it.
Artifacts (Round 2)
scripts/bakeoff/harness_patch.py— patch-mode harnessscripts/bakeoff/harness_patch_truncated.py— truncation-mode harness (env varTOOL_RESULT_CAP)scripts/bakeoff/runs_patch/gemma4-26b/log.json— patch mode, unlimited (fails)scripts/bakeoff/runs_patch/gemma4-26b-truncated/log.json— cap=800 (passes)scripts/bakeoff/runs_patch/gemma4-26b-cap1200/log.json— cap=1200 (passes)scripts/bakeoff/runs_patch/gemma4-26b-cap1600/log.json— cap=1600 (fails)scripts/bakeoff/runs_patch/gemma4-26b-cap2000/log.json— cap=2000 (fails)scripts/bakeoff/runs_patch/gemma4-31b/log.json— patch mode, passes (control)scripts/bakeoff/runs_patch/qwen3-coder-30b/log.json— patch mode, passes (control)
Reproducing Round 2
cd scripts/bakeoff
# Patch-mode baseline (3 models)
python3 harness_patch.py gemma4:31b-it-q4_K_M runs_patch/gemma4-31b/work runs_patch/gemma4-31b/log.json
python3 harness_patch.py qwen3-coder:30b runs_patch/qwen3-coder-30b/work runs_patch/qwen3-coder-30b/log.json
python3 harness_patch.py gemma4:26b runs_patch/gemma4-26b/work runs_patch/gemma4-26b/log.json
# Truncation sweep on 26B
TOOL_RESULT_CAP=800 python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-truncated/work runs_patch/gemma4-26b-truncated/log.json
TOOL_RESULT_CAP=1200 python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-cap1200/work runs_patch/gemma4-26b-cap1200/log.json
TOOL_RESULT_CAP=1600 python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-cap1600/work runs_patch/gemma4-26b-cap1600/log.json
TOOL_RESULT_CAP=2000 python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-cap2000/work runs_patch/gemma4-26b-cap2000/log.json