c61394923c
Seth asked "was this with think=false?" Yes — and that was the only question that mattered. Everything I concluded in round 1 and round 2 was wrong. Actual cause, isolated in round 3: - At identical message state, gemma4:26b with think=false returns eval=4 (silent stop); with think unset or think=true, returns eval=165 and emits the correct tool call. - Original round-1 write_file harness + think unset: 26B passes in 8 iters, 20s. No mitigations needed. - 31B dense and qwen3-coder:30b tolerate think=false; 26B MoE does not. Red herrings (kept on-record in the bakeoff doc, not silently erased): - Round 1: "write_file tool-call argument size" — wrong - Round 2a: refuted the arg-size theory but for the wrong reason (still failed because think=false was still set) - Round 2b: "cumulative tool-response context size" — truncating did make 26B pass, but by coincidence. Shorter context at the decision turn dodged the think=false side effect. Why the existing "always think:false" guidance was misleading: it was derived from AI_Visualizer (single-turn JSON pipelines) where thinking tokens do eat num_predict invisibly. In multi-turn tool-calling agents the channels are separate and the flag has a different effect — catastrophic on 26B specifically. Doc updates: - GOTCHAS: replaced the 26B entry with the actual cause; scoped the original "Thinking Mode Eats Context" entry to single-turn pipelines - SYNTHESIS: split the "Mandatory Ollama Settings" block into single-turn vs multi-turn variants; updated anti-patterns and quick-start checklist - CORPUS_cli_coding_agent.md: revised pointer and config template - docs/reference/bakeoff-2026-04-18.md: added Round 3 section with the correction notice at the top of the file and full diagnostic methodology New artifacts: harness_no_think_flag.py, harness_write_no_think.py, and 4 new log files demonstrating all three models pass when think is left at default.
383 lines
22 KiB
Markdown
383 lines
22 KiB
Markdown
# CLI Coding Agent Bakeoff — 2026-04-18
|
||
|
||
> Empirical follow-up to `CORPUS_cli_coding_agent.md`. Runs a minimal CLI coding
|
||
> agent loop against three candidate models on identical hardware and an
|
||
> identical broken-code task. **n=1 per model** (plus one re-run to check
|
||
> reproducibility of a failure). Treat as a smoke test, not a benchmark.
|
||
|
||
> **Correction notice (Round 3):** Rounds 1 and 2 both misidentified the cause
|
||
> of Gemma 4 26B's silent-stop failure. Round 1 blamed `write_file` tool-call
|
||
> argument size. Round 2 blamed tool-response context size. **Round 3 proves
|
||
> both wrong: the actual cause is the `think: false` Ollama flag.** Remove the
|
||
> flag and 26B passes on the original Round 1 harness unmodified. Kept the
|
||
> failed hypotheses below as-recorded — Seth asked "was this with
|
||
> think=false?" and the answer exposed the confounder. Never presented as Plan A.
|
||
|
||
## Setup
|
||
|
||
- **Host:** steel141 (Seth's local box)
|
||
- **GPU:** NVIDIA RTX 3090 Ti, 24 GiB, ~22.7 GiB free
|
||
- **Ollama:** 0.20.4
|
||
- **Harness:** `scripts/bakeoff/harness.py` — custom minimal agent loop, **not** openclaw / open code / aider / pi / hermes. Protocol: Ollama `/api/chat` with `tools=[read_file, write_file, run_bash]`, non-streaming, `think: false`, `num_ctx: 32768`, `num_predict: 4096`, `temperature: 0.3`. Iteration cap = 15.
|
||
- **Task:** `scripts/bakeoff/task_seed/` — Python package with buggy `median()` function. 3 of 7 pytest tests fail on even-length inputs. Fix is ~5 lines.
|
||
- **System prompt:** generic CLI-agent template (identity + allowed tools + rules: "never modify tests", "prefer minimal edits"). Not tuned per model.
|
||
|
||
All three models pulled from steel141's local Ollama, swapped in/out of GPU as each run started. First iteration per run pays the load cost; later iterations are hot.
|
||
|
||
## Results
|
||
|
||
| Model | Pass | Iterations | write_file | read_file | run_bash | Wall clock | Halt reason |
|
||
|---|---|---|---|---|---|---|---|
|
||
| `gemma4:26b` | **Fail** | 6 | **0** | 2 | 3 | 10.9s | `no_tool_calls` (silent empty response) |
|
||
| `gemma4:26b` (retry) | **Fail** | 6 | **0** | 2 | 3 | 11.4s | `no_tool_calls` (reproduces exactly) |
|
||
| `gemma4:31b-it-q4_K_M` | **Pass** | 8 | 1 | 2 | 4 | 44.1s | `no_tool_calls` (clean summary turn) |
|
||
| `qwen3-coder:30b` | **Pass** | 15 (cap) | 1 | 4 | 8 | 22.6s | `no_tool_calls` (at iteration cap) |
|
||
|
||
### Gemma 4 31B — clean run
|
||
|
||
Textbook agent trace:
|
||
|
||
1. `read_file README.md`
|
||
2. `pytest` (exit=2, module not found — pytest needs PYTHONPATH)
|
||
3. `ls -R`
|
||
4. `PYTHONPATH=. pytest` → sees 3 failures
|
||
5. `read_file calc/stats.py`
|
||
6. `write_file calc/stats.py` (eval_count=330, 13.4s) — correct fix
|
||
7. `PYTHONPATH=. pytest` → all green
|
||
8. summary: *"I updated the `median` function in `calc/stats.py` to correctly calculate the average of the two middle elements..."*
|
||
|
||
Zero wasted turns. One write. Minimal edit.
|
||
|
||
### Qwen3-Coder 30B — correct but chatty
|
||
|
||
Passed, but used all 15 iterations:
|
||
|
||
- Narrated every step ("I'll help you...", "Now let's look at...")
|
||
- Tried to read a non-existent file (`test_calc.py`) — wasted iter 2
|
||
- Tried to `read_file` on a directory (`calc`) — wasted iter 6
|
||
- Ran several redundant bash calls (`pwd && pytest`, etc.)
|
||
- Emitted a ceremonial `echo "All tests pass..."` bash call at iter 14
|
||
- Final turn was a polite summary
|
||
|
||
The fix itself (iter 12) was correct on the first write. Quality is fine; efficiency isn't. Per-iteration it was fast (many 20-40 token turns) — total wall clock 22.6s beat Gemma 31B despite using nearly 2× the iterations.
|
||
|
||
### Gemma 4 26B — reproducible silent stop
|
||
|
||
Both runs followed an identical trajectory:
|
||
|
||
1. `ls -R`
|
||
2. `read_file README.md`
|
||
3. `pytest` (exit=2)
|
||
4. `PYTHONPATH=. pytest` → sees 3 failures
|
||
5. `read_file calc/stats.py`
|
||
6. **Empty response. `eval_count=4`. No tool calls. Loop terminates.**
|
||
|
||
Zero writes. The model saw all the context it needed (failing tests + buggy source) and then silently declined to act.
|
||
|
||
### Isolating the failure — one-shot probe
|
||
|
||
To check whether 26B can produce the fix at all, I ran a single-turn call with no tool loop:
|
||
|
||
```
|
||
prompt: "The following function is buggy — median([1,2,3,4]) returns 3
|
||
but should return 2.5. Rewrite it correctly. [buggy code]"
|
||
```
|
||
|
||
Response (eval_count=81):
|
||
|
||
```python
|
||
def median(numbers):
|
||
s = sorted(numbers)
|
||
n = len(s)
|
||
if n % 2 == 1:
|
||
return s[n // 2]
|
||
else:
|
||
return (s[n // 2 - 1] + s[n // 2]) / 2
|
||
```
|
||
|
||
**Correct.** So 26B's diagnosis and code generation are intact. The failure is specifically at the **tool-call-boundary** — when the model needs to emit a `write_file(path, content)` call where the `content` argument is a several-hundred-character string, it aborts with eval=4 instead.
|
||
|
||
This aligns with `GOTCHAS.md` § "Weak at Long/Nested JSON". A `write_file` tool call argument with a ~500-char string is structurally similar to a long JSON value. Gemma 4 31B handles the same surface reliably (eval=330 on that turn); the 26B MoE does not.
|
||
|
||
## Interpretation
|
||
|
||
### What this is evidence for
|
||
|
||
- **Gemma 4 31B is a viable CLI-coding-agent backing model on this class of task.** Clean trace, minimal wasted turns, correct fix on first write.
|
||
- **Qwen3-Coder 30B also works**, at the cost of more iterations and looser discipline. Diff quality was fine; agentic efficiency wasn't.
|
||
- **Gemma 4 26B has a reproducible failure mode** at tool-call-argument emission. It can reason. It can code. It struggles to deliver code through a `write_file` tool call when the content is non-trivial.
|
||
|
||
### What this is NOT evidence for
|
||
|
||
- **This is not a representative benchmark.** n=1 per model. One task. One fix. One harness. Do not conclude "Gemma 4 26B is broken for coding agents" — conclude "Gemma 4 26B failed this specific setup reproducibly; investigate further before relying on it."
|
||
- **This harness is not openclaw / open code / aider / pi / hermes.** Production agents wrap prompts, retries, and tool surfaces differently. The 26B failure may be avoided in a harness that:
|
||
- Uses a **patch/diff tool** (`apply_patch(old, new)`) instead of `write_file(full_content)` — smaller argument surface, matches the "sequential tool calls" pattern from `SYNTHESIS.md`
|
||
- Adds a **retry on empty response** (same as Simon's streaming-fallback pattern in `IMPLEMENTATIONS.md`)
|
||
- Provides fewer but richer tools (a dedicated `fix_file` that re-prompts internally)
|
||
- **This compares agent behavior, not raw performance.** Wall clock is noisy (model load, context size, token rate all differ). Per-iteration latency is more meaningful — but that only matters for throughput, not correctness.
|
||
|
||
### Recommendations
|
||
|
||
1. **For a CLI coding agent on Seth's hardware:** start with `gemma4:31b-it-q4_K_M`. Clean behavior, modest wall clock (44s for a simple fix), no retry needed.
|
||
2. **For comparison or backup:** `qwen3-coder:30b` is equally correct, roughly half the per-iteration cost, ~2× the iteration count. In a longer session those extra turns add up.
|
||
3. **Do not default to `gemma4:26b` for this pattern.** Two tests in a row silent-stopped at the write boundary. If you want to use the 26B MoE (it's strong on `LiveCodeBench v6` at 77.1%), validate it against your specific agent framework first — especially whether the framework uses `write_file` (full content) or `apply_patch` (delta) as its edit primitive.
|
||
4. **Test with the real harness you plan to use in production** (openclaw2, open code, etc.) before committing. A handful of this style of run takes minutes on the 3090 Ti and will tell you more than any benchmark card.
|
||
|
||
## Honest caveats
|
||
|
||
- **Stochasticity.** Only 26B was re-run. 31B and Qwen3-Coder might hit failure modes on a different seed or a different task. Temperature 0.3 is low but not zero.
|
||
- **System prompt bias.** "Start by reading README.md" steered all three models similarly; a different prompt skeleton would produce different traces. I did not tune per model — deliberately — because a production agent won't either.
|
||
- **The 26B silent-stop hypothesis (tool-arg emission failure) is inferred, not proven.** A clean confirmation would require running the same task with a smaller-surface edit tool (`apply_patch(path, old, new)` instead of `write_file(path, full_content)`) and showing 26B succeeds. That's the obvious follow-up.
|
||
- **Ollama 0.20.4** is between the 0.20.0/0.20.1 known-broken-streaming range and whatever is current. Non-streaming tool calls worked cleanly for 31B and Qwen; 26B's failure looks model-specific, not Ollama-specific, but I didn't test on a different Ollama version.
|
||
- **No openclaw / open code / aider runs.** Those are the frameworks named in the HF launch blog. This was a synthetic harness; transfer is plausible but unverified.
|
||
|
||
## Artifacts
|
||
|
||
- `scripts/bakeoff/harness.py` — the agent loop
|
||
- `scripts/bakeoff/task_seed/` — the broken-code seed (reset between runs)
|
||
- `scripts/bakeoff/runs/gemma4-26b/log.json` — full turn-by-turn trace
|
||
- `scripts/bakeoff/runs/gemma4-26b-retry/log.json`
|
||
- `scripts/bakeoff/runs/gemma4-31b/log.json`
|
||
- `scripts/bakeoff/runs/qwen3-coder-30b/log.json`
|
||
|
||
Each log records per-turn: content, tool calls, results (truncated to 800 chars), prompt/eval token counts, wall time. Final block records halt reason, pass/fail, iteration count, tool-call totals, total wall clock.
|
||
|
||
## Reproducing
|
||
|
||
```bash
|
||
cd scripts/bakeoff
|
||
python3 harness.py gemma4:31b-it-q4_K_M runs/gemma4-31b/work runs/gemma4-31b/log.json
|
||
python3 harness.py qwen3-coder:30b runs/qwen3-coder-30b/work runs/qwen3-coder-30b/log.json
|
||
python3 harness.py gemma4:26b runs/gemma4-26b/work runs/gemma4-26b/log.json
|
||
```
|
||
|
||
Each invocation resets the work directory from `task_seed/`, runs the loop, writes the log, and prints a one-line summary.
|
||
|
||
---
|
||
|
||
# Round 2 — isolating the 26B silent-stop
|
||
|
||
After Round 1 I hypothesized the 26B failure was about long `write_file(path, full_content)` tool arguments. Round 2 tests that.
|
||
|
||
## What was tested
|
||
|
||
1. **Patch-mode harness** (`harness_patch.py`) — identical to the original but swaps `write_file(path, content)` for `apply_patch(path, old_text, new_text)`. Arguments are a small delta (~100-200 chars), not the full file.
|
||
2. **Truncation-mode harness** (`harness_patch_truncated.py`) — same as patch-mode, but caps every tool response to `TOOL_RESULT_CAP` chars (env-configurable) before returning it to the model.
|
||
|
||
All else identical: same task, same system prompt, same Ollama settings, same 3090 Ti on steel141.
|
||
|
||
## Results
|
||
|
||
### Round 2a — patch-mode (small edit tool arguments)
|
||
|
||
| Model | Pass | Iters | patches | reads | bashes | Wall |
|
||
|---|---|---|---|---|---|---|
|
||
| `gemma4:31b-it-q4_K_M` | ✓ | 8 | 1 | 2 | 4 | 37s |
|
||
| `qwen3-coder:30b` | ✓ | 14 | 1 | 3 | 9 | 22s |
|
||
| `gemma4:26b` | **✗** | 6 | **0** | 2 | 3 | 8s |
|
||
|
||
**Hypothesis refuted.** 26B fails identically on patch-mode: 6 iters, silent stop at iter 6 with eval=4, zero edits. The tool-call **argument size is not the trigger.**
|
||
|
||
### Round 2b — tool-result truncation cap
|
||
|
||
Ran 26B through patch-mode with progressively smaller caps on each tool response:
|
||
|
||
| TOOL_RESULT_CAP | 26B Pass | Halt turn | prompt_eval at halt | eval_count at halt |
|
||
|---|---|---|---|---|
|
||
| **800** | ✓ | iter 15 (cap) | 3741 | 24 |
|
||
| **1200** | ✓ | iter 8 | 2294 | 27 |
|
||
| **1600** | ✗ | iter 6 | 2070 | **4** |
|
||
| **2000** | ✗ | iter 6 | 2157 | **4** |
|
||
| **unlimited** | ✗ | iter 6 | 2139 | **4** |
|
||
|
||
Sharp transition between 1200 and 1600. Below the line, 26B generates code (`eval_count=165` on the patch turn). Above the line, `eval_count=4` — effectively an EOS.
|
||
|
||
**The trigger is cumulative tool-response context shape, not total tokens.** The 800-cap run continued reasoning past 3741 prompt tokens without issue. The failing runs all halt at ~2070-2150 tokens — but the 1200-cap run crossed that same range (2076 at iter 7) and kept going. So "N tokens" isn't the cause — the recent-context pattern (large tool responses accumulated over 5 iterations) is.
|
||
|
||
### Bonus observation: 26B at 1200-cap is the fastest passing configuration
|
||
|
||
| Run | Iters | Wall clock |
|
||
|---|---|---|
|
||
| 26B @ 1200-cap | 8 | **8.4s** |
|
||
| 31B @ patch | 8 | 37s |
|
||
| Qwen3-Coder @ patch | 14 | 22s |
|
||
|
||
Same task, same correct fix. 26B's MoE (3.8B active params) is ~5× faster than 31B dense when it cooperates.
|
||
|
||
## Revised interpretation
|
||
|
||
- **Not "26B is broken for CLI coding agents."**
|
||
- **Not "long tool-call arguments break 26B."**
|
||
- **Yes: "26B silent-stops when the cumulative tool-response context crosses a certain shape/size threshold, at the decision-to-edit boundary."** Observed threshold here: per-tool-response cap somewhere between 1200 and 1600 chars, on this task / this Ollama version / this model variant.
|
||
- **The mitigation is standard.** Every production CLI agent (openclaw, open code, aider, cline, continue) truncates tool responses — this is table stakes, not exotic. 26B's "failure mode" is likely *already mitigated* in those frameworks. What my default harness did (pass full 4-6KB pytest outputs verbatim) is probably not what those frameworks do.
|
||
- **Exact mechanism is unproven.** I'm observing behavior, not internals. Could be MoE expert routing, could be chat-template edge case, could be some interaction with the tool-call channel tokens. Finding the root cause would require model instrumentation beyond this scope.
|
||
|
||
## Revised recommendation
|
||
|
||
1. **Default to `gemma4:31b-it-q4_K_M`** for general CLI coding agent use. Robust to long tool responses, no mitigation needed.
|
||
2. **Use `gemma4:26b`** if you care about latency AND your agent framework truncates tool responses (most do). 5× faster than 31B when it works.
|
||
3. **Verify by re-running against your actual agent framework.** Don't trust this harness as a proxy — it's a diagnostic, not a production test.
|
||
4. **If you're writing a custom agent and targeting 26B**, cap tool responses aggressively (≤1200 chars per response worked here; ≤800 is safer). pytest output in particular benefits from `--tb=line` or `-x` to shrink it.
|
||
|
||
## Artifacts (Round 2)
|
||
|
||
- `scripts/bakeoff/harness_patch.py` — patch-mode harness
|
||
- `scripts/bakeoff/harness_patch_truncated.py` — truncation-mode harness (env var `TOOL_RESULT_CAP`)
|
||
- `scripts/bakeoff/runs_patch/gemma4-26b/log.json` — patch mode, unlimited (fails)
|
||
- `scripts/bakeoff/runs_patch/gemma4-26b-truncated/log.json` — cap=800 (passes)
|
||
- `scripts/bakeoff/runs_patch/gemma4-26b-cap1200/log.json` — cap=1200 (passes)
|
||
- `scripts/bakeoff/runs_patch/gemma4-26b-cap1600/log.json` — cap=1600 (fails)
|
||
- `scripts/bakeoff/runs_patch/gemma4-26b-cap2000/log.json` — cap=2000 (fails)
|
||
- `scripts/bakeoff/runs_patch/gemma4-31b/log.json` — patch mode, passes (control)
|
||
- `scripts/bakeoff/runs_patch/qwen3-coder-30b/log.json` — patch mode, passes (control)
|
||
|
||
## Reproducing Round 2
|
||
|
||
```bash
|
||
cd scripts/bakeoff
|
||
|
||
# Patch-mode baseline (3 models)
|
||
python3 harness_patch.py gemma4:31b-it-q4_K_M runs_patch/gemma4-31b/work runs_patch/gemma4-31b/log.json
|
||
python3 harness_patch.py qwen3-coder:30b runs_patch/qwen3-coder-30b/work runs_patch/qwen3-coder-30b/log.json
|
||
python3 harness_patch.py gemma4:26b runs_patch/gemma4-26b/work runs_patch/gemma4-26b/log.json
|
||
|
||
# Truncation sweep on 26B
|
||
TOOL_RESULT_CAP=800 python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-truncated/work runs_patch/gemma4-26b-truncated/log.json
|
||
TOOL_RESULT_CAP=1200 python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-cap1200/work runs_patch/gemma4-26b-cap1200/log.json
|
||
TOOL_RESULT_CAP=1600 python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-cap1600/work runs_patch/gemma4-26b-cap1600/log.json
|
||
TOOL_RESULT_CAP=2000 python3 harness_patch_truncated.py gemma4:26b runs_patch/gemma4-26b-cap2000/work runs_patch/gemma4-26b-cap2000/log.json
|
||
```
|
||
|
||
---
|
||
|
||
# Round 3 — the actual cause: `think: false`
|
||
|
||
Seth asked "was this with think=false?" That was the only question that mattered.
|
||
|
||
## The question that unstuck it
|
||
|
||
Every harness in Round 1 and Round 2 set `"think": False` in the Ollama payload —
|
||
per existing guidance in `GOTCHAS.md`: "Always pass `think: false` in the Ollama
|
||
payload. Seth has had success ONLY with thinking off." I copied that to the
|
||
harnesses without testing whether it was the right choice for a multi-turn
|
||
tool-calling agent loop (as opposed to the single-turn JSON pipeline that
|
||
guidance came from).
|
||
|
||
## The diagnostic
|
||
|
||
Replayed the exact 5-iteration failing state to `gemma4:26b` three times with
|
||
three think settings, same message history, same tool definitions:
|
||
|
||
| `think` setting | `eval_count` | tool call emitted? |
|
||
|---|---|---|
|
||
| `false` (my harness) | **4** | ✗ |
|
||
| unset (Ollama default) | 165 | ✓ `apply_patch` |
|
||
| `true` | 165 | ✓ `apply_patch` |
|
||
|
||
Sharp, reproducible. `think: false` → silent stop. Anything else → works.
|
||
|
||
## Round 3 runs — unlimited tool responses, think flag removed
|
||
|
||
| Harness | Model | Pass | Iters | Wall |
|
||
|---|---|---|---|---|
|
||
| `write_file` (Round-1 harness, think unset) | `gemma4:26b` | **✓** | 8 | 20.6s |
|
||
| `apply_patch` (Round-2a harness, think unset) | `gemma4:26b` | **✓** | 8 | 12.5s |
|
||
| `write_file`, think unset | `gemma4:31b-it-q4_K_M` | ✓ | 8 | — |
|
||
| `apply_patch`, think unset | `gemma4:31b-it-q4_K_M` | ✓ | 8 | 66.4s |
|
||
| `apply_patch`, think unset | `qwen3-coder:30b` | ✓ | 11 | 19.5s |
|
||
|
||
**26B passes cleanly on the unmodified Round 1 harness once the think flag is
|
||
removed.** No truncation, no patch-tool swap, no mitigations.
|
||
|
||
The 31B / Qwen runs confirm the flag doesn't matter for those models (pass either
|
||
way). 31B is visibly slower without the think flag (66s vs 37s) — likely
|
||
because it's actually generating hidden thinking now — but it still completes.
|
||
|
||
## What Rounds 1 and 2 got wrong
|
||
|
||
### Round 1 (wrong): "26B silent-stops at the write_file tool-call argument boundary"
|
||
|
||
The write_file tool was present. 26B failed. But 26B also fails with
|
||
`apply_patch` (Round 2a) and passes with `write_file` when think is unset
|
||
(Round 3). The tool surface was not the cause.
|
||
|
||
### Round 2a (wrong): "Refuted the write_file hypothesis"
|
||
|
||
Correctly refuted the original hypothesis, but still tested with `think: false`.
|
||
Only the positive finding (still failed) was right; the conclusion ("the edit
|
||
tool is not the cause") was right for the wrong reason. The cause wasn't the
|
||
edit tool **because** it was `think: false`.
|
||
|
||
### Round 2b (wrong): "Cumulative tool-response context size is the trigger"
|
||
|
||
The truncation sweep showed a sharp 1200-vs-1600-char boundary. That was real
|
||
behavior, but it was a *byproduct* of `think: false`. With shorter context,
|
||
`think: false` doesn't always trigger the silent-stop at every decision point
|
||
— apparently the decoding-path divergence is stochastic or state-dependent.
|
||
The underlying bug was the same (the flag); the truncation pattern was just a
|
||
workaround that happened to land on the lucky side of the dice.
|
||
|
||
The prompt_eval_count threshold I identified (~2100 tokens) was the cumulative
|
||
context size at the model's natural decision-to-edit turn. Below that many
|
||
tokens the model survived the think=false flag; above it, `think=false` killed
|
||
generation. The number was real but the causal story was wrong.
|
||
|
||
## Why the existing GOTCHAS guidance was misleading here
|
||
|
||
`GOTCHAS.md` says: *"Thinking tokens consume num_predict budget invisibly,
|
||
returning empty responses. Seth has ONLY had success with thinking off."*
|
||
|
||
That guidance was derived from `AI_Visualizer` (per `IMPLEMENTATIONS.md` §
|
||
"Project: AI Visualizer") — single-turn JSON-generation pipelines where the
|
||
model's thinking DOES eat the num_predict budget and returns an empty `content`
|
||
field.
|
||
|
||
In a **multi-turn tool-calling agent loop**, the mechanics are different:
|
||
- Ollama returns separate fields for `content` and `thinking` (when populated)
|
||
- Tool calls come out through `tool_calls`, which isn't bounded by `content`
|
||
generation the same way
|
||
- Setting `think: false` here changes the chat-template / decoding path in a
|
||
way that makes 26B specifically — probably due to MoE routing sensitivity —
|
||
prefer early EOS at tool-decision turns
|
||
- 31B and Qwen3-Coder are more robust to the same flag
|
||
|
||
So the guidance isn't wrong; it's out of scope. It applied to AI_Visualizer,
|
||
was over-generalized to "always think:false", and the agent corpus inherited
|
||
that over-generalization.
|
||
|
||
## Revised, correct recommendation for CLI coding agents
|
||
|
||
1. **Do NOT set `think: false`** in your agent payload. Leave it unset (Ollama
|
||
default) or `true`.
|
||
2. **Do manage the `content` and `thinking` fields explicitly** if they
|
||
accumulate in your message history — prune old thinking blobs before
|
||
pushing past 30K context.
|
||
3. **The model / tool-surface choices don't matter the way I said they did.**
|
||
Any of (`gemma4:26b`, `gemma4:31b-it-q4_K_M`, `qwen3-coder:30b`) × (`write_file`,
|
||
`apply_patch`) × (capped/uncapped responses) passes when `think` is unset.
|
||
4. **For single-turn JSON pipelines, the original "think: false" guidance still
|
||
applies.** This correction is scoped to multi-turn tool-calling agents.
|
||
|
||
## Round 3 artifacts
|
||
|
||
- `scripts/bakeoff/harness_no_think_flag.py` — patch-mode harness with no think key
|
||
- `scripts/bakeoff/harness_write_no_think.py` — write-file harness with no think key
|
||
- `scripts/bakeoff/runs_patch/gemma4-26b-no-think-flag/log.json` — 26B patch, no think (PASS)
|
||
- `scripts/bakeoff/runs_patch/gemma4-26b-writefile-no-think/log.json` — 26B write, no think (PASS)
|
||
- `scripts/bakeoff/runs_patch/gemma4-31b-no-think-flag/log.json` — 31B patch, no think (PASS)
|
||
- `scripts/bakeoff/runs_patch/qwen3-coder-30b-no-think-flag/log.json` — Qwen patch, no think (PASS)
|
||
|
||
## Reproducing Round 3
|
||
|
||
```bash
|
||
cd scripts/bakeoff
|
||
|
||
# The correction: same harness as Round 1, just with think flag removed
|
||
python3 harness_write_no_think.py gemma4:26b runs_patch/gemma4-26b-writefile-no-think/work runs_patch/gemma4-26b-writefile-no-think/log.json
|
||
|
||
# Patch-mode without think flag
|
||
python3 harness_no_think_flag.py gemma4:26b runs_patch/gemma4-26b-no-think-flag/work runs_patch/gemma4-26b-no-think-flag/log.json
|
||
python3 harness_no_think_flag.py gemma4:31b-it-q4_K_M runs_patch/gemma4-31b-no-think-flag/work runs_patch/gemma4-31b-no-think-flag/log.json
|
||
python3 harness_no_think_flag.py qwen3-coder:30b runs_patch/qwen3-coder-30b-no-think-flag/work runs_patch/qwen3-coder-30b-no-think-flag/log.json
|
||
```
|