feat: mort-bot think=true vs think=false bakeoff

Seth's challenge: "we experienced this context eating with every implementation that had think=true. mort-bot runs a loop. Can you do a bake-off?" Built a harness that replicates mort-bot's /api/chat loop verbatim (num_ctx=8192, num_predict=2048, temperature=0.7, gemma4:26b, STEP_BUDGET=20, exact payload shape) but with stubbed tools and a prebuilt 15-turn fake chat history. Ran 4 tasks × 2 think settings. Finding: on Ollama 0.20.4 the "thinking eats context" concern does NOT reproduce. Direct evidence: - Movies task step 2 (think=true) returned 905 chars of thinking. - Step 3 prompt_eval_count delta: +76 tokens (think=true) vs +135 tokens (think=false). If thinking had accumulated in the prompt, think=true would have grown by +360 tokens, not shrunk. - Ollama's chat template strips the `thinking` field when serializing assistant turns for subsequent prompts. All 4 tasks × 2 settings produced identical step counts and tool counts. Wall clocks comparable. Gemma only actually generated thinking on 1 of 4 tasks (the one with check_sethflix verify-loop); on the others with think=true it emitted 0 thinking tokens. Reconciled with the earlier coding-agent bakeoff: the two findings are orthogonal. Coding bakeoff was at num_ctx=32K with a different harness; mort at 8K doesn't touch the silent-stop regime either way. Seth's prior may have been correct on an older Ollama or in a different API shape (/api/generate has its own issues) but does not reproduce here. Concrete recommendation: mort-bot THINK=False is defensible but not load-bearing; THINK=True or unset-default would also work. Keep as-is unless a different need arises. New: docs/reference/mort-bakeoff-2026-04-18.md, scripts/mort-bakeoff/ (harness + 8 run logs). README updated with pointer.
2026-04-18 18:23:43 -04:00
parent c61394923c
commit 8436a91571
12 changed files with 988 additions and 2 deletions
@@ -15,7 +15,8 @@ Research corpus and implementation guidance for Google Gemma 4, based on product
 | `CORPUS_benchmarks.md` | Full benchmark table vs Gemma 3, arena scores, agentic scores | When comparing Gemma 4 to alternatives |
 | `CORPUS_tool_calling_format.md` | Native token format + JSON API format for function calling | When implementing tool calling |
 | `CORPUS_cli_coding_agent.md` | Positioning Gemma 4 for CLI coding agent use (openclaw / open code / pi / hermes / aider style). Honest take on what Google did and didn't measure, head-to-head with `qwen3-coder:30b`, homelab setup pointer | When scoping a CLI coding agent or deciding Gemma 4 vs Qwen3-Coder |
-| `docs/reference/bakeoff-2026-04-18.md` | Raw results: CLI-coding-agent bakeoff of gemma4:26b / gemma4:31b / qwen3-coder:30b on steel141 3090 Ti. **31B clean, Qwen3-Coder correct but chatty, 26B reproducibly silent-stops at write_file.** Harness at `scripts/bakeoff/` | When deciding which model to back a CLI agent with, or debugging a similar tool-call halt |
+| `docs/reference/bakeoff-2026-04-18.md` | CLI-coding-agent bakeoff on 3090 Ti. **Rounds 1/2 misidentified the cause; Round 3 (the correct one): `think: false` silent-stops gemma4:26b at certain multi-turn states on 32K context.** 31B and Qwen3-Coder robust to the flag. Harness at `scripts/bakeoff/` | When deciding which model to back a CLI agent with, writing a custom agent payload, or debugging a silent tool-call halt |
+| `docs/reference/mort-bakeoff-2026-04-18.md` | mort-bot-specific `think=true` vs `think=false` bakeoff on mort's actual loop shape (gemma4:26b, num_ctx=8192). **Thinking does NOT accumulate in context on Ollama 0.20.4** — strips it from serialized history. Both settings behave identically on step counts, tool counts, wall clock. Harness at `scripts/mort-bakeoff/` | When deciding mort-bot's THINK env var, or when someone claims "think=true eats context" without pinning an Ollama version |
 | `tooling/` | **Canonical upstream tooling** — real scripts, notebooks, model cards, and configs pulled from Google / HF / framework maintainers (147 files). Subdirs: `google-official/`, `huggingface/`, `inference-frameworks/`, `gemma-family/`, `fine-tuning/`. See `tooling/README.md` for index and findings that update the older `CORPUS_*` docs | When you need authoritative source material — model cards, chat templates, fine-tuning recipes, serving commands for vLLM / llama.cpp / MLX, or to scope a specialized sibling (ShieldGemma, EmbeddingGemma, etc.) |

 ## Source Projects