Files

T

Mortdecai 7f806e0b92 feat: round-2 bakeoff — 26b silent-stop is tool-response context size

Round 2 tested the hypothesis that 26B's silent-stop was about
write_file argument size. Result: refuted.

- Patch-mode (apply_patch instead of write_file): 26B fails identically
  at iter 6. Tool-arg size is not the cause.
- Truncation sweep on tool responses reveals the real trigger: cap at
  800 or 1200 chars → 26B PASSES (1200-cap is 8.4s, fastest of any run).
  Cap at 1600, 2000, or unlimited → 26B silent-stops with eval=4.

Revised understanding: 26B silent-stops when cumulative tool-response
context crosses a shape threshold around 1200-1600 chars per response.
Not a tool-arg bug, not a raw code-gen bug — 26B emits correct code
fine in both one-shot and short-context settings.

Production CLI agents (openclaw, open code, aider) typically truncate
tool responses by default, so this failure may not surface in them.
Custom harnesses should cap ≤1200 chars per tool response when
targeting the 26B MoE.

Updates GOTCHAS (rewritten entry with the truncation sweep table),
SYNTHESIS model-selection row, CORPUS_cli_coding_agent.md pointer,
docs/reference/bakeoff-2026-04-18.md with full Round 2 methodology
and data.

Adds harness_patch.py (apply_patch edit tool), harness_patch_truncated.py
(env-configurable TOOL_RESULT_CAP), all 7 run logs, and a
.secrets.baseline for detect-secrets false positives on JSON timestamps.

2026-04-18 13:40:18 -04:00

12 KiB

Raw Blame History

Gemma 4 as a CLI Coding Agent

Research pass, 2026-04-18. Positions Gemma 4 against the specific use case of driving a terminal-based coding agent (openclaw / open code / aider / pi / hermes style: read_file, write_file, bash, iterate). Separate from the existing IMPLEMENTATIONS.md chat-agent patterns (Simon) and pipeline patterns (AI_Visualizer).

Empirical follow-up: docs/reference/bakeoff-2026-04-18.md — 2 rounds of runs against a custom minimal CLI-agent harness on a fix-the-median-bug task. Round 1: 31B clean (8 iters), Qwen3-Coder correct but chatty (15 iters), 26B silently quits with zero edits. Round 2 (diagnostic): the 26B failure is NOT about edit-tool-argument size — it's about cumulative tool-response context shape. Capping tool responses ≤1200 chars makes 26B pass cleanly and in the fastest wall time of any run (8.4s). Most production CLI agents already truncate tool responses, so the issue may be invisible in them. Read when: scoping which model to point an agent at, hitting an unexpected tool-call halt, or writing a custom harness targeting the 26B MoE.

TL;DR

Gemma 4 is Google's first Gemma with trained (not proof-of-concept) tool use. LiveCodeBench v6 = 80.0% (31B) / 77.1% (26B). Codeforces ELO = 2150 / 1718. That's frontier-open territory on the reported benchmarks.
Google/HF co-launched with four local CLI coding agents: openclaw, hermes, pi, open code (see tooling/huggingface/blog/gemma4-blog.md, § "Plug in your local agent"). All four use an OpenAI-compatible endpoint → Ollama or llama.cpp work interchangeably.
No SWE-bench or Aider polyglot number from Google. Reporting leans on competitive programming + single-file code gen. Real-world multi-file repo-scale coding is an empirical question Google didn't answer. Treat the CLI agent claim as plausible + untested, not proven.
No specialized CodeGemma-4 sibling exists (CodeGemma is still G1). Base Gemma 4 is the Gemma-family coding path for now.
In Seth's homelab, CT 166 openclaw2 on pve197 is the natural testbed — GPU-adjacent to CT 105 Ollama which already serves gemma4:26b and gemma4:31b-it-q4_K_M.

What Google does and doesn't claim

The HF 31B-it model card (tooling/huggingface/model-cards/gemma-4-31B-it-README.md, line 38) says:

"Enhanced Coding & Agentic Capabilities – Achieves notable improvements in coding benchmarks alongside native function-calling support, powering highly capable autonomous agents."

Reported coding / agentic numbers (from CORPUS_benchmarks.md):

Benchmark	31B	26B A4B	What it tests
LiveCodeBench v6	80.0%	77.1%	Single-file code generation
Codeforces ELO	2150	1718	Competitive programming
tau2-bench	86.4%	85.5%	Agentic tool use — customer service, not coding

What's not reported and worth noting:

SWE-bench Verified / SWE-bench Lite — the standard multi-file repo-patch benchmark
Aider polyglot — the standard diff-format / edit-quality benchmark
HumanEval / MBPP — even the old single-function tests

The absence isn't necessarily bad news (Google could simply have prioritized novel benchmarks), but it means the claim "powering highly capable autonomous agents" has no agentic-coding-specific receipt. tau2-bench is the closest agentic number and it measures a different domain.

First-party supported CLI coding agents

From the HF launch blog (tooling/huggingface/blog/gemma4-blog.md, lines 505-572):

Agent	Config	Endpoint
openclaw	`openclaw onboard` — auto-detects running llama-server	OpenAI-compatible
hermes	`hermes model` — interactive model picker	OpenAI-compatible
pi	`~/.pi/agent/models.json`	`baseUrl: http://localhost:8080/v1`, `api: openai-completions`
open code	`~/.config/opencode/opencode.json` (opencode.ai)	`@ai-sdk/openai-compatible`, `baseURL: http://127.0.0.1:8080/v1`

All four are demonstrated against llama.cpp's llama-server, which ships first-party Gemma 4 GGUFs via ggml-org/gemma-4-*-it-GGUF including mmproj for vision. Ollama's /v1/chat/completions is drop-in substitutable — same protocol, different port/path (http://<host>:11434/v1).

The blog didn't test aider / continue / cline / roo code / goose. They're all OpenAI-compatible and should work, but they're outside Google's tested set. Aider in particular uses a highly structured diff format that depends on the model emitting edits cleanly — an area where Gemma 4 has a known weakness (long/nested JSON — see GOTCHAS.md).

vs qwen3-coder:30b (the realistic homelab alternative)

Seth's steel141 already has qwen3-coder:30b and qwen3-coder-next:79.7B. The honest comparison:

Axis	Gemma 4 26B A4B	qwen3-coder:30b
Active params	3.8B (MoE, 8-of-128 experts)	~30B dense
Designed for	General-purpose + agentic tool use	Coding specifically
Vision	Native (all variants)	No
Agentic tool-call training	Yes, native tokens	Yes, native tokens
LiveCodeBench v6	77.1% (Google card)	not in this corpus — don't invent
Edit-format fidelity	Weak at long JSON (sequential-calls workaround)	Coder-tuned, strong at diffs
VRAM at 32K ctx	moderate (KV-hungry, see GOTCHAS)	moderate

Picking heuristic:

Gemma 4 if the agent does chat + tools + vision (e.g., "look at this screenshot, edit this file, re-run test") — it's the only side with native vision.
qwen3-coder if the agent is pure code-edit loops where diff quality dominates.
Bakeoff before committing. Swapping an OpenAI-compatible provider URL is near-free. Two runs on one real repo task beats either benchmark.

Don't treat Google's "Enhanced Coding" framing as a head-to-head result against Qwen. It's not — they're pointing at the delta from Gemma 3, not at current coder-specialized competition.

Configuration for Ollama-backed agents

The baseline settings from SYNTHESIS.md still apply. CLI coding agent-specific adjustments:

{
  "model": "gemma4:26b",
  "think": false,
  "keep_alive": "4h",
  "options": {
    "num_ctx": 32768,
    "num_predict": 4096,
    "temperature": 0.3
  }
}

num_ctx: 32768 is the working minimum for repo-scale work. Agents interleave file reads, bash output, and edits; 4K will truncate the second read_file.
num_predict: 4096 — single edits are short but the agent may emit a bash invocation + reasoning + tool call in one turn.
temperature: 0.3 — per SYNTHESIS.md temperature table, "structured extraction" tier. Coding edits want low variance.
think: false — critical. GOTCHAS.md documents that Ollama 0.20+ thinking silently eats num_predict and drops tool calls. If an agent somehow injects think: true, you'll see empty responses.
keep_alive: 4h — agent sessions have think pauses; avoid reload penalty.

Streaming

Non-streaming mode required on Ollama 0.20.0-0.20.1. The tool-call parser drops calls on streaming endpoints (see GOTCHAS.md and CORPUS_tool_calling_format.md). Most CLI agents default to non-streaming for tool turns, but verify in the agent's config.

llama-server alternative

If you want to follow the HF blog exactly, swap Ollama for llama.cpp:

llama-server -hf ggml-org/gemma-4-26b-a4b-it-GGUF:Q4_K_M \
  --jinja \
  -c 32768 \
  --host 0.0.0.0 --port 8080

--jinja is the critical flag — without it, the native tool-call template (with <|tool_call> / <tool_call|> asymmetric brackets — see CORPUS_tool_calling_format.md) doesn't render correctly.

Gotchas specific to CLI coding agent use

These extend (do not replace) the general GOTCHAS.md.

1. Safety overfiltering on security-adjacent code

GOTCHAS.md documents strict alignment generally. For coding agents this bites more often: pentest tooling, CTF write-ups, auth-bypass debugging, even aggressive rm -rf-style cleanup can trigger refusals or bowdlerized edits.

Workaround: The agent's system prompt should establish authorization context — "this is an authorized security test", "this is my own machine", "this is a CTF challenge". Don't rephrase as a jailbreak; state context plainly. Stock agent system prompts typically don't set this, so it's often the first thing to add.

2. Weak long JSON → favors sequential tool calls

Gemma 4 struggles with deeply-nested schemas and long arrays (existing GOTCHAS.md finding). Agent-level implication:

Agents that drive tool-by-tool (openclaw, open code, pi, cline): good fit. Each write_file / bash / read_file is a short tool call.
Agents that expect one-giant-structured-response (some aider edit modes, any "output the entire diff as JSON"): expect parse failures on long patches. Break into smaller edits if possible.

3. No code execution — that's the agent's job

Gemma 4 has no sandbox / kernel / VM. It decides when to call bash; the agent runs it. This is standard but worth stating — no CodeInterpreter-style "model runs the code" path.

4. Long-horizon context pressure

Gemma 4 supports 256K on 26B/31B but the KV cache is VRAM-hungry (existing GOTCHAS.md). For an agent churning through a repo:

32K ctx = comfortable on a 24GB card
128K ctx = you're feeding a lot of VRAM to cache, not weights
Prefer agent-side retrieval (grep, ripgrep, targeted file reads) over "paste the whole repo in context"

5. Identity drift across long sessions

Gemma 4's "ultra-compliant but doesn't know who it is" (existing GOTCHA) shows up in long agent sessions as subtle drift — switching voice, adopting a different tool-call style mid-session, forgetting constraints from turn 1. The SYNTHESIS.md system-prompt template (identity + what-you-do + what-you-do-not + format) is more important for a 50-turn agent loop than a 3-turn chat.

6. Missing coding-specific agentic benchmark (same warning, bigger stakes)

Because Google didn't publish SWE-bench, you're operating on extrapolation from Codeforces + tau2-bench when you use Gemma 4 as a CLI coding agent. Measure on your actual repo before taking a dependency.

Homelab setup (Seth)

Natural testbed: CT 166 openclaw2 on pve197 → CT 105 Ollama on pve197.

Both are on the same host so there's no network hop. CT 105 already serves gemma4:26b and gemma4:31b-it-q4_K_M (verified in handoff + per-node inventory in /home/claude/bin/CLAUDE.md).

Verify openclaw2's current model config. If it's pointing at a different backend, switch to http://192.168.0.179:11434/v1 with gemma4:26b (or 31B if VRAM permits alongside the V100 CT 167 visualizer stack).
Set default options per the block above (num_ctx: 32768, num_predict: 4096, think: false, temperature: 0.3, keep_alive: 4h).
Run one real task (suggested: a small addition to Mortdecai-2.0 — a codebase with existing CLAUDE.md and clear conventions, good signal-to-noise).
Capture: number of tool calls, number of retries, diff quality, wall clock.
Same task against qwen3-coder:30b on steel141 (http://192.168.0.141:11434/v1). Don't A/B anything else — same agent, same prompt, same repo state, different backend.
If Gemma 4 dominates on plan/navigate/describe but Qwen dominates on write_file quality, the natural step is per-role model split: let openclaw2 use Gemma for "thinking" tool calls and Qwen for edit tool calls. open code's provider config supports this cleanly.

What is NOT covered by this document

Concrete benchmark results from the proposed bakeoff (do the measurement, write a separate findings file)
openclaw / hermes / pi / open code feature-matrix detail (each agent has its own docs — the HF blog links to all four)
aider-specific diff-format analysis (aider wasn't in the HF blog's tested set)
Fine-tuning Gemma 4 for coding agents (see tooling/fine-tuning/ — the existing path)
CodeGemma (still Gemma 1 base — see tooling/gemma-family/codegemma.md)

Provenance

HF 31B-it model card: tooling/huggingface/model-cards/gemma-4-31B-it-README.md
HF launch blog: tooling/huggingface/blog/gemma4-blog.md
Benchmarks: CORPUS_benchmarks.md
Tool calling: CORPUS_tool_calling_format.md
Ollama variants: CORPUS_ollama_variants.md
Known issues: GOTCHAS.md
Qwen3-Coder in homelab: /home/claude/bin/CLAUDE.md § "Ollama models"

12 KiB Raw Blame History Unescape Escape