Files
gemma4-research/GOTCHAS.md
T
Mortdecai 7f806e0b92 feat: round-2 bakeoff — 26b silent-stop is tool-response context size
Round 2 tested the hypothesis that 26B's silent-stop was about
write_file argument size. Result: refuted.

- Patch-mode (apply_patch instead of write_file): 26B fails identically
  at iter 6. Tool-arg size is not the cause.
- Truncation sweep on tool responses reveals the real trigger: cap at
  800 or 1200 chars → 26B PASSES (1200-cap is 8.4s, fastest of any run).
  Cap at 1600, 2000, or unlimited → 26B silent-stops with eval=4.

Revised understanding: 26B silent-stops when cumulative tool-response
context crosses a shape threshold around 1200-1600 chars per response.
Not a tool-arg bug, not a raw code-gen bug — 26B emits correct code
fine in both one-shot and short-context settings.

Production CLI agents (openclaw, open code, aider) typically truncate
tool responses by default, so this failure may not surface in them.
Custom harnesses should cap ≤1200 chars per tool response when
targeting the 26B MoE.

Updates GOTCHAS (rewritten entry with the truncation sweep table),
SYNTHESIS model-selection row, CORPUS_cli_coding_agent.md pointer,
docs/reference/bakeoff-2026-04-18.md with full Round 2 methodology
and data.

Adds harness_patch.py (apply_patch edit tool), harness_patch_truncated.py
(env-configurable TOOL_RESULT_CAP), all 7 run logs, and a
.secrets.baseline for detect-secrets false positives on JSON timestamps.
2026-04-18 13:40:18 -04:00

12 KiB
Raw Blame History

Gemma 4 Gotchas & Known Issues

Derived from Seth's production implementations (Simon, AI_Visualizer) and community reports. These are hard-won lessons.

CRITICAL: Thinking Mode Eats Context

Severity: HIGH — causes silent failures

Gemma 4 in Ollama 0.20+ defaults to think: true. When enabled:

  • Thinking tokens go into a hidden thinking field, NOT response
  • If num_predict is limited, thinking consumes the entire budget
  • response comes back empty — no error, just silence
  • On evaluative tasks, thinking inflates scores (31B scored a known-bad image 9/10 with thinking vs 7/10 without)

Fix: Always pass think: false in the Ollama payload. Seth has had success ONLY with thinking off.

{
  "model": "gemma4:26b",
  "think": false,
  "options": { "num_predict": 4096 }
}

CRITICAL: format=json Causes Infinite Loops

Severity: HIGH — hangs indefinitely

Ollama's server-side format: "json" enforcer causes Gemma 4 26B (Q4) to enter an infinite retry loop when the requested schema is deeply nested.

Fix: Never use format: "json". Instead:

  1. Request JSON structure in the prompt text
  2. Parse client-side with regex + json.loads + json5 fallback
# DO THIS
response = client.generate(model="gemma4:26b", prompt=prompt, format_json=False)
body = response["response"]
obj = json.loads(body[body.find("{"):body.rfind("}") + 1])

# NOT THIS
response = client.generate(model="gemma4:26b", prompt=prompt, format="json")  # HANGS

CRITICAL: Ollama Default Context is 2048

Severity: HIGH — causes truncation

Ollama defaults num_ctx to 2048 tokens. Gemma 4 supports 128K. If you don't override, your prompts get silently truncated.

Fix: Always set num_ctx explicitly:

{ "options": { "num_ctx": 8192 } }

Scale to your needs: 4096 for simple tasks, 16384 for long inputs, 32768 for complex multi-turn.

HIGH: num_predict Default is 128

Severity: HIGH — truncates output

Ollama defaults num_predict to 128 tokens. Almost any useful Gemma 4 output exceeds this.

Fix: Always set num_predict explicitly. Minimum recommended: 512. For JSON output: 2048+.

HIGH: 26B Silent-Stops When Tool Responses Accumulate (reproducible)

Severity: HIGH — silent agent-loop failure. Mitigatable.

Reproduced on 2026-04-18 against gemma4:26b via Ollama 0.20.4 on a 3090 Ti (steel141). Agent harness looped through read_file / (write_file or apply_patch) / run_bash tools to fix a failing Python test.

The observation

26B silent-stops (empty content, no tool calls, eval_count=4) at the decision-to-edit turn, regardless of which edit tool is offered — tested with both write_file(path, full_content) and apply_patch(path, old, new). Initial hypothesis (long tool-call argument) was refuted.

The actual trigger: cumulative tool-response context shape

A sweep with progressive truncation caps on tool responses (TOOL_RESULT_CAP):

Cap (chars) Result Halt eval_count
800 PASS 24 (continues, hits iteration cap)
1200 PASSfastest of any run (8.4s) 27 (clean summary)
1600 FAIL 4 (silent stop)
2000 FAIL 4 (silent stop)
unlimited FAIL 4 (silent stop)

Sharp transition between 1200 and 1600 chars-per-response. Below the line, 26B emits correct code (eval_count ~165 on the patch turn). Above, it silent-stops. Exact mechanism unproven (could be MoE expert routing, chat-template edge case, or something else). Actionable: cap tool responses ≤1200 chars.

What's NOT at fault

  • Not the edit tool surfacewrite_file and apply_patch both trigger it
  • Not raw code generation — a one-shot direct prompt asking 26B to fix the same function returned clean correct code (eval=81)
  • Not total context size alone — the 800-cap run continued past 3741 prompt tokens. Failing runs halt at ~2070-2150 tokens but the 1200-cap run crossed the same range and kept going
  • Not a Gemma-4-family issuegemma4:31b-it-q4_K_M on identical harness handles full-size tool responses cleanly (eval=330 on the write turn)

Fix

  • For 26B in an agent loop, cap tool responses ≤1200 chars. 800 is safer; this is where every production CLI agent (openclaw / open code / aider / cline) already lives by default, so the issue may not surface in those frameworks.
  • For raw pytest output specifically, use pytest -x --tb=line or a custom formatter to shrink per-test output to a few lines.
  • Alternative: use gemma4:31b-it-q4_K_M — same harness, no mitigation, just works. Trade: ~5× slower than 26B when 26B cooperates.
  • See docs/reference/bakeoff-2026-04-18.md (Round 2) for full traces and the truncation sweep methodology.

MEDIUM: Weak at Long/Nested JSON

Severity: MEDIUM — causes parse failures

Gemma 4 reliably produces short JSON (5-10 fields) but struggles with:

  • Deeply nested schemas (3+ levels)
  • Long arrays (20+ items)
  • Mixed nesting + length

Fix: Sequential tool calls. Break one large JSON request into multiple smaller calls:

  • Instead of "generate a 50-item storyboard", do "generate items 1-5", "generate items 6-10", etc.
  • Due to Gemma 4's fast speed and free local use, sequential calls are cheap

Fallback pattern (AI_Visualizer):

for attempt in range(MAX_RETRIES):
    temp = BASE_TEMP + attempt * TEMP_BUMP  # 0.4 -> 0.5 -> 0.6
    response = call_gemma(temperature=temp)
    try:
        return parse_json(response)
    except JSONDecodeError:
        continue

MEDIUM: Identity Confusion

Severity: MEDIUM — cosmetic but confusing

Gemma 4 is ultra-compliant and highly capable but does not know who it is. It may:

  • Claim to be a different model
  • Hallucinate capabilities it doesn't have
  • Respond as a generic "AI assistant" without personality

Fix: Explicit identity in system prompt:

You are [Name], a [role]. You are powered by Gemma 4.
You ONLY do [X]. You NEVER do [Y].

Gemma 4 does NOT need hand-holding on task execution — it's very capable. It needs explicit instructions about identity and boundaries.

MEDIUM: Flash Attention Hang on 31B Dense (>3-4K tokens)

Severity: MEDIUM — hardware-specific, affects RTX 3090

Community-reported: Flash Attention causes Gemma 4 31B Dense to hang indefinitely during prompt evaluation when the prompt exceeds ~3-4K tokens. The 26B MoE variant handles the same prompts fine — bug is specific to the Dense model.

Source: ollama/ollama#15350

Fix: Use 26B for long prompts, or disable Flash Attention if running 31B on affected hardware.

MEDIUM: Tool Calling Broken in Ollama v0.20.0 Streaming

Severity: MEDIUM — version-specific

As of early April 2026, Gemma 4 tool calling has issues in Ollama v0.20.0: the tool call parser fails and streaming drops tool calls entirely. Community reports include format mismatches and continuous loops in llama.cpp / LM Studio.

Source: community reports

Fix: Use non-streaming for tool calls (Simon does this). Test tool calling thoroughly when upgrading Ollama versions. Seth's implementations work reliably with non-streaming tool calls.

MEDIUM: VRAM-Hungry for Context

Severity: MEDIUM — affects hardware planning

Gemma 4 KV cache is large relative to competitors. Community reports: 31B at 262K context requires ~22GB just for KV cache on top of model weights. One user could only fit Gemma 3 27B Q4 with 20K context on a 5090, while Qwen 3.5 27B Q4 fit with 190K context on the same card.

Implication: Don't set num_ctx higher than you need. 32K is plenty for most tasks and keeps VRAM reasonable.

MEDIUM: Safety Overfiltering

Severity: MEDIUM — blocks benign prompts

Strict safety alignment occasionally blocks technical, academic, or creative prompts that superficially resemble restricted categories. One user reported jailbreaks with basic system prompts.

Fix: Rephrase blocked prompts to avoid trigger patterns. For system prompts, avoid language that sounds like you're asking the model to bypass restrictions — just state the task directly.

MEDIUM: KV Cache Config Bug (31B/26B ship with num_kv_shared_layers=0)

Severity: MEDIUM — crashes on first attention forward pass

The 31B and 26B ship with num_kv_shared_layers = 0, which causes layer_types[:-0] to collapse to zero layer slots. Crashes on first forward pass.

Fix: Patch the config. Check model card discussions for the exact fix.

LOW: vLLM Triton Fallback (~9 tok/s on RTX 4090)

Severity: LOW — vLLM-specific

Heterogeneous attention head dimensions in Gemma 4 force vLLM to fall back to a slow Triton kernel. RTX 4090 gets ~9 tok/s instead of expected ~100+.

Source: vllm-project/vllm#38887

Fix: Use Ollama instead of vLLM for now, or wait for the fix.

LOW: <unused> Token Infinite Loop (Vulkan backends)

Severity: LOW — Vulkan-specific

Gemma 4 can generate <unused> or <unused24> tokens in an infinite loop on Vulkan backends in llama.cpp.

Source: ggml-org/llama.cpp#21516

MEDIUM: google/gemma_pytorch Abandoned for Gemma 4

Severity: MEDIUM — wastes time on a dead-end path

The google/gemma_pytorch repo (last push 2025-05-30) has zero Gemma 4 support — its variants validator only accepts Gemma 1/2/3 IDs. Anyone pointing at it as "the official PyTorch reference" for Gemma 4 is wrong.

Use instead:

  • Inference: huggingface/transformers (AutoModelForMultimodalLM, v5.5.4+)
  • Reference impl: google-deepmind/gemma (JAX/Flax)
  • Serving: Ollama / vLLM / llama.cpp

See tooling/google-official/gemma-pytorch/README.md for the original repo state.

LOW: Fine-Tuning Ecosystem Issues

Severity: LOW — only relevant if fine-tuning

Day-one issues for fine-tuners:

  • HuggingFace Transformers didn't recognize gemma4 architecture (required install from source)
  • PEFT couldn't handle Gemma4ClippableLinear (new vision encoder layer type)
  • New mm_token_type_ids field required during training even for text-only data
  • E2B/E4B show training loss of 13-15, which is normal for multimodal models (not a bug)
  • Flash Attention 2/4 incompatible: Gemma 4's global-attention head_dim is 512; FA2 max is 256, FA4 max is 128. Training backends fall back to SDP or Flex Attention (Axolotl hard-codes sdp_attention: true for Gemma 4). Does not affect inference runtimes that already use SDP (Ollama, vLLM).
  • Fused LoRA kernels broken (shared-KV layers). Axolotl disables lora_mlp_kernel / qkv_kernel / o_kernel for Gemma 4; Unsloth routes around it.
  • 26B A4B MoE wants ≥8-bit LoRA, not 4-bit QLoRA — MoE expert quality degrades at 4-bit during training. Axolotl's ScatterMoE + expert-LoRA config is the only validated 4-bit MoE path. (This caveat is training-only; Q4_K_M inference is fine.)
  • New tool-call / channel tokens are learned embeddings — if fine-tuning, set modules_to_save=["lm_head","embed_tokens"] + ensure_weight_tying=True in LoraConfig, or the adapter trains against frozen random vectors for them.

See tooling/fine-tuning/recipe-recommendation.md for the full training path.

LOW: Vision Validator Overrejects

Severity: LOW — specific to evaluative vision tasks

In AI_Visualizer, Gemma 4 vision was used to critique SDXL frames. It flagged images for motif-matching failures that humans rated as equal or better than passed images. The validator was queued for disable.

Pattern: Gemma 4 vision is good at description but unreliable for subjective quality scoring. Use it for "what's in this image?" not "is this image good?"

LOW: Keep-Alive Too Short

Severity: LOW — performance only

Default keep_alive is 5 minutes. If your pipeline has gaps (e.g., waiting for SDXL generation), the model gets unloaded and reloaded (~10-30s penalty).

Fix: Set keep_alive to match your pipeline duration:

{ "keep_alive": "4h" }

Or pin/unpin explicitly:

client.generate(model="gemma4:26b", prompt="", keep_alive=-1, options={"num_predict": 0})  # pin
# ... do work ...
client.generate(model="gemma4:26b", prompt="", keep_alive=0, options={"num_predict": 0})    # unpin