docs: initial Gemma 4 research corpus and synthesis

Architecture specs, benchmarks, gotchas, Ollama settings, tool calling format, and implementation patterns from Simon and AI_Visualizer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 18:14:19 -04:00
commit 5011059f5d
9 changed files with 861 additions and 0 deletions
@@ -0,0 +1,105 @@
 # Gemma 4 Architecture Reference
 > Sources: Google DeepMind blog, HuggingFace blog (huggingface.co/blog/gemma4),
 > Maarten Grootendorst visual guide, kaitchup.substack.com, wavespeed.ai
 ## Model Family
 | Variant | Total Params | Effective Params | Type | Notes |
 |---------|-------------|-----------------|------|-------|
 | E2B | ~5.1B | ~2.3B | Dense + PLE | On-device, audio+vision |
 | E4B | ~8B | ~4B | Dense + PLE | On-device, audio+vision |
 | 31B | 31B | 31B | Dense | 60 layers, widened vs Gemma 3 27B (62 layers) |
 | 26B A4B | 26B | ~4B active | MoE | 128 experts, 8 active + 1 shared |
 ## Attention Architecture
 - **Pattern:** Local (sliding window) interleaved with global attention
  - E2B: 4:1 ratio (4 local, 1 global). E4B/31B/26B: 5:1 ratio
  - Global attention is always the last layer
 - **Sliding window:** E2B/E4B = 512 tokens; 31B/26B = 1024 tokens
 - **Grouped Query Attention (GQA):**
  - Local: 2 query heads share 1 KV head
  - Global: 8 query heads share 1 KV head, doubled Key dimensions
 ## Positional Encoding: Proportional RoPE (p-RoPE)
 - Applied to global attention layers only
 - p=0.25 -> rotates only 25% of head dimensions
 - theta=1M
 - 75% of dimensions are position-independent -> better long-context extrapolation
 - Replaces Gemma 3's 8x linear frequency scaling
 ## Per-Layer Embeddings (PLE) — E2B/E4B Only
 - Each decoder layer gets its own unique token representation
 - Parallel lower-dimensional pathway alongside main residual stream
 - PLE dimensions: 256 (E2B), 2560 (E4B)
 - Original embedding dimensions: 1536 (E2B), 2560 (E4B)
 - Applied between decoder blocks with gating function
 - This is why E2B has 5.1B total but only 2.3B effective — the PLE table is large
 ## Shared KV Cache
 - Last N layers reuse K/V tensors from earlier layers (same attention type)
 - No quality loss in practice
 - Significant memory + compute savings for long-context generation
 ## Vision Encoder
 - Params: 150M (E2B/E4B), 550M (31B/26B)
 - Patch size: 16x16 pixels
 - 3x3 neighboring patches merged into single embedding
 - Uses 2D RoPE for variable aspect ratio
 - Token budgets: 70, 140, 280, 560, 1120 soft tokens
 - Approximate resolutions: 272x176 (70 tokens) -> 1088x704 (1120 tokens)
 ## Audio Encoder — E2B/E4B Only
 - Conformer architecture with convolutional modules
 - Mel-spectrogram feature extraction
 - Two 2D conv layers for downsampling
 - NOT available on 31B or 26B variants
 ## MoE Details (26B A4B)
 - 128 total experts
 - 8 experts activated per token
 - 1 shared expert (3x size of regular experts)
 - 119 experts unused during any given forward pass
 ## Context Window
 | Variant | Context Window | MRCR v2 8-needle @ 128K |
 |---------|---------------|------------------------|
 | E2B | 128K | 19.1% |
 | E4B | 128K | 25.4% |
 | 26B A4B | 256K | 44.1% |
 | 31B | 256K | 66.4% |
 | Gemma 3 27B | 128K | 13.5% |
 - Ollama default num_ctx: 2048 (must override!)
 - Retrieval accuracy diminishes beyond ~100K tokens in repetitive/unstructured text
 ## Vocabulary
 - SentencePiece tokenizer, 262,144 tokens (256K vocab, up from 256K in earlier Gemma)
 ## Memory Requirements (approximate)
 | Model | BF16 | 8-bit | 4-bit |
 |-------|------|-------|-------|
 | E2B | 9.6 GB | 4.6 GB | 3.2 GB |
 | E4B | 15 GB | 7.5 GB | 5 GB |
 | 31B Dense | 58.3 GB | 30.4 GB | 17.4 GB |
 | 26B A4B (MoE) | 48 GB | 25 GB | 15.6 GB |
 Note: 26B MoE requires ALL 26B params loaded despite only activating ~4B per token.
 ## License
 Apache 2.0 — major change from Gemma 3's proprietary "Gemma Terms of Use". No custom clauses, no redistribution restrictions.
 ## Training Data Cutoff
 January 2025
@@ -0,0 +1,40 @@
 # Gemma 4 Benchmarks
 > Source: Google DeepMind model card, HuggingFace blog, LMArena
 > Released: April 2, 2026
 ## Gemma 4 vs Gemma 3 (biggest single-version jump in Gemma family)
 | Benchmark | Gemma 3 27B | Gemma 4 31B | Gemma 4 26B A4B | Delta (31B vs G3) |
 |-----------|------------|------------|----------------|-------------------|
 | MMLU Pro | 67.6% | 85.2% | 82.6% | +17.6 |
 | AIME 2026 (no tools) | 20.8% | 89.2% | 88.3% | +68.4 |
 | GPQA Diamond | 42.4% | 84.3% | 82.3% | +41.9 |
 | BigBench Extra Hard | 19.3% | 74.4% | 64.8% | +55.1 |
 | LiveCodeBench v6 | 29.1% | 80.0% | 77.1% | +50.9 |
 | Codeforces ELO | 110 | 2150 | 1718 | +2040 |
 | MMMU Pro (vision) | 49.7% | 76.9% | 73.8% | +27.2 |
 | MATH-Vision | 46.0% | 85.6% | 82.4% | +39.6 |
 | OmniDocBench (lower=better) | 0.365 | 0.131 | 0.149 | -0.234 |
 | MRCR v2 128K | 13.5% | 66.4% | 44.1% | +52.9 |
 | MMMLU (multilingual) | 70.7% | 88.4% | 86.3% | +17.7 |
 ## Arena Scores
 | Model | LMArena Score | Rank |
 |-------|--------------|------|
 | Gemma 4 31B | 1452 | #3 |
 | Gemma 4 26B A4B | 1441 | #6 |
 ## Agentic Benchmark (tau2-bench)
 | Model | Score |
 |-------|-------|
 | 31B | 86.4% |
 | 26B A4B | 85.5% |
 | E4B | 57.5% |
 | E2B | 29.4% |
 ## Takeaway
 The jump from Gemma 3 to 4 is enormous — AIME went from 20.8% to 89.2%, Codeforces from 110 to 2150 ELO. This is not an incremental update. The 26B MoE nearly matches 31B Dense on most benchmarks while using ~4B active params.
@@ -0,0 +1,55 @@
 # Gemma 4 Capabilities Reference
 ## Modalities
 ### Text (all variants)
 - Standard instruction-following, chat, completion
 - System prompt support (critical — see synthesis)
 - 128K context window (training length)
 - 262K vocabulary
 ### Vision (all variants)
 - **Tested and verified working** (Seth, 2026-04-10)
 - Accurately described colors, shapes, composition in 256x256 test image
 - ~25 tok/s, ~24s end-to-end on pve197 V100
 - Input: base64-encoded image in `images` field of Ollama API
 - Vision encoder: 16x16 patches, 2D RoPE, variable aspect ratio
 - Token budgets scale with resolution (70-1120 soft tokens)
 - Used in AI_Visualizer for SDXL frame quality criticism
 ### Audio (E2B/E4B only)
 - **Not tested by Seth** — status unknown in practice
 - Conformer architecture (~300M params), mel-spectrogram input
 - **Trained on SPEECH ONLY — not music or environmental sounds**
 - Maximum 30 seconds per clip
 - NOT available on 26B or 31B variants
 - AI_Visualizer explicitly rejected audio for music analysis (DECISIONS S2) — correct call, model wasn't trained for it
 ### Video (all variants)
 - E2B/E4B: video WITH audio (`load_audio_from_video=True`)
 - 31B/26B: video WITHOUT audio
 - Not explicitly post-trained on video but works
 - Maximum 60 seconds at 1 frame/second
 - Not tested by Seth
 ### Tool Calling / Function Calling
 - **Verified reliable** in both Simon and AI_Visualizer
 - Ollama native tool format (OpenAI-compatible function calling)
 - Simon: 6 genealogy tools, up to 12 sequential iterations
 - Supports parallel tool calls in single response
 - Weak at deeply nested JSON schemas -> prefer sequential calls
 ## Benchmark Context (vs Gemma 3)
 - 31B replaces Gemma 3 27B (60 layers vs 62, but wider)
 - MoE variant (26B) is new — no Gemma 3 equivalent
 - E-series with PLE is new — on-device focus
 - Proportional RoPE replaces linear frequency scaling -> better long-context
 - Shared KV cache is new -> more efficient inference
 ## What Gemma 4 Does NOT Do
 - No native code execution / sandboxing
 - No web browsing or retrieval
 - Audio only on E-series (not the models most people run)
 - No built-in RAG — tool calling can implement it
@@ -0,0 +1,42 @@
 # Gemma 4 on Ollama — Available Variants
 > Last verified against Seth's homelab: 2026-04-12
 ## Ollama Model Tags
 | Tag | Params | Quant | Size on Disk | VRAM | Notes |
 |-----|--------|-------|-------------|------|-------|
 | `gemma4:e4b-it-q8_0` | ~8B total / 4B effective | Q8_0 | 11.6GB | ~12GB | Vision + audio capable. ~25 tok/s on V100 |
 | `gemma4:26b` | 25.8B | Q4_K_M (default) | 18.0GB | ~18GB | Sweet spot for quality/speed. ~134 tok/s on 3090 Ti |
 | `gemma4:31b-it-q4_K_M` | 31.3B | Q4_K_M | 19.9GB | ~24.5GB | Sharpest but 5x slower (~28 tok/s on 3090 Ti, memory pressure) |
 ## Capabilities by Variant (from `ollama show`)
 All variants support:
 - Text generation (completion, chat)
 - Vision (image input via base64 in `images` field)
 - Tool/function calling (native Ollama tool format)
 E-series (E2B, E4B) additionally support:
 - Audio input (conformer encoder)
 ## GPU Coexistence (pve197 V100 32GB)
 - gemma4:26b + SDXL Turbo: ~28.5GB peak VRAM — fits on V100-32GB
 - gemma4:31b: 24.5GB alone — memory pressure with any coexisting model
 - gemma4:e4b-it-q8_0: ~12GB — comfortable headroom
 ## Ollama API Endpoint
 - `/api/generate` (single-turn, used by AI_Visualizer)
 - `/api/chat` (multi-turn with message history, used by Simon)
 - Both accept `tools`, `images`, `stream`, `options`, `keep_alive`
 ## Important Ollama Defaults to Override
 | Parameter | Ollama Default | Recommended | Why |
 |-----------|---------------|-------------|-----|
 | `num_ctx` | 2048 | 4096-32768 | Default is absurdly small, causes truncation |
 | `num_predict` | 128 | 512-4096+ | Default truncates almost all useful output |
 | `think` | true (Ollama 0.20+) | false | See GOTCHAS doc |
 | `keep_alive` | 5m | 30m-4h | Prevents expensive model reload between calls |
@@ -0,0 +1,100 @@
 # Gemma 4 Native Tool Calling Format
 > Source: Google AI for Developers - Function Calling docs
 > https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4
 ## Special Tokens (6 total)
 | Token | Purpose |
 |-------|---------|
 | `<\|tool>` / `<tool\|>` | Tool definition block |
 | `<\|tool_call>` / `<tool_call\|>` | Model's tool request |
 | `<\|tool_response>` / `<tool_response\|>` | Tool execution result |
 String delimiter: `<\|"\|>` (encloses all string values in native format)
 ## Native Format (raw model tokens)
 ### Tool definition in system prompt:
 ```
 <|tool>declaration:
 get_current_temperature{
  location:{type:<|"|>string<|"|>,description:<|"|>The city<|"|>},
  unit:{type:<|"|>string<|"|>,enum:[<|"|>celsius<|"|>,<|"|>fahrenheit<|"|>]}
 }<tool|>
 ```
 ### Tool call from model:
 ```
 <|tool_call>call:get_current_temperature{location:<|"|>London<|"|>}<tool_call|>
 ```
 ### Tool response:
 ```
 <|tool_response>response:get_current_weather{temperature:15,weather:<|"|>sunny<|"|>}<tool_response|>
 ```
 ## JSON Chat Format (for Ollama / OpenAI-compatible APIs)
 This is what you actually use in practice. Ollama translates to/from native tokens.
 ### Tool definition:
 ```json
 {
  "type": "function",
  "function": {
    "name": "get_weather",
    "description": "Get current weather for a location",
    "parameters": {
      "type": "object",
      "properties": {
        "city": {"type": "string", "description": "The city name"}
      },
      "required": ["city"]
    }
  }
 }
 ```
 ### Model returns:
 ```json
 {
  "role": "assistant",
  "tool_calls": [{
    "function": {
      "name": "get_weather",
      "arguments": {"city": "London"}
    }
  }]
 }
 ```
 ### Tool result message:
 ```json
 {
  "role": "tool",
  "content": "{\"temperature\": 15, \"weather\": \"sunny\"}"
 }
 ```
 ## Thinking Mode + Tool Calls
 - When thinking is enabled, preserve thoughts between tool calls
 - For long agent chains, summarize thoughts as plain text to save context
 - Recommended: **disable thinking for tool-heavy workflows** (Seth's finding)
 ## Framework Flags
 | Framework | Required Flag |
 |-----------|--------------|
 | llama.cpp | `--jinja` |
 | vLLM | `--enable-auto-tool-choice` |
 | Ollama | Works via `/api/chat` endpoint with `tools` field |
 | transformers | `apply_chat_template(tools=[...])` |
 ## Known Issues
 - Ollama v0.20.0-0.20.1: tool call parser broken, streaming drops tool calls
 - llama.cpp: format mismatches and continuous loops reported
 - LM Studio: compatibility issues with tool calling
 - **Workaround:** Use non-streaming mode for tool calls (proven in Simon)
@@ -0,0 +1,205 @@
 # Gemma 4 Gotchas & Known Issues
 > Derived from Seth's production implementations (Simon, AI_Visualizer)
 > and community reports. These are hard-won lessons.
 ## CRITICAL: Thinking Mode Eats Context
 **Severity: HIGH — causes silent failures**
 Gemma 4 in Ollama 0.20+ defaults to `think: true`. When enabled:
 - Thinking tokens go into a hidden `thinking` field, NOT `response`
 - If `num_predict` is limited, thinking consumes the entire budget
 - `response` comes back **empty** — no error, just silence
 - On evaluative tasks, thinking inflates scores (31B scored a known-bad image 9/10 with thinking vs 7/10 without)
 **Fix:** Always pass `think: false` in the Ollama payload. Seth has had success ONLY with thinking off.
 ```json
 {
  "model": "gemma4:26b",
  "think": false,
  "options": { "num_predict": 4096 }
 }
 ```
 ## CRITICAL: format=json Causes Infinite Loops
 **Severity: HIGH — hangs indefinitely**
 Ollama's server-side `format: "json"` enforcer causes Gemma 4 26B (Q4) to enter an infinite retry loop when the requested schema is deeply nested.
 **Fix:** Never use `format: "json"`. Instead:
 1. Request JSON structure in the prompt text
 2. Parse client-side with regex + `json.loads` + json5 fallback
 ```python
 # DO THIS
 response = client.generate(model="gemma4:26b", prompt=prompt, format_json=False)
 body = response["response"]
 obj = json.loads(body[body.find("{"):body.rfind("}") + 1])
 # NOT THIS
 response = client.generate(model="gemma4:26b", prompt=prompt, format="json")  # HANGS
 ```
 ## CRITICAL: Ollama Default Context is 2048
 **Severity: HIGH — causes truncation**
 Ollama defaults `num_ctx` to 2048 tokens. Gemma 4 supports 128K. If you don't override, your prompts get silently truncated.
 **Fix:** Always set `num_ctx` explicitly:
 ```json
 { "options": { "num_ctx": 8192 } }
 ```
 Scale to your needs: 4096 for simple tasks, 16384 for long inputs, 32768 for complex multi-turn.
 ## HIGH: num_predict Default is 128
 **Severity: HIGH — truncates output**
 Ollama defaults `num_predict` to 128 tokens. Almost any useful Gemma 4 output exceeds this.
 **Fix:** Always set `num_predict` explicitly. Minimum recommended: 512. For JSON output: 2048+.
 ## MEDIUM: Weak at Long/Nested JSON
 **Severity: MEDIUM — causes parse failures**
 Gemma 4 reliably produces short JSON (5-10 fields) but struggles with:
 - Deeply nested schemas (3+ levels)
 - Long arrays (20+ items)
 - Mixed nesting + length
 **Fix:** Sequential tool calls. Break one large JSON request into multiple smaller calls:
 - Instead of "generate a 50-item storyboard", do "generate items 1-5", "generate items 6-10", etc.
 - Due to Gemma 4's fast speed and free local use, sequential calls are cheap
 **Fallback pattern (AI_Visualizer):**
 ```python
 for attempt in range(MAX_RETRIES):
    temp = BASE_TEMP + attempt * TEMP_BUMP  # 0.4 -> 0.5 -> 0.6
    response = call_gemma(temperature=temp)
    try:
        return parse_json(response)
    except JSONDecodeError:
        continue
 ```
 ## MEDIUM: Identity Confusion
 **Severity: MEDIUM — cosmetic but confusing**
 Gemma 4 is ultra-compliant and highly capable but does not know who it is. It may:
 - Claim to be a different model
 - Hallucinate capabilities it doesn't have
 - Respond as a generic "AI assistant" without personality
 **Fix:** Explicit identity in system prompt:
 ```
 You are [Name], a [role]. You are powered by Gemma 4.
 You ONLY do [X]. You NEVER do [Y].
 ```
 Gemma 4 does NOT need hand-holding on task execution — it's very capable.
 It needs explicit instructions about identity and boundaries.
 ## MEDIUM: Flash Attention Hang on 31B Dense (>3-4K tokens)
 **Severity: MEDIUM — hardware-specific, affects RTX 3090**
 Community-reported: Flash Attention causes Gemma 4 31B Dense to hang indefinitely during prompt evaluation when the prompt exceeds ~3-4K tokens. The 26B MoE variant handles the same prompts fine — bug is specific to the Dense model.
 **Source:** [ollama/ollama#15350](https://github.com/ollama/ollama/issues/15350)
 **Fix:** Use 26B for long prompts, or disable Flash Attention if running 31B on affected hardware.
 ## MEDIUM: Tool Calling Broken in Ollama v0.20.0 Streaming
 **Severity: MEDIUM — version-specific**
 As of early April 2026, Gemma 4 tool calling has issues in Ollama v0.20.0: the tool call parser fails and streaming drops tool calls entirely. Community reports include format mismatches and continuous loops in llama.cpp / LM Studio.
 **Source:** [community reports](https://dev.to/dentity007/-gemma-4-after-24-hours-what-the-community-found-vs-what-google-promised-3a2f)
 **Fix:** Use non-streaming for tool calls (Simon does this). Test tool calling thoroughly when upgrading Ollama versions. Seth's implementations work reliably with non-streaming tool calls.
 ## MEDIUM: VRAM-Hungry for Context
 **Severity: MEDIUM — affects hardware planning**
 Gemma 4 KV cache is large relative to competitors. Community reports: 31B at 262K context requires ~22GB just for KV cache on top of model weights. One user could only fit Gemma 3 27B Q4 with 20K context on a 5090, while Qwen 3.5 27B Q4 fit with 190K context on the same card.
 **Implication:** Don't set num_ctx higher than you need. 32K is plenty for most tasks and keeps VRAM reasonable.
 ## MEDIUM: Safety Overfiltering
 **Severity: MEDIUM — blocks benign prompts**
 Strict safety alignment occasionally blocks technical, academic, or creative prompts that superficially resemble restricted categories. One user reported jailbreaks with basic system prompts.
 **Fix:** Rephrase blocked prompts to avoid trigger patterns. For system prompts, avoid language that sounds like you're asking the model to bypass restrictions — just state the task directly.
 ## MEDIUM: KV Cache Config Bug (31B/26B ship with num_kv_shared_layers=0)
 **Severity: MEDIUM — crashes on first attention forward pass**
 The 31B and 26B ship with `num_kv_shared_layers = 0`, which causes `layer_types[:-0]` to collapse to zero layer slots. Crashes on first forward pass.
 **Fix:** Patch the config. Check model card discussions for the exact fix.
 ## LOW: vLLM Triton Fallback (~9 tok/s on RTX 4090)
 **Severity: LOW — vLLM-specific**
 Heterogeneous attention head dimensions in Gemma 4 force vLLM to fall back to a slow Triton kernel. RTX 4090 gets ~9 tok/s instead of expected ~100+.
 **Source:** [vllm-project/vllm#38887](https://github.com/vllm-project/vllm/issues/38887)
 **Fix:** Use Ollama instead of vLLM for now, or wait for the fix.
 ## LOW: `<unused>` Token Infinite Loop (Vulkan backends)
 **Severity: LOW — Vulkan-specific**
 Gemma 4 can generate `<unused>` or `<unused24>` tokens in an infinite loop on Vulkan backends in llama.cpp.
 **Source:** [ggml-org/llama.cpp#21516](https://github.com/ggml-org/llama.cpp/issues/21516)
 ## LOW: Fine-Tuning Ecosystem Issues
 **Severity: LOW — only relevant if fine-tuning**
 Day-one issues for fine-tuners:
 - HuggingFace Transformers didn't recognize gemma4 architecture (required install from source)
 - PEFT couldn't handle Gemma4ClippableLinear (new vision encoder layer type)
 - New `mm_token_type_ids` field required during training even for text-only data
 - E2B/E4B show training loss of 13-15, which is normal for multimodal models (not a bug)
 ## LOW: Vision Validator Overrejects
 **Severity: LOW — specific to evaluative vision tasks**
 In AI_Visualizer, Gemma 4 vision was used to critique SDXL frames. It flagged images for motif-matching failures that humans rated as equal or better than passed images. The validator was queued for disable.
 **Pattern:** Gemma 4 vision is good at description but unreliable for subjective quality scoring. Use it for "what's in this image?" not "is this image good?"
 ## LOW: Keep-Alive Too Short
 **Severity: LOW — performance only**
 Default `keep_alive` is 5 minutes. If your pipeline has gaps (e.g., waiting for SDXL generation), the model gets unloaded and reloaded (~10-30s penalty).
 **Fix:** Set `keep_alive` to match your pipeline duration:
 ```json
 { "keep_alive": "4h" }
 ```
 Or pin/unpin explicitly:
 ```python
 client.generate(model="gemma4:26b", prompt="", keep_alive=-1, options={"num_predict": 0})  # pin
 # ... do work ...
 client.generate(model="gemma4:26b", prompt="", keep_alive=0, options={"num_predict": 0})    # unpin
 ```
@@ -0,0 +1,95 @@
 # Gemma 4 Implementation Reference
 > Patterns extracted from Seth's two production Gemma 4 projects.
 ## Project: Simon (FreibergFamily/simon/)
 **Purpose:** AI genealogy historian — multi-turn chat with tool-calling agent
 | Setting | Value |
 |---------|-------|
 | Model | `gemma4:26b` |
 | API | `/api/chat` (multi-turn) |
 | num_ctx | 32768 |
 | num_predict | 4096 |
 | temperature | 1.0 |
 | top_p | 0.95 |
 | top_k | 64 |
 | keep_alive | 4h |
 | think | (not explicitly set — should be false) |
 | format_json | not used |
 | Vision | not used |
 | Tool calling | 6 tools, max 12 iterations |
 ### Key Patterns
 1. **Aggressive system prompt:** 40+ lines defining identity, boundaries, tool usage rules, multi-step chaining requirements. Gemma 4 follows all of it.
 2. **Tool chaining instructions:** System prompt explicitly tells Gemma to chain tools (e.g., "after lookup_person, ALSO call get_historical_context"). Gemma 4 follows these multi-step chains reliably.
 3. **Parallel tool calls:** Encouraged in system prompt for multiple lookups. Gemma 4 does this.
 4. **History pruning:** Drops old tool results and tool-call messages, keeps assistant summaries. Prevents context bloat in multi-turn.
 5. **Fallback to streaming:** After 12 tool iterations, switches to stream mode (no tools) to force a text response.
 6. **Two modes (historian vs interview):** Completely different system prompts swapped at runtime. Gemma 4 stays in character for both.
 ---
 ## Project: AI Visualizer (AI_Visualizer/)
 **Purpose:** Music-reactive video generator — Gemma 4 as reasoning engine across 4 pipeline stages
 | Stage | num_predict | num_ctx | temperature | Purpose |
 |-------|-------------|---------|-------------|---------|
 | Mood Analysis | 4096 | 16384 | 0.4-0.6 | Analyze CLAP descriptors -> narratives + boundary adjustments |
 | Rate Pass | 512 | 4096 | 0.3-0.5 | Choose visual pacing rate per music segment |
 | Storyboard | 2048 | 4096 | 0.6-0.8 | Generate SDXL prompts per music segment |
 | Batch Expansion | 2048 | default | 0.7 | Interpolate between scene prompts over time |
 | Vision Validator | 256 | default | 0.2 | Critique generated frames (queued for disable) |
 ### Key Patterns
 1. **No tool calling used.** All Gemma interaction is single-turn generate with JSON requested in prompt.
 2. **Client-side JSON extraction:**
   ```python
   body = response["response"]
   start = body.find("{")
   end = body.rfind("}")
   obj = json.loads(body[start:end + 1])
   ```
 3. **Temperature ramping on retry:** Base temp + bump per attempt. Conservative first, creative on retry.
 4. **think: false everywhere.** Explicitly set on every call. Critical for budget control.
 5. **format_json: false everywhere.** Causes infinite loops on nested schemas.
 6. **Model pinning:** `keep_alive=-1` to prevent GPU eviction during long SDXL pauses.
 7. **Explicit num_ctx:** Added after discovering Ollama defaults to 2048, which truncated mood analyzer prompts on long tracks.
 8. **Banned vocabulary in prompts:** List of cliche words (cinematic, dramatic, ethereal...) passed to Gemma to avoid generic output.
 9. **Vision for image critique:** Base64-encoded PNG -> structured SCORE/ISSUE/REASON output parsed by regex. Works but overrejects on subjective quality.
 ---
 ## Common Settings Across Both Projects
 ```json
 {
  "model": "gemma4:26b",
  "think": false,
  "options": {
    "num_ctx": 4096,
    "num_predict": 2048,
    "temperature": 0.5
  },
  "keep_alive": "30m"
 }
 ```
 Adjust num_ctx/num_predict upward for your payload size. These are safe minimums.
@@ -0,0 +1,25 @@
 # gemma4-research
 Research corpus and implementation guidance for Google Gemma 4, based on production use in Seth's homelab.
 ## Files
 | File | What | When to Read |
 |------|------|-------------|
 | `SYNTHESIS.md` | **Start here.** Opinionated guide — how to build with Gemma 4 | Before any new Gemma 4 implementation |
 | `GOTCHAS.md` | Known issues and workarounds, severity-ranked | When debugging Gemma 4 issues or starting a new project |
 | `IMPLEMENTATIONS.md` | Patterns from Simon and AI_Visualizer | When designing a new Gemma 4 integration |
 | `CORPUS_architecture.md` | Model architecture details (layers, attention, PLE, MoE) | When you need to understand WHY Gemma 4 behaves a certain way |
 | `CORPUS_ollama_variants.md` | Available models, sizes, VRAM, Ollama settings | When choosing a model variant or configuring Ollama |
 | `CORPUS_capabilities.md` | Modalities (vision, audio, video, tools), what it can/can't do | When scoping what Gemma 4 can handle |
 | `CORPUS_benchmarks.md` | Full benchmark table vs Gemma 3, arena scores, agentic scores | When comparing Gemma 4 to alternatives |
 | `CORPUS_tool_calling_format.md` | Native token format + JSON API format for function calling | When implementing tool calling |
 ## Source Projects
 - **Simon** (`~/bin/FreibergFamily/simon/`) — Multi-turn chat agent with 6 tools, genealogy historian
 - **AI Visualizer** (`~/bin/AI_Visualizer/`) — Music video generator, 4-stage Gemma pipeline + vision
 ## Key Insight
 Gemma 4 is ultra-compliant and highly capable but doesn't know who it is. It needs explicit system prompts, not hand-holding. Due to fast local inference, sequential tool calls beat long JSON requests.
@@ -0,0 +1,194 @@
 # Gemma 4 Synthesis — How to Build With It
 > Opinionated guide based on two production implementations and ongoing use.
 > Seth Freiberg, 2026-04-12
 ## The One-Paragraph Summary
 Gemma 4 is an ultra-compliant, highly-capable model that doesn't know who it is. It doesn't need hand-holding on tasks but needs explicit instructions in the system prompt about identity, boundaries, and output format. It needs `num_predict` increased (Ollama defaults are absurdly low), `think` set to false (thinking eats the context budget), and `format: json` avoided entirely (causes infinite loops). Due to its fast speed and free local inference, sequential tool calls are the ideal solution to tasks that would otherwise require long structured output.
 ## Mental Model
 Think of Gemma 4 as a very competent employee on their first day. They can do the work — you don't need to explain how. But you DO need to explain:
 - Who they are and what their job is
 - What they should and should NOT do
 - Exactly what format you want the deliverable in
 - The boundaries of their role
 Get those right and Gemma 4 just works. Get them wrong and you get a generic chatbot.
 ## Mandatory Ollama Settings
 Every Gemma 4 call MUST include:
 ```json
 {
  "think": false,
  "options": {
    "num_ctx": 4096,
    "num_predict": 2048
  }
 }
 ```
 **Why each one:**
 - `think: false` — Ollama 0.20+ defaults to think:true. Thinking tokens consume num_predict budget invisibly, returning empty responses. Seth has ONLY had success with thinking off.
 - `num_ctx: 4096+` — Ollama defaults to 2048. Your system prompt alone might exceed that.
 - `num_predict: 2048+` — Ollama defaults to 128. Any structured output gets truncated.
 Scale these to your task. The values above are safe minimums, not recommendations.
 ## System Prompt Template
 ```
 You are [NAME], a [ROLE DESCRIPTION].
 ## What You Do
 - [Explicit list of responsibilities]
 - [Tools you have access to and when to use each one]
 ## What You Do NOT Do
 - [Explicit list of things to refuse or avoid]
 - [Common mistakes to prevent]
 ## Output Format
 [Exact schema, field names, example if complex]
 Respond with ONLY [format]. No prose outside the [format].
 ## Rules
 - [Behavioral constraints]
 - [Multi-step chaining instructions if using tools]
 Today's date: [DATE]
 ```
 **Key principles:**
 1. Identity first — who is this agent?
 2. Positive instructions before negative (what TO do before what NOT to do)
 3. Output format is explicit and complete — Gemma 4 follows schemas faithfully
 4. "No prose outside the JSON" prevents wrapper text that breaks parsing
 5. Date injection helps with temporal reasoning
 ## Tool Calling Strategy
 Gemma 4 is **reliable for tool calling** but **weak at structuring long JSONs**.
 ### When to use tool calling (Ollama native)
 - Multi-turn agents with 2-10 tools
 - Sequential reasoning chains (lookup A -> use A to decide B -> lookup B)
 - Any task where the model needs to gather information before responding
 ### When to use prompt-based JSON instead
 - Single-turn generation with known output structure
 - When you need specific JSON schema control
 - When the output is a payload (prompts, configs) not a conversation
 ### The Sequential Pattern
 Instead of asking Gemma 4 to produce one massive JSON:
 ```
 BAD:  "Generate a 50-scene storyboard as JSON"  -> truncated/malformed
 GOOD: "Generate scenes 1-5 as JSON" x10         -> reliable every time
 ```
 Gemma 4's inference speed makes sequential calls cheap. A 10-call chain at ~134 tok/s on a 3090 Ti costs seconds, not minutes. This is the fundamental advantage of local models — latency is predictable and network-free.
 ## JSON Extraction Pattern
 Since `format: "json"` is broken, always extract client-side:
 ```python
 # Python
 import json
 raw = response["response"]
 start = raw.find("{")
 end = raw.rfind("}")
 if start >= 0 and end > start:
    obj = json.loads(raw[start:end + 1])
 ```
 ```javascript
 // JavaScript
 const raw = response.message.content;
 const match = raw.match(/\{[\s\S]*\}/);
 if (match) obj = JSON.parse(match[0]);
 ```
 For arrays, find `[` and `]` instead. Add json5 fallback for trailing commas.
 ## Temperature Guidelines
 | Task Type | Temperature | Why |
 |-----------|-------------|-----|
 | Evaluation / scoring | 0.2 | Consistent, reproducible judgments |
 | Structured extraction | 0.3-0.4 | Faithful to schema |
 | Creative generation | 0.6-0.8 | Variety without chaos |
 | Conversation / chat | 0.7-1.0 | Natural feel |
 Retry strategy: bump temp +0.1 per retry to escape format failures.
 ## Vision Usage
 **Works for:** Describing image contents (objects, colors, composition, text)
 **Unreliable for:** Subjective quality scoring, aesthetic judgment
 ```python
 import base64
 with open("image.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode("ascii")
 response = client.generate(
    model="gemma4:26b",
    prompt="Describe this image in detail.",
    images=[b64],
    think=False,
    options={"temperature": 0.2, "num_predict": 512}
 )
 ```
 Vision is on ALL Gemma 4 variants (E2B, E4B, 26B, 31B). Audio is E-series only.
 ## Context Management
 ### Multi-turn (chat agents)
 - Prune old tool results and tool-call messages
 - Keep assistant's natural-language summaries
 - Set num_ctx to 32768 for rich conversations
 - Set a tool iteration limit (12 is proven) with streaming fallback
 ### Single-turn (pipeline stages)
 - Calculate your prompt size and set num_ctx accordingly
 - For long inputs (full track analysis), use recursive splitting at natural boundaries
 - Pin model with `keep_alive=-1` if pipeline has idle gaps
 ## Model Selection
 | Use Case | Recommended | Why |
 |----------|------------|-----|
 | Production pipeline (needs GPU coexistence) | `gemma4:26b` | Best quality/speed/VRAM balance |
 | On-device / edge | `gemma4:e4b-it-q8_0` | 12GB VRAM, vision+audio |
 | Maximum quality (single-model GPU) | `gemma4:31b-it-q4_K_M` | Sharpest but slow under memory pressure |
 | Rapid prototyping / testing | `gemma4:26b` | Fast enough for interactive dev |
 ## Anti-Patterns
 1. **Don't use `format: "json"`** — infinite loops on nested schemas
 2. **Don't leave `think` at default** — eats your output budget silently
 3. **Don't leave `num_predict` at default** — 128 tokens is nothing
 4. **Don't leave `num_ctx` at default** — 2048 truncates most prompts
 5. **Don't ask for huge JSON in one call** — break into sequential calls
 6. **Don't use thinking mode for evaluation** — inflates scores, wastes context
 7. **Don't skip system prompt identity** — Gemma 4 becomes a generic chatbot
 8. **Don't use audio on 26B/31B** — only E-series has audio encoder
 ## Quick-Start Checklist
 - [ ] Set `think: false`
 - [ ] Set `num_predict` >= 512 (2048+ for JSON output)
 - [ ] Set `num_ctx` >= 4096 (scale to your prompt size)
 - [ ] Write explicit system prompt with identity + boundaries + output format
 - [ ] Extract JSON client-side (no `format: "json"`)
 - [ ] Set `keep_alive` >= 30m (or pin with -1)
 - [ ] For long structured output, use sequential calls
 - [ ] For vision, pass base64 in `images` array
 - [ ] Test with your actual prompt length — Ollama won't warn about truncation