# Gemma 4 Synthesis — How to Build With It > Opinionated guide based on two production implementations and ongoing use. > Seth Freiberg, 2026-04-12 ## The One-Paragraph Summary Gemma 4 is an ultra-compliant, highly-capable model that doesn't know who it is. It doesn't need hand-holding on tasks but needs explicit instructions in the system prompt about identity, boundaries, and output format. It needs `num_predict` increased (Ollama defaults are absurdly low), `think` set to false (thinking eats the context budget), and `format: json` avoided entirely (causes infinite loops). Due to its fast speed and free local inference, sequential tool calls are the ideal solution to tasks that would otherwise require long structured output. > **For canonical upstream source (model cards, chat templates, serving commands, > fine-tuning recipes, specialized siblings like EmbeddingGemma/ShieldGemma): see > `tooling/README.md`.** That directory is 147 files / 14 MB of first-party material > pulled from Google / Hugging Face / framework maintainers. This SYNTHESIS is the > opinionated digest; `tooling/` is the receipts. ## Mental Model Think of Gemma 4 as a very competent employee on their first day. They can do the work — you don't need to explain how. But you DO need to explain: - Who they are and what their job is - What they should and should NOT do - Exactly what format you want the deliverable in - The boundaries of their role Get those right and Gemma 4 just works. Get them wrong and you get a generic chatbot. ## Mandatory Ollama Settings Every Gemma 4 call MUST include: ```json { "think": false, "options": { "num_ctx": 4096, "num_predict": 2048 } } ``` **Why each one:** - `think: false` — Ollama 0.20+ defaults to think:true. Thinking tokens consume num_predict budget invisibly, returning empty responses. Seth has ONLY had success with thinking off. - `num_ctx: 4096+` — Ollama defaults to 2048. Your system prompt alone might exceed that. - `num_predict: 2048+` — Ollama defaults to 128. Any structured output gets truncated. Scale these to your task. The values above are safe minimums, not recommendations. ## System Prompt Template ``` You are [NAME], a [ROLE DESCRIPTION]. ## What You Do - [Explicit list of responsibilities] - [Tools you have access to and when to use each one] ## What You Do NOT Do - [Explicit list of things to refuse or avoid] - [Common mistakes to prevent] ## Output Format [Exact schema, field names, example if complex] Respond with ONLY [format]. No prose outside the [format]. ## Rules - [Behavioral constraints] - [Multi-step chaining instructions if using tools] Today's date: [DATE] ``` **Key principles:** 1. Identity first — who is this agent? 2. Positive instructions before negative (what TO do before what NOT to do) 3. Output format is explicit and complete — Gemma 4 follows schemas faithfully 4. "No prose outside the JSON" prevents wrapper text that breaks parsing 5. Date injection helps with temporal reasoning ## Tool Calling Strategy Gemma 4 is **reliable for tool calling** but **weak at structuring long JSONs**. ### When to use tool calling (Ollama native) - Multi-turn agents with 2-10 tools - Sequential reasoning chains (lookup A -> use A to decide B -> lookup B) - Any task where the model needs to gather information before responding ### When to use prompt-based JSON instead - Single-turn generation with known output structure - When you need specific JSON schema control - When the output is a payload (prompts, configs) not a conversation ### The Sequential Pattern Instead of asking Gemma 4 to produce one massive JSON: ``` BAD: "Generate a 50-scene storyboard as JSON" -> truncated/malformed GOOD: "Generate scenes 1-5 as JSON" x10 -> reliable every time ``` Gemma 4's inference speed makes sequential calls cheap. A 10-call chain at ~134 tok/s on a 3090 Ti costs seconds, not minutes. This is the fundamental advantage of local models — latency is predictable and network-free. ## JSON Extraction Pattern Since `format: "json"` is broken, always extract client-side: ```python # Python import json raw = response["response"] start = raw.find("{") end = raw.rfind("}") if start >= 0 and end > start: obj = json.loads(raw[start:end + 1]) ``` ```javascript // JavaScript const raw = response.message.content; const match = raw.match(/\{[\s\S]*\}/); if (match) obj = JSON.parse(match[0]); ``` For arrays, find `[` and `]` instead. Add json5 fallback for trailing commas. ## Temperature Guidelines | Task Type | Temperature | Why | |-----------|-------------|-----| | Evaluation / scoring | 0.2 | Consistent, reproducible judgments | | Structured extraction | 0.3-0.4 | Faithful to schema | | Creative generation | 0.6-0.8 | Variety without chaos | | Conversation / chat | 0.7-1.0 | Natural feel | Retry strategy: bump temp +0.1 per retry to escape format failures. ## Vision Usage **Works for:** Describing image contents (objects, colors, composition, text) **Unreliable for:** Subjective quality scoring, aesthetic judgment ```python import base64 with open("image.png", "rb") as f: b64 = base64.b64encode(f.read()).decode("ascii") response = client.generate( model="gemma4:26b", prompt="Describe this image in detail.", images=[b64], think=False, options={"temperature": 0.2, "num_predict": 512} ) ``` Vision is on ALL Gemma 4 variants (E2B, E4B, 26B, 31B). Audio is E-series only. ## Context Management ### Multi-turn (chat agents) - Prune old tool results and tool-call messages - Keep assistant's natural-language summaries - Set num_ctx to 32768 for rich conversations - Set a tool iteration limit (12 is proven) with streaming fallback ### Single-turn (pipeline stages) - Calculate your prompt size and set num_ctx accordingly - For long inputs (full track analysis), use recursive splitting at natural boundaries - Pin model with `keep_alive=-1` if pipeline has idle gaps ## Model Selection | Use Case | Recommended | Why | |----------|------------|-----| | Production pipeline (needs GPU coexistence) | `gemma4:26b` | MoE (3.8B active), fast, good quality/VRAM balance | | On-device / edge | `gemma4:e4b-it-q8_0` | 12GB VRAM, vision+audio (audio via llama.cpp only) | | Maximum quality (single-model GPU) | `gemma4:31b-it-q4_K_M` | Dense 31B, sharpest but 5x slower, more VRAM pressure | | Rapid prototyping / testing | `gemma4:26b` | Fast enough for interactive dev | | Retrieval / embeddings | `embeddinggemma` (308M, separate model) | Gemma 4 has no embedding mode; use the sibling | ## Anti-Patterns 1. **Don't use `format: "json"`** — infinite loops on nested schemas 2. **Don't leave `think` at default** — eats your output budget silently 3. **Don't leave `num_predict` at default** — 128 tokens is nothing 4. **Don't leave `num_ctx` at default** — 2048 truncates most prompts 5. **Don't ask for huge JSON in one call** — break into sequential calls 6. **Don't use thinking mode for evaluation** — inflates scores, wastes context 7. **Don't skip system prompt identity** — Gemma 4 becomes a generic chatbot 8. **Don't use audio on 26B/31B** — only E-series has audio encoder ## Quick-Start Checklist - [ ] Set `think: false` - [ ] Set `num_predict` >= 512 (2048+ for JSON output) - [ ] Set `num_ctx` >= 4096 (scale to your prompt size) - [ ] Write explicit system prompt with identity + boundaries + output format - [ ] Extract JSON client-side (no `format: "json"`) - [ ] Set `keep_alive` >= 30m (or pin with -1) - [ ] For long structured output, use sequential calls - [ ] For vision, pass base64 in `images` array - [ ] Test with your actual prompt length — Ollama won't warn about truncation