docs: initial Gemma 4 research corpus and synthesis
Architecture specs, benchmarks, gotchas, Ollama settings, tool calling format, and implementation patterns from Simon and AI_Visualizer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,105 @@
|
|||||||
|
# Gemma 4 Architecture Reference
|
||||||
|
|
||||||
|
> Sources: Google DeepMind blog, HuggingFace blog (huggingface.co/blog/gemma4),
|
||||||
|
> Maarten Grootendorst visual guide, kaitchup.substack.com, wavespeed.ai
|
||||||
|
|
||||||
|
## Model Family
|
||||||
|
|
||||||
|
| Variant | Total Params | Effective Params | Type | Notes |
|
||||||
|
|---------|-------------|-----------------|------|-------|
|
||||||
|
| E2B | ~5.1B | ~2.3B | Dense + PLE | On-device, audio+vision |
|
||||||
|
| E4B | ~8B | ~4B | Dense + PLE | On-device, audio+vision |
|
||||||
|
| 31B | 31B | 31B | Dense | 60 layers, widened vs Gemma 3 27B (62 layers) |
|
||||||
|
| 26B A4B | 26B | ~4B active | MoE | 128 experts, 8 active + 1 shared |
|
||||||
|
|
||||||
|
## Attention Architecture
|
||||||
|
|
||||||
|
- **Pattern:** Local (sliding window) interleaved with global attention
|
||||||
|
- E2B: 4:1 ratio (4 local, 1 global). E4B/31B/26B: 5:1 ratio
|
||||||
|
- Global attention is always the last layer
|
||||||
|
- **Sliding window:** E2B/E4B = 512 tokens; 31B/26B = 1024 tokens
|
||||||
|
- **Grouped Query Attention (GQA):**
|
||||||
|
- Local: 2 query heads share 1 KV head
|
||||||
|
- Global: 8 query heads share 1 KV head, doubled Key dimensions
|
||||||
|
|
||||||
|
## Positional Encoding: Proportional RoPE (p-RoPE)
|
||||||
|
|
||||||
|
- Applied to global attention layers only
|
||||||
|
- p=0.25 -> rotates only 25% of head dimensions
|
||||||
|
- theta=1M
|
||||||
|
- 75% of dimensions are position-independent -> better long-context extrapolation
|
||||||
|
- Replaces Gemma 3's 8x linear frequency scaling
|
||||||
|
|
||||||
|
## Per-Layer Embeddings (PLE) — E2B/E4B Only
|
||||||
|
|
||||||
|
- Each decoder layer gets its own unique token representation
|
||||||
|
- Parallel lower-dimensional pathway alongside main residual stream
|
||||||
|
- PLE dimensions: 256 (E2B), 2560 (E4B)
|
||||||
|
- Original embedding dimensions: 1536 (E2B), 2560 (E4B)
|
||||||
|
- Applied between decoder blocks with gating function
|
||||||
|
- This is why E2B has 5.1B total but only 2.3B effective — the PLE table is large
|
||||||
|
|
||||||
|
## Shared KV Cache
|
||||||
|
|
||||||
|
- Last N layers reuse K/V tensors from earlier layers (same attention type)
|
||||||
|
- No quality loss in practice
|
||||||
|
- Significant memory + compute savings for long-context generation
|
||||||
|
|
||||||
|
## Vision Encoder
|
||||||
|
|
||||||
|
- Params: 150M (E2B/E4B), 550M (31B/26B)
|
||||||
|
- Patch size: 16x16 pixels
|
||||||
|
- 3x3 neighboring patches merged into single embedding
|
||||||
|
- Uses 2D RoPE for variable aspect ratio
|
||||||
|
- Token budgets: 70, 140, 280, 560, 1120 soft tokens
|
||||||
|
- Approximate resolutions: 272x176 (70 tokens) -> 1088x704 (1120 tokens)
|
||||||
|
|
||||||
|
## Audio Encoder — E2B/E4B Only
|
||||||
|
|
||||||
|
- Conformer architecture with convolutional modules
|
||||||
|
- Mel-spectrogram feature extraction
|
||||||
|
- Two 2D conv layers for downsampling
|
||||||
|
- NOT available on 31B or 26B variants
|
||||||
|
|
||||||
|
## MoE Details (26B A4B)
|
||||||
|
|
||||||
|
- 128 total experts
|
||||||
|
- 8 experts activated per token
|
||||||
|
- 1 shared expert (3x size of regular experts)
|
||||||
|
- 119 experts unused during any given forward pass
|
||||||
|
|
||||||
|
## Context Window
|
||||||
|
|
||||||
|
| Variant | Context Window | MRCR v2 8-needle @ 128K |
|
||||||
|
|---------|---------------|------------------------|
|
||||||
|
| E2B | 128K | 19.1% |
|
||||||
|
| E4B | 128K | 25.4% |
|
||||||
|
| 26B A4B | 256K | 44.1% |
|
||||||
|
| 31B | 256K | 66.4% |
|
||||||
|
| Gemma 3 27B | 128K | 13.5% |
|
||||||
|
|
||||||
|
- Ollama default num_ctx: 2048 (must override!)
|
||||||
|
- Retrieval accuracy diminishes beyond ~100K tokens in repetitive/unstructured text
|
||||||
|
|
||||||
|
## Vocabulary
|
||||||
|
|
||||||
|
- SentencePiece tokenizer, 262,144 tokens (256K vocab, up from 256K in earlier Gemma)
|
||||||
|
|
||||||
|
## Memory Requirements (approximate)
|
||||||
|
|
||||||
|
| Model | BF16 | 8-bit | 4-bit |
|
||||||
|
|-------|------|-------|-------|
|
||||||
|
| E2B | 9.6 GB | 4.6 GB | 3.2 GB |
|
||||||
|
| E4B | 15 GB | 7.5 GB | 5 GB |
|
||||||
|
| 31B Dense | 58.3 GB | 30.4 GB | 17.4 GB |
|
||||||
|
| 26B A4B (MoE) | 48 GB | 25 GB | 15.6 GB |
|
||||||
|
|
||||||
|
Note: 26B MoE requires ALL 26B params loaded despite only activating ~4B per token.
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
Apache 2.0 — major change from Gemma 3's proprietary "Gemma Terms of Use". No custom clauses, no redistribution restrictions.
|
||||||
|
|
||||||
|
## Training Data Cutoff
|
||||||
|
|
||||||
|
January 2025
|
||||||
@@ -0,0 +1,40 @@
|
|||||||
|
# Gemma 4 Benchmarks
|
||||||
|
|
||||||
|
> Source: Google DeepMind model card, HuggingFace blog, LMArena
|
||||||
|
> Released: April 2, 2026
|
||||||
|
|
||||||
|
## Gemma 4 vs Gemma 3 (biggest single-version jump in Gemma family)
|
||||||
|
|
||||||
|
| Benchmark | Gemma 3 27B | Gemma 4 31B | Gemma 4 26B A4B | Delta (31B vs G3) |
|
||||||
|
|-----------|------------|------------|----------------|-------------------|
|
||||||
|
| MMLU Pro | 67.6% | 85.2% | 82.6% | +17.6 |
|
||||||
|
| AIME 2026 (no tools) | 20.8% | 89.2% | 88.3% | +68.4 |
|
||||||
|
| GPQA Diamond | 42.4% | 84.3% | 82.3% | +41.9 |
|
||||||
|
| BigBench Extra Hard | 19.3% | 74.4% | 64.8% | +55.1 |
|
||||||
|
| LiveCodeBench v6 | 29.1% | 80.0% | 77.1% | +50.9 |
|
||||||
|
| Codeforces ELO | 110 | 2150 | 1718 | +2040 |
|
||||||
|
| MMMU Pro (vision) | 49.7% | 76.9% | 73.8% | +27.2 |
|
||||||
|
| MATH-Vision | 46.0% | 85.6% | 82.4% | +39.6 |
|
||||||
|
| OmniDocBench (lower=better) | 0.365 | 0.131 | 0.149 | -0.234 |
|
||||||
|
| MRCR v2 128K | 13.5% | 66.4% | 44.1% | +52.9 |
|
||||||
|
| MMMLU (multilingual) | 70.7% | 88.4% | 86.3% | +17.7 |
|
||||||
|
|
||||||
|
## Arena Scores
|
||||||
|
|
||||||
|
| Model | LMArena Score | Rank |
|
||||||
|
|-------|--------------|------|
|
||||||
|
| Gemma 4 31B | 1452 | #3 |
|
||||||
|
| Gemma 4 26B A4B | 1441 | #6 |
|
||||||
|
|
||||||
|
## Agentic Benchmark (tau2-bench)
|
||||||
|
|
||||||
|
| Model | Score |
|
||||||
|
|-------|-------|
|
||||||
|
| 31B | 86.4% |
|
||||||
|
| 26B A4B | 85.5% |
|
||||||
|
| E4B | 57.5% |
|
||||||
|
| E2B | 29.4% |
|
||||||
|
|
||||||
|
## Takeaway
|
||||||
|
|
||||||
|
The jump from Gemma 3 to 4 is enormous — AIME went from 20.8% to 89.2%, Codeforces from 110 to 2150 ELO. This is not an incremental update. The 26B MoE nearly matches 31B Dense on most benchmarks while using ~4B active params.
|
||||||
@@ -0,0 +1,55 @@
|
|||||||
|
# Gemma 4 Capabilities Reference
|
||||||
|
|
||||||
|
## Modalities
|
||||||
|
|
||||||
|
### Text (all variants)
|
||||||
|
- Standard instruction-following, chat, completion
|
||||||
|
- System prompt support (critical — see synthesis)
|
||||||
|
- 128K context window (training length)
|
||||||
|
- 262K vocabulary
|
||||||
|
|
||||||
|
### Vision (all variants)
|
||||||
|
- **Tested and verified working** (Seth, 2026-04-10)
|
||||||
|
- Accurately described colors, shapes, composition in 256x256 test image
|
||||||
|
- ~25 tok/s, ~24s end-to-end on pve197 V100
|
||||||
|
- Input: base64-encoded image in `images` field of Ollama API
|
||||||
|
- Vision encoder: 16x16 patches, 2D RoPE, variable aspect ratio
|
||||||
|
- Token budgets scale with resolution (70-1120 soft tokens)
|
||||||
|
- Used in AI_Visualizer for SDXL frame quality criticism
|
||||||
|
|
||||||
|
### Audio (E2B/E4B only)
|
||||||
|
- **Not tested by Seth** — status unknown in practice
|
||||||
|
- Conformer architecture (~300M params), mel-spectrogram input
|
||||||
|
- **Trained on SPEECH ONLY — not music or environmental sounds**
|
||||||
|
- Maximum 30 seconds per clip
|
||||||
|
- NOT available on 26B or 31B variants
|
||||||
|
- AI_Visualizer explicitly rejected audio for music analysis (DECISIONS S2) — correct call, model wasn't trained for it
|
||||||
|
|
||||||
|
### Video (all variants)
|
||||||
|
- E2B/E4B: video WITH audio (`load_audio_from_video=True`)
|
||||||
|
- 31B/26B: video WITHOUT audio
|
||||||
|
- Not explicitly post-trained on video but works
|
||||||
|
- Maximum 60 seconds at 1 frame/second
|
||||||
|
- Not tested by Seth
|
||||||
|
|
||||||
|
### Tool Calling / Function Calling
|
||||||
|
- **Verified reliable** in both Simon and AI_Visualizer
|
||||||
|
- Ollama native tool format (OpenAI-compatible function calling)
|
||||||
|
- Simon: 6 genealogy tools, up to 12 sequential iterations
|
||||||
|
- Supports parallel tool calls in single response
|
||||||
|
- Weak at deeply nested JSON schemas -> prefer sequential calls
|
||||||
|
|
||||||
|
## Benchmark Context (vs Gemma 3)
|
||||||
|
|
||||||
|
- 31B replaces Gemma 3 27B (60 layers vs 62, but wider)
|
||||||
|
- MoE variant (26B) is new — no Gemma 3 equivalent
|
||||||
|
- E-series with PLE is new — on-device focus
|
||||||
|
- Proportional RoPE replaces linear frequency scaling -> better long-context
|
||||||
|
- Shared KV cache is new -> more efficient inference
|
||||||
|
|
||||||
|
## What Gemma 4 Does NOT Do
|
||||||
|
|
||||||
|
- No native code execution / sandboxing
|
||||||
|
- No web browsing or retrieval
|
||||||
|
- Audio only on E-series (not the models most people run)
|
||||||
|
- No built-in RAG — tool calling can implement it
|
||||||
@@ -0,0 +1,42 @@
|
|||||||
|
# Gemma 4 on Ollama — Available Variants
|
||||||
|
|
||||||
|
> Last verified against Seth's homelab: 2026-04-12
|
||||||
|
|
||||||
|
## Ollama Model Tags
|
||||||
|
|
||||||
|
| Tag | Params | Quant | Size on Disk | VRAM | Notes |
|
||||||
|
|-----|--------|-------|-------------|------|-------|
|
||||||
|
| `gemma4:e4b-it-q8_0` | ~8B total / 4B effective | Q8_0 | 11.6GB | ~12GB | Vision + audio capable. ~25 tok/s on V100 |
|
||||||
|
| `gemma4:26b` | 25.8B | Q4_K_M (default) | 18.0GB | ~18GB | Sweet spot for quality/speed. ~134 tok/s on 3090 Ti |
|
||||||
|
| `gemma4:31b-it-q4_K_M` | 31.3B | Q4_K_M | 19.9GB | ~24.5GB | Sharpest but 5x slower (~28 tok/s on 3090 Ti, memory pressure) |
|
||||||
|
|
||||||
|
## Capabilities by Variant (from `ollama show`)
|
||||||
|
|
||||||
|
All variants support:
|
||||||
|
- Text generation (completion, chat)
|
||||||
|
- Vision (image input via base64 in `images` field)
|
||||||
|
- Tool/function calling (native Ollama tool format)
|
||||||
|
|
||||||
|
E-series (E2B, E4B) additionally support:
|
||||||
|
- Audio input (conformer encoder)
|
||||||
|
|
||||||
|
## GPU Coexistence (pve197 V100 32GB)
|
||||||
|
|
||||||
|
- gemma4:26b + SDXL Turbo: ~28.5GB peak VRAM — fits on V100-32GB
|
||||||
|
- gemma4:31b: 24.5GB alone — memory pressure with any coexisting model
|
||||||
|
- gemma4:e4b-it-q8_0: ~12GB — comfortable headroom
|
||||||
|
|
||||||
|
## Ollama API Endpoint
|
||||||
|
|
||||||
|
- `/api/generate` (single-turn, used by AI_Visualizer)
|
||||||
|
- `/api/chat` (multi-turn with message history, used by Simon)
|
||||||
|
- Both accept `tools`, `images`, `stream`, `options`, `keep_alive`
|
||||||
|
|
||||||
|
## Important Ollama Defaults to Override
|
||||||
|
|
||||||
|
| Parameter | Ollama Default | Recommended | Why |
|
||||||
|
|-----------|---------------|-------------|-----|
|
||||||
|
| `num_ctx` | 2048 | 4096-32768 | Default is absurdly small, causes truncation |
|
||||||
|
| `num_predict` | 128 | 512-4096+ | Default truncates almost all useful output |
|
||||||
|
| `think` | true (Ollama 0.20+) | false | See GOTCHAS doc |
|
||||||
|
| `keep_alive` | 5m | 30m-4h | Prevents expensive model reload between calls |
|
||||||
@@ -0,0 +1,100 @@
|
|||||||
|
# Gemma 4 Native Tool Calling Format
|
||||||
|
|
||||||
|
> Source: Google AI for Developers - Function Calling docs
|
||||||
|
> https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4
|
||||||
|
|
||||||
|
## Special Tokens (6 total)
|
||||||
|
|
||||||
|
| Token | Purpose |
|
||||||
|
|-------|---------|
|
||||||
|
| `<\|tool>` / `<tool\|>` | Tool definition block |
|
||||||
|
| `<\|tool_call>` / `<tool_call\|>` | Model's tool request |
|
||||||
|
| `<\|tool_response>` / `<tool_response\|>` | Tool execution result |
|
||||||
|
|
||||||
|
String delimiter: `<\|"\|>` (encloses all string values in native format)
|
||||||
|
|
||||||
|
## Native Format (raw model tokens)
|
||||||
|
|
||||||
|
### Tool definition in system prompt:
|
||||||
|
```
|
||||||
|
<|tool>declaration:
|
||||||
|
get_current_temperature{
|
||||||
|
location:{type:<|"|>string<|"|>,description:<|"|>The city<|"|>},
|
||||||
|
unit:{type:<|"|>string<|"|>,enum:[<|"|>celsius<|"|>,<|"|>fahrenheit<|"|>]}
|
||||||
|
}<tool|>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Tool call from model:
|
||||||
|
```
|
||||||
|
<|tool_call>call:get_current_temperature{location:<|"|>London<|"|>}<tool_call|>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Tool response:
|
||||||
|
```
|
||||||
|
<|tool_response>response:get_current_weather{temperature:15,weather:<|"|>sunny<|"|>}<tool_response|>
|
||||||
|
```
|
||||||
|
|
||||||
|
## JSON Chat Format (for Ollama / OpenAI-compatible APIs)
|
||||||
|
|
||||||
|
This is what you actually use in practice. Ollama translates to/from native tokens.
|
||||||
|
|
||||||
|
### Tool definition:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "function",
|
||||||
|
"function": {
|
||||||
|
"name": "get_weather",
|
||||||
|
"description": "Get current weather for a location",
|
||||||
|
"parameters": {
|
||||||
|
"type": "object",
|
||||||
|
"properties": {
|
||||||
|
"city": {"type": "string", "description": "The city name"}
|
||||||
|
},
|
||||||
|
"required": ["city"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Model returns:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"role": "assistant",
|
||||||
|
"tool_calls": [{
|
||||||
|
"function": {
|
||||||
|
"name": "get_weather",
|
||||||
|
"arguments": {"city": "London"}
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Tool result message:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"role": "tool",
|
||||||
|
"content": "{\"temperature\": 15, \"weather\": \"sunny\"}"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Thinking Mode + Tool Calls
|
||||||
|
|
||||||
|
- When thinking is enabled, preserve thoughts between tool calls
|
||||||
|
- For long agent chains, summarize thoughts as plain text to save context
|
||||||
|
- Recommended: **disable thinking for tool-heavy workflows** (Seth's finding)
|
||||||
|
|
||||||
|
## Framework Flags
|
||||||
|
|
||||||
|
| Framework | Required Flag |
|
||||||
|
|-----------|--------------|
|
||||||
|
| llama.cpp | `--jinja` |
|
||||||
|
| vLLM | `--enable-auto-tool-choice` |
|
||||||
|
| Ollama | Works via `/api/chat` endpoint with `tools` field |
|
||||||
|
| transformers | `apply_chat_template(tools=[...])` |
|
||||||
|
|
||||||
|
## Known Issues
|
||||||
|
|
||||||
|
- Ollama v0.20.0-0.20.1: tool call parser broken, streaming drops tool calls
|
||||||
|
- llama.cpp: format mismatches and continuous loops reported
|
||||||
|
- LM Studio: compatibility issues with tool calling
|
||||||
|
- **Workaround:** Use non-streaming mode for tool calls (proven in Simon)
|
||||||
+205
@@ -0,0 +1,205 @@
|
|||||||
|
# Gemma 4 Gotchas & Known Issues
|
||||||
|
|
||||||
|
> Derived from Seth's production implementations (Simon, AI_Visualizer)
|
||||||
|
> and community reports. These are hard-won lessons.
|
||||||
|
|
||||||
|
## CRITICAL: Thinking Mode Eats Context
|
||||||
|
|
||||||
|
**Severity: HIGH — causes silent failures**
|
||||||
|
|
||||||
|
Gemma 4 in Ollama 0.20+ defaults to `think: true`. When enabled:
|
||||||
|
- Thinking tokens go into a hidden `thinking` field, NOT `response`
|
||||||
|
- If `num_predict` is limited, thinking consumes the entire budget
|
||||||
|
- `response` comes back **empty** — no error, just silence
|
||||||
|
- On evaluative tasks, thinking inflates scores (31B scored a known-bad image 9/10 with thinking vs 7/10 without)
|
||||||
|
|
||||||
|
**Fix:** Always pass `think: false` in the Ollama payload. Seth has had success ONLY with thinking off.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"model": "gemma4:26b",
|
||||||
|
"think": false,
|
||||||
|
"options": { "num_predict": 4096 }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## CRITICAL: format=json Causes Infinite Loops
|
||||||
|
|
||||||
|
**Severity: HIGH — hangs indefinitely**
|
||||||
|
|
||||||
|
Ollama's server-side `format: "json"` enforcer causes Gemma 4 26B (Q4) to enter an infinite retry loop when the requested schema is deeply nested.
|
||||||
|
|
||||||
|
**Fix:** Never use `format: "json"`. Instead:
|
||||||
|
1. Request JSON structure in the prompt text
|
||||||
|
2. Parse client-side with regex + `json.loads` + json5 fallback
|
||||||
|
|
||||||
|
```python
|
||||||
|
# DO THIS
|
||||||
|
response = client.generate(model="gemma4:26b", prompt=prompt, format_json=False)
|
||||||
|
body = response["response"]
|
||||||
|
obj = json.loads(body[body.find("{"):body.rfind("}") + 1])
|
||||||
|
|
||||||
|
# NOT THIS
|
||||||
|
response = client.generate(model="gemma4:26b", prompt=prompt, format="json") # HANGS
|
||||||
|
```
|
||||||
|
|
||||||
|
## CRITICAL: Ollama Default Context is 2048
|
||||||
|
|
||||||
|
**Severity: HIGH — causes truncation**
|
||||||
|
|
||||||
|
Ollama defaults `num_ctx` to 2048 tokens. Gemma 4 supports 128K. If you don't override, your prompts get silently truncated.
|
||||||
|
|
||||||
|
**Fix:** Always set `num_ctx` explicitly:
|
||||||
|
```json
|
||||||
|
{ "options": { "num_ctx": 8192 } }
|
||||||
|
```
|
||||||
|
|
||||||
|
Scale to your needs: 4096 for simple tasks, 16384 for long inputs, 32768 for complex multi-turn.
|
||||||
|
|
||||||
|
## HIGH: num_predict Default is 128
|
||||||
|
|
||||||
|
**Severity: HIGH — truncates output**
|
||||||
|
|
||||||
|
Ollama defaults `num_predict` to 128 tokens. Almost any useful Gemma 4 output exceeds this.
|
||||||
|
|
||||||
|
**Fix:** Always set `num_predict` explicitly. Minimum recommended: 512. For JSON output: 2048+.
|
||||||
|
|
||||||
|
## MEDIUM: Weak at Long/Nested JSON
|
||||||
|
|
||||||
|
**Severity: MEDIUM — causes parse failures**
|
||||||
|
|
||||||
|
Gemma 4 reliably produces short JSON (5-10 fields) but struggles with:
|
||||||
|
- Deeply nested schemas (3+ levels)
|
||||||
|
- Long arrays (20+ items)
|
||||||
|
- Mixed nesting + length
|
||||||
|
|
||||||
|
**Fix:** Sequential tool calls. Break one large JSON request into multiple smaller calls:
|
||||||
|
- Instead of "generate a 50-item storyboard", do "generate items 1-5", "generate items 6-10", etc.
|
||||||
|
- Due to Gemma 4's fast speed and free local use, sequential calls are cheap
|
||||||
|
|
||||||
|
**Fallback pattern (AI_Visualizer):**
|
||||||
|
```python
|
||||||
|
for attempt in range(MAX_RETRIES):
|
||||||
|
temp = BASE_TEMP + attempt * TEMP_BUMP # 0.4 -> 0.5 -> 0.6
|
||||||
|
response = call_gemma(temperature=temp)
|
||||||
|
try:
|
||||||
|
return parse_json(response)
|
||||||
|
except JSONDecodeError:
|
||||||
|
continue
|
||||||
|
```
|
||||||
|
|
||||||
|
## MEDIUM: Identity Confusion
|
||||||
|
|
||||||
|
**Severity: MEDIUM — cosmetic but confusing**
|
||||||
|
|
||||||
|
Gemma 4 is ultra-compliant and highly capable but does not know who it is. It may:
|
||||||
|
- Claim to be a different model
|
||||||
|
- Hallucinate capabilities it doesn't have
|
||||||
|
- Respond as a generic "AI assistant" without personality
|
||||||
|
|
||||||
|
**Fix:** Explicit identity in system prompt:
|
||||||
|
```
|
||||||
|
You are [Name], a [role]. You are powered by Gemma 4.
|
||||||
|
You ONLY do [X]. You NEVER do [Y].
|
||||||
|
```
|
||||||
|
|
||||||
|
Gemma 4 does NOT need hand-holding on task execution — it's very capable.
|
||||||
|
It needs explicit instructions about identity and boundaries.
|
||||||
|
|
||||||
|
## MEDIUM: Flash Attention Hang on 31B Dense (>3-4K tokens)
|
||||||
|
|
||||||
|
**Severity: MEDIUM — hardware-specific, affects RTX 3090**
|
||||||
|
|
||||||
|
Community-reported: Flash Attention causes Gemma 4 31B Dense to hang indefinitely during prompt evaluation when the prompt exceeds ~3-4K tokens. The 26B MoE variant handles the same prompts fine — bug is specific to the Dense model.
|
||||||
|
|
||||||
|
**Source:** [ollama/ollama#15350](https://github.com/ollama/ollama/issues/15350)
|
||||||
|
|
||||||
|
**Fix:** Use 26B for long prompts, or disable Flash Attention if running 31B on affected hardware.
|
||||||
|
|
||||||
|
## MEDIUM: Tool Calling Broken in Ollama v0.20.0 Streaming
|
||||||
|
|
||||||
|
**Severity: MEDIUM — version-specific**
|
||||||
|
|
||||||
|
As of early April 2026, Gemma 4 tool calling has issues in Ollama v0.20.0: the tool call parser fails and streaming drops tool calls entirely. Community reports include format mismatches and continuous loops in llama.cpp / LM Studio.
|
||||||
|
|
||||||
|
**Source:** [community reports](https://dev.to/dentity007/-gemma-4-after-24-hours-what-the-community-found-vs-what-google-promised-3a2f)
|
||||||
|
|
||||||
|
**Fix:** Use non-streaming for tool calls (Simon does this). Test tool calling thoroughly when upgrading Ollama versions. Seth's implementations work reliably with non-streaming tool calls.
|
||||||
|
|
||||||
|
## MEDIUM: VRAM-Hungry for Context
|
||||||
|
|
||||||
|
**Severity: MEDIUM — affects hardware planning**
|
||||||
|
|
||||||
|
Gemma 4 KV cache is large relative to competitors. Community reports: 31B at 262K context requires ~22GB just for KV cache on top of model weights. One user could only fit Gemma 3 27B Q4 with 20K context on a 5090, while Qwen 3.5 27B Q4 fit with 190K context on the same card.
|
||||||
|
|
||||||
|
**Implication:** Don't set num_ctx higher than you need. 32K is plenty for most tasks and keeps VRAM reasonable.
|
||||||
|
|
||||||
|
## MEDIUM: Safety Overfiltering
|
||||||
|
|
||||||
|
**Severity: MEDIUM — blocks benign prompts**
|
||||||
|
|
||||||
|
Strict safety alignment occasionally blocks technical, academic, or creative prompts that superficially resemble restricted categories. One user reported jailbreaks with basic system prompts.
|
||||||
|
|
||||||
|
**Fix:** Rephrase blocked prompts to avoid trigger patterns. For system prompts, avoid language that sounds like you're asking the model to bypass restrictions — just state the task directly.
|
||||||
|
|
||||||
|
## MEDIUM: KV Cache Config Bug (31B/26B ship with num_kv_shared_layers=0)
|
||||||
|
|
||||||
|
**Severity: MEDIUM — crashes on first attention forward pass**
|
||||||
|
|
||||||
|
The 31B and 26B ship with `num_kv_shared_layers = 0`, which causes `layer_types[:-0]` to collapse to zero layer slots. Crashes on first forward pass.
|
||||||
|
|
||||||
|
**Fix:** Patch the config. Check model card discussions for the exact fix.
|
||||||
|
|
||||||
|
## LOW: vLLM Triton Fallback (~9 tok/s on RTX 4090)
|
||||||
|
|
||||||
|
**Severity: LOW — vLLM-specific**
|
||||||
|
|
||||||
|
Heterogeneous attention head dimensions in Gemma 4 force vLLM to fall back to a slow Triton kernel. RTX 4090 gets ~9 tok/s instead of expected ~100+.
|
||||||
|
|
||||||
|
**Source:** [vllm-project/vllm#38887](https://github.com/vllm-project/vllm/issues/38887)
|
||||||
|
|
||||||
|
**Fix:** Use Ollama instead of vLLM for now, or wait for the fix.
|
||||||
|
|
||||||
|
## LOW: `<unused>` Token Infinite Loop (Vulkan backends)
|
||||||
|
|
||||||
|
**Severity: LOW — Vulkan-specific**
|
||||||
|
|
||||||
|
Gemma 4 can generate `<unused>` or `<unused24>` tokens in an infinite loop on Vulkan backends in llama.cpp.
|
||||||
|
|
||||||
|
**Source:** [ggml-org/llama.cpp#21516](https://github.com/ggml-org/llama.cpp/issues/21516)
|
||||||
|
|
||||||
|
## LOW: Fine-Tuning Ecosystem Issues
|
||||||
|
|
||||||
|
**Severity: LOW — only relevant if fine-tuning**
|
||||||
|
|
||||||
|
Day-one issues for fine-tuners:
|
||||||
|
- HuggingFace Transformers didn't recognize gemma4 architecture (required install from source)
|
||||||
|
- PEFT couldn't handle Gemma4ClippableLinear (new vision encoder layer type)
|
||||||
|
- New `mm_token_type_ids` field required during training even for text-only data
|
||||||
|
- E2B/E4B show training loss of 13-15, which is normal for multimodal models (not a bug)
|
||||||
|
|
||||||
|
## LOW: Vision Validator Overrejects
|
||||||
|
|
||||||
|
**Severity: LOW — specific to evaluative vision tasks**
|
||||||
|
|
||||||
|
In AI_Visualizer, Gemma 4 vision was used to critique SDXL frames. It flagged images for motif-matching failures that humans rated as equal or better than passed images. The validator was queued for disable.
|
||||||
|
|
||||||
|
**Pattern:** Gemma 4 vision is good at description but unreliable for subjective quality scoring. Use it for "what's in this image?" not "is this image good?"
|
||||||
|
|
||||||
|
## LOW: Keep-Alive Too Short
|
||||||
|
|
||||||
|
**Severity: LOW — performance only**
|
||||||
|
|
||||||
|
Default `keep_alive` is 5 minutes. If your pipeline has gaps (e.g., waiting for SDXL generation), the model gets unloaded and reloaded (~10-30s penalty).
|
||||||
|
|
||||||
|
**Fix:** Set `keep_alive` to match your pipeline duration:
|
||||||
|
```json
|
||||||
|
{ "keep_alive": "4h" }
|
||||||
|
```
|
||||||
|
|
||||||
|
Or pin/unpin explicitly:
|
||||||
|
```python
|
||||||
|
client.generate(model="gemma4:26b", prompt="", keep_alive=-1, options={"num_predict": 0}) # pin
|
||||||
|
# ... do work ...
|
||||||
|
client.generate(model="gemma4:26b", prompt="", keep_alive=0, options={"num_predict": 0}) # unpin
|
||||||
|
```
|
||||||
@@ -0,0 +1,95 @@
|
|||||||
|
# Gemma 4 Implementation Reference
|
||||||
|
|
||||||
|
> Patterns extracted from Seth's two production Gemma 4 projects.
|
||||||
|
|
||||||
|
## Project: Simon (FreibergFamily/simon/)
|
||||||
|
|
||||||
|
**Purpose:** AI genealogy historian — multi-turn chat with tool-calling agent
|
||||||
|
|
||||||
|
| Setting | Value |
|
||||||
|
|---------|-------|
|
||||||
|
| Model | `gemma4:26b` |
|
||||||
|
| API | `/api/chat` (multi-turn) |
|
||||||
|
| num_ctx | 32768 |
|
||||||
|
| num_predict | 4096 |
|
||||||
|
| temperature | 1.0 |
|
||||||
|
| top_p | 0.95 |
|
||||||
|
| top_k | 64 |
|
||||||
|
| keep_alive | 4h |
|
||||||
|
| think | (not explicitly set — should be false) |
|
||||||
|
| format_json | not used |
|
||||||
|
| Vision | not used |
|
||||||
|
| Tool calling | 6 tools, max 12 iterations |
|
||||||
|
|
||||||
|
### Key Patterns
|
||||||
|
|
||||||
|
1. **Aggressive system prompt:** 40+ lines defining identity, boundaries, tool usage rules, multi-step chaining requirements. Gemma 4 follows all of it.
|
||||||
|
|
||||||
|
2. **Tool chaining instructions:** System prompt explicitly tells Gemma to chain tools (e.g., "after lookup_person, ALSO call get_historical_context"). Gemma 4 follows these multi-step chains reliably.
|
||||||
|
|
||||||
|
3. **Parallel tool calls:** Encouraged in system prompt for multiple lookups. Gemma 4 does this.
|
||||||
|
|
||||||
|
4. **History pruning:** Drops old tool results and tool-call messages, keeps assistant summaries. Prevents context bloat in multi-turn.
|
||||||
|
|
||||||
|
5. **Fallback to streaming:** After 12 tool iterations, switches to stream mode (no tools) to force a text response.
|
||||||
|
|
||||||
|
6. **Two modes (historian vs interview):** Completely different system prompts swapped at runtime. Gemma 4 stays in character for both.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Project: AI Visualizer (AI_Visualizer/)
|
||||||
|
|
||||||
|
**Purpose:** Music-reactive video generator — Gemma 4 as reasoning engine across 4 pipeline stages
|
||||||
|
|
||||||
|
| Stage | num_predict | num_ctx | temperature | Purpose |
|
||||||
|
|-------|-------------|---------|-------------|---------|
|
||||||
|
| Mood Analysis | 4096 | 16384 | 0.4-0.6 | Analyze CLAP descriptors -> narratives + boundary adjustments |
|
||||||
|
| Rate Pass | 512 | 4096 | 0.3-0.5 | Choose visual pacing rate per music segment |
|
||||||
|
| Storyboard | 2048 | 4096 | 0.6-0.8 | Generate SDXL prompts per music segment |
|
||||||
|
| Batch Expansion | 2048 | default | 0.7 | Interpolate between scene prompts over time |
|
||||||
|
| Vision Validator | 256 | default | 0.2 | Critique generated frames (queued for disable) |
|
||||||
|
|
||||||
|
### Key Patterns
|
||||||
|
|
||||||
|
1. **No tool calling used.** All Gemma interaction is single-turn generate with JSON requested in prompt.
|
||||||
|
|
||||||
|
2. **Client-side JSON extraction:**
|
||||||
|
```python
|
||||||
|
body = response["response"]
|
||||||
|
start = body.find("{")
|
||||||
|
end = body.rfind("}")
|
||||||
|
obj = json.loads(body[start:end + 1])
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Temperature ramping on retry:** Base temp + bump per attempt. Conservative first, creative on retry.
|
||||||
|
|
||||||
|
4. **think: false everywhere.** Explicitly set on every call. Critical for budget control.
|
||||||
|
|
||||||
|
5. **format_json: false everywhere.** Causes infinite loops on nested schemas.
|
||||||
|
|
||||||
|
6. **Model pinning:** `keep_alive=-1` to prevent GPU eviction during long SDXL pauses.
|
||||||
|
|
||||||
|
7. **Explicit num_ctx:** Added after discovering Ollama defaults to 2048, which truncated mood analyzer prompts on long tracks.
|
||||||
|
|
||||||
|
8. **Banned vocabulary in prompts:** List of cliche words (cinematic, dramatic, ethereal...) passed to Gemma to avoid generic output.
|
||||||
|
|
||||||
|
9. **Vision for image critique:** Base64-encoded PNG -> structured SCORE/ISSUE/REASON output parsed by regex. Works but overrejects on subjective quality.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Common Settings Across Both Projects
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"model": "gemma4:26b",
|
||||||
|
"think": false,
|
||||||
|
"options": {
|
||||||
|
"num_ctx": 4096,
|
||||||
|
"num_predict": 2048,
|
||||||
|
"temperature": 0.5
|
||||||
|
},
|
||||||
|
"keep_alive": "30m"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Adjust num_ctx/num_predict upward for your payload size. These are safe minimums.
|
||||||
@@ -0,0 +1,25 @@
|
|||||||
|
# gemma4-research
|
||||||
|
|
||||||
|
Research corpus and implementation guidance for Google Gemma 4, based on production use in Seth's homelab.
|
||||||
|
|
||||||
|
## Files
|
||||||
|
|
||||||
|
| File | What | When to Read |
|
||||||
|
|------|------|-------------|
|
||||||
|
| `SYNTHESIS.md` | **Start here.** Opinionated guide — how to build with Gemma 4 | Before any new Gemma 4 implementation |
|
||||||
|
| `GOTCHAS.md` | Known issues and workarounds, severity-ranked | When debugging Gemma 4 issues or starting a new project |
|
||||||
|
| `IMPLEMENTATIONS.md` | Patterns from Simon and AI_Visualizer | When designing a new Gemma 4 integration |
|
||||||
|
| `CORPUS_architecture.md` | Model architecture details (layers, attention, PLE, MoE) | When you need to understand WHY Gemma 4 behaves a certain way |
|
||||||
|
| `CORPUS_ollama_variants.md` | Available models, sizes, VRAM, Ollama settings | When choosing a model variant or configuring Ollama |
|
||||||
|
| `CORPUS_capabilities.md` | Modalities (vision, audio, video, tools), what it can/can't do | When scoping what Gemma 4 can handle |
|
||||||
|
| `CORPUS_benchmarks.md` | Full benchmark table vs Gemma 3, arena scores, agentic scores | When comparing Gemma 4 to alternatives |
|
||||||
|
| `CORPUS_tool_calling_format.md` | Native token format + JSON API format for function calling | When implementing tool calling |
|
||||||
|
|
||||||
|
## Source Projects
|
||||||
|
|
||||||
|
- **Simon** (`~/bin/FreibergFamily/simon/`) — Multi-turn chat agent with 6 tools, genealogy historian
|
||||||
|
- **AI Visualizer** (`~/bin/AI_Visualizer/`) — Music video generator, 4-stage Gemma pipeline + vision
|
||||||
|
|
||||||
|
## Key Insight
|
||||||
|
|
||||||
|
Gemma 4 is ultra-compliant and highly capable but doesn't know who it is. It needs explicit system prompts, not hand-holding. Due to fast local inference, sequential tool calls beat long JSON requests.
|
||||||
+194
@@ -0,0 +1,194 @@
|
|||||||
|
# Gemma 4 Synthesis — How to Build With It
|
||||||
|
|
||||||
|
> Opinionated guide based on two production implementations and ongoing use.
|
||||||
|
> Seth Freiberg, 2026-04-12
|
||||||
|
|
||||||
|
## The One-Paragraph Summary
|
||||||
|
|
||||||
|
Gemma 4 is an ultra-compliant, highly-capable model that doesn't know who it is. It doesn't need hand-holding on tasks but needs explicit instructions in the system prompt about identity, boundaries, and output format. It needs `num_predict` increased (Ollama defaults are absurdly low), `think` set to false (thinking eats the context budget), and `format: json` avoided entirely (causes infinite loops). Due to its fast speed and free local inference, sequential tool calls are the ideal solution to tasks that would otherwise require long structured output.
|
||||||
|
|
||||||
|
## Mental Model
|
||||||
|
|
||||||
|
Think of Gemma 4 as a very competent employee on their first day. They can do the work — you don't need to explain how. But you DO need to explain:
|
||||||
|
- Who they are and what their job is
|
||||||
|
- What they should and should NOT do
|
||||||
|
- Exactly what format you want the deliverable in
|
||||||
|
- The boundaries of their role
|
||||||
|
|
||||||
|
Get those right and Gemma 4 just works. Get them wrong and you get a generic chatbot.
|
||||||
|
|
||||||
|
## Mandatory Ollama Settings
|
||||||
|
|
||||||
|
Every Gemma 4 call MUST include:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"think": false,
|
||||||
|
"options": {
|
||||||
|
"num_ctx": 4096,
|
||||||
|
"num_predict": 2048
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why each one:**
|
||||||
|
- `think: false` — Ollama 0.20+ defaults to think:true. Thinking tokens consume num_predict budget invisibly, returning empty responses. Seth has ONLY had success with thinking off.
|
||||||
|
- `num_ctx: 4096+` — Ollama defaults to 2048. Your system prompt alone might exceed that.
|
||||||
|
- `num_predict: 2048+` — Ollama defaults to 128. Any structured output gets truncated.
|
||||||
|
|
||||||
|
Scale these to your task. The values above are safe minimums, not recommendations.
|
||||||
|
|
||||||
|
## System Prompt Template
|
||||||
|
|
||||||
|
```
|
||||||
|
You are [NAME], a [ROLE DESCRIPTION].
|
||||||
|
|
||||||
|
## What You Do
|
||||||
|
- [Explicit list of responsibilities]
|
||||||
|
- [Tools you have access to and when to use each one]
|
||||||
|
|
||||||
|
## What You Do NOT Do
|
||||||
|
- [Explicit list of things to refuse or avoid]
|
||||||
|
- [Common mistakes to prevent]
|
||||||
|
|
||||||
|
## Output Format
|
||||||
|
[Exact schema, field names, example if complex]
|
||||||
|
Respond with ONLY [format]. No prose outside the [format].
|
||||||
|
|
||||||
|
## Rules
|
||||||
|
- [Behavioral constraints]
|
||||||
|
- [Multi-step chaining instructions if using tools]
|
||||||
|
|
||||||
|
Today's date: [DATE]
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key principles:**
|
||||||
|
1. Identity first — who is this agent?
|
||||||
|
2. Positive instructions before negative (what TO do before what NOT to do)
|
||||||
|
3. Output format is explicit and complete — Gemma 4 follows schemas faithfully
|
||||||
|
4. "No prose outside the JSON" prevents wrapper text that breaks parsing
|
||||||
|
5. Date injection helps with temporal reasoning
|
||||||
|
|
||||||
|
## Tool Calling Strategy
|
||||||
|
|
||||||
|
Gemma 4 is **reliable for tool calling** but **weak at structuring long JSONs**.
|
||||||
|
|
||||||
|
### When to use tool calling (Ollama native)
|
||||||
|
- Multi-turn agents with 2-10 tools
|
||||||
|
- Sequential reasoning chains (lookup A -> use A to decide B -> lookup B)
|
||||||
|
- Any task where the model needs to gather information before responding
|
||||||
|
|
||||||
|
### When to use prompt-based JSON instead
|
||||||
|
- Single-turn generation with known output structure
|
||||||
|
- When you need specific JSON schema control
|
||||||
|
- When the output is a payload (prompts, configs) not a conversation
|
||||||
|
|
||||||
|
### The Sequential Pattern
|
||||||
|
|
||||||
|
Instead of asking Gemma 4 to produce one massive JSON:
|
||||||
|
```
|
||||||
|
BAD: "Generate a 50-scene storyboard as JSON" -> truncated/malformed
|
||||||
|
GOOD: "Generate scenes 1-5 as JSON" x10 -> reliable every time
|
||||||
|
```
|
||||||
|
|
||||||
|
Gemma 4's inference speed makes sequential calls cheap. A 10-call chain at ~134 tok/s on a 3090 Ti costs seconds, not minutes. This is the fundamental advantage of local models — latency is predictable and network-free.
|
||||||
|
|
||||||
|
## JSON Extraction Pattern
|
||||||
|
|
||||||
|
Since `format: "json"` is broken, always extract client-side:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Python
|
||||||
|
import json
|
||||||
|
raw = response["response"]
|
||||||
|
start = raw.find("{")
|
||||||
|
end = raw.rfind("}")
|
||||||
|
if start >= 0 and end > start:
|
||||||
|
obj = json.loads(raw[start:end + 1])
|
||||||
|
```
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// JavaScript
|
||||||
|
const raw = response.message.content;
|
||||||
|
const match = raw.match(/\{[\s\S]*\}/);
|
||||||
|
if (match) obj = JSON.parse(match[0]);
|
||||||
|
```
|
||||||
|
|
||||||
|
For arrays, find `[` and `]` instead. Add json5 fallback for trailing commas.
|
||||||
|
|
||||||
|
## Temperature Guidelines
|
||||||
|
|
||||||
|
| Task Type | Temperature | Why |
|
||||||
|
|-----------|-------------|-----|
|
||||||
|
| Evaluation / scoring | 0.2 | Consistent, reproducible judgments |
|
||||||
|
| Structured extraction | 0.3-0.4 | Faithful to schema |
|
||||||
|
| Creative generation | 0.6-0.8 | Variety without chaos |
|
||||||
|
| Conversation / chat | 0.7-1.0 | Natural feel |
|
||||||
|
|
||||||
|
Retry strategy: bump temp +0.1 per retry to escape format failures.
|
||||||
|
|
||||||
|
## Vision Usage
|
||||||
|
|
||||||
|
**Works for:** Describing image contents (objects, colors, composition, text)
|
||||||
|
**Unreliable for:** Subjective quality scoring, aesthetic judgment
|
||||||
|
|
||||||
|
```python
|
||||||
|
import base64
|
||||||
|
with open("image.png", "rb") as f:
|
||||||
|
b64 = base64.b64encode(f.read()).decode("ascii")
|
||||||
|
|
||||||
|
response = client.generate(
|
||||||
|
model="gemma4:26b",
|
||||||
|
prompt="Describe this image in detail.",
|
||||||
|
images=[b64],
|
||||||
|
think=False,
|
||||||
|
options={"temperature": 0.2, "num_predict": 512}
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
Vision is on ALL Gemma 4 variants (E2B, E4B, 26B, 31B). Audio is E-series only.
|
||||||
|
|
||||||
|
## Context Management
|
||||||
|
|
||||||
|
### Multi-turn (chat agents)
|
||||||
|
- Prune old tool results and tool-call messages
|
||||||
|
- Keep assistant's natural-language summaries
|
||||||
|
- Set num_ctx to 32768 for rich conversations
|
||||||
|
- Set a tool iteration limit (12 is proven) with streaming fallback
|
||||||
|
|
||||||
|
### Single-turn (pipeline stages)
|
||||||
|
- Calculate your prompt size and set num_ctx accordingly
|
||||||
|
- For long inputs (full track analysis), use recursive splitting at natural boundaries
|
||||||
|
- Pin model with `keep_alive=-1` if pipeline has idle gaps
|
||||||
|
|
||||||
|
## Model Selection
|
||||||
|
|
||||||
|
| Use Case | Recommended | Why |
|
||||||
|
|----------|------------|-----|
|
||||||
|
| Production pipeline (needs GPU coexistence) | `gemma4:26b` | Best quality/speed/VRAM balance |
|
||||||
|
| On-device / edge | `gemma4:e4b-it-q8_0` | 12GB VRAM, vision+audio |
|
||||||
|
| Maximum quality (single-model GPU) | `gemma4:31b-it-q4_K_M` | Sharpest but slow under memory pressure |
|
||||||
|
| Rapid prototyping / testing | `gemma4:26b` | Fast enough for interactive dev |
|
||||||
|
|
||||||
|
## Anti-Patterns
|
||||||
|
|
||||||
|
1. **Don't use `format: "json"`** — infinite loops on nested schemas
|
||||||
|
2. **Don't leave `think` at default** — eats your output budget silently
|
||||||
|
3. **Don't leave `num_predict` at default** — 128 tokens is nothing
|
||||||
|
4. **Don't leave `num_ctx` at default** — 2048 truncates most prompts
|
||||||
|
5. **Don't ask for huge JSON in one call** — break into sequential calls
|
||||||
|
6. **Don't use thinking mode for evaluation** — inflates scores, wastes context
|
||||||
|
7. **Don't skip system prompt identity** — Gemma 4 becomes a generic chatbot
|
||||||
|
8. **Don't use audio on 26B/31B** — only E-series has audio encoder
|
||||||
|
|
||||||
|
## Quick-Start Checklist
|
||||||
|
|
||||||
|
- [ ] Set `think: false`
|
||||||
|
- [ ] Set `num_predict` >= 512 (2048+ for JSON output)
|
||||||
|
- [ ] Set `num_ctx` >= 4096 (scale to your prompt size)
|
||||||
|
- [ ] Write explicit system prompt with identity + boundaries + output format
|
||||||
|
- [ ] Extract JSON client-side (no `format: "json"`)
|
||||||
|
- [ ] Set `keep_alive` >= 30m (or pin with -1)
|
||||||
|
- [ ] For long structured output, use sequential calls
|
||||||
|
- [ ] For vision, pass base64 in `images` array
|
||||||
|
- [ ] Test with your actual prompt length — Ollama won't warn about truncation
|
||||||
Reference in New Issue
Block a user