# Gemma 4 on Ollama — Available Variants > Last verified against Seth's homelab: 2026-04-12 ## Ollama Model Tags | Tag | Params | Quant | Size on Disk | VRAM | Notes | |-----|--------|-------|-------------|------|-------| | `gemma4:e4b-it-q8_0` | ~8B total / 4B effective | Q8_0 | 11.6GB | ~12GB | Vision + audio capable. ~25 tok/s on V100 | | `gemma4:26b` | 25.2B total / **3.8B active (MoE)** | Q4_K_M (default) | 18.0GB | ~18GB | Sweet spot for quality/speed. ~134 tok/s on 3090 Ti. **8 experts active of 128 + 1 shared** — runs at ~4B-speed, hence throughput. Q4_K_M inference is standard (Mixtral/DeepSeek ship same); the "MoE quality degrades at 4-bit" caveat is a **training-time** concern, not inference. See `tooling/huggingface/model-cards/gemma-4-26B-A4B-it-README.md` for the full card. | | `gemma4:31b-it-q4_K_M` | 31.3B | Q4_K_M | 19.9GB | ~24.5GB | Sharpest but 5x slower (~28 tok/s on 3090 Ti, memory pressure) | ## Capabilities by Variant (from `ollama show`) All variants support: - Text generation (completion, chat) - Vision (image input via base64 in `images` field) - Tool/function calling (native Ollama tool format) - Thinking (configurable — `ollama show` lists it; Seth's finding is to leave it `false` for tool-use workloads) E-series (E2B, E4B) additionally support: - Audio input (conformer encoder) — **but not via Ollama**; requires llama.cpp with the `mmproj-*-E*B-it-*.gguf` projector, or vLLM's `input_features_padded`. See `tooling/inference-frameworks/README.md`. ## GPU Coexistence (pve197 V100 32GB) - gemma4:26b + SDXL Turbo: ~28.5GB peak VRAM — fits on V100-32GB - gemma4:31b: 24.5GB alone — memory pressure with any coexisting model - gemma4:e4b-it-q8_0: ~12GB — comfortable headroom ## Ollama API Endpoint - `/api/generate` (single-turn, used by AI_Visualizer) - `/api/chat` (multi-turn with message history, used by Simon) - Both accept `tools`, `images`, `stream`, `options`, `keep_alive` ## Important Ollama Defaults to Override | Parameter | Ollama Default | Recommended | Why | |-----------|---------------|-------------|-----| | `num_ctx` | 2048 | 4096-32768 | Default is absurdly small, causes truncation | | `num_predict` | 128 | 512-4096+ | Default truncates almost all useful output | | `think` | true (Ollama 0.20+) | false | See GOTCHAS doc | | `keep_alive` | 5m | 30m-4h | Prevents expensive model reload between calls |