Files
gemma4-research/CORPUS_ollama_variants.md
Mortdecai 5775978899 docs: merge tooling findings into SYNTHESIS/GOTCHAS/CORPUS_* and add handoff
Patches the top-level corpus docs with the 13 findings flagged during the
2026-04-18 canonical tooling research pass. tooling/README.md now marks each
finding [merged: <file>] or [flagged] for provenance.

- CORPUS_ollama_variants.md: annotate gemma4:26b as MoE (25.2B total / 3.8B
  active, 8-of-128 experts + 1 shared). Note Q4_K_M inference is standard
  (the "MoE quality degrades at 4-bit" caveat is training-only). Add note
  that audio on E-series is NOT available via Ollama — llama.cpp mmproj
  or vLLM only.
- CORPUS_capabilities.md: native system role, configurable thinking mode,
  first trained tool use (vs Gemma 1/2/3 proof-of-concept), native object
  detection with bbox output in 1000x1000 coords, pointer to EmbeddingGemma
  for retrieval (Gemma 4 has no embedding mode).
- CORPUS_tool_calling_format.md: add Chat Template Context section
  documenting the <|turn>/<turn|> asymmetric brackets (new in Gemma 4,
  replaced <start_of_turn>/<end_of_turn>) plus <|think>, <|channel>,
  <|image>, <|audio> tokens. Add HF transformers Alternative section
  showing processor.parse_response with response_schema.
- GOTCHAS.md: add MEDIUM gotcha for abandoned google/gemma_pytorch (no
  Gemma 4 support since 2025-05-30). Expand fine-tuning section with FA2/FA4
  head_dim=512 break, fused LoRA kernel issues, 26B A4B training-quant
  guidance, new tool-call tokens as learned embeddings.
- SYNTHESIS.md: add banner pointing to tooling/ for canonical upstream
  material. Add embeddinggemma row to Model Selection table.

Also:
- Add .gitignore excluding .backup/ (local scratch per global CLAUDE.md
  convention, not needed in tracked history) and __pycache__/.
- Add .claude/handoffs/2026-04-18-canonical-tooling-research.md so future
  sessions can pick up cold — facts verified, open threads, what changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:48:26 -04:00

2.3 KiB

Gemma 4 on Ollama — Available Variants

Last verified against Seth's homelab: 2026-04-12

Ollama Model Tags

Tag Params Quant Size on Disk VRAM Notes
gemma4:e4b-it-q8_0 ~8B total / 4B effective Q8_0 11.6GB ~12GB Vision + audio capable. ~25 tok/s on V100
gemma4:26b 25.2B total / 3.8B active (MoE) Q4_K_M (default) 18.0GB ~18GB Sweet spot for quality/speed. ~134 tok/s on 3090 Ti. 8 experts active of 128 + 1 shared — runs at ~4B-speed, hence throughput. Q4_K_M inference is standard (Mixtral/DeepSeek ship same); the "MoE quality degrades at 4-bit" caveat is a training-time concern, not inference. See tooling/huggingface/model-cards/gemma-4-26B-A4B-it-README.md for the full card.
gemma4:31b-it-q4_K_M 31.3B Q4_K_M 19.9GB ~24.5GB Sharpest but 5x slower (~28 tok/s on 3090 Ti, memory pressure)

Capabilities by Variant (from ollama show)

All variants support:

  • Text generation (completion, chat)
  • Vision (image input via base64 in images field)
  • Tool/function calling (native Ollama tool format)
  • Thinking (configurable — ollama show lists it; Seth's finding is to leave it false for tool-use workloads)

E-series (E2B, E4B) additionally support:

  • Audio input (conformer encoder) — but not via Ollama; requires llama.cpp with the mmproj-*-E*B-it-*.gguf projector, or vLLM's input_features_padded. See tooling/inference-frameworks/README.md.

GPU Coexistence (pve197 V100 32GB)

  • gemma4:26b + SDXL Turbo: ~28.5GB peak VRAM — fits on V100-32GB
  • gemma4:31b: 24.5GB alone — memory pressure with any coexisting model
  • gemma4:e4b-it-q8_0: ~12GB — comfortable headroom

Ollama API Endpoint

  • /api/generate (single-turn, used by AI_Visualizer)
  • /api/chat (multi-turn with message history, used by Simon)
  • Both accept tools, images, stream, options, keep_alive

Important Ollama Defaults to Override

Parameter Ollama Default Recommended Why
num_ctx 2048 4096-32768 Default is absurdly small, causes truncation
num_predict 128 512-4096+ Default truncates almost all useful output
think true (Ollama 0.20+) false See GOTCHAS doc
keep_alive 5m 30m-4h Prevents expensive model reload between calls