Files
gemma4-research/.claude/handoffs/2026-04-20-055658-gpu-bakeoff-3090-vs-strix.md
T
Mortdecai 0f82cd71b1 docs: session handoff — GPU bakeoff (3090 Ti vs Strix Halo)
Closes out the session that produced docs/reference/gpu-bakeoff-2026-04-20.md
and the parked scripts/native-bakeoff/ scaffold. Chains (chronologically)
from the 2026-04-18 OpenWebUI setup handoff though the topic is unrelated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 06:00:07 -04:00

17 KiB
Raw Blame History

Handoff: GPU Bakeoff — 3090 Ti vs Strix Halo (+ parked native-bakeoff scaffold)

Session Metadata

  • Created: 2026-04-20 05:56:58
  • Project: /home/claude/bin/gemma4-research
  • Branch: master (pushed to origin)
  • Session duration: ~extended session, multi-pivot (~4+ hours)

Recent Commits (for context)

  • 91842f3 docs: scrub PII/IPs from gpu-bakeoff ← latest, end of session
  • 22af597 docs: remove V100 from GPU bakeoff ← V100 column dropped
  • b619035 feat: GPU bakeoff — 3090 Ti vs V100 vs Strix Halo ← initial write-up (superseded by later scrubs)
  • df5542f feat: native-bakeoff scaffold — Ollama JSON vs native-token tool-calling ← parked research
  • 91aaaa4 docs: redact PII from persistent-correspondence findings

Handoff Chain

This session is not a continuation of the OpenWebUI doc work — it's a fresh research thread on the same repo. The link is chronological, not topical. Previous handoff is only relevant if debugging OpenWebUI-related Gemma 4 behavior.

Current State Summary

Session started on a native-vs-JSON tool-calling bakeoff question, pivoted to a cross-GPU throughput comparison mid-session, and shipped the latter. Final state: docs/reference/gpu-bakeoff-2026-04-20.md comparing gemma4:26b MoE and gemma4:31b dense decode/prefill rates on RTX 3090 Ti (steel141) vs AMD Strix Halo iGPU (strix-halo host). V100 data was initially gathered and included but removed when it turned out the V100 was 95% CPU-bound due to SDXL coresident on CT 167 — the published doc is a clean 2-host comparison. Native-bakeoff harness (the earlier thread) remains scaffolded and committed at scripts/native-bakeoff/ but not run further. Repo is clean, three commits pushed.

Codebase Understanding

Architecture Overview

The repo is a Gemma 4 research corpus. New this session:

  • scripts/native-bakeoff/ — three-arm tool-calling harness (Ollama JSON tools vs Ollama raw native tokens vs google-deepmind/gemma JAX ToolSampler). Arms A and B tested and functionally equivalent on gemma4:26b Q4 against a shared task suite lifted from mort-bakeoff. Arm C is env-gated (requires JAX + gemma PyPI package); wired but not run.
  • scripts/gpu-bakeoff/ — cross-GPU throughput harness. Takes host aliases from HOSTS dict and resolves URLs from env vars (OLLAMA_STEEL141_URL, OLLAMA_PVE197_URL, OLLAMA_STRIX_URL). Runs 1 warmup + 3 measurement calls per (host × model × prompt-length), logs Ollama's canonical timing fields, aggregates min/median/max.
  • docs/reference/gpu-bakeoff-2026-04-20.md — the finished writeup. 3090 Ti + Strix Halo only.

The docs/reference/ tier holds experimental findings; docs/ top-level holds applied how-to guides. Both bakeoffs landed in docs/reference/ which is correct.

Critical Files

File Purpose Relevance
docs/reference/gpu-bakeoff-2026-04-20.md The session's primary artifact Read this first for the session's shipped findings
scripts/gpu-bakeoff/harness.py GPU bakeoff harness, env-var-driven URL resolution Re-run the bakeoff (e.g., for isolated V100) by setting env vars + invoking
scripts/gpu-bakeoff/runs/**/*.json Raw per-call timing data Source of truth for the doc's numbers; each JSON has warmup + 3 runs with full Ollama timing fields
scripts/native-bakeoff/harness.py Parked three-arm tool-calling harness Reference if revisiting the native-vs-JSON question; arms A and B are ready, arm C needs JAX env
scripts/native-bakeoff/arms/ollama_native.py Arm B — renders the canonical HF jinja chat template directly, POSTs to /api/generate raw:true Contains a subtle fix (keep assistant content="" when it has tool_calls) that's easy to regress
tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja Canonical Gemma 4 chat template, rendered by arm B Authoritative source of Gemma 4's native tool-call wire format
~/bin/DECISIONS.md Global decision log Three new 2026-04-20 entries: MoE-preferred, 3090 Ti primary, V100 degraded

Key Patterns Discovered

  • MoE vs dense is a latency cliff, not a smooth curve. gemma4:26b (MoE, ~4B active) decodes ~4.7× faster than gemma4:31b (dense, 31.3B active) on every GPU tested, because memory bandwidth is the binding constraint and the active-parameter bill is what you pay for per token. Total parameter count doesn't predict latency.
  • Ollama's JSON↔native-token tool-call translator is faithful on gemma4:26b Q4. Arms A (JSON tools via /api/chat) and B (raw native tokens via /api/generate raw:true) produced identical behavioral shapes on the 4-task mort-bakeoff suite. Good for mort-bot's confidence in its production path.
  • Ollama's /api/generate strips matched stop tokens from the response. Arm B's initial version mis-handled this by checking done_reason == "stop" as the "already terminated" branch; the correct logic is to always re-append the stop token based on which OPEN token (<|tool_call> vs <|turn>) is present in the completion.
  • Jinja message.get('content') checks the raw string, not the strip-thinking'd version. Storing the model's <|channel>thought\n<channel|> prefix in an assistant message's content field causes the template's post-tool-response conditional to append a spurious <turn|>\n, corrupting the next step's prompt. Safe default: leave content="" when the message has tool_calls.

Work Completed

Tasks Finished

  • Researched "most native Gemma 4 engine" — concluded google-deepmind/gemma (JAX) is the canonical reference; gemma.cpp verified to still NOT support Gemma 4 on dev branch (main README "CPU-only inference for: Gemma 2-3, PaliGemma 2")
  • Scaffolded three-arm native-bakeoff harness (ollama-json, ollama-native, jax-native) at scripts/native-bakeoff/
  • Ran A+B sweep on gemma4:26b Q4 via Strix Halo host over Tailscale; debugged arm-B parser bug; concluded Ollama's JSON↔native translator is faithful
  • Probed GPU inventory across steel141 (3090 Ti), pve197 CT 105 (V100), strix-halo (Strix Halo iGPU)
  • Built scripts/gpu-bakeoff/harness.py — env-var-keyed hosts, warmup + 3 runs, canonical timing extraction
  • Ran the bakeoff; discovered V100 was 95% CPU-bound due to SDXL occupying ~31 GB of its VRAM
  • Wrote docs/reference/gpu-bakeoff-2026-04-20.md with V100 column initially included, then removed at Seth's direction
  • Scrubbed PII/IPs from the doc and harness: host alias matt-strixstrix-halo, URLs moved to env vars, runs/ dir renamed, JSONs patched
  • Updated ~/bin/DECISIONS.md with three 2026-04-20 entries
  • Added feedback memory for the PII-scrub preference
  • Updated README.md index entry for the new bakeoff doc

Files Modified

File Changes Rationale
docs/reference/gpu-bakeoff-2026-04-20.md Created (final: 3090 Ti vs Strix Halo) Session's primary artifact
scripts/gpu-bakeoff/ New dir — harness + runs Bakeoff infrastructure
scripts/native-bakeoff/ New dir — three-arm harness, parked Earlier research thread, parked but shippable
README.md One new row in the file index Discoverability for the new doc
~/bin/DECISIONS.md Three new 2026-04-20 entries MoE preference, 3090 Ti primacy, V100-SDXL contention
~/.claude/projects/-home-claude-bin-gemma4-research/memory/feedback_scrub_pii_before_publish.md New memory entry Seth's preference for scrubbing artifacts before sharing
~/.claude/projects/-home-claude-bin-gemma4-research/memory/MEMORY.md Index entry added Link to the new memory

Decisions Made

Decision Options Considered Rationale
Pivot from native-bakeoff to GPU-bakeoff mid-session Complete native-bakeoff first; park and come back Seth explicitly pivoted ("What I really want is..."); native-bakeoff was already functionally answered (A ≡ B)
Remove V100 from GPU-bakeoff doc entirely rather than keep with caveat Keep with prominent ⚠ badge; drop the column Seth directed "remove v100 from doc"; degraded data with caveat pollutes the narrative
Env-var-ize host URLs in harness source rather than config file .env file; hard-coded with placeholders; CLI-only Lightest change that accomplishes scrub; localhost default keeps steel141 path usable out of the box
Start GPU bakeoff on E4B, not 26B, for the native-bakeoff thread Go straight to 26B (production model) Actually reversed to 26B mid-session when strix-halo (Matt's host) was found reachable with gemma4:26b already pulled — production-shape became the shipped path
Don't rewrite git history to remove IPs from earlier commits Force-push a cleaned history Destructive; Seth's "remove IP/PI" was scoped to current artifacts, not a history scrub. Flagged the tradeoff and did not act
Chain this handoff to the previous OpenWebUI one chronologically even though topically unrelated Link as "continues from"; mark "supersedes"; no chain Session-handoff skill's chain field is chronological per doc conventions; the narrative separation is called out in the body

Pending Work

Immediate Next Steps

  1. (Optional) Isolated V100 re-run. Stop CT 167 (ai-visualizer / SDXL) on pve197, then OLLAMA_PVE197_URL=http://<ip>:11434 python3 scripts/gpu-bakeoff/harness.py --host pve197. Expected result: V100 lands between 3090 Ti and Strix Halo based on HBM2 ~900 GB/s spec. Add a V100 column back to the doc with isolated numbers. Judgment call — worth the ai-visualizer interruption?
  2. (Optional) Strix max-model-fit follow-up. Strix can host models neither the 3090 Ti nor V100 can. Pull a larger model (gemma4:26b-a4b-it-q8_0 at 28 GB, or something 40B+) on the Strix Halo host; re-run harness to characterize the bandwidth/capacity ceiling for that architecture.
  3. (Optional) Close the native-bakeoff thread with arm C. Set up a JAX env on steel141 or in a vast-h100 session, pip install gemma, run the JAX ToolSampler arm against the same mort-bakeoff task suite. If arm C matches arms A/B, that's definitive "Ollama's runtime is faithful to the DeepMind reference." If it diverges, the GGUF quantization / llama.cpp runtime is the variable to investigate.

Blockers/Open Questions

  • Does gemma4:31b-it-q4_K_M on the V100 still deserve its 2026-04-07 "primary model on V100" designation? The new 2026-04-20 decision noting 26B-MoE preference doesn't formally supersede it — they coexist on a speed vs quality axis that wasn't measured here. If a future session cares, a quality bakeoff (same tasks, qualitatively scored outputs) would resolve it.
  • Quantization sensitivity unmeasured. All bakeoff numbers are Q4_K_M. Q8 vs Q4 throughput ratio on the same model (especially on Strix where more headroom is available) is an open question that came up in the "open questions" section of the doc.

Deferred Items

  • Native-bakeoff arm C — env setup cost, not landing in this session.
  • Git history scrub — would require force-push; Seth's scrub request was interpreted as "current artifacts only" and he was informed of the tradeoff.
  • DECISIONS.md per-project local — considered creating a project-local decision log for the bakeoff findings but instead promoted them to the global log (~/bin/DECISIONS.md) since the hardware/model implications are cross-project.

Context for Resuming Agent

Important Context

  • The V100 caveat is in git history (commit b619035) but not the final doc. If someone greps the repo for "V100" and expects to find it in the current head, they won't — the final commit 22af597 removed it deliberately.
  • Host aliases were scrubbed this session. matt-strix was renamed to strix-halo in the repo; the SSH alias in ~/.ssh/config and ~/bin/CLAUDE.md still uses the original name. Don't "reconcile" those by renaming the alias locally — Seth uses it as-is outside the published repo.
  • Harness requires env vars for non-local hosts now. Running scripts/gpu-bakeoff/harness.py --host strix-halo without OLLAMA_STRIX_URL set will error out with a clear message. Set it from the SSH alias / Tailscale IP as needed.
  • The scrubbed URL constants are NOT in this repo. If the next session needs to re-run the bakeoff against the original hosts, pull them from ~/bin/CLAUDE.md (SSH aliases → tailscale/LAN IPs) or probe via ssh strix-halo hostname -I / equivalent.
  • gemma4:latest on steel141 is the E4B-it variant (8 GB), NOT the MoE 26B. Confirmed during smoke-testing. Other hosts may resolve gemma4:latest differently.
  • Push-on-commit is the convention for this repo (~/bin/CLAUDE.md Gitea section). Both commits this session were pushed immediately.

Assumptions Made

  • The V100 was degraded "because of SDXL" based on /api/ps showing size_vram: 1.57 GB of a 30.5 GB model + nvidia-smi showing 31.7/32.7 GB used by other processes. Not independently verified by stopping SDXL and re-running; that's the open follow-up. If SDXL wasn't actually the culprit (e.g., Ollama version bug on that host), the finding needs revisiting.
  • matt-strix's gemma4:31b tag and steel141's gemma4:31b-it-q4_K_M tag are the same weights (both Q4_K_M, both 19.9 GB, both 31.3 B params). Verified via /api/tags metadata; not by hash comparison.
  • Ollama's /api/generate canonical timing fields (prompt_eval_duration, eval_duration, etc.) are trustworthy for throughput measurements. Supported by their deterministic behavior across runs; not compared against external profiling.

Potential Gotchas

  • keep_alive: 10m in the harness keeps models resident. Running the full matrix against a host with limited VRAM can leave the model loaded after the harness exits; subsequent unrelated Ollama users may see degraded performance until keep_alive expires or another model evicts it.
  • The V100 runs are gone from scripts/gpu-bakeoff/runs/ (commit 22af597). Git history has them at b619035^. Don't write new code expecting runs/pve197/ to exist locally.
  • The native-bakeoff content="" fix is subtle. If someone "improves" arm B to preserve the model's pre-tool-call thinking text as assistant content, they'll regress the turn-termination bug. Module-level comment in scripts/native-bakeoff/arms/ollama_native.py calls this out but is easy to miss.
  • gemma.cpp status as of 2026-04-20: dev branch README still says Gemma 2/3 + PaliGemma 2 only. Don't suggest gemma.cpp as a Gemma 4 option without re-checking.
  • Arm B's raw_completion_tail/prompt_tail/prompt_head trace fields were added during debugging and left in place. They make the trace JSONs larger than strictly necessary; ok to remove if cleanliness matters, but don't delete the fix they were added to diagnose.

Environment State

Tools/Services Used

  • Local Ollama on steel141 (127.0.0.1:11434) — version and model list as of session
  • Remote Ollama on strix-halo (via Tailscale) — version 0.21.0, models: gemma4:26b, gemma4:31b
  • Remote Ollama on pve197 CT 105 — models include the Q8 MoE gemma4:26b-a4b-it-q8_0 that only fits V100
  • Git / Gitea at git.sethpc.xyz/Seth/gemma4-research
  • Python 3 with aiohttp, jinja2, urllib.request (stdlib only for gpu-bakeoff)

Active Processes

  • None started or left running by this session. The keep_alive: 10m in harness.py may still be holding models resident briefly post-session; they'll drop when the TTL expires.

Environment Variables

  • OLLAMA_STEEL141_URL — default http://127.0.0.1:11434 if unset
  • OLLAMA_PVE197_URL — no default; required if --host pve197
  • OLLAMA_STRIX_URL — no default; required if --host strix-halo
  • Optionally OLLAMA_URL for any one-off calls to a different host, though harness doesn't read this

(No values are embedded in source; none logged here per handoff security policy.)


Security Reminder: Before finalizing, run validate_handoff.py to check for accidental secret exposure.