Files

T

Mortdecai 0f82cd71b1 docs: session handoff — GPU bakeoff (3090 Ti vs Strix Halo)

Closes out the session that produced docs/reference/gpu-bakeoff-2026-04-20.md
and the parked scripts/native-bakeoff/ scaffold. Chains (chronologically)
from the 2026-04-18 OpenWebUI setup handoff though the topic is unrelated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-20 06:00:07 -04:00

17 KiB

Raw Blame History

Handoff: GPU Bakeoff — 3090 Ti vs Strix Halo (+ parked native-bakeoff scaffold)

Session Metadata

Created: 2026-04-20 05:56:58
Project: /home/claude/bin/gemma4-research
Branch: master (pushed to origin)
Session duration: ~extended session, multi-pivot (~4+ hours)

Recent Commits (for context)

91842f3 docs: scrub PII/IPs from gpu-bakeoff ← latest, end of session
22af597 docs: remove V100 from GPU bakeoff ← V100 column dropped
b619035 feat: GPU bakeoff — 3090 Ti vs V100 vs Strix Halo ← initial write-up (superseded by later scrubs)
df5542f feat: native-bakeoff scaffold — Ollama JSON vs native-token tool-calling ← parked research
91aaaa4 docs: redact PII from persistent-correspondence findings

Handoff Chain

Continues from: 2026-04-18-233832-openwebui-setup-doc.md
- Previous title: OpenWebUI Setup Doc for Gemma 4
Supersedes: None

This session is not a continuation of the OpenWebUI doc work — it's a fresh research thread on the same repo. The link is chronological, not topical. Previous handoff is only relevant if debugging OpenWebUI-related Gemma 4 behavior.

Current State Summary

Session started on a native-vs-JSON tool-calling bakeoff question, pivoted to a cross-GPU throughput comparison mid-session, and shipped the latter. Final state: docs/reference/gpu-bakeoff-2026-04-20.md comparing gemma4:26b MoE and gemma4:31b dense decode/prefill rates on RTX 3090 Ti (steel141) vs AMD Strix Halo iGPU (strix-halo host). V100 data was initially gathered and included but removed when it turned out the V100 was 95% CPU-bound due to SDXL coresident on CT 167 — the published doc is a clean 2-host comparison. Native-bakeoff harness (the earlier thread) remains scaffolded and committed at scripts/native-bakeoff/ but not run further. Repo is clean, three commits pushed.

Codebase Understanding

Architecture Overview

The repo is a Gemma 4 research corpus. New this session:

scripts/native-bakeoff/ — three-arm tool-calling harness (Ollama JSON tools vs Ollama raw native tokens vs google-deepmind/gemma JAX ToolSampler). Arms A and B tested and functionally equivalent on gemma4:26b Q4 against a shared task suite lifted from mort-bakeoff. Arm C is env-gated (requires JAX + gemma PyPI package); wired but not run.
scripts/gpu-bakeoff/ — cross-GPU throughput harness. Takes host aliases from HOSTS dict and resolves URLs from env vars (OLLAMA_STEEL141_URL, OLLAMA_PVE197_URL, OLLAMA_STRIX_URL). Runs 1 warmup + 3 measurement calls per (host × model × prompt-length), logs Ollama's canonical timing fields, aggregates min/median/max.
docs/reference/gpu-bakeoff-2026-04-20.md — the finished writeup. 3090 Ti + Strix Halo only.

The docs/reference/ tier holds experimental findings; docs/ top-level holds applied how-to guides. Both bakeoffs landed in docs/reference/ which is correct.

Critical Files

File	Purpose	Relevance
`docs/reference/gpu-bakeoff-2026-04-20.md`	The session's primary artifact	Read this first for the session's shipped findings
`scripts/gpu-bakeoff/harness.py`	GPU bakeoff harness, env-var-driven URL resolution	Re-run the bakeoff (e.g., for isolated V100) by setting env vars + invoking
`scripts/gpu-bakeoff/runs/*/.json`	Raw per-call timing data	Source of truth for the doc's numbers; each JSON has warmup + 3 runs with full Ollama timing fields
`scripts/native-bakeoff/harness.py`	Parked three-arm tool-calling harness	Reference if revisiting the native-vs-JSON question; arms A and B are ready, arm C needs JAX env
`scripts/native-bakeoff/arms/ollama_native.py`	Arm B — renders the canonical HF jinja chat template directly, POSTs to /api/generate raw:true	Contains a subtle fix (keep assistant `content=""` when it has `tool_calls`) that's easy to regress
`tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja`	Canonical Gemma 4 chat template, rendered by arm B	Authoritative source of Gemma 4's native tool-call wire format
`~/bin/DECISIONS.md`	Global decision log	Three new 2026-04-20 entries: MoE-preferred, 3090 Ti primary, V100 degraded

Key Patterns Discovered

MoE vs dense is a latency cliff, not a smooth curve. gemma4:26b (MoE, ~4B active) decodes ~4.7× faster than gemma4:31b (dense, 31.3B active) on every GPU tested, because memory bandwidth is the binding constraint and the active-parameter bill is what you pay for per token. Total parameter count doesn't predict latency.
Ollama's JSON↔native-token tool-call translator is faithful on gemma4:26b Q4. Arms A (JSON tools via /api/chat) and B (raw native tokens via /api/generate raw:true) produced identical behavioral shapes on the 4-task mort-bakeoff suite. Good for mort-bot's confidence in its production path.
Ollama's /api/generate strips matched stop tokens from the response. Arm B's initial version mis-handled this by checking done_reason == "stop" as the "already terminated" branch; the correct logic is to always re-append the stop token based on which OPEN token (<|tool_call> vs <|turn>) is present in the completion.
Jinja message.get('content') checks the raw string, not the strip-thinking'd version. Storing the model's <|channel>thought\n<channel|> prefix in an assistant message's content field causes the template's post-tool-response conditional to append a spurious <turn|>\n, corrupting the next step's prompt. Safe default: leave content="" when the message has tool_calls.

Work Completed

Tasks Finished

Researched "most native Gemma 4 engine" — concluded google-deepmind/gemma (JAX) is the canonical reference; gemma.cpp verified to still NOT support Gemma 4 on dev branch (main README "CPU-only inference for: Gemma 2-3, PaliGemma 2")
Scaffolded three-arm native-bakeoff harness (ollama-json, ollama-native, jax-native) at scripts/native-bakeoff/
Ran A+B sweep on gemma4:26b Q4 via Strix Halo host over Tailscale; debugged arm-B parser bug; concluded Ollama's JSON↔native translator is faithful
Probed GPU inventory across steel141 (3090 Ti), pve197 CT 105 (V100), strix-halo (Strix Halo iGPU)
Built scripts/gpu-bakeoff/harness.py — env-var-keyed hosts, warmup + 3 runs, canonical timing extraction
Ran the bakeoff; discovered V100 was 95% CPU-bound due to SDXL occupying ~31 GB of its VRAM
Wrote docs/reference/gpu-bakeoff-2026-04-20.md with V100 column initially included, then removed at Seth's direction
Scrubbed PII/IPs from the doc and harness: host alias matt-strix → strix-halo, URLs moved to env vars, runs/ dir renamed, JSONs patched
Updated ~/bin/DECISIONS.md with three 2026-04-20 entries
Added feedback memory for the PII-scrub preference
Updated README.md index entry for the new bakeoff doc

Files Modified

File	Changes	Rationale
`docs/reference/gpu-bakeoff-2026-04-20.md`	Created (final: 3090 Ti vs Strix Halo)	Session's primary artifact
`scripts/gpu-bakeoff/`	New dir — harness + runs	Bakeoff infrastructure
`scripts/native-bakeoff/`	New dir — three-arm harness, parked	Earlier research thread, parked but shippable
`README.md`	One new row in the file index	Discoverability for the new doc
`~/bin/DECISIONS.md`	Three new 2026-04-20 entries	MoE preference, 3090 Ti primacy, V100-SDXL contention
`~/.claude/projects/-home-claude-bin-gemma4-research/memory/feedback_scrub_pii_before_publish.md`	New memory entry	Seth's preference for scrubbing artifacts before sharing
`~/.claude/projects/-home-claude-bin-gemma4-research/memory/MEMORY.md`	Index entry added	Link to the new memory

Decisions Made

Decision	Options Considered	Rationale
Pivot from native-bakeoff to GPU-bakeoff mid-session	Complete native-bakeoff first; park and come back	Seth explicitly pivoted ("What I really want is..."); native-bakeoff was already functionally answered (A ≡ B)
Remove V100 from GPU-bakeoff doc entirely rather than keep with caveat	Keep with prominent ⚠ badge; drop the column	Seth directed "remove v100 from doc"; degraded data with caveat pollutes the narrative
Env-var-ize host URLs in harness source rather than config file	.env file; hard-coded with placeholders; CLI-only	Lightest change that accomplishes scrub; localhost default keeps steel141 path usable out of the box
Start GPU bakeoff on E4B, not 26B, for the native-bakeoff thread	Go straight to 26B (production model)	Actually reversed to 26B mid-session when strix-halo (Matt's host) was found reachable with `gemma4:26b` already pulled — production-shape became the shipped path
Don't rewrite git history to remove IPs from earlier commits	Force-push a cleaned history	Destructive; Seth's "remove IP/PI" was scoped to current artifacts, not a history scrub. Flagged the tradeoff and did not act
Chain this handoff to the previous OpenWebUI one chronologically even though topically unrelated	Link as "continues from"; mark "supersedes"; no chain	Session-handoff skill's chain field is chronological per doc conventions; the narrative separation is called out in the body

Pending Work

Immediate Next Steps

(Optional) Isolated V100 re-run. Stop CT 167 (ai-visualizer / SDXL) on pve197, then OLLAMA_PVE197_URL=http://<ip>:11434 python3 scripts/gpu-bakeoff/harness.py --host pve197. Expected result: V100 lands between 3090 Ti and Strix Halo based on HBM2 ~900 GB/s spec. Add a V100 column back to the doc with isolated numbers. Judgment call — worth the ai-visualizer interruption?
(Optional) Strix max-model-fit follow-up. Strix can host models neither the 3090 Ti nor V100 can. Pull a larger model (gemma4:26b-a4b-it-q8_0 at 28 GB, or something 40B+) on the Strix Halo host; re-run harness to characterize the bandwidth/capacity ceiling for that architecture.
(Optional) Close the native-bakeoff thread with arm C. Set up a JAX env on steel141 or in a vast-h100 session, pip install gemma, run the JAX ToolSampler arm against the same mort-bakeoff task suite. If arm C matches arms A/B, that's definitive "Ollama's runtime is faithful to the DeepMind reference." If it diverges, the GGUF quantization / llama.cpp runtime is the variable to investigate.

Blockers/Open Questions

Does gemma4:31b-it-q4_K_M on the V100 still deserve its 2026-04-07 "primary model on V100" designation? The new 2026-04-20 decision noting 26B-MoE preference doesn't formally supersede it — they coexist on a speed vs quality axis that wasn't measured here. If a future session cares, a quality bakeoff (same tasks, qualitatively scored outputs) would resolve it.
Quantization sensitivity unmeasured. All bakeoff numbers are Q4_K_M. Q8 vs Q4 throughput ratio on the same model (especially on Strix where more headroom is available) is an open question that came up in the "open questions" section of the doc.

Deferred Items

Native-bakeoff arm C — env setup cost, not landing in this session.
Git history scrub — would require force-push; Seth's scrub request was interpreted as "current artifacts only" and he was informed of the tradeoff.
DECISIONS.md per-project local — considered creating a project-local decision log for the bakeoff findings but instead promoted them to the global log (~/bin/DECISIONS.md) since the hardware/model implications are cross-project.

Context for Resuming Agent

Important Context

The V100 caveat is in git history (commit b619035) but not the final doc. If someone greps the repo for "V100" and expects to find it in the current head, they won't — the final commit 22af597 removed it deliberately.
Host aliases were scrubbed this session. matt-strix was renamed to strix-halo in the repo; the SSH alias in ~/.ssh/config and ~/bin/CLAUDE.md still uses the original name. Don't "reconcile" those by renaming the alias locally — Seth uses it as-is outside the published repo.
Harness requires env vars for non-local hosts now. Running scripts/gpu-bakeoff/harness.py --host strix-halo without OLLAMA_STRIX_URL set will error out with a clear message. Set it from the SSH alias / Tailscale IP as needed.
The scrubbed URL constants are NOT in this repo. If the next session needs to re-run the bakeoff against the original hosts, pull them from ~/bin/CLAUDE.md (SSH aliases → tailscale/LAN IPs) or probe via ssh strix-halo hostname -I / equivalent.
gemma4:latest on steel141 is the E4B-it variant (8 GB), NOT the MoE 26B. Confirmed during smoke-testing. Other hosts may resolve gemma4:latest differently.
Push-on-commit is the convention for this repo (~/bin/CLAUDE.md Gitea section). Both commits this session were pushed immediately.

Assumptions Made

The V100 was degraded "because of SDXL" based on /api/ps showing size_vram: 1.57 GB of a 30.5 GB model + nvidia-smi showing 31.7/32.7 GB used by other processes. Not independently verified by stopping SDXL and re-running; that's the open follow-up. If SDXL wasn't actually the culprit (e.g., Ollama version bug on that host), the finding needs revisiting.
matt-strix's gemma4:31b tag and steel141's gemma4:31b-it-q4_K_M tag are the same weights (both Q4_K_M, both 19.9 GB, both 31.3 B params). Verified via /api/tags metadata; not by hash comparison.
Ollama's /api/generate canonical timing fields (prompt_eval_duration, eval_duration, etc.) are trustworthy for throughput measurements. Supported by their deterministic behavior across runs; not compared against external profiling.

Potential Gotchas

keep_alive: 10m in the harness keeps models resident. Running the full matrix against a host with limited VRAM can leave the model loaded after the harness exits; subsequent unrelated Ollama users may see degraded performance until keep_alive expires or another model evicts it.
The V100 runs are gone from scripts/gpu-bakeoff/runs/ (commit 22af597). Git history has them at b619035^. Don't write new code expecting runs/pve197/ to exist locally.
The native-bakeoff content="" fix is subtle. If someone "improves" arm B to preserve the model's pre-tool-call thinking text as assistant content, they'll regress the turn-termination bug. Module-level comment in scripts/native-bakeoff/arms/ollama_native.py calls this out but is easy to miss.
gemma.cpp status as of 2026-04-20: dev branch README still says Gemma 2/3 + PaliGemma 2 only. Don't suggest gemma.cpp as a Gemma 4 option without re-checking.
Arm B's raw_completion_tail/prompt_tail/prompt_head trace fields were added during debugging and left in place. They make the trace JSONs larger than strictly necessary; ok to remove if cleanliness matters, but don't delete the fix they were added to diagnose.

Environment State

Tools/Services Used

Local Ollama on steel141 (127.0.0.1:11434) — version and model list as of session
Remote Ollama on strix-halo (via Tailscale) — version 0.21.0, models: gemma4:26b, gemma4:31b
Remote Ollama on pve197 CT 105 — models include the Q8 MoE gemma4:26b-a4b-it-q8_0 that only fits V100
Git / Gitea at git.sethpc.xyz/Seth/gemma4-research
Python 3 with aiohttp, jinja2, urllib.request (stdlib only for gpu-bakeoff)

Active Processes

None started or left running by this session. The keep_alive: 10m in harness.py may still be holding models resident briefly post-session; they'll drop when the TTL expires.

Environment Variables

OLLAMA_STEEL141_URL — default http://127.0.0.1:11434 if unset
OLLAMA_PVE197_URL — no default; required if --host pve197
OLLAMA_STRIX_URL — no default; required if --host strix-halo
Optionally OLLAMA_URL for any one-off calls to a different host, though harness doesn't read this

(No values are embedded in source; none logged here per handoff security policy.)

docs/reference/gpu-bakeoff-2026-04-20.md — the session's primary artifact
scripts/gpu-bakeoff/ — harness + raw traces
scripts/native-bakeoff/ — parked research thread, functional A+B arms
tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja — authoritative Gemma 4 chat template, rendered by arm B of native-bakeoff
~/bin/DECISIONS.md — three new 2026-04-20 entries relating to this session
MEMORY index — updated with PII-scrub feedback
Previous handoff: 2026-04-18-233832-openwebui-setup-doc.md — chronological predecessor, topically unrelated
Gitea commits this session: df5542f, b619035, 22af597, 91842f3

Security Reminder: Before finalizing, run validate_handoff.py to check for accidental secret exposure.

17 KiB Raw Blame History Unescape Escape