diff --git a/README.md b/README.md index 306126a..169804b 100644 --- a/README.md +++ b/README.md @@ -18,7 +18,7 @@ Research corpus and implementation guidance for Google Gemma 4, based on product | `docs/openwebui-setup.md` | How to configure Gemma 4 inside OpenWebUI — per-setting reference, two ready-to-bake Workspace Model profiles (chat + extract), and a symptom→cause troubleshooting table mapped back to GOTCHAS.md. Assumes Ollama + OpenWebUI are already running. | When setting up or debugging a Gemma 4 model in OpenWebUI, or handing the front-end config to someone else | | `docs/reference/bakeoff-2026-04-18.md` | CLI-coding-agent bakeoff on 3090 Ti. **Rounds 1/2 misidentified the cause; Round 3 (the correct one): `think: false` silent-stops gemma4:26b at certain multi-turn states on 32K context.** 31B and Qwen3-Coder robust to the flag. Harness at `scripts/bakeoff/` | When deciding which model to back a CLI agent with, writing a custom agent payload, or debugging a silent tool-call halt | | `docs/reference/mort-bakeoff-2026-04-18.md` | mort-bot-specific `think=true` vs `think=false` bakeoff on mort's actual loop shape (gemma4:26b, num_ctx=8192). **Thinking does NOT accumulate in context on Ollama 0.20.4** — strips it from serialized history. Both settings behave identically on step counts, tool counts, wall clock. Harness at `scripts/mort-bakeoff/` | When deciding mort-bot's THINK env var, or when someone claims "think=true eats context" without pinning an Ollama version | -| `docs/reference/gpu-bakeoff-2026-04-20.md` | Cross-GPU throughput bakeoff: steel141 RTX 3090 Ti vs matt-strix (AMD Strix Halo). **3090 Ti wins decode decisively (128 tok/s on 26B MoE). Strix gets ~42% of that on ~25% of the bandwidth.** Also quantifies the MoE vs dense gap: 26B decodes ~4.7× faster than 31B on both cards. Harness at `scripts/gpu-bakeoff/` | When choosing which host to run a Gemma 4 workload on | +| `docs/reference/gpu-bakeoff-2026-04-20.md` | Cross-GPU throughput bakeoff: steel141 RTX 3090 Ti vs strix-halo (AMD Strix Halo). **3090 Ti wins decode decisively (128 tok/s on 26B MoE). Strix gets ~42% of that on ~25% of the bandwidth.** Also quantifies the MoE vs dense gap: 26B decodes ~4.7× faster than 31B on both cards. Harness at `scripts/gpu-bakeoff/` | When choosing which host to run a Gemma 4 workload on | | `tooling/` | **Canonical upstream tooling** — real scripts, notebooks, model cards, and configs pulled from Google / HF / framework maintainers (147 files). Subdirs: `google-official/`, `huggingface/`, `inference-frameworks/`, `gemma-family/`, `fine-tuning/`. See `tooling/README.md` for index and findings that update the older `CORPUS_*` docs | When you need authoritative source material — model cards, chat templates, fine-tuning recipes, serving commands for vLLM / llama.cpp / MLX, or to scope a specialized sibling (ShieldGemma, EmbeddingGemma, etc.) | ## Source Projects diff --git a/docs/reference/gpu-bakeoff-2026-04-20.md b/docs/reference/gpu-bakeoff-2026-04-20.md index 2706d9d..4185180 100644 --- a/docs/reference/gpu-bakeoff-2026-04-20.md +++ b/docs/reference/gpu-bakeoff-2026-04-20.md @@ -1,7 +1,7 @@ # GPU Bakeoff — Gemma 4 Throughput: 3090 Ti vs Strix Halo **Date:** 2026-04-20 -**Host matrix:** steel141 (RTX 3090 Ti) · matt-strix (AMD Strix Halo iGPU) +**Host matrix:** steel141 (RTX 3090 Ti) · strix-halo (AMD Strix Halo iGPU) **Models:** `gemma4:26b` (MoE Q4_K_M) · `gemma4:31b-it-q4_K_M` (dense Q4_K_M) **Harness:** `scripts/gpu-bakeoff/harness.py` **Raw data:** `scripts/gpu-bakeoff/runs/` @@ -13,7 +13,7 @@ | GPU | 26B (MoE) decode | 31B (dense) decode | Long-prompt prefill (26B) | |-----|------------------|--------------------|-----------------------| | **RTX 3090 Ti** (steel141) | **128 tok/s** | **27 tok/s** | **23,849 tok/s** | -| **AMD Strix Halo iGPU** (matt-strix) | 54 tok/s (42%) | 11 tok/s (39%) | 14,326 tok/s (60%) | +| **AMD Strix Halo iGPU** (strix-halo) | 54 tok/s (42%) | 11 tok/s (39%) | 14,326 tok/s (60%) | ### Headline findings @@ -34,8 +34,8 @@ | Host | GPU | VRAM | Bandwidth | Compute cap | Notes | |------|-----|------|-----------|-------------|-------| -| steel141 | RTX 3090 Ti | 24 GB GDDR6X | ~1008 GB/s | 8.6 (Ampere) | Seth's workstation. Also has a GTX 1660 SUPER as aux display card — not used for inference. Ollama on 127.0.0.1:11434. | -| matt-strix | AMD Strix Halo (Radeon 890M iGPU + XDNA 2 NPU) | Shared LPDDR5X | ~256 GB/s | — | Unified memory lets it fit models a 24 GB card can't. Ollama on 100.117.155.64:11434 via Tailscale. | +| steel141 | RTX 3090 Ti | 24 GB GDDR6X | ~1008 GB/s | 8.6 (Ampere) | Workstation. Also has a GTX 1660 SUPER as aux display card — not used for inference. Ollama on localhost. | +| strix-halo | AMD Strix Halo (Radeon 890M iGPU + XDNA 2 NPU) | Shared LPDDR5X | ~256 GB/s | — | Unified memory lets it fit models a 24 GB card can't. Ollama accessed via Tailscale. | --- @@ -151,7 +151,7 @@ and matches or slightly exceeds proportionally. 1. **Strix max-model fit.** Strix can host models that wouldn't fit the 3090 Ti. A follow-up would pull a larger model (70 B+ quantized) on - matt-strix and measure the Strix-only performance ceiling. + strix-halo and measure the Strix-only performance ceiling. 2. **Q8 vs Q4 on Strix.** Same model, two quantizations — quality/speed tradeoff characterization. @@ -166,7 +166,7 @@ runs/ ├── steel141/ │ ├── gemma4-26b/{short,long}.json │ └── gemma4-31b/{short,long}.json -└── matt-strix/ +└── strix-halo/ ├── gemma4-26b/{short,long}.json └── gemma4-31b/{short,long}.json ``` diff --git a/scripts/gpu-bakeoff/harness.py b/scripts/gpu-bakeoff/harness.py index fd346a2..b07771d 100644 --- a/scripts/gpu-bakeoff/harness.py +++ b/scripts/gpu-bakeoff/harness.py @@ -5,7 +5,7 @@ three hosts: - steel141 : RTX 3090 Ti (24 GB GDDR6X, compute 8.6, ~1008 GB/s) - pve197 : Tesla V100-PCIE-32GB (32 GB HBM2, compute 7.0, ~900 GB/s) - - matt-strix: AMD Strix Halo iGPU (shared LPDDR5X, ~256 GB/s) + - strix-halo: AMD Strix Halo iGPU (shared LPDDR5X, ~256 GB/s) Per (host, model, prompt_length), runs 1 warmup + N measurement runs, records Ollama's canonical timing fields, and writes one JSON trace to @@ -15,6 +15,13 @@ All three Ollama servers are polled via HTTP; no SSH required. All timings come from Ollama's own /api/generate response fields so wall- clock jitter between the harness and the server is excluded. +Host URLs are resolved from environment variables so routable addresses +don't live in source. Set these before running against non-local hosts: + + OLLAMA_STEEL141_URL=http://127.0.0.1:11434 + OLLAMA_PVE197_URL=http://:11434 + OLLAMA_STRIX_URL=http://:11434 + Invocation: python3 harness.py --host steel141 --model gemma4:26b --prompt short python3 harness.py all # runs the full planned matrix @@ -24,6 +31,7 @@ from __future__ import annotations import argparse import json +import os import sys import time import urllib.request @@ -31,16 +39,30 @@ from pathlib import Path HOSTS = { - "steel141": {"url": "http://127.0.0.1:11434", "gpu": "RTX 3090 Ti", "vram_gb": 24}, - "pve197": {"url": "http://192.168.0.179:11434", "gpu": "Tesla V100-PCIE-32GB", "vram_gb": 32}, - "matt-strix": {"url": "http://100.117.155.64:11434", "gpu": "AMD Strix Halo iGPU", "vram_gb": None}, + "steel141": {"url_env": "OLLAMA_STEEL141_URL", "default_url": "http://127.0.0.1:11434", + "gpu": "RTX 3090 Ti", "vram_gb": 24}, + "pve197": {"url_env": "OLLAMA_PVE197_URL", "default_url": None, + "gpu": "Tesla V100-PCIE-32GB", "vram_gb": 32}, + "strix-halo": {"url_env": "OLLAMA_STRIX_URL", "default_url": None, + "gpu": "AMD Strix Halo iGPU", "vram_gb": None}, } -# Per-host model tag mapping. matt-strix uses gemma4:31b, the others + +def _host_url(host: str) -> str: + cfg = HOSTS[host] + url = os.environ.get(cfg["url_env"]) or cfg["default_url"] + if not url: + raise RuntimeError( + f"host {host!r} has no URL — set ${cfg['url_env']} in env" + ) + return url + + +# Per-host model tag mapping. strix-halo uses gemma4:31b, the others # use gemma4:31b-it-q4_K_M — identical weights, different tags. MODEL_ALIASES = { - "gemma4:26b": {"steel141": "gemma4:26b", "pve197": "gemma4:26b", "matt-strix": "gemma4:26b"}, - "gemma4:31b": {"steel141": "gemma4:31b-it-q4_K_M", "pve197": "gemma4:31b-it-q4_K_M", "matt-strix": "gemma4:31b"}, + "gemma4:26b": {"steel141": "gemma4:26b", "pve197": "gemma4:26b", "strix-halo": "gemma4:26b"}, + "gemma4:31b": {"steel141": "gemma4:31b-it-q4_K_M", "pve197": "gemma4:31b-it-q4_K_M", "strix-halo": "gemma4:31b"}, # V100-only edge case — only 32 GB host has headroom for the Q8 MoE. "gemma4:26b-q8": {"pve197": "gemma4:26b-a4b-it-q8_0"}, } @@ -151,7 +173,7 @@ def run_matrix( return {"host": host, "model_alias": model_alias, "skipped": "model not available on host"} prompt = PROMPTS[prompt_key] - url = host_cfg["url"] + url = _host_url(host) trace = { "host": host, diff --git a/scripts/gpu-bakeoff/runs/matt-strix/gemma4-26b-q8/long.json b/scripts/gpu-bakeoff/runs/strix-halo/gemma4-26b-q8/long.json similarity index 75% rename from scripts/gpu-bakeoff/runs/matt-strix/gemma4-26b-q8/long.json rename to scripts/gpu-bakeoff/runs/strix-halo/gemma4-26b-q8/long.json index 7b0d94d..c0378ad 100644 --- a/scripts/gpu-bakeoff/runs/matt-strix/gemma4-26b-q8/long.json +++ b/scripts/gpu-bakeoff/runs/strix-halo/gemma4-26b-q8/long.json @@ -1,5 +1,5 @@ { - "host": "matt-strix", + "host": "strix-halo", "model_alias": "gemma4:26b-q8", "skipped": "model not available on host" } \ No newline at end of file diff --git a/scripts/gpu-bakeoff/runs/matt-strix/gemma4-26b-q8/short.json b/scripts/gpu-bakeoff/runs/strix-halo/gemma4-26b-q8/short.json similarity index 75% rename from scripts/gpu-bakeoff/runs/matt-strix/gemma4-26b-q8/short.json rename to scripts/gpu-bakeoff/runs/strix-halo/gemma4-26b-q8/short.json index 7b0d94d..c0378ad 100644 --- a/scripts/gpu-bakeoff/runs/matt-strix/gemma4-26b-q8/short.json +++ b/scripts/gpu-bakeoff/runs/strix-halo/gemma4-26b-q8/short.json @@ -1,5 +1,5 @@ { - "host": "matt-strix", + "host": "strix-halo", "model_alias": "gemma4:26b-q8", "skipped": "model not available on host" } \ No newline at end of file diff --git a/scripts/gpu-bakeoff/runs/matt-strix/gemma4-26b/long.json b/scripts/gpu-bakeoff/runs/strix-halo/gemma4-26b/long.json similarity index 98% rename from scripts/gpu-bakeoff/runs/matt-strix/gemma4-26b/long.json rename to scripts/gpu-bakeoff/runs/strix-halo/gemma4-26b/long.json index 97ecd3c..69dd030 100644 --- a/scripts/gpu-bakeoff/runs/matt-strix/gemma4-26b/long.json +++ b/scripts/gpu-bakeoff/runs/strix-halo/gemma4-26b/long.json @@ -1,5 +1,5 @@ { - "host": "matt-strix", + "host": "strix-halo", "gpu": "AMD Strix Halo iGPU", "vram_gb": null, "model_alias": "gemma4:26b", diff --git a/scripts/gpu-bakeoff/runs/matt-strix/gemma4-26b/short.json b/scripts/gpu-bakeoff/runs/strix-halo/gemma4-26b/short.json similarity index 98% rename from scripts/gpu-bakeoff/runs/matt-strix/gemma4-26b/short.json rename to scripts/gpu-bakeoff/runs/strix-halo/gemma4-26b/short.json index f49e2dc..48e547f 100644 --- a/scripts/gpu-bakeoff/runs/matt-strix/gemma4-26b/short.json +++ b/scripts/gpu-bakeoff/runs/strix-halo/gemma4-26b/short.json @@ -1,5 +1,5 @@ { - "host": "matt-strix", + "host": "strix-halo", "gpu": "AMD Strix Halo iGPU", "vram_gb": null, "model_alias": "gemma4:26b", diff --git a/scripts/gpu-bakeoff/runs/matt-strix/gemma4-31b/long.json b/scripts/gpu-bakeoff/runs/strix-halo/gemma4-31b/long.json similarity index 98% rename from scripts/gpu-bakeoff/runs/matt-strix/gemma4-31b/long.json rename to scripts/gpu-bakeoff/runs/strix-halo/gemma4-31b/long.json index f9e4faf..1f684b5 100644 --- a/scripts/gpu-bakeoff/runs/matt-strix/gemma4-31b/long.json +++ b/scripts/gpu-bakeoff/runs/strix-halo/gemma4-31b/long.json @@ -1,5 +1,5 @@ { - "host": "matt-strix", + "host": "strix-halo", "gpu": "AMD Strix Halo iGPU", "vram_gb": null, "model_alias": "gemma4:31b", diff --git a/scripts/gpu-bakeoff/runs/matt-strix/gemma4-31b/short.json b/scripts/gpu-bakeoff/runs/strix-halo/gemma4-31b/short.json similarity index 98% rename from scripts/gpu-bakeoff/runs/matt-strix/gemma4-31b/short.json rename to scripts/gpu-bakeoff/runs/strix-halo/gemma4-31b/short.json index 73af9f8..46c2b46 100644 --- a/scripts/gpu-bakeoff/runs/matt-strix/gemma4-31b/short.json +++ b/scripts/gpu-bakeoff/runs/strix-halo/gemma4-31b/short.json @@ -1,5 +1,5 @@ { - "host": "matt-strix", + "host": "strix-halo", "gpu": "AMD Strix Halo iGPU", "vram_gb": null, "model_alias": "gemma4:31b",