Files
gemma4-research/docs/reference/gpu-bakeoff-2026-04-20.md
T
Mortdecai 91842f30cb docs: scrub PII/IPs from gpu-bakeoff
- Rename host alias matt-strix -> strix-halo (removes third-party name)
- Move host URLs to env-var lookup (OLLAMA_*_URL), drop hardcoded IPs
  from harness source. Defaults: steel141 keeps localhost; pve197 and
  strix-halo require their env var to be set before use.
- Update doc: remove the Tailscale IP and LAN-IP references, describe
  access paths without specific addresses.
- Rename runs/matt-strix -> runs/strix-halo and patch the host field
  in each JSON.

Harness still functional for the original author (set the env vars)
and safe to share without leaking routable addresses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 05:50:52 -04:00

6.9 KiB
Raw Blame History

GPU Bakeoff — Gemma 4 Throughput: 3090 Ti vs Strix Halo

Date: 2026-04-20 Host matrix: steel141 (RTX 3090 Ti) · strix-halo (AMD Strix Halo iGPU) Models: gemma4:26b (MoE Q4_K_M) · gemma4:31b-it-q4_K_M (dense Q4_K_M) Harness: scripts/gpu-bakeoff/harness.py Raw data: scripts/gpu-bakeoff/runs/


TL;DR

GPU 26B (MoE) decode 31B (dense) decode Long-prompt prefill (26B)
RTX 3090 Ti (steel141) 128 tok/s 27 tok/s 23,849 tok/s
AMD Strix Halo iGPU (strix-halo) 54 tok/s (42%) 11 tok/s (39%) 14,326 tok/s (60%)

Headline findings

  1. MoE changes everything. gemma4:26b decodes ~4.7× faster than gemma4:31b on every GPU tested, because only ~4 B of its 25.8 B parameters activate per token. Total parameter counts (26 B vs 31 B) don't predict latency; active parameters do.
  2. 3090 Ti wins decisively on decode. For inference workloads the memory-bandwidth-flop ratio of consumer Ampere GDDR6X is hard to beat at this price point.
  3. Strix Halo punches above its bandwidth. Gets 42 % of 3090 Ti decode speed on only ~25 % of the memory bandwidth (~256 GB/s vs ~1008 GB/s) — good SIMD utilization, especially on the MoE model.

Hardware inventory

Host GPU VRAM Bandwidth Compute cap Notes
steel141 RTX 3090 Ti 24 GB GDDR6X ~1008 GB/s 8.6 (Ampere) Workstation. Also has a GTX 1660 SUPER as aux display card — not used for inference. Ollama on localhost.
strix-halo AMD Strix Halo (Radeon 890M iGPU + XDNA 2 NPU) Shared LPDDR5X ~256 GB/s Unified memory lets it fit models a 24 GB card can't. Ollama accessed via Tailscale.

Methodology

  • Each (host × model × prompt-length) cell:
    • 1 warm-up call (discarded, absorbs model load time and JIT warm-up)
    • 3 measurement calls
    • temperature: 0.0, top_k: 1 (greedy), num_predict: 256, num_ctx: 4096
    • keep_alive: 10m so the model stays resident between runs
  • Two prompt lengths:
    • short (~15 tokens) — isolates decode performance, prefill time is negligible
    • long (~500 tokens) — stresses prefill (prompt evaluation)
  • All timings come from Ollama's own /api/generate response fields (prompt_eval_duration, eval_duration, etc.), so HTTP and wall-clock jitter are excluded from the rates.
  • Median of the 3 measurement runs is reported in tables; min/max are in the raw JSON.

Full results

Decode rate (tok/s, median of 3 runs)

Decode is the metric that matters most for interactive LLM use — it's the speed of token generation after the prompt has been processed.

Model 3090 Ti Strix Halo
gemma4:26b (MoE, ~4 B active) 128.20 53.86
gemma4:31b (dense, 31.3 B active) 27.15 10.64

Prefill rate (tok/s, long ~500-token prompt, median)

Prefill is the cost of ingesting the prompt and populating the KV cache before decode begins. Batched per-token, so short-prompt prefill numbers are noisy (dominated by fixed overhead — see raw JSON for those); the long-prompt numbers below are the ones to reason from.

Model 3090 Ti Strix Halo
gemma4:26b (long) 23,849 14,326
gemma4:31b (long) 7,716 3,278

Short-prompt prefill (for reference)

On a 15-token prompt, prefill tokens/sec is meaningless — prompt is too small to amortize overhead. Included only to confirm no regression.

Model 3090 Ti Strix Halo
gemma4:26b (short) 2,063 1,276
gemma4:31b (short) 661 292

Why 26B decodes 4.7× faster than 31B

gemma4:26b is the MoE variant ("A4B" in Google's naming = activated 4B). Per-token inference routes through only ~4 B of its 25.8 B total parameters. gemma4:31b is dense: every one of its 31.3 B parameters participates in every token's forward pass. Memory bandwidth is the binding constraint for decode, so the ratio of active params is what you actually pay for.

Rough math (3090 Ti, 1008 GB/s, Q4_K_M ≈ 0.5 bytes/param):

  • 26B MoE: 4 B × 0.5 B = 2 GB per token. Theoretical max ≈ 504 tok/s. Observed 128 tok/s = 25 % utilization.
  • 31B dense: 31.3 B × 0.5 B = 15.65 GB per token. Theoretical max ≈ 64 tok/s. Observed 27 tok/s = 42 % utilization.

So dense workloads actually extract higher bandwidth utilization — they're less overhead-dominated per token. But in absolute terms, MoE wins by a large factor because the active-parameter bill is much smaller. For interactive chat this is decisive: Seth's mort-bot running gemma4:26b gets ~4.7× the responsiveness it would on gemma4:31b, even though the models are near-equal in total params.

Why the ratio holds on both GPUs: memory bandwidth is the bottleneck on both cards. Strix gets 42 % of 3090 Ti on 26B and 39 % of 3090 Ti on 31B — nearly identical ratios — because it has ~25 % of the bandwidth and matches or slightly exceeds proportionally.


When to use which GPU

Interactive chat / agent workloads (decode-heavy).

  • Primary: 3090 Ti — by a wide margin. 128 tok/s on 26B is comfortable for real-time responses.
  • Fallback: Strix Halo — 54 tok/s is usable. Benefit is unified memory can host larger models the 24 GB 3090 Ti can't.

Long-context / prompt-heavy workloads (prefill-heavy).

  • Primary: 3090 Ti again — 23,849 tok/s prefill means a 500-token prompt ingests in ~21 ms.
  • Strix at 14,326 tok/s is ~35 ms — still interactive.

Running models that don't fit on discrete cards.

  • Strix Halo. Unified LPDDR5X can hold 80 GB+ models that a 24 GB 3090 Ti can't — at the cost of lower bandwidth.
  • The largest model tested here (gemma4:31b Q4 at 19.9 GB) fits both. Q8 variants (28 GB+) only fit Strix in this matrix.

Fine-tuning / training.

  • Not measured here. 3090 Ti's 24 GB limits batch size on 20 B+ models.

Open questions / follow-ups

  1. Strix max-model fit. Strix can host models that wouldn't fit the 3090 Ti. A follow-up would pull a larger model (70 B+ quantized) on strix-halo and measure the Strix-only performance ceiling.
  2. Q8 vs Q4 on Strix. Same model, two quantizations — quality/speed tradeoff characterization.

Raw data

All per-run JSON traces are under scripts/gpu-bakeoff/runs/:

runs/
├── steel141/
│   ├── gemma4-26b/{short,long}.json
│   └── gemma4-31b/{short,long}.json
└── strix-halo/
    ├── gemma4-26b/{short,long}.json
    └── gemma4-31b/{short,long}.json

Each JSON contains the warmup call and all 3 measurement calls with every field Ollama's /api/generate returns (token counts, durations, loaded-at, context length), plus a summary with min/median/max for prefill and decode rates.