Files
Mortdecai df5542f7d6 feat: native-bakeoff scaffold — Ollama JSON vs native-token tool-calling
Three-arm harness under scripts/native-bakeoff/:
- arm A: /api/chat with JSON tools (current default)
- arm B: /api/generate raw:true with canonical HF jinja template rendered directly
- arm C: google-deepmind/gemma JAX ToolSampler (env-gated, JAX required)

Interim finding from A+B sweep on matt-strix gemma4:26b Q4: Ollama's
bidirectional JSON↔native tool-call translator is faithful. The "long"
multi-tool task produces identical behavior (7 steps / 6 tools) on both
arms. Earlier arm-B parser bug that looked like a divergence was a
harness issue: preserving the model's <|channel>thought\n<channel|>
prefix as assistant content tripped the jinja template's
tool_response-following conditional, appending a spurious <turn|>\n
that corrupted the next step's prompt. Fixed by dropping the channel
prefix on the assistant message.

Arm C left as scaffolded-but-not-run — the JAX/bf16 reference path
would answer "does the GGUF runtime diverge from DeepMind's
implementation" but requires a separate env with the `gemma` PyPI
package. Parked pending SDXL eviction or vast-h100 session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 05:45:12 -04:00

4.7 KiB

Native Bakeoff — Gemma 4 Inference Path Comparison

Three-arm bakeoff comparing how different inference paths handle the same Gemma 4 tool-calling workload. Isolates Ollama's JSON↔native translator and the runtime itself as variables.

The three arms

Arm Path What varies
A. ollama-json /api/chat with OpenAI-style tools:[...] Ollama translates JSON → native tokens on input, native tool-call tokens → structured JSON on output.
B. ollama-native /api/generate with raw:true + canonical HF jinja template No JSON translation. Rendered tokens go straight to the model; the harness parses <|tool_call> spans out of the completion.
C. jax-native google-deepmind/gemma reference ToolSampler No Ollama. No llama.cpp. No GGUF quant. Reference Python + JAX + bf16.

Research question

Does Ollama's JSON tools path materially diverge from the native/reference path?

  • A vs B divergence ⇒ the Ollama server-side parser is the variable.
  • B vs C divergence ⇒ llama.cpp runtime / GGUF quantization / Ollama scheduler is the variable.
  • A ≡ B ≡ C ⇒ Ollama's path is faithful to the reference, current production usage is fine.

Prerequisites

Arms A and B: local Ollama with gemma4:latest (E4B 8B) or gemma4:e4b-it-q8_0 pulled. Python 3.10+, aiohttp, jinja2.

Arm C: separate env with jax and gemma installed; HF credentials for checkpoint download (~8GB for E4B-it). See arms/jax_native.py module docstring.

Running

cd scripts/native-bakeoff

# One arm, one task:
python3 harness.py --arm ollama-json   --task memory --out runs/A/memory.json
python3 harness.py --arm ollama-native --task memory --out runs/B/memory.json
python3 harness.py --arm jax-native    --task memory --out runs/C/memory.json

# Full sweep (A + B, 4 tasks each):
for arm in ollama-json ollama-native; do
  for task in movies research memory long; do
    python3 harness.py --arm "$arm" --task "$task" \
      --out "runs/${arm}/${task}.json"
  done
done

Default model is gemma4:latest for Ollama arms (the E4B-it variant). Override with --model gemma4:26b if you want the MoE bakeoff (expect slower; 26B is 18GB GGUF).

Trace schema

Each run writes a JSON with:

  • arm, model, task, task_prompt
  • turns[] — per-step metrics: elapsed_s, prompt_eval_count, eval_count, tool_call_count, content_len, etc.
  • finalhalt_reason, steps_used, tool_calls_total, wall_clock_s, final_history_chars

Halt reasons: no_tool_calls (model produced final answer), step_budget (hit 20-step limit), error:*, env_missing (arm C only), sampler_error:* (arm C only).

Smoke test evidence

First wiring run on 2026-04-19 against gemma4:latest on steel141 (local Ollama, CPU):

Arm Task Steps Tools Halt Wall
A (ollama-json) memory 2 1 no_tool_calls 10.16s
B (ollama-native) memory 2 1 no_tool_calls 2.39s

Identical behavioral shape (one tool call, clean final answer) on this simple task. The wall-clock delta is interesting but not conclusive on a single run — could be cache warmth or could be Ollama's parser overhead. A full sweep will separate signal from noise.

Known limitations

  • Arm C system prompt handling. gm.text.ToolSampler doesn't take a pre-populated message history cleanly, so arm C folds a compact version of FAKE_HISTORY into the user message. Arms A and B feed history through proper role-tagged turns. Fidelity compromise — if a C vs A/B delta traces here, rebuild sampler.turns directly before calling .chat().
  • Arm C sampler caveat. The deepmind-gemma ToolSampler docstring notes "Gemma 1, 2 and 3 models were not specifically trained for tool use" and flags the sampler as a proof-of-concept. Gemma 4 is tool-trained, so it should do better, but if arm C underperforms A/B the sampler implementation may be the variable, not the model.
  • Quantization confounder. Ollama arms run Q8 (E4B) or Q4 (26B); arm C runs bf16. A non-trivial A vs C delta could be the quantization. Only A ≡ B ≢ C cleanly implicates the inference engine rather than the bits.
  • scripts/mort-bakeoff/harness.py — the round-3 bakeoff that established think:false kills 26B in multi-turn tool loops. Task definitions are lifted from there.
  • docs/reference/bakeoff-2026-04-18.md — round-3 writeup.
  • CORPUS_tool_calling_format.md — the native Gemma 4 tool-call token syntax this harness implements.
  • tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja — the canonical template arm B renders.