feat: native-bakeoff scaffold — Ollama JSON vs native-token tool-calling
Three-arm harness under scripts/native-bakeoff/: - arm A: /api/chat with JSON tools (current default) - arm B: /api/generate raw:true with canonical HF jinja template rendered directly - arm C: google-deepmind/gemma JAX ToolSampler (env-gated, JAX required) Interim finding from A+B sweep on matt-strix gemma4:26b Q4: Ollama's bidirectional JSON↔native tool-call translator is faithful. The "long" multi-tool task produces identical behavior (7 steps / 6 tools) on both arms. Earlier arm-B parser bug that looked like a divergence was a harness issue: preserving the model's <|channel>thought\n<channel|> prefix as assistant content tripped the jinja template's tool_response-following conditional, appending a spurious <turn|>\n that corrupted the next step's prompt. Fixed by dropping the channel prefix on the assistant message. Arm C left as scaffolded-but-not-run — the JAX/bf16 reference path would answer "does the GGUF runtime diverge from DeepMind's implementation" but requires a separate env with the `gemma` PyPI package. Parked pending SDXL eviction or vast-h100 session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,114 @@
|
||||
# Native Bakeoff — Gemma 4 Inference Path Comparison
|
||||
|
||||
Three-arm bakeoff comparing how different inference paths handle the
|
||||
same Gemma 4 tool-calling workload. Isolates Ollama's JSON↔native
|
||||
translator and the runtime itself as variables.
|
||||
|
||||
## The three arms
|
||||
|
||||
| Arm | Path | What varies |
|
||||
|-----|------|-------------|
|
||||
| A. `ollama-json` | `/api/chat` with OpenAI-style `tools:[...]` | Ollama translates JSON → native tokens on input, native tool-call tokens → structured JSON on output. |
|
||||
| B. `ollama-native` | `/api/generate` with `raw:true` + canonical HF jinja template | No JSON translation. Rendered tokens go straight to the model; the harness parses `<\|tool_call>` spans out of the completion. |
|
||||
| C. `jax-native` | `google-deepmind/gemma` reference `ToolSampler` | No Ollama. No llama.cpp. No GGUF quant. Reference Python + JAX + bf16. |
|
||||
|
||||
## Research question
|
||||
|
||||
> Does Ollama's JSON tools path materially diverge from the native/reference path?
|
||||
|
||||
- A vs B divergence ⇒ the Ollama server-side parser is the variable.
|
||||
- B vs C divergence ⇒ llama.cpp runtime / GGUF quantization / Ollama
|
||||
scheduler is the variable.
|
||||
- A ≡ B ≡ C ⇒ Ollama's path is faithful to the reference, current
|
||||
production usage is fine.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
**Arms A and B:** local Ollama with `gemma4:latest` (E4B 8B) or
|
||||
`gemma4:e4b-it-q8_0` pulled. Python 3.10+, `aiohttp`, `jinja2`.
|
||||
|
||||
**Arm C:** separate env with `jax` and `gemma` installed; HF
|
||||
credentials for checkpoint download (~8GB for E4B-it). See
|
||||
`arms/jax_native.py` module docstring.
|
||||
|
||||
## Running
|
||||
|
||||
```bash
|
||||
cd scripts/native-bakeoff
|
||||
|
||||
# One arm, one task:
|
||||
python3 harness.py --arm ollama-json --task memory --out runs/A/memory.json
|
||||
python3 harness.py --arm ollama-native --task memory --out runs/B/memory.json
|
||||
python3 harness.py --arm jax-native --task memory --out runs/C/memory.json
|
||||
|
||||
# Full sweep (A + B, 4 tasks each):
|
||||
for arm in ollama-json ollama-native; do
|
||||
for task in movies research memory long; do
|
||||
python3 harness.py --arm "$arm" --task "$task" \
|
||||
--out "runs/${arm}/${task}.json"
|
||||
done
|
||||
done
|
||||
```
|
||||
|
||||
Default model is `gemma4:latest` for Ollama arms (the E4B-it variant).
|
||||
Override with `--model gemma4:26b` if you want the MoE bakeoff
|
||||
(expect slower; 26B is 18GB GGUF).
|
||||
|
||||
## Trace schema
|
||||
|
||||
Each run writes a JSON with:
|
||||
- `arm`, `model`, `task`, `task_prompt`
|
||||
- `turns[]` — per-step metrics: `elapsed_s`, `prompt_eval_count`,
|
||||
`eval_count`, `tool_call_count`, `content_len`, etc.
|
||||
- `final` — `halt_reason`, `steps_used`, `tool_calls_total`,
|
||||
`wall_clock_s`, `final_history_chars`
|
||||
|
||||
Halt reasons: `no_tool_calls` (model produced final answer),
|
||||
`step_budget` (hit 20-step limit), `error:*`, `env_missing` (arm C
|
||||
only), `sampler_error:*` (arm C only).
|
||||
|
||||
## Smoke test evidence
|
||||
|
||||
First wiring run on 2026-04-19 against `gemma4:latest` on steel141
|
||||
(local Ollama, CPU):
|
||||
|
||||
| Arm | Task | Steps | Tools | Halt | Wall |
|
||||
|-----|------|-------|-------|------|------|
|
||||
| A (ollama-json) | memory | 2 | 1 | `no_tool_calls` | 10.16s |
|
||||
| B (ollama-native) | memory | 2 | 1 | `no_tool_calls` | 2.39s |
|
||||
|
||||
Identical *behavioral* shape (one tool call, clean final answer)
|
||||
on this simple task. The wall-clock delta is interesting but not
|
||||
conclusive on a single run — could be cache warmth or could be
|
||||
Ollama's parser overhead. A full sweep will separate signal from
|
||||
noise.
|
||||
|
||||
## Known limitations
|
||||
|
||||
- **Arm C system prompt handling.** `gm.text.ToolSampler` doesn't
|
||||
take a pre-populated message history cleanly, so arm C folds a
|
||||
compact version of `FAKE_HISTORY` into the user message. Arms A
|
||||
and B feed history through proper role-tagged turns. Fidelity
|
||||
compromise — if a C vs A/B delta traces here, rebuild
|
||||
`sampler.turns` directly before calling `.chat()`.
|
||||
- **Arm C sampler caveat.** The deepmind-gemma `ToolSampler`
|
||||
docstring notes "Gemma 1, 2 and 3 models were not specifically
|
||||
trained for tool use" and flags the sampler as a proof-of-concept.
|
||||
Gemma 4 *is* tool-trained, so it should do better, but if arm C
|
||||
underperforms A/B the sampler implementation may be the
|
||||
variable, not the model.
|
||||
- **Quantization confounder.** Ollama arms run Q8 (E4B) or Q4 (26B);
|
||||
arm C runs bf16. A non-trivial A vs C delta could be the
|
||||
quantization. Only A ≡ B ≢ C cleanly implicates the inference
|
||||
engine rather than the bits.
|
||||
|
||||
## Related artifacts
|
||||
|
||||
- `scripts/mort-bakeoff/harness.py` — the round-3 bakeoff that
|
||||
established `think:false` kills 26B in multi-turn tool loops.
|
||||
Task definitions are lifted from there.
|
||||
- `docs/reference/bakeoff-2026-04-18.md` — round-3 writeup.
|
||||
- `CORPUS_tool_calling_format.md` — the native Gemma 4 tool-call
|
||||
token syntax this harness implements.
|
||||
- `tooling/huggingface/model-cards/gemma-4-E4B-it-chat_template.jinja`
|
||||
— the canonical template arm B renders.
|
||||
Reference in New Issue
Block a user