feat: native-bakeoff scaffold — Ollama JSON vs native-token tool-calling

Three-arm harness under scripts/native-bakeoff/: - arm A: /api/chat with JSON tools (current default) - arm B: /api/generate raw:true with canonical HF jinja template rendered directly - arm C: google-deepmind/gemma JAX ToolSampler (env-gated, JAX required) Interim finding from A+B sweep on matt-strix gemma4:26b Q4: Ollama's bidirectional JSON↔native tool-call translator is faithful. The "long" multi-tool task produces identical behavior (7 steps / 6 tools) on both arms. Earlier arm-B parser bug that looked like a divergence was a harness issue: preserving the model's <|channel>thought\n<channel|> prefix as assistant content tripped the jinja template's tool_response-following conditional, appending a spurious <turn|>\n that corrupted the next step's prompt. Fixed by dropping the channel prefix on the assistant message. Arm C left as scaffolded-but-not-run — the JAX/bf16 reference path would answer "does the GGUF runtime diverge from DeepMind's implementation" but requires a separate env with the `gemma` PyPI package. Parked pending SDXL eviction or vast-h100 session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 05:45:12 -04:00
parent 91aaaa48d7
commit df5542f7d6
21 changed files with 1800 additions and 0 deletions
@@ -0,0 +1,37 @@
+{
+  "arm": "ollama-json",
+  "model": "gemma4:26b",
+  "num_ctx": 8192,
+  "num_predict": 2048,
+  "started_at": 1776600290.2718768,
+  "turns": [
+    {
+      "step": 1,
+      "elapsed_s": 0.72,
+      "prompt_eval_count": 1310,
+      "eval_count": 23,
+      "content_len": 0,
+      "tool_call_count": 1,
+      "history_chars_before_append": 2656
+    },
+    {
+      "step": 2,
+      "elapsed_s": 1.74,
+      "prompt_eval_count": 1422,
+      "eval_count": 67,
+      "content_len": 209,
+      "tool_call_count": 0,
+      "history_chars_before_append": 2928
+    }
+  ],
+  "final": {
+    "halt_reason": "no_tool_calls",
+    "steps_used": 2,
+    "tool_calls_total": 1,
+    "wall_clock_s": 2.46,
+    "final_message_count": 16,
+    "final_history_chars": 3137
+  },
+  "task": "memory",
+  "task_prompt": "What do I have stored about home automation? If anything, summarize it briefly."
+}