Files

T

Mortdecai 5011059f5d docs: initial Gemma 4 research corpus and synthesis

Architecture specs, benchmarks, gotchas, Ollama settings, tool calling
format, and implementation patterns from Simon and AI_Visualizer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-12 18:14:19 -04:00

2.5 KiB

Raw Blame History

Gemma 4 Native Tool Calling Format

Source: Google AI for Developers - Function Calling docs https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4

Special Tokens (6 total)

Token	Purpose
`<\|tool>` / `<tool\|>`	Tool definition block
`<\|tool_call>` / `<tool_call\|>`	Model's tool request
`<\|tool_response>` / `<tool_response\|>`	Tool execution result

String delimiter: <\|"\|> (encloses all string values in native format)

Native Format (raw model tokens)

Tool definition in system prompt:

<|tool>declaration:
get_current_temperature{
  location:{type:<|"|>string<|"|>,description:<|"|>The city<|"|>},
  unit:{type:<|"|>string<|"|>,enum:[<|"|>celsius<|"|>,<|"|>fahrenheit<|"|>]}
}<tool|>

Tool call from model:

<|tool_call>call:get_current_temperature{location:<|"|>London<|"|>}<tool_call|>

Tool response:

<|tool_response>response:get_current_weather{temperature:15,weather:<|"|>sunny<|"|>}<tool_response|>

JSON Chat Format (for Ollama / OpenAI-compatible APIs)

This is what you actually use in practice. Ollama translates to/from native tokens.

Tool definition:

{
  "type": "function",
  "function": {
    "name": "get_weather",
    "description": "Get current weather for a location",
    "parameters": {
      "type": "object",
      "properties": {
        "city": {"type": "string", "description": "The city name"}
      },
      "required": ["city"]
    }
  }
}

Model returns:

{
  "role": "assistant",
  "tool_calls": [{
    "function": {
      "name": "get_weather",
      "arguments": {"city": "London"}
    }
  }]
}

Tool result message:

{
  "role": "tool",
  "content": "{\"temperature\": 15, \"weather\": \"sunny\"}"
}

Thinking Mode + Tool Calls

When thinking is enabled, preserve thoughts between tool calls
For long agent chains, summarize thoughts as plain text to save context
Recommended: disable thinking for tool-heavy workflows (Seth's finding)

Framework Flags

Framework	Required Flag
llama.cpp	`--jinja`
vLLM	`--enable-auto-tool-choice`
Ollama	Works via `/api/chat` endpoint with `tools` field
transformers	`apply_chat_template(tools=[...])`

Known Issues

Ollama v0.20.0-0.20.1: tool call parser broken, streaming drops tool calls
llama.cpp: format mismatches and continuous loops reported
LM Studio: compatibility issues with tool calling
Workaround: Use non-streaming mode for tool calls (proven in Simon)

2.5 KiB Raw Blame History

Gemma 4 Native Tool Calling Format

Special Tokens (6 total)

Native Format (raw model tokens)

Tool definition in system prompt:

Tool call from model:

Tool response:

JSON Chat Format (for Ollama / OpenAI-compatible APIs)

Tool definition:

Model returns:

Tool result message:

Thinking Mode + Tool Calls

Framework Flags

Known Issues

2.5 KiB

Raw Blame History