Files
gemma4-research/CORPUS_tool_calling_format.md
T
Mortdecai 5011059f5d docs: initial Gemma 4 research corpus and synthesis
Architecture specs, benchmarks, gotchas, Ollama settings, tool calling
format, and implementation patterns from Simon and AI_Visualizer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 18:14:19 -04:00

2.5 KiB

Gemma 4 Native Tool Calling Format

Source: Google AI for Developers - Function Calling docs https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4

Special Tokens (6 total)

Token Purpose
<|tool> / <tool|> Tool definition block
<|tool_call> / <tool_call|> Model's tool request
<|tool_response> / <tool_response|> Tool execution result

String delimiter: <\|"\|> (encloses all string values in native format)

Native Format (raw model tokens)

Tool definition in system prompt:

<|tool>declaration:
get_current_temperature{
  location:{type:<|"|>string<|"|>,description:<|"|>The city<|"|>},
  unit:{type:<|"|>string<|"|>,enum:[<|"|>celsius<|"|>,<|"|>fahrenheit<|"|>]}
}<tool|>

Tool call from model:

<|tool_call>call:get_current_temperature{location:<|"|>London<|"|>}<tool_call|>

Tool response:

<|tool_response>response:get_current_weather{temperature:15,weather:<|"|>sunny<|"|>}<tool_response|>

JSON Chat Format (for Ollama / OpenAI-compatible APIs)

This is what you actually use in practice. Ollama translates to/from native tokens.

Tool definition:

{
  "type": "function",
  "function": {
    "name": "get_weather",
    "description": "Get current weather for a location",
    "parameters": {
      "type": "object",
      "properties": {
        "city": {"type": "string", "description": "The city name"}
      },
      "required": ["city"]
    }
  }
}

Model returns:

{
  "role": "assistant",
  "tool_calls": [{
    "function": {
      "name": "get_weather",
      "arguments": {"city": "London"}
    }
  }]
}

Tool result message:

{
  "role": "tool",
  "content": "{\"temperature\": 15, \"weather\": \"sunny\"}"
}

Thinking Mode + Tool Calls

  • When thinking is enabled, preserve thoughts between tool calls
  • For long agent chains, summarize thoughts as plain text to save context
  • Recommended: disable thinking for tool-heavy workflows (Seth's finding)

Framework Flags

Framework Required Flag
llama.cpp --jinja
vLLM --enable-auto-tool-choice
Ollama Works via /api/chat endpoint with tools field
transformers apply_chat_template(tools=[...])

Known Issues

  • Ollama v0.20.0-0.20.1: tool call parser broken, streaming drops tool calls
  • llama.cpp: format mismatches and continuous loops reported
  • LM Studio: compatibility issues with tool calling
  • Workaround: Use non-streaming mode for tool calls (proven in Simon)