Files
gemma4-research/docs/openwebui-setup.md
Mortdecai d9477da52a docs: OpenWebUI setup guide for Gemma 4
Applies SYNTHESIS.md + GOTCHAS.md findings to the OpenWebUI front-end:
per-setting reference, two baked-in Workspace Model profiles (chat +
extract), and a symptom→cause troubleshooting table. Front-loads the
`think: false` / gemma4:26b multi-turn footgun from Round 3 of the
2026-04-18 bakeoff since that is the shape OpenWebUI users will hit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 20:47:17 -04:00

12 KiB
Raw Permalink Blame History

Gemma 4 in OpenWebUI — Setup Guide

Derived from SYNTHESIS.md, GOTCHAS.md, and the 2026-04-18 bakeoff. Assumes: OpenWebUI is already running and gemma4:* is pulled in the backing Ollama. Covers every setting that matters and what to set it to.

TL;DR — Just Tell Me What To Toggle

Create a Workspace Model (don't edit per-chat), pick a base variant, and set:

Setting Multi-turn chat (the default OpenWebUI shape) Single-turn JSON pipeline
Base model gemma4:26b (fast) or gemma4:31b-it-q4_K_M (sharper) same
System Prompt Required — identity + boundaries + format (template below) Required
Context Length (num_ctx) 32768 409616384 (scale to prompt)
Max Tokens (num_predict) 4096 2048+
Temperature 0.7 0.20.4
Stream On for text-only chat; Off if you attach a Tool Off
Function Calling Native N/A
Reasoning / think LEAVE UNSET on 26B (do NOT force off). Unset or On on 31B. Force Off
Response Format / JSON mode Off (always) Off (always)
Keep Alive 30m or -1 match pipeline duration (4h / -1)

The single biggest failure mode in OpenWebUI: setting think: false on gemma4:26b in a chat. The model silent-stops at tool-decision turns with eval_count=4. If you see "model just stops answering after a few messages" on 26B, check this first. See GOTCHAS.md § "think: false Kills Gemma 4 26B in Multi-Turn Tool-Calling Loops".


Where Settings Live in OpenWebUI

OpenWebUI has four layers. Later layers override earlier ones:

  1. Ollama defaults — baked in, almost all wrong for Gemma 4 (num_ctx=2048, num_predict=128, keep_alive=5m).
  2. Admin Panel → Settings → Models — global defaults for all models. Touch this only to set sane fleet-wide floors.
  3. Workspace → Models → [Create/Edit] — named presets. This is where you bake Gemma 4 settings. A Workspace model = base model + system prompt + advanced params + tags + optional tool server bindings.
  4. Per-chat controls (right-hand panel / top of chat) — overrides for a single conversation. Useful for experimentation, bad for persistence.

Rule: every knob below goes in layer 3 (Workspace Model) unless noted. Per-chat overrides are for debugging only.


Step 1 — Create the Workspace Model

Workspace → Models → + Add Model (or Create a Model). Fill in:

  • Name: gemma4-26b-chat (or whatever matches your use case)
  • Base Model: pick from Ollama list. Recommended:
    • gemma4:26b — fastest, great default
    • gemma4:31b-it-q4_K_M — sharper, 5x slower, more VRAM
    • gemma4:e4b-it-q8_0 — 12GB VRAM, vision + audio (audio via llama.cpp only)
  • Description: what this preset is for. Future-you will thank you.
  • Profile Image / Tags: optional.
  • System Prompt: required (see Step 2).
  • Advanced Params: expand and configure (see Step 3).
  • Tools / Knowledge / Filters: optional — attach any tool servers here.
  • Capabilities (at bottom): toggle Vision if you want image input. Gemma 4 supports vision on all variants.

Save. The model now appears in the main chat dropdown.


Step 2 — System Prompt (Required)

Gemma 4 is ultra-compliant but doesn't know who it is. A blank or generic system prompt gets you a generic chatbot — and sporadic overfiltering.

Use the template from SYNTHESIS.md:

You are [NAME], a [ROLE DESCRIPTION]. You are powered by Gemma 4.

## What You Do
- [Explicit list of responsibilities]
- [Tools you have access to and when to use each one, if any]

## What You Do NOT Do
- [Explicit list of things to refuse or avoid]
- [Common mistakes to prevent]

## Output Format
[For free-form chat: "Respond in clear Markdown with code in fenced blocks."]
[For structured output: exact schema, field names, example if complex.]

## Rules
- [Behavioral constraints]
- [Multi-step chaining instructions if using tools]

Today's date: 2026-04-18

Principles:

  1. Identity first.
  2. Positive instructions (what to do) before negative (what not to do).
  3. Output format is explicit.
  4. Don't use language that sounds like you're asking the model to bypass restrictions — just state the task directly (safety overfilter trigger).

Step 3 — Advanced Params Reference

Expand Advanced Params in the Workspace Model editor. Every field, what to set, and why.

Sampling / Output Shape

Field Ollama default Set to Why
Temperature 0.8 0.7 (chat) / 0.3 (extraction) / 0.2 (scoring) Per SYNTHESIS.md temperature table.
Top K 40 leave default Gemma 4 is well-behaved at default.
Top P 0.9 leave default Same.
Min P 0.0 leave default Same.
Seed random leave blank Set only for A/B reproduction.
Stop Sequences none leave blank Gemma 4 emits proper EOS.
Mirostat / Eta / Tau off leave off Not needed; Min P / Top P work fine.
Frequency Penalty 0 leave 0 Any value biases style for little gain.
Repeat Penalty 1.1 leave default Fine at default.
Repeat Last N 64 leave default Fine at default.
Presence Penalty 0 leave 0 Same as frequency.

Context / Memory — these are the ones that bite

Field Ollama default Set to Why
Context Length (num_ctx) 2048 32768 chat / 409616384 pipeline Default truncates mid-system-prompt. GOTCHAS.md § "Ollama Default Context is 2048".
Max Tokens (num_predict) 128 4096 chat / 2048+ JSON Default truncates any useful reply. GOTCHAS.md § "num_predict Default is 128".
Batch Size (num_batch) 512 leave default Prompt-eval throughput; no Gemma 4 issue.
Tokens to Keep (num_keep) 4 leave default System-prompt header anchor.
Use Mmap on leave on Standard.
Use Mlock off leave off Standard.
Threads (num_thread) auto leave default Ollama picks.
Keep Alive 5m 30m for chat, 4h or -1 for pipelines Default unloads the model between messages — ~1030s reload penalty. GOTCHAS.md § "Keep-Alive Too Short".

Reasoning / Thinking — the OpenWebUI 26B killer

Field Set to Why
Reasoning / Thinking / Think toggle Leave UNSET on 26B in chat. Optional on 31B. Force Off only in single-turn JSON pipelines. Ollama 0.20+ defaults think: true. On gemma4:26b in multi-turn chat, forcing think: false causes silent stops (eval_count=4, empty content, no tool call) — the model just… stops. Verified 2026-04-18. 31B and Qwen3-Coder tolerate the flag. In single-turn JSON pipelines (AI_Visualizer shape) the old advice still applies: force off so thinking tokens don't eat num_predict. See GOTCHAS.md § "think: false Kills Gemma 4 26B" and § "Thinking Mode Eats Context".

OpenWebUI exposes this as a "Reasoning" toggle or a raw think field depending on version. If your version exposes it as tri-state (On / Off / Default), pick Default on 26B chat. If it's binary (On / Off), leave it On on 26B chat. Never Off on 26B chat.

Response Format — never use JSON mode

Field Set to Why
Response Format / Format = JSON Off / None / Text Ollama's server-side format: "json" enforcer causes Gemma 4 26B to infinite-loop on nested schemas. Ask for JSON in the prompt and parse client-side. GOTCHAS.md § "format=json Causes Infinite Loops".

Streaming & Function Calling

Field Set to Why
Stream Chat Response On for text-only chat. Off if you've attached Tools. Ollama v0.20.0 drops tool calls on streaming responses (community-reported, and matches Simon's non-streaming choice). GOTCHAS.md § "Tool Calling Broken in Ollama v0.20.0 Streaming".
Function Calling Native if you're attaching tools; otherwise Default / off. Native uses Ollama's /api/chat tool_calls field. Gemma 4 has a native tool-calling token format.

Vision

Enable the Vision capability (bottom of model editor). All Gemma 4 variants support vision. Paste or upload images in chat. Works great for description; unreliable for subjective quality scoring (see GOTCHAS.md § "Vision Validator Overrejects").

Audio (E-series only)

26B / 31B have no audio encoder. Only gemma4:e4b-it-* variants support audio, and currently only via llama.cpp — Ollama doesn't pipe audio through OpenWebUI today. Skip this in OpenWebUI for now.


Step 4 — Global Admin Defaults (Optional Floor)

Admin Panel → Settings → Models sets defaults that apply when a Workspace Model doesn't override. Set these as a safety net for ad-hoc chats against gemma4:* base models directly:

  • Default Context Length: 8192
  • Default Max Tokens: 2048
  • Default Keep Alive: 30m

These are only floors. The Workspace Model's explicit settings still take over for named presets.


Two Profiles Worth Baking

Profile A: gemma4-26b-chat (default daily driver)

  • Base: gemma4:26b
  • System Prompt: "You are a helpful assistant powered by Gemma 4. Respond in clear Markdown. Use fenced code blocks for code. Today's date: …"
  • Temp 0.7, num_ctx 32768, num_predict 4096
  • Reasoning: Default (unset) or On — never Off
  • Stream On, Format Off, Keep Alive 30m

Profile B: gemma4-26b-extract (structured output)

  • Base: gemma4:26b
  • System Prompt: explicit schema with "Respond with ONLY JSON. No prose."
  • Temp 0.3, num_ctx 8192, num_predict 2048
  • Reasoning: Off (single-turn — thinking would eat num_predict)
  • Stream Off, Format Off (still!), Keep Alive 1h
  • Parse client-side with the regex pattern in SYNTHESIS.md.

For tool-using agent chats, Profile A is correct — don't flip Reasoning off.


Troubleshooting Map

Symptom Most likely cause Fix
26B "stops answering" mid-conversation, blank reply think: false in payload Set Reasoning to Default/On in Workspace Model
Reply truncates mid-sentence num_predict too low Bump to 4096
Long prompt ignored / forgets system prompt num_ctx too low Set 32768
JSON request hangs forever Response Format = JSON Turn it off; parse client-side
Tool call not fired despite model "deciding" to call it Streaming + tool call, Ollama v0.20.0 Disable Stream when tools attached
1030s latency on first message after idle keep_alive default 5m Set 30m or -1
Model generic / no personality / confuses identity Empty or weak system prompt Use the template in Step 2
31B hangs at long prompts Flash Attention + 31B Dense + >34K tokens on 3090 class Use 26B for long prompts, or disable FA in Ollama
Chat refuses a benign technical prompt Safety overfilter Rephrase; state task directly without "bypass"/"ignore" language

What This Doc Does Not Cover

  • Installing Ollama or OpenWebUI — assumed done.
  • Pulling Gemma 4 modelsollama pull gemma4:26b outside scope.
  • Tool server development — see CORPUS_tool_calling_format.md and Simon (~/bin/FreibergFamily/simon/).
  • Embeddings / retrieval — Gemma 4 has no embedding mode; use embeddinggemma (308M) as a sibling model.
  • Fine-tuning — see GOTCHAS.md § "Fine-Tuning Ecosystem Issues" and tooling/fine-tuning/recipe-recommendation.md.
  • SYNTHESIS.md — opinionated guide this doc is derived from.
  • GOTCHAS.md — every known issue, severity-ranked.
  • CORPUS_ollama_variants.md — model inventory, VRAM, Ollama settings.
  • docs/reference/bakeoff-2026-04-18.md — the think: false / 26B evidence trail.
  • CORPUS_cli_coding_agent.md — if the OpenWebUI chat is really an agent front-end, read this for model-choice nuance.