From d9477da52ac30af9ed4bcc28abd68b4bf7888754 Mon Sep 17 00:00:00 2001
From: Mortdecai <admin@mortdec.ai>
Date: Sat, 18 Apr 2026 20:47:17 -0400
Subject: [PATCH] docs: OpenWebUI setup guide for Gemma 4
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Applies SYNTHESIS.md + GOTCHAS.md findings to the OpenWebUI front-end:
per-setting reference, two baked-in Workspace Model profiles (chat +
extract), and a symptom→cause troubleshooting table. Front-loads the
`think: false` / gemma4:26b multi-turn footgun from Round 3 of the
2026-04-18 bakeoff since that is the shape OpenWebUI users will hit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 README.md               |   1 +
 docs/openwebui-setup.md | 257 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 258 insertions(+)
 create mode 100644 docs/openwebui-setup.md

diff --git a/README.md b/README.md
index 659fbad..2cf36e4 100644
--- a/README.md
+++ b/README.md
@@ -15,6 +15,7 @@ Research corpus and implementation guidance for Google Gemma 4, based on product
 | `CORPUS_benchmarks.md` | Full benchmark table vs Gemma 3, arena scores, agentic scores | When comparing Gemma 4 to alternatives |
 | `CORPUS_tool_calling_format.md` | Native token format + JSON API format for function calling | When implementing tool calling |
 | `CORPUS_cli_coding_agent.md` | Positioning Gemma 4 for CLI coding agent use (openclaw / open code / pi / hermes / aider style). Honest take on what Google did and didn't measure, head-to-head with `qwen3-coder:30b`, homelab setup pointer | When scoping a CLI coding agent or deciding Gemma 4 vs Qwen3-Coder |
+| `docs/openwebui-setup.md` | How to configure Gemma 4 inside OpenWebUI — per-setting reference, two ready-to-bake Workspace Model profiles (chat + extract), and a symptom→cause troubleshooting table mapped back to GOTCHAS.md. Assumes Ollama + OpenWebUI are already running. | When setting up or debugging a Gemma 4 model in OpenWebUI, or handing the front-end config to someone else |
 | `docs/reference/bakeoff-2026-04-18.md` | CLI-coding-agent bakeoff on 3090 Ti. **Rounds 1/2 misidentified the cause; Round 3 (the correct one): `think: false` silent-stops gemma4:26b at certain multi-turn states on 32K context.** 31B and Qwen3-Coder robust to the flag. Harness at `scripts/bakeoff/` | When deciding which model to back a CLI agent with, writing a custom agent payload, or debugging a silent tool-call halt |
 | `docs/reference/mort-bakeoff-2026-04-18.md` | mort-bot-specific `think=true` vs `think=false` bakeoff on mort's actual loop shape (gemma4:26b, num_ctx=8192). **Thinking does NOT accumulate in context on Ollama 0.20.4** — strips it from serialized history. Both settings behave identically on step counts, tool counts, wall clock. Harness at `scripts/mort-bakeoff/` | When deciding mort-bot's THINK env var, or when someone claims "think=true eats context" without pinning an Ollama version |
 | `tooling/` | **Canonical upstream tooling** — real scripts, notebooks, model cards, and configs pulled from Google / HF / framework maintainers (147 files). Subdirs: `google-official/`, `huggingface/`, `inference-frameworks/`, `gemma-family/`, `fine-tuning/`. See `tooling/README.md` for index and findings that update the older `CORPUS_*` docs | When you need authoritative source material — model cards, chat templates, fine-tuning recipes, serving commands for vLLM / llama.cpp / MLX, or to scope a specialized sibling (ShieldGemma, EmbeddingGemma, etc.) |
diff --git a/docs/openwebui-setup.md b/docs/openwebui-setup.md
new file mode 100644
index 0000000..7ffd1e8
--- /dev/null
+++ b/docs/openwebui-setup.md
@@ -0,0 +1,257 @@
+# Gemma 4 in OpenWebUI — Setup Guide
+
+> Derived from `SYNTHESIS.md`, `GOTCHAS.md`, and the 2026-04-18 bakeoff.
+> Assumes: OpenWebUI is already running and `gemma4:*` is pulled in the
+> backing Ollama. Covers every setting that matters and what to set it to.
+
+## TL;DR — Just Tell Me What To Toggle
+
+Create a **Workspace Model** (don't edit per-chat), pick a base variant, and set:
+
+| Setting | Multi-turn chat (the default OpenWebUI shape) | Single-turn JSON pipeline |
+|---|---|---|
+| Base model | `gemma4:26b` (fast) or `gemma4:31b-it-q4_K_M` (sharper) | same |
+| System Prompt | **Required** — identity + boundaries + format (template below) | **Required** |
+| Context Length (`num_ctx`) | `32768` | `4096`–`16384` (scale to prompt) |
+| Max Tokens (`num_predict`) | `4096` | `2048`+ |
+| Temperature | `0.7` | `0.2`–`0.4` |
+| Stream | **On** for text-only chat; **Off** if you attach a Tool | Off |
+| Function Calling | Native | N/A |
+| Reasoning / `think` | **LEAVE UNSET** on 26B (do NOT force off). Unset or On on 31B. | **Force Off** |
+| Response Format / JSON mode | **Off** (always) | **Off** (always) |
+| Keep Alive | `30m` or `-1` | match pipeline duration (`4h` / `-1`) |
+
+**The single biggest failure mode in OpenWebUI:** setting `think: false` on
+`gemma4:26b` in a chat. The model silent-stops at tool-decision turns with
+`eval_count=4`. If you see "model just stops answering after a few messages"
+on 26B, check this first. See `GOTCHAS.md` § "think: false Kills Gemma 4 26B
+in Multi-Turn Tool-Calling Loops".
+
+---
+
+## Where Settings Live in OpenWebUI
+
+OpenWebUI has four layers. Later layers override earlier ones:
+
+1. **Ollama defaults** — baked in, almost all wrong for Gemma 4
+   (`num_ctx=2048`, `num_predict=128`, `keep_alive=5m`).
+2. **Admin Panel → Settings → Models** — global defaults for all models.
+   Touch this only to set sane fleet-wide floors.
+3. **Workspace → Models → [Create/Edit]** — named presets. **This is where
+   you bake Gemma 4 settings.** A Workspace model = base model + system
+   prompt + advanced params + tags + optional tool server bindings.
+4. **Per-chat controls** (right-hand panel / top of chat) — overrides for
+   a single conversation. Useful for experimentation, bad for persistence.
+
+**Rule:** every knob below goes in layer 3 (Workspace Model) unless noted.
+Per-chat overrides are for debugging only.
+
+---
+
+## Step 1 — Create the Workspace Model
+
+Workspace → Models → **+ Add Model** (or **Create a Model**). Fill in:
+
+- **Name**: `gemma4-26b-chat` (or whatever matches your use case)
+- **Base Model**: pick from Ollama list. Recommended:
+  - `gemma4:26b` — fastest, great default
+  - `gemma4:31b-it-q4_K_M` — sharper, 5x slower, more VRAM
+  - `gemma4:e4b-it-q8_0` — 12GB VRAM, vision + audio (audio via llama.cpp only)
+- **Description**: what this preset is for. Future-you will thank you.
+- **Profile Image / Tags**: optional.
+- **System Prompt**: **required** (see Step 2).
+- **Advanced Params**: expand and configure (see Step 3).
+- **Tools / Knowledge / Filters**: optional — attach any tool servers here.
+- **Capabilities** (at bottom): toggle Vision if you want image input. Gemma
+  4 supports vision on all variants.
+
+Save. The model now appears in the main chat dropdown.
+
+---
+
+## Step 2 — System Prompt (Required)
+
+Gemma 4 is ultra-compliant but doesn't know who it is. A blank or generic
+system prompt gets you a generic chatbot — and sporadic overfiltering.
+
+Use the template from `SYNTHESIS.md`:
+
+```
+You are [NAME], a [ROLE DESCRIPTION]. You are powered by Gemma 4.
+
+## What You Do
+- [Explicit list of responsibilities]
+- [Tools you have access to and when to use each one, if any]
+
+## What You Do NOT Do
+- [Explicit list of things to refuse or avoid]
+- [Common mistakes to prevent]
+
+## Output Format
+[For free-form chat: "Respond in clear Markdown with code in fenced blocks."]
+[For structured output: exact schema, field names, example if complex.]
+
+## Rules
+- [Behavioral constraints]
+- [Multi-step chaining instructions if using tools]
+
+Today's date: 2026-04-18
+```
+
+Principles:
+1. Identity first.
+2. Positive instructions (what to do) before negative (what not to do).
+3. Output format is explicit.
+4. Don't use language that sounds like you're asking the model to bypass
+   restrictions — just state the task directly (safety overfilter trigger).
+
+---
+
+## Step 3 — Advanced Params Reference
+
+Expand **Advanced Params** in the Workspace Model editor. Every field, what
+to set, and why.
+
+### Sampling / Output Shape
+
+| Field | Ollama default | Set to | Why |
+|---|---|---|---|
+| **Temperature** | 0.8 | `0.7` (chat) / `0.3` (extraction) / `0.2` (scoring) | Per `SYNTHESIS.md` temperature table. |
+| **Top K** | 40 | leave default | Gemma 4 is well-behaved at default. |
+| **Top P** | 0.9 | leave default | Same. |
+| **Min P** | 0.0 | leave default | Same. |
+| **Seed** | random | leave blank | Set only for A/B reproduction. |
+| **Stop Sequences** | none | leave blank | Gemma 4 emits proper EOS. |
+| **Mirostat / Eta / Tau** | off | leave off | Not needed; Min P / Top P work fine. |
+| **Frequency Penalty** | 0 | leave 0 | Any value biases style for little gain. |
+| **Repeat Penalty** | 1.1 | leave default | Fine at default. |
+| **Repeat Last N** | 64 | leave default | Fine at default. |
+| **Presence Penalty** | 0 | leave 0 | Same as frequency. |
+
+### Context / Memory — **these are the ones that bite**
+
+| Field | Ollama default | Set to | Why |
+|---|---|---|---|
+| **Context Length** (`num_ctx`) | **2048** | `32768` chat / `4096`–`16384` pipeline | Default truncates mid-system-prompt. `GOTCHAS.md` § "Ollama Default Context is 2048". |
+| **Max Tokens** (`num_predict`) | **128** | `4096` chat / `2048`+ JSON | Default truncates any useful reply. `GOTCHAS.md` § "num_predict Default is 128". |
+| **Batch Size** (`num_batch`) | 512 | leave default | Prompt-eval throughput; no Gemma 4 issue. |
+| **Tokens to Keep** (`num_keep`) | 4 | leave default | System-prompt header anchor. |
+| **Use Mmap** | on | leave on | Standard. |
+| **Use Mlock** | off | leave off | Standard. |
+| **Threads** (`num_thread`) | auto | leave default | Ollama picks. |
+| **Keep Alive** | 5m | `30m` for chat, `4h` or `-1` for pipelines | Default unloads the model between messages — `~10–30s` reload penalty. `GOTCHAS.md` § "Keep-Alive Too Short". |
+
+### Reasoning / Thinking — **the OpenWebUI 26B killer**
+
+| Field | Set to | Why |
+|---|---|---|
+| **Reasoning / Thinking** / **Think** toggle | **Leave UNSET** on 26B in chat. Optional on 31B. **Force Off** only in single-turn JSON pipelines. | Ollama 0.20+ defaults `think: true`. On `gemma4:26b` in multi-turn chat, forcing `think: false` causes silent stops (`eval_count=4`, empty content, no tool call) — the model just… stops. Verified 2026-04-18. 31B and Qwen3-Coder tolerate the flag. In single-turn JSON pipelines (AI_Visualizer shape) the old advice still applies: force off so thinking tokens don't eat `num_predict`. See `GOTCHAS.md` § "think: false Kills Gemma 4 26B" and § "Thinking Mode Eats Context". |
+
+> OpenWebUI exposes this as a **"Reasoning"** toggle or a raw **`think`**
+> field depending on version. If your version exposes it as tri-state
+> (On / Off / Default), pick **Default** on 26B chat. If it's binary
+> (On / Off), leave it **On** on 26B chat. **Never Off on 26B chat.**
+
+### Response Format — **never use JSON mode**
+
+| Field | Set to | Why |
+|---|---|---|
+| **Response Format** / **Format = JSON** | **Off / None / Text** | Ollama's server-side `format: "json"` enforcer causes Gemma 4 26B to infinite-loop on nested schemas. Ask for JSON in the prompt and parse client-side. `GOTCHAS.md` § "format=json Causes Infinite Loops". |
+
+### Streaming & Function Calling
+
+| Field | Set to | Why |
+|---|---|---|
+| **Stream Chat Response** | **On** for text-only chat. **Off** if you've attached Tools. | Ollama v0.20.0 drops tool calls on streaming responses (community-reported, and matches Simon's non-streaming choice). `GOTCHAS.md` § "Tool Calling Broken in Ollama v0.20.0 Streaming". |
+| **Function Calling** | `Native` if you're attaching tools; otherwise `Default` / off. | Native uses Ollama's `/api/chat` tool_calls field. Gemma 4 has a native tool-calling token format. |
+
+### Vision
+
+Enable the **Vision** capability (bottom of model editor). All Gemma 4
+variants support vision. Paste or upload images in chat. Works great for
+description; **unreliable for subjective quality scoring** (see
+`GOTCHAS.md` § "Vision Validator Overrejects").
+
+### Audio (E-series only)
+
+26B / 31B have no audio encoder. Only `gemma4:e4b-it-*` variants support
+audio, and currently only via llama.cpp — Ollama doesn't pipe audio through
+OpenWebUI today. Skip this in OpenWebUI for now.
+
+---
+
+## Step 4 — Global Admin Defaults (Optional Floor)
+
+Admin Panel → Settings → Models sets defaults that apply when a Workspace
+Model doesn't override. Set these as a safety net for ad-hoc chats against
+`gemma4:*` base models directly:
+
+- Default Context Length: **8192**
+- Default Max Tokens: **2048**
+- Default Keep Alive: **30m**
+
+These are only floors. The Workspace Model's explicit settings still take
+over for named presets.
+
+---
+
+## Two Profiles Worth Baking
+
+### Profile A: `gemma4-26b-chat` (default daily driver)
+
+- Base: `gemma4:26b`
+- System Prompt: "You are a helpful assistant powered by Gemma 4. Respond
+  in clear Markdown. Use fenced code blocks for code. Today's date: …"
+- Temp `0.7`, `num_ctx 32768`, `num_predict 4096`
+- Reasoning: **Default** (unset) or On — **never Off**
+- Stream On, Format Off, Keep Alive `30m`
+
+### Profile B: `gemma4-26b-extract` (structured output)
+
+- Base: `gemma4:26b`
+- System Prompt: explicit schema with "Respond with ONLY JSON. No prose."
+- Temp `0.3`, `num_ctx 8192`, `num_predict 2048`
+- Reasoning: **Off** (single-turn — thinking would eat `num_predict`)
+- Stream Off, Format **Off** (still!), Keep Alive `1h`
+- Parse client-side with the regex pattern in `SYNTHESIS.md`.
+
+For tool-using agent chats, Profile A is correct — don't flip Reasoning off.
+
+---
+
+## Troubleshooting Map
+
+| Symptom | Most likely cause | Fix |
+|---|---|---|
+| 26B "stops answering" mid-conversation, blank reply | `think: false` in payload | Set Reasoning to Default/On in Workspace Model |
+| Reply truncates mid-sentence | `num_predict` too low | Bump to 4096 |
+| Long prompt ignored / forgets system prompt | `num_ctx` too low | Set 32768 |
+| JSON request hangs forever | Response Format = JSON | Turn it off; parse client-side |
+| Tool call not fired despite model "deciding" to call it | Streaming + tool call, Ollama v0.20.0 | Disable Stream when tools attached |
+| 10–30s latency on first message after idle | `keep_alive` default 5m | Set `30m` or `-1` |
+| Model generic / no personality / confuses identity | Empty or weak system prompt | Use the template in Step 2 |
+| 31B hangs at long prompts | Flash Attention + 31B Dense + >3–4K tokens on 3090 class | Use 26B for long prompts, or disable FA in Ollama |
+| Chat refuses a benign technical prompt | Safety overfilter | Rephrase; state task directly without "bypass"/"ignore" language |
+
+---
+
+## What This Doc Does Not Cover
+
+- **Installing Ollama or OpenWebUI** — assumed done.
+- **Pulling Gemma 4 models** — `ollama pull gemma4:26b` outside scope.
+- **Tool server development** — see `CORPUS_tool_calling_format.md` and
+  Simon (`~/bin/FreibergFamily/simon/`).
+- **Embeddings / retrieval** — Gemma 4 has no embedding mode; use
+  `embeddinggemma` (308M) as a sibling model.
+- **Fine-tuning** — see `GOTCHAS.md` § "Fine-Tuning Ecosystem Issues" and
+  `tooling/fine-tuning/recipe-recommendation.md`.
+
+## Related Docs in This Repo
+
+- `SYNTHESIS.md` — opinionated guide this doc is derived from.
+- `GOTCHAS.md` — every known issue, severity-ranked.
+- `CORPUS_ollama_variants.md` — model inventory, VRAM, Ollama settings.
+- `docs/reference/bakeoff-2026-04-18.md` — the `think: false` / 26B
+  evidence trail.
+- `CORPUS_cli_coding_agent.md` — if the OpenWebUI chat is really an
+  agent front-end, read this for model-choice nuance.