From d9477da52ac30af9ed4bcc28abd68b4bf7888754 Mon Sep 17 00:00:00 2001 From: Mortdecai Date: Sat, 18 Apr 2026 20:47:17 -0400 Subject: [PATCH] docs: OpenWebUI setup guide for Gemma 4 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Applies SYNTHESIS.md + GOTCHAS.md findings to the OpenWebUI front-end: per-setting reference, two baked-in Workspace Model profiles (chat + extract), and a symptom→cause troubleshooting table. Front-loads the `think: false` / gemma4:26b multi-turn footgun from Round 3 of the 2026-04-18 bakeoff since that is the shape OpenWebUI users will hit. Co-Authored-By: Claude Opus 4.7 (1M context) --- README.md | 1 + docs/openwebui-setup.md | 257 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 258 insertions(+) create mode 100644 docs/openwebui-setup.md diff --git a/README.md b/README.md index 659fbad..2cf36e4 100644 --- a/README.md +++ b/README.md @@ -15,6 +15,7 @@ Research corpus and implementation guidance for Google Gemma 4, based on product | `CORPUS_benchmarks.md` | Full benchmark table vs Gemma 3, arena scores, agentic scores | When comparing Gemma 4 to alternatives | | `CORPUS_tool_calling_format.md` | Native token format + JSON API format for function calling | When implementing tool calling | | `CORPUS_cli_coding_agent.md` | Positioning Gemma 4 for CLI coding agent use (openclaw / open code / pi / hermes / aider style). Honest take on what Google did and didn't measure, head-to-head with `qwen3-coder:30b`, homelab setup pointer | When scoping a CLI coding agent or deciding Gemma 4 vs Qwen3-Coder | +| `docs/openwebui-setup.md` | How to configure Gemma 4 inside OpenWebUI — per-setting reference, two ready-to-bake Workspace Model profiles (chat + extract), and a symptom→cause troubleshooting table mapped back to GOTCHAS.md. Assumes Ollama + OpenWebUI are already running. | When setting up or debugging a Gemma 4 model in OpenWebUI, or handing the front-end config to someone else | | `docs/reference/bakeoff-2026-04-18.md` | CLI-coding-agent bakeoff on 3090 Ti. **Rounds 1/2 misidentified the cause; Round 3 (the correct one): `think: false` silent-stops gemma4:26b at certain multi-turn states on 32K context.** 31B and Qwen3-Coder robust to the flag. Harness at `scripts/bakeoff/` | When deciding which model to back a CLI agent with, writing a custom agent payload, or debugging a silent tool-call halt | | `docs/reference/mort-bakeoff-2026-04-18.md` | mort-bot-specific `think=true` vs `think=false` bakeoff on mort's actual loop shape (gemma4:26b, num_ctx=8192). **Thinking does NOT accumulate in context on Ollama 0.20.4** — strips it from serialized history. Both settings behave identically on step counts, tool counts, wall clock. Harness at `scripts/mort-bakeoff/` | When deciding mort-bot's THINK env var, or when someone claims "think=true eats context" without pinning an Ollama version | | `tooling/` | **Canonical upstream tooling** — real scripts, notebooks, model cards, and configs pulled from Google / HF / framework maintainers (147 files). Subdirs: `google-official/`, `huggingface/`, `inference-frameworks/`, `gemma-family/`, `fine-tuning/`. See `tooling/README.md` for index and findings that update the older `CORPUS_*` docs | When you need authoritative source material — model cards, chat templates, fine-tuning recipes, serving commands for vLLM / llama.cpp / MLX, or to scope a specialized sibling (ShieldGemma, EmbeddingGemma, etc.) | diff --git a/docs/openwebui-setup.md b/docs/openwebui-setup.md new file mode 100644 index 0000000..7ffd1e8 --- /dev/null +++ b/docs/openwebui-setup.md @@ -0,0 +1,257 @@ +# Gemma 4 in OpenWebUI — Setup Guide + +> Derived from `SYNTHESIS.md`, `GOTCHAS.md`, and the 2026-04-18 bakeoff. +> Assumes: OpenWebUI is already running and `gemma4:*` is pulled in the +> backing Ollama. Covers every setting that matters and what to set it to. + +## TL;DR — Just Tell Me What To Toggle + +Create a **Workspace Model** (don't edit per-chat), pick a base variant, and set: + +| Setting | Multi-turn chat (the default OpenWebUI shape) | Single-turn JSON pipeline | +|---|---|---| +| Base model | `gemma4:26b` (fast) or `gemma4:31b-it-q4_K_M` (sharper) | same | +| System Prompt | **Required** — identity + boundaries + format (template below) | **Required** | +| Context Length (`num_ctx`) | `32768` | `4096`–`16384` (scale to prompt) | +| Max Tokens (`num_predict`) | `4096` | `2048`+ | +| Temperature | `0.7` | `0.2`–`0.4` | +| Stream | **On** for text-only chat; **Off** if you attach a Tool | Off | +| Function Calling | Native | N/A | +| Reasoning / `think` | **LEAVE UNSET** on 26B (do NOT force off). Unset or On on 31B. | **Force Off** | +| Response Format / JSON mode | **Off** (always) | **Off** (always) | +| Keep Alive | `30m` or `-1` | match pipeline duration (`4h` / `-1`) | + +**The single biggest failure mode in OpenWebUI:** setting `think: false` on +`gemma4:26b` in a chat. The model silent-stops at tool-decision turns with +`eval_count=4`. If you see "model just stops answering after a few messages" +on 26B, check this first. See `GOTCHAS.md` § "think: false Kills Gemma 4 26B +in Multi-Turn Tool-Calling Loops". + +--- + +## Where Settings Live in OpenWebUI + +OpenWebUI has four layers. Later layers override earlier ones: + +1. **Ollama defaults** — baked in, almost all wrong for Gemma 4 + (`num_ctx=2048`, `num_predict=128`, `keep_alive=5m`). +2. **Admin Panel → Settings → Models** — global defaults for all models. + Touch this only to set sane fleet-wide floors. +3. **Workspace → Models → [Create/Edit]** — named presets. **This is where + you bake Gemma 4 settings.** A Workspace model = base model + system + prompt + advanced params + tags + optional tool server bindings. +4. **Per-chat controls** (right-hand panel / top of chat) — overrides for + a single conversation. Useful for experimentation, bad for persistence. + +**Rule:** every knob below goes in layer 3 (Workspace Model) unless noted. +Per-chat overrides are for debugging only. + +--- + +## Step 1 — Create the Workspace Model + +Workspace → Models → **+ Add Model** (or **Create a Model**). Fill in: + +- **Name**: `gemma4-26b-chat` (or whatever matches your use case) +- **Base Model**: pick from Ollama list. Recommended: + - `gemma4:26b` — fastest, great default + - `gemma4:31b-it-q4_K_M` — sharper, 5x slower, more VRAM + - `gemma4:e4b-it-q8_0` — 12GB VRAM, vision + audio (audio via llama.cpp only) +- **Description**: what this preset is for. Future-you will thank you. +- **Profile Image / Tags**: optional. +- **System Prompt**: **required** (see Step 2). +- **Advanced Params**: expand and configure (see Step 3). +- **Tools / Knowledge / Filters**: optional — attach any tool servers here. +- **Capabilities** (at bottom): toggle Vision if you want image input. Gemma + 4 supports vision on all variants. + +Save. The model now appears in the main chat dropdown. + +--- + +## Step 2 — System Prompt (Required) + +Gemma 4 is ultra-compliant but doesn't know who it is. A blank or generic +system prompt gets you a generic chatbot — and sporadic overfiltering. + +Use the template from `SYNTHESIS.md`: + +``` +You are [NAME], a [ROLE DESCRIPTION]. You are powered by Gemma 4. + +## What You Do +- [Explicit list of responsibilities] +- [Tools you have access to and when to use each one, if any] + +## What You Do NOT Do +- [Explicit list of things to refuse or avoid] +- [Common mistakes to prevent] + +## Output Format +[For free-form chat: "Respond in clear Markdown with code in fenced blocks."] +[For structured output: exact schema, field names, example if complex.] + +## Rules +- [Behavioral constraints] +- [Multi-step chaining instructions if using tools] + +Today's date: 2026-04-18 +``` + +Principles: +1. Identity first. +2. Positive instructions (what to do) before negative (what not to do). +3. Output format is explicit. +4. Don't use language that sounds like you're asking the model to bypass + restrictions — just state the task directly (safety overfilter trigger). + +--- + +## Step 3 — Advanced Params Reference + +Expand **Advanced Params** in the Workspace Model editor. Every field, what +to set, and why. + +### Sampling / Output Shape + +| Field | Ollama default | Set to | Why | +|---|---|---|---| +| **Temperature** | 0.8 | `0.7` (chat) / `0.3` (extraction) / `0.2` (scoring) | Per `SYNTHESIS.md` temperature table. | +| **Top K** | 40 | leave default | Gemma 4 is well-behaved at default. | +| **Top P** | 0.9 | leave default | Same. | +| **Min P** | 0.0 | leave default | Same. | +| **Seed** | random | leave blank | Set only for A/B reproduction. | +| **Stop Sequences** | none | leave blank | Gemma 4 emits proper EOS. | +| **Mirostat / Eta / Tau** | off | leave off | Not needed; Min P / Top P work fine. | +| **Frequency Penalty** | 0 | leave 0 | Any value biases style for little gain. | +| **Repeat Penalty** | 1.1 | leave default | Fine at default. | +| **Repeat Last N** | 64 | leave default | Fine at default. | +| **Presence Penalty** | 0 | leave 0 | Same as frequency. | + +### Context / Memory — **these are the ones that bite** + +| Field | Ollama default | Set to | Why | +|---|---|---|---| +| **Context Length** (`num_ctx`) | **2048** | `32768` chat / `4096`–`16384` pipeline | Default truncates mid-system-prompt. `GOTCHAS.md` § "Ollama Default Context is 2048". | +| **Max Tokens** (`num_predict`) | **128** | `4096` chat / `2048`+ JSON | Default truncates any useful reply. `GOTCHAS.md` § "num_predict Default is 128". | +| **Batch Size** (`num_batch`) | 512 | leave default | Prompt-eval throughput; no Gemma 4 issue. | +| **Tokens to Keep** (`num_keep`) | 4 | leave default | System-prompt header anchor. | +| **Use Mmap** | on | leave on | Standard. | +| **Use Mlock** | off | leave off | Standard. | +| **Threads** (`num_thread`) | auto | leave default | Ollama picks. | +| **Keep Alive** | 5m | `30m` for chat, `4h` or `-1` for pipelines | Default unloads the model between messages — `~10–30s` reload penalty. `GOTCHAS.md` § "Keep-Alive Too Short". | + +### Reasoning / Thinking — **the OpenWebUI 26B killer** + +| Field | Set to | Why | +|---|---|---| +| **Reasoning / Thinking** / **Think** toggle | **Leave UNSET** on 26B in chat. Optional on 31B. **Force Off** only in single-turn JSON pipelines. | Ollama 0.20+ defaults `think: true`. On `gemma4:26b` in multi-turn chat, forcing `think: false` causes silent stops (`eval_count=4`, empty content, no tool call) — the model just… stops. Verified 2026-04-18. 31B and Qwen3-Coder tolerate the flag. In single-turn JSON pipelines (AI_Visualizer shape) the old advice still applies: force off so thinking tokens don't eat `num_predict`. See `GOTCHAS.md` § "think: false Kills Gemma 4 26B" and § "Thinking Mode Eats Context". | + +> OpenWebUI exposes this as a **"Reasoning"** toggle or a raw **`think`** +> field depending on version. If your version exposes it as tri-state +> (On / Off / Default), pick **Default** on 26B chat. If it's binary +> (On / Off), leave it **On** on 26B chat. **Never Off on 26B chat.** + +### Response Format — **never use JSON mode** + +| Field | Set to | Why | +|---|---|---| +| **Response Format** / **Format = JSON** | **Off / None / Text** | Ollama's server-side `format: "json"` enforcer causes Gemma 4 26B to infinite-loop on nested schemas. Ask for JSON in the prompt and parse client-side. `GOTCHAS.md` § "format=json Causes Infinite Loops". | + +### Streaming & Function Calling + +| Field | Set to | Why | +|---|---|---| +| **Stream Chat Response** | **On** for text-only chat. **Off** if you've attached Tools. | Ollama v0.20.0 drops tool calls on streaming responses (community-reported, and matches Simon's non-streaming choice). `GOTCHAS.md` § "Tool Calling Broken in Ollama v0.20.0 Streaming". | +| **Function Calling** | `Native` if you're attaching tools; otherwise `Default` / off. | Native uses Ollama's `/api/chat` tool_calls field. Gemma 4 has a native tool-calling token format. | + +### Vision + +Enable the **Vision** capability (bottom of model editor). All Gemma 4 +variants support vision. Paste or upload images in chat. Works great for +description; **unreliable for subjective quality scoring** (see +`GOTCHAS.md` § "Vision Validator Overrejects"). + +### Audio (E-series only) + +26B / 31B have no audio encoder. Only `gemma4:e4b-it-*` variants support +audio, and currently only via llama.cpp — Ollama doesn't pipe audio through +OpenWebUI today. Skip this in OpenWebUI for now. + +--- + +## Step 4 — Global Admin Defaults (Optional Floor) + +Admin Panel → Settings → Models sets defaults that apply when a Workspace +Model doesn't override. Set these as a safety net for ad-hoc chats against +`gemma4:*` base models directly: + +- Default Context Length: **8192** +- Default Max Tokens: **2048** +- Default Keep Alive: **30m** + +These are only floors. The Workspace Model's explicit settings still take +over for named presets. + +--- + +## Two Profiles Worth Baking + +### Profile A: `gemma4-26b-chat` (default daily driver) + +- Base: `gemma4:26b` +- System Prompt: "You are a helpful assistant powered by Gemma 4. Respond + in clear Markdown. Use fenced code blocks for code. Today's date: …" +- Temp `0.7`, `num_ctx 32768`, `num_predict 4096` +- Reasoning: **Default** (unset) or On — **never Off** +- Stream On, Format Off, Keep Alive `30m` + +### Profile B: `gemma4-26b-extract` (structured output) + +- Base: `gemma4:26b` +- System Prompt: explicit schema with "Respond with ONLY JSON. No prose." +- Temp `0.3`, `num_ctx 8192`, `num_predict 2048` +- Reasoning: **Off** (single-turn — thinking would eat `num_predict`) +- Stream Off, Format **Off** (still!), Keep Alive `1h` +- Parse client-side with the regex pattern in `SYNTHESIS.md`. + +For tool-using agent chats, Profile A is correct — don't flip Reasoning off. + +--- + +## Troubleshooting Map + +| Symptom | Most likely cause | Fix | +|---|---|---| +| 26B "stops answering" mid-conversation, blank reply | `think: false` in payload | Set Reasoning to Default/On in Workspace Model | +| Reply truncates mid-sentence | `num_predict` too low | Bump to 4096 | +| Long prompt ignored / forgets system prompt | `num_ctx` too low | Set 32768 | +| JSON request hangs forever | Response Format = JSON | Turn it off; parse client-side | +| Tool call not fired despite model "deciding" to call it | Streaming + tool call, Ollama v0.20.0 | Disable Stream when tools attached | +| 10–30s latency on first message after idle | `keep_alive` default 5m | Set `30m` or `-1` | +| Model generic / no personality / confuses identity | Empty or weak system prompt | Use the template in Step 2 | +| 31B hangs at long prompts | Flash Attention + 31B Dense + >3–4K tokens on 3090 class | Use 26B for long prompts, or disable FA in Ollama | +| Chat refuses a benign technical prompt | Safety overfilter | Rephrase; state task directly without "bypass"/"ignore" language | + +--- + +## What This Doc Does Not Cover + +- **Installing Ollama or OpenWebUI** — assumed done. +- **Pulling Gemma 4 models** — `ollama pull gemma4:26b` outside scope. +- **Tool server development** — see `CORPUS_tool_calling_format.md` and + Simon (`~/bin/FreibergFamily/simon/`). +- **Embeddings / retrieval** — Gemma 4 has no embedding mode; use + `embeddinggemma` (308M) as a sibling model. +- **Fine-tuning** — see `GOTCHAS.md` § "Fine-Tuning Ecosystem Issues" and + `tooling/fine-tuning/recipe-recommendation.md`. + +## Related Docs in This Repo + +- `SYNTHESIS.md` — opinionated guide this doc is derived from. +- `GOTCHAS.md` — every known issue, severity-ranked. +- `CORPUS_ollama_variants.md` — model inventory, VRAM, Ollama settings. +- `docs/reference/bakeoff-2026-04-18.md` — the `think: false` / 26B + evidence trail. +- `CORPUS_cli_coding_agent.md` — if the OpenWebUI chat is really an + agent front-end, read this for model-choice nuance.