Files

T

claude (blind_chess) 6d457a2321 docs(plan): defer in-game chat, add Phase 1 (Casual) implementation plan

- DECISIONS.md: in-game chat (player↔player and human↔Gemma) deferred
  indefinitely. Blind-mode chat is a side channel that defeats the
  moderator-vocabulary security boundary; chat with Gemma leaks belief
  state mid-game. Resolvable but expensive — revisit only on demand.
- Spec: same deferral noted in "Out of scope".
- New plan: docs/superpowers/plans/2026-04-28-ai-player-phase-1-casual.md
  — 13 tasks, 80 sub-steps. Phase 1 only (Casual bot end-to-end). Phase 2
  (Recon) gets its own plan once Phase 1 outcomes inform Recon's target.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-28 13:31:12 -04:00

41 KiB

Raw Permalink Blame History

AI / Computer Player — Design Spec

Project: blind_chess Date: 2026-04-28 Status: Draft (awaiting user review) Builds on: 2026-04-28-blind-chess-design.md — the deployed-MVP architecture Supersedes: (none — this is a new feature)

Decision reversal

DECISIONS.md line "2026-04-28: Client-side AI / hint generation — explicitly out of scope. Human vs. human only" is superseded as of 2026-04-28 by Seth's directive to add a computer-opponent feature.

The reversal is partial: client-side AI / hint generation in human-vs-human games remains rejected. This spec adds AI only in the human-vs-AI path. Human-vs-human games are unchanged.

Executive summary

Add two AI opponents to blind_chess:

Casual bot — algorithmic, in-process, ~200 LoC of TypeScript. Plays legal moves with simple heuristics. Always available; no external dependencies. Plays badly but quickly.
gemma4 recon bot — multi-turn chat agent backed by gemma4:26b running on the homelab Ollama service (steel141 RTX 3090 Ti primary, pve197 V100 fallback). Maintains a private per-game chat history that persists across turns as the bot's memory, allowing it to build belief about hidden opponent positions over time. Reasoning is hidden from the human during play and revealed in a collapsible post-game panel.

Both bots play through the same view filter and finite-state machine that humans use. The architectural invariant from CLAUDE.md ("the view filter is the only egress for board state") applies to bots: a bot consumes only buildView(game, botColor) plus moderator announcements. No oracle access. The Recon bot is honestly playing blind chess, not pretending to.

The feature ships in two phases: Casual first (single-week scope, low risk), Recon second (research-flavored multi-week scope, depends on Gemma 4 prompt engineering). The shared infrastructure (BotDriver, Brain interface, in-process dispatch path) is built in Phase 1 and reused in Phase 2.

Goals

#	Goal	How we know it's met
1	"Always-available opponent" — a user can play a legal chess game alone, on demand, without a friend	Casual bot completes 100 self-play games without crashes; legal moves only
2	Showcase the blind-chess problem — demonstrate an agent reasoning under uncertainty	Recon bot wins ≥60% over 50 Recon-vs-Casual games (both colors); 10 random reasoning logs show Gemma using announcements as evidence
3	Architectural integrity — bot doesn't get oracle access	Bot input is `BoardView` (filtered) + `Announcement[]`; no test or code path bypasses the view filter
4	Mobile-first UX consistent with existing site	Two-section landing stacks on narrow viewports; AI badge fits opponent slot; thinking indicator visible
5	Honest GPU surface — user knows which hardware Gemma is running on	`aiInfo` field in protocol; persistent badge; failover updates badge
6	Graceful degradation when Ollama is unavailable	Preflight failure → 503 with friendly message; mid-game failure → failover to V100; both endpoints down → bot resigns with `endReason: 'ai_unavailable'`

Non-goals (explicit)

Strong vanilla chess play. Both bots play vanilla mode but neither uses Stockfish; vanilla is a side-effect, not a feature target.
AI vs AI spectator-able games in the public UI. The self-play harness is a CLI tool, not a UX feature.
Live token streaming during Gemma's thinking. Static "AI is thinking..." indicator only.
Difficulty slider. Two named buttons (Casual, Recon) — no continuum.
Hint generation in human-vs-human games. Still out of scope.
Mid-game GPU flap-back. Once failed over, stays on fallback for the rest of the game.
Browser E2E testing. Existing project decision (DECISIONS.md row "End-to-end browser tests") still applies.

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│ blind-chess server (CT 690, Fastify on :3000)                       │
│                                                                      │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────────────────┐ │
│  │  WS clients  │   │  REST routes │   │ BotDriver (per-game)     │ │
│  │  (humans)    │   │              │   │                          │ │
│  └──────┬───────┘   └──────┬───────┘   │  ┌────────────────────┐  │ │
│         │                  │           │  │ CasualBrain        │  │ │
│         ▼                  ▼           │  │ (algorithmic)      │  │ │
│  ┌─────────────────────────────────┐   │  └────────────────────┘  │ │
│  │ ws.ts: dispatch (commit / etc)  │◀──┤  ┌────────────────────┐  │ │
│  │ commit.ts: touch-move FSM       │   │  │ ReconBrain         │  │ │
│  │ view.ts: buildView / ownSquares │──▶│  │ (Ollama chat agent)│  │ │
│  │ translator.ts: announcements    │   │  │  - persistent      │  │ │
│  │ state.ts: in-memory game Map    │   │  │    chat history    │  │ │
│  └─────────────────────────────────┘   │  │  - private memory  │  │ │
│                                         │  └────────────────────┘  │ │
│                                         └──────────┬───────────────┘ │
│                                                    │ HTTPS           │
└────────────────────────────────────────────────────┼─────────────────┘
                                                     │
            ┌────────────────────────────────────────┴──────┐
            │ Ollama endpoint priority list                  │
            │                                                │
            │ 1. http://192.168.0.141:11434  (steel141)     │
            │    RTX 3090 Ti, gemma4:26b, ~134 tok/s        │
            │ 2. http://192.168.0.179:11434  (pve197 CT 105)│
            │    Tesla V100 32GB, gemma4:26b, ~80 tok/s est.│
            └───────────────────────────────────────────────┘

Key principles:

Bots are virtual in-process players. A BotDriver is created per AI game and attached to the bot's color. The driver computes legal candidates from the bot's view and dispatches its actions through the same commit handler humans use.
Bots use the same view filter as humans. BotDriver calls buildView(game, botColor) and feeds the filtered board to the Brain. No oracle access; the Recon bot is honestly playing blind chess.
The Brain is a swappable strategy. CasualBrain and ReconBrain implement the same interface; the driver doesn't know which one it has.
Recon bot is a stateful chat agent, not a stateless mover. Each turn appends to a persistent chat history (system + alternating user/assistant). The bot's reasoning persists across turns as its private memory.
The bot has no PlayerToken, no WS connection, and no grace-period treatment. Its "session" is the lifetime of the BotDriver. The server emits peer-status: { color: <botColor>, connected: true } for the bot's slot at all times until the game ends; no grace timer applies to the bot's color.

Components

All new code lives under packages/server/src/bot/. Five modules.

`Brain` interface (shared contract)

interface Brain {
  init(args: { color: Color; mode: Mode; gameId: GameId }): Promise<void>;
  decide(input: BrainInput): Promise<BrainAction>;
  dispose?(): Promise<void>;
}

interface BrainInput {
  view: BoardView;                    // own pieces only in blind mode
  newAnnouncements: Announcement[];   // moderator events since last decide
  legalCandidates: CandidateMove[];   // pre-computed by driver
  attemptHistory?: { move: CandidateMove; rejection: ModeratorText }[];
}

type BrainAction =
  | { type: 'commit'; from: Square; to: Square; promotion?: PromotionType }
  | { type: 'resign' }
  | { type: 'offer-draw' }
  | { type: 'respond-draw'; accept: boolean };

Why this shape: the driver pre-computes legal candidates so the brain doesn't have to know chess.js. This makes both brains trivially mockable in tests, and the candidate set is computed identically to what the human-side highlighter shows.

`BotDriver` (per-game orchestration)

Owns one Brain. Subscribes to game state-change events. Per-driver mutex enforces one in-flight decision. Bounded retry (5) on FSM rejections. Pseudocode:

on game state change:
  if game.status === 'finished': dispose brain; remove driver
  if game.toMove !== bot.color: do nothing
  if alreadyDeciding: do nothing  (mutex)
  else:
    input = buildBrainInput()
    action = await brain.decide(input)
    dispatch(action) through normal handlers
    if rejection (wont_help / illegal_move): append to attemptHistory; decide again
    cap retries at 5; on cap, resign as the bot

`CasualBrain` (Phase 1, ~200 LoC)

Pure TypeScript, no I/O, deterministic when seeded.

Scoring per candidate move:

+50 if destination is geometrically reachable but not own-occupied (likely-capture proxy in blind mode).
+30 if first 8 moves and the move develops a knight or bishop.
+25 if the move is a pawn move toward the center (e/d files preferred).
+15 if the move advances rank toward opponent.
-40 if the move would leave a queen, rook, or minor piece on its starting square while another piece could have been developed (anti-shuffling penalty).
Tiny seedable random tiebreak.

Behavior:

Picks highest-scored candidate; on attemptHistory rejection, drops the top N choices and retries.
Promotion: defaults to queen.
Draw offer auto-response: accept at material parity, decline at material lead (computed from own view only — biased and weak by design).
Casual never resigns voluntarily.
Vanilla mode: same scoring, but candidates come from chess.js .moves({verbose: true}) (which excludes self-check) instead of geometricMoves().

`ReconBrain` (Phase 2)

Wraps an OllamaClient interface (testable) + a per-game chat history (in-memory only).

State:

class ReconBrain {
  private color!: Color;
  private mode!: Mode;
  private chat: { role: 'system'|'user'|'assistant'; content: string }[];
  private endpoint: OllamaEndpoint;
  private failedOver: boolean = false;
  private moveCount: number = 0;
}

init(): push one system message that establishes identity, what the bot can see, the moderator vocabulary, the output schema, and that its reasoning is private and persistent.

decide(input): push one user message describing new view + announcements + legal candidates, call /api/chat with the full history, parse the assistant reply, append the assistant message to history, return the action.

Ollama call config (per ~/bin/gemma4-research/SYNTHESIS.md "Mandatory Ollama Settings · multi-turn tool-calling agents"):

model: 'gemma4:26b'
options: { num_ctx: 32768, num_predict: 1024, temperature: 0.4 }
keep_alive: "30m"
Do not set think: false (silently breaks 26B in multi-turn loops; documented in gemma4-research/GOTCHAS.md § "think: false Kills Gemma 4 26B in Multi-Turn Tool-Calling Loops").
Do not use format: "json" (infinite loops on nested schemas; documented in gemma4-research/SYNTHESIS.md § "Anti-Patterns"). Extract {...} from response client-side via regex per the SYNTHESIS guidance.

System prompt skeleton (final wording deferred to implementation):

You are a chess agent playing BLIND CHESS as <COLOR>.
You see only your own pieces. The moderator announces moves with a fixed vocabulary.

## Your task each turn
1. Read the new announcements and your current view.
2. Update your beliefs about where opponent pieces likely are. Show this reasoning explicitly — your reasoning persists across turns and is your private memory.
3. Pick exactly one move from the legal candidates I provide.

## Output schema
Reply with JSON only, on its own line, no prose wrapper:
{"reasoning": "<your analysis>", "move": "<from>-<to>", "promotion": "q"|"r"|"b"|"n"|null}

## Vocabulary you'll see
[full enumeration of ModeratorText]

## Important
- Your reasoning is hidden from the human player. Be honest and detailed.
- Build up belief over turns. Reference your prior notes.
- If your move is rejected (you'll see "wont_help" or "illegal_move"), I'll show you the rejection and ask again. Don't repeat the rejected move.

Per-move user message skeleton:

Turn <N>. <COLOR> to move.

Announcements since your last turn:
- <list of ModeratorText entries with any payload>

Your view (own pieces, blind mode):
<list: piece, square>

Legal candidate moves:
<list: from-to, optionally with promotion>

Reply with reasoning + chosen move (JSON).

Bot registry

Lives in state.ts. Map<gameId, BotDriver>. Created on AI game creation, removed when the game ends. Lifetime is bound to the game; restart drops both, consistent with current MVP behavior.

Touches in existing code

File	Change
`packages/shared/src/protocol.ts`	Extend `CreateGameRequest` with `vsAi?: { brain: 'casual' \| 'recon' }`. Add `'ai_unavailable'` to `EndReason`. Add optional `aiInfo` to `joined` and `update` server messages.
`packages/server/src/state.ts`	Add `Game.aiOpponent?: { brain; color }` (informational). Add bot registry. Add `Game.aiThoughtsLog?: ChatTurn[]` populated at game end for the post-game reveal.
`packages/server/src/server.ts`	`POST /api/games` handles `vsAi`, runs preflight, creates `BotDriver`.
`packages/server/src/ws.ts`	State-change observer triggers attached `BotDriver`. No special-case handling inside `ws.ts` itself.
`packages/client/`	Two-section landing layout. AI badge under opponent slot. "AI is thinking..." indicator. Post-game thoughts reveal (Recon only).

Notably NOT changed: view.ts, commit.ts, translator.ts, geometric.ts, Announcement type, ModeratorText enum. Bots flow through them identically to humans.

Data flow

Game creation (vs Casual)

User clicks "Casual bot" on landing
  → POST /api/games  body: { mode, side, highlightingEnabled, vsAi: { brain: 'casual' } }
  → server: create Game, fill creator slot with new PlayerToken
  → server: create BotDriver{CasualBrain}, attach to Game, fill opposite slot
  → server: subscribe driver to game-state-change events
  → respond 201: { gameId, creatorToken, joinUrl: null }   // no shareable link
  → client navigates to /#/g/<id> and opens WS /ws?game=<id>
  → server: on hello, sends 'joined' with view (no aiInfo for Casual)
  → if user is white, user moves first; else CasualBrain.decide() fires immediately

Game creation (vs Recon)

Same as above except:

Server synchronously preflights the GPU endpoint list before responding to POST /api/games:
1. GET http://192.168.0.141:11434/api/tags with 1.5s timeout. 200 OK + gemma4:26b listed → primary selected.
2. else GET http://192.168.0.179:11434/api/tags with 1.5s timeout. 200 OK + gemma4:26b listed → fallback selected, log warning.
3. else respond HTTP 503 { error: 'ai_offline' }.
Adds ~50–200ms to the create call when steel141 is reachable.
If primary chosen, server fires a non-blocking warmup HTTP call (/api/chat with a minimal prompt, keep_alive: "30m") so the model is in VRAM by the bot's first move.
BotDriver{ReconBrain} is attached; ReconBrain's chat history seeded with the system prompt.
Server response includes aiInfo: { model, gpu, host } so the client renders the badge.

The bot's turn

[trigger]: game state transitions to "bot's turn" (after human commit, OR at game start if bot is white)

driver:
  if alreadyDeciding for this game: ignore (mutex)
  else mark "deciding" = true:
    1. compute BrainInput:
       - view = buildView(game, bot.color)
       - newAnnouncements = announcements added since last decide call
       - legalCandidates =
           if mode === 'vanilla': chess.js .moves({verbose: true}) for bot.color
           else (blind):           geometricMoves(piece, sq, ownSquares) over own pieces,
                                   plus promotion-required moves
       - attemptHistory = []
    2. action = await brain.decide(input)
    3. dispatch action through normal commit handler:
       - 'commit': call commit handler (same one ws.ts uses)
         - if FSM rejects with wont_help/illegal_move:
           - append to attemptHistory; goto step 2 with updated input; max 5 retries
           - on retry-cap-hit: dispatch {type: 'resign'} (loud log)
         - if FSM accepts: turn ends, observers fire (including this driver if game continues)
       - 'resign' / 'offer-draw' / 'respond-draw': pass-through
    4. mark "deciding" = false
    5. log brain reasoning to journald (Recon only)

Opponent (human) move arrives at the bot

human commits a move → ws.ts dispatches → FSM accepts → translator emits Announcement[]
  → game state-change event fires
  → driver observes
  → driver checks "is it now bot's turn?": yes → next decide() call; no → idle

The driver does NOT need a separate signal for "opponent moved." The state-change observer covers it.

Game end

state.status transitions to 'finished'
  → driver observes
  → driver copies ReconBrain.chat → Game.aiThoughtsLog (Recon only; Casual has no thoughts to copy)
  → driver disposes brain (close any in-flight HTTP for ReconBrain via AbortController)
  → driver removes itself from registry
  → janitor (existing) prunes the game after 30min idle, same as humans
  → reveal: client renders full board for both sides (existing post-game UX)
  → AI-thoughts post-game reveal (Recon only): collapsible "View gemma4's reasoning" section
    on the game-over screen, shows chat history as a chronological log of
    {ply N, view at that time, announcements heard, reasoning, move played}

Mid-game GPU failover

ReconBrain.decide() → HTTP call to current endpoint
  → connection error / 5xx / 30s timeout
  → driver:
    1. log: "<endpoint> failed mid-game, attempting failover to <other>"
    2. preflight the other endpoint (1.5s timeout)
       - 200 OK → switch ReconBrain.endpoint to fallback; mark failedOver = true
       - else → bot resigns with endReason 'ai_unavailable'
    3. retry the SAME decide() call against the new endpoint
       - same chat history, same user message, no replay
    4. on success: emit a UI-system message (NOT a moderator Announcement):
       "AI moved to V100 (steel141 unreachable). Moves may take longer." + update aiInfo badge
    5. on failure: bot resigns

Failover triggers (HTTP-layer only):

Connection refused / DNS fail
5xx status
Per-move timeout (30s normal, 90s first-move)

Does not trigger failover:

Malformed JSON in Gemma's response → existing temp-bump-retry path
Move-not-in-candidates → existing "pick from the list" retry
wont_help from the FSM → existing retry-with-attemptHistory path

One-way only: once failed over to V100, stays there for the rest of the game. No flap-back.

Disconnect / reconnect (human side, AI game)

human WS drops → existing 5-minute grace timer starts
  → BotDriver: if it's the bot's turn, the in-flight Ollama call (if any) is allowed to
    complete and the move is committed; the result is visible when the human reconnects.
  → grace expires → existing path: game ends with endReason 'abandoned', AI wins.
  → human reconnects within grace → existing path: 'joined' message with full state.

Error handling

Failure	Detection	Response
Both Ollama endpoints down at game-creation time	Preflight 1.5s timeout × 2	HTTP 503 `{ error: 'ai_offline' }`. Game never created. Client shows "AI is offline right now, try again later."
Primary down at game-creation, fallback up	Preflight	Game created on V100. `aiInfo` reflects V100 from the start. Game-start UI message: `"steel141 unreachable; playing on pve197 V100."`
Primary dies mid-game, fallback up	First failed `decide()`	Section "Mid-game GPU failover" above.
Both endpoints die mid-game	Failover attempt also fails	Bot resigns with `endReason: 'ai_unavailable'`. Moderator panel: `"AI service became unavailable. Game ended."` Human "wins" but post-game labels it `"AI unavailable"` rather than crowing.
Gemma returns malformed JSON	Client-side regex `\{[\s\S]*\}` fails OR `JSON.parse` throws	Retry once with `temperature += 0.1`. Second failure: fall back to `CasualBrain.decide()` for this turn only; chat history doesn't get the failed turn appended. Loud log.
Gemma proposes a move not in `legalCandidates`	Driver compares Gemma's `{from, to}` against the candidate list	Append corrective user message: `"That move wasn't a candidate. Pick from this list: <re-paste>."` Retry once. Second failure: same Casual fallback.
Gemma's move is in candidates but FSM rejects with `wont_help`	Standard FSM path	Append rejection to `attemptHistory`, append corrective user message to chat history, decide again. Bounded by driver retry cap (5). On cap-hit: bot resigns.
Driver retry cap (5) hit	Internal counter	Bot resigns. Moderator panel: `"AI ran out of valid moves to consider."` Human gets the win.
Per-move timeout (30s normal / 90s first-move)	`AbortController` on the HTTP call	Treat as endpoint failure → failover path.
Bot tries to commit on a finished game	Driver state-change observer race	Discard the action. The mutex + observer should prevent this in practice, but the dispatch handler returns an error which the driver swallows.
Two simultaneous `decide()` invocations on the same driver	Per-driver mutex	Second invocation is a no-op.
`BotDriver` instance leaks	Janitor sweep	Janitor double-checks: any driver whose game is `finished` or absent → dispose.
Server restart with active AI games	All in-memory state lost (existing behavior)	Same as human-vs-human: games disappear. Acceptable for MVP.
Casual / Recon encounters position with zero legal candidates	`legalCandidates.length === 0`	Means stalemate or checkmate — game already ended via the FSM, driver shouldn't have been invoked. If it happens anyway: log loudly, bot resigns.

The "fall back to Casual for this turn only" pattern protects gameplay continuity at the cost of consistency in the bot's reasoning history. The chat history never gets the failed turn appended; Gemma never sees that Casual played on its behalf — the next user message just says "your turn N+1" as if Gemma had played the actual move. Pragmatic compromise that keeps games playable instead of crashing them on flaky LLM output.

Testing

Five testing layers. Existing harness is 43 tests; this feature adds ≈30–40 new tests.

Unit tests — `CasualBrain` (≈10 tests)

Single candidate → picks it.
Multi-candidate scoring: with deterministic seed, capture-bias produces expected ranking.
8th-move-ish development heuristic activates correctly.
attemptHistory causes the previously-rejected move to be excluded.
Promotion defaults to queen.
Draw-offer auto-response: parity → accept; lead → decline.
Zero-candidate input → throws.
Vanilla candidates ≠ blind candidates.

Unit tests — `ReconBrain` with mocked Ollama (≈12 tests)

ReconBrain is wired to an OllamaClient interface; tests inject a stub. No real network in tests.

Sends correct payload shape.
Chat history seeded with system message on init().
Successful response: assistant message appended; BrainAction matches parsed JSON.
Multi-turn: 3 decide() calls produce 1 system + 6 alternating user/assistant.
Malformed JSON: stub returns garbage → second call with temperature bumped +0.1 → succeeds → only second turn appended to history.
Both retries malformed → throws ReconLLMUnavailable.
Move not in legalCandidates: corrective user message appended; second call returns valid candidate.
Endpoint failover: first call rejects with EndpointUnreachable; brain switches endpoint; assert payload to second endpoint matches.
One-way failover: after first failover, brain stays on V100.
dispose() cancels in-flight HTTP via AbortController.
System prompt isolation: fresh ReconBrain doesn't share state with disposed one.
White vs Black: system prompt parametrizes color correctly.

Unit tests — `BotDriver` (≈8 tests)

Driver is wired to a Brain interface; tests use a StubBrain with scriptable responses. Game state is real.

Mutex: two simultaneous "your turn" triggers → one decide() call.
State-change observer fires decide() only when game.toMove === bot.color.
Game finished → driver disposes brain and unsubscribes.
Retry cap (5): stub brain returns wont_help-inducing move 5 times; driver dispatches resign on attempt 6.
Casual fallback for malformed Recon turn: stub brain throws ReconLLMUnavailable; driver invokes a CasualBrain for that turn only; Recon's chat history not modified.
AI-unavailable end-state propagates endReason: 'ai_unavailable'.
Disconnect during the bot's turn: in-flight decide() allowed to complete; move committed; reconnecting human sees the move.
Janitor disposes orphaned drivers.

Real WS integration tests (≈6 tests)

Same pattern as the existing 4 WS integration tests — ephemeral port, real Fastify, real ws client. Bot driver uses a StubBrain (no Ollama).

Create AI game (Casual): server returns aiInfo: undefined; joined received; bot plays first move if white.
Create AI game (Recon stubbed): joined includes aiInfo; preflight passes (stubbed); bot plays first move via stub brain.
Full Casual vs scripted-human game: 8 moves played, integration end-to-end; capture announced correctly.
Recon failover surfaces in a server message: kill primary endpoint stub mid-game, observe aiInfo update.
Per-move timeout: stub brain's decide() hangs > 30s; driver triggers failover or resign; client observes the right end state.
AI-unavailable preflight: both endpoint stubs return errors; POST /api/games returns 503; client renders error.

Self-play harness — `scripts/selfplay.ts` (operator tool, NOT in CI)

pnpm selfplay --white casual --black casual --games 100
pnpm selfplay --white recon --black casual --games 50
pnpm selfplay --white recon --black recon --games 10

Reports: win/loss/draw breakdown, average moves per game, average per-move latency (Recon), reasoning-log archive (Recon, written to tmp/selfplay-runs/<timestamp>/).

Live smoke checklist (manual, post-deploy)

Create Casual game, play to completion. Both modes (vanilla + blind).
Create Recon game on warm steel141. Play 10 moves. Latency ≤4s/move. Inspect journald.
Create Recon game on cold steel141 (after 31min idle). First move 30–60s; UI shows "AI is starting up..." Subsequent moves fast.
Force failover: stop steel141 ollama mid-game; bot fails over to V100; UI badge updates.
Force ai-unavailable: stop both Ollama services; new Recon game returns 503 with friendly error.
Post-game thoughts reveal: collapsible section appears for Recon, absent for Casual; reasoning matches the moves played.

UX

Landing page

Two visually distinct sections, sharing the same control vocabulary (mode/side/highlight):

┌─────────────────────────────────────┐
│  Play with a friend                 │
│  ─────────────────────────          │
│  [mode] [side] [highlight]          │
│  ( Create Game )                    │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│  Play vs Computer                   │
│  ─────────────────────────          │
│  [mode] [side] [highlight]          │
│  ( Casual bot )  ( gemma4 recon )   │
└─────────────────────────────────────┘

Tooltip / hover help:

Casual bot — "Fast, plays simple moves, makes mistakes. Good for a quick game."
gemma4 recon — "Gemma 4 large language model. Reasons about hidden information across turns. Slower; first move may take up to a minute. Plays better in blind mode than Casual."

Opponent slot — in-game badge

State	Badge text
Casual game	`"Casual bot"`
Recon game (3090)	`"gemma4:26b · RTX 3090 Ti"`
Recon game (V100, primary)	`"gemma4:26b · Tesla V100"`
Recon game (failover)	`"gemma4:26b · V100 (failed over)"` (badge color shifts amber)
Mobile narrow	Truncates to `"gemma4 · 3090 Ti"` etc.

"AI is thinking" indicator

When it's the bot's turn:

Bot / situation	Indicator
Casual	`"Casual bot is moving..."` (rarely visible)
Recon, normal	`"gemma4 is thinking..."` with animated ellipsis
Recon, first move only	`"gemma4 is starting up..."` (cold-start framing)
Recon, just failed over	Moderator-panel-area system message: `"AI moved to V100 (steel141 unreachable). Moves may take longer."`

Moderator panel

Vocabulary unchanged. AI reasoning is never rendered here during play. Two new UI-system messages (style-distinct from Announcement entries):

Game-start (Recon only): "You are playing gemma4:26b on RTX 3090 Ti (steel141)."
Failover (Recon only): "AI moved to V100 (steel141 unreachable). Moves may take longer."

Post-game thoughts reveal (Recon only)

Below the existing game-over content:

▾ View gemma4's reasoning (32 turns)
  ┌─────────────────────────────────────┐
  │ Turn 1, your move was e2-e4         │
  │ gemma4 (Black) thought:             │
  │ "<reasoning text>"                  │
  │ → played c7-c5                      │
  ├─────────────────────────────────────┤
  │ Turn 2, ...                         │
  └─────────────────────────────────────┘

Collapsed by default on mobile. Casual games omit the section.

Resign / draw / disconnect

Resign button: same UX. Game ends, AI thoughts log is revealed (Recon only).
Offer draw: human can offer; bot responds via respond-draw. Casual: heuristic auto-response. Recon: passes the offer as a user message; Gemma decides via JSON output schema with accept: bool.
Bot resigns (retry cap / AI-unavailable): post-game labels the end appropriately rather than crowing about a human win.
Human disconnect during AI game: existing 5-min grace; bot's in-flight decide() (if any) completes; result visible on reconnect.

Things explicitly NOT in MVP UX

Live token streaming during Gemma's thinking. Static indicator only.
Difficulty slider. Two named buttons only.
Public AI vs AI spectate-able games. Self-play is CLI-only.
Hint button in human-vs-human games.
"Watch the AI think" mode.

Acceptance criteria

Phase 1 (Casual) is "done" when:

100 Casual-vs-Casual games complete with no crashes.
Median game length is between 20 and 200 moves.
Casual reliably beats a "random legal move" baseline (≥80% over 100 games).
All Phase 1 unit + integration tests pass.
Live smoke checklist for Casual passes.
AI-game creation and play work end-to-end on the live URL.

Phase 2 (Recon) is "done" when:

Recon wins ≥60% over 50 Recon-vs-Casual games, both colors.
Average per-move latency ≤8s on the 3090 Ti (≤10s on V100), with cold-start excluded.
Manual inspection of 10 random reasoning logs shows Gemma is using announcements as evidence (not just plausible-sounding text).
All Phase 2 unit + integration tests pass.
Live smoke checklist for Recon passes (warm, cold, failover, both-down).
Post-game reasoning reveal renders correctly on phone and desktop.

Decision triggers if Phase 2 misses bars

If Recon wins <60% but >40% vs Casual: prompt-engineering rabbit hole. Iterate on system prompt + per-turn message format. Try presenting candidates differently (e.g., with annotations).
If Recon wins <40%: design signal. Either 26B isn't strong enough (try 31B at 5× latency cost — would also need to revisit per-move timeout caps) OR the candidate-list framing is wrong (consider feeding Gemma a textual board representation instead of just candidate moves).
If latency is consistently >15s/move: the 32K context approach may be too expensive. Consider context compaction (summarize older turns into a "what I've inferred so far" running summary).

Risks / open questions

Recon plays at some level — but how much? This is the central research-y unknown. LLMs play vanilla chess poorly (badly trained on game positions), but the task here is different — Gemma isn't being asked to compute tactical depth, it's being asked to reason about what evidence implies about hidden state, and pick a move from a pre-computed legal list. That's much more LLM-shaped. Still, the 60% Recon-vs-Casual bar is a guess; we'll learn the real number from the self-play harness.
Cold-start UX on first move. 30–60s is long. The "AI is starting up..." copy mitigates but doesn't eliminate. If users complain we can: (a) preflight harder (an actual /api/chat warmup with keep_alive: -1), (b) offer Casual as a one-click fallback if the user gets impatient, (c) shrink to gemma4:e4b for first-move-only and switch to 26B for subsequent. None of these are MVP.
Chat history grows unboundedly. A 100-move game accumulates ~25K tokens. 32K context covers that comfortably, but a longer game (which is rare in casual play, but possible) would overflow. Mitigation: if we hit context overflow in practice, add per-turn compaction — replace the oldest 20 turns with a summary turn. Not MVP unless seen.
3090 GPU contention with mort-3090-scheduler. The scheduler is supposed to yield to other GPU users, but verifying this under chess load is unmeasured. Mitigation: monitor steel141 GPU utilization during early Recon games; if mort jobs interfere we'd need explicit coordination (e.g., a held lock).
Bot proposes moves Gemma can't see consequences of. Casual bias toward "geometrically reachable but not own-occupied" squares is just a heuristic; many such moves walk into traps. This is intentional for Casual (low-strength is the design target) but if Recon makes the same mistakes despite reasoning, the prompt template needs tuning. Self-play exposes this clearly.
The post-game reasoning reveal could be embarrassing. Gemma might write reasoning that's confidently wrong in a way that makes the AI look dumb. Per gemma4-research/SYNTHESIS.md, Gemma is "ultra-compliant and highly capable but doesn't know who it is" — strong system prompt mitigates the worst, but reasoning logs are essentially uncurated LLM output. Mitigation: sample 10 logs early, iterate prompt to suppress overly confident bad takes.
Floating-point determinism differs across GPU architectures. Gemma will produce slightly different tokens on V100 vs 3090 Ti. Mitigation: none needed — we're not comparing across calls. Just want a reasonable response.
No mid-game flap-back. If steel141 recovers, we don't switch back from V100. Consequence: a recovered 3090 doesn't help an already-failed-over game. Mitigation: none in MVP. Cost is a slightly slower remainder of the game; acceptable.

Out of scope (deferred to post-MVP)

Difficulty slider / strength selection beyond two named buttons.
Stockfish integration (vanilla mode strength via real chess engine).
AI vs AI spectate-able public games.
Live token streaming during Gemma's thinking.
Hint button in human-vs-human games.
Per-turn context compaction for long games.
Mid-game GPU flap-back to recovered primary.
Multi-model selection (e.g., "play vs gemma4:31b" or "play vs qwen3-coder-next").
Persistent reasoning logs across game restart (would require SQLite per the existing deferred row).
Bot rating / Elo tracking across games.
Bot personalities / styles ("aggressive recon", "defensive recon").
In-game chat (player ↔ player or human ↔ Gemma). Considered 2026-04-28; deferred indefinitely. Player chat in blind mode is a side channel that bypasses the moderator-vocabulary security boundary; chat with Gemma leaks the bot's belief state and undermines the post-game reasoning reveal. See DECISIONS.md "Deferred / Rejected" for the full rationale.

Appendix A — Module layout

packages/server/src/bot/
├── index.ts             # public API: createBotDriver, BotRegistry types
├── driver.ts            # BotDriver class
├── brain.ts             # Brain interface, BrainInput, BrainAction types
├── casual-brain.ts      # CasualBrain class
├── recon-brain.ts       # ReconBrain class
├── ollama-client.ts     # OllamaClient interface + production HTTP impl
├── ollama-endpoints.ts  # endpoint priority list, preflight logic
├── prompt.ts            # system prompt template, per-turn user message builder
├── parse.ts             # extract JSON {reasoning, move, promotion} from response
└── candidates.ts        # legal candidate computation (vanilla vs blind)

packages/server/test/unit/bot/
├── casual-brain.test.ts
├── recon-brain.test.ts
├── driver.test.ts
├── ollama-endpoints.test.ts
└── parse.test.ts

packages/server/test/integration/
├── ai-game-casual.test.ts
├── ai-game-recon-stub.test.ts
└── ai-game-failover.test.ts

scripts/
└── selfplay.ts          # operator tool, not in CI

Appendix B — Gemma 4 prompt cookbook references

~/bin/gemma4-research/SYNTHESIS.md — opinionated guide; multi-turn settings; anti-patterns
~/bin/gemma4-research/GOTCHAS.md § "think: false Kills Gemma 4 26B in Multi-Turn Tool-Calling Loops"
~/bin/gemma4-research/CORPUS_ollama_variants.md — model selection, VRAM, defaults
~/bin/gemma4-research/docs/reference/gpu-bakeoff-2026-04-20.md — 3090 Ti vs Strix throughput, MoE vs dense
~/bin/gemma4-research/docs/reference/mort-bakeoff-2026-04-18.md — <think> tokens stripped from Ollama 0.20.4 serialized history

Appendix C — Implementation phases

Phase 1 — Casual bot (single-week scope)

Module scaffold under packages/server/src/bot/.
Brain interface, BrainInput, BrainAction types.
CasualBrain implementation + unit tests.
BotDriver implementation + unit tests (with StubBrain).
legalCandidates computation (vanilla + blind paths) + tests.
Protocol additions: vsAi request field; bot registry in state.ts.
POST /api/games handles vsAi.brain === 'casual'.
ws.ts state-change observer wires up the driver.
Client landing page two-section layout.
Client opponent-slot badge + thinking indicator.
Integration tests: ai-game-casual.test.ts.
Self-play harness: scripts/selfplay.ts Casual-vs-Casual.
Deploy to CT 690; run live smoke checklist for Casual.
Update DECISIONS.md with the phase outcome.

Phase 2 — Recon bot (multi-week scope)

OllamaClient interface + production HTTP impl with AbortController.
ollama-endpoints.ts with preflight + failover logic + tests.
prompt.ts system prompt + per-turn message builder.
parse.ts JSON extraction + unit tests.
ReconBrain implementation + unit tests (mocked Ollama).
Protocol additions: aiInfo, 'ai_unavailable' end reason, post-game reasoning fields.
POST /api/games handles vsAi.brain === 'recon' (preflight + warmup).
Driver retry/fallback paths; mid-game failover wiring.
Client GPU badge + system messages + post-game reasoning reveal.
Integration tests: ai-game-recon-stub.test.ts, ai-game-failover.test.ts.
Self-play harness: Recon-vs-Casual mode.
Iterate on prompt template based on self-play results until 60% bar met.
Deploy to CT 690; run live smoke checklist for Recon (warm, cold, failover, both-down).
Update DECISIONS.md with the phase outcome.

41 KiB Raw Permalink Blame History Unescape Escape