- DECISIONS.md: in-game chat (player↔player and human↔Gemma) deferred indefinitely. Blind-mode chat is a side channel that defeats the moderator-vocabulary security boundary; chat with Gemma leaks belief state mid-game. Resolvable but expensive — revisit only on demand. - Spec: same deferral noted in "Out of scope". - New plan: docs/superpowers/plans/2026-04-28-ai-player-phase-1-casual.md — 13 tasks, 80 sub-steps. Phase 1 only (Casual bot end-to-end). Phase 2 (Recon) gets its own plan once Phase 1 outcomes inform Recon's target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
41 KiB
AI / Computer Player — Design Spec
Project: blind_chess Date: 2026-04-28 Status: Draft (awaiting user review) Builds on:
2026-04-28-blind-chess-design.md— the deployed-MVP architecture Supersedes: (none — this is a new feature)
Decision reversal
DECISIONS.md line "2026-04-28: Client-side AI / hint generation — explicitly out of scope. Human vs. human only" is superseded as of 2026-04-28 by Seth's directive to add a computer-opponent feature.
The reversal is partial: client-side AI / hint generation in human-vs-human games remains rejected. This spec adds AI only in the human-vs-AI path. Human-vs-human games are unchanged.
Executive summary
Add two AI opponents to blind_chess:
- Casual bot — algorithmic, in-process, ~200 LoC of TypeScript. Plays legal moves with simple heuristics. Always available; no external dependencies. Plays badly but quickly.
- gemma4 recon bot — multi-turn chat agent backed by
gemma4:26brunning on the homelab Ollama service (steel141 RTX 3090 Ti primary, pve197 V100 fallback). Maintains a private per-game chat history that persists across turns as the bot's memory, allowing it to build belief about hidden opponent positions over time. Reasoning is hidden from the human during play and revealed in a collapsible post-game panel.
Both bots play through the same view filter and finite-state machine that humans use. The architectural invariant from CLAUDE.md ("the view filter is the only egress for board state") applies to bots: a bot consumes only buildView(game, botColor) plus moderator announcements. No oracle access. The Recon bot is honestly playing blind chess, not pretending to.
The feature ships in two phases: Casual first (single-week scope, low risk), Recon second (research-flavored multi-week scope, depends on Gemma 4 prompt engineering). The shared infrastructure (BotDriver, Brain interface, in-process dispatch path) is built in Phase 1 and reused in Phase 2.
Goals
| # | Goal | How we know it's met |
|---|---|---|
| 1 | "Always-available opponent" — a user can play a legal chess game alone, on demand, without a friend | Casual bot completes 100 self-play games without crashes; legal moves only |
| 2 | Showcase the blind-chess problem — demonstrate an agent reasoning under uncertainty | Recon bot wins ≥60% over 50 Recon-vs-Casual games (both colors); 10 random reasoning logs show Gemma using announcements as evidence |
| 3 | Architectural integrity — bot doesn't get oracle access | Bot input is BoardView (filtered) + Announcement[]; no test or code path bypasses the view filter |
| 4 | Mobile-first UX consistent with existing site | Two-section landing stacks on narrow viewports; AI badge fits opponent slot; thinking indicator visible |
| 5 | Honest GPU surface — user knows which hardware Gemma is running on | aiInfo field in protocol; persistent badge; failover updates badge |
| 6 | Graceful degradation when Ollama is unavailable | Preflight failure → 503 with friendly message; mid-game failure → failover to V100; both endpoints down → bot resigns with endReason: 'ai_unavailable' |
Non-goals (explicit)
- Strong vanilla chess play. Both bots play vanilla mode but neither uses Stockfish; vanilla is a side-effect, not a feature target.
- AI vs AI spectator-able games in the public UI. The self-play harness is a CLI tool, not a UX feature.
- Live token streaming during Gemma's thinking. Static "AI is thinking..." indicator only.
- Difficulty slider. Two named buttons (Casual, Recon) — no continuum.
- Hint generation in human-vs-human games. Still out of scope.
- Mid-game GPU flap-back. Once failed over, stays on fallback for the rest of the game.
- Browser E2E testing. Existing project decision (DECISIONS.md row "End-to-end browser tests") still applies.
Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ blind-chess server (CT 690, Fastify on :3000) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐ │
│ │ WS clients │ │ REST routes │ │ BotDriver (per-game) │ │
│ │ (humans) │ │ │ │ │ │
│ └──────┬───────┘ └──────┬───────┘ │ ┌────────────────────┐ │ │
│ │ │ │ │ CasualBrain │ │ │
│ ▼ ▼ │ │ (algorithmic) │ │ │
│ ┌─────────────────────────────────┐ │ └────────────────────┘ │ │
│ │ ws.ts: dispatch (commit / etc) │◀──┤ ┌────────────────────┐ │ │
│ │ commit.ts: touch-move FSM │ │ │ ReconBrain │ │ │
│ │ view.ts: buildView / ownSquares │──▶│ │ (Ollama chat agent)│ │ │
│ │ translator.ts: announcements │ │ │ - persistent │ │ │
│ │ state.ts: in-memory game Map │ │ │ chat history │ │ │
│ └─────────────────────────────────┘ │ │ - private memory │ │ │
│ │ └────────────────────┘ │ │
│ └──────────┬───────────────┘ │
│ │ HTTPS │
└────────────────────────────────────────────────────┼─────────────────┘
│
┌────────────────────────────────────────┴──────┐
│ Ollama endpoint priority list │
│ │
│ 1. http://192.168.0.141:11434 (steel141) │
│ RTX 3090 Ti, gemma4:26b, ~134 tok/s │
│ 2. http://192.168.0.179:11434 (pve197 CT 105)│
│ Tesla V100 32GB, gemma4:26b, ~80 tok/s est.│
└───────────────────────────────────────────────┘
Key principles:
- Bots are virtual in-process players. A
BotDriveris created per AI game and attached to the bot's color. The driver computes legal candidates from the bot's view and dispatches its actions through the samecommithandler humans use. - Bots use the same view filter as humans.
BotDrivercallsbuildView(game, botColor)and feeds the filtered board to the Brain. No oracle access; the Recon bot is honestly playing blind chess. - The Brain is a swappable strategy.
CasualBrainandReconBrainimplement the same interface; the driver doesn't know which one it has. - Recon bot is a stateful chat agent, not a stateless mover. Each turn appends to a persistent chat history (system + alternating user/assistant). The bot's reasoning persists across turns as its private memory.
- The bot has no
PlayerToken, no WS connection, and no grace-period treatment. Its "session" is the lifetime of theBotDriver. The server emitspeer-status: { color: <botColor>, connected: true }for the bot's slot at all times until the game ends; no grace timer applies to the bot's color.
Components
All new code lives under packages/server/src/bot/. Five modules.
Brain interface (shared contract)
interface Brain {
init(args: { color: Color; mode: Mode; gameId: GameId }): Promise<void>;
decide(input: BrainInput): Promise<BrainAction>;
dispose?(): Promise<void>;
}
interface BrainInput {
view: BoardView; // own pieces only in blind mode
newAnnouncements: Announcement[]; // moderator events since last decide
legalCandidates: CandidateMove[]; // pre-computed by driver
attemptHistory?: { move: CandidateMove; rejection: ModeratorText }[];
}
type BrainAction =
| { type: 'commit'; from: Square; to: Square; promotion?: PromotionType }
| { type: 'resign' }
| { type: 'offer-draw' }
| { type: 'respond-draw'; accept: boolean };
Why this shape: the driver pre-computes legal candidates so the brain doesn't have to know chess.js. This makes both brains trivially mockable in tests, and the candidate set is computed identically to what the human-side highlighter shows.
BotDriver (per-game orchestration)
Owns one Brain. Subscribes to game state-change events. Per-driver mutex enforces one in-flight decision. Bounded retry (5) on FSM rejections. Pseudocode:
on game state change:
if game.status === 'finished': dispose brain; remove driver
if game.toMove !== bot.color: do nothing
if alreadyDeciding: do nothing (mutex)
else:
input = buildBrainInput()
action = await brain.decide(input)
dispatch(action) through normal handlers
if rejection (wont_help / illegal_move): append to attemptHistory; decide again
cap retries at 5; on cap, resign as the bot
CasualBrain (Phase 1, ~200 LoC)
Pure TypeScript, no I/O, deterministic when seeded.
Scoring per candidate move:
+50if destination is geometrically reachable but not own-occupied (likely-capture proxy in blind mode).+30if first 8 moves and the move develops a knight or bishop.+25if the move is a pawn move toward the center (e/d files preferred).+15if the move advances rank toward opponent.-40if the move would leave a queen, rook, or minor piece on its starting square while another piece could have been developed (anti-shuffling penalty).- Tiny seedable random tiebreak.
Behavior:
- Picks highest-scored candidate; on
attemptHistoryrejection, drops the top N choices and retries. - Promotion: defaults to queen.
- Draw offer auto-response: accept at material parity, decline at material lead (computed from own view only — biased and weak by design).
- Casual never resigns voluntarily.
- Vanilla mode: same scoring, but candidates come from
chess.js .moves({verbose: true})(which excludes self-check) instead ofgeometricMoves().
ReconBrain (Phase 2)
Wraps an OllamaClient interface (testable) + a per-game chat history (in-memory only).
State:
class ReconBrain {
private color!: Color;
private mode!: Mode;
private chat: { role: 'system'|'user'|'assistant'; content: string }[];
private endpoint: OllamaEndpoint;
private failedOver: boolean = false;
private moveCount: number = 0;
}
init(): push one system message that establishes identity, what the bot can see, the moderator vocabulary, the output schema, and that its reasoning is private and persistent.
decide(input): push one user message describing new view + announcements + legal candidates, call /api/chat with the full history, parse the assistant reply, append the assistant message to history, return the action.
Ollama call config (per ~/bin/gemma4-research/SYNTHESIS.md "Mandatory Ollama Settings · multi-turn tool-calling agents"):
model: 'gemma4:26b'options: { num_ctx: 32768, num_predict: 1024, temperature: 0.4 }keep_alive: "30m"- Do not set
think: false(silently breaks 26B in multi-turn loops; documented ingemma4-research/GOTCHAS.md§ "think: falseKills Gemma 4 26B in Multi-Turn Tool-Calling Loops"). - Do not use
format: "json"(infinite loops on nested schemas; documented ingemma4-research/SYNTHESIS.md§ "Anti-Patterns"). Extract{...}from response client-side via regex per the SYNTHESIS guidance.
System prompt skeleton (final wording deferred to implementation):
You are a chess agent playing BLIND CHESS as <COLOR>.
You see only your own pieces. The moderator announces moves with a fixed vocabulary.
## Your task each turn
1. Read the new announcements and your current view.
2. Update your beliefs about where opponent pieces likely are. Show this reasoning explicitly — your reasoning persists across turns and is your private memory.
3. Pick exactly one move from the legal candidates I provide.
## Output schema
Reply with JSON only, on its own line, no prose wrapper:
{"reasoning": "<your analysis>", "move": "<from>-<to>", "promotion": "q"|"r"|"b"|"n"|null}
## Vocabulary you'll see
[full enumeration of ModeratorText]
## Important
- Your reasoning is hidden from the human player. Be honest and detailed.
- Build up belief over turns. Reference your prior notes.
- If your move is rejected (you'll see "wont_help" or "illegal_move"), I'll show you the rejection and ask again. Don't repeat the rejected move.
Per-move user message skeleton:
Turn <N>. <COLOR> to move.
Announcements since your last turn:
- <list of ModeratorText entries with any payload>
Your view (own pieces, blind mode):
<list: piece, square>
Legal candidate moves:
<list: from-to, optionally with promotion>
Reply with reasoning + chosen move (JSON).
Bot registry
Lives in state.ts. Map<gameId, BotDriver>. Created on AI game creation, removed when the game ends. Lifetime is bound to the game; restart drops both, consistent with current MVP behavior.
Touches in existing code
| File | Change |
|---|---|
packages/shared/src/protocol.ts |
Extend CreateGameRequest with vsAi?: { brain: 'casual' | 'recon' }. Add 'ai_unavailable' to EndReason. Add optional aiInfo to joined and update server messages. |
packages/server/src/state.ts |
Add Game.aiOpponent?: { brain; color } (informational). Add bot registry. Add Game.aiThoughtsLog?: ChatTurn[] populated at game end for the post-game reveal. |
packages/server/src/server.ts |
POST /api/games handles vsAi, runs preflight, creates BotDriver. |
packages/server/src/ws.ts |
State-change observer triggers attached BotDriver. No special-case handling inside ws.ts itself. |
packages/client/ |
Two-section landing layout. AI badge under opponent slot. "AI is thinking..." indicator. Post-game thoughts reveal (Recon only). |
Notably NOT changed: view.ts, commit.ts, translator.ts, geometric.ts, Announcement type, ModeratorText enum. Bots flow through them identically to humans.
Data flow
Game creation (vs Casual)
User clicks "Casual bot" on landing
→ POST /api/games body: { mode, side, highlightingEnabled, vsAi: { brain: 'casual' } }
→ server: create Game, fill creator slot with new PlayerToken
→ server: create BotDriver{CasualBrain}, attach to Game, fill opposite slot
→ server: subscribe driver to game-state-change events
→ respond 201: { gameId, creatorToken, joinUrl: null } // no shareable link
→ client navigates to /#/g/<id> and opens WS /ws?game=<id>
→ server: on hello, sends 'joined' with view (no aiInfo for Casual)
→ if user is white, user moves first; else CasualBrain.decide() fires immediately
Game creation (vs Recon)
Same as above except:
- Server synchronously preflights the GPU endpoint list before responding to
POST /api/games:GET http://192.168.0.141:11434/api/tagswith 1.5s timeout. 200 OK +gemma4:26blisted → primary selected.- else
GET http://192.168.0.179:11434/api/tagswith 1.5s timeout. 200 OK +gemma4:26blisted → fallback selected, log warning. - else respond HTTP 503
{ error: 'ai_offline' }.
- Adds ~50–200ms to the create call when steel141 is reachable.
- If primary chosen, server fires a non-blocking warmup HTTP call (
/api/chatwith a minimal prompt,keep_alive: "30m") so the model is in VRAM by the bot's first move. BotDriver{ReconBrain}is attached; ReconBrain's chat history seeded with the system prompt.- Server response includes
aiInfo: { model, gpu, host }so the client renders the badge.
The bot's turn
[trigger]: game state transitions to "bot's turn" (after human commit, OR at game start if bot is white)
driver:
if alreadyDeciding for this game: ignore (mutex)
else mark "deciding" = true:
1. compute BrainInput:
- view = buildView(game, bot.color)
- newAnnouncements = announcements added since last decide call
- legalCandidates =
if mode === 'vanilla': chess.js .moves({verbose: true}) for bot.color
else (blind): geometricMoves(piece, sq, ownSquares) over own pieces,
plus promotion-required moves
- attemptHistory = []
2. action = await brain.decide(input)
3. dispatch action through normal commit handler:
- 'commit': call commit handler (same one ws.ts uses)
- if FSM rejects with wont_help/illegal_move:
- append to attemptHistory; goto step 2 with updated input; max 5 retries
- on retry-cap-hit: dispatch {type: 'resign'} (loud log)
- if FSM accepts: turn ends, observers fire (including this driver if game continues)
- 'resign' / 'offer-draw' / 'respond-draw': pass-through
4. mark "deciding" = false
5. log brain reasoning to journald (Recon only)
Opponent (human) move arrives at the bot
human commits a move → ws.ts dispatches → FSM accepts → translator emits Announcement[]
→ game state-change event fires
→ driver observes
→ driver checks "is it now bot's turn?": yes → next decide() call; no → idle
The driver does NOT need a separate signal for "opponent moved." The state-change observer covers it.
Game end
state.status transitions to 'finished'
→ driver observes
→ driver copies ReconBrain.chat → Game.aiThoughtsLog (Recon only; Casual has no thoughts to copy)
→ driver disposes brain (close any in-flight HTTP for ReconBrain via AbortController)
→ driver removes itself from registry
→ janitor (existing) prunes the game after 30min idle, same as humans
→ reveal: client renders full board for both sides (existing post-game UX)
→ AI-thoughts post-game reveal (Recon only): collapsible "View gemma4's reasoning" section
on the game-over screen, shows chat history as a chronological log of
{ply N, view at that time, announcements heard, reasoning, move played}
Mid-game GPU failover
ReconBrain.decide() → HTTP call to current endpoint
→ connection error / 5xx / 30s timeout
→ driver:
1. log: "<endpoint> failed mid-game, attempting failover to <other>"
2. preflight the other endpoint (1.5s timeout)
- 200 OK → switch ReconBrain.endpoint to fallback; mark failedOver = true
- else → bot resigns with endReason 'ai_unavailable'
3. retry the SAME decide() call against the new endpoint
- same chat history, same user message, no replay
4. on success: emit a UI-system message (NOT a moderator Announcement):
"AI moved to V100 (steel141 unreachable). Moves may take longer." + update aiInfo badge
5. on failure: bot resigns
Failover triggers (HTTP-layer only):
- Connection refused / DNS fail
- 5xx status
- Per-move timeout (30s normal, 90s first-move)
Does not trigger failover:
- Malformed JSON in Gemma's response → existing temp-bump-retry path
- Move-not-in-candidates → existing "pick from the list" retry
wont_helpfrom the FSM → existing retry-with-attemptHistory path
One-way only: once failed over to V100, stays there for the rest of the game. No flap-back.
Disconnect / reconnect (human side, AI game)
human WS drops → existing 5-minute grace timer starts
→ BotDriver: if it's the bot's turn, the in-flight Ollama call (if any) is allowed to
complete and the move is committed; the result is visible when the human reconnects.
→ grace expires → existing path: game ends with endReason 'abandoned', AI wins.
→ human reconnects within grace → existing path: 'joined' message with full state.
Error handling
| Failure | Detection | Response |
|---|---|---|
| Both Ollama endpoints down at game-creation time | Preflight 1.5s timeout × 2 | HTTP 503 { error: 'ai_offline' }. Game never created. Client shows "AI is offline right now, try again later." |
| Primary down at game-creation, fallback up | Preflight | Game created on V100. aiInfo reflects V100 from the start. Game-start UI message: "steel141 unreachable; playing on pve197 V100." |
| Primary dies mid-game, fallback up | First failed decide() |
Section "Mid-game GPU failover" above. |
| Both endpoints die mid-game | Failover attempt also fails | Bot resigns with endReason: 'ai_unavailable'. Moderator panel: "AI service became unavailable. Game ended." Human "wins" but post-game labels it "AI unavailable" rather than crowing. |
| Gemma returns malformed JSON | Client-side regex \{[\s\S]*\} fails OR JSON.parse throws |
Retry once with temperature += 0.1. Second failure: fall back to CasualBrain.decide() for this turn only; chat history doesn't get the failed turn appended. Loud log. |
Gemma proposes a move not in legalCandidates |
Driver compares Gemma's {from, to} against the candidate list |
Append corrective user message: "That move wasn't a candidate. Pick from this list: <re-paste>." Retry once. Second failure: same Casual fallback. |
Gemma's move is in candidates but FSM rejects with wont_help |
Standard FSM path | Append rejection to attemptHistory, append corrective user message to chat history, decide again. Bounded by driver retry cap (5). On cap-hit: bot resigns. |
| Driver retry cap (5) hit | Internal counter | Bot resigns. Moderator panel: "AI ran out of valid moves to consider." Human gets the win. |
| Per-move timeout (30s normal / 90s first-move) | AbortController on the HTTP call |
Treat as endpoint failure → failover path. |
| Bot tries to commit on a finished game | Driver state-change observer race | Discard the action. The mutex + observer should prevent this in practice, but the dispatch handler returns an error which the driver swallows. |
Two simultaneous decide() invocations on the same driver |
Per-driver mutex | Second invocation is a no-op. |
BotDriver instance leaks |
Janitor sweep | Janitor double-checks: any driver whose game is finished or absent → dispose. |
| Server restart with active AI games | All in-memory state lost (existing behavior) | Same as human-vs-human: games disappear. Acceptable for MVP. |
| Casual / Recon encounters position with zero legal candidates | legalCandidates.length === 0 |
Means stalemate or checkmate — game already ended via the FSM, driver shouldn't have been invoked. If it happens anyway: log loudly, bot resigns. |
The "fall back to Casual for this turn only" pattern protects gameplay continuity at the cost of consistency in the bot's reasoning history. The chat history never gets the failed turn appended; Gemma never sees that Casual played on its behalf — the next user message just says "your turn N+1" as if Gemma had played the actual move. Pragmatic compromise that keeps games playable instead of crashing them on flaky LLM output.
Testing
Five testing layers. Existing harness is 43 tests; this feature adds ≈30–40 new tests.
Unit tests — CasualBrain (≈10 tests)
- Single candidate → picks it.
- Multi-candidate scoring: with deterministic seed, capture-bias produces expected ranking.
- 8th-move-ish development heuristic activates correctly.
attemptHistorycauses the previously-rejected move to be excluded.- Promotion defaults to queen.
- Draw-offer auto-response: parity → accept; lead → decline.
- Zero-candidate input → throws.
- Vanilla candidates ≠ blind candidates.
Unit tests — ReconBrain with mocked Ollama (≈12 tests)
ReconBrain is wired to an OllamaClient interface; tests inject a stub. No real network in tests.
- Sends correct payload shape.
- Chat history seeded with system message on
init(). - Successful response: assistant message appended;
BrainActionmatches parsed JSON. - Multi-turn: 3 decide() calls produce 1 system + 6 alternating user/assistant.
- Malformed JSON: stub returns garbage → second call with temperature bumped +0.1 → succeeds → only second turn appended to history.
- Both retries malformed → throws
ReconLLMUnavailable. - Move not in
legalCandidates: corrective user message appended; second call returns valid candidate. - Endpoint failover: first call rejects with
EndpointUnreachable; brain switches endpoint; assert payload to second endpoint matches. - One-way failover: after first failover, brain stays on V100.
dispose()cancels in-flight HTTP via AbortController.- System prompt isolation: fresh ReconBrain doesn't share state with disposed one.
- White vs Black: system prompt parametrizes color correctly.
Unit tests — BotDriver (≈8 tests)
Driver is wired to a Brain interface; tests use a StubBrain with scriptable responses. Game state is real.
- Mutex: two simultaneous "your turn" triggers → one
decide()call. - State-change observer fires
decide()only whengame.toMove === bot.color. - Game finished → driver disposes brain and unsubscribes.
- Retry cap (5): stub brain returns
wont_help-inducing move 5 times; driver dispatchesresignon attempt 6. - Casual fallback for malformed Recon turn: stub brain throws
ReconLLMUnavailable; driver invokes aCasualBrainfor that turn only; Recon's chat history not modified. - AI-unavailable end-state propagates
endReason: 'ai_unavailable'. - Disconnect during the bot's turn: in-flight
decide()allowed to complete; move committed; reconnecting human sees the move. - Janitor disposes orphaned drivers.
Real WS integration tests (≈6 tests)
Same pattern as the existing 4 WS integration tests — ephemeral port, real Fastify, real ws client. Bot driver uses a StubBrain (no Ollama).
- Create AI game (Casual): server returns
aiInfo: undefined;joinedreceived; bot plays first move if white. - Create AI game (Recon stubbed):
joinedincludesaiInfo; preflight passes (stubbed); bot plays first move via stub brain. - Full Casual vs scripted-human game: 8 moves played, integration end-to-end; capture announced correctly.
- Recon failover surfaces in a server message: kill primary endpoint stub mid-game, observe
aiInfoupdate. - Per-move timeout: stub brain's
decide()hangs > 30s; driver triggers failover or resign; client observes the right end state. - AI-unavailable preflight: both endpoint stubs return errors;
POST /api/gamesreturns 503; client renders error.
Self-play harness — scripts/selfplay.ts (operator tool, NOT in CI)
pnpm selfplay --white casual --black casual --games 100
pnpm selfplay --white recon --black casual --games 50
pnpm selfplay --white recon --black recon --games 10
Reports: win/loss/draw breakdown, average moves per game, average per-move latency (Recon), reasoning-log archive (Recon, written to tmp/selfplay-runs/<timestamp>/).
Live smoke checklist (manual, post-deploy)
- Create Casual game, play to completion. Both modes (vanilla + blind).
- Create Recon game on warm steel141. Play 10 moves. Latency ≤4s/move. Inspect journald.
- Create Recon game on cold steel141 (after 31min idle). First move 30–60s; UI shows "AI is starting up..." Subsequent moves fast.
- Force failover: stop steel141 ollama mid-game; bot fails over to V100; UI badge updates.
- Force ai-unavailable: stop both Ollama services; new Recon game returns 503 with friendly error.
- Post-game thoughts reveal: collapsible section appears for Recon, absent for Casual; reasoning matches the moves played.
UX
Landing page
Two visually distinct sections, sharing the same control vocabulary (mode/side/highlight):
┌─────────────────────────────────────┐
│ Play with a friend │
│ ───────────────────────── │
│ [mode] [side] [highlight] │
│ ( Create Game ) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Play vs Computer │
│ ───────────────────────── │
│ [mode] [side] [highlight] │
│ ( Casual bot ) ( gemma4 recon ) │
└─────────────────────────────────────┘
Tooltip / hover help:
- Casual bot — "Fast, plays simple moves, makes mistakes. Good for a quick game."
- gemma4 recon — "Gemma 4 large language model. Reasons about hidden information across turns. Slower; first move may take up to a minute. Plays better in blind mode than Casual."
Opponent slot — in-game badge
| State | Badge text |
|---|---|
| Casual game | "Casual bot" |
| Recon game (3090) | "gemma4:26b · RTX 3090 Ti" |
| Recon game (V100, primary) | "gemma4:26b · Tesla V100" |
| Recon game (failover) | "gemma4:26b · V100 (failed over)" (badge color shifts amber) |
| Mobile narrow | Truncates to "gemma4 · 3090 Ti" etc. |
"AI is thinking" indicator
When it's the bot's turn:
| Bot / situation | Indicator |
|---|---|
| Casual | "Casual bot is moving..." (rarely visible) |
| Recon, normal | "gemma4 is thinking..." with animated ellipsis |
| Recon, first move only | "gemma4 is starting up..." (cold-start framing) |
| Recon, just failed over | Moderator-panel-area system message: "AI moved to V100 (steel141 unreachable). Moves may take longer." |
Moderator panel
Vocabulary unchanged. AI reasoning is never rendered here during play. Two new UI-system messages (style-distinct from Announcement entries):
- Game-start (Recon only):
"You are playing gemma4:26b on RTX 3090 Ti (steel141)." - Failover (Recon only):
"AI moved to V100 (steel141 unreachable). Moves may take longer."
Post-game thoughts reveal (Recon only)
Below the existing game-over content:
▾ View gemma4's reasoning (32 turns)
┌─────────────────────────────────────┐
│ Turn 1, your move was e2-e4 │
│ gemma4 (Black) thought: │
│ "<reasoning text>" │
│ → played c7-c5 │
├─────────────────────────────────────┤
│ Turn 2, ... │
└─────────────────────────────────────┘
Collapsed by default on mobile. Casual games omit the section.
Resign / draw / disconnect
- Resign button: same UX. Game ends, AI thoughts log is revealed (Recon only).
- Offer draw: human can offer; bot responds via
respond-draw. Casual: heuristic auto-response. Recon: passes the offer as a user message; Gemma decides via JSON output schema withaccept: bool. - Bot resigns (retry cap / AI-unavailable): post-game labels the end appropriately rather than crowing about a human win.
- Human disconnect during AI game: existing 5-min grace; bot's in-flight
decide()(if any) completes; result visible on reconnect.
Things explicitly NOT in MVP UX
- Live token streaming during Gemma's thinking. Static indicator only.
- Difficulty slider. Two named buttons only.
- Public AI vs AI spectate-able games. Self-play is CLI-only.
- Hint button in human-vs-human games.
- "Watch the AI think" mode.
Acceptance criteria
Phase 1 (Casual) is "done" when:
- 100 Casual-vs-Casual games complete with no crashes.
- Median game length is between 20 and 200 moves.
- Casual reliably beats a "random legal move" baseline (≥80% over 100 games).
- All Phase 1 unit + integration tests pass.
- Live smoke checklist for Casual passes.
- AI-game creation and play work end-to-end on the live URL.
Phase 2 (Recon) is "done" when:
- Recon wins ≥60% over 50 Recon-vs-Casual games, both colors.
- Average per-move latency ≤8s on the 3090 Ti (≤10s on V100), with cold-start excluded.
- Manual inspection of 10 random reasoning logs shows Gemma is using announcements as evidence (not just plausible-sounding text).
- All Phase 2 unit + integration tests pass.
- Live smoke checklist for Recon passes (warm, cold, failover, both-down).
- Post-game reasoning reveal renders correctly on phone and desktop.
Decision triggers if Phase 2 misses bars
- If Recon wins <60% but >40% vs Casual: prompt-engineering rabbit hole. Iterate on system prompt + per-turn message format. Try presenting candidates differently (e.g., with annotations).
- If Recon wins <40%: design signal. Either 26B isn't strong enough (try 31B at 5× latency cost — would also need to revisit per-move timeout caps) OR the candidate-list framing is wrong (consider feeding Gemma a textual board representation instead of just candidate moves).
- If latency is consistently >15s/move: the 32K context approach may be too expensive. Consider context compaction (summarize older turns into a "what I've inferred so far" running summary).
Risks / open questions
-
Recon plays at some level — but how much? This is the central research-y unknown. LLMs play vanilla chess poorly (badly trained on game positions), but the task here is different — Gemma isn't being asked to compute tactical depth, it's being asked to reason about what evidence implies about hidden state, and pick a move from a pre-computed legal list. That's much more LLM-shaped. Still, the 60% Recon-vs-Casual bar is a guess; we'll learn the real number from the self-play harness.
-
Cold-start UX on first move. 30–60s is long. The "AI is starting up..." copy mitigates but doesn't eliminate. If users complain we can: (a) preflight harder (an actual
/api/chatwarmup withkeep_alive: -1), (b) offer Casual as a one-click fallback if the user gets impatient, (c) shrink to gemma4:e4b for first-move-only and switch to 26B for subsequent. None of these are MVP. -
Chat history grows unboundedly. A 100-move game accumulates ~25K tokens. 32K context covers that comfortably, but a longer game (which is rare in casual play, but possible) would overflow. Mitigation: if we hit context overflow in practice, add per-turn compaction — replace the oldest 20 turns with a summary turn. Not MVP unless seen.
-
3090 GPU contention with mort-3090-scheduler. The scheduler is supposed to yield to other GPU users, but verifying this under chess load is unmeasured. Mitigation: monitor steel141 GPU utilization during early Recon games; if mort jobs interfere we'd need explicit coordination (e.g., a held lock).
-
Bot proposes moves Gemma can't see consequences of. Casual bias toward "geometrically reachable but not own-occupied" squares is just a heuristic; many such moves walk into traps. This is intentional for Casual (low-strength is the design target) but if Recon makes the same mistakes despite reasoning, the prompt template needs tuning. Self-play exposes this clearly.
-
The post-game reasoning reveal could be embarrassing. Gemma might write reasoning that's confidently wrong in a way that makes the AI look dumb. Per
gemma4-research/SYNTHESIS.md, Gemma is "ultra-compliant and highly capable but doesn't know who it is" — strong system prompt mitigates the worst, but reasoning logs are essentially uncurated LLM output. Mitigation: sample 10 logs early, iterate prompt to suppress overly confident bad takes. -
Floating-point determinism differs across GPU architectures. Gemma will produce slightly different tokens on V100 vs 3090 Ti. Mitigation: none needed — we're not comparing across calls. Just want a reasonable response.
-
No mid-game flap-back. If steel141 recovers, we don't switch back from V100. Consequence: a recovered 3090 doesn't help an already-failed-over game. Mitigation: none in MVP. Cost is a slightly slower remainder of the game; acceptable.
Out of scope (deferred to post-MVP)
- Difficulty slider / strength selection beyond two named buttons.
- Stockfish integration (vanilla mode strength via real chess engine).
- AI vs AI spectate-able public games.
- Live token streaming during Gemma's thinking.
- Hint button in human-vs-human games.
- Per-turn context compaction for long games.
- Mid-game GPU flap-back to recovered primary.
- Multi-model selection (e.g., "play vs gemma4:31b" or "play vs qwen3-coder-next").
- Persistent reasoning logs across game restart (would require SQLite per the existing deferred row).
- Bot rating / Elo tracking across games.
- Bot personalities / styles ("aggressive recon", "defensive recon").
- In-game chat (player ↔ player or human ↔ Gemma). Considered 2026-04-28; deferred indefinitely. Player chat in blind mode is a side channel that bypasses the moderator-vocabulary security boundary; chat with Gemma leaks the bot's belief state and undermines the post-game reasoning reveal. See
DECISIONS.md"Deferred / Rejected" for the full rationale.
Appendix A — Module layout
packages/server/src/bot/
├── index.ts # public API: createBotDriver, BotRegistry types
├── driver.ts # BotDriver class
├── brain.ts # Brain interface, BrainInput, BrainAction types
├── casual-brain.ts # CasualBrain class
├── recon-brain.ts # ReconBrain class
├── ollama-client.ts # OllamaClient interface + production HTTP impl
├── ollama-endpoints.ts # endpoint priority list, preflight logic
├── prompt.ts # system prompt template, per-turn user message builder
├── parse.ts # extract JSON {reasoning, move, promotion} from response
└── candidates.ts # legal candidate computation (vanilla vs blind)
packages/server/test/unit/bot/
├── casual-brain.test.ts
├── recon-brain.test.ts
├── driver.test.ts
├── ollama-endpoints.test.ts
└── parse.test.ts
packages/server/test/integration/
├── ai-game-casual.test.ts
├── ai-game-recon-stub.test.ts
└── ai-game-failover.test.ts
scripts/
└── selfplay.ts # operator tool, not in CI
Appendix B — Gemma 4 prompt cookbook references
~/bin/gemma4-research/SYNTHESIS.md— opinionated guide; multi-turn settings; anti-patterns~/bin/gemma4-research/GOTCHAS.md§ "think: falseKills Gemma 4 26B in Multi-Turn Tool-Calling Loops"~/bin/gemma4-research/CORPUS_ollama_variants.md— model selection, VRAM, defaults~/bin/gemma4-research/docs/reference/gpu-bakeoff-2026-04-20.md— 3090 Ti vs Strix throughput, MoE vs dense~/bin/gemma4-research/docs/reference/mort-bakeoff-2026-04-18.md—<think>tokens stripped from Ollama 0.20.4 serialized history
Appendix C — Implementation phases
Phase 1 — Casual bot (single-week scope)
- Module scaffold under
packages/server/src/bot/. Braininterface,BrainInput,BrainActiontypes.CasualBrainimplementation + unit tests.BotDriverimplementation + unit tests (withStubBrain).legalCandidatescomputation (vanilla + blind paths) + tests.- Protocol additions:
vsAirequest field; bot registry instate.ts. POST /api/gameshandlesvsAi.brain === 'casual'.ws.tsstate-change observer wires up the driver.- Client landing page two-section layout.
- Client opponent-slot badge + thinking indicator.
- Integration tests: ai-game-casual.test.ts.
- Self-play harness:
scripts/selfplay.tsCasual-vs-Casual. - Deploy to CT 690; run live smoke checklist for Casual.
- Update
DECISIONS.mdwith the phase outcome.
Phase 2 — Recon bot (multi-week scope)
OllamaClientinterface + production HTTP impl with AbortController.ollama-endpoints.tswith preflight + failover logic + tests.prompt.tssystem prompt + per-turn message builder.parse.tsJSON extraction + unit tests.ReconBrainimplementation + unit tests (mocked Ollama).- Protocol additions:
aiInfo,'ai_unavailable'end reason, post-game reasoning fields. POST /api/gameshandlesvsAi.brain === 'recon'(preflight + warmup).- Driver retry/fallback paths; mid-game failover wiring.
- Client GPU badge + system messages + post-game reasoning reveal.
- Integration tests: ai-game-recon-stub.test.ts, ai-game-failover.test.ts.
- Self-play harness: Recon-vs-Casual mode.
- Iterate on prompt template based on self-play results until 60% bar met.
- Deploy to CT 690; run live smoke checklist for Recon (warm, cold, failover, both-down).
- Update
DECISIONS.mdwith the phase outcome.