This repo opens with the design-discovery work completed before any product code is written. Two model bakeoffs against gemma4:8b/26b/31b on a local Ollama established that: - Whole-puzzle generation in the Connections shape is unreliable on Gemma 4 (gemma4:31b ~50% structural-pass, gemma4:26b ~20-30%); 31b is intentionally out of project scope, so the generation route is harder still. - Atomic semantic-judging skills are reliable: 87.5%/93.75%/100% (8B/26b/31b) on JUDGE; *all three models* scored 10/10 on CREATIVE_ACCEPT — fair judging of player-INVENTED categories. That is the structural unlock vs static hand-curated word games. The README contains the full writeup, the test bench, and a brainstormed bank of 10 distinct game-mechanics ideas across the fast/medium/slow tempo range, plus a primitives table for recombination. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
seth_semantic_game
Working title. A self-hosted word game built around an LLM's ability to fairly judge player-invented semantic categories in real time — something static, hand-curated word games structurally cannot do.
This repo documents the design discovery process, including two model bakeoffs that picked the architecture and a brainstormed bank of game-mechanics ideas that the actual product will draw from.
TL;DR
- Seed idea: clone NYT Connections (16 words → 4 hidden groups of 4) with a local LLM doing the curation.
- Seed idea died fast: unaided whole-puzzle generation on Gemma 4 ships broken puzzles ~50% of the time (duplicate tiles, mislabeled categories, fake wordplay) — see docs/reference/gemma-generation-bakeoff-2026-04-27-221751.md.
- The actual unlock: Gemma 4 reliably judges whether a player-supplied category fits a player-supplied set of words. Across 35 hand-labeled cases on three model sizes, CREATIVE_ACCEPT scored 10/10 on every model including the 8B variant at 0.7s per call. JUDGE landed at 87.5% / 93.75% / 100% (8B / 26b / 31b). See docs/reference/gemma-semantic-bakeoff-2026-04-27-224800.md.
- The pivot: stop trying to generate Connections. Build games where the player invents the groupings and the LLM is the live, fair judge. That's what the static format can't do.
- Models in scope:
gemma4:latest(8B) for live judging,gemma4:26bfor offline puzzle prep / critique.gemma4:31bwas tested and is more accurate, but is intentionally out of scope for this project.
What we did
Two experiments, both reproducible from scripts/ against a local Ollama (point OLLAMA_HOST at your instance; defaults to http://localhost:11434).
Experiment 1 — Generation bakeoff
Question: can Gemma 4 generate a Connections-quality 16-word / 4-group puzzle in one shot?
Setup: 5 puzzles per model on gemma4:26b and gemma4:31b. Strict JSON schema requesting groups + difficulty bands + claimed overlap-trap words. No format=json (that's a known Gemma 4 + Ollama hang); JSON parsed client-side; up to 3 retries with temperature bumped +0.1 each attempt.
Results:
| Model | Pass | Borderline | Fail | Avg s/puzzle |
|---|---|---|---|---|
gemma4:26b |
1 | 1 + 1 partial | 2 | 5.2 |
gemma4:31b-it-q4_K_M |
2 | 2 | 1 | 18.2 |
Failure modes ranked by severity:
- Structural violations — duplicate or near-duplicate words on the 16-tile board. Trivially detectable.
- Broken category logic — words listed in a category they don't actually fit (
DELUXEdoesn't start with the full Greek letter "DELTA";LIBRAisn't a "type of scale"). Hard to detect deterministically — needs a critique pass. - Redundant categories — two groups themed on the same concept. Detectable.
- Self-graded traps don't always hold up — Gemma's claimed
intended_trapswere sometimes nonsense (PRESSclaimed to fit "Words after BLOOD," but the compound is blood pressure, not blood press). Important consequence: the same model cannot be trusted to grade its own output.
This was decisive for the project direction: unaided generation isn't viable; AND we're explicitly capping at 26b, which is the less reliable generator. So we need a different game shape — one that doesn't depend on the LLM generating finished puzzles unaided.
Experiment 2 — Semantic-skill bakeoff
Question: instead of whole-puzzle generation, can Gemma reliably perform the atomic skills a live game would need? Specifically:
- JUDGE — given a category and 4 words, does Gemma correctly say yes/no on whether they all fit?
- CREATE — given a category, does Gemma produce 4 tightly-fitting words?
- CREATIVE_ACCEPT — given 4 words and a player-proposed category, does Gemma fairly judge whether the category validates the grouping (even if it differs from any "intended" category)?
The third one is the design-relevant one. If it works, the game can let players invent their own groupings — which is structurally impossible for a hand-curated static format.
Setup: 35 hand-labeled cases (16 JUDGE / 10 CREATE / 9 CREATIVE_ACCEPT + 2 deliberately ambiguous) tested across gemma4:latest (8B), gemma4:26b, and gemma4:31b. Each case has explicit ground truth in the test bank.
Results:
| Model | JUDGE | CREATE | CREATIVE_ACCEPT | Avg s/call |
|---|---|---|---|---|
gemma4:latest (8B) |
14/16 (87.5%) | 8/10 | 10/10 | 0.7 |
gemma4:26b |
15/16 (93.75%) | 9/10 | 10/10 | 0.8 |
gemma4:31b-it-q4_K_M |
16/16 | 9/10 | 10/10 | 2.3 |
Key findings:
- CREATIVE_ACCEPT is decisive across all three models. 10/10 on five player-creative-but-valid groupings (e.g.
WHIP / NUT / CODE / SMILE → "Things you can crack"accepted) AND 10/10 on five invalid ones (e.g.OAK / MAPLE / BIRCH / PINE → "Furniture brands"rejected). The model gets the distinction. - 8B is fast enough to use as a live judge. Sub-second on a 24 GB consumer GPU; per-guess economics are effectively free.
- 26b is mildly over-permissive on borderline cases. It accepted KIWI as a tech brand (
APPLE / ORANGE / KIWI / BLACKBERRY → "Tech/phone brands"). 8B and 31b were stricter. For a live game, false-positives degrade integrity more than false-negatives — so 8B's calibration is the right tradeoff for live judging. - One failure mode is shared by all three models: "homophones-of-body-parts" (8B gave SEA/SEE/HEAR/HERE — none of which sound like body parts; 26b gave EYE which IS a body part rather than a homophone of one; 31b parse-failed three times running). Avoid this category class or scaffold prompts with worked examples.
What we picked
Model assignments:
| Role | Model | Why |
|---|---|---|
| Live JUDGE (per player guess) | gemma4:latest (8B) |
Sub-second, strict-enough calibration, 87.5% accuracy on tight cases |
| Live CREATIVE_ACCEPT | gemma4:latest (8B) |
10/10 in test, sub-second |
| Offline puzzle generation (if used at all) | gemma4:26b with strict filters + retries |
31b is out of scope by user constraint; 26b plus a deterministic post-filter and a critique pass is the workable path |
| Offline critique pass | gemma4:26b grading 8B's work, OR a non-Gemma open-weights judge |
A model cannot be trusted to grade itself — the bakeoff confirmed Gemma rubber-stamps its own structural mistakes |
Operational gotchas baked into the scripts (all from upstream Gemma 4 + Ollama issue tracker; documented in the bakeoff scripts):
- No
format: "json"— server-side JSON enforcer hangs gemma4:26b Q4 indefinitely; ask for JSON in the prompt and parse client-side. think: falsefor single-turn JSON pipelines — otherwise thinking tokens consume the response budget andresponsecomes back empty.- Override Ollama defaults:
num_ctx(default 2048 truncates the prompt),num_predict(default 128 truncates the output). - For multi-turn tool-calling agents the rule is the opposite: leave
thinkunset on 26b. Not relevant here, but worth knowing.
Game-mechanics idea bank
The two bakeoffs together say: don't build a game where the LLM is the curator. Build a game where the LLM is the live, fair judge of player creativity. Below are 10 distinct game ideas that take that as the design constraint. None of them is Connections; each one leans on something a static game structurally can't replicate (live category validation, multi-solution puzzles, generative answer pools, semantic chains, etc.).
Each idea lists its tempo (how fast the game feels), the AI calls per turn (so cost can be reasoned about), and the structural novelty (the thing this idea can do that a hand-curated static format cannot).
Fast-paced (≤60-second rounds)
1. Pile — speedrun categorize
- Tempo: real-time, 60-second rounds.
- Mechanic: A pool of ~16 random words. You drag any 3–5 of them into a box and type a category. The LLM (8B) judges in ~0.7s. Accepted → those words clear, refilled from a deck. Rejected → they stay. Score = words categorized per minute.
- AI calls: 1 per submission (CREATIVE_ACCEPT shape: player-supplied category + player-supplied words).
- Structural novelty: the player invents groupings under time pressure; categories aren't pre-known. A static game has a single fixed answer per puzzle; this one has open-ended valid answers as long as the LLM can confirm tightness.
2. Bridge — single-word polysemy speedrun
- Tempo: real-time, ~10 sec per move.
- Mechanic: Two category cards on screen ("Words for sharp pain" and "Things that bite"). Type one word the LLM accepts as fitting BOTH (e.g.
STING). Move on. Faster = more points. - AI calls: 2 JUDGE calls per submission (one per category, on the player's word).
- Structural novelty: the polysemy/multi-meaning skill — a known Connections difficulty axis — turned into the primary gameplay loop. Static games can plant such words but can't let the player invent them on demand.
3. Threaded — semantic word chains
- Tempo: real-time / continuous.
- Mechanic: Words drift across a conveyor belt. You build a chain by linking consecutive words with a category the LLM accepts ("APPLE → ORANGE: both fruits" → "ORANGE → RED: both colors" → "RED → ANGRY: red with anger"). Chain length = score. One chain per game.
- AI calls: 1 JUDGE per link, on the player's pair-and-category.
- Structural novelty: emergent semantic graphs from arbitrary word streams. The category set isn't pre-built — it's whatever the player can find. A static game can't be open-ended on the connection vocabulary.
Medium-paced (5–15 minute sessions)
4. Stretch — push a category to its limit
- Tempo: medium, 5-min sessions.
- Mechanic: The game opens with a tight seed category and 4 starting words ("Types of trees: OAK, MAPLE, BIRCH, PINE"). Add a 5th word — does it still fit? LLM judges. Yes → add a 6th. Each accepted word = +1 point. First rejection ends the round. Some categories support more stretch than others (broader = more elastic).
- AI calls: 1 JUDGE per word added.
- Structural novelty: category elasticity as a gameplay dimension. There's no pre-set answer length. The player learns intuitions about which categories admit how much stretching — a meta-skill no static game develops.
5. Inverse — multi-solution sort
- Tempo: medium, ~10 min per puzzle.
- Mechanic: 16 words on a board with NO predetermined grouping. The player sorts them into ANY 4 groups of 4 with ANY categories of their choice. The LLM judges all 4 categories. All 4 valid → win. Bonus for tightness (LLM rates each category 1–5).
- AI calls: 4 CREATIVE_ACCEPT per submission, plus optional 4 tightness-score calls.
- Structural novelty: Connections has one valid answer; this version has thousands. Players compete on creativity and tightness, not on guessing the curator's mind.
6. Misfit — odd-one-out, then redeem
- Tempo: medium, ~3 min per puzzle.
- Mechanic: The game shows a category and 4–5 words; one of them doesn't quite fit. Stage 1: identify the misfit. Stage 2 (bonus): propose a category the misfit word DOES fit. Both stages judged by the LLM.
- AI calls: 1 JUDGE on stage 1 (verifies the misfit), 1 CREATIVE_ACCEPT on stage 2 (validates the player's redemption category).
- Structural novelty: the second stage — "what category does the wrong word actually fit?" — is essentially impossible without live judging. Static games can plant misfits; they can't accept arbitrary creative redemptions.
Slow / daily
7. Coalition — daily creativity leaderboard
- Tempo: daily, 24-hour cycle, async.
- Mechanic: Once per day, the system publishes 16 words (offline-generated by 26b with the guarded pipeline + filter + critique pass). All players worldwide get the same 16. Each player submits their own 4×4 sort with 4 self-supplied categories. Server collects all submissions. Daily leaderboard ranks by:
- Validity: all 4 categories accepted by the LLM (binary gate).
- Tightness score: LLM rates each category 1–5; submission score is the average.
- Uniqueness: how few other players used the same exact grouping (rewards creativity over the obvious solution).
- AI calls: 4 CREATIVE_ACCEPT + 4 tightness ratings per submission.
- Structural novelty: the social/share ritual of Wordle and Connections, but with creativity as the leaderboard axis instead of speed-to-known-answer. "I split the daily 16 with the only 'Greek myths' grouping anyone found" is a different brag than "I solved it in 2 mistakes."
8. Bench — collaborative single-category foraging
- Tempo: daily, 24-hour async.
- Mechanic: Each day a single category is published ("Words that follow GREEN" or "Things you can break"). Players have 24 hours to submit as many words as they can; LLM judges each. Each accepted word is "claimed" by the first submitter (publicly visible). Per-player score = unique claims.
- AI calls: 1 JUDGE per submitted word.
- Structural novelty: the answer set is generative, not hand-curated. NYT can't ship an open-ended "submit anything that fits" puzzle because they don't know all the answers; the LLM does (well enough for 87.5% of cases, with the bench growing publicly to fill in the rest).
Hybrid / structurally distinctive
9. Heist — competitive bluff-and-claim
- Tempo: medium-fast, 2-team multiplayer.
- Mechanic: Two teams share a pool of words. Each turn, the active team announces a category ("Words that follow BLUE") and has 30 seconds to claim words from the pool that fit. The opposing team can challenge any claim — if the LLM agrees the word doesn't fit, the claiming team loses points; if it does, the challenger loses points. Bluffing dynamics emerge naturally: claim a borderline word and dare them to challenge.
- AI calls: 1 JUDGE per claim (at challenge-time only — no need to judge unchallenged claims unless you want a "true scoring" cleanup pass at end-of-game).
- Structural novelty: competitive risk-taking on category boundaries. The challenge mechanic literally requires a live, fair judge — there's no static-game equivalent because static games can't adjudicate disputes mid-play.
10. Hidden — find the broadest tight category
- Tempo: medium, ~5 min per puzzle.
- Mechanic: 12 (or more) words on a board. Find ONE category that fits ALL of them — and the narrower / more specific the category, the higher the score. ("Things that exist" gets you 1 point; "Things you'd find in a 1980s bedroom" gets you 8.) LLM judges on both validity (does it actually fit all 12?) and tightness (1–5).
- AI calls: 1 batched JUDGE (on category × 12 words) per submission, plus 1 tightness rating.
- Structural novelty: the inversion. Every other word game asks the player to find narrow groups inside a board; this one asks the player to find the broadest category that still feels tight. A different cognitive skill, and impossible without live category judging.
Recombinable building blocks
The 10 ideas above mix five primitives. Use these to remix or design new variants:
| Primitive | Variants |
|---|---|
| Time pressure | Real-time / per-move timer / per-day async / untimed |
| Goal direction | Find a valid grouping · validate a player-proposed grouping · find a misfit · find a "bridge" word · find the broadest tight category · build a chain |
| Player count | Solo · async-multi (Wordle-shape) · sync-co-op · sync-versus |
| Word source | Daily-curated 16 · player-supplied · conveyor-fed stream · category-seeded generation |
| Scoring axis | Speed · count · uniqueness vs other players · LLM-rated tightness · chain length |
| AI call shape | JUDGE single · JUDGE batched (one category × N words) · CREATIVE_ACCEPT · CREATE (rare — from the bakeoff this is the least reliable axis) · tightness-rating |
Easy recombinations to consider:
- Pile + Coalition = daily 60-second speedrun on the day's curated word pool, leaderboard by score.
- Stretch + Hidden = find the longest broadest category that still passes the tightness bar.
- Heist + Threaded = chain-builder versus mode where teams steal links from each other's chains.
- Bench + Misfit = daily foraging where some submissions are deliberate adversarial misfits the community has to flag.
Open questions / things still untested
- Adversarial player input on CREATIVE_ACCEPT. Tests used honest categories. Real players will gaming-test the judge with categories like "Words containing a vowel" (trivially-true on most English words) or "Words that are 4–7 letters long" (true by construction in many cases). Need a category-tightness pre-check on player input — at minimum, require the category to fail for at least one word from the wider deck, or apply a specificity bar.
- Cultural / contextual category robustness. Tested categories were lexical/factual ("Roman gods", "fruits", "things you can crack"). Cultural references and time-bound categories ("Words in Beatles songs", "Common Texan slang") may break the judge.
- Critique-pass effectiveness. The generation pipeline assumes a second-model critique pass catches structural mistakes. Not yet verified — feed Experiment 1's failed puzzles into a critique prompt and check.
- 8B's "no" bias on hard YES cases. It missed
judge-y3(days of the week — said all four were misfits, which was incoherent) andjudge-y6(cold turkey). 8B might be slightly more conservative in production than its test numbers suggest. - Diversity over time. All 10 puzzles in Experiment 1 were unseeded; 31b reached for "scales" twice in 5 puzzles. With 26b alone for generation, the diversity question is sharper. A 30-day seeded run is the next experiment if any of the daily-puzzle ideas (Coalition, Bench) goes forward.
Repo structure
.
├── README.md # this file
├── IDEA.md # original brief, with note about the pivot
├── DECISIONS.md # decision log, kept as project moves forward
├── scripts/
│ ├── gemma-generation-bakeoff.py # Experiment 1 — whole-puzzle generation
│ └── gemma-semantic-bakeoff.py # Experiment 2 — atomic skills
└── docs/reference/
├── gemma-generation-bakeoff-2026-04-27-221751.md # Experiment 1 report (graded)
├── gemma-generation-bakeoff-2026-04-27-221751-raw.json
├── gemma-semantic-bakeoff-2026-04-27-224800.md # Experiment 2 report (graded)
└── gemma-semantic-bakeoff-2026-04-27-224800-raw.json
Reproduce
# point at any local Ollama with gemma4:latest and gemma4:26b loaded
export OLLAMA_HOST=http://localhost:11434
python3 scripts/gemma-semantic-bakeoff.py # ~5 min on a 24 GB GPU
python3 scripts/gemma-generation-bakeoff.py # ~10 min
Reports land in docs/reference/ with timestamps. Hand-grade the CREATE outputs and any TODO grades inline in the markdown — both bakeoff scripts emit grading-friendly reports.
License
Not yet specified. If you're considering using this code or the test bank in your own work, open an issue and ask.