T

Mortdecai 5a2a02e483 docs: bootstrap repo with bakeoff results and game-mechanics idea bank

This repo opens with the design-discovery work completed before any product
code is written. Two model bakeoffs against gemma4:8b/26b/31b on a local
Ollama established that:

- Whole-puzzle generation in the Connections shape is unreliable on Gemma 4
  (gemma4:31b ~50% structural-pass, gemma4:26b ~20-30%); 31b is intentionally
  out of project scope, so the generation route is harder still.
- Atomic semantic-judging skills are reliable: 87.5%/93.75%/100% (8B/26b/31b)
  on JUDGE; *all three models* scored 10/10 on CREATIVE_ACCEPT — fair judging
  of player-INVENTED categories. That is the structural unlock vs static
  hand-curated word games.

The README contains the full writeup, the test bench, and a brainstormed
bank of 10 distinct game-mechanics ideas across the fast/medium/slow tempo
range, plus a primitives table for recombination.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-27 23:09:46 -04:00

docs/reference

docs: bootstrap repo with bakeoff results and game-mechanics idea bank

2026-04-27 23:09:46 -04:00

scripts

docs: bootstrap repo with bakeoff results and game-mechanics idea bank

2026-04-27 23:09:46 -04:00

.gitignore

docs: bootstrap repo with bakeoff results and game-mechanics idea bank

2026-04-27 23:09:46 -04:00

DECISIONS.md

docs: bootstrap repo with bakeoff results and game-mechanics idea bank

2026-04-27 23:09:46 -04:00

IDEA.md

docs: bootstrap repo with bakeoff results and game-mechanics idea bank

2026-04-27 23:09:46 -04:00

README.md

docs: bootstrap repo with bakeoff results and game-mechanics idea bank

2026-04-27 23:09:46 -04:00

README.md

seth_semantic_game

Working title. A self-hosted word game built around an LLM's ability to fairly judge player-invented semantic categories in real time — something static, hand-curated word games structurally cannot do.

This repo documents the design discovery process, including two model bakeoffs that picked the architecture and a brainstormed bank of game-mechanics ideas that the actual product will draw from.

TL;DR

Seed idea: clone NYT Connections (16 words → 4 hidden groups of 4) with a local LLM doing the curation.
Seed idea died fast: unaided whole-puzzle generation on Gemma 4 ships broken puzzles ~50% of the time (duplicate tiles, mislabeled categories, fake wordplay) — see docs/reference/gemma-generation-bakeoff-2026-04-27-221751.md.
The actual unlock: Gemma 4 reliably judges whether a player-supplied category fits a player-supplied set of words. Across 35 hand-labeled cases on three model sizes, CREATIVE_ACCEPT scored 10/10 on every model including the 8B variant at 0.7s per call. JUDGE landed at 87.5% / 93.75% / 100% (8B / 26b / 31b). See docs/reference/gemma-semantic-bakeoff-2026-04-27-224800.md.
The pivot: stop trying to generate Connections. Build games where the player invents the groupings and the LLM is the live, fair judge. That's what the static format can't do.
Models in scope: gemma4:latest (8B) for live judging, gemma4:26b for offline puzzle prep / critique. gemma4:31b was tested and is more accurate, but is intentionally out of scope for this project.

What we did

Two experiments, both reproducible from scripts/ against a local Ollama (point OLLAMA_HOST at your instance; defaults to http://localhost:11434).

Experiment 1 — Generation bakeoff

Question: can Gemma 4 generate a Connections-quality 16-word / 4-group puzzle in one shot?

Setup: 5 puzzles per model on gemma4:26b and gemma4:31b. Strict JSON schema requesting groups + difficulty bands + claimed overlap-trap words. No format=json (that's a known Gemma 4 + Ollama hang); JSON parsed client-side; up to 3 retries with temperature bumped +0.1 each attempt.

Results:

Model	Pass	Borderline	Fail	Avg s/puzzle
`gemma4:26b`	1	1 + 1 partial	2	5.2
`gemma4:31b-it-q4_K_M`	2	2	1	18.2

Failure modes ranked by severity:

Structural violations — duplicate or near-duplicate words on the 16-tile board. Trivially detectable.
Broken category logic — words listed in a category they don't actually fit (DELUXE doesn't start with the full Greek letter "DELTA"; LIBRA isn't a "type of scale"). Hard to detect deterministically — needs a critique pass.
Redundant categories — two groups themed on the same concept. Detectable.
Self-graded traps don't always hold up — Gemma's claimed intended_traps were sometimes nonsense (PRESS claimed to fit "Words after BLOOD," but the compound is blood pressure, not blood press). Important consequence: the same model cannot be trusted to grade its own output.

This was decisive for the project direction: unaided generation isn't viable; AND we're explicitly capping at 26b, which is the less reliable generator. So we need a different game shape — one that doesn't depend on the LLM generating finished puzzles unaided.

Experiment 2 — Semantic-skill bakeoff

Question: instead of whole-puzzle generation, can Gemma reliably perform the atomic skills a live game would need? Specifically:

JUDGE — given a category and 4 words, does Gemma correctly say yes/no on whether they all fit?
CREATE — given a category, does Gemma produce 4 tightly-fitting words?
CREATIVE_ACCEPT — given 4 words and a player-proposed category, does Gemma fairly judge whether the category validates the grouping (even if it differs from any "intended" category)?

The third one is the design-relevant one. If it works, the game can let players invent their own groupings — which is structurally impossible for a hand-curated static format.

Setup: 35 hand-labeled cases (16 JUDGE / 10 CREATE / 9 CREATIVE_ACCEPT + 2 deliberately ambiguous) tested across gemma4:latest (8B), gemma4:26b, and gemma4:31b. Each case has explicit ground truth in the test bank.

Results:

Model	JUDGE	CREATE	CREATIVE_ACCEPT	Avg s/call
`gemma4:latest` (8B)	14/16 (87.5%)	8/10	10/10	0.7
`gemma4:26b`	15/16 (93.75%)	9/10	10/10	0.8
`gemma4:31b-it-q4_K_M`	16/16	9/10	10/10	2.3

Key findings:

CREATIVE_ACCEPT is decisive across all three models. 10/10 on five player-creative-but-valid groupings (e.g. WHIP / NUT / CODE / SMILE → "Things you can crack" accepted) AND 10/10 on five invalid ones (e.g. OAK / MAPLE / BIRCH / PINE → "Furniture brands" rejected). The model gets the distinction.
8B is fast enough to use as a live judge. Sub-second on a 24 GB consumer GPU; per-guess economics are effectively free.
26b is mildly over-permissive on borderline cases. It accepted KIWI as a tech brand (APPLE / ORANGE / KIWI / BLACKBERRY → "Tech/phone brands"). 8B and 31b were stricter. For a live game, false-positives degrade integrity more than false-negatives — so 8B's calibration is the right tradeoff for live judging.
One failure mode is shared by all three models: "homophones-of-body-parts" (8B gave SEA/SEE/HEAR/HERE — none of which sound like body parts; 26b gave EYE which IS a body part rather than a homophone of one; 31b parse-failed three times running). Avoid this category class or scaffold prompts with worked examples.

What we picked

Model assignments:

Role	Model	Why
Live JUDGE (per player guess)	`gemma4:latest` (8B)	Sub-second, strict-enough calibration, 87.5% accuracy on tight cases
Live CREATIVE_ACCEPT	`gemma4:latest` (8B)	10/10 in test, sub-second
Offline puzzle generation (if used at all)	`gemma4:26b` with strict filters + retries	31b is out of scope by user constraint; 26b plus a deterministic post-filter and a critique pass is the workable path
Offline critique pass	`gemma4:26b` grading 8B's work, OR a non-Gemma open-weights judge	A model cannot be trusted to grade itself — the bakeoff confirmed Gemma rubber-stamps its own structural mistakes

Operational gotchas baked into the scripts (all from upstream Gemma 4 + Ollama issue tracker; documented in the bakeoff scripts):

No format: "json" — server-side JSON enforcer hangs gemma4:26b Q4 indefinitely; ask for JSON in the prompt and parse client-side.
think: false for single-turn JSON pipelines — otherwise thinking tokens consume the response budget and response comes back empty.
Override Ollama defaults: num_ctx (default 2048 truncates the prompt), num_predict (default 128 truncates the output).
For multi-turn tool-calling agents the rule is the opposite: leave think unset on 26b. Not relevant here, but worth knowing.

Game-mechanics idea bank

The two bakeoffs together say: don't build a game where the LLM is the curator. Build a game where the LLM is the live, fair judge of player creativity. Below are 10 distinct game ideas that take that as the design constraint. None of them is Connections; each one leans on something a static game structurally can't replicate (live category validation, multi-solution puzzles, generative answer pools, semantic chains, etc.).

Each idea lists its tempo (how fast the game feels), the AI calls per turn (so cost can be reasoned about), and the structural novelty (the thing this idea can do that a hand-curated static format cannot).

Fast-paced (≤60-second rounds)

1. Pile — speedrun categorize

Tempo: real-time, 60-second rounds.
Mechanic: A pool of ~16 random words. You drag any 3–5 of them into a box and type a category. The LLM (8B) judges in ~0.7s. Accepted → those words clear, refilled from a deck. Rejected → they stay. Score = words categorized per minute.
AI calls: 1 per submission (CREATIVE_ACCEPT shape: player-supplied category + player-supplied words).
Structural novelty: the player invents groupings under time pressure; categories aren't pre-known. A static game has a single fixed answer per puzzle; this one has open-ended valid answers as long as the LLM can confirm tightness.

2. Bridge — single-word polysemy speedrun

Tempo: real-time, ~10 sec per move.
Mechanic: Two category cards on screen ("Words for sharp pain" and "Things that bite"). Type one word the LLM accepts as fitting BOTH (e.g. STING). Move on. Faster = more points.
AI calls: 2 JUDGE calls per submission (one per category, on the player's word).
Structural novelty: the polysemy/multi-meaning skill — a known Connections difficulty axis — turned into the primary gameplay loop. Static games can plant such words but can't let the player invent them on demand.

3. Threaded — semantic word chains

Tempo: real-time / continuous.
Mechanic: Words drift across a conveyor belt. You build a chain by linking consecutive words with a category the LLM accepts ("APPLE → ORANGE: both fruits" → "ORANGE → RED: both colors" → "RED → ANGRY: red with anger"). Chain length = score. One chain per game.
AI calls: 1 JUDGE per link, on the player's pair-and-category.
Structural novelty: emergent semantic graphs from arbitrary word streams. The category set isn't pre-built — it's whatever the player can find. A static game can't be open-ended on the connection vocabulary.

Medium-paced (5–15 minute sessions)

4. Stretch — push a category to its limit

Tempo: medium, 5-min sessions.
Mechanic: The game opens with a tight seed category and 4 starting words ("Types of trees: OAK, MAPLE, BIRCH, PINE"). Add a 5th word — does it still fit? LLM judges. Yes → add a 6th. Each accepted word = +1 point. First rejection ends the round. Some categories support more stretch than others (broader = more elastic).
AI calls: 1 JUDGE per word added.
Structural novelty: category elasticity as a gameplay dimension. There's no pre-set answer length. The player learns intuitions about which categories admit how much stretching — a meta-skill no static game develops.

5. Inverse — multi-solution sort

Tempo: medium, ~10 min per puzzle.
Mechanic: 16 words on a board with NO predetermined grouping. The player sorts them into ANY 4 groups of 4 with ANY categories of their choice. The LLM judges all 4 categories. All 4 valid → win. Bonus for tightness (LLM rates each category 1–5).
AI calls: 4 CREATIVE_ACCEPT per submission, plus optional 4 tightness-score calls.
Structural novelty: Connections has one valid answer; this version has thousands. Players compete on creativity and tightness, not on guessing the curator's mind.

6. Misfit — odd-one-out, then redeem

Tempo: medium, ~3 min per puzzle.
Mechanic: The game shows a category and 4–5 words; one of them doesn't quite fit. Stage 1: identify the misfit. Stage 2 (bonus): propose a category the misfit word DOES fit. Both stages judged by the LLM.
AI calls: 1 JUDGE on stage 1 (verifies the misfit), 1 CREATIVE_ACCEPT on stage 2 (validates the player's redemption category).
Structural novelty: the second stage — "what category does the wrong word actually fit?" — is essentially impossible without live judging. Static games can plant misfits; they can't accept arbitrary creative redemptions.

Slow / daily

7. Coalition — daily creativity leaderboard

Tempo: daily, 24-hour cycle, async.
Mechanic: Once per day, the system publishes 16 words (offline-generated by 26b with the guarded pipeline + filter + critique pass). All players worldwide get the same 16. Each player submits their own 4×4 sort with 4 self-supplied categories. Server collects all submissions. Daily leaderboard ranks by:
- Validity: all 4 categories accepted by the LLM (binary gate).
- Tightness score: LLM rates each category 1–5; submission score is the average.
- Uniqueness: how few other players used the same exact grouping (rewards creativity over the obvious solution).
AI calls: 4 CREATIVE_ACCEPT + 4 tightness ratings per submission.
Structural novelty: the social/share ritual of Wordle and Connections, but with creativity as the leaderboard axis instead of speed-to-known-answer. "I split the daily 16 with the only 'Greek myths' grouping anyone found" is a different brag than "I solved it in 2 mistakes."

8. Bench — collaborative single-category foraging

Tempo: daily, 24-hour async.
Mechanic: Each day a single category is published ("Words that follow GREEN" or "Things you can break"). Players have 24 hours to submit as many words as they can; LLM judges each. Each accepted word is "claimed" by the first submitter (publicly visible). Per-player score = unique claims.
AI calls: 1 JUDGE per submitted word.
Structural novelty: the answer set is generative, not hand-curated. NYT can't ship an open-ended "submit anything that fits" puzzle because they don't know all the answers; the LLM does (well enough for 87.5% of cases, with the bench growing publicly to fill in the rest).

Hybrid / structurally distinctive

9. Heist — competitive bluff-and-claim

Tempo: medium-fast, 2-team multiplayer.
Mechanic: Two teams share a pool of words. Each turn, the active team announces a category ("Words that follow BLUE") and has 30 seconds to claim words from the pool that fit. The opposing team can challenge any claim — if the LLM agrees the word doesn't fit, the claiming team loses points; if it does, the challenger loses points. Bluffing dynamics emerge naturally: claim a borderline word and dare them to challenge.
AI calls: 1 JUDGE per claim (at challenge-time only — no need to judge unchallenged claims unless you want a "true scoring" cleanup pass at end-of-game).
Structural novelty: competitive risk-taking on category boundaries. The challenge mechanic literally requires a live, fair judge — there's no static-game equivalent because static games can't adjudicate disputes mid-play.

10. Hidden — find the broadest tight category

Tempo: medium, ~5 min per puzzle.
Mechanic: 12 (or more) words on a board. Find ONE category that fits ALL of them — and the narrower / more specific the category, the higher the score. ("Things that exist" gets you 1 point; "Things you'd find in a 1980s bedroom" gets you 8.) LLM judges on both validity (does it actually fit all 12?) and tightness (1–5).
AI calls: 1 batched JUDGE (on category × 12 words) per submission, plus 1 tightness rating.
Structural novelty: the inversion. Every other word game asks the player to find narrow groups inside a board; this one asks the player to find the broadest category that still feels tight. A different cognitive skill, and impossible without live category judging.

Recombinable building blocks

The 10 ideas above mix five primitives. Use these to remix or design new variants:

Primitive	Variants
Time pressure	Real-time / per-move timer / per-day async / untimed
Goal direction	Find a valid grouping · validate a player-proposed grouping · find a misfit · find a "bridge" word · find the broadest tight category · build a chain
Player count	Solo · async-multi (Wordle-shape) · sync-co-op · sync-versus
Word source	Daily-curated 16 · player-supplied · conveyor-fed stream · category-seeded generation
Scoring axis	Speed · count · uniqueness vs other players · LLM-rated tightness · chain length
AI call shape	JUDGE single · JUDGE batched (one category × N words) · CREATIVE_ACCEPT · CREATE (rare — from the bakeoff this is the least reliable axis) · tightness-rating

Easy recombinations to consider:

Pile + Coalition = daily 60-second speedrun on the day's curated word pool, leaderboard by score.
Stretch + Hidden = find the longest broadest category that still passes the tightness bar.
Heist + Threaded = chain-builder versus mode where teams steal links from each other's chains.
Bench + Misfit = daily foraging where some submissions are deliberate adversarial misfits the community has to flag.

Open questions / things still untested

Adversarial player input on CREATIVE_ACCEPT. Tests used honest categories. Real players will gaming-test the judge with categories like "Words containing a vowel" (trivially-true on most English words) or "Words that are 4–7 letters long" (true by construction in many cases). Need a category-tightness pre-check on player input — at minimum, require the category to fail for at least one word from the wider deck, or apply a specificity bar.
Cultural / contextual category robustness. Tested categories were lexical/factual ("Roman gods", "fruits", "things you can crack"). Cultural references and time-bound categories ("Words in Beatles songs", "Common Texan slang") may break the judge.
Critique-pass effectiveness. The generation pipeline assumes a second-model critique pass catches structural mistakes. Not yet verified — feed Experiment 1's failed puzzles into a critique prompt and check.
8B's "no" bias on hard YES cases. It missed judge-y3 (days of the week — said all four were misfits, which was incoherent) and judge-y6 (cold turkey). 8B might be slightly more conservative in production than its test numbers suggest.
Diversity over time. All 10 puzzles in Experiment 1 were unseeded; 31b reached for "scales" twice in 5 puzzles. With 26b alone for generation, the diversity question is sharper. A 30-day seeded run is the next experiment if any of the daily-puzzle ideas (Coalition, Bench) goes forward.

Repo structure

.
├── README.md                          # this file
├── IDEA.md                            # original brief, with note about the pivot
├── DECISIONS.md                       # decision log, kept as project moves forward
├── scripts/
│   ├── gemma-generation-bakeoff.py    # Experiment 1 — whole-puzzle generation
│   └── gemma-semantic-bakeoff.py      # Experiment 2 — atomic skills
└── docs/reference/
    ├── gemma-generation-bakeoff-2026-04-27-221751.md       # Experiment 1 report (graded)
    ├── gemma-generation-bakeoff-2026-04-27-221751-raw.json
    ├── gemma-semantic-bakeoff-2026-04-27-224800.md         # Experiment 2 report (graded)
    └── gemma-semantic-bakeoff-2026-04-27-224800-raw.json

Reproduce

# point at any local Ollama with gemma4:latest and gemma4:26b loaded
export OLLAMA_HOST=http://localhost:11434
python3 scripts/gemma-semantic-bakeoff.py    # ~5 min on a 24 GB GPU
python3 scripts/gemma-generation-bakeoff.py  # ~10 min

Reports land in docs/reference/ with timestamps. Hand-grade the CREATE outputs and any TODO grades inline in the markdown — both bakeoff scripts emit grading-friendly reports.

License

Not yet specified. If you're considering using this code or the test bank in your own work, open an issue and ask.

README.md Unescape Escape

seth_semantic_game

TL;DR

What we did

Experiment 1 — Generation bakeoff

Experiment 2 — Semantic-skill bakeoff

What we picked

Game-mechanics idea bank

Fast-paced (≤60-second rounds)

1. Pile — speedrun categorize

2. Bridge — single-word polysemy speedrun

3. Threaded — semantic word chains

Medium-paced (5–15 minute sessions)

4. Stretch — push a category to its limit

5. Inverse — multi-solution sort

6. Misfit — odd-one-out, then redeem

Slow / daily

7. Coalition — daily creativity leaderboard

8. Bench — collaborative single-category foraging

Hybrid / structurally distinctive

9. Heist — competitive bluff-and-claim

10. Hidden — find the broadest tight category

Recombinable building blocks

Open questions / things still untested

Repo structure

Reproduce

License

README.md