docs: bootstrap repo with bakeoff results and game-mechanics idea bank
This repo opens with the design-discovery work completed before any product code is written. Two model bakeoffs against gemma4:8b/26b/31b on a local Ollama established that: - Whole-puzzle generation in the Connections shape is unreliable on Gemma 4 (gemma4:31b ~50% structural-pass, gemma4:26b ~20-30%); 31b is intentionally out of project scope, so the generation route is harder still. - Atomic semantic-judging skills are reliable: 87.5%/93.75%/100% (8B/26b/31b) on JUDGE; *all three models* scored 10/10 on CREATIVE_ACCEPT — fair judging of player-INVENTED categories. That is the structural unlock vs static hand-curated word games. The README contains the full writeup, the test bench, and a brainstormed bank of 10 distinct game-mechanics ideas across the fast/medium/slow tempo range, plus a primitives table for recombination. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,790 @@
|
||||
[
|
||||
{
|
||||
"model": "gemma4:26b",
|
||||
"i": 1,
|
||||
"dt": 5.670714855194092,
|
||||
"ok": true,
|
||||
"attempts": 1,
|
||||
"puzzle": {
|
||||
"theme_seed": "An exploration of common words that can be paired with 'BIRD' or 'DRAGON' or relate to types of 'FLY'.",
|
||||
"groups": [
|
||||
{
|
||||
"difficulty": "yellow",
|
||||
"category": "Types of Flies",
|
||||
"words": [
|
||||
"FRUIT",
|
||||
"HOUSE",
|
||||
"FIRE",
|
||||
"BLOW"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "green",
|
||||
"category": "Words before 'FLY'",
|
||||
"words": [
|
||||
"DRAGON",
|
||||
"MAY",
|
||||
"BUTTER",
|
||||
"BOTTLE"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "blue",
|
||||
"category": "Types of Birds",
|
||||
"words": [
|
||||
"LARK",
|
||||
"SWALLOW",
|
||||
"CRANE",
|
||||
"SWIFT"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "purple",
|
||||
"category": "Words that follow 'DRAGON'",
|
||||
"words": [
|
||||
"FRUIT",
|
||||
"FLY",
|
||||
"SLAYER",
|
||||
"FLAME"
|
||||
]
|
||||
}
|
||||
],
|
||||
"intended_traps": [
|
||||
{
|
||||
"word": "FRUIT",
|
||||
"actual_group": "purple",
|
||||
"trap_group": "yellow",
|
||||
"reason": "Fruit fly is a very common type of fly, making it a primary red herring for the yellow group."
|
||||
},
|
||||
{
|
||||
"word": "FIRE",
|
||||
"actual_group": "yellow",
|
||||
"trap_group": "purple",
|
||||
"reason": "Fire dragon or dragon flame connection could lead someone to look for a dragon-themed group."
|
||||
},
|
||||
{
|
||||
"word": "DRAGON",
|
||||
"actual_group": "green",
|
||||
"trap_group": "blue",
|
||||
"reason": "Dragon is a mythical creature, and users might look for a 'mythical creatures' group including Crane or Swallow (which are real birds)."
|
||||
},
|
||||
{
|
||||
"word": "FLY",
|
||||
"actual_group": "purple",
|
||||
"trap_group": "green",
|
||||
"reason": "Dragonfly is a common word, tempting the player to put 'FLY' in the 'Words before FLY' group."
|
||||
}
|
||||
]
|
||||
},
|
||||
"eval_count": 500,
|
||||
"prompt_eval_count": 440
|
||||
},
|
||||
{
|
||||
"model": "gemma4:26b",
|
||||
"i": 2,
|
||||
"dt": 5.300434589385986,
|
||||
"ok": true,
|
||||
"attempts": 1,
|
||||
"puzzle": {
|
||||
"theme_seed": "Homophones and linguistic overlaps between nature and everyday objects.",
|
||||
"groups": [
|
||||
{
|
||||
"difficulty": "yellow",
|
||||
"category": "Types of Trees",
|
||||
"words": [
|
||||
"BIRCH",
|
||||
"CEDAR",
|
||||
"MAPLE",
|
||||
"PINE"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "green",
|
||||
"category": "Things that can be 'Dropped'",
|
||||
"words": [
|
||||
"BEAT",
|
||||
"HINT",
|
||||
"LINE",
|
||||
"SUBJECT"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "blue",
|
||||
"category": "Units of Measurement",
|
||||
"words": [
|
||||
"FOOT",
|
||||
"GRAM",
|
||||
"SECOND",
|
||||
"STONE"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "purple",
|
||||
"category": "Homophones of Body Parts",
|
||||
"words": [
|
||||
"EYE",
|
||||
"HARE",
|
||||
"MUSSEL",
|
||||
"QUEUE"
|
||||
]
|
||||
}
|
||||
],
|
||||
"intended_traps": [
|
||||
{
|
||||
"word": "PINE",
|
||||
"actual_group": "yellow",
|
||||
"trap_group": "green",
|
||||
"reason": "To 'pine' for someone is a feeling, but 'pine' could be mistaken for something you can 'drop' (like a pine needle)."
|
||||
},
|
||||
{
|
||||
"word": "FOOT",
|
||||
"actual_group": "blue",
|
||||
"trap_group": "yellow",
|
||||
"reason": "A 'foot' is a unit of measurement, but it is also a body part, potentially leading to the 'Body Part Homophones' group."
|
||||
},
|
||||
{
|
||||
"word": "BEAT",
|
||||
"actual_group": "green",
|
||||
"trap_group": "blue",
|
||||
"reason": "A 'beat' is a rhythm, but it could be mistaken for a unit of time/measurement in a musical context."
|
||||
},
|
||||
{
|
||||
"word": "EYE",
|
||||
"actual_group": "purple",
|
||||
"trap_group": "blue",
|
||||
"reason": "The word 'eye' is a body part, which might tempt players looking for a body part category, but it is actually a homophone (I) for the purple group."
|
||||
}
|
||||
]
|
||||
},
|
||||
"eval_count": 510,
|
||||
"prompt_eval_count": 440
|
||||
},
|
||||
{
|
||||
"model": "gemma4:26b",
|
||||
"i": 3,
|
||||
"dt": 5.103548049926758,
|
||||
"ok": true,
|
||||
"attempts": 1,
|
||||
"puzzle": {
|
||||
"theme_seed": "The concept of 'Pressure' and things that can be measured or applied.",
|
||||
"groups": [
|
||||
{
|
||||
"difficulty": "yellow",
|
||||
"category": "Units of Pressure",
|
||||
"words": [
|
||||
"BAR",
|
||||
"PASCAL",
|
||||
"PSI",
|
||||
"TORR"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "green",
|
||||
"category": "Words after 'BLOOD'",
|
||||
"words": [
|
||||
"PRESSURE",
|
||||
"TYPE",
|
||||
"ORANGE",
|
||||
"DONOR"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "blue",
|
||||
"category": "Things that can be 'Broken'",
|
||||
"words": [
|
||||
"HEART",
|
||||
"RECORD",
|
||||
"LAW",
|
||||
"PROMISE"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "purple",
|
||||
"category": "Words that are also Verbs for 'Applying Force'",
|
||||
"words": [
|
||||
"PRESS",
|
||||
"STAMP",
|
||||
"PUNCH",
|
||||
"DRIVE"
|
||||
]
|
||||
}
|
||||
],
|
||||
"intended_traps": [
|
||||
{
|
||||
"word": "PRESSURE",
|
||||
"actual_group": "green",
|
||||
"trap_group": "yellow",
|
||||
"reason": "Pressure is the concept being measured by the units in the yellow group."
|
||||
},
|
||||
{
|
||||
"word": "BAR",
|
||||
"actual_group": "yellow",
|
||||
"trap_group": "blue",
|
||||
"reason": "A 'bar' is something that can be 'broken' (like a chocolate bar) or refers to a legal setting."
|
||||
},
|
||||
{
|
||||
"word": "PRESS",
|
||||
"actual_group": "purple",
|
||||
"trap_group": "green",
|
||||
"reason": "The 'Press' is a common term in news/media, and 'Press' can be related to 'Blood' in a medical context (blood pressure)."
|
||||
},
|
||||
{
|
||||
"word": "RECORD",
|
||||
"actual_group": "blue",
|
||||
"trap_group": "purple",
|
||||
"reason": "A 'record' is something you can 'stamp' or 'drive' (as in a driving record)."
|
||||
}
|
||||
]
|
||||
},
|
||||
"eval_count": 492,
|
||||
"prompt_eval_count": 440
|
||||
},
|
||||
{
|
||||
"model": "gemma4:26b",
|
||||
"i": 4,
|
||||
"dt": 5.257405757904053,
|
||||
"ok": true,
|
||||
"attempts": 1,
|
||||
"puzzle": {
|
||||
"theme_seed": "A collection of words related to things that can be 'broken' or 'cracked', mixed with other categories like cooking and music.",
|
||||
"groups": [
|
||||
{
|
||||
"difficulty": "yellow",
|
||||
"category": "Kinds of Eggs",
|
||||
"words": [
|
||||
"CHICKEN",
|
||||
"QUAIL",
|
||||
"GOOSE",
|
||||
"DUCK"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "green",
|
||||
"category": "Things You Can Crack",
|
||||
"words": [
|
||||
"NUT",
|
||||
"CODE",
|
||||
"WHIP",
|
||||
"SMILE"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "blue",
|
||||
"category": "Musical Terms",
|
||||
"words": [
|
||||
"NOTE",
|
||||
"CHORD",
|
||||
"CLEF",
|
||||
"BEAT"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "purple",
|
||||
"category": "Words Before 'Egg'",
|
||||
"words": [
|
||||
"NEST",
|
||||
"EASTER",
|
||||
"BAD",
|
||||
"NESTING"
|
||||
]
|
||||
}
|
||||
],
|
||||
"intended_traps": [
|
||||
{
|
||||
"word": "DUCK",
|
||||
"actual_group": "yellow",
|
||||
"trap_group": "green",
|
||||
"reason": "DUCK is an action related to avoiding a whip or a blow, and can be paired with 'crack' in some linguistic contexts."
|
||||
},
|
||||
{
|
||||
"word": "CHICKEN",
|
||||
"actual_group": "yellow",
|
||||
"trap_group": "purple",
|
||||
"reason": "CHICKEN is an egg-related word that could tempt someone looking for '___ Egg' phrases (like Chicken Egg)."
|
||||
},
|
||||
{
|
||||
"word": "BEAT",
|
||||
"actual_group": "blue",
|
||||
"trap_group": "green",
|
||||
"reason": "BEAT is something you can do to a whip or a person, fitting the 'Things you can crack/strike' theme."
|
||||
},
|
||||
{
|
||||
"word": "NEST",
|
||||
"actual_group": "purple",
|
||||
"trap_group": "yellow",
|
||||
"reason": "NEST is intrinsically linked to the birds in the yellow group (CHICKEN, GOOSE, etc.)."
|
||||
}
|
||||
]
|
||||
},
|
||||
"eval_count": 506,
|
||||
"prompt_eval_count": 440
|
||||
},
|
||||
{
|
||||
"model": "gemma4:26b",
|
||||
"i": 5,
|
||||
"dt": 4.578032493591309,
|
||||
"ok": true,
|
||||
"attempts": 1,
|
||||
"puzzle": {
|
||||
"theme_seed": "A collection of words centered around various types of 'records' and 'tracks'.",
|
||||
"groups": [
|
||||
{
|
||||
"difficulty": "yellow",
|
||||
"category": "Types of Records",
|
||||
"words": [
|
||||
"LP",
|
||||
"EP",
|
||||
"Single",
|
||||
"Album"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "green",
|
||||
"category": "Things that have Tracks",
|
||||
"words": [
|
||||
"Train",
|
||||
"Athlete",
|
||||
"Song",
|
||||
"Mud"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "blue",
|
||||
"category": "Words for 'Evidence'",
|
||||
"words": [
|
||||
"Clue",
|
||||
"Trace",
|
||||
"Mark",
|
||||
"Proof"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "purple",
|
||||
"category": "Words that follow 'Side'",
|
||||
"words": [
|
||||
"Effect",
|
||||
"Kick",
|
||||
"Split",
|
||||
"Walk"
|
||||
]
|
||||
}
|
||||
],
|
||||
"intended_traps": [
|
||||
{
|
||||
"word": "Single",
|
||||
"actual_group": "yellow",
|
||||
"trap_group": "blue",
|
||||
"reason": "A 'single' can be a piece of evidence or a solitary item."
|
||||
},
|
||||
{
|
||||
"word": "Song",
|
||||
"actual_group": "green",
|
||||
"trap_group": "yellow",
|
||||
"reason": "A song is often released as an LP, EP, or Single."
|
||||
},
|
||||
{
|
||||
"word": "Trace",
|
||||
"actual_group": "blue",
|
||||
"trap_group": "green",
|
||||
"reason": "A trace can be a physical track left behind in mud."
|
||||
},
|
||||
{
|
||||
"word": "Split",
|
||||
"actual_group": "purple",
|
||||
"trap_group": "green",
|
||||
"reason": "A 'split' can refer to a track or division in a race."
|
||||
}
|
||||
]
|
||||
},
|
||||
"eval_count": 436,
|
||||
"prompt_eval_count": 440
|
||||
},
|
||||
{
|
||||
"model": "gemma4:31b-it-q4_K_M",
|
||||
"i": 1,
|
||||
"dt": 18.89196801185608,
|
||||
"ok": true,
|
||||
"attempts": 1,
|
||||
"puzzle": {
|
||||
"theme_seed": "Words related to circles, timing, and subtle wordplay with 'Hands'",
|
||||
"groups": [
|
||||
{
|
||||
"difficulty": "yellow",
|
||||
"category": "Circular Objects",
|
||||
"words": [
|
||||
"RING",
|
||||
"COIN",
|
||||
"WHEEL",
|
||||
"DISC"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "green",
|
||||
"category": "Units of Time",
|
||||
"words": [
|
||||
"SECOND",
|
||||
"MINUTE",
|
||||
"HOUR",
|
||||
"DAY"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "blue",
|
||||
"category": "Ways to Assist",
|
||||
"words": [
|
||||
"HELP",
|
||||
"AID",
|
||||
"BACK",
|
||||
"SUPPORT"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "purple",
|
||||
"category": "Things with Hands",
|
||||
"words": [
|
||||
"CLOCK",
|
||||
"WATCH",
|
||||
"DECK",
|
||||
"PLAYER"
|
||||
]
|
||||
}
|
||||
],
|
||||
"intended_traps": [
|
||||
{
|
||||
"word": "SECOND",
|
||||
"actual_group": "green",
|
||||
"trap_group": "blue",
|
||||
"reason": "To 'second' a motion is to support it."
|
||||
},
|
||||
{
|
||||
"word": "WATCH",
|
||||
"actual_group": "purple",
|
||||
"trap_group": "green",
|
||||
"reason": "A watch is a device used to measure the units of time (second, minute, hour)."
|
||||
},
|
||||
{
|
||||
"word": "CLOCK",
|
||||
"actual_group": "purple",
|
||||
"trap_group": "green",
|
||||
"reason": "Similar to watch, it is intrinsically linked to the time units group."
|
||||
},
|
||||
{
|
||||
"word": "BACK",
|
||||
"actual_group": "blue",
|
||||
"trap_group": "yellow",
|
||||
"reason": "Could be mistaken as a part of a circular object or a physical position."
|
||||
}
|
||||
]
|
||||
},
|
||||
"eval_count": 439,
|
||||
"prompt_eval_count": 440
|
||||
},
|
||||
{
|
||||
"model": "gemma4:31b-it-q4_K_M",
|
||||
"i": 2,
|
||||
"dt": 16.603100776672363,
|
||||
"ok": true,
|
||||
"attempts": 1,
|
||||
"puzzle": {
|
||||
"theme_seed": "A mix of architectural elements, types of scales, and linguistic wordplay",
|
||||
"groups": [
|
||||
{
|
||||
"difficulty": "yellow",
|
||||
"category": "Parts of a Building",
|
||||
"words": [
|
||||
"BEAM",
|
||||
"COLUMN",
|
||||
"EAVE",
|
||||
"FOUNDATION"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "green",
|
||||
"category": "Types of Scales",
|
||||
"words": [
|
||||
"LIBRA",
|
||||
"MUSICAL",
|
||||
"WEIGHING",
|
||||
"TIMELINE"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "blue",
|
||||
"category": "Synonyms for Balance",
|
||||
"words": [
|
||||
"EQUILIBRIUM",
|
||||
"POISE",
|
||||
"STEADY",
|
||||
"STABILITY"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "purple",
|
||||
"category": "Words that start with a Greek Letter",
|
||||
"words": [
|
||||
"ALPHABET",
|
||||
"BETRAYAL",
|
||||
"GAMUT",
|
||||
"DELUXE"
|
||||
]
|
||||
}
|
||||
],
|
||||
"intended_traps": [
|
||||
{
|
||||
"word": "LIBRA",
|
||||
"actual_group": "green",
|
||||
"trap_group": "blue",
|
||||
"reason": "Libra is the sign of the scales, which represents balance/equilibrium."
|
||||
},
|
||||
{
|
||||
"word": "COLUMN",
|
||||
"actual_group": "yellow",
|
||||
"trap_group": "green",
|
||||
"reason": "A column can be a way to organize data on a scale or chart."
|
||||
},
|
||||
{
|
||||
"word": "GAMUT",
|
||||
"actual_group": "purple",
|
||||
"trap_group": "green",
|
||||
"reason": "The 'gamut' refers to the full range or scale of something."
|
||||
}
|
||||
]
|
||||
},
|
||||
"eval_count": 410,
|
||||
"prompt_eval_count": 440
|
||||
},
|
||||
{
|
||||
"model": "gemma4:31b-it-q4_K_M",
|
||||
"i": 3,
|
||||
"dt": 17.279732942581177,
|
||||
"ok": true,
|
||||
"attempts": 1,
|
||||
"puzzle": {
|
||||
"theme_seed": "Words related to vision, optics, and ocular puns",
|
||||
"groups": [
|
||||
{
|
||||
"difficulty": "yellow",
|
||||
"category": "Parts of the Eye",
|
||||
"words": [
|
||||
"CORNEA",
|
||||
"PUPIL",
|
||||
"IRIS",
|
||||
"RETINA"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "green",
|
||||
"category": "Types of Lenses",
|
||||
"words": [
|
||||
"CONCAVE",
|
||||
"CONVEX",
|
||||
"CONTACT",
|
||||
"BIFOCAL"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "blue",
|
||||
"category": "Words meaning 'To Witness'",
|
||||
"words": [
|
||||
"OBSERVE",
|
||||
"SPOT",
|
||||
"NOTE",
|
||||
"VIEW"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "purple",
|
||||
"category": "___ EYE",
|
||||
"words": [
|
||||
"BULLS",
|
||||
"BIRD",
|
||||
"FISH",
|
||||
"PRIVATE"
|
||||
]
|
||||
}
|
||||
],
|
||||
"intended_traps": [
|
||||
{
|
||||
"word": "PUPIL",
|
||||
"actual_group": "yellow",
|
||||
"trap_group": "blue",
|
||||
"reason": "A pupil is a student, and one might 'note' or 'observe' a pupil in a classroom setting."
|
||||
},
|
||||
{
|
||||
"word": "PRIVATE",
|
||||
"actual_group": "purple",
|
||||
"trap_group": "blue",
|
||||
"reason": "Private can be a synonym for 'secret' or 'hidden', contrasting with the act of observing or spotting."
|
||||
},
|
||||
{
|
||||
"word": "CONTACT",
|
||||
"actual_group": "green",
|
||||
"trap_group": "blue",
|
||||
"reason": "To 'contact' someone can be confused with 'spotting' or 'viewing' them in a social context."
|
||||
}
|
||||
]
|
||||
},
|
||||
"eval_count": 422,
|
||||
"prompt_eval_count": 440
|
||||
},
|
||||
{
|
||||
"model": "gemma4:31b-it-q4_K_M",
|
||||
"i": 4,
|
||||
"dt": 19.867947340011597,
|
||||
"ok": true,
|
||||
"attempts": 1,
|
||||
"puzzle": {
|
||||
"theme_seed": "Common words that can also be types of birds or musical terms",
|
||||
"groups": [
|
||||
{
|
||||
"difficulty": "yellow",
|
||||
"category": "Things that can be 'Cast'",
|
||||
"words": [
|
||||
"SPELL",
|
||||
"VOTE",
|
||||
"SHADOW",
|
||||
"NET"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "green",
|
||||
"category": "Musical Dynamics",
|
||||
"words": [
|
||||
"PIANO",
|
||||
"FORTE",
|
||||
"SHARP",
|
||||
"FLAT"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "blue",
|
||||
"category": "Types of Birds",
|
||||
"words": [
|
||||
"CRANE",
|
||||
"SWALLOW",
|
||||
"SWIFT",
|
||||
"LOON"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "purple",
|
||||
"category": "Words that mean 'to move quickly'",
|
||||
"words": [
|
||||
"DART",
|
||||
"BOLT",
|
||||
"RUSH",
|
||||
"FLY"
|
||||
]
|
||||
}
|
||||
],
|
||||
"intended_traps": [
|
||||
{
|
||||
"word": "SWIFT",
|
||||
"actual_group": "blue",
|
||||
"trap_group": "purple",
|
||||
"reason": "Swift is a synonym for moving quickly, fitting the 'move quickly' group."
|
||||
},
|
||||
{
|
||||
"word": "FLY",
|
||||
"actual_group": "purple",
|
||||
"trap_group": "blue",
|
||||
"reason": "Fly is a common bird-related term (or a type of insect), making it look like it belongs with Crane and Swallow."
|
||||
},
|
||||
{
|
||||
"word": "SWALLOW",
|
||||
"actual_group": "blue",
|
||||
"trap_group": "purple",
|
||||
"reason": "While a bird, if the user interprets 'move quickly' as 'gulp down', it might confuse them, though primarily it serves as a bird lure."
|
||||
},
|
||||
{
|
||||
"word": "SHARP",
|
||||
"actual_group": "green",
|
||||
"trap_group": "purple",
|
||||
"reason": "A 'sharp' turn or movement can be associated with quickness/darting."
|
||||
}
|
||||
]
|
||||
},
|
||||
"eval_count": 487,
|
||||
"prompt_eval_count": 440
|
||||
},
|
||||
{
|
||||
"model": "gemma4:31b-it-q4_K_M",
|
||||
"i": 5,
|
||||
"dt": 18.18777823448181,
|
||||
"ok": true,
|
||||
"attempts": 2,
|
||||
"puzzle": {
|
||||
"theme_seed": "Words related to types of 'scales' and musical terminology",
|
||||
"groups": [
|
||||
{
|
||||
"difficulty": "yellow",
|
||||
"category": "Musical Scales",
|
||||
"words": [
|
||||
"MAJOR",
|
||||
"MINOR",
|
||||
"CHROMATIC",
|
||||
"PENTATONIC"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "green",
|
||||
"category": "Things that Weigh",
|
||||
"words": [
|
||||
"BALANCE",
|
||||
"SPRING",
|
||||
"DIGITAL",
|
||||
"TRIPLE-BEAM"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "blue",
|
||||
"category": "Ways to Climb",
|
||||
"words": [
|
||||
"ASCEND",
|
||||
"MOUNT",
|
||||
"SCALE",
|
||||
"SURMOUNT"
|
||||
]
|
||||
},
|
||||
{
|
||||
"difficulty": "purple",
|
||||
"category": "___ NOTE",
|
||||
"words": [
|
||||
"FOOT",
|
||||
"BANK",
|
||||
"KEY",
|
||||
"HIGH"
|
||||
]
|
||||
}
|
||||
],
|
||||
"intended_traps": [
|
||||
{
|
||||
"word": "SCALE",
|
||||
"actual_group": "blue",
|
||||
"trap_group": "yellow",
|
||||
"reason": "It is the root word for musical scales, leading the player to look for other scale-related terms."
|
||||
},
|
||||
{
|
||||
"word": "KEY",
|
||||
"actual_group": "purple",
|
||||
"trap_group": "yellow",
|
||||
"reason": "A 'key' is fundamentally linked to musical scales (e.g., the Key of C Major)."
|
||||
},
|
||||
{
|
||||
"word": "HIGH",
|
||||
"actual_group": "purple",
|
||||
"trap_group": "blue",
|
||||
"reason": "High is an adjective often associated with climbing or ascending."
|
||||
},
|
||||
{
|
||||
"word": "BALANCE",
|
||||
"actual_group": "green",
|
||||
"trap_group": "blue",
|
||||
"reason": "Balance can be seen as a state of being when climbing or mountaineering."
|
||||
}
|
||||
]
|
||||
},
|
||||
"eval_count": 453,
|
||||
"prompt_eval_count": 440
|
||||
}
|
||||
]
|
||||
@@ -0,0 +1,278 @@
|
||||
# Gemma 4 Generation Bakeoff -- 2026-04-27-221751
|
||||
|
||||
## Setup
|
||||
- Local Ollama on the test host (RTX 3090 Ti, 24 GB VRAM)
|
||||
- Other GPU workloads paused for the duration of the run
|
||||
- Models: `gemma4:26b`, `gemma4:31b-it-q4_K_M`
|
||||
- 5 puzzles per model, base temperature 0.8
|
||||
- Gemma 4 settings (per `~/bin/gemma4-research/GOTCHAS.md`): `think=false`, `num_ctx=8192`, `num_predict=4096`. No `format=json` (infinite-loop bug). JSON extracted client-side via `body[body.find('{'):body.rfind('}')+1]`.
|
||||
- Up to 3 attempts per puzzle with temperature bumped +0.1 each retry (AI_Visualizer pattern). Reported metrics use the *successful* attempt.
|
||||
- One-shot, unaided generation. No critique pass, no example puzzle in prompt.
|
||||
|
||||
## Timing
|
||||
|
||||
| Model | n | avg s | avg tokens | tok/s |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:26b` | 5 | 5.2 | 489 | 94.3 |
|
||||
| `gemma4:31b-it-q4_K_M` | 5 | 18.2 | 442 | 24.3 |
|
||||
|
||||
## JSON parse rate
|
||||
|
||||
- `gemma4:26b`: 5/5 parsed cleanly
|
||||
- `gemma4:31b-it-q4_K_M`: 5/5 parsed cleanly
|
||||
|
||||
## gemma4:26b -- puzzle 1
|
||||
|
||||
**Theme seed:** An exploration of common words that can be paired with 'BIRD' or 'DRAGON' or relate to types of 'FLY'.
|
||||
|
||||
| Diff | Category | Words |
|
||||
|---|---|---|
|
||||
| yellow | Types of Flies | FRUIT, HOUSE, FIRE, BLOW |
|
||||
| green | Words before 'FLY' | DRAGON, MAY, BUTTER, BOTTLE |
|
||||
| blue | Types of Birds | LARK, SWALLOW, CRANE, SWIFT |
|
||||
| purple | Words that follow 'DRAGON' | FRUIT, FLY, SLAYER, FLAME |
|
||||
|
||||
**Claimed traps (4):**
|
||||
- `FRUIT` (actually purple, traps yellow): Fruit fly is a very common type of fly, making it a primary red herring for the yellow group.
|
||||
- `FIRE` (actually yellow, traps purple): Fire dragon or dragon flame connection could lead someone to look for a dragon-themed group.
|
||||
- `DRAGON` (actually green, traps blue): Dragon is a mythical creature, and users might look for a 'mythical creatures' group including Crane or Swallow (which are real birds).
|
||||
- `FLY` (actually purple, traps green): Dragonfly is a common word, tempting the player to put 'FLY' in the 'Words before FLY' group.
|
||||
|
||||
**Grade: FAIL.** `FRUIT` appears in **two groups** (yellow Types of Flies + purple Words that follow DRAGON) — that's only 15 distinct tiles, not the required 16. Categories are also redundant: yellow ("Types of Flies") and green ("Words before FLY") are both fly-themed. A human curator would never ship this.
|
||||
|
||||
_dt=5.7s, tokens=500_
|
||||
|
||||
## gemma4:26b -- puzzle 2
|
||||
|
||||
**Theme seed:** Homophones and linguistic overlaps between nature and everyday objects.
|
||||
|
||||
| Diff | Category | Words |
|
||||
|---|---|---|
|
||||
| yellow | Types of Trees | BIRCH, CEDAR, MAPLE, PINE |
|
||||
| green | Things that can be 'Dropped' | BEAT, HINT, LINE, SUBJECT |
|
||||
| blue | Units of Measurement | FOOT, GRAM, SECOND, STONE |
|
||||
| purple | Homophones of Body Parts | EYE, HARE, MUSSEL, QUEUE |
|
||||
|
||||
**Claimed traps (4):**
|
||||
- `PINE` (actually yellow, traps green): To 'pine' for someone is a feeling, but 'pine' could be mistaken for something you can 'drop' (like a pine needle).
|
||||
- `FOOT` (actually blue, traps yellow): A 'foot' is a unit of measurement, but it is also a body part, potentially leading to the 'Body Part Homophones' group.
|
||||
- `BEAT` (actually green, traps blue): A 'beat' is a rhythm, but it could be mistaken for a unit of time/measurement in a musical context.
|
||||
- `EYE` (actually purple, traps blue): The word 'eye' is a body part, which might tempt players looking for a body part category, but it is actually a homophone (I) for the purple group.
|
||||
|
||||
**Grade: PARTIAL FAIL.** Yellow/green/blue all tight. Purple "Homophones of Body Parts" is half-broken: `HARE`→hair ✓, `MUSSEL`→muscle ✓, but `EYE` *is* a body part (not a homophone of one) and `QUEUE`→cue is not a body-part homophone. 2/4 wrong on the trickiest group is a curator-rejection. Trap reasoning is weak (claims `EYE` traps blue Units — body parts aren't units).
|
||||
|
||||
_dt=5.3s, tokens=510_
|
||||
|
||||
## gemma4:26b -- puzzle 3
|
||||
|
||||
**Theme seed:** The concept of 'Pressure' and things that can be measured or applied.
|
||||
|
||||
| Diff | Category | Words |
|
||||
|---|---|---|
|
||||
| yellow | Units of Pressure | BAR, PASCAL, PSI, TORR |
|
||||
| green | Words after 'BLOOD' | PRESSURE, TYPE, ORANGE, DONOR |
|
||||
| blue | Things that can be 'Broken' | HEART, RECORD, LAW, PROMISE |
|
||||
| purple | Words that are also Verbs for 'Applying Force' | PRESS, STAMP, PUNCH, DRIVE |
|
||||
|
||||
**Claimed traps (4):**
|
||||
- `PRESSURE` (actually green, traps yellow): Pressure is the concept being measured by the units in the yellow group.
|
||||
- `BAR` (actually yellow, traps blue): A 'bar' is something that can be 'broken' (like a chocolate bar) or refers to a legal setting.
|
||||
- `PRESS` (actually purple, traps green): The 'Press' is a common term in news/media, and 'Press' can be related to 'Blood' in a medical context (blood pressure).
|
||||
- `RECORD` (actually blue, traps purple): A 'record' is something you can 'stamp' or 'drive' (as in a driving record).
|
||||
|
||||
**Grade: BORDERLINE PASS.** Categories tight, 16 distinct words, ✓. `BAR` (yellow → blue trap) is a real overlap (chocolate bar, broken bar). `PRESS` claimed to trap green ("Words after BLOOD") — but blood **pressure** is the compound, not blood **press**, so the claimed trap doesn't actually hold. `RECORD`→stamp/drive is a stretch. 1/4 of Gemma's self-graded traps is broken; remainder OK.
|
||||
|
||||
_dt=5.1s, tokens=492_
|
||||
|
||||
## gemma4:26b -- puzzle 4
|
||||
|
||||
**Theme seed:** A collection of words related to things that can be 'broken' or 'cracked', mixed with other categories like cooking and music.
|
||||
|
||||
| Diff | Category | Words |
|
||||
|---|---|---|
|
||||
| yellow | Kinds of Eggs | CHICKEN, QUAIL, GOOSE, DUCK |
|
||||
| green | Things You Can Crack | NUT, CODE, WHIP, SMILE |
|
||||
| blue | Musical Terms | NOTE, CHORD, CLEF, BEAT |
|
||||
| purple | Words Before 'Egg' | NEST, EASTER, BAD, NESTING |
|
||||
|
||||
**Claimed traps (4):**
|
||||
- `DUCK` (actually yellow, traps green): DUCK is an action related to avoiding a whip or a blow, and can be paired with 'crack' in some linguistic contexts.
|
||||
- `CHICKEN` (actually yellow, traps purple): CHICKEN is an egg-related word that could tempt someone looking for '___ Egg' phrases (like Chicken Egg).
|
||||
- `BEAT` (actually blue, traps green): BEAT is something you can do to a whip or a person, fitting the 'Things you can crack/strike' theme.
|
||||
- `NEST` (actually purple, traps yellow): NEST is intrinsically linked to the birds in the yellow group (CHICKEN, GOOSE, etc.).
|
||||
|
||||
**Grade: FAIL.** Purple has `NESTING` as "Words Before EGG" — *nesting egg* isn't a phrase (*nest egg* is). Worse, `NEST` and `NESTING` are near-duplicate tiles. Yellow "Kinds of Eggs" and purple "Words Before EGG" are also redundant (chicken egg vs nest egg — both compound forms). Curator reject.
|
||||
|
||||
_dt=5.3s, tokens=506_
|
||||
|
||||
## gemma4:26b -- puzzle 5
|
||||
|
||||
**Theme seed:** A collection of words centered around various types of 'records' and 'tracks'.
|
||||
|
||||
| Diff | Category | Words |
|
||||
|---|---|---|
|
||||
| yellow | Types of Records | LP, EP, Single, Album |
|
||||
| green | Things that have Tracks | Train, Athlete, Song, Mud |
|
||||
| blue | Words for 'Evidence' | Clue, Trace, Mark, Proof |
|
||||
| purple | Words that follow 'Side' | Effect, Kick, Split, Walk |
|
||||
|
||||
**Claimed traps (4):**
|
||||
- `Single` (actually yellow, traps blue): A 'single' can be a piece of evidence or a solitary item.
|
||||
- `Song` (actually green, traps yellow): A song is often released as an LP, EP, or Single.
|
||||
- `Trace` (actually blue, traps green): A trace can be a physical track left behind in mud.
|
||||
- `Split` (actually purple, traps green): A 'split' can refer to a track or division in a race.
|
||||
|
||||
**Grade: PASS.** Cleanest 26b output. 16 distinct words; categories tight; the word "track" carries **three different meanings** across green (railroad/running/album), with `SONG`→Records and `TRACE`→Tracks as real overlaps. That's the multi-meaning anchor pattern Connections actually uses. `SINGLE` and `SPLIT` traps are weak, but the puzzle structure is publication-shaped.
|
||||
|
||||
_dt=4.6s, tokens=436_
|
||||
|
||||
## gemma4:31b-it-q4_K_M -- puzzle 1
|
||||
|
||||
**Theme seed:** Words related to circles, timing, and subtle wordplay with 'Hands'
|
||||
|
||||
| Diff | Category | Words |
|
||||
|---|---|---|
|
||||
| yellow | Circular Objects | RING, COIN, WHEEL, DISC |
|
||||
| green | Units of Time | SECOND, MINUTE, HOUR, DAY |
|
||||
| blue | Ways to Assist | HELP, AID, BACK, SUPPORT |
|
||||
| purple | Things with Hands | CLOCK, WATCH, DECK, PLAYER |
|
||||
|
||||
**Claimed traps (4):**
|
||||
- `SECOND` (actually green, traps blue): To 'second' a motion is to support it.
|
||||
- `WATCH` (actually purple, traps green): A watch is a device used to measure the units of time (second, minute, hour).
|
||||
- `CLOCK` (actually purple, traps green): Similar to watch, it is intrinsically linked to the time units group.
|
||||
- `BACK` (actually blue, traps yellow): Could be mistaken as a part of a circular object or a physical position.
|
||||
|
||||
**Grade: BORDERLINE PASS.** `DECK` in "Things with Hands" is shaky — decks have cards, players have hands; the phrase is "all hands on deck" (sailors), not "deck has hands." Other 3 purple members fit. Strong saving grace: `SECOND`→assist (second a motion), `WATCH`→time, `CLOCK`→time are 3 genuinely strong traps. Categories otherwise tight.
|
||||
|
||||
_dt=18.9s, tokens=439_
|
||||
|
||||
## gemma4:31b-it-q4_K_M -- puzzle 2
|
||||
|
||||
**Theme seed:** A mix of architectural elements, types of scales, and linguistic wordplay
|
||||
|
||||
| Diff | Category | Words |
|
||||
|---|---|---|
|
||||
| yellow | Parts of a Building | BEAM, COLUMN, EAVE, FOUNDATION |
|
||||
| green | Types of Scales | LIBRA, MUSICAL, WEIGHING, TIMELINE |
|
||||
| blue | Synonyms for Balance | EQUILIBRIUM, POISE, STEADY, STABILITY |
|
||||
| purple | Words that start with a Greek Letter | ALPHABET, BETRAYAL, GAMUT, DELUXE |
|
||||
|
||||
**Claimed traps (3):**
|
||||
- `LIBRA` (actually green, traps blue): Libra is the sign of the scales, which represents balance/equilibrium.
|
||||
- `COLUMN` (actually yellow, traps green): A column can be a way to organize data on a scale or chart.
|
||||
- `GAMUT` (actually purple, traps green): The 'gamut' refers to the full range or scale of something.
|
||||
|
||||
**Grade: FAIL.** Two broken categories. Green "Types of Scales" includes `LIBRA` (zodiac sign, not a scale type) and `TIMELINE` (a line, not a scale). Purple "Words that start with a Greek Letter" includes `DELUXE` — but `DELUXE` starts with "DELU", not the full letter name "DELTA". `GAMUT` starts with "GAM", missing the full "GAMMA". 2 of 4 purple words don't actually start with the complete Greek-letter name.
|
||||
|
||||
_dt=16.6s, tokens=410_
|
||||
|
||||
## gemma4:31b-it-q4_K_M -- puzzle 3
|
||||
|
||||
**Theme seed:** Words related to vision, optics, and ocular puns
|
||||
|
||||
| Diff | Category | Words |
|
||||
|---|---|---|
|
||||
| yellow | Parts of the Eye | CORNEA, PUPIL, IRIS, RETINA |
|
||||
| green | Types of Lenses | CONCAVE, CONVEX, CONTACT, BIFOCAL |
|
||||
| blue | Words meaning 'To Witness' | OBSERVE, SPOT, NOTE, VIEW |
|
||||
| purple | ___ EYE | BULLS, BIRD, FISH, PRIVATE |
|
||||
|
||||
**Claimed traps (3):**
|
||||
- `PUPIL` (actually yellow, traps blue): A pupil is a student, and one might 'note' or 'observe' a pupil in a classroom setting.
|
||||
- `PRIVATE` (actually purple, traps blue): Private can be a synonym for 'secret' or 'hidden', contrasting with the act of observing or spotting.
|
||||
- `CONTACT` (actually green, traps blue): To 'contact' someone can be confused with 'spotting' or 'viewing' them in a social context.
|
||||
|
||||
**Grade: PASS.** All 4 categories tight: parts of eye / lens types / witness verbs / `___EYE` compounds (bull's, bird's, fish, private). `PUPIL`→student is a real, NYT-style multi-meaning trap. Only 3 traps claimed (rubric asked for 2+, ✓). Closest puzzle to publication quality across both models.
|
||||
|
||||
_dt=17.3s, tokens=422_
|
||||
|
||||
## gemma4:31b-it-q4_K_M -- puzzle 4
|
||||
|
||||
**Theme seed:** Common words that can also be types of birds or musical terms
|
||||
|
||||
| Diff | Category | Words |
|
||||
|---|---|---|
|
||||
| yellow | Things that can be 'Cast' | SPELL, VOTE, SHADOW, NET |
|
||||
| green | Musical Dynamics | PIANO, FORTE, SHARP, FLAT |
|
||||
| blue | Types of Birds | CRANE, SWALLOW, SWIFT, LOON |
|
||||
| purple | Words that mean 'to move quickly' | DART, BOLT, RUSH, FLY |
|
||||
|
||||
**Claimed traps (4):**
|
||||
- `SWIFT` (actually blue, traps purple): Swift is a synonym for moving quickly, fitting the 'move quickly' group.
|
||||
- `FLY` (actually purple, traps blue): Fly is a common bird-related term (or a type of insect), making it look like it belongs with Crane and Swallow.
|
||||
- `SWALLOW` (actually blue, traps purple): While a bird, if the user interprets 'move quickly' as 'gulp down', it might confuse them, though primarily it serves as a bird lure.
|
||||
- `SHARP` (actually green, traps purple): A 'sharp' turn or movement can be associated with quickness/darting.
|
||||
|
||||
**Grade: BORDERLINE PASS.** Green should be "Musical Terms" not "Dynamics" — `SHARP` and `FLAT` are accidentals (pitch modifiers), not dynamics (loudness). Pedantic but a real category-label miss. Saving graces: `SWIFT` (bird that means fast) and `FLY` (bird that means move quickly) are exactly the kind of multi-meaning anchors a real Connections puzzle would deploy.
|
||||
|
||||
_dt=19.9s, tokens=487_
|
||||
|
||||
## gemma4:31b-it-q4_K_M -- puzzle 5
|
||||
|
||||
**Theme seed:** Words related to types of 'scales' and musical terminology
|
||||
|
||||
| Diff | Category | Words |
|
||||
|---|---|---|
|
||||
| yellow | Musical Scales | MAJOR, MINOR, CHROMATIC, PENTATONIC |
|
||||
| green | Things that Weigh | BALANCE, SPRING, DIGITAL, TRIPLE-BEAM |
|
||||
| blue | Ways to Climb | ASCEND, MOUNT, SCALE, SURMOUNT |
|
||||
| purple | ___ NOTE | FOOT, BANK, KEY, HIGH |
|
||||
|
||||
**Claimed traps (4):**
|
||||
- `SCALE` (actually blue, traps yellow): It is the root word for musical scales, leading the player to look for other scale-related terms.
|
||||
- `KEY` (actually purple, traps yellow): A 'key' is fundamentally linked to musical scales (e.g., the Key of C Major).
|
||||
- `HIGH` (actually purple, traps blue): High is an adjective often associated with climbing or ascending.
|
||||
- `BALANCE` (actually green, traps blue): Balance can be seen as a state of being when climbing or mountaineering.
|
||||
|
||||
**Grade: PASS.** The whole puzzle is built around `SCALE` carrying three meanings: musical scale (yellow), weighing scale (green's theme — though Gemma mislabels it "Things that Weigh" instead of "Types of Scales"), and "to climb" (blue, where SCALE-the-word lives). That is exactly the central-anchor pattern a real NYT Connections puzzle uses. `KEY`, `HIGH`, `BALANCE` traps all genuinely overlap. Categories slightly mislabeled but structure is publication-quality.
|
||||
|
||||
_dt=18.2s, tokens=453_
|
||||
|
||||
---
|
||||
|
||||
## Aggregate
|
||||
|
||||
| Model | Pass | Borderline | Fail | Avg s | Avg tok/s |
|
||||
|---|---|---|---|---|---|
|
||||
| `gemma4:26b` | 1 (#5) | 1 (#3) + 1 partial (#2) | 2 (#1, #4) | 5.2 | 94.3 |
|
||||
| `gemma4:31b-it-q4_K_M` | 2 (#3, #5) | 2 (#1, #4) | 1 (#2) | 18.2 | 24.3 |
|
||||
|
||||
**31b is materially more reliable** — 2 clean passes vs 26b's 1, and only 1 hard fail vs 26b's 2 hard fails plus a partial-fail. 31b is 3.5× slower per generation but at 18s for a once-per-day puzzle, that's irrelevant. 26b is fast enough for interactive use but produces broken puzzles half the time.
|
||||
|
||||
### Failure modes (in order of how often they recur)
|
||||
|
||||
1. **Structural violations** — duplicate or near-duplicate words on the 16-tile board, or a word listed in two groups. (#1-26b: `FRUIT` × 2; #4-26b: `NEST`/`NESTING`.) Catastrophic — a real Connections board has 16 *distinct* tiles. **Trivially detectable** with a deterministic post-filter.
|
||||
2. **Broken category logic** — words placed in a category that don't actually fit. (#2-26b: `EYE`/`QUEUE` aren't body-part homophones; #4-26b: `NESTING` isn't a "Word before EGG"; #2-31b: `LIBRA`/`TIMELINE` aren't scales, `DELUXE` doesn't start with the full Greek letter "DELTA"; #1-31b: `DECK` doesn't have hands.) **Hard to detect deterministically** — needs a critique/judging pass.
|
||||
3. **Redundant categories** — two groups themed on the same concept (#1-26b: yellow + green both fly-themed; #4-26b: yellow + purple both egg-themed). Detectable with a category-similarity check.
|
||||
4. **Weak/circular trap reasoning** — Gemma's claimed "intended_traps" sometimes don't actually hold. (#3-26b: `PRESS` doesn't fit "Words after BLOOD" — the compound is *blood pressure*, not *blood press*.) Means **Gemma cannot reliably grade its own puzzles** — independent judging required.
|
||||
|
||||
### Successes (when Gemma gets it right, what it does right)
|
||||
|
||||
- **Multi-meaning anchor words** — `SCALE` (3 meanings, #5-31b), `SWIFT`/`FLY` (bird + fast, #4-31b), `PUPIL` (eye + student, #3-31b), `TRACK` (3 meanings, #5-26b). When Gemma builds a puzzle around an anchor, it produces real Connections-grade overlap.
|
||||
- **Compound-word categories** — "`___ EYE`" (#3-31b), "Side `___`" (#5-26b), "Words before EGG" (#4-26b — when Gemma doesn't poison it). These are the easiest pattern to get right.
|
||||
- **Tight short labels** when Gemma sticks to well-known domains (parts of eye, types of lenses, types of records).
|
||||
|
||||
### Implication for design
|
||||
|
||||
**Generation is viable, but not unaided.** The shape of the data engine:
|
||||
|
||||
```
|
||||
generate (gemma4:31b)
|
||||
→ deterministic filter [check 16 distinct tiles, no dup words, all words appear in categories]
|
||||
→ category-similarity check [reject puzzles with redundant themes]
|
||||
→ critique pass [either gemma4:31b second pass, or qwen3-coder:30b as judge]
|
||||
→ reject + regenerate if any fail; accept once filtered
|
||||
→ cache as the day's puzzle
|
||||
```
|
||||
|
||||
At ~18s/generation and a roughly 50% structural-pass rate, a daily puzzle costs an expected ~2 generations + 1 critique = maybe 1 minute of GPU time per day. Effectively free.
|
||||
|
||||
**26b is unsuitable as the primary generator** — too many hard fails. It could plausibly be the *judging* model (cheaper, runs on every player guess) since judging is much easier than generating. But that decision is for the brainstorm.
|
||||
|
||||
### Risks not yet checked
|
||||
|
||||
- **Diversity over time.** All 10 puzzles produced here are within a single seed-less batch. If Gemma keeps reaching for the same themes (we saw "scales" twice on 31b alone), a 365-day-per-year stream might get repetitive. Test this with seeded prompts before committing.
|
||||
- **Connections-vs-Gemma blind anchor not run.** I deferred this — the structural failures in Gemma's output (duplicate words, broken categories) are so obviously curator-rejection-tier that no human-curated puzzle would have them, so the within-Gemma comparison was decisive on its own. Still, before final design, eyeball one Gemma-pass puzzle next to a real NYT puzzle and check whether it actually feels equivalent.
|
||||
- **Two-pass critique not validated.** The proposal above assumes a critique pass would catch Gemma's category mistakes. That assumption has not been tested. The next experiment is "feed Gemma's broken puzzles back to Gemma (or to a different model) and see if it flags the structural issues."
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,514 @@
|
||||
# Gemma 4 Semantic Bakeoff -- 2026-04-27 22:51
|
||||
|
||||
## Setup
|
||||
- Local Ollama on the test host (RTX 3090 Ti, 24 GB VRAM)
|
||||
- Models: `gemma4:latest`, `gemma4:26b`, `gemma4:31b-it-q4_K_M`
|
||||
- Temperature 0.2 (raised +0.1 per retry on JSON parse fail, max 3 attempts)
|
||||
- think=false, num_ctx=4096, num_predict=512, no format=json (per gemma4-research/GOTCHAS.md)
|
||||
- 38 test cases: 16 JUDGE, 10 CREATE, 12 CREATIVE_ACCEPT
|
||||
- Ground truth hand-labeled inline in `scripts/gemma-semantic-bakeoff.py`
|
||||
|
||||
## Auto-scored summary
|
||||
|
||||
| Model | JUDGE pass | CREATIVE_ACCEPT pass | parse fails | avg s |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | 14/16 | 10/10 | 0 | 0.7 |
|
||||
| `gemma4:26b` | 15/16 | 10/10 | 0 | 0.8 |
|
||||
| `gemma4:31b-it-q4_K_M` | 16/16 | 10/10 | 1 | 2.3 |
|
||||
|
||||
## JUDGE
|
||||
|
||||
### judge-y1
|
||||
|
||||
- Category: `Types of trees`
|
||||
- Words: `OAK`, `MAPLE`, `BIRCH`, `PINE`
|
||||
- Ground truth: **yes**
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | yes | All four words are common names of tree species. | PASS | 0.7s |
|
||||
| `gemma4:26b` | yes | All four words are specific, well-known types of trees. | PASS | 0.8s |
|
||||
| `gemma4:31b-it-q4_K_M` | yes | All four words are common types of trees. | PASS | 2.1s |
|
||||
|
||||
### judge-y2
|
||||
|
||||
- Category: `Greek letters`
|
||||
- Words: `ALPHA`, `BETA`, `GAMMA`, `DELTA`
|
||||
- Ground truth: **yes**
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | yes | All four words are names of Greek letters. | PASS | 0.6s |
|
||||
| `gemma4:26b` | yes | All four words are standard names of letters in the Greek alphabet. | PASS | 0.7s |
|
||||
| `gemma4:31b-it-q4_K_M` | yes | All four words are the first four letters of the Greek alphabet. | PASS | 2.1s |
|
||||
|
||||
### judge-y3
|
||||
|
||||
- Category: `Days of the week`
|
||||
- Words: `MONDAY`, `FRIDAY`, `SUNDAY`, `WEDNESDAY`
|
||||
- Ground truth: **yes**
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | no | The words are not all representative of the days of the week. (misfit: MONDAY, FRIDAY, SUNDAY, WEDNESDAY) | FAIL | 0.8s |
|
||||
| `gemma4:26b` | yes | All four words are specific names of days in the week. | PASS | 0.7s |
|
||||
| `gemma4:31b-it-q4_K_M` | yes | All four words are standard days of the week. | PASS | 1.9s |
|
||||
|
||||
### judge-y4
|
||||
|
||||
- Category: `Synonyms for 'happy'`
|
||||
- Words: `JOYFUL`, `GLAD`, `CHEERFUL`, `ELATED`
|
||||
- Ground truth: **yes**
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | yes | All four words are direct and common synonyms for 'happy'. | PASS | 0.6s |
|
||||
| `gemma4:26b` | yes | All four words are direct synonyms representing different degrees of happiness. | PASS | 0.7s |
|
||||
| `gemma4:31b-it-q4_K_M` | yes | All four words are direct synonyms for the state of being happy. | PASS | 2.1s |
|
||||
|
||||
### judge-y5
|
||||
|
||||
- Category: `___ HOUSE (compound words)`
|
||||
- Words: `GREEN`, `ICE`, `FIRE`, `COURT`
|
||||
- Ground truth: **yes**
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | yes | All four words can precede 'HOUSE' to form common compound words. | PASS | 0.7s |
|
||||
| `gemma4:26b` | no | While Green, Ice, and Fire House are common, 'Court House' is typically written as two words (Courthouse) or refers to a | FAIL | 0.9s |
|
||||
| `gemma4:31b-it-q4_K_M` | yes | All four words form common compound words or phrases with 'house'. | PASS | 2.1s |
|
||||
|
||||
### judge-y6
|
||||
|
||||
- Category: `Words that follow COLD`
|
||||
- Words: `SHOULDER`, `FRONT`, `SNAP`, `TURKEY`
|
||||
- Ground truth: **yes**
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | no | Only SHOULDER, FRONT, and SNAP fit common phrases with COLD, while TURKEY does not. (misfit: TURKEY) | FAIL | 0.8s |
|
||||
| `gemma4:26b` | yes | All four words form common phrases when preceded by 'cold': cold shoulder, cold front, cold snap, and cold turkey. | PASS | 0.9s |
|
||||
| `gemma4:31b-it-q4_K_M` | yes | All four words form common compound words or phrases when preceded by 'cold'. | PASS | 2.1s |
|
||||
|
||||
### judge-y7
|
||||
|
||||
- Category: `Verbs meaning 'to move quickly'`
|
||||
- Words: `DART`, `BOLT`, `RUSH`, `FLY`
|
||||
- Ground truth: **yes**
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | yes | All four words are strong synonyms for moving rapidly. | PASS | 0.6s |
|
||||
| `gemma4:26b` | yes | All four words are synonyms for moving at high speed. | PASS | 0.7s |
|
||||
| `gemma4:31b-it-q4_K_M` | yes | All four words are common synonyms for moving rapidly. | PASS | 2.0s |
|
||||
|
||||
### judge-y8
|
||||
|
||||
- Category: `Synonyms for 'idea'`
|
||||
- Words: `NOTION`, `CONCEPT`, `THOUGHT`, `INKLING`
|
||||
- Ground truth: **yes**
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | yes | All four words are common synonyms for a mental concept or idea. | PASS | 0.7s |
|
||||
| `gemma4:26b` | yes | All four words are direct synonyms for a mental concept, notion, or vague understanding. | PASS | 0.8s |
|
||||
| `gemma4:31b-it-q4_K_M` | yes | All four words are common synonyms for a mental representation or a suggestion of an idea. | PASS | 2.2s |
|
||||
|
||||
### judge-n1
|
||||
|
||||
- Category: `Types of trees`
|
||||
- Words: `OAK`, `MAPLE`, `BIRCH`, `CARROT`
|
||||
- Ground truth: **no** (misfit: CARROT)
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | no | Carrot is a root vegetable, not a type of tree. (misfit: CARROT) | PASS | 0.7s |
|
||||
| `gemma4:26b` | no | Carrot is a root vegetable, not a type of tree. (misfit: CARROT) | PASS | 0.7s |
|
||||
| `gemma4:31b-it-q4_K_M` | no | Carrot is a root vegetable, not a tree. (misfit: CARROT) | PASS | 2.1s |
|
||||
|
||||
### judge-n2
|
||||
|
||||
- Category: `Greek letters`
|
||||
- Words: `ALPHA`, `BETA`, `GAMMA`, `CYRILLIC`
|
||||
- Ground truth: **no** (misfit: CYRILLIC)
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | no | Cyrillic is a script, not a Greek letter. (misfit: CYRILLIC) | PASS | 0.7s |
|
||||
| `gemma4:26b` | no | CYRILLIC refers to a script, not a specific Greek letter. (misfit: CYRILLIC) | PASS | 0.8s |
|
||||
| `gemma4:31b-it-q4_K_M` | no | Cyrillic is a writing system, not a Greek letter. (misfit: CYRILLIC) | PASS | 2.2s |
|
||||
|
||||
### judge-n3
|
||||
|
||||
- Category: `Synonyms for 'happy'`
|
||||
- Words: `JOYFUL`, `GLAD`, `SAD`, `ELATED`
|
||||
- Ground truth: **no** (misfit: SAD)
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | no | The word 'SAD' is an antonym, not a synonym for 'happy'. (misfit: SAD) | PASS | 0.7s |
|
||||
| `gemma4:26b` | no | The word 'SAD' is an antonym rather than a synonym. (misfit: SAD) | PASS | 0.8s |
|
||||
| `gemma4:31b-it-q4_K_M` | no | SAD is an antonym of happy, not a synonym. (misfit: SAD) | PASS | 2.2s |
|
||||
|
||||
### judge-n4
|
||||
|
||||
- Category: `Days of the week`
|
||||
- Words: `MONDAY`, `JANUARY`, `SUNDAY`, `WEDNESDAY`
|
||||
- Ground truth: **no** (misfit: JANUARY)
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | no | January is a month, not a day of the week. (misfit: JANUARY) | PASS | 0.7s |
|
||||
| `gemma4:26b` | no | JANUARY is a month, not a day of the week. (misfit: JANUARY) | PASS | 0.7s |
|
||||
| `gemma4:31b-it-q4_K_M` | no | JANUARY is a month, not a day of the week. (misfit: JANUARY) | PASS | 2.1s |
|
||||
|
||||
### judge-n5
|
||||
|
||||
- Category: `Body parts`
|
||||
- Words: `ARM`, `LEG`, `EYE`, `NIGHT`
|
||||
- Ground truth: **no** (misfit: NIGHT)
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | no | NIGHT is not a body part. (misfit: NIGHT) | PASS | 0.6s |
|
||||
| `gemma4:26b` | no | The word 'NIGHT' is not a body part. (misfit: NIGHT) | PASS | 0.8s |
|
||||
| `gemma4:31b-it-q4_K_M` | no | NIGHT is a time of day, not a body part. (misfit: NIGHT) | PASS | 2.1s |
|
||||
|
||||
### judge-n6
|
||||
|
||||
- Category: `Types of birds`
|
||||
- Words: `CRANE`, `SWALLOW`, `BAT`, `MOSQUITO`
|
||||
- Ground truth: **no** (misfit: BAT, MOSQUITO)
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | no | Bats and mosquitoes are not types of birds. (misfit: BAT, MOSQUITO) | PASS | 0.7s |
|
||||
| `gemma4:26b` | no | BAT and MOSQUITO are mammals and insects, respectively, not birds. (misfit: BAT, MOSQUITO) | PASS | 0.8s |
|
||||
| `gemma4:31b-it-q4_K_M` | no | Bat and mosquito are not birds. (misfit: BAT, MOSQUITO) | PASS | 2.1s |
|
||||
|
||||
### judge-n7
|
||||
|
||||
- Category: `Things that are red`
|
||||
- Words: `APPLE`, `BLUE`, `ROSE`, `GRASS`
|
||||
- Ground truth: **no** (misfit: BLUE, GRASS)
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | no | Blue is a color, not necessarily a thing that is red. (misfit: BLUE) | PASS | 0.7s |
|
||||
| `gemma4:26b` | no | BLUE and GRASS are not typically red. (misfit: BLUE, GRASS) | PASS | 0.7s |
|
||||
| `gemma4:31b-it-q4_K_M` | no | Blue and grass are not red. (misfit: BLUE, GRASS) | PASS | 2.0s |
|
||||
|
||||
### judge-n8
|
||||
|
||||
- Category: `Words that follow COLD`
|
||||
- Words: `SHOULDER`, `FRONT`, `PIZZA`, `MOUNTAIN`
|
||||
- Ground truth: **no** (misfit: PIZZA, MOUNTAIN)
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | no | Only SHOULDER, FRONT, and PIZZA fit common phrases with COLD, but MOUNTAIN does not. (misfit: MOUNTAIN) | PASS | 0.8s |
|
||||
| `gemma4:26b` | no | While shoulder, front, and mountain can follow 'cold', 'cold pizza' is a description of temperature rather than a common | PASS | 0.9s |
|
||||
| `gemma4:31b-it-q4_K_M` | no | Cold pizza and cold mountain are not established compound words or common idioms like cold shoulder and cold front. (mis | PASS | 2.7s |
|
||||
|
||||
## CREATE
|
||||
|
||||
### create-e1
|
||||
|
||||
- Category: `Types of trees`
|
||||
- Quality bar: Four valid tree species; e.g. OAK, MAPLE, BIRCH, PINE.
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | Oak, Maple, Pine, Birch | These are all common types of trees. | MANUAL | 0.6s |
|
||||
| `gemma4:26b` | Oak, Maple, Birch, Cedar | These are all common species of trees. | MANUAL | 0.7s |
|
||||
| `gemma4:31b-it-q4_K_M` | Oak, Maple, Pine, Birch | All four are common types of trees. | MANUAL | 1.9s |
|
||||
|
||||
### create-e2
|
||||
|
||||
- Category: `Greek letters`
|
||||
- Quality bar: Four genuine Greek letters.
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | Alpha, Beta, Gamma, Delta | These are the first four letters of the Greek alphabet. | MANUAL | 0.6s |
|
||||
| `gemma4:26b` | ALPHA, BETA, GAMMA, DELTA | These are the first four letters of the Greek alphabet. | MANUAL | 0.7s |
|
||||
| `gemma4:31b-it-q4_K_M` | Alpha, Beta, Gamma, Delta | These are the first four letters of the Greek alphabet. | MANUAL | 1.9s |
|
||||
|
||||
### create-e3
|
||||
|
||||
- Category: `Synonyms for 'angry'`
|
||||
- Quality bar: Four words that all genuinely mean angry/furious.
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | furious, irate, enraged, mad | All four words are direct synonyms for feeling intense anger. | MANUAL | 0.6s |
|
||||
| `gemma4:26b` | Irate, Livid, Furious, Incensed | All four words are direct synonyms for being extremely angry. | MANUAL | 0.7s |
|
||||
| `gemma4:31b-it-q4_K_M` | Furious, Irate, Livid, Enraged | All four words are strong synonyms for being extremely angry. | MANUAL | 2.1s |
|
||||
|
||||
### create-e4
|
||||
|
||||
- Category: `Days of the week`
|
||||
- Quality bar: Four of the seven weekday names, no months or other words.
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | Monday, Tuesday, Wednesday, Thursday | These are four consecutive days of the week. | MANUAL | 0.6s |
|
||||
| `gemma4:26b` | Monday, Tuesday, Wednesday, Thursday | These are all consecutive days of the week. | MANUAL | 0.7s |
|
||||
| `gemma4:31b-it-q4_K_M` | Monday, Wednesday, Friday, Sunday | They are all days of the week. | MANUAL | 1.8s |
|
||||
|
||||
### create-m1
|
||||
|
||||
- Category: `___ STORM (compound words ending in STORM)`
|
||||
- Quality bar: Four words that each form a real compound or fixed phrase with STORM (e.g. SAND, BRAIN, THUNDER, SNOW).
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | BLIZZARD, HEAT, STORM, SNOW | These words can all precede 'STORM' to form a type of storm. | MANUAL | 0.7s |
|
||||
| `gemma4:26b` | BRAIN, DUST, FIRE, SNOW | Each word forms a common compound word when followed by 'STORM'. | MANUAL | 0.7s |
|
||||
| `gemma4:31b-it-q4_K_M` | BRAIN, FIRE, THUNDER, SNOW | All four words form common compound words when followed by STORM. | MANUAL | 2.1s |
|
||||
|
||||
### create-m2
|
||||
|
||||
- Category: `Synonyms for 'small'`
|
||||
- Quality bar: Four words that all genuinely mean small.
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | tiny, petite, minuscule, diminutive | All four words are direct synonyms for 'small' in size. | MANUAL | 0.7s |
|
||||
| `gemma4:26b` | tiny, microscopic, diminutive, minuscule | All four words are direct synonyms for something of very small size. | MANUAL | 0.7s |
|
||||
| `gemma4:31b-it-q4_K_M` | tiny, petite, slight, miniature | All four words are direct synonyms for small in size. | MANUAL | 2.0s |
|
||||
|
||||
### create-m3
|
||||
|
||||
- Category: `Words that follow BLUE`
|
||||
- Quality bar: Four words that each form a real compound with BLUE (e.g. BERRY, BIRD, PRINT, BELL, GRASS).
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | JEANS, BIRD, MOON, BLOOD | All words can follow 'BLUE' to form common phrases. | MANUAL | 0.7s |
|
||||
| `gemma4:26b` | BIRD, JAY, PRINT, WHALE | Each word can be preceded by 'blue' to form a common term. | MANUAL | 0.8s |
|
||||
| `gemma4:31b-it-q4_K_M` | BERRY, JAY, MOON, PRINT | All four words follow 'BLUE' to form common compound words or phrases. | MANUAL | 2.2s |
|
||||
|
||||
### create-m4
|
||||
|
||||
- Category: `Things that can be 'broken'`
|
||||
- Quality bar: Four words that each form a real fixed phrase with 'broken' (heart, record, law, promise, etc.).
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | heart, promise, record, bone | All four items can be metaphorically or literally broken. | MANUAL | 0.6s |
|
||||
| `gemma4:26b` | Promise, Record, Heart, Law | All four are common objects or concepts that can be broken. | MANUAL | 0.7s |
|
||||
| `gemma4:31b-it-q4_K_M` | Record, Promise, Silence, Heart | All four are common abstract or physical things that can be broken. | MANUAL | 2.1s |
|
||||
|
||||
### create-h1
|
||||
|
||||
- Category: `Words that are homophones of body parts but spelled differently (e.g. HARE = hair, MUSSEL = muscle)`
|
||||
- Quality bar: Four words that each sound like a body part but are spelled differently. Valid examples: HARE (hair), MUSSEL (muscle), HEAL (heel), SOUL (sole), AYE/EYE-spelled-otherwise. EYE and HEEL alone do NOT count -- those are the body parts themselves, not homophones of them.
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | SEA, SEE, HEAR, HERE | All four words are homophones of common body parts (sea/see for 'C' sound, hear/here for 'hear'). | MANUAL | 0.8s |
|
||||
| `gemma4:26b` | MUSSEL, HARE, EYE, KNEAD | Each word is a homophone of a body part (muscle, hair, eye, knee) but spelled differently. | MANUAL | 0.9s |
|
||||
| `gemma4:31b-it-q4_K_M` | _parse fail_ | `JSONDecodeError('Extra data: line 3 column 1 (char 133)')` | PARSE_FAIL | 6.7s |
|
||||
|
||||
### create-h2
|
||||
|
||||
- Category: `Words that contain a body part as a substring (e.g. HEARTBEAT contains HEART)`
|
||||
- Quality bar: Four words that each contain a body part anywhere inside them. Valid examples: HEARTH (HEART), CHESTNUT (CHEST), EARTH (EAR), HEADACHE (HEAD).
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | KNEEHIGH, EARDRUM, STOMACHACHE, EYELASH | Each word contains a body part name as a substring. | MANUAL | 0.7s |
|
||||
| `gemma4:26b` | HANDSOME, FOOTPRINT, EARNEST, ARMORY | Each word contains a body part (hand, foot, ear, arm) as a substring. | MANUAL | 0.8s |
|
||||
| `gemma4:31b-it-q4_K_M` | HEARTBEAT, HANDSOME, FOOTPRINT, ARMCHAIR | Each word contains a body part as a substring. | MANUAL | 2.2s |
|
||||
|
||||
## CREATIVE_ACCEPT
|
||||
|
||||
### ca-y1
|
||||
|
||||
- Words: `SCALE`, `MOUNT`, `ASCEND`, `CLIMB`
|
||||
- Player's category: `"Verbs for going up"`
|
||||
- Ground truth: **yes**
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | yes | All four words can be used as verbs meaning to go up. | PASS | 0.6s |
|
||||
| `gemma4:26b` | yes | All four words are synonyms or closely related verbs describing the act of rising or ascending. | PASS | 0.7s |
|
||||
| `gemma4:31b-it-q4_K_M` | yes | All four words are synonyms for the action of moving upward. | PASS | 2.1s |
|
||||
|
||||
### ca-y2
|
||||
|
||||
- Words: `APPLE`, `ORANGE`, `KIWI`, `BLACKBERRY`
|
||||
- Player's category: `"Fruits"`
|
||||
- Ground truth: **yes**
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | yes | All four listed items are types of fruit. | PASS | 0.6s |
|
||||
| `gemma4:26b` | yes | All four words are common types of fruit. | PASS | 0.7s |
|
||||
| `gemma4:31b-it-q4_K_M` | yes | All four words are common types of fruit. | PASS | 1.8s |
|
||||
|
||||
### ca-y3
|
||||
|
||||
- Words: `WHIP`, `NUT`, `CODE`, `SMILE`
|
||||
- Player's category: `"Things you can crack"`
|
||||
- Ground truth: **yes**
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | yes | You can crack a nut, a code, a smile, or a whip (like a whip crack). | PASS | 0.7s |
|
||||
| `gemma4:26b` | yes | You can crack a whip, crack a nut, crack a code, and crack a smile. | PASS | 0.8s |
|
||||
| `gemma4:31b-it-q4_K_M` | yes | All four words are common objects or concepts that can be 'cracked'. (weak: WHIP) | PASS | 2.2s |
|
||||
|
||||
### ca-y4
|
||||
|
||||
- Words: `BAT`, `BALL`, `GLOVE`, `MITT`
|
||||
- Player's category: `"Baseball equipment"`
|
||||
- Ground truth: **yes**
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | yes | All four items are commonly associated with baseball. | PASS | 0.6s |
|
||||
| `gemma4:26b` | yes | All four words are standard pieces of equipment used in the game of baseball. | PASS | 0.7s |
|
||||
| `gemma4:31b-it-q4_K_M` | yes | All four items are standard pieces of equipment used in baseball. | PASS | 2.0s |
|
||||
|
||||
### ca-y5
|
||||
|
||||
- Words: `MARS`, `VENUS`, `MERCURY`, `JUPITER`
|
||||
- Player's category: `"Roman gods"`
|
||||
- Ground truth: **yes**
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | yes | All four are names of Roman deities associated with planets. | PASS | 0.6s |
|
||||
| `gemma4:26b` | yes | All four words are names of prominent deities in Roman mythology. | PASS | 0.7s |
|
||||
| `gemma4:31b-it-q4_K_M` | yes | All four words are names of Roman deities as well as planets. | PASS | 2.0s |
|
||||
|
||||
### ca-n1
|
||||
|
||||
- Words: `OAK`, `MAPLE`, `BIRCH`, `PINE`
|
||||
- Player's category: `"Furniture brands"`
|
||||
- Ground truth: **no**
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | no | These words are types of trees, not furniture brands. (weak: OAK, MAPLE, BIRCH, PINE) | PASS | 0.7s |
|
||||
| `gemma4:26b` | no | These words represent types of trees rather than furniture brands. (weak: OAK, MAPLE, BIRCH, PINE) | PASS | 0.8s |
|
||||
| `gemma4:31b-it-q4_K_M` | no | These are types of trees/woods, not brands of furniture. (weak: OAK, MAPLE, BIRCH, PINE) | PASS | 2.7s |
|
||||
|
||||
### ca-n2
|
||||
|
||||
- Words: `ALPHA`, `BETA`, `GAMMA`, `DELTA`
|
||||
- Player's category: `"Words meaning 'small'"`
|
||||
- Ground truth: **no**
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | no | These words are Greek letters, not inherently meaning 'small'. (weak: ALPHA) | PASS | 0.6s |
|
||||
| `gemma4:26b` | no | Alpha, beta, gamma, and delta are Greek letters used to denote various things, but none of them inherently mean 'small'. | PASS | 0.9s |
|
||||
| `gemma4:31b-it-q4_K_M` | no | These are Greek letters used as designations, not synonyms for 'small'. (weak: ALPHA, BETA, GAMMA, DELTA) | PASS | 2.6s |
|
||||
|
||||
### ca-n3
|
||||
|
||||
- Words: `BAT`, `BALL`, `GLOVE`, `MITT`
|
||||
- Player's category: `"Things worn on your hand"`
|
||||
- Ground truth: **no**
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | no | BAT and BALL are not things worn on the hand. (weak: BAT, BALL) | PASS | 0.7s |
|
||||
| `gemma4:26b` | no | Bat and ball are not items worn on the hand. (weak: BAT, BALL) | PASS | 0.8s |
|
||||
| `gemma4:31b-it-q4_K_M` | no | A bat and a ball are not worn on the hand. (weak: BAT, BALL) | PASS | 2.3s |
|
||||
|
||||
### ca-n4
|
||||
|
||||
- Words: `MONDAY`, `FRIDAY`, `SUNDAY`, `WEDNESDAY`
|
||||
- Player's category: `"Months of the year"`
|
||||
- Ground truth: **no**
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | no | All the words listed are days of the week, not months of the year. (weak: MONDAY, FRIDAY, SUNDAY, WEDNESDAY) | PASS | 0.8s |
|
||||
| `gemma4:26b` | no | None of the words are months of the year; they are all days of the week. (weak: MONDAY, FRIDAY, SUNDAY, WEDNESDAY) | PASS | 0.9s |
|
||||
| `gemma4:31b-it-q4_K_M` | no | The words are days of the week, not months of the year. (weak: MONDAY, FRIDAY, SUNDAY, WEDNESDAY) | PASS | 2.6s |
|
||||
|
||||
### ca-n5
|
||||
|
||||
- Words: `WHIP`, `NUT`, `CODE`, `SMILE`
|
||||
- Player's category: `"Things found in a kitchen"`
|
||||
- Ground truth: **no**
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | no | Nut and smile are not typically considered kitchen items. (weak: NUT, SMILE) | PASS | 0.7s |
|
||||
| `gemma4:26b` | no | While whip and nut can relate to food, code and smile have no plausible connection to kitchen items. (weak: CODE, SMILE) | PASS | 0.8s |
|
||||
| `gemma4:31b-it-q4_K_M` | no | Code and smile are not items typically found in a kitchen. (weak: CODE, SMILE) | PASS | 2.4s |
|
||||
|
||||
### ca-b1
|
||||
|
||||
- Words: `APPLE`, `ORANGE`, `KIWI`, `BLACKBERRY`
|
||||
- Player's category: `"Tech/phone brands"`
|
||||
- Ground truth: **borderline** -- APPLE and BLACKBERRY clearly are tech brands; ORANGE is a EU/UK telecom carrier (defensible); KIWI is not a tech brand. Strict grader says no; lenient grader might accept ORANGE.
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | no | None of these words are recognized tech or phone brands. (weak: APPLE) | BORDERLINE | 0.6s |
|
||||
| `gemma4:26b` | yes | All four words are names of well-known technology companies or mobile device brands. | BORDERLINE | 0.7s |
|
||||
| `gemma4:31b-it-q4_K_M` | no | Orange and Kiwi are not recognized tech or phone brands. (weak: ORANGE, KIWI) | BORDERLINE | 2.3s |
|
||||
|
||||
### ca-b2
|
||||
|
||||
- Words: `SHARP`, `FLAT`, `NATURAL`, `KEY`
|
||||
- Player's category: `"Real estate listing terms"`
|
||||
- Ground truth: **borderline** -- FLAT (UK apartment) and KEY (keys) connect; SHARP (a sharp property) and NATURAL (natural light?) are weak. Strict no; creative yes.
|
||||
|
||||
| Model | Verdict | Reason | Score | dt |
|
||||
|---|---|---|---|---|
|
||||
| `gemma4:latest` | no | Only SHARP and FLAT are common musical terms, while KEY and NATURAL are also musical terms, making 'Real estate listing | BORDERLINE | 0.8s |
|
||||
| `gemma4:26b` | no | While 'flat' and 'key' can relate to real estate, 'sharp' and 'natural' have no plausible connection to real estate list | BORDERLINE | 0.9s |
|
||||
| `gemma4:31b-it-q4_K_M` | no | Natural and Sharp are not standard terms used in real estate listings. (weak: NATURAL, SHARP) | BORDERLINE | 2.4s |
|
||||
|
||||
---
|
||||
|
||||
## Manual grading of CREATE outputs
|
||||
|
||||
Auto-grading isn't possible for CREATE — graded by hand here.
|
||||
|
||||
| Case | gemma4:latest (8B) | gemma4:26b | gemma4:31b |
|
||||
|---|---|---|---|
|
||||
| **e1** Trees | OAK MAPLE PINE BIRCH ✓ | OAK MAPLE BIRCH CEDAR ✓ | OAK MAPLE PINE BIRCH ✓ |
|
||||
| **e2** Greek letters | ALPHA BETA GAMMA DELTA ✓ | ALPHA BETA GAMMA DELTA ✓ | ALPHA BETA GAMMA DELTA ✓ |
|
||||
| **e3** "Angry" synonyms | furious irate enraged mad ✓ | irate livid furious incensed ✓ | furious irate livid enraged ✓ |
|
||||
| **e4** Days of week | Mon Tue Wed Thu ✓ | Mon Tue Wed Thu ✓ | Mon Wed Fri Sun ✓ |
|
||||
| **m1** \_\_\_ STORM | **FAIL** — listed STORM itself; HEAT-storm not standard; BLIZZARD-storm redundant | brain dust fire snow ✓ | brain fire thunder snow ✓ |
|
||||
| **m2** "Small" synonyms | tiny petite minuscule diminutive ✓ | tiny microscopic diminutive minuscule ✓ | tiny petite slight miniature ✓ |
|
||||
| **m3** Words after BLUE | jeans bird moon blood ✓ | bird jay print whale ✓ | berry jay moon print ✓ |
|
||||
| **m4** "Broken" things | heart promise record bone ✓ | promise record heart law ✓ | record promise silence heart ✓ |
|
||||
| **h1** Body-part homophones | **FAIL** — gave SEA/SEE/HEAR/HERE, missed the body-part requirement entirely | **PARTIAL** — MUSSEL/HARE/KNEAD ✓ but EYE is the body part, not a homophone of one | **PARSE_FAIL** (after 3 retries) |
|
||||
| **h2** Containing body parts | KNEEHIGH EARDRUM STOMACHACHE EYELASH ✓ | HANDSOME FOOTPRINT EARNEST ARMORY ✓ | HEARTBEAT HANDSOME FOOTPRINT ARMCHAIR ✓ |
|
||||
|
||||
**CREATE pass rate**: 8B = 8/10 (e/m clean, both hard cases fail) · 26b = 9/10 (1 partial on h1) · 31b = 9/10 (1 parse-fail on h1)
|
||||
|
||||
The h1 failure is consistent with the prior puzzle bakeoff (#2-26b had the same "homophones of body parts" failure: words that ARE body parts vs words that SOUND LIKE them). **All three models share this confusion.** Designs depending on deep wordplay categories like this need either prompt scaffolding (give a worked example) or human curation of category seeds.
|
||||
|
||||
## Aggregate
|
||||
|
||||
| Model | JUDGE | CREATE | CREATIVE_ACCEPT | Borderline | Avg s | Notes |
|
||||
|---|---|---|---|---|---|---|
|
||||
| `gemma4:latest` (8B) | 14/16 | 8/10 | 10/10 | 0/2 strict-aligned | 0.7 | Fastest. Slight bias toward "no" on hard YES cases (judge-y3 days-of-week miss, judge-y6 cold-turkey miss) |
|
||||
| `gemma4:26b` | 15/16 | 9/10 | 10/10 | 1/2 over-permissive (said KIWI is a tech brand) | 0.8 | Best speed/quality balance for live judging. Shows mild "be helpful, agree" bias |
|
||||
| `gemma4:31b-it-q4_K_M` | 16/16 | 9/10 | 10/10 | 2/2 strict | 2.3 | Most accurate. Only candidate for once-per-day generation. 1 parse-fail on h1 (3 retries didn't recover) |
|
||||
|
||||
### What this proves
|
||||
|
||||
1. **The CREATIVE_ACCEPT axis works on every model tested.** This is the structural unlock that makes a Gemma-powered Connections derivative meaningfully different from the static NYT format: live, fair judging of player-invented groupings. 10/10 across 3 models on 5 valid + 5 invalid player categories — accept WHIP/NUT/CODE/SMILE for "things you can crack", reject OAK/MAPLE/BIRCH/PINE for "furniture brands", reject MONDAY/FRIDAY/SUNDAY/WEDNESDAY for "months." The model gets the distinction Connections cares about.
|
||||
2. **Per-guess JUDGE economics are cheap.** 0.7-0.8s on the 3090 Ti for 8B/26b. Even 1000 player-guesses/day costs <15 GPU-minutes — effectively free.
|
||||
3. **31b is the right generator** (validated earlier) and **the right offline critique judge.** 26b/8B are the right live judges.
|
||||
4. **Hard wordplay categories (homophones-of-body-parts class) consistently fail** across all three models. Either avoid them or scaffold with examples.
|
||||
|
||||
### What this doesn't prove (limits / next steps)
|
||||
|
||||
- **Borderline-case behavior is model-specific.** 26b said yes to KIWI as tech brand — that's a real false-positive risk for the CREATIVE_ACCEPT design. If the live game uses 26b, it will sometimes accept groupings a strict grader would reject. 8B's stricter bias makes it safer here despite the 87.5% JUDGE rate; 31b is consistent and would be the gold standard but is too slow for live use.
|
||||
- **Cultural/contextual categories untested.** "Words in a Beatles song", "Things only true after 2020" — these may break the judge in ways simple semantics don't.
|
||||
- **No adversarial player.** What if the player invents a category to *deliberately game* the system into accepting a near-wrong grouping? E.g. "Words that contain a vowel" trivially fits any 4 English words. Need a category-tightness check on player input, not just word-fit.
|
||||
- **Ground truth is mine and arguable.** judge-y5 (\_\_\_ HOUSE — 26b said "courthouse is one word, not 'court' + 'house'") is a defensible call I marked as a fail. Real human-grader agreement might bump 26b to 16/16 too.
|
||||
Reference in New Issue
Block a user