Evolving card games nobody has played yet

NAME

evolving-card-games — Evolving card games nobody has played yet

SYNOPSIS

I set out to evolve novel card games with genetic algorithms and Monte Carlo simulation, and hit two walls. Almost every random rulebook is unplayable garbage, and once you fix that with constrained game skeletons, evolution just rediscovers games that already exist. The fix for the first is to search a manifold where every point is a valid game; the fix for the second is to put an LLM in the selection loop as a novelty judge, because structural metrics are blind to the entire corpus of published card games. It was built across a run of Claude models, Opus 4.5 through Fable 5, the latter of which rebuilt the measurement foundation before the US government pulled it four days after launch, leaving Opus 4.8 to execute the breakthrough. Most recently I generalized the six hand-coded game families into a generative grammar that stays playable-by-construction and now represents most known card games -- Crazy Eights, Hearts, Spades, Casino, Gin, Big Two, poker -- from one set of typed primitives, all feeding the same judge. The certified-novel games are genuinely playable; whether any of them is actually fun is still an open question, which is the whole point of the playable web version.

METADATA

dateJune 17, 2026

length19.2K

reading~16m

The idea: a rulebook is just a genome

A card game is a set of rules you can encode: setup (deal pattern, hand sizes), turn structure (draw, play, pass), legal moves (match by suit or rank, special card effects), win conditions (empty hand, point threshold, capture), scoring. Encode all of that as a genome and you can mutate it and cross it over, taking the turn structure from game A and the win condition from game B.

For fitness you score each candidate on measurable proxies for fun: decision density, comeback potential, skill-versus-luck, session length. You measure those by simulating thousands of games with three AI players. Random play validates the game is mechanically playable. Greedy play measures whether obvious strategy helps. MCTS approximates skilled play, and the gap between MCTS and random win rates is your skill signal. Clean idea. I still think it’s the right idea.

Wall #1: almost every rulebook is garbage

The first version encoded rules directly and mutated them freely. Opus 4.5 built it across January, and the commit history is a wall of betting subsystems, a bytecode compiler, a CGo bridge, a web UI. The problem shows up the moment you run evolution. The overwhelming majority of random or mutated rulebooks produce games that don’t work: infinite loops, states with no legal move, win conditions you can never reach, games that collapse into drawing and passing forever. The fitness landscape is an ocean of zeros with rare islands of playable games, and mutation mostly walks you off an island into the water.

So you spend your entire compute budget proving candidates invalid before you ever get to measure whether one is fun. No amount of cleverness in the fitness function fixes it, because the fitness function never runs on a broken game.

The pivot: stop searching for playable games

The fix was to stop searching the space of all rulebooks. Pick a handful of templates for how real card games actually work, where the game loop itself guarantees every state has a legal move, and evolve the parameters inside a template instead of evolving whether the game functions.

Six skeletons: shedding (Crazy Eights), trick-taking (Whist, Hearts), rummy (Gin), climbing (Big Two), casino-style capture (Scopa), and vying/betting (poker). Each is a runner that’s playable by construction. The genome stopped deciding whether the game works and started deciding only what happens inside a game that already works.

That one change moves the entire search budget from validity to quality. Every candidate is a real game, so every simulation tells you something about how good it is instead of whether it runs. I threw out the Python and rewrote the whole thing in pure Go. This was the Opus 4.6-into-4.7 window, April. The rewrite and the skeleton idea are the cleanest structural decisions in the project, and they’re where the model stepped up from 4.5.

Measuring fun, badly, on purpose

The fitness function is five metrics: meaningful decisions, game arc, interaction, skill gradient, session length. None of them measure fun. They measure things that correlate with fun if you squint, which is the best you can do without a human in the room.

The trick that keeps them honest is calibration against ground truth. There are eleven real, published, time-tested games in the seed set, and a game still in circulation after decades is fun by survival. If your metric scores Gin Rummy a 0.3, the metric is broken. Real games are free labels, and you should use them.

The first cut of these metrics shipped with the April rewrite, and it was mostly proxies wearing the costume of measurements: decision density read off an event log instead of counting legal moves, interaction was a tag taxonomy, the calibration floor was a number I’d picked by feel. In June an audit caught all of it, including a few load-bearing claims from the v1 days that turned out to be false (the “parallel” fitness evaluator had been running serially the whole time because Python 3.13 multiprocessing deadlocks with CGo). The model that ran that audit was Fable 5, in its short window of availability, and it rebuilt the metrics into real measurements: decision density from actual legal-move counts, game arc from per-game lead trajectories, interaction from option-perturbation, a calibration gate where published classics provably outrank degenerate fixtures, and a battery of degeneracy vetoes so a game that passes random play but collapses under greedy play gets killed. More on that window later. For now the lesson: audit your own benchmarks before you quote them, including the ones you wrote feeling good about yourself.

The other thing the rebuild taught, over and over, is that every metric is a lens ground for the games you already had, and it’s blind to anything shaped differently. Big Two scored near zero on interaction for a while. Climbing games are intensely interactive, but the metric measured changes in option count, and Big Two’s beat-or-pass constraint doesn’t register that way. Add a climbing-specific interaction measure and the score jumps to par with Gin. Every new game family taught the same thing: it needs its own interaction measure and its own greedy strategy, or the metrics literally cannot see it. You discover the blind spot only when something you know is good scores obviously wrong.

Wall #2: it works, and everything it makes already exists

Now the system reliably produced playable games. They were also all games you could find in a copy of Hoyle’s. Evolution finds the nearest fun attractor, and the nearest fun attractors are the games humans already discovered over a few centuries of playing cards. Playability was solved; novelty wasn’t.

The deeper problem was that I couldn’t even measure novelty. Every structural novelty signal I had, behavioral distance and distance-from-seeds, anchors on my own seed set. A game can sit miles from my eleven seeds and still be a faithful rediscovery of Mau-Mau or Pinochle, published games I simply didn’t seed. The structural metrics are blind to the entire corpus of human card games they were never shown. You cannot compute “is this new to humanity” from a simulation, because the simulation has never heard of humanity’s games.

This is when I put the project aside for a couple of months.

The pivot: an LLM is the only thing that read the rulebooks

The thing that has read the corpus of published card games is a language model. So I generate a blind rulebook for each evolved game, a dossier with no hint of where it came from, and ask an LLM a narrow question: is this a novel game or a rediscovery of something published, and name the closest existing game. Then I feed that verdict back into selection as a fitness term. Novel compositions get explored more. Rediscoveries get starved.

Three moving parts. A selection term during evolution that rewards judged-novel lineages. A checkpoint scheme so the verdict table compounds across generations. A publication ranking so a certified-novel game surfaces above a higher-fitness rediscovery. The judging runs cheap-in-loop and expensive-to-certify: Sonnet judges every chunk during evolution, Opus certifies the final top set. In a controlled check the two agreed within one category on all six test games, exactly on four, with zero flips between “known” and “novel,” and Sonnet is the more conservative of the two, which is the safe direction for an in-loop filter that can starve a lineage.

The mechanism that actually generates novelty is cross-skeleton borrowing. Let a shedding game borrow a mechanic from another family. Shallow borrows, the ones that only add a scoring tally, read as variants. The borrow that mattered changes the legal-move set: multi-card combos lifted out of climbing and dropped into shedding, so the moves you can make are different, not just the points you score. Pair that with a win condition that ends the game on points accumulated over several rounds and you get the validated recipe. Blind judges certify that combination novel four times out of four. Pure move-tweaks with no win-condition change, they correctly call variant two out of two. The judge is discriminating.

The honest accounting of who did what is messier than the tidy version. Mostly it was me bouncing half-formed ideas off whatever model I had open until the deep-borrow framing shook out, then handing it over to build. More than once a model went off on its own and fixed something I hadn’t thought through clearly enough, which counts for plenty. The breakthrough commits are dated June 14, Opus 4.8. Calling the recipe a clean flash of human insight would be flattering myself; it came out of the back-and-forth.

One bug is worth the war story regardless of which model hit it. For a stretch the judges called everything a variant, including games I was sure were new. The rulebook generator was rendering every borrowed mechanic as a generic, parameter-free blurb, so the judges were assessing novelty blind to the actual rules. I made it render the concrete rule, re-judged the exact same games, and they went from all-variant to mostly novel. The games never changed, only the description did. Your measurement can lie to you even when the thing being measured is fine, and an LLM judge is only as good as the document you hand it.

The models built it

I always used the newest Claude model available, and execution quality tracked the releases closely: Opus 4.5 produced the v1 sprawl and its dead end, 4.6 and 4.7 the clean skeleton rewrite, Fable 5 the measurement rebuild, Opus 4.8 the novelty breakthrough.

Model	Released	Role in DarwinDeck
Opus 4.5	Nov 2025	v1, the Python/Go hybrid and its unconstrained-genome dead end
Opus 4.6	Feb 2026	start of the pure-Go rewrite
Opus 4.7	Apr 2026	the skeletons, the first cut of the five metrics
Opus 4.8	May 2026	judge-in-loop scaffolding; took back over June 13 and executed the novelty breakthrough
Fable 5	Jun 9, 2026	the June 11-12 audit and metrics rebuild; pulled June 13 by the US government suspension

Fable 5 shipped on June 9 and ran the next three days under deliberately loose direction, the whole prompt being “audit this and try to fix it as best you can.” Crucially, and without much in the way of actual direction, it figured out what I was trying to do. In a roughly 76-commit burst it did exactly that: rebuilt the five metrics from proxies into actual measurements, calibrated them so published classics outrank degenerate fixtures, added the degeneracy vetoes that keep the search honest, made the game loop pure, and built the determinized ISMCTS player that gives the skill gradient its teeth. Then, on June 13, the US government ordered Anthropic to suspend Fable 5 and Mythos 5 for all foreign nationals on national-security grounds, and because Anthropic couldn’t reliably gate access by nationality it pulled both models for everyone.¹ So the most capable model I had spent three days hardening my card-game fitness function, then got recalled by a federal directive, and Opus 4.8 took back over. The next day the breakthrough landed.

Once the metrics worked the next piece was the deep-borrow plus integration-pressure plus judge-in-the-loop architecture which involved a lot of back and forth with Opus 4.8. Fable probably could’ve built that on its own with a nudge or two.

The step I kept skipping: ask a human

The whole tower, five proxy metrics plus an LLM novelty judge, optimizes for fun with nothing checking the proxies against a real person actually enjoying a game. I’ve played them. The v1 games were mostly broken or nonsensical, which is how I knew the unconstrained genome was a dead end; the new ones I’ve tried are genuinely playable. What I haven’t done is sit other people down with a wide set and collect ratings, and I haven’t even played all of mine yet. The original plan called for exactly that: the last bullet was human playtesting, validate the proxies by correlating their scores against enjoyment ratings. It’s the one step I kept deferring, because it’s the only one that needs other people.

So this week I built the thing that should have come earlier. A web server that serves any evolved game as a browser game against the AI, with a rating widget, deployed at cards.signalnine.net. It runs a curated set across all six skeletons. Now the loop the plan called for on day one can finally close: people play the judge-certified-novel games, rate them, and those ratings tell me whether any of the proxies correlate with fun at all.

I don’t know the answer yet, and that’s the point. It’s entirely possible the certified-novel games are clever and not fun. That result would be worth more than building three more skeletons, because it would tell me which half of the tower to throw away.

The skeletons were also a ceiling

The skeletons fixed playability by fencing the search into six families: shedding, trick-taking, rummy, climbing, casino, vying. Every point inside is a valid game, which is the whole point. It also means the most structurally different games you can find are those six plus a handful of cross-family fusions. Dozens of distinct games, not thousands.

So I stopped hand-coding skeletons and generated them. A game becomes a composition of typed primitives: a move-generator (match-and-shed, beat-or-pass, accumulate-toward-a-target, capture-from-a-table), an end condition, a scoring rule, and a set of modifiers. One interpreter runs any composition. The six skeletons stop being six hand-written runners and become six points in a grammar with room for many more.

“Generate the rules” is exactly what v1 did, and v1 was a desert, so the only thing that matters is keeping this generative space out of the desert. The grammar is playable-by-construction two ways. Every move-generator carries an unconditional fallback, so the legal-move set is never empty and you can’t reach a dead end. And the interpreter has its own termination rule, so a composition whose win condition is never reached still ends instead of looping. I ran every composition in the space through random play: zero ever got stuck, zero ever failed to terminate, which gets you safety and liveness across the whole space.

Most compositions are valid and pointless: a scoring rule that crowns the same winner the end condition already picked, an end the move dynamics can never reach. So the grammar carries a coherence type that makes those unrepresentable, and encoding it collapses twenty loosely-enumerated families to the four the primitives actually support, which turn out to be exactly the hand-coded skeletons the grammar covers. The base grammar reproduces the skeletons and invents no spurious novelty. That’s the correct answer: novelty should come from composition.

The cross-family borrows become typed modifiers gated by that same coherence type. The table of which borrow is legal on which skeleton, which I used to maintain by hand, becomes a small type rule the compiler enforces. Four base families crossed with their compatible modifiers give about twenty playable-by-construction games, and they drop into the same engine, the same five metrics, and the same blind judge with no special-casing. The judge calibrated identically: bare bases come back known, simple fusions variant, and one richer fusion, multi-card combos plus a forced-suit constraint plus a knock-to-end, comes back novel, a dual win-path timing decision it couldn’t place in any published game. It’s the same loop over a broader space.

What started as a four-move-generator prototype now spans seven, adding trick-taking, rummy, and a betting generator for poker, and represents about four in five of the known card games I surveyed, every one playable-by-construction. Crazy Eights, Hearts, Spades, Casino, Gin, Big Two, five-card poker all fall out of the same primitives plus typed modifiers: trump, bidding, partnerships, Uno’s skip and reverse and draw-two, Scopa’s sum-capture. The handful it still can’t reach – Euchre, Bridge, Hold’em – each need a bespoke primitive the others don’t share, so they’re the long tail, not the bulk. None of this changes the open question. A bigger space of certified-novel games is still a bigger space of games that might not be fun. What it buys is leverage: when the ratings come back and tell me which proxies to trust, the thing on the other side of that filter is most of the card-game design space, generated, instead of six hand-built points in it. The manifold from the first wall is finally generative.

What I’d tell someone starting this

Constrain the search space to a manifold where every point is valid, then search inside it. Optimizing validity burns your whole budget; you want every evaluation spent on quality.

But the constraint that makes the space safe can also cap it. Six valid skeletons is a safe search space and a small one. If you can, generalize the constraint into a generator, a grammar where every composition is valid by construction, and you get the safety without the ceiling.

Calibrate your metrics against real ground truth. Published games are free labels. If your metric ranks them low, the metric is wrong.

Assume your metrics are blind to anything structurally unlike what you built them for. You’ll only find the blind spot when something you trust scores obviously wrong, so go looking on purpose.

Treat “playable” and “novel” as separate problems. Selection walks toward the nearest known-good thing unless you actively push it away.

Some properties can’t be computed, only judged. An LLM is a usable fitness function for a thing your code can’t measure, here whether a game is new to human culture. Run a cheap model in the loop and an expensive one to certify.

When I was stuck, what moved it was getting clearer about what I was actually asking for, not waiting for a better model. The models were good collaborators, and one went off and fixed things I hadn’t thought through. But the stall broke when my own grip on the problem did. When you’re stuck, upgrade that before your tooling.

And the ground truth you’re approximating has to be measured directly at some point, or you’re optimizing your own assumptions with great precision. Build that instrument early. I built it last, which is the one thing about the route I’d actually change.

If you want to help me find out whether any of this is fun, go play a few and rate them: cards.signalnine.net. The ratings are the experiment now.

Anthropic’s statement on the US government directive to suspend access to Fable 5 and Mythos 5, June 13, 2026. ↩︎