TTY/01 /posts/evolving-card-games load: 0.42 0.51 0.38 mem: 18.2/32G ONLINE
gabe@signalnine:~/posts$ man evolving-card-games
NAME
evolving-card-games — Evolving card games nobody has played yet
SYNOPSIS
I set out to evolve novel card games with genetic algorithms and Monte Carlo simulation, and hit two walls. Almost every random rulebook is unplayable garbage, and once you fix that with constrained game skeletons, evolution just rediscovers games that already exist. The fix for the first is to search a manifold where every point is a valid game; the fix for the second is to put an LLM in the selection loop as a novelty judge, because structural metrics are blind to the entire corpus of published card games. It was built across a run of Claude models, Opus 4.5 through Fable 5, the latter of which rebuilt the measurement foundation before the US government pulled it three days after launch, leaving Opus 4.8 to execute the breakthrough. Whether any of the certified-novel games is actually fun is still untested, which is the whole point of the playable web version.
METADATA
dateJune 17, 2026
length15.5K
reading~13m
tags
BODY

Earlier this year I wrote down a plan: use Monte Carlo simulation and genetic algorithms to evolve novel card games playable with a standard 52-card deck. Encode a game as a genome of rules, simulate thousands of games with AI players, score the results on proxy metrics for “fun,” select and mutate the best, repeat. The last item on the plan was “top evolved games get human trials.”

I built almost all of it by pair-programming with whatever the newest Claude model was at the time. That detail turns out to matter, so keep it in the back of your mind: the project spans Opus 4.5 in November through Fable 5 this month, and toward the end a new model landed about every six weeks. One of them got pulled off the board by the US federal government partway through. The plan was right about the destination and wrong about almost the entire route. Two walls, two pivots, and one step I kept skipping.

The idea: a rulebook is just a genome

A card game is a set of rules you can encode: setup (deal pattern, hand sizes), turn structure (draw, play, pass), legal moves (match by suit or rank, special card effects), win conditions (empty hand, point threshold, capture), scoring. Encode all of that as a genome and you can mutate it and cross it over, taking the turn structure from game A and the win condition from game B.

For fitness you score each candidate on measurable proxies for fun: decision density, comeback potential, skill-versus-luck, session length. You measure those by simulating thousands of games with three AI players. Random play validates the game is mechanically playable. Greedy play measures whether obvious strategy helps. MCTS approximates skilled play, and the gap between MCTS and random win rates is your skill signal. Clean idea. I still think it’s the right idea.

Wall #1: almost every rulebook is garbage

The first version encoded rules directly and mutated them freely. Opus 4.5 built it across January, and the commit history is a wall of betting subsystems, a bytecode compiler, a CGo bridge, a web UI. The problem shows up the moment you run evolution: the overwhelming majority of random or mutated rulebooks produce games that don’t work. Infinite loops. States with no legal move. Win conditions you can never reach. Games that collapse into drawing and passing forever. The fitness landscape is a desert of zeros with rare islands of playable games, and mutation mostly walks you off an island into the sand.

So you spend your entire compute budget proving candidates invalid before you ever get to measure whether one is fun. No amount of cleverness in the fitness function fixes it, because the fitness function never runs on a broken game.

The pivot: stop searching for playable games

The fix was to stop searching the space of all rulebooks. Pick a handful of templates for how real card games actually work, where the game loop itself guarantees every state has a legal move, and evolve the parameters inside a template instead of evolving whether the game functions.

Six skeletons: shedding (Crazy Eights), trick-taking (Whist, Hearts), rummy (Gin), climbing (Big Two), casino-style capture (Scopa), and vying/betting (poker). Each is a runner that’s playable by construction. The genome stopped deciding whether the game works and started deciding only what happens inside a game that already works.

That one change moves the entire search budget from validity to quality. Every candidate is a real game, so every simulation tells you something about how good it is instead of whether it runs. I threw out the Python and rewrote the whole thing in pure Go. This was the Opus 4.6-into-4.7 window, April. The rewrite and the skeleton idea are the cleanest structural decisions in the project, and they’re where the model stepped up from 4.5.

Measuring fun, badly, on purpose

The fitness function is five metrics: meaningful decisions, game arc, interaction, skill gradient, session length. None of them measure fun. They measure things that correlate with fun if you squint, which is the best you can do without a human in the room.

The trick that keeps them honest is calibration against ground truth. There are eleven real, published, time-tested games in the seed set, and a game still in circulation after decades is fun by survival. If your metric scores Gin Rummy a 0.3, the metric is broken. Real games are free labels, and you should use them.

The first cut of these metrics shipped with the April rewrite, and it was mostly proxies wearing the costume of measurements: decision density read off an event log instead of counting legal moves, interaction was a tag taxonomy, the calibration floor was a number I’d picked by feel. In June an audit caught all of it, including a few load-bearing claims from the v1 days that turned out to be false (the “parallel” fitness evaluator had been running serially the whole time because Python 3.13 multiprocessing deadlocks with CGo). The model that ran that audit was Fable 5, in its short window of availability, and it rebuilt the metrics into real measurements: decision density from actual legal-move counts, game arc from per-game lead trajectories, interaction from option-perturbation, a calibration gate where published classics provably outrank degenerate fixtures, and a battery of degeneracy vetoes so a game that passes random play but collapses under greedy play gets killed. More on that window later. For now the lesson: audit your own benchmarks before you quote them, including the ones you wrote feeling good about yourself.

The other thing the rebuild taught, over and over, is that every metric is a lens ground for the games you already had, and it’s blind to anything shaped differently. Big Two scored near zero on interaction for a while. Climbing games are intensely interactive, but the metric measured changes in option count, and Big Two’s beat-or-pass constraint doesn’t register that way. Add a climbing-specific interaction measure and the score jumps to par with Gin. Every new game family taught the same thing: it needs its own interaction measure and its own greedy strategy, or the metrics literally cannot see it. You discover the blind spot only when something you know is good scores obviously wrong.

Wall #2: it works, and everything it makes already exists

Now the system reliably produced playable games. They were also all games you could already buy. Evolution finds the nearest fun attractor, and the nearest fun attractors are the games humans already discovered over a few centuries of playing cards. Playable was solved. Novel was not.

The deeper problem was that I couldn’t even measure novelty. Every structural novelty signal I had, behavioral distance and distance-from-seeds, anchors on my own seed set. A game can sit miles from my eleven seeds and still be a faithful rediscovery of Mau-Mau or Pinochle, published games I simply didn’t seed. The structural metrics are blind to the entire corpus of human card games they were never shown. You cannot compute “is this new to humanity” from a simulation, because the simulation has never heard of humanity’s games.

This is where I stalled for weeks.

The pivot: an LLM is the only thing that read the rulebooks

The thing that has read the corpus of published card games is a language model. So I generate a blind rulebook for each evolved game, a dossier with no hint of where it came from, and ask an LLM a narrow question: is this a novel game or a rediscovery of something published, and name the closest existing game. Then I feed that verdict back into selection as a fitness term. Novel compositions get explored more. Rediscoveries get starved.

Three moving parts. A selection term during evolution that rewards judged-novel lineages. A checkpoint scheme so the verdict table compounds across generations. A publication ranking so a certified-novel game surfaces above a higher-fitness rediscovery. The judging runs cheap-in-loop and expensive-to-certify: Sonnet judges every chunk during evolution, Opus certifies the final top set. In a controlled check the two agreed within one category on all six test games, exactly on four, with zero flips between “known” and “novel,” and Sonnet is the more conservative of the two, which is the safe direction for an in-loop filter that can starve a lineage.

The mechanism that actually generates novelty is cross-skeleton borrowing. Let a shedding game borrow a mechanic from another family. Shallow borrows, the ones that only add a scoring tally, read as variants. The borrow that mattered changes the legal-move set: multi-card combos lifted out of climbing and dropped into shedding, so the moves you can make are different, not just the points you score. Pair that with a win condition that ends the game on points accumulated over several rounds and you get the validated recipe. Blind judges certify that combination novel four times out of four. Pure move-tweaks with no win-condition change, they correctly call variant two out of two. The judge is discriminating.

The honest accounting of who did what cuts against the easy version of this story, so here it is straight. The design idea, a borrow that changes the legal-move set, a mechanic-aware novelty pressure that selects for deep fusions, and the judge wired directly into selection, was mine, not the model’s. I’d been circling the novelty problem for weeks, and what unblocked it was a specific architectural idea. The breakthrough commits, dated June 14, are that idea executed by Opus 4.8. The model was an excellent pair of hands. The insight was a human one. That’s the version I’d want told about my own work, and it happens to be true.

One bug is worth the war story regardless of which model hit it. For a stretch the judges called everything a variant, including games I was sure were new. The rulebook generator was rendering every borrowed mechanic as a generic, parameter-free blurb, so the judges were assessing novelty blind to the actual rules. I made it render the concrete rule, re-judged the exact same games, and they went from all-variant to mostly novel. The games never changed. The description did. Your measurement can lie to you even when the thing being measured is fine, and an LLM judge is only as good as the document you hand it.

The models built it, mostly as hands

I always used the newest Claude model available, and execution quality tracked the releases closely: Opus 4.5 produced the v1 sprawl and its dead end, 4.6 and 4.7 the clean skeleton rewrite, Fable 5 the measurement rebuild, Opus 4.8 the novelty breakthrough.

ModelReleasedRole in DarwinDeck
Opus 4.5Nov 2025v1, the Python/Go hybrid and its unconstrained-genome dead end
Opus 4.6Feb 2026start of the pure-Go rewrite
Opus 4.7Apr 2026the skeletons, the first cut of the five metrics
Opus 4.8May 2026judge-in-loop scaffolding; took back over June 13 and executed the novelty breakthrough
Fable 5Jun 9, 2026the June 11-12 audit and metrics rebuild; pulled June 13 by the US government suspension

Fable 5 shipped on June 9 and ran the next three days under deliberately loose direction, the whole brief being “audit this and make the measurements real.” In a roughly 76-commit burst it did exactly that: rebuilt the five metrics from proxies into actual measurements, calibrated them so published classics outrank degenerate fixtures, added the degeneracy vetoes that keep the search honest, made the game loop pure, and built the determinized ISMCTS player that gives the skill gradient its teeth. Then, on June 13, the US government ordered Anthropic to suspend Fable 5 and Mythos 5 for all foreign nationals on national-security grounds, and because Anthropic couldn’t reliably gate access by nationality it pulled both models for everyone.1 So the most capable model I had spent three days hardening my card-game fitness function, then got recalled by a federal directive, and Opus 4.8 took back over. Two days later the breakthrough landed.

The thing that did not track the model releases was the key idea. The deep-borrow plus integration-pressure plus judge-in-the-loop architecture was mine, and I handed it to Opus 4.8 to build. The lesson I take from that runs opposite to the hype: the model was never the bottleneck on the creative step. Better models made me a faster and more reliable builder at every phase, and Fable made the ground under the whole thing trustworthy, but the insight that broke the stall was a human one. Only the Opus 4.5 attribution is actually in the commit metadata, for what it’s worth; the rest is the release calendar laid over the commit dates plus my own memory of what I was running, and one externally-dated event that happens to bracket the handoff precisely.

The step I kept skipping: ask a human

The whole tower, five proxy metrics plus an LLM novelty judge, optimizes for fun and novelty without a single human ever playing one of these games. The original plan knew this. The last bullet was human playtesting, and the plan explicitly said to validate the proxies by correlating their scores against enjoyment ratings. It’s the one step I kept deferring, because it’s the only one that needs other people.

So this week I built the thing that should have come earlier. A web server that serves any evolved game as a browser game against the AI, with a rating widget, deployed at cards.signalnine.net. It runs a curated set across all six skeletons. Now the loop the plan called for on day one can finally close: people play the judge-certified-novel games, rate them, and those ratings tell me whether any of the proxies correlate with fun at all.

I don’t know the answer yet, and that’s the point. It’s entirely possible the certified-novel games are clever and not fun. That result would be worth more than building three more skeletons, because it would tell me which half of the tower to throw away.

What I’d tell someone starting this

Constrain the search space to a manifold where every point is valid, then search inside it. Optimizing validity burns your whole budget; you want every evaluation spent on quality.

Calibrate your metrics against real ground truth. Published games are free labels. If your metric ranks them low, the metric is wrong.

Assume your metrics are blind to anything structurally unlike what you built them for. You’ll only find the blind spot when something you trust scores obviously wrong, so go looking on purpose.

Treat “playable” and “novel” as separate problems. Selection walks toward the nearest known-good thing unless you actively push it away.

Some properties can’t be computed, only judged. An LLM is a usable fitness function for a thing your code can’t measure, here whether a game is new to human culture. Run a cheap model in the loop and an expensive one to certify.

The model is the executor, not the source of the idea. The novelty problem sat unsolved for weeks, and a specific design idea broke it. Better models made me faster at every step and built the breakthrough cleanly once I’d designed it, but the design was mine. When you’re stuck, upgrade your understanding of the problem before your tooling.

And the ground truth you’re approximating has to be measured directly at some point, or you’re optimizing your own assumptions with great precision. Build that instrument early. I built it last, which is the one thing about the route I’d actually change.

If you want to help me find out whether any of this is fun, go play a few and rate them: cards.signalnine.net. The ratings are the experiment now.