Gabe Ortiz

Thunderdome is an agentic SWE benchmark I built that tests models on real multi-file tasks. Unlike SWE-bench, which is mostly “here’s a bug, fix it,” Thunderdome tasks are greenfield – build an ecommerce backend, implement full-text search, create a plugin marketplace from scratch. The model has to make architectural decisions, not just patch diffs. 3,300+ scored trials across 95 orchestrator variants so far.

llm-circuit-finder is a toolkit for layer surgery on transformer models – duplicating or removing layers from GGUF files to see what happens. The idea: maybe not every layer in a model is helping. Some are actively hurting. Find them, cut them out, see if the model gets better.

The technique builds on David Ng’s RYS method, which showed you can duplicate specific middle-layer blocks in a transformer and get better reasoning - no finetuning, just running the same layers twice. The upstream toolkit generalized this into a search framework but only does duplication. My fork adds pruning and custom code/tool-use probes (leetcode-hard style algorithmic problems), because I wanted to know if cutting layers can make models smaller without making them dumber.

The Headline Result

Qwen3-Coder-30B-A3B has 48 layers and 128 MoE experts. I deleted layers 28-29 and the model went from 38.1% to 84.9% on my coding probe — a 47 point jump. Eleven out of twelve pruning configurations improved code scores. Layers 28-29 aren’t just redundant. They’re actively interfering with code generation, or are they?

Worse. Not catastrophically worse; and 11% smaller VRAM footprint, ~3% faster, but still worse.

On the bright side, if you only evaluate on single-turn coding benchmarks and never test on real workloads, this technique is incredible. Truly a breakthrough for the field of publishing misleading leaderboard scores.

Seven Models, One Table

The pattern: probe improvements don’t transfer to real SWE work. Across every model and strategy, surgically modified models score within ~5% of baseline on Thunderdome. The probes measure isolated algorithmic ability. Thunderdome measures multi-turn agentic reasoning, e.g. file navigation, test interpretation, context management across multiple tool calls. Different skill entirely. Even when I added a tool-use probe alongside the coding one, it didnt help find real improvements.

For context: Thunderdome’s top orchestrators (Conclave Review on Opus 4.6, various disciplined Sonnet configurations) score 87-88%. Qwen3-Coder 30B running locally scores 54.6%. The gap between local open models and orchestrated frontier models is ~33 points, and layer surgery moves the needle by less than 5 in either direction. The bottleneck is model capability, not architecture.

Expert Count Determines Strategy

The theory: Models with many experts have natural redundancy, the routing network has learned to spread computation across so many experts that some layers become net-negative for specific tasks. Removing them helps on probes because you’re eliminating interference.

Model	Arch	Experts	Best Config	Code Probe Δ	Thunderdome Δ
Qwen3-Coder-30B	Pure MoE	128	del(28,30)	+46.8pp	-3.1%
Mixtral 8x7B	Pure MoE	8	dup(24,27)	+11.9%	+0.9%
DeepSeek-Coder-V2	MoE+dense	64	dup(6,9)	+7.8%	0.0%
GPT-OSS-20B	Pure MoE	128	—	all hurt	—
Devstral-24B	Dense	—	—	all hurt	—
Nemotron-Nano-30B	SSM+MoE	128	—	all crash	—
Qwen3.5-35B-A3B	MoE+thinking	256	—	incompatible	—

Models with few experts (like Mixtral’s 8) have every layer doing load-bearing work. You can’t remove anything without breaking something. But you can duplicate useful layers to reinforce computation that’s already working.

Where the Circuits Live

Reasoning and code share circuits at layers 12-15. The code-specific region at 28-32 is where the best prune target sits. The model grew a dedicated code processing region in its late-middle layers and part of it is doing more harm than good?

But Thunderdome tells you those “harmful” layers are still part of routing paths that multi-turn SWE tasks depend on. The model adapted to their presence. Removing them improves the isolated code generation kernel while degrading the connective tissue around it.

What Doesn’t Work At All

Dense models. Devstral-24B: 0 of 12 configurations improved either metric. Every layer is tightly coupled. No surgery possible.

Hybrid SSM+MoE. Nemotron-Nano-30B: every configuration crashed on load. The alternating Mamba-2/attention pattern creates rigid layer dependencies. If you’re thinking about trying this on Nemotron models, save yourself the time.

Too few layers. GPT-OSS-20B has 128 experts but only 24 layers — not enough redundancy to prune, not enough structure to duplicate meaningfully.

Thinking models. Qwen3.5’s thinking mode generates 200+ tokens of reasoning trace that breaks probe evaluation. Separate issue but worth noting.

The Question You’re Actually Asking

I tested this directly. Pruned 6 layers (42 remaining) and quantized to Q4_K_S: 15.4 GB, -3.5% on Thunderdome. Then I just used a lower quantization on the full model — Q3_K_M: 14.7 GB, -1.1% on Thunderdome.

The reason is intuitive once you see the per-task breakdown. Quantization distributes quality loss evenly: every layer gets a little worse but nothing breaks. Pruning concentrates the loss: one task (plugin-marketplace) dropped from 0.477 to 0.215 because it depended on the removed layers. With Q3_K_M that same task only dropped to 0.468.

Variant	Size	Thunderdome Δ
Baseline Q4_K_M	18.6 GB	—
Pruned + Q4_K_S	15.4 GB	-3.5%
Full model Q3_K_M	14.7 GB	-1.1%

If you want Qwen3-Coder on a 4060 Ti or 5080, use Q3_K_M at 14.7 GB. No surgery needed. That’s a 30B MoE code model on a $300 card with 19 tasks’ worth of Thunderdome data showing only -1.1% degradation from the full-size quant. It won’t match a $200/mo Claude subscription, Qwen3-Coder scores 54.6% where Opus-backed orchestrators score 87%+ — but it’s free, it’s local, and it’s a lot better than a 7B dense model.

So What Is Layer Surgery Actually Good For?

Understanding model internals. The probe results are real, layers 28-29 in Qwen3-Coder genuinely interfere with algorithmic code generation. The expert-count heuristic (high experts → prune, low experts → duplicate) holds across architectures. The circuit map showing where reasoning and code live in the layer stack is novel and useful for anyone doing mechanistic interpretability work on MoE models.

It’s just not a deployment optimization. If your goal is fitting a model into less VRAM, use a lower quant. If your goal is understanding why a model behaves the way it does, surgery is a powerful diagnostic tool.

Reproduce It

Everything’s open except the Thunderdome benchmark repos which are only private so they dont end up in a training dataset, ping me if you want access to those:

git clone https://github.com/signalnine/llm-circuit-finder.git
cd llm-circuit-finder
pip install gguf requests tqdm

# Run the code+SWE probe sweep on any GGUF model
python sweep.py \
  --model /path/to/model.gguf \
  --llama-server /path/to/llama-server \
  --block-sizes 2 3 --stride 4 --start-min 4 --start-max 40 \
  --server-args --n-gpu-layers 999 --flash-attn on

Raw data for all 7 models is in the repo as JSONL. Thunderdome is at github.com/signalnine/thunderdome. All results converged at n=3.

Hardware

Everything ran on a single RTX 5090 (32 GB) with an i7-14700 and 64 GB DDR4. Inference via ollama 0.18.0 and llama.cpp built from source with CUDA sm_120. No cloud GPUs required for any of the experiments in this post.

I Pruned Two Layers From a 30B Code Model and It Got 123% Better at Coding Benchmarks. Then I Tested It On Real Software Engineering Tasks.

The Setup