TurboQuant (ICLR 2026) compresses transformer KV caches using Walsh-Hadamard rotation plus Lloyd-Max quantization. 2-5x compression, minimal quality loss. The paper shipped with a llama.cpp fork that had Metal kernels but no CUDA support.
I spent 12 days porting it. 140 commits, 6 PRs, two overnight runs of an automated optimizer that found a 3.5x speedup nobody would have written by hand.
the initial port
PR #3, March 26-27. Goal: make turbo3 (3-bit PolarQuant, 3.4 bpv) work on NVIDIA. That meant flash attention kernels (VEC, MMA, tile paths) with on-the-fly dequant from 3-bit block format, a warp-cooperative Walsh-Hadamard transform using __shfl_xor_sync for the binary butterfly, quantize/dequant for the KV cache with centroid lookup and sign decode, and mixed KV types so you could run turbo3 K with q8_0 V or vice versa.
Asymmetric compression matters because cheaper K with better V is often the right tradeoff.
First benchmark on RTX 5090: 187 t/s generation on Qwen3.5-35B at 3.47x KV compression. 98.5% of f16 decode speed.
expanding the type system
PRs #24 and #30, March 28-29. Three additions:
turbo4 (4-bit, 4.6 bpv) for applications that need higher quality. Required nibble-packed dequant and a 16-entry centroid LUT.
turbo2 (2-bit, 2.4 bpv) for VRAM-constrained scenarios — 6.4x compression. Only 4 centroids. Fast but lossy at long context.
Zero-padding for non-128 heads. GLM-4.7 Flash has head_dim=160, which doesn’t align with the 128-element WHT. Zero-pad on quantize, mask on dequant, no quality loss.
Also added cross-type flash attention so any combination of {turbo2, turbo3, turbo4, q8_0} works for K and V independently.
Bug parade: GLM-4.7 crashed on turbo4 because context init didn’t account for zero-padding. KV state serialization wrote wrong tensor widths. MLA models needed Q rotation group_size derived from K, not V.
auto-asymmetric KV
PR #36 on March 31. Symmetric turbo3 on Qwen2.5-7B blew up: PPL 2887 instead of ~8.
The cause was GQA. Qwen has a 7:1 K:Q ratio, so each K vector serves 7 queries. Quantization error in K gets amplified multiplicatively across all the queries it serves.
The fix: at KV cache init, if the GQA ratio is >= 6:1, silently upgrade K to q8_0 and keep V at turbo3. Zero user intervention. PPL goes back to normal. Opt-out via TURBO_AUTO_ASYMMETRIC=0 for users who really want to suffer.
Same PR included the first round of FA kernel optimizations: shared-memory LUT for turbo KQ scoring (+15% at 32K), q8_1 Q quantization path for turbo VEC kernels, __expf fast-math softmax, warp shuffle for the WHT butterfly that eliminated 5 of 7 __syncthreads.
autoresearch
April 1-6. Inspired by Karpathy’s autoresearch, I built an agent loop that edits a kernel, builds, benchmarks, and keeps or reverts based on result. While I sleep, it tries things.
scripts/autoresearch/
run_track.sh — outer loop: invoke agent, manage git branches
run_experiment.sh — build -> llama-bench -> PPL check -> coherence check
build.sh — cmake wrapper
track-kv/ — KV FA kernel optimization track
track-weight/ — TQ4_1S weight kernel optimization track
Branch per session. 3-failure pause. 5-stall redirect. Single-file enforcement. GPU thermal logging. Coherence testing spins up llama-server and validates factual Q&A so the agent can’t optimize itself into a model that produces gibberish.
track 1: KV flash attention (PR #53)
33 experiments on fattn-vec.cuh. The agent found:
nthreads_KQ=1,nthreads_V/=8— better occupancy for turbo’s simpler dequant- Warp shuffle KQ scores — eliminates shared memory for score reduction
- Precomputed scaled V centroids per block — factor scale out of the inner loop
__launch_bounds__occupancy 2 — better register allocation- Shmem KQ LUT — precompute Q × centroid in shared memory
Result: +9% decode throughput on turbo3 KV.
track 2: TQ4_1S weight kernels (PR #57)
This was the big one. The TQ4_1S weight mul_mat_vec kernel started at 68 t/s with a naive scalar implementation. 86 experiments across two overnight runs. The progression:
- fp16 activation buffer (V1) — base
- Shared-memory centroid LUT (V3, +89%) — moves the centroid table from constant to shared memory, eliminating serialization on divergent lane access
- half2 arithmetic (V5) — pack activation pairs for 2x density
- Vectorized 128-bit loads (V6, +45%) — load 4x uint32 weight data and 4x int4 activation data per iteration
- Register
__byte_permcentroid decode (V12) — zero-memory centroid lookup using GPU shuffle instructions. The breakthrough. - dp4a int8 path (V14) — pre-rotate activations to q8_1, use the integer dot product intrinsic for 4x throughput vs float FMA
- NWARPS 8→4 (V15) — better occupancy for the dp4a path
Final: 240 t/s. 3.5x speedup. The __byte_perm trick at experiment 12 was not something I’d have tried manually. That’s the autoresearch payoff — the agent doesn’t get bored or anchor on a first idea, it just tries dozens of approaches systematically.
the prefill problem
April 6-7. The TQ4_1S weight kernel was great for decode (ne[1]=1) but prefill was 2x slower than the load-time q8_0 conversion path. Root cause: cuBLAS uses int8 tensor core GEMM for q8_0 but falls back to fp32 GEMM for TQ4_1S after element-wise dequant.
I tried four approaches:
- Warp-cooperative dequant kernel. 16x less compute per block for TQ4_1S→fp16. Didn’t help — the bottleneck was GEMM dispatch, not dequant.
- Multi-token dp4a kernel templated on
ncols_dst(1-8). Loads weight data once and reuses across tokens. Good for speculative decoding, modest for general prefill. - Runtime dequant + cuBLAS. TQ4_1S→fp16 on the fly, then cuBLAS fp16 tensor core GEMM. ~5.5K t/s prefill.
- Lazy fp16 cache. Dequant once on first prefill, persist. ~12K t/s warm. Discarded — 15 GB extra VRAM defeats the entire point of compression.
The pragmatic choice: load-time q8_0 conversion as default (13.3K t/s prefill, 175 t/s decode), with GGML_TQ_NATIVE=1 opt-in for decode-heavy workloads (5.5K t/s prefill, 226 t/s decode).
You can’t have both. Compressed formats save bandwidth (good for decode) but need decompression (bad for prefill). Pick where you pay.
cross-platform fixes
Community testing surfaced a stack of platform issues:
- Multi-GPU crash. Static global CUDA buffers (
d_act_buf,d_q8_1_buf) were allocated on GPU 0 and accessed from GPU 1. Switched to per-deviceggml_cuda_pool_allocfromctx.pool(id). - HIP/ROCm build failure.
__dp4ais NVIDIA-only. Replaced with theggml_cuda_dp4awrapper fromcommon.cuh. - RDNA4 performance regression. dp4a int8 regressed on AMD (101 vs 135 t/s). TheTom contributed an arch dispatch fix routing AMD to a scalar half-precision kernel.
- Static build failure. The
TURBO_IQ_APImacro assumedGGML_BACKEND_BUILDwas defined; broke withBUILD_SHARED_LIBS=OFF. Wrapped in#ifdef GGML_BACKEND_SHARED.
alpha scaling
April 7-8. TheTom’s PR #60 introduced TURBO_ALPHA — a norm scaling factor for turbo dequant that reduces KL divergence on Q4_K_M weight models. I ported it to CUDA using per-TU __constant__ variables initialized lazily from the environment variable.
Validated on Mistral-24B Q4_K_M: -14.2% mean KLD with TURBO_ALPHA=102, exactly matching Metal. The effect is model-dependent: strong on Mistral, moderate on Q2_K Qwen, negligible on GLM and DeepSeek MoE.
Per-TU __constant__ memory in CUDA is a trap. Template instantiation across dozens of .cu files means each gets its own copy. cudaMemcpyToSymbol only writes to the calling TU’s copy. I lost hours debugging alpha scaling before I understood this.
the numbers
prefill — turbo3 vs f16 KV (RTX 5090)
| Model | pp512 | pp32K | pp65K | pp131K |
|---|---|---|---|---|
| Qwen3.5-35B f16 | 7,033 | 6,395 | 5,806 | OOM |
| Qwen3.5-35B turbo3 | 6,877 (-2.2%) | 6,277 (-1.8%) | 5,707 (-1.7%) | 4,702 |
| Mistral-24B f16 | 5,322 | 3,920 | 3,005 | OOM |
| Mistral-24B turbo3 | 5,148 (-3.3%) | 3,740 (-4.6%) | 2,918 (-2.9%) | 1,717 |
| GLM-4.7 f16 | 6,869 | 3,080 | 1,970 | OOM |
| GLM-4.7 turbo3 | 6,767 (-1.5%) | 3,063 (-0.5%) | 1,957 (-0.7%) | 1,074 |
f16 OOMs at 131K context on every model. turbo3 still runs.
quality — turbo3 vs q4_0 KV (Qwen3.5-35B)
| Metric | f16 | q4_0 (4.5 bpv) | turbo3 (3.4 bpv) | turbo2 (2.4 bpv) |
|---|---|---|---|---|
| PPL @2K | 5.754 | 5.805 | 5.772 | 5.799 |
| PPL @8K | 5.975 | 6.009 | 6.023 | 5.986 |
| Mean KLD | — | 0.02278 | 0.02220 | 0.02743 |
| NIAH 4-32K | 19/20 | 20/20 | 20/20 | 18/20 |
turbo3 at 3.4 bpv matches q4_0 quality (4.5 bpv) with 25% less V memory and perfect needle-in-a-haystack retrieval. turbo2 at 2.4 bpv is viable but starts missing needles past 16K.
TQ4_1S weight compression
| Config | pp512 | tg128 | VRAM |
|---|---|---|---|
| Load-time q8_0 (default) | 13,291 | 175 | 7.5 GiB |
| Native TQ4_1S | 5,492 | 226 | 4.5 GiB |
Native decode is +29% faster at 40% less VRAM. The cost is slower prefill.
what I learned
Autoresearch works. The __byte_perm trick at experiment 12 is the kind of optimization a human spends days on. The agent found it overnight by trying approaches a human would dismiss out of stubbornness or boredom.
The prefill/decode tradeoff is fundamental. Compressed formats save bandwidth (good for decode) but need decompression (bad for prefill). There’s no free lunch.
Per-TU __constant__ memory in CUDA is a trap. I lost hours to it.
Auto-asymmetric is essential. GQA models with high K:Q ratios catastrophically amplify K quantization error. Detecting this automatically and upgrading K is the difference between “works” and “PPL 2887.”
Test on the exact model. Alpha scaling helps 14% on Mistral, 0% on GLM. Benchmark results from one model don’t generalize.
by the numbers
- 140 commits across 12 days
- 6 PRs (4 merged, 2 open)
- 119 autoresearch experiments (86 weight + 33 KV FA)
- 3.5x decode speedup on the TQ4_1S weight kernel (68 → 240 t/s)
- 2.7x KV compression with <1% PPL loss and perfect NIAH
- 131K context where f16 OOMs on 32 GB
The fork: github.com/TheTom/llama-cpp-turboquant. PRs #3, #24, #30, #36, #53, #57, plus the feature/cuda-alpha-scaling branch.