Porting TurboQuant to CUDA

NAME

porting-turboquant-to-cuda — Porting TurboQuant to CUDA

SYNOPSIS

I built the CUDA backend for TurboQuant — Walsh-Hadamard rotation plus Lloyd-Max KV cache compression. 12 days, 140 commits, 6 PRs. 3.5x decode speedup on the weight kernel, with the breakthrough optimization found by an automated agent at experiment 12 of an overnight run.

METADATA

dateMay 9, 2026

length10.7K

reading~8m

the initial port

PR #3, March 26-27. Goal: make turbo3 (3-bit PolarQuant, 3.4 bpv) work on NVIDIA. That meant flash attention kernels (VEC, MMA, tile paths) with on-the-fly dequant from 3-bit block format, a warp-cooperative Walsh-Hadamard transform using __shfl_xor_sync for the binary butterfly, quantize/dequant for the KV cache with centroid lookup and sign decode, and mixed KV types so you could run turbo3 K with q8_0 V or vice versa.

Asymmetric compression matters because cheaper K with better V is often the right tradeoff.

First benchmark on RTX 5090: 187 t/s generation on Qwen3.5-35B at 3.47x KV compression. 98.5% of f16 decode speed.

expanding the type system

PRs #24 and #30, March 28-29. Three additions:

turbo4 (4-bit, 4.6 bpv) for applications that need higher quality. Required nibble-packed dequant and a 16-entry centroid LUT.

turbo2 (2-bit, 2.4 bpv) for VRAM-constrained scenarios — 6.4x compression. Only 4 centroids. Fast but lossy at long context.

Zero-padding for non-128 heads. GLM-4.7 Flash has head_dim=160, which doesn’t align with the 128-element WHT. Zero-pad on quantize, mask on dequant, no quality loss.

Also added cross-type flash attention so any combination of {turbo2, turbo3, turbo4, q8_0} works for K and V independently.

Bug parade: GLM-4.7 crashed on turbo4 because context init didn’t account for zero-padding. KV state serialization wrote wrong tensor widths. MLA models needed Q rotation group_size derived from K, not V.

auto-asymmetric KV

PR #36 on March 31. Symmetric turbo3 on Qwen2.5-7B blew up: PPL 2887 instead of ~8.

The cause was GQA. Qwen has a 7:1 K:Q ratio, so each K vector serves 7 queries. Quantization error in K gets amplified multiplicatively across all the queries it serves.

The fix: at KV cache init, if the GQA ratio is >= 6:1, silently upgrade K to q8_0 and keep V at turbo3. Zero user intervention. PPL goes back to normal. Opt-out via TURBO_AUTO_ASYMMETRIC=0 for users who really want to suffer.

Same PR included the first round of FA kernel optimizations: shared-memory LUT for turbo KQ scoring (+15% at 32K), q8_1 Q quantization path for turbo VEC kernels, __expf fast-math softmax, warp shuffle for the WHT butterfly that eliminated 5 of 7 __syncthreads.

autoresearch

April 1-6. Inspired by Karpathy’s autoresearch, I built an agent loop that edits a kernel, builds, benchmarks, and keeps or reverts based on result. While I sleep, it tries things.

scripts/autoresearch/
  run_track.sh      — outer loop: invoke agent, manage git branches
  run_experiment.sh — build -> llama-bench -> PPL check -> coherence check
  build.sh          — cmake wrapper
  track-kv/         — KV FA kernel optimization track
  track-weight/     — TQ4_1S weight kernel optimization track

Branch per session. 3-failure pause. 5-stall redirect. Single-file enforcement. GPU thermal logging. Coherence testing spins up llama-server and validates factual Q&A so the agent can’t optimize itself into a model that produces gibberish.

track 1: KV flash attention (PR #53)

33 experiments on fattn-vec.cuh. The agent found:

nthreads_KQ=1, nthreads_V/=8 — better occupancy for turbo’s simpler dequant
Warp shuffle KQ scores — eliminates shared memory for score reduction
Precomputed scaled V centroids per block — factor scale out of the inner loop
__launch_bounds__ occupancy 2 — better register allocation
Shmem KQ LUT — precompute Q × centroid in shared memory

Result: +9% decode throughput on turbo3 KV.

track 2: TQ4_1S weight kernels (PR #57)

This was the big one. The TQ4_1S weight mul_mat_vec kernel started at 68 t/s with a naive scalar implementation. 86 experiments across two overnight runs. The progression:

fp16 activation buffer (V1) — base
Shared-memory centroid LUT (V3, +89%) — moves the centroid table from constant to shared memory, eliminating serialization on divergent lane access
half2 arithmetic (V5) — pack activation pairs for 2x density
Vectorized 128-bit loads (V6, +45%) — load 4x uint32 weight data and 4x int4 activation data per iteration
Register __byte_perm centroid decode (V12) — zero-memory centroid lookup using GPU shuffle instructions. The breakthrough.
dp4a int8 path (V14) — pre-rotate activations to q8_1, use the integer dot product intrinsic for 4x throughput vs float FMA
NWARPS 8→4 (V15) — better occupancy for the dp4a path

Final: 240 t/s. 3.5x speedup. The __byte_perm trick at experiment 12 was not something I’d have tried manually. That’s the autoresearch payoff — the agent doesn’t get bored or anchor on a first idea, it just tries dozens of approaches systematically.

the prefill problem

April 6-7. The TQ4_1S weight kernel was great for decode (ne[1]=1) but prefill was 2x slower than the load-time q8_0 conversion path. Root cause: cuBLAS uses int8 tensor core GEMM for q8_0 but falls back to fp32 GEMM for TQ4_1S after element-wise dequant.

I tried four approaches:

Warp-cooperative dequant kernel. 16x less compute per block for TQ4_1S→fp16. Didn’t help — the bottleneck was GEMM dispatch, not dequant.
Multi-token dp4a kernel templated on ncols_dst (1-8). Loads weight data once and reuses across tokens. Good for speculative decoding, modest for general prefill.
Runtime dequant + cuBLAS. TQ4_1S→fp16 on the fly, then cuBLAS fp16 tensor core GEMM. ~5.5K t/s prefill.
Lazy fp16 cache. Dequant once on first prefill, persist. ~12K t/s warm. Discarded — 15 GB extra VRAM defeats the entire point of compression.

The pragmatic choice: load-time q8_0 conversion as default (13.3K t/s prefill, 175 t/s decode), with GGML_TQ_NATIVE=1 opt-in for decode-heavy workloads (5.5K t/s prefill, 226 t/s decode).

You can’t have both. Compressed formats save bandwidth (good for decode) but need decompression (bad for prefill). Pick where you pay.

cross-platform fixes

Community testing surfaced a stack of platform issues:

Multi-GPU crash. Static global CUDA buffers (d_act_buf, d_q8_1_buf) were allocated on GPU 0 and accessed from GPU 1. Switched to per-device ggml_cuda_pool_alloc from ctx.pool(id).
HIP/ROCm build failure. __dp4a is NVIDIA-only. Replaced with the ggml_cuda_dp4a wrapper from common.cuh.
RDNA4 performance regression. dp4a int8 regressed on AMD (101 vs 135 t/s). TheTom contributed an arch dispatch fix routing AMD to a scalar half-precision kernel.
Static build failure. The TURBO_IQ_API macro assumed GGML_BACKEND_BUILD was defined; broke with BUILD_SHARED_LIBS=OFF. Wrapped in #ifdef GGML_BACKEND_SHARED.

alpha scaling

April 7-8. TheTom’s PR #60 introduced TURBO_ALPHA — a norm scaling factor for turbo dequant that reduces KL divergence on Q4_K_M weight models. I ported it to CUDA using per-TU __constant__ variables initialized lazily from the environment variable.

Validated on Mistral-24B Q4_K_M: -14.2% mean KLD with TURBO_ALPHA=102, exactly matching Metal. The effect is model-dependent: strong on Mistral, moderate on Q2_K Qwen, negligible on GLM and DeepSeek MoE.

Per-TU __constant__ memory in CUDA is a trap. Template instantiation across dozens of .cu files means each gets its own copy. cudaMemcpyToSymbol only writes to the calling TU’s copy. I lost hours debugging alpha scaling before I understood this.

the numbers

prefill — turbo3 vs f16 KV (RTX 5090)

Model	pp512	pp32K	pp65K	pp131K
Qwen3.5-35B f16	7,033	6,395	5,806	OOM
Qwen3.5-35B turbo3	6,877 (-2.2%)	6,277 (-1.8%)	5,707 (-1.7%)	4,702
Mistral-24B f16	5,322	3,920	3,005	OOM
Mistral-24B turbo3	5,148 (-3.3%)	3,740 (-4.6%)	2,918 (-2.9%)	1,717
GLM-4.7 f16	6,869	3,080	1,970	OOM
GLM-4.7 turbo3	6,767 (-1.5%)	3,063 (-0.5%)	1,957 (-0.7%)	1,074

f16 OOMs at 131K context on every model. turbo3 still runs.

quality — turbo3 vs q4_0 KV (Qwen3.5-35B)

Metric	f16	q4_0 (4.5 bpv)	turbo3 (3.4 bpv)	turbo2 (2.4 bpv)
PPL @2K	5.754	5.805	5.772	5.799
PPL @8K	5.975	6.009	6.023	5.986
Mean KLD	—	0.02278	0.02220	0.02743
NIAH 4-32K	19/20	20/20	20/20	18/20

turbo3 at 3.4 bpv matches q4_0 quality (4.5 bpv) with 25% less V memory and perfect needle-in-a-haystack retrieval. turbo2 at 2.4 bpv is viable but starts missing needles past 16K.

TQ4_1S weight compression

Config	pp512	tg128	VRAM
Load-time q8_0 (default)	13,291	175	7.5 GiB
Native TQ4_1S	5,492	226	4.5 GiB

Native decode is +29% faster at 40% less VRAM. The cost is slower prefill.

what I learned

Autoresearch works. The __byte_perm trick at experiment 12 is the kind of optimization a human spends days on. The agent found it overnight by trying approaches a human would dismiss out of stubbornness or boredom.

The prefill/decode tradeoff is fundamental. Compressed formats save bandwidth (good for decode) but need decompression (bad for prefill). There’s no free lunch.

Per-TU __constant__ memory in CUDA is a trap. I lost hours to it.

Auto-asymmetric is essential. GQA models with high K:Q ratios catastrophically amplify K quantization error. Detecting this automatically and upgrading K is the difference between “works” and “PPL 2887.”

Test on the exact model. Alpha scaling helps 14% on Mistral, 0% on GLM. Benchmark results from one model don’t generalize.

by the numbers

140 commits across 12 days
6 PRs (4 merged, 2 open)
119 autoresearch experiments (86 weight + 33 KV FA)
3.5x decode speedup on the TQ4_1S weight kernel (68 → 240 t/s)
2.7x KV compression with <1% PPL loss and perfect NIAH
131K context where f16 OOMs on 32 GB

The fork: github.com/TheTom/llama-cpp-turboquant. PRs #3, #24, #30, #36, #53, #57, plus the feature/cuda-alpha-scaling branch.