TQ3_0 V Cache Benchmark — TurboQuant KV Cache Compression

The Test: Two Users × 128K Context Each

--ctx-size 262144 --parallel 2 — Gemma 4 31B with full 128K context window per user on a single RTX 5090.

Q8_0 V Cache

8 bits per element (default)

OOM

31,685 MiB needed
31,602 MiB available

TQ3_0 V Cache

3 bits per element + IWHT correction

65.4

tok/s · 28,485 MiB · correct output

VRAM Usage

Q8_0 V

31,685 MiB

GPU limit

OOM

TQ3_0 V

28,485 MiB

3.1 GB free

GPU Total

32,607 MiB

3.1 GB

VRAM Saved

2.67×

V Cache Compression

1.7%

Speed Overhead

Speed Comparison

At --ctx-size 32768 --parallel 2 where both configurations fit in VRAM:

Metric	Q8_0 V Cache	TQ3_0 V Cache	Difference
Generation Speed	66.2 tok/s	65.1 tok/s	−1.7%
VRAM Used	22,753 MiB	22,347 MiB	−406 MiB
Output Quality	Correct	Correct	Identical

Concurrent Load Test

Two simultaneous large-context requests on the 262K configuration. Q8_0 cannot run this test — it fails at server startup.

TQ3_0 V Cache — Both Slots Active

Slot 0

27,027 prompt tokens

4.4 tok/s

Slot 1

16,027 prompt tokens

36.2 tok/s

Wall time: 77.1s Combined context: 43,054 tokens Peak VRAM: 29,125 / 32,607 MiB

Extended Stress Test — 127K Combined Context

Slot 0

75,028 prompt tokens

12.8 tok/s

Slot 1

52,028 prompt tokens

21.4 tok/s

Wall time: 283s Combined context: 127,056 tokens Peak VRAM: 29,127 / 32,607 MiB

Where Q8_0 Fails, TQ3_0 Fits

Tested configurations on RTX 5090 (32 GB) with Gemma 4 31B. Total context fixed at 128K.

Parallel Slots	Context / Slot	Q8_0 V Cache	TQ3_0 V Cache
1 slot	128K	25,045 MiB	OK
4 slots	32K	28,708 MiB	OK
7 slots	18K	32,325 MiB — OOM	31,279 MiB
2 slots × 128K each	128K	31,685 MiB — OOM	28,485 MiB

How It Works

The Problem: TurboQuant compresses V cache values using a Walsh-Hadamard Transform (WHT) before quantizing to 3 bits. Reading values back requires the inverse WHT on all 32 elements per block — expensive when called per-element inside Flash Attention's inner loop.

The Insight: WHT is linear. Attention's weighted sum is linear. They commute:

IWHT( Σ_t α_t · WHT(V_t) ) = Σ_t α_t · V_t

Instead of inverse-transforming every V element before summing (hot path), we sum the raw quantized values and apply the inverse transform once to the attention output (cold path).

The Implementation:

dequantize_V: O(1) codebook lookup per element (no WHT) inside FA's VEC kernel
IWHT32: Warp-shuffle CUDA kernel applied once per head per layer after attention
Native FATTN_VEC_CASE template instantiation for TQ3_0 — no intermediate F16 buffer

Prior art: The deferred IWHT technique was independently discovered by multiple contributors in the llama.cpp TurboQuant community (discussion #20969), including AmesianX (CUDA, V post-processing + warp-shuffle IWHT) and the pre-rotated queries approach (veritatisquaesitoressumus, AmesianX). spiritbuun achieved similar speed via bulk dequant-then-MMA. Our implementation uses a distinct integration path through the VEC FA kernel with native template instantiation, avoiding intermediate F16 buffers.

The Debugging Journey

Getting from "it compiles" to "it works at full speed" required solving a chain of bugs, each hidden behind the last. If you're adding a new quantization type to llama.cpp, this section might save you days.

Bug 1: Wrong Codebook Values

The dequantize_V_tq3_0 function used codebook values {-1.0, -0.3, 0.3, 1.0} copied from TheTom's fork, which doesn't use WHT at all. The actual quantize function uses Lloyd-Max centroids {-1.510, -0.4528, 0.4528, 1.510}. Since the deferred IWHT requires exact WHT-domain values, wrong centroids meant the correction couldn't produce valid output.

Lesson: When adapting code from a fork that uses a different algorithm variant, every constant needs to be verified against your actual quantize path. "Close enough" doesn't exist in quantization.

Bug 2: Wrong Sign Mask in IWHT Kernel

The IWHT kernel uses a 32-bit bitmask to encode which elements get sign-flipped during the inverse transform. The mask was 0xAA6CD4B2 instead of the correct 0xB1A6D4B2. This was verified by computing the mask directly from the sign array used in the forward WHT:

signs = [+1,-1,+1,+1,-1,-1,+1,-1,+1,+1,-1,+1,-1,+1,-1,-1,
         +1,-1,-1,+1,+1,-1,+1,-1,-1,+1,+1,+1,-1,-1,+1,-1]
mask = sum((1 << j) for j in range(32) if signs[j] == -1)
# Expected: 0xB1A6D4B2
# Actual:   0xAA6CD4B2  ← wrong

Lesson: Packed bitmask constants can't be eyeballed. Compute them programmatically from the source truth and assert equality.

Bug 3: The Real Killer — Dead Code

After fixing bugs 1 and 2, the output was still garbage. The IWHT kernel was numerically correct (verified against a Python reference to 3.6×10^-8 error). It was firing on all 60 layers. The input/output types were correct. Everything looked right.

The breakthrough came from capturing raw float values at the IWHT kernel. When we compared the "fast" path (codebook-only dequant + IWHT correction) against the "slow" path (full IWHT in dequant, no correction), the values going into the IWHT were byte-for-byte identical. This was impossible — the two paths should produce completely different FA outputs.

Root cause: dequantize_V_tq3_0 in fattn-common.cuh was never being called. Flash Attention had no template instantiation for TQ3_0:

// fattn.cu — these existed:
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q8_0, GGML_TYPE_Q8_0)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q8_0, GGML_TYPE_F16)

// this did NOT exist:
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q8_0, GGML_TYPE_TQ3_0)  // ← missing

Without a matching template, the ggml scheduler silently dequantized the V cache from TQ3_0 to F16 using convert.cu's general-purpose dequantize — which included the full IWHT already. Our code changes to the FA-internal dequant were dead code. The build system gave zero warnings.

Lesson: Adding a new type to llama.cpp's type system does NOT automatically enable Flash Attention support. Without explicit FATTN_VEC_CASE instantiations, FA silently falls back to F16 conversion. You will get correct output (from the conversion) but none of your FA-internal optimizations will run. The only way to detect this is to verify your code actually executes.

Bug 4: The 6.8× Slowdown

After fixing the silent dequantization by removing the IWHT from convert.cu (so the scheduler's F16 conversion would output WHT-domain values), the IWHT correction produced correct output. But performance was 9.7 tok/s vs 66.2 tok/s baseline — a 6.8× slowdown.

The problem: the scheduler was still converting TQ3_0 → F16 into a temporary buffer before FA. This doubled memory bandwidth. The IWHT wasn't the bottleneck — the unnecessary conversion was.

The fix was 3 changes to fattn.cu:

// 1. Add template instantiations (tells FA how to read TQ3_0 directly)
FATTN_VEC_CASE(256, GGML_TYPE_Q8_0, GGML_TYPE_TQ3_0)  // SWA layers
FATTN_VEC_CASE(512, GGML_TYPE_Q8_0, GGML_TYPE_TQ3_0)  // full attn layers

// 2. Allow K≠V types for TQ3_0
if (K->type != V->type && V->type != GGML_TYPE_TQ3_0)
    return BEST_FATTN_KERNEL_NONE;

// 3. Force VEC kernel for TQ3_0 (MMA expects F16)
if (V->type == GGML_TYPE_TQ3_0) return BEST_FATTN_KERNEL_VEC;

Result: 9.7 tok/s → 65.1 tok/s. The VEC kernel now reads TQ3_0 data directly via dequantize_V_tq3_0, no intermediate buffer, no unnecessary conversion. The IWHT unary op corrects the output once per head per layer.

Lesson: If your new quant type is working but mysteriously slow, check whether FA is actually using your native dequant or silently converting to F16. The speed difference is 6.8×.

Timeline Summary

Step 1	Implemented TQ3_0 type, IWHT kernel, FA dequant function. Compiles clean. Output: garbage
Step 2	Fixed codebook + sign mask. Still garbage. IWHT kernel verified numerically correct.
Step 3	Discovered dequantize_V was dead code — no FA template. Moved IWHT removal to convert.cu. Output: correct, speed: 9.7 tok/s (6.8× slow)
Step 4	Added native FA support (3 lines in fattn.cu). Speed: 65.1 tok/s — matching Q8_0 baseline.

Test Environment

Hardware

GPU	NVIDIA RTX 5090
VRAM	32,607 MiB
Compute	SM 12.0 (Blackwell)
Platform	RunPod

Software

Model	gemma-4-31B-it-UD-Q4_K_XL
Model Size	17.46 GB
Parameters	30.70B
llama.cpp	b8639 (a1cfb64) + TQ3_0 patches
Flash Attention	VEC kernel (native TQ3_0)
K Cache	Q8_0 (8 bit)
V Cache	TQ3_0 (3 bit, deferred IWHT)