TQ3_0 V Cache — KV Cache Compression Benchmark

TurboQuant 3-bit V cache with deferred inverse Walsh-Hadamard Transform
April 4, 2026 · Gemma 4 31B Q4_K_XL · RTX 5090 (32 GB)

The Test: Two Users × 128K Context Each

--ctx-size 262144 --parallel 2 — Gemma 4 31B with full 128K context window per user on a single RTX 5090.

Q8_0 V Cache
8 bits per element (default)
OOM
31,685 MiB needed
31,602 MiB available
TQ3_0 V Cache
3 bits per element + IWHT correction
65.4
tok/s · 28,485 MiB · correct output

VRAM Usage

Q8_0 V
31,685 MiB
GPU limit
OOM
TQ3_0 V
28,485 MiB
3.1 GB free
GPU Total
32,607 MiB
3.1 GB
VRAM Saved
2.67×
V Cache Compression
1.7%
Speed Overhead

Speed Comparison

At --ctx-size 32768 --parallel 2 where both configurations fit in VRAM:

Metric Q8_0 V Cache TQ3_0 V Cache Difference
Generation Speed 66.2 tok/s 65.1 tok/s −1.7%
VRAM Used 22,753 MiB 22,347 MiB −406 MiB
Output Quality Correct Correct Identical

Concurrent Load Test

Two simultaneous large-context requests on the 262K configuration. Q8_0 cannot run this test — it fails at server startup.

TQ3_0 V Cache — Both Slots Active

Slot 0
27,027 prompt tokens
4.4 tok/s
Slot 1
16,027 prompt tokens
36.2 tok/s
Wall time: 77.1s Combined context: 43,054 tokens Peak VRAM: 29,125 / 32,607 MiB

Extended Stress Test — 127K Combined Context

Slot 0
75,028 prompt tokens
12.8 tok/s
Slot 1
52,028 prompt tokens
21.4 tok/s
Wall time: 283s Combined context: 127,056 tokens Peak VRAM: 29,127 / 32,607 MiB

Where Q8_0 Fails, TQ3_0 Fits

Tested configurations on RTX 5090 (32 GB) with Gemma 4 31B. Total context fixed at 128K.

Parallel Slots Context / Slot Q8_0 V Cache TQ3_0 V Cache
1 slot 128K 25,045 MiB OK
4 slots 32K 28,708 MiB OK
7 slots 18K 32,325 MiB — OOM 31,279 MiB
2 slots × 128K each 128K 31,685 MiB — OOM 28,485 MiB

How It Works

The Problem: TurboQuant compresses V cache values using a Walsh-Hadamard Transform (WHT) before quantizing to 3 bits. Reading values back requires the inverse WHT on all 32 elements per block — expensive when called per-element inside Flash Attention's inner loop.
The Insight: WHT is linear. Attention's weighted sum is linear. They commute:
IWHT( Σt αt · WHT(Vt) ) = Σt αt · Vt
Instead of inverse-transforming every V element before summing (hot path), we sum the raw quantized values and apply the inverse transform once to the attention output (cold path).
The Implementation:
  • dequantize_V: O(1) codebook lookup per element (no WHT) inside FA's VEC kernel
  • IWHT32: Warp-shuffle CUDA kernel applied once per head per layer after attention
  • Native FATTN_VEC_CASE template instantiation for TQ3_0 — no intermediate F16 buffer
Prior art: The deferred IWHT technique was independently discovered by multiple contributors in the llama.cpp TurboQuant community (discussion #20969), including AmesianX (CUDA, V post-processing + warp-shuffle IWHT) and the pre-rotated queries approach (veritatisquaesitoressumus, AmesianX). spiritbuun achieved similar speed via bulk dequant-then-MMA. Our implementation uses a distinct integration path through the VEC FA kernel with native template instantiation, avoiding intermediate F16 buffers.

The Debugging Journey

Getting from "it compiles" to "it works at full speed" required solving a chain of bugs, each hidden behind the last. If you're adding a new quantization type to llama.cpp, this section might save you days.

Bug 1: Wrong Codebook Values

The dequantize_V_tq3_0 function used codebook values {-1.0, -0.3, 0.3, 1.0} copied from TheTom's fork, which doesn't use WHT at all. The actual quantize function uses Lloyd-Max centroids {-1.510, -0.4528, 0.4528, 1.510}. Since the deferred IWHT requires exact WHT-domain values, wrong centroids meant the correction couldn't produce valid output.

Lesson: When adapting code from a fork that uses a different algorithm variant, every constant needs to be verified against your actual quantize path. "Close enough" doesn't exist in quantization.

Bug 2: Wrong Sign Mask in IWHT Kernel

The IWHT kernel uses a 32-bit bitmask to encode which elements get sign-flipped during the inverse transform. The mask was 0xAA6CD4B2 instead of the correct 0xB1A6D4B2. This was verified by computing the mask directly from the sign array used in the forward WHT:

signs = [+1,-1,+1,+1,-1,-1,+1,-1,+1,+1,-1,+1,-1,+1,-1,-1,
         +1,-1,-1,+1,+1,-1,+1,-1,-1,+1,+1,+1,-1,-1,+1,-1]
mask = sum((1 << j) for j in range(32) if signs[j] == -1)
# Expected: 0xB1A6D4B2
# Actual:   0xAA6CD4B2  ← wrong

Lesson: Packed bitmask constants can't be eyeballed. Compute them programmatically from the source truth and assert equality.

Bug 3: The Real Killer — Dead Code

After fixing bugs 1 and 2, the output was still garbage. The IWHT kernel was numerically correct (verified against a Python reference to 3.6×10-8 error). It was firing on all 60 layers. The input/output types were correct. Everything looked right.

The breakthrough came from capturing raw float values at the IWHT kernel. When we compared the "fast" path (codebook-only dequant + IWHT correction) against the "slow" path (full IWHT in dequant, no correction), the values going into the IWHT were byte-for-byte identical. This was impossible — the two paths should produce completely different FA outputs.

Root cause: dequantize_V_tq3_0 in fattn-common.cuh was never being called. Flash Attention had no template instantiation for TQ3_0:

// fattn.cu — these existed:
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q8_0, GGML_TYPE_Q8_0)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q8_0, GGML_TYPE_F16)

// this did NOT exist:
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q8_0, GGML_TYPE_TQ3_0)  // ← missing

Without a matching template, the ggml scheduler silently dequantized the V cache from TQ3_0 to F16 using convert.cu's general-purpose dequantize — which included the full IWHT already. Our code changes to the FA-internal dequant were dead code. The build system gave zero warnings.

Lesson: Adding a new type to llama.cpp's type system does NOT automatically enable Flash Attention support. Without explicit FATTN_VEC_CASE instantiations, FA silently falls back to F16 conversion. You will get correct output (from the conversion) but none of your FA-internal optimizations will run. The only way to detect this is to verify your code actually executes.

Bug 4: The 6.8× Slowdown

After fixing the silent dequantization by removing the IWHT from convert.cu (so the scheduler's F16 conversion would output WHT-domain values), the IWHT correction produced correct output. But performance was 9.7 tok/s vs 66.2 tok/s baseline — a 6.8× slowdown.

The problem: the scheduler was still converting TQ3_0 → F16 into a temporary buffer before FA. This doubled memory bandwidth. The IWHT wasn't the bottleneck — the unnecessary conversion was.

The fix was 3 changes to fattn.cu:

// 1. Add template instantiations (tells FA how to read TQ3_0 directly)
FATTN_VEC_CASE(256, GGML_TYPE_Q8_0, GGML_TYPE_TQ3_0)  // SWA layers
FATTN_VEC_CASE(512, GGML_TYPE_Q8_0, GGML_TYPE_TQ3_0)  // full attn layers

// 2. Allow K≠V types for TQ3_0
if (K->type != V->type && V->type != GGML_TYPE_TQ3_0)
    return BEST_FATTN_KERNEL_NONE;

// 3. Force VEC kernel for TQ3_0 (MMA expects F16)
if (V->type == GGML_TYPE_TQ3_0) return BEST_FATTN_KERNEL_VEC;

Result: 9.7 tok/s → 65.1 tok/s. The VEC kernel now reads TQ3_0 data directly via dequantize_V_tq3_0, no intermediate buffer, no unnecessary conversion. The IWHT unary op corrects the output once per head per layer.

Lesson: If your new quant type is working but mysteriously slow, check whether FA is actually using your native dequant or silently converting to F16. The speed difference is 6.8×.

Timeline Summary
Step 1 Implemented TQ3_0 type, IWHT kernel, FA dequant function. Compiles clean. Output: garbage
Step 2 Fixed codebook + sign mask. Still garbage. IWHT kernel verified numerically correct.
Step 3 Discovered dequantize_V was dead code — no FA template. Moved IWHT removal to convert.cu. Output: correct, speed: 9.7 tok/s (6.8× slow)
Step 4 Added native FA support (3 lines in fattn.cu). Speed: 65.1 tok/s — matching Q8_0 baseline.

Test Environment

Hardware

GPUNVIDIA RTX 5090
VRAM32,607 MiB
ComputeSM 12.0 (Blackwell)
PlatformRunPod

Software

Modelgemma-4-31B-it-UD-Q4_K_XL
Model Size17.46 GB
Parameters30.70B
llama.cppb8639 (a1cfb64) + TQ3_0 patches
Flash AttentionVEC kernel (native TQ3_0)
K CacheQ8_0 (8 bit)
V CacheTQ3_0 (3 bit, deferred IWHT)