--ctx-size 262144 --parallel 2 — Gemma 4 31B with full 128K context window per user on a single RTX 5090.
At --ctx-size 32768 --parallel 2 where both configurations fit in VRAM:
| Metric | Q8_0 V Cache | TQ3_0 V Cache | Difference |
|---|---|---|---|
| Generation Speed | 66.2 tok/s | 65.1 tok/s | −1.7% |
| VRAM Used | 22,753 MiB | 22,347 MiB | −406 MiB |
| Output Quality | Correct | Correct | Identical |
Two simultaneous large-context requests on the 262K configuration. Q8_0 cannot run this test — it fails at server startup.
Tested configurations on RTX 5090 (32 GB) with Gemma 4 31B. Total context fixed at 128K.
| Parallel Slots | Context / Slot | Q8_0 V Cache | TQ3_0 V Cache |
|---|---|---|---|
| 1 slot | 128K | 25,045 MiB | OK |
| 4 slots | 32K | 28,708 MiB | OK |
| 7 slots | 18K | 32,325 MiB — OOM | 31,279 MiB |
| 2 slots × 128K each | 128K | 31,685 MiB — OOM | 28,485 MiB |
dequantize_V: O(1) codebook lookup per element (no WHT) inside FA's VEC kernelIWHT32: Warp-shuffle CUDA kernel applied once per head per layer after attentionFATTN_VEC_CASE template instantiation for TQ3_0 — no intermediate F16 bufferGetting from "it compiles" to "it works at full speed" required solving a chain of bugs, each hidden behind the last. If you're adding a new quantization type to llama.cpp, this section might save you days.
The dequantize_V_tq3_0 function used codebook values {-1.0, -0.3, 0.3, 1.0} copied from
TheTom's fork, which doesn't use WHT at all. The actual quantize function uses Lloyd-Max centroids
{-1.510, -0.4528, 0.4528, 1.510}. Since the deferred IWHT requires exact WHT-domain values,
wrong centroids meant the correction couldn't produce valid output.
Lesson: When adapting code from a fork that uses a different algorithm variant, every constant needs to be verified against your actual quantize path. "Close enough" doesn't exist in quantization.
The IWHT kernel uses a 32-bit bitmask to encode which elements get sign-flipped during the inverse transform.
The mask was 0xAA6CD4B2 instead of the correct 0xB1A6D4B2. This was verified by computing
the mask directly from the sign array used in the forward WHT:
signs = [+1,-1,+1,+1,-1,-1,+1,-1,+1,+1,-1,+1,-1,+1,-1,-1,
+1,-1,-1,+1,+1,-1,+1,-1,-1,+1,+1,+1,-1,-1,+1,-1]
mask = sum((1 << j) for j in range(32) if signs[j] == -1)
# Expected: 0xB1A6D4B2
# Actual: 0xAA6CD4B2 ← wrong
Lesson: Packed bitmask constants can't be eyeballed. Compute them programmatically from the source truth and assert equality.
After fixing bugs 1 and 2, the output was still garbage. The IWHT kernel was numerically correct (verified against a Python reference to 3.6×10-8 error). It was firing on all 60 layers. The input/output types were correct. Everything looked right.
The breakthrough came from capturing raw float values at the IWHT kernel. When we compared the "fast" path (codebook-only dequant + IWHT correction) against the "slow" path (full IWHT in dequant, no correction), the values going into the IWHT were byte-for-byte identical. This was impossible — the two paths should produce completely different FA outputs.
Root cause: dequantize_V_tq3_0 in fattn-common.cuh was
never being called. Flash Attention had no template
instantiation for TQ3_0:
// fattn.cu — these existed: FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q8_0, GGML_TYPE_Q8_0) FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q8_0, GGML_TYPE_F16) // this did NOT exist: FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q8_0, GGML_TYPE_TQ3_0) // ← missing
Without a matching template, the ggml scheduler silently dequantized the V cache from TQ3_0 to F16
using convert.cu's general-purpose dequantize — which included the full IWHT already.
Our code changes to the FA-internal dequant were dead code. The build system gave zero warnings.
Lesson: Adding a new type to llama.cpp's type system does NOT automatically enable Flash Attention support.
Without explicit FATTN_VEC_CASE instantiations, FA silently falls back to F16 conversion.
You will get correct output (from the conversion) but none of your FA-internal optimizations will run.
The only way to detect this is to verify your code actually executes.
After fixing the silent dequantization by removing the IWHT from convert.cu (so the scheduler's
F16 conversion would output WHT-domain values), the IWHT correction produced correct output. But performance
was 9.7 tok/s vs 66.2 tok/s baseline — a 6.8× slowdown.
The problem: the scheduler was still converting TQ3_0 → F16 into a temporary buffer before FA. This doubled memory bandwidth. The IWHT wasn't the bottleneck — the unnecessary conversion was.
The fix was 3 changes to fattn.cu:
// 1. Add template instantiations (tells FA how to read TQ3_0 directly) FATTN_VEC_CASE(256, GGML_TYPE_Q8_0, GGML_TYPE_TQ3_0) // SWA layers FATTN_VEC_CASE(512, GGML_TYPE_Q8_0, GGML_TYPE_TQ3_0) // full attn layers // 2. Allow K≠V types for TQ3_0 if (K->type != V->type && V->type != GGML_TYPE_TQ3_0) return BEST_FATTN_KERNEL_NONE; // 3. Force VEC kernel for TQ3_0 (MMA expects F16) if (V->type == GGML_TYPE_TQ3_0) return BEST_FATTN_KERNEL_VEC;
Result: 9.7 tok/s → 65.1 tok/s. The VEC kernel now reads TQ3_0 data directly via
dequantize_V_tq3_0, no intermediate buffer, no unnecessary conversion.
The IWHT unary op corrects the output once per head per layer.
Lesson: If your new quant type is working but mysteriously slow, check whether FA is actually using your native dequant or silently converting to F16. The speed difference is 6.8×.
| Step 1 | Implemented TQ3_0 type, IWHT kernel, FA dequant function. Compiles clean. Output: garbage |
| Step 2 | Fixed codebook + sign mask. Still garbage. IWHT kernel verified numerically correct. |
| Step 3 | Discovered dequantize_V was dead code — no FA template. Moved IWHT removal to convert.cu. Output: correct, speed: 9.7 tok/s (6.8× slow) |
| Step 4 | Added native FA support (3 lines in fattn.cu). Speed: 65.1 tok/s — matching Q8_0 baseline. |
| GPU | NVIDIA RTX 5090 |
| VRAM | 32,607 MiB |
| Compute | SM 12.0 (Blackwell) |
| Platform | RunPod |
| Model | gemma-4-31B-it-UD-Q4_K_XL |
| Model Size | 17.46 GB |
| Parameters | 30.70B |
| llama.cpp | b8639 (a1cfb64) + TQ3_0 patches |
| Flash Attention | VEC kernel (native TQ3_0) |
| K Cache | Q8_0 (8 bit) |
| V Cache | TQ3_0 (3 bit, deferred IWHT) |