What Is TurboQuant — And Why Is It All the Rage Right Now?

What Is TurboQuant — And Why Is It All the Rage Right Now?


What Is TurboQuant — And Why Is It All the Rage Right Now?

It's 2 AM. Your production LLM deployment starts throwing OOM errors. You stare at the GPU memory dashboard and watch it climb: 40GB, 60GB, 80GB. The culprit isn't your model weights. It's the KV Cache — the giant table of intermediate attention vectors your transformer keeps around to avoid recalculating every previous token from scratch.

Scaling context windows from 4K to 128K tokens didn't just change your product. It multiplied your per-request memory footprint by 32x. And somewhere in that explosion of floats, a research team asked a deceptively simple question: what if we just… compressed the vectors?

The answer, it turns out, is not simple at all. Naive compression destroys the attention mechanism entirely. But a new algorithm called TurboQuant does something clever — it comes in two distinct variants with two distinct jobs. The first, TurboQuant_mse, applies a random rotation before quantizing to solve the distribution problem. The second, TurboQuant_prod, adds a one-bit residual correction on top to eliminate inner product bias.

These are not interchangeable. They solve different problems and fail in different ways when misapplied. In this article, we'll walk through both variants from first principles: the intuition, the math, the step-by-step arithmetic, and the concrete PyTorch implementation.


Table of Contents


The Problem: Your Transformer's Memory Is Hemorrhaging

Every modern transformer computes three vectors per token per layer: a Query (Q), a Key (K), and a Value (V). During generation, the model needs to compare the current token's Q against every previous token's K to decide what to pay attention to. Rather than recompute all previous K and V vectors from scratch on each new token, the model caches them. This is the KV Cache.

The problem: those caches are enormous.

For a model with a 128K context window, the KV Cache for a single sequence can consume 60–80 GB of GPU memory — more than the model weights themselves. At scale, serving 100 concurrent sessions means you either need a warehouse of H100s or you start aggressively evicting context.

This is where vector quantization enters. Instead of storing each 16-bit float in full precision, you store a compressed integer approximation. At 4x compression, you shave your 80GB cache down to 20GB.

The catch? Every compression algorithm introduces error. And in transformers, the wrong kind of error doesn't degrade quality gracefully — it causes attention collapse. This is the problem TurboQuant was specifically designed to solve.


The Mental Model: Quantization as a Compression Codec

Before diving into TurboQuant specifically, let's establish the mental model that makes everything else click.

Quantization maps continuous numbers to a small discrete set of values (centroids) and stores only the integer index.

Think of it like a color palette. Instead of storing a 32-bit full-color pixel, you pick the nearest color from a 16-color palette and store a 4-bit index. When you want the color back, you look up the index.

Applied to a 4096-dimensional embedding vector, the process works per coordinate:

  • Each floating-point coordinate maps to its nearest centroid value.
  • You store integer indices instead of floats.
  • At decode time, you map indices back to centroids.

At 1 bit per coordinate, you only have two centroid choices (−0.5 and +0.5). That's a 16x compression of a 16-bit float. At 2 bits, you have four choices — 8x compression.

The problem with this approach is that real LLM vectors are wildly non-uniform. One coordinate might be 10.5, the next 0.001. A fixed centroid set designed for "typical" values fails badly for any specific vector. TurboQuant solves this before anything else happens — by rotating the vector first.


The Two Variants and Why Both Exist

TurboQuant is not a single algorithm. It's a family of two, each solving a distinct problem:

Variant Core Problem It Solves Use Case
TurboQuant_mse Non-uniform coordinate distributions make fixed centroids inefficient Image/audio compression, geometric reconstruction
TurboQuant_prod MSE quantization deflates inner products, corrupting attention scores LLM KV cache, vector databases, RAG, attention

TurboQuant_prod is built on top of TurboQuant_mse — it runs the MSE step first, then adds a 1-bit residual correction. To understand the full system, you must understand both layers. We'll cover them in order.


Variant 1 — TurboQuant_mse: The Kaleidoscope Trick {#variant-1-turboquant-mse}

The Core Insight

The fundamental problem with naive quantization is unpredictable coordinate distributions. When a vector arrives with coordinates like [10.5, 0.001, −3.2, 0.0004], your pre-built centroid set — designed for "average" vectors — will do a terrible job on the extreme values. You'd need to either learn a custom codebook per model or accept massive quantization errors.

TurboQuant sidesteps both options entirely with a single insight: apply a random rotation to the vector before quantizing it.

A rotation matrix Π is orthogonal — it preserves the L2 norm (length) of any vector it's applied to. It never scales, never distorts. It only changes orientation. What makes a random rotation particularly useful here is a consequence of the central limit theorem: after rotating a high-dimensional vector by a random orthogonal matrix, each output coordinate becomes an approximate sum of many independent components. By the CLT, such sums converge to a Gaussian (bell-curve) distribution, regardless of the original coordinate distribution.

Once coordinates are Gaussian-distributed, everything becomes predictable. You can precompute a single optimal centroid set for a Gaussian distribution once, offline, and it will work well for any rotated vector — from any LLM, any domain, any input. No data collection, no calibration, no per-model training.

This is the "kaleidoscope" analogy: you scramble the vector into a predictable shape, compress that predictable shape efficiently, then unscramble it on the way out.

The Algorithm

Quantization (compress):

  1. Rotate: y = Π · x
  2. For each coordinate y[j], find the index of the nearest centroid: idx[j] = argmin_k |y[j] − c_k|

Dequantization (decompress):

  1. Map indices back to centroid values: ỹ[j] = c_{idx[j]}
  2. Apply the inverse rotation: x̃_mse = Πᵀ · ỹ

The inverse of an orthogonal matrix is its transpose — a free operation requiring no extra computation.

The centroids c_k are pre-computed using the Lloyd-Max algorithm applied to a standard Gaussian distribution. This is a classical result: Lloyd-Max gives the centroid placement that minimizes mean squared error for a given distribution and a given number of levels. Because we guarantee Gaussian inputs via rotation, these centroids are optimal without any per-model calibration.


Full Walkthrough: TurboQuant_mse with Real Numbers

Let's trace a concrete 2D vector through every step of TurboQuant_mse. We'll use explicit matrix arithmetic so the mechanics are fully transparent.

Initialization

  • Input vector: x = [1.0, 0.0]
  • Rotation matrix:
Π = [[ 0.8,  −0.6 ],
     [ 0.6,   0.8 ]]

(Pre-generated via QR decomposition of a random matrix. This is a valid orthogonal matrix: its rows and columns are both orthonormal.)

  • Centroids (1-bit budget): c = [−0.5, 0.5]

(With 1 bit, we have 2¹ = 2 allowed values. Index 0 maps to −0.5, Index 1 maps to 0.5.)


Step 1 — Rotate the Vector

We apply the rotation matrix to distribute the coordinates into a Gaussian-like shape.

y = Π · x

y[0] = (0.8 × 1.0) + (−0.6 × 0.0) = 0.8 + 0.0 = 0.8
y[1] = (0.6 × 1.0) + ( 0.8 × 0.0) = 0.6 + 0.0 = 0.6

Result: y = [0.8, 0.6]

The original vector had all its energy in a single coordinate (1.0, 0.0). After rotation, the energy is spread across both dimensions. This is exactly the Gaussianization effect that makes the centroids useful.


Step 2 — Quantize: Find the Nearest Centroid

For each coordinate in y, we find which centroid it's closest to. Our only options are c[0] = −0.5 and c[1] = +0.5.

Coordinate y[0] = 0.8:

Distance to c[0] = −0.5:  |0.8 − (−0.5)| = |1.3| = 1.3
Distance to c[1] = +0.5:  |0.8 − 0.5|   = |0.3| = 0.3  ← smaller

Closest centroid: c[1] = 0.5Index 1

Coordinate y[1] = 0.6:

Distance to c[0] = −0.5:  |0.6 − (−0.5)| = |1.1| = 1.1
Distance to c[1] = +0.5:  |0.6 − 0.5|   = |0.1| = 0.1  ← smaller

Closest centroid: c[1] = 0.5Index 1

Compressed result: idx = [1, 1]

This is what gets committed to memory — two integers, not two floats. At 1 bit per coordinate, this is the complete compressed representation of the original vector [1.0, 0.0].


Step 3 — Dequantize: Map Indices Back to Centroids

To reconstruct, we first translate the integer indices back into their centroid float values:

idx = [1, 1]  →  ỹ = [c[1], c[1]] = [0.5, 0.5]

This is our approximate rotated vector ỹ = [0.5, 0.5].


Step 4 — Inverse Rotation (Un-rotate)

We reverse the rotation using Πᵀ — the transpose of the original rotation matrix. For an orthogonal matrix, the transpose is the inverse: Πᵀ · Π = I.

Transposing Π swaps the off-diagonal elements:

Π  = [[ 0.8, −0.6 ],      Πᵀ = [[ 0.8,  0.6 ],
      [ 0.6,  0.8 ]]             [−0.6,  0.8 ]]

Now the matrix multiplication x̃_mse = Πᵀ · ỹ:

Top coordinate (Row 1 of Πᵀ dotted with ỹ):

(0.8 × 0.5) + (0.6 × 0.5) = 0.4 + 0.3 = 0.7

Bottom coordinate (Row 2 of Πᵀ dotted with ỹ):

(−0.6 × 0.5) + (0.8 × 0.5) = −0.3 + 0.4 = 0.1

MSE decompressed result: x̃_mse = [0.7, 0.1]


The TurboQuant_mse Stage Summary

Stage Input Operation Result
Rotate x = [1.0, 0.0] y = Π · x y = [0.8, 0.6]
Quantize y = [0.8, 0.6] Nearest centroid in {−0.5, 0.5} idx = [1, 1]
Centroid Lookup idx = [1, 1] ỹ = c[idx] ỹ = [0.5, 0.5]
Un-rotate ỹ = [0.5, 0.5] x̃_mse = Πᵀ · ỹ x̃_mse = [0.7, 0.1]

The reconstructed vector [0.7, 0.1] is "pointing in the right direction" — mostly to the right, slightly upward — but it has clearly deflated and drifted from the original [1.0, 0.0]. This drift is the quantization error, and it is an unavoidable consequence of 1-bit compression.

The exact error is:

r = x − x̃_mse = [1.00.7, 0.00.1] = [0.3, −0.1]

For geometric reconstruction tasks — where your downstream goal is placing the reconstructed vector as close as possible to the original in Euclidean space — TurboQuant_mse may be perfectly acceptable. For LLM attention, the story changes completely, which brings us to the second variant.


Why TurboQuant_mse Is Not Enough for Attention

Here is the failure mode that is easy to miss.

You might think quantization error is just "noise" — a small geometric distance between the original and the reconstructed vector. Small noise means decent reconstruction, right?

Not in a transformer. The attention mechanism doesn't care about geometric distance. It cares about inner products (dot products):

attention_score(query, key) = dot(Q, K) / sqrt(d)

If you've compressed K with a quantizer that systematically shrinks the dot product, then every attention score across every layer, every head, every token is deflated by the same factor. The softmax distribution flattens. The model loses the ability to distinguish a critical fact from irrelevant noise.

The paper proves this precisely: MSE-optimized quantization at 1 bit per coordinate shrinks the expected inner product by exactly 2/π ≈ 0.637. Concretely:

  • True attention score: 10.0 — "This is critical. Pay full attention."
  • MSE-quantized score: 6.37 — "This is mildly interesting, maybe."

The downstream consequences:

  1. Softmax flattening. The softmax of uniformly deflated scores is flatter — the model spreads attention evenly instead of concentrating it.
  2. Needle-in-a-haystack failure. The one critical fact buried in a 128K context gets the same attention weight as filler text. The model hallucinates.
  3. Vector search corruption. In RAG pipelines, similarity rankings are wrong across the entire corpus because every score is deflated by the same factor.

TurboQuant_mse is an excellent quantizer for tasks where geometric fidelity matters. For LLM attention, it introduces an unfixable bias. You need the second variant.

Key Takeaway: The failure mode of applying TurboQuant_mse to LLM attention is not gradual degradation. It is systematic attention collapse, proportional to your compression ratio.


Variant 2 — TurboQuant_prod: The QJL Fix

The Core Insight

The inner product bias introduced by MSE quantization cannot be fixed by tuning the MSE algorithm. The bias is inherent to how Lloyd-Max centroids work — they pull extreme values toward bucket averages, which mathematically shrinks vector magnitudes. No amount of centroid adjustment eliminates this.

Instead, TurboQuant_prod takes a different approach: measure the exact error, compress that error cheaply, and add the correction back at decode time.

After running TurboQuant_mse, we know both the original vector x and the reconstructed approximation x̃_mse. The exact difference is the residual:

r = x − x̃_mse

If we could store r in full precision and add it back, we'd have lossless compression. Instead, we compress it to 1 bit per coordinate using a technique called Quantized Johnson-Lindenstrauss (QJL).

The QJL step takes a pre-generated random Gaussian matrix S and computes:

qjl = sign(S · r)

Each coordinate of the projected residual is compressed to a single bit: +1 (positive) or −1 (negative). We also save the residual's L2 magnitude ‖r‖₂ as a single scalar float.

What makes this mathematically sound is a theorem from random matrix theory: the expected value of the reconstructed inner product, when using the QJL-decoded residual, equals the true inner product with zero bias. The correction is noisy but unbiased — averaged across the high dimensionality of real LLM vectors, the noise cancels exactly.

The full reconstruction formula is:

x̃_prod = x̃_mse + (√(π/2) / d) · ‖r‖₂ · Sᵀ · qjl

The scaling constant √(π/2) / d is not a hyperparameter. It is a mathematically derived value that exactly cancels the expected bias of the sign-based projection — it comes from the first moment of the half-normal distribution.

What gets stored in memory for TurboQuant_prod:

  • idx — integer indices from the MSE step (the bulk of the compression)
  • qjl — binary ±1 values from the residual correction (1 bit per coordinate)
  • ‖r‖₂ — a single scalar float for residual magnitude scaling

The total bit budget: (target_bits − 1) for the MSE step, plus 1 bit for QJL. Same total bits spent as TurboQuant_mse at the equivalent budget, but with the inner product bias fully eliminated.


Full Walkthrough: TurboQuant_prod with Real Numbers

We'll continue directly from where the TurboQuant_mse walkthrough ended, using the same input vector and rotation matrix. This makes the two variants directly comparable.

Setup (Carried Over from TurboQuant_mse)

  • Original vector: x = [1.0, 0.0]
  • MSE-decompressed vector: x̃_mse = [0.7, 0.1] (From Step 4 above — after rotating, quantizing, and un-rotating)
  • QJL random matrix:
S = [[ 1.2, −0.4 ],
     [ 0.5,  0.9 ]]

(Pre-generated Gaussian matrix, stored alongside the rotation Π. Must be identical at compress and decompress time.)

We'll also use a query vector to evaluate attention score preservation at the end:

  • Query: y = [2.0, 1.0]
  • True inner product: ⟨y, x⟩ = (2.0 × 1.0) + (1.0 × 0.0) = 2.0

Step 5 — Compute the Residual

We calculate exactly how much the MSE step got wrong:

r = x − x̃_mse

r[0] = 1.00.7 =  0.3
r[1] = 0.00.1 = −0.1

Residual: r = [0.3, −0.1]

We also compute the L2 norm (Pythagorean length) of this residual, which we'll save for the decoding step:

‖r‖₂ = √(0.3² + (−0.1)²) = √(0.09 + 0.01) = √0.100.316

This single float 0.316 captures the magnitude of our quantization error. It gets stored in memory alongside idx and qjl.


Step 6 — 1-Bit QJL Compression of the Residual

We project the residual through matrix S and take only the sign of each result.

Part A: Matrix multiplication S · r

S · r = [[ 1.2, −0.4 ],  ·  [ 0.3, −0.1 ]
         [ 0.5,  0.9 ]]

Row 1: (1.2 × 0.3) + (−0.4 × −0.1) = 0.36 + 0.04 = 0.40
Row 2: (0.5 × 0.3) + ( 0.9 × −0.1) = 0.150.09 = 0.06

Result: S · r = [0.40, 0.06]

Part B: Apply the sign() function

qjl = sign([0.40, 0.06]) = [+1, +1]

Both projections are positive, so both bits are +1.

What gets committed to memory:

  • Integer indices idx = [1, 1] (from the MSE step)
  • Binary bits qjl = [+1, +1] (1 bit per coordinate)
  • Magnitude scalar ‖r‖₂ ≈ 0.316 (one float)

The full floating-point vector [1.0, 0.0] has been compressed into a handful of integers, two bits, and a scalar.


Step 7 — Decode the QJL Correction

At inference time, when the model needs to read this cached key vector, we unpack the QJL correction and add it back to x̃_mse.

The decoding formula:

x̃_qjl = (√(π/2) / d) · ‖r‖₂ · Sᵀ · qjl

Part A: Compute the scalar multiplier

√(π/2) ≈ √1.57081.253
d = 2  (2D example)

scalar = (1.253 / 2) × 0.3160.626 × 0.3160.198

Part B: Compute Sᵀ · qjl

We transpose S by swapping the off-diagonal entries:

S  = [[ 1.2, −0.4 ],      Sᵀ = [[ 1.2,  0.5 ],
      [ 0.5,  0.9 ]]             [−0.4,  0.9 ]]

Multiply by qjl = [+1, +1]:

Row 1: (1.2 × 1) + (0.5 × 1)  = 1.2 + 0.5  = 1.7
Row 2: (−0.4 × 1) + (0.9 × 1) = −0.4 + 0.9 = 0.5

Result: Sᵀ · qjl = [1.7, 0.5]

Part C: Combine scalar and matrix result

x̃_qjl = 0.198 × [1.7, 0.5] = [0.337, 0.099] ≈ [0.34, 0.10]

Our decompressed residual correction: x̃_qjl = [0.34, 0.10].


Step 8 — Final Combination

We add the MSE base vector and the QJL correction together:

x̃_prod = x̃_mse + x̃_qjl
        = [0.7, 0.1] + [0.34, 0.10]
        = [1.04, 0.20]

The QJL term has re-inflated the deflated MSE vector back toward the original [1.0, 0.0].


The Grand Comparison

Now let's see what the attention mechanism actually sees when it computes dot products with query y = [2.0, 1.0]:

Approach Reconstructed Vector Inner Product with y = [2.0, 1.0] Error
No compression [1.0, 0.0] (2×1.0) + (1×0.0) = **2.0** 0%
TurboQuant_mse only [0.7, 0.1] (2×0.7) + (1×0.1) = **1.5** 25% deviation
TurboQuant_prod [1.04, 0.20] (2×1.04) + (1×0.20) = **2.28** 14% deviation

With just one extra bit per coordinate for the QJL correction, the inner product error drops from 25% to 11% in this tiny 2D example. At the real dimensionalities used in LLMs (d = 4096), the Johnson-Lindenstrauss averaging effect makes this correction progressively more precise — the expected error approaches exactly zero.

The paper proves mathematically that: E[⟨y, x̃_prod⟩] = ⟨y, x⟩. The correction is unbiased.


The Loss Functions Explained

The two variants minimize different objectives, and understanding which one your use case requires is the single most important decision when deploying TurboQuant.

MSE Distortion (minimized by TurboQuant_mse):

D_mse = E[ ‖x − x̃‖²₂ ]

This measures the average squared Euclidean distance between original and reconstructed vectors. Lloyd-Max centroid optimization minimizes this directly. It penalizes large geometric misplacements, ensuring the reconstructed vector sits as close as possible to the original in coordinate space. This is the right objective when your downstream task cares about where the vector is.

Inner Product Distortion (minimized by TurboQuant_prod):

D_prod = E[ |⟨y, x⟩ − ⟨y, x̃⟩|² ]

This measures the expected squared error in dot product scores, averaged over queries y. MSE quantization introduces a systematic bias in this metric — the 2/π shrinkage factor. The QJL residual step is mathematically constructed to make D_prod approach zero. Use this when your downstream task cares about similarity between vectors, not their individual coordinates.

A useful mental shortcut: if your application ever calls .dot(), .cosine_similarity(), or computes an attention score, you need TurboQuant_prod.


PyTorch Implementation

Here is the complete implementation of both variants. The code follows the architecture exactly as described in the walkthroughs above.

import torch
import math


class TurboQuantMSE:
    """
    TurboQuant_mse: Rotation + scalar quantization.

    Optimizes for geometric (L2) reconstruction fidelity.
    The random rotation guarantees Gaussian coordinate distribution,
    making pre-computed Lloyd-Max centroids optimal for any input vector.

    Use for: image/audio compression, geometric reconstruction.
    Do NOT use for LLM KV cache or vector database lookups — see TurboQuantProd.
    """

    def __init__(self, dim: int, num_bits: int = 2):
        self.dim = dim
        self.num_bits = num_bits
        self.num_centroids = 2 ** num_bits

        # Generate a random orthogonal rotation matrix Π via QR decomposition.
        # Any random orthogonal matrix works — this is not data-dependent.
        random_matrix = torch.randn(dim, dim)
        q, _ = torch.linalg.qr(random_matrix)
        self.Pi = q  # Shape: (dim, dim), orthogonal

        # Centroid values for each integer index.
        # In production, use pre-computed Lloyd-Max centroids for N(0,1).
        # These tabulated values minimize MSE for a Gaussian distribution.
        # Here we use linspace as a reasonable approximation.
        self.centroids = torch.linspace(-1.0, 1.0, self.num_centroids)

    def quantize(self, x: torch.Tensor) -> torch.Tensor:
        """
        x: (batch, dim) float tensor
        returns: (batch, dim) integer tensor of centroid indices

        Step 1: Rotate  ->  Step 2: Find nearest centroid
        """
        # Step 1: Rotate — spreads coordinate values into Gaussian distribution
        y = x @ self.Pi.T  # (batch, dim)

        # Step 2: For each coordinate, find the index of the nearest centroid.
        # Broadcasting: (batch, dim, 1) − (num_centroids,) → (batch, dim, num_centroids)
        distances = torch.abs(y.unsqueeze(-1) - self.centroids)
        idx = torch.argmin(distances, dim=-1)  # (batch, dim) integers
        return idx

    def dequantize(self, idx: torch.Tensor) -> torch.Tensor:
        """
        idx: (batch, dim) integer tensor
        returns: (batch, dim) float tensor

        Step 3: Map indices to centroids  ->  Step 4: Inverse rotation
        """
        # Step 3: Translate integer indices back to float centroid values
        y_tilde = self.centroids[idx]  # (batch, dim)

        # Step 4: Un-rotate with Πᵀ (transpose = inverse for orthogonal matrices)
        x_tilde_mse = y_tilde @ self.Pi  # (batch, dim)
        return x_tilde_mse


class TurboQuantProd:
    """
    TurboQuant_prod: MSE quantization + 1-bit QJL residual correction.

    Extends TurboQuantMSE to eliminate the inner product bias introduced by
    MSE-optimized centroid placement. The QJL step is mathematically proven
    to make E[] =  (unbiased inner products).

    Use for: LLM KV cache, vector databases, RAG retrieval, attention mechanisms.
    Any task where dot products determine behavior must use this variant.
    """

    def __init__(self, dim: int, target_bits: int = 3):
        self.dim = dim

        # Budget allocation: (target_bits - 1) for MSE, 1 for QJL residual.
        # Example: target_bits=3 → 2-bit MSE (4 centroids) + 1-bit QJL.
        self.mse_quantizer = TurboQuantMSE(dim, num_bits=target_bits - 1)

        # Random Gaussian matrix S for the QJL transform.
        # Must be identical at quantization and dequantization time.
        # Store this with your model artifacts — not regenerated per call.
        self.S = torch.randn(dim, dim)  # Shape: (dim, dim)

    def quantize(self, x: torch.Tensor):
        """
        x: (batch, dim) float tensor
        returns:
            idx:       (batch, dim)  — MSE centroid indices
            qjl_signs: (batch, dim)  — ±1 residual correction bits
            r_norm:    (batch, 1)    — residual L2 norm for scale recovery
        """
        # Phase 1: Run TurboQuantMSE to get the base compressed representation
        idx = self.mse_quantizer.quantize(x)            # (batch, dim) integers
        x_tilde_mse = self.mse_quantizer.dequantize(idx)

        # Phase 2: Compute the residual — the exact error the MSE step made
        r = x - x_tilde_mse  # (batch, dim)

        # Phase 3: Compress the residual to 1 bit per coordinate via QJL
        # sign(S · r) captures the direction of the projection in ±1
        qjl_signs = torch.sign(r @ self.S.T)  # (batch, dim), values in {-1, +1}

        # Save the residual magnitude — needed to scale the correction at decode
        r_norm = torch.norm(r, p=2, dim=-1, keepdim=True)  # (batch, 1)

        return idx, qjl_signs, r_norm

    def dequantize(
        self,
        idx: torch.Tensor,
        qjl_signs: torch.Tensor,
        r_norm: torch.Tensor
    ) -> torch.Tensor:
        """
        Reconstruct the vector with unbiased inner product fidelity.

        The constant sqrt(pi/2)/d is derived from the first moment of the
        half-normal distribution — it exactly cancels the expected bias
        of the sign-based projection. This is not a hyperparameter.
        """
        # Step 1: Reconstruct the deflated MSE base vector
        x_tilde_mse = self.mse_quantizer.dequantize(idx)  # (batch, dim)

        # Step 2: Decode the QJL correction using the unbiasing formula
        constant = math.sqrt(math.pi / 2) / self.dim
        x_tilde_qjl = constant * r_norm * (qjl_signs @ self.S)  # (batch, dim)

        # Step 3: Add the correction — this "re-inflates" the MSE-deflated vector
        return x_tilde_mse + x_tilde_qjl


# ─── Verification ──────────────────────────────────────────────────────────────

if __name__ == "__main__":
    torch.manual_seed(42)
    dim   = 4096
    batch = 64

    # Simulate unit-normalized key vectors (typical in LLM attention)
    x = torch.randn(batch, dim)
    x = x / x.norm(dim=-1, keepdim=True)

    # Query vectors for inner product evaluation
    y = torch.randn(batch, dim)
    y = y / y.norm(dim=-1, keepdim=True)

    true_scores = (y * x).sum(dim=-1)

    # ── TurboQuantMSE ──────────────────────────────────────────────────────────
    mse_quantizer = TurboQuantMSE(dim=dim, num_bits=2)
    idx_mse       = mse_quantizer.quantize(x)
    x_hat_mse     = mse_quantizer.dequantize(idx_mse)

    mse_scores = (y * x_hat_mse).sum(dim=-1)
    mse_error  = (true_scores - mse_scores).abs() / true_scores.abs()

    print("TurboQuantMSE (2-bit) — Inner Product Fidelity:")
    print(f"  Mean relative error: {mse_error.mean().item():.4f}")
    print(f"  Expected: ~{1 - 2/math.pi:.4f} bias factor at 1-bit; decreases with more bits\n")

    # ── TurboQuantProd ─────────────────────────────────────────────────────────
    prod_quantizer   = TurboQuantProd(dim=dim, target_bits=3)
    idx, signs, norm = prod_quantizer.quantize(x)
    x_hat_prod       = prod_quantizer.dequantize(idx, signs, norm)

    prod_scores = (y * x_hat_prod).sum(dim=-1)
    prod_error  = (true_scores - prod_scores).abs() / true_scores.abs()

    print("TurboQuantProd (3-bit total: 2-bit MSE + 1-bit QJL) — Inner Product Fidelity:")
    print(f"  Mean relative error: {prod_error.mean().item():.4f}")
    print(f"  Should be near 0 — QJL guarantees unbiased expected inner product")

Implementation notes worth flagging:

On centroids: torch.linspace is a placeholder. In production, use pre-computed Lloyd-Max optimal values for N(0,1) at your chosen bit width. These are tabulated constants, not learned from your data — which is a key differentiator from methods like product quantization.

On the rotation matrix Pi: Regenerating it per deployment is safe. TurboQuant only requires Pi to be random and orthogonal, not data-optimal. Any fresh QR decomposition gives equivalent performance.

On the QJL matrix S: Unlike Pi, the matrix S must be identical at quantization and dequantization time. If the inference server uses a different S than the one used when the cache was written, the residual correction will be garbage. Treat S as a model artifact and checkpoint it.


Comparison: TurboQuant vs. Competing Approaches {#comparison-matrix}

Method Inner Product Bias Training Data Required Memory Overhead Hardware Dependency Best For
TurboQuant_prod None (proven unbiased) No Low None LLM KV cache, vector DBs, RAG
TurboQuant_mse High (2/π factor) No Very low None Image/audio compression, geometric tasks
Product Quantization (PQ) Low Yes (codebook training) Medium None Offline ANN indexes
GPTQ / AWQ N/A (weight quant.) Yes (calibration data) Low Some (INT4 kernels) Model weight compression
FP8 KV Cache Very low No Low Yes (H100+) High-end GPU serving
KVQuant Low Yes (per-layer profiling) Medium None Research / offline

TurboQuant_prod occupies a unique niche: near-zero inner product bias, no training data required, no special hardware. The tradeoff against FP8 is straightforward — if you're running exclusively on Hopper-class hardware with a serving stack that supports native FP8 KV cache, prefer FP8. On A100s, consumer hardware, or mixed fleets, TurboQuant is the more portable and practical choice.

Happy building.


Want to go deeper? The original TurboQuant paper is worth reading in full

Previous Post
Next Post

post written by: