AI Engineering

The KV Cache: Why Long Context Isn’t Free

Your model has a 200K-token context window, so you do the obvious thing: you stuff it. Full chat history, a dozen retrieved documents, the entire system prompt. The tokens fit — the API doesn’t complain — and yet responses crawl, your GPU throws out-of-memory errors under load, and the per-request cost is triple what the token math predicted.

The tokens fitting was never the constraint. The constraint is a data structure most developers never think about until it’s the thing that’s on fire: the KV cache.

It’s the reason transformers generate text fast at all. It’s also the reason “just send more context” is one of the most expensive sentences in production AI. The cache grows with every token in your context, it lives in scarce GPU memory, and it has to be read in full on every single step of generation. Understanding it is the difference between knowing that long context is slow and knowing exactly why — and which lever actually fixes it.

This is a deep dive. We’ll build up from what the cache is, through the memory and bandwidth math that makes it bite, to the architectural and serving-layer techniques — GQA, MLA, PagedAttention, quantization — that the entire 2026 inference stack is built around.

Table of Contents

  1. What the KV cache actually is
  2. Why it exists: the alternative is quadratic
  3. The memory math (and why it explodes)
  4. The hidden cost: bandwidth, not just memory
  5. Shrinking the cache: GQA, MQA, and MLA
  6. Serving-layer wins: PagedAttention, quantization, prefix caching
  7. What this means for you
  8. FAQ

What the KV cache actually is

When a transformer processes text, each attention layer projects every token into three vectors: a query, a key, and a value. To compute attention for a given token, the model compares its query against the keys of every preceding token, then takes a weighted sum of their values.

The KV cache is exactly what it sounds like: a store of the key and value vectors the model has already computed for every prior token, kept in GPU memory so they don’t have to be recomputed on the next step. That’s it. A buffer of past keys and values, one set per layer, sitting in VRAM alongside the model weights — and quietly becoming the largest thing in memory as your sequences get long.

Why it exists: the alternative is quadratic

To see why the cache is non-negotiable, picture generation without it.

Transformers generate autoregressively — one token at a time, each new token conditioned on everything before it. To produce token number 1,000, the model needs the keys and values of tokens 1 through 999. With no cache, it would recompute all of them from scratch on every step. Token 1,000 recomputes 999 tokens’ worth of K and V; token 1,001 recomputes 1,000; and so on. Across a full generation that’s quadratic work — O(n²) — and it makes long-form generation agonizingly, unusably slow.

The cache turns that quadratic recompute into a linear lookup. Compute each token’s K and V exactly once, store them, and every future step just reads them back. Generation drops to O(n) work per step. Here’s the shape of it:

# Without a cache: re-encode the whole sequence every step → O(n^2) over a generation.
tokens = prompt[:]
for _ in range(max_new_tokens):
    logits = model(tokens)            # recomputes K, V for every prior token, every step
    tokens.append(sample(logits[-1]))
# With a KV cache: compute each token's K, V once and reuse them → O(n) per step.
cache = None
next_tok = prompt
for _ in range(max_new_tokens):
    logits, cache = model(next_tok, past_kv=cache)  # only the new token is encoded
    next_tok = sample(logits)Code language: Python (python)

So the KV cache is a classic engineering bargain: you spend memory to save compute. The catch — and the entire subject of this article — is that the memory side of that bargain has a much steeper bill than most people expect.

The memory math (and why it explodes)

The size of the KV cache is fully determined by the model’s shape and how much text you’re holding. For standard multi-head attention it’s:

def kv_cache_gb(
    n_layers: int,
    n_kv_heads: int,        # KV heads — NOT query heads. This is the lever GQA pulls.
    head_dim: int,
    seq_len: int,           # prompt + generated tokens
    batch_size: int = 1,
    bytes_per_elem: int = 2,  # BF16 / FP16
) -> float:
    # factor of 2 = one tensor for K, one for V
    total = 2 * n_layers * n_kv_heads * head_dim * seq_len * batch_size * bytes_per_elem
    return total / 1e9
# Llama 3 70B: 80 layers, GQA with 8 KV heads, head_dim 128, BF16
kv_cache_gb(n_layers=80, n_kv_heads=8, head_dim=128, seq_len=128_000)
# ≈ 42 GB — for ONE 128K-token sequence, on top of the model weightsCode language: Python (python)

Read that result again. A single 128K-token request needs roughly 42 GB of KV cache — and that’s the efficient case, because Llama 3 already uses grouped-query attention to keep the KV head count down to 8. The same model with full multi-head attention (64 KV heads) would need around eight times as much: well over 300 GB of cache for one sequence, which doesn’t fit on any single GPU made.

Two terms in that formula are the troublemakers. Sequence length is linear: double your context, double your cache. Batch size is also linear, and it’s the one that surprises teams — every concurrent request you serve needs its own full cache. The KV cache is what caps how many users you can serve at once on a given GPU. Run out of room and you don’t degrade gracefully; you OOM.

⚠️ Note: Model weights are a fixed cost — load them once, done. The KV cache is a per-request, per-token cost that scales with traffic and context. On a busy long-context service, the cache routinely consumes more VRAM than the weights themselves. If you size your hardware off weights alone, you will run out of memory in production.

This is the concrete mechanism behind a softer problem covered elsewhere in this series: piling context into the window doesn’t just risk degrading the model’s answer quality, it has a hard, physical cost in gigabytes. “More context is free because it fits” is wrong on both counts.

The hidden cost: bandwidth, not just memory

Here’s the part that surprises even people who know the memory math. The KV cache doesn’t only cost you space — it costs you time, and not in the way you’d guess.

Token generation is memory-bandwidth-bound, not compute-bound. To generate each new token, the GPU must read the entire KV cache out of memory to compute attention against it. The actual arithmetic is cheap; the bottleneck is shovelling all those cached bytes from VRAM into the compute units, every single step. The bigger the cache, the more bytes move per token, the slower each token comes out.

That reframes long context entirely. A 100K-token prompt isn’t slow because there’s “more to think about.” It’s slow because every generated token drags a 100K-token cache across the memory bus first. Latency scales with cache size whether or not you’re anywhere near the context limit. So the cost of a long prompt is paid twice: once in the memory to hold the cache, and again in the per-token latency to read it.

This is why the headline number — context window size — tells you almost nothing about what long context will actually cost you in latency and dollars. The window is what fits. The cache is what you pay for. And it’s a big part of why the inference bill climbs faster than per-token pricing would suggest: long-context requests are expensive in memory and bandwidth, not just token count.

Shrinking the cache: GQA, MQA, and MLA

Because the cache scales with the number of KV heads, the most effective fixes attack that term directly — at the architecture level, before a model is ever trained.

Multi-Query and Grouped-Query Attention

Standard multi-head attention gives every query head its own key/value head. Multi-Query Attention (Shazeer, 2019) goes to the opposite extreme: all query heads share a single KV head. That shrinks the cache by a factor equal to the head count — dramatic — but sharing one KV projection across every head costs a noticeable 1–3% in quality.

Grouped-Query Attention (Ainslie et al., 2023) is the pragmatic middle, and it’s why it became the default for nearly every modern open model. Query heads are split into groups, and each group shares one KV head. With 64 query heads and 8 KV groups — Llama 3’s configuration — you get an 8× smaller cache while staying within a hair of full multi-head quality. A neat bonus from the original work: you can convert an existing multi-head model to GQA by “uptraining” it on roughly 5% of the original pretraining compute, rather than training from scratch.

GQA isn’t a tradeoff you opt into at inference time — it’s a property of the model you chose. It’s one of the first things I check when evaluating a model for a long-context, high-throughput workload, because it sets the ceiling on how many concurrent requests you can serve.

Multi-Head Latent Attention

DeepSeek’s Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2 and central to DeepSeek-V3, attacks the same bottleneck from a different angle. Instead of sharing keys and values across heads like GQA, MLA compresses them: it projects the full K and V down into a small low-rank latent vector, caches only that latent, and reconstructs the full-resolution keys and values on the fly during attention.

The payoff is roughly 3–5× cache compression — comparable to aggressive GQA — but with a twist that makes it genuinely interesting. DeepSeek’s ablations report that MLA didn’t just preserve quality, it slightly beat full multi-head attention, whereas GQA tends to sit a touch below it. So MLA is that rare optimization that can improve memory and quality at once. The cost is complexity: it requires extra projection steps, a decoupled rotary-position-embedding scheme, and specialized inference kernels to realize the gains, which is why you mostly see it in frontier-scale architectures (DeepSeek’s family, and the models that have since borrowed the recipe) rather than in every model.

💡 Tip — when you’re choosing a base model for self-hosted long-context serving, its attention design is a first-order decision, not a footnote. GQA-8 versus full MHA is an 8× difference in how many users you can fit on the same GPU. MLA can push that further. This single architectural choice often matters more than any serving tweak you’ll make later.

Serving-layer wins: PagedAttention, quantization, prefix caching

Architecture sets the ceiling; the serving stack decides how close you get to it. Three techniques do most of the work in 2026.

PagedAttention, the idea at the heart of vLLM, borrows from operating-system virtual memory. Naive serving allocates one big contiguous block of cache per request, sized for the worst case, which wastes enormous amounts of memory to fragmentation and over-allocation. PagedAttention instead stores the cache in small fixed-size “pages” that don’t need to be contiguous, allocated on demand. The result is far less wasted memory, which means far higher batch sizes and throughput on the same hardware — and it’s why vLLM-style serving became the default rather than a nice-to-have.

KV cache quantization stores the cached keys and values at lower precision — FP8 or INT8 instead of BF16 — roughly halving the cache’s memory footprint with minimal quality impact. It’s the same core idea as model-weight quantization, applied to the cache instead of the parameters, and the two stack: a quantized model with a quantized cache fits dramatically more on a GPU.

Prefix caching exploits the fact that many requests share a common prefix — a long system prompt, a fixed set of instructions, a document reused across a conversation. Instead of recomputing the KV cache for that shared prefix on every request, the server computes it once and reuses it. This is the serving-side machinery behind the provider “prompt caching” features that cut both latency and cost; it’s the same mechanism, exposed as a billing line item.

What this means for you

You don’t have to implement any of this to benefit from understanding it. The practical consequences are concrete.

Treat context length as a budget, not a free allowance. Every token you put in the prompt is paid for twice — once in cache memory, once in per-token decode latency — on every call. Trimming retrieved context from twenty chunks to five isn’t only about answer quality; it directly shrinks the cache and speeds up generation. When you’re choosing a self-hosted model, read its attention architecture: GQA is table stakes, and MLA is a signal the model was designed for serious long-context serving. When you’re choosing a serving stack, use one with PagedAttention and prefix caching — that’s not premature optimization, it’s the baseline. And if you’re memory-bound, KV cache quantization is often the fastest win available.

The through-line is the same one that runs across this whole series, just expressed in hardware: more isn’t free. A bigger context window is more cache to hold and more bytes to read, not a free upgrade. The KV cache is one of nine concepts worth keeping sharp in the 2026 AI terms guide for developers — and it’s the one that turns “why is long context slow and expensive?” from a mystery into arithmetic.

FAQ

What is a KV cache in an LLM?

A KV cache stores the key and value vectors a transformer computes for every token it has already processed, keeping them in GPU memory so they don’t have to be recomputed on each generation step. It’s what makes token-by-token generation fast: without it, producing each new token would require recomputing attention over the entire preceding sequence, which is quadratic work and prohibitively slow.

Why does long context make LLM inference slow and expensive?

Two reasons, both tied to the KV cache. First, the cache grows linearly with sequence length, so long context consumes large amounts of GPU memory and limits how many requests you can serve at once. Second, token generation is memory-bandwidth-bound: every new token requires reading the entire cache from memory, so a bigger cache means slower generation even when the tokens technically fit in the context window.

How big is the KV cache?

Its size is 2 × layers × KV-heads × head-dimension × sequence-length × batch-size × bytes-per-element. As a concrete example, a single 128K-token sequence on a Llama-3-70B-class model (using grouped-query attention with 8 KV heads) needs roughly 42 GB of cache — on top of the model weights. The full multi-head version of the same model would need around eight times more.

What is the difference between GQA and MLA?

Both reduce KV cache size. Grouped-Query Attention (GQA) has groups of query heads share a smaller number of KV heads, cutting the cache by the grouping factor (8× in Llama 3) with a small quality cost. Multi-Head Latent Attention (MLA), from DeepSeek, instead compresses keys and values into a low-rank latent vector and reconstructs them at inference, achieving similar 3–5× compression while, in DeepSeek’s tests, slightly improving quality over full multi-head attention.

How can I reduce KV cache memory usage?

At the model level, choose an architecture with GQA or MLA rather than full multi-head attention. At the serving level, use a stack with PagedAttention (such as vLLM) to eliminate memory fragmentation, enable prefix caching to reuse the cache for shared prompts, and apply KV cache quantization (FP8/INT8) to roughly halve the footprint. At the application level, send less context — every token you cut is cache you don’t allocate.

Back to top button