Transformers live and die by self-attention. Yet, in its standard form, self-attention requires pairwise dot products between states at all timesteps, yielding quadratic time complexity and linear space complexity over sequence length. Modern LLMs like Claude Opus 4.7 come with context windows of 1M tokens, but the context anxiety remains.
Much work has been done on reducing the complexity of self-attention, including:
- replacements of attention with linear recurrent models; see Table 2 in the DeltaNet paper1
- hierarchical/routed attention, which gives significant constant-factor speedups but still grows quadratically in time, e.g. DeepSeek-V4’s Compressed Sparse Attention2
- sliding-window attention with a fixed-size cache, notably used in
mistral-7bandgpt-oss-120B
Replacing self-attention with recurrent models loses a crucial bias. By retaining past states, self-attention enables late interactions3 over sparse subsets of the input sequence4, drastically reducing the sample complexity in learning basic algorithms, e.g. copying arbitrary-length sequences. On the other hand, recurrent models are forced to compress at every step, increasing the complexity of learning such tasks.
DeepSeek’s algorithmic efforts have succeeded at decreasing the price of inference on their models, but have not unlocked infinite context windows. Sliding-window attention is not the solution either, as shown below.
These observations motivate a different approach: is it possible to (a) retain past states in a form suitable for late interactions while (b) ensuring the history never exceeds a predetermined maximum size?
In an auto-regressive Transformer, each forward pass generates a key-value pair $(k, v)$ at each attention layer (or attention head in the case of multi-head attention, but the extension is straightforward). The KV-cache for a sequence of length $n$ is a list of pairs:
$$C = \left\{(k_1, v_1), (k_2, v_2), \ldots, (k_n, v_n)\right\}$$
Viewing the Transformer as a process running on an operating system makes the KV-cache analogous to a list of variables allocated in memory. A variable’s key $k$ contains the metadata for the retrieval of its value $v$, the actual content. As the process runs, new variables are stored. However, the standard Transformer model never removes entries, leading to its unbounded memory requirement. Eviction and compaction are techniques for managing the process memory effectively.
Eviction Heuristics
An eviction policy $\pi(e|x, C)$ with $e\in \left\{1, 2, \ldots, n\right\}$ is a categorical distribution over entries in the KV-cache, such that sampling it selects a key-value pair to evict. The logits create an ordering, so this extends easily to a policy that retains a subset of entries, e.g. the top-$k$.
To devise an eviction policy, recall the formula for a self-attention layer (ignoring scaling):
$$\mathrm{Att}(X) = \underbrace{\mathrm{softmax}\big( Q(X) K(X)^\mathsf{T} \big)}_{\text{attention coefficients}}\cdot V(X)$$
where functions $Q, K, V$ broadcast over the input sequence $X=\left[x_1, x_2, \ldots, x_n\right]$ and $\mathrm{softmax}$ broadcasts over the first input dimension. Given input sequence $X$, suppose we sample completions from the distribution $p(Y|X)$. We define the future utility of the entry at KV-cache index $i$ as its expected contribution to the completion $Y$ by summing over attention coefficients. The vector $F$ of future utilities is:
$$F_i = \mathbb{E}_{Y|X} \left[\sum_{j=1}^{|Y|} \mathrm{softmax}\left( Q(Y) K(X)^\mathsf{T}\right)_{ji} \right]$$
The underlying assumption is that frequently attended entries carry proportionally greater causal influence on the generated completion. With this definition, eviction decisions become simple: remove the entry corresponding to $\argmin F$ from the KV-cache. However, exact computation is intractable as $p(Y|X)$ is too complex to efficiently enumerate, so further approximations are necessary.
A simple approach is to make the KV-cache a queue by evicting the oldest entry at each step. This is referred to as sliding-window attention, as it only stores a fixed window of the most recent key-value pairs. Some models are trained with this as part of their architecture, but otherwise it is a poor heuristic to apply on a pre-trained Transformer, because recency is a limited signal for future utility.
The attention oracle heuristic treats a reserved suffix (e.g. 16 tokens) of $X$ as a sampled completion, then evicts from the preceding positions (adapted from the $\mathrm{H}_2\mathrm{O}$ paper5). Protecting the first few positions from eviction is also essential, as that is where attention sinks form in auto-regressive LLMs6.
To get a feel for how difficult it is to create a good eviction policy, I ran a small experiment. I chose the open-source allenai/OLMo-2-0425-1B model as a testbed due to its dense, full-attention architecture. As a proxy for model performance, I computed mean cross-entropy (CE) over 2048-token sequences sampled from the allenai/dolmino-mix-1124 dataset.
Capping the KV-cache to a budget of 512 tokens and implementing sliding-window eviction over half of the attention layers increased the Dolmino CE loss by $+3.4\%$. By protecting the first four tokens from eviction to maintain attention sinks, the loss-delta decreased to only $+0.7\%$. Switching to the attention oracle policy gave a $+0.3\%$ delta.
Clearly, Dolmino was not challenging enough for testing eviction policies because such simple heuristics were so effective. To amend this, I created a custom dataset named Evicto QA that consists of (question, context, answer) triples. It combines the Natural Questions7 and Qasper8 datasets, then randomizes the location of the answer within the context. With limited KV-cache, a successful eviction policy must retain relevant information conditioned on the question to predict the answer. To isolate this effect, I computed mean CE loss only over answer tokens. Sliding-window eviction with protected attention sinks gave a $+19.1\%$ increase in loss over the base model, while the attention oracle policy had a $+14.9\%$ delta. This made Evicto QA a useful benchmark for testing improvements upon the heuristics.
Learned Eviction Policies
Moving beyond heuristics, is it possible to learn a good eviction policy from data?
To train a policy, I ran GRPO with negative CE loss as the reward. I broke the input sequence into chunks and evicted at the chunk level, because this makes training faster (rollouts are serial) and because tokens are often highly correlated with their neighbours.
I parametrized a scoring function over chunks of size $s=16$. It has a dynamic component that is an attention oracle with LoRA-modified queries of the most recent chunk $T$ with indices in $\left\{n-s+1, \ldots, n\right\}$. It also has a static component with learned queries $Q_\text{static}$ that are reused for every eviction decision, because certain keys may be reliable candidates in a variety of contexts. For a non-final chunk $P$, the combined scorer is:
$$\mathrm{Score}(P) = \sum_{i\in P} \sum_{j \in T} \mathrm{softmax} \left( Q^\prime(x_j) K(x_i)^\mathsf{T} \right)_{ji}$$
where
$$Q^\prime(x) = \begin{bmatrix} Q(x) + \mathrm{LoRA}(x) \\ Q_\text{static} \end{bmatrix}$$
I used 10 rollouts for computing the advantage and a batch size of 192. With a scorer consisting of a rank-16 LoRA and 64 static queries, I ran training for 60 steps, which was enough to see each example (~12k total) in Evicto QA once. This validated that an eviction policy could be learned using a fairly basic training setup. Attention sink protection was applied to all methods shown below.
| method | loss | delta |
|---|---|---|
| full cache | 3.051 | 0.0% |
| learned policy | 3.338 | +9.4% |
| attention oracle | 3.506 | +14.9% |
| learned policy (init) | 3.549 | +16.3% |
| sliding-window | 3.635 | +19.1% |
There are several directions for improvement. Recent research from Apple trains eviction policies to maximize future utility directly at each attention head9, rather than minimizing a global CE loss. The scorer is parametrized as a 2-layer MLP with 256 hidden units to score the $i$-th entry with $[k_i,v_i,i]$ as input, which is akin to the static queries in my scorer. Their text completions are sampled using a full-cache model independently of eviction decisions, so training is much more efficient than online RL, but also makes the learning off-policy. They show strong results running Qwen2.5-7B-Chat on the RULER dataset.
Another paper from Stanford uses an attention oracle for eviction and continues RLVR post-training with it active, which forces the model to adapt its reasoning under memory pressure10. They make the model budget-aware through a structured tag indicating the percent of context that will be evicted each time. Intuitively, the model can learn that it should not start a new sub-task when a large portion of context is about to be cleared.
Compaction Techniques
Compaction is the more general term for compressing a KV-cache by returning a smaller one. Eviction is a special case where the cache size is reduced by directly removing entries created during inference. However, there is no hard constraint that the compressed cache must be a subset of the original; more advanced methods modify its entries or even rewrite it completely.
Attention matching11 is a recently proposed technique that (1) detects entries with the highest future utility and evicts the rest, (2) reweights attention coefficients to account for attention mass lost from evicted entries, and (3) updates the remaining values using least-squares fitting. To do this, they need to predict what future queries will look like, which they solve using prompts such as <context> Repeat the previous context. <context>. One can view it as a more complete (but also more expensive) version of the attention oracle followed by adaptation of the values.
The next level is to build the compressed KV-cache from scratch, i.e. to write a summary. As a baseline, the model can be prompted to summarize the context thus far, which is what Claude Code uses based on leaked source code12. In my experience, Claude Code’s auto-compaction tends to derail the conversation by mixing old context with the new. I’ve developed a habit of forking sessions and manually running /compact whenever I feel the existing context is somewhat self-contained and shouldn’t mix with upcoming instructions. From a user’s perspective, this is a poor experience, but there are likely diminishing returns to tuning at the prompt level.
A promising approach is to teach the model to periodically summarize its context using <summary> control tokens, similar to how reasoning models use <think> to delineate reasoning steps. This has the advantage of being dynamic, so the model decides when to summarize, and of being more general, as it can be used on any LLM with summarization abilities. A paper from Microsoft Research13 introduces the OpenMementos dataset, which contains ~230k reasoning traces annotated with compressed summaries generated by GPT-5. They used supervised fine-tuning with a fully active KV-cache to teach open-source models how to use the summary tokens. Later they used masking to remove context outside the summaries and force the model to only use its summaries. A key insight to highlight is that restarting the KV-cache associated with summary tokens results in a 15 percentage-point decrease on AIME24, indicating that there is value in retaining the cache entries that were computed before compaction as they contain more information than the input tokens alone.
What Next?
The design space of eviction and compaction methods is fascinating. And yet, hand-crafted heuristics and least-squares fitting are inevitably limited by their assumptions, even though they look elegant to a research scientist. The bitter lesson has been vindicated numerous times: both instruction-tuning and reasoning in LLMs are enabled through training rather than special modelling techniques.
Instead of building sophisticated objectives or architectures, datasets should encode the behaviours needed for open-ended contexts: summarization, continual-learning rubrics, and memory management. Perhaps all it takes to achieve infinite context in practice is to fine-tune a model with Memento-style summarization in an RLVR loop. In any case, there is plenty of work to be done.
Parallelizing Linear Transformers with the Delta Rule over Sequence Length ↩︎
ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction ↩︎
When Do Transformers Outperform Feedforward and Recurrent Networks? A Statistical Perspective ↩︎
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models ↩︎
A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers ↩︎
Neural Garbage Collection: Learning to Forget while Learning to Reason ↩︎