Unlike RNNs, the memory usage of Transformers grows linearly with sequence length. This is because of the late interaction principle – rather than updating a single state at each timestep, the Transformer retains all past states and mixes them with self-attention. This delays the decision of what information to retain.
But what if the late interaction principle is mainly useful for training? Perhaps we can RNN-ify the Transformer after it has developed all the important representations and circuits it needs; can we trim the fat?
attention is a kernel
- step 1 eviction
- step 2 eviction and value matching
- step 3 inducing points
Eviction Policies
Attention sinks: Heuristics: Key-scoring: Learned queries:
Predict what future queries will need. Eviction - similar to an online coreset problem Compaction - eviction but with very large chunks Attention matching / STILL – inducing points
Supervised vs RL approach
LoRA 16, Static queries 64

- Cache size: 512
- Chunk size: 16
- Eviction layers: 11, 12
Attention sinks are a result of the positional encodings. If we remove RoPE perhaps we won’t need the protected chunks for attention sinks.