Home

Designing KV Cache Compressors

Contents


Unlike RNNs, the memory usage of Transformers grows linearly with sequence length. This is because of the late interaction principle – rather than updating a single state at each timestep, the Transformer retains all past states and mixes them with self-attention. This delays the decision of what information to retain.

But what if the late interaction principle is mainly useful for training? Perhaps we can RNN-ify the Transformer after it has developed all the important representations and circuits it needs; can we trim the fat?

attention is a kernel

  • step 1 eviction
  • step 2 eviction and value matching
  • step 3 inducing points

Eviction Policies

Attention sinks: Heuristics: Key-scoring: Learned queries:

Predict what future queries will need. Eviction - similar to an online coreset problem Compaction - eviction but with very large chunks Attention matching / STILL – inducing points

Supervised vs RL approach

LoRA 16, Static queries 64

All eviction policies

  • Cache size: 512
  • Chunk size: 16
  • Eviction layers: 11, 12

Attention sinks are a result of the positional encodings. If we remove RoPE perhaps we won’t need the protected chunks for attention sinks.