Home

Predictions on Continual Learning


I think I know roughly what the solution to continual learning will look like. And by the end of the year I expect one of the big labs will have implemented it. I’ll write an update post before the end of 2026 to see if I was right.

Pre-training on text creates a bundle of algorithms which we call an LLM. Once developed, post-training tunes the relevance of these algorithms within the model to align its behaviour to particular use cases, like being a helpful assistant.

The common approach to continual learning at the moment is to create a structured memory and have the LLM populate it with information it identifies as important. During inference, the LLM uses in-context learning to adapt its output to these memories. As a model experiences more text, it will grow its memory, but there are problems with doing so at scale: (a) VRAM is expensive, (b) increasing the supported context size of LLMs requires high-quality text with long-range dependencies, which is relatively rare, and (c) self-attention has quadratic time complexity.

Instead, we need a way to compress the memory and access it efficiently during inference. A naive approach would be to periodically optimize the model on generated trajectories so context is baked directly into its weights. The Titans paper1 proposes a smart optimizer step on an explicit fixed-size memory module rather than updating the model’s weights. The challenge with these approaches is deciding how to weight the experiences, since (a) ‘garbage in, garbage out’ implies that a conversation with someone stupid will make the LLM stupid as well, (b) a dataset of self-generated tokens will likely be quite small, and (c) the training signal would be too raw and noisy, making optimization inefficient.

To solve this, I think the big labs will leverage features from within the LLMs themselves that correspond to meaningful changes in behaviour. This makes it possible to optimize over trajectories efficiently due to (a) the denoising nature of projection onto feature directions (less variance) and (b) the selection of directions being conducive to useful learning (beneficial biases). The process of finding such features could involve both human and automated methods.

A blog post from Thinking Machines showed that on-policy distillation using a rank-32 LoRA on Qwen3-8B was able to match full supervised fine-tuning at $9\times$ lower cost2. Intuitively, this shows that the model contains latent abilities that are unlocked by low-rank modifications to its weights. While this holds when optimizing on specific datasets, I hypothesize that it also holds in reverse: by fine-tuning a feature-constrained LoRA at test time, the model will achieve continual learning.

Let’s consider two practical applications of continual learning and how features could be selected to enable online learning:

  1. Coding agent: bundle it with features related to software design, syntax style, and high-level engineering behaviours. Depending on the environment and agents it interacts with, the coding agent should be able to adapt its behaviour to become e.g. an architect, a fixer, a tech lead, or a coding machine3.
  2. Home assistant: bundle it with features focused on human emotions and relationships. For instance, some people prefer a more direct communication style than others. This type of trait can be personalized for each user based on interaction history and has been studied at length in research on personas4.

I’m excited to try this myself, perhaps using SAE features from one of the Gemma models, or by generating my own features on one of the OLMo models.