I think I know roughly what the solution to continual learning will look like. And by the end of the year I expect one of the big labs will have implemented it. I’ll write an update post before the end of 2026 to see if I was right.
Pre-training on text creates a bundle of algorithms which we call an LLM. Once developed, post-training adjusts the weighting of these algorithms within the model to align its behaviour to particular use cases, like being a helpful assistant.
The current approach to continual learning with LLMs is to create a structured memory and have the model populate it with information it identifies as important. During inference, the LLM uses in-context learning to adapt its output to these memories. As a model experiences more text, it will grow its memory, but there are problems with doing so at scale: (a) VRAM is expensive, (b) increasing the supported context size of LLMs requires high-quality text with long-range dependencies, which is relatively rare, and (c) self-attention has quadratic time complexity.
Instead, we need a way to compress the memory and access it efficiently during inference. A naive approach would be to periodically optimize the model on generated trajectories so context is baked directly into its weights. The Titans paper1 proposes a smart optimizer step on an explicit fixed-size memory module rather than updating the model’s weights.
The challenge with these approaches is deciding how to weight the experiences. Three key issues arise: (a) ‘garbage in, garbage out’ implies that a conversation with someone uninformed will make the LLM less capable as well, (b) a dataset of self-generated tokens will likely be quite small, and (c) the training signal would be too raw and noisy, making optimization inefficient.
More generally, the most efficient method for teaching Transformer models new behaviours can be identified by considering two dimensions: (1) the breadth of the target domain (wide vs narrow) and (2) how much coverage the existing model already has (low vs high). The key to continual learning is figuring out how to apply these methods at the appropriate frequency over time.
| low coverage | high coverage | |
|---|---|---|
| narrow | in-context learning | LoRA fine-tuning |
| wide | structured memory | full fine-tuning |
Agents (see Claude Code) and memory frameworks (see Mem0) already handle the low coverage cases very well. A user can provide a large amount of information in-context and have it compacted or stored for future reference. Since the type of information can vary greatly, it makes sense to store it in the most general format: tokens.
Moreover, full fine-tuning of models for wide tasks using continuous RL has proven effective (see Composer 2 by Cursor), where continual training over new data from their userbase means a new checkpoint can be released every five hours2. This is not too different from how recommendation systems are adapted to user preferences on social media platforms.
The missing piece is in creating a generic LoRA fine-tuning method that can be applied at test time for narrow applications, e.g. for a specific user or even task. This is akin to how a human adapts to a new job role over the course of a week. Compared to the other elements in the table above, there is no mature technique for this yet.
To solve this, I think the big labs will leverage features from within the LLMs themselves that correspond to meaningful changes in behaviour. This makes it possible to optimize over trajectories efficiently due to (a) the denoising nature of projection onto feature directions (less variance) and (b) the selection of directions being conducive to useful learning (beneficial biases). The process of finding such features could involve both human preference and automated methods.
A blog post from Thinking Machines showed that on-policy distillation using a rank-32 LoRA on Qwen3-8B was able to match full supervised fine-tuning at $9\times$ lower cost3. Intuitively, this shows that the model contains latent abilities that are unlocked by low-rank modifications to its weights. While this holds when optimizing on specific datasets, I hypothesize that it also holds in reverse: by fine-tuning a feature-constrained LoRA at test time, the model will adapt in useful and interpretable ways.
Let’s consider two practical applications of continual learning and how features could be selected to enable online learning:
- Coding agent: equip it with features focused on software design, syntax style, and high-level engineering behaviours. Depending on the environment and agents it interacts with, the coding agent should be able to adapt its behaviour to become e.g. an architect, a fixer, a tech lead, or a coding machine4.
- Home assistant: equip it with features focused on human emotions and relationships. For instance, some people prefer a more direct communication style than others. This type of trait can be personalized for each user based on interaction history and has been studied at length in research on personas5.
I’m excited to try this myself, perhaps using SAE features from one of the Gemma models, or by generating my own features on one of the OLMo models.