ngeo.dev

Premise: models are being fine-tuned more and more, which risks catastrophic forgetting, e.g. of safety and alignment training. To validate this I ran a few epochs of OpenOrca on OLMo-3 7B. Though not drastic, I did see a small drop in refusal rate on StrongREJECT bench, from 99% to 97%. This was not large enough to be concerned about, but I did not continue the training because of costs. I pivoted to testing the confounding effect of forgetting by targeting the model’s weights directly. This is where I went into the unlearning literature.

Let the post-training objective be approximated as a mixture of a large background distribution $R$ and a smaller dataset component $S$, where $S$ is the dataset whose local contribution we want to downweight or remove.

$$\mathcal{L}_p(\theta)=(1-p)\mathcal{L}_R(\theta)+p\mathcal{L}_S(\theta)$$

Assume the trained model parameters $\theta^*$ are approximately stationary for this mixture:

$$\nabla_\theta \mathcal{L}_p(\theta^*) \approx 0$$

Writing

$$g_R = \nabla_\theta \mathcal{L}_R(\theta^*),\qquad g_S = \nabla_\theta \mathcal{L}_S(\theta^*)$$

we have

$$(1-p)g_R + p g_S \approx 0$$

and therefore

$$g_R \approx -\frac{p}{1-p}g_S$$

Now consider a hypothetical objective in which the contribution of $S$ is reduced from $p$ to $p’ = p-\rho$:

$$\mathcal{L}_{p’}(\theta)=(1-p’)\mathcal{L}_R(\theta)+p’\mathcal{L}_S(\theta)$$

At the original parameters $\theta^*$, the gradient of the new objective is

$$\nabla_\theta \mathcal{L}_{p’}(\theta^*)=(1-p’)g_R+p’g_S$$

Using the stationarity condition for the original mixture, this becomes

$$\nabla_\theta \mathcal{L}_{p’}(\theta^*)\approx-\frac{\rho}{1-p}g_S$$

A local Newton approximation to the new optimum gives

$$\Delta \theta \approx -H^{-1}\nabla_\theta \mathcal{L}_{p’}(\theta^*)$$

$$\Delta \theta \approx \frac{\rho}{1-p}H^{-1}g_S$$

Since the true dataset proportion and downweighting amount are unknown, we absorb the scalar factor into a tunable perturbation magnitude:

$$\Delta \theta = \eta H^{-1}g_S$$

This is the basic leave-dataset-out perturbation: move in the curvature-preconditioned gradient direction of the dataset component $S$.

If $\mathcal{L}_S$ is a loss for preserving the model’s current behavior on $S$, then $+\Delta\theta$ is the local leave-out or forgetting direction, while $-\Delta\theta$ is the local retention or reinforcement direction.

KFE / EKFAC coordinate form

For a linear layer with weight matrix $W$, KFAC approximates the curvature block as

$$H_W \approx A \otimes G$$

where $A$ is the activation covariance and $G$ is the preactivation-gradient covariance.

Let

$$A = Q_A \Lambda_A Q_A^\top$$

and

$$G = Q_G \Lambda_G Q_G^\top$$

The KFE coordinates of the weight matrix are

$$\tilde W = Q_G^\top W Q_A$$

and the projected dataset gradient is

$$\tilde g_S = Q_G^\top \nabla_W \mathcal{L}_S Q_A$$

In plain KFAC, the curvature associated with KFE coordinate $(i,j)$ is

$$\kappa_{ij} = \lambda^G_i \lambda^A_j$$

In EKFAC, $\kappa_{ij}$ is replaced by the empirical diagonal curvature estimate in the KFE basis.

The idealized inverse-curvature perturbation would be

$$\Delta \tilde W_{ij} = \eta \frac{\tilde g_{S,ij}}{\kappa_{ij}}$$

However, because $\kappa_{ij}$ can vary over many orders of magnitude and may be noisy for small values, we use a clipped log-space preconditioner.

Median-normalized log-space preconditioner

For each layer $\ell$, define the median-normalized curvature

$$z_{\ell i}=\frac{\kappa_{\ell i}}{\operatorname{median}_j(\kappa_{\ell j})}$$

Equivalently, in log space,

$$\log z_{\ell i}=\log(\kappa_{\ell i}+\epsilon)-\operatorname{median}_j\left[\log(\kappa_{\ell j}+\epsilon)\right]$$

where $\epsilon > 0$ is a small numerical floor.

We then clip the log-normalized curvature values by quantile:

$$\bar{\ell}_{\ell i}=\operatorname{clip}\left(\log z_{\ell i},q_{\min},q_{\max}\right)$$

where $q_{\min}$ and $q_{\max}$ are, for example, the 1st and 99th percentiles of $\log z_{\ell i}$ within the layer.

The temperature-controlled preconditioned direction is then

$$u_{\ell i}(\alpha)=\tilde g_{S,\ell i}\exp\left(-\alpha \bar{\ell}_{\ell i}\right)$$

Equivalently,

$$u_{\ell i}(\alpha)=\frac{\tilde g_{S,\ell i}}{\bar z_{\ell i}^{\alpha}}$$

where

$$\bar z_{\ell i}=\exp(\bar{\ell}_{\ell i})$$

The parameter $\alpha$ controls the degree of preconditioning:

$$\alpha=0\Rightarrow u_{\ell i}=\tilde g_{S,\ell i}$$

which is the raw KFE-projected gradient direction, while

$$\alpha=1\Rightarrow u_{\ell i}=\frac{\tilde g_{S,\ell i}}{\bar z_{\ell i}}$$

which is the clipped inverse-EKFAC-preconditioned direction. Intermediate values such as $\alpha=0.5$ give partial or square-root preconditioning.

Euclidean-normalized perturbation

After constructing the preconditioned direction $u(\alpha)$, we normalize it and scale it to a chosen Euclidean perturbation budget $\rho$:

$$\Delta \tilde\theta=\rho\frac{u(\alpha)}{|u(\alpha)|_2+\epsilon}$$

For a single layer, this is

$$\Delta \tilde W_{\ell i}=\rho_\ell\frac{u_{\ell i}(\alpha)}{|u_\ell(\alpha)|_2+\epsilon}$$

or, with global normalization across all edited layers,

$$\Delta \tilde W_{\ell i}=\rho\frac{u_{\ell i}(\alpha)}{\sqrt{\sum_{\ell’}\sum_j u_{\ell’ j}(\alpha)^2}+\epsilon}$$

Finally, transform the perturbation back to weight space:

$$\Delta W_\ell=Q_{G,\ell}\Delta \tilde W_\ell Q_{A,\ell}^\top$$

and apply

$$W_\ell’=W_\ell+\Delta W_\ell$$

The two main sweep variables are therefore $\alpha$, which controls the degree of EKFAC/KFE preconditioning, and $\rho$, which controls the total Euclidean perturbation magnitude.

In experiments, $\alpha$ can be swept over values such as

$$\alpha\in{0,0.25,0.5,0.75,1.0,1.25}$$

and $\rho$ can be swept logarithmically to trace the tradeoff between behavioral change on $S$ and general capability degradation.

Unlearning with EKFAC

KFE / EKFAC coordinate form

Median-normalized log-space preconditioner

Euclidean-normalized perturbation