I expected refusal behavior to be more brittle.
The motivating experiment was simple: take the weight difference between allenai/Olmo-3-7B-Instruct and a refusal-abliterated checkpoint, project that difference into the base model’s EKFAC/KFE coordinates, rank the coordinates by a Heretic-style salience score, and then delete the highest-ranked coordinates from the base model. If those directions are really where refusal removal lives, zeroing enough of them should push the base model toward the abliterated model’s behavior.
It did not.
Across layers 19-30 mlp.down_proj, I zeroed up to 35M KFE coordinates per module, or roughly 420M selected coordinates total. The model got worse at MMLU, and eventually visibly less coherent, but it kept refusing harmful requests at essentially the same rate.

| Intervention | Harmful refusal | MMLU |
|---|---|---|
| Base-ish zero-out, 100 coords/module | 99% | 82.6% |
| Zero-out, 1M coords/module | 99% | 83.4% |
| Zero-out, 20M coords/module | 99% | 81.3% |
| Zero-out, 35M coords/module | 99% | 64.5% |
| Heretic checkpoint, same eval harness | 67% | 73.5% |
| Josiefied abliterated checkpoint, same eval harness | 58% | 81.5% |
This is the part I find most interesting: the model is not merely robust to small random-looking perturbations. It is robust to deleting a very large set of directions chosen specifically from a refusal-abliterated model delta. Even when the deletion causes real capability damage, refusal remains.
One reason this feels like a meaningful stress test is that the Fisher geometry is the geometry used by influence-function-style arguments. In that view, we are not zeroing arbitrary weights. We are selecting coordinates using an inverse-Fisher-like score derived from an actual abliteration delta. Informally, this resembles asking for a very unfavorable hypothetical fine-tuning dataset: one whose influence concentrates exactly on the parameter directions most associated with a known refusal-removal edit, then deleting those directions rather than merely nudging them. If refusal survives that, it is some evidence against the worry that ordinary fine-tuning would accidentally find the same pathway.
That changes how I think about post-training risk. If a targeted deletion of hundreds of millions of Heretic-ranked coordinates does not accidentally remove refusal, then ordinary capability fine-tuning should probably be even less likely to stumble into broad refusal removal. The caveat is important: this does not mean refusal is impossible to remove. Heretic and Josiefied show that adversarial/refusal-specific interventions can reduce refusal substantially. The claim is narrower: refusal seems robust under generic or poorly matched weight-space damage, even when the damage is selected using a relevant ablation model.
There is also a major coverage limitation: this experiment only computed EKFAC/KFE for 12 MLP down_proj matrices in the deeper part of the model, layers 19-30. Refusal could be implemented elsewhere: in attention projections, in residual-stream directions not well captured by these MLP output matrices, or in earlier layers that shape the model’s interpretation of harmful instructions before the late MLPs act. So this result is not evidence that refusal is globally deletion-proof; it is evidence that one plausible deep-MLP slice is not sufficient.
The KFE analysis also explains why simple coordinate deletion may be the wrong operation. The Heretic-base delta in these matrices is almost rank-1: across layers 19-30 down_proj, the top singular direction carries about 99.5% of Frobenius energy. So the behavioral edit may depend on a coherent signed direction and magnitude, not just on the set of high-scoring individual KFE coordinates.

My current read is:
- Refusal behavior is not easy to erase by deleting many selected KFE coordinates.
- Capability can degrade before refusal drops.
- Successful abliterations probably exploit structured, signed low-rank updates rather than just removing isolated directions.
- For ordinary fine-tuning, this is mildly reassuring: robustness to targeted deletion suggests robustness to incidental drift. For adversarial fine-tuning, it is not reassuring; the known abliterated checkpoints are direct counterexamples.
- The result is local to the modules analyzed. A whole-model story needs attention projections, earlier layers, and possibly residual-stream probes.
The next experiment I would run is to broaden the KFE map before pushing the zero-out harder. Compute the same Heretic/Josiefied KFE ranking for attention projections and earlier blocks, especially modules that write directly into the residual stream, then repeat the matched refusal-vs-MMLU sweep. After that, apply scaled versions of the actual low-rank Heretic/Josiefied delta and compare them with coordinate deletion at matched MMLU damage. If refusal moves under the signed low-rank delta or under non-MLP module deletion, but not under deep-MLP coordinate deletion, that would pin the effect on update geometry or module location rather than raw overlap with the Heretic directions.