Blog v2

Since around the time ChatGPT came out, I’ve heard many of my software engineer colleagues say variations of the following:

“LLMs are not robust like traditional software because they are non-deterministic.”

However, I think this misdiagnoses the real issue. To understand why, we need to review some maths.

In mathematical optimization, problems are posed as the minimization of an objective subject to constraints, of which there are two types:

Hard constraints: these create the boundaries of the feasible set and all solutions must satisfy them
Soft constraints: these add penalties (or preferences) that guide solutions toward desired regions, but violations are permitted at a cost.

Hard constraints are tricky because they create discontinuities in the search space with no clear way through. On the other hand, they are super useful. Consider how reliable software handles edge cases, validates inputs, and avoids outputting invalid values. Type systems in programming languages can guarantee that certain classes of errors cannot occur, while formal verification methods can prove specific properties about critical software components¹.

LLMs are fundamentally different. Even an LLM that perfectly minimizes its training objective would still be inconsistent on basic facts, because the training data itself contains contradictions and the objective encourages matching this inconsistent distribution. Consider memes that are intentionally incorrect for comedic effect, e.g. with text that says 1+1=3. Even though the vast majority of a diverse dataset would indicate 1+1=2, an LLM given prefix 1+1= will still put non-zero mass on 3 as the next token. This has nothing to do with determinism, just the fact the LLMs are software designed using soft constraints.

In classical mechanics, the idea of determinism is that given identical initial conditions, a system will always evolve in exactly the same way². LLMs are deterministic in their forward pass, i.e. given the same input tokens, they will always produce the same logit distributions. When LLMs appear non-deterministic in practice, it’s often due implementation details like floating-point precision and side-effects from deployment on distributed hardware.

So what’s the correct statement then? I suggest:

“LLMs are not robust like traditional software because they are optimized using soft constraints.”

Is this a fundamental limit to how LLMs are trained? In the optimization literature it is quite common to parametrize hard constraints into soft ones, then slowly harden them according to a schedule (think annealing). In LLM training this would be equivalent to decreasing the softmax temperature before computing the cross-entropy loss, effectively hardening the softmax into an argmax. In theory, this could force the LLM to learn hard requirements, like only completing 1+1= with 2. This only works if one of the token logits is fixed to a constant, since otherwise the LLM can learn to scale the logits in a way that nullifies the effects of the temperature³. But more damagingly, we’d have to figure out what we actually agree on as fact. And that’s a much harder problem.

LLMs vs Traditional Software