Home

The Three ELBOs


The Evidence Lower Bound (ELBO) is ubiquitous in machine learning, appearing in three equivalent forms. The standard didactic derivation starts from the data log-likelihood, introduces a latent variable $Z$ and distribution $q(Z)$, then applies Jensen’s inequality.

$$ \begin{align*} \log p(X) &= \log \int p(X,Z) dZ \\ &= \log \left\langle\frac{p(X,Z)}{q(Z)}\right\rangle_{q(Z)} \\ &\geq \left\langle\log \frac{p(X,Z)}{q(Z)}\right\rangle_{q(Z)} \\ &= \left\langle\log p(X,Z)\right\rangle_{q(Z)} + \mathcal{H}[q(Z)] \end{align*} $$

This first form is useful for solving the update steps of the EM algorithm analytically, as the M-step optimizes parameters which would appear in the first term only.

The other two forms come from expanding $p(X,Z)$ using conditional probability.

$$ \begin{align*} \mathcal{F} &= \left\langle\log p(X,Z)\right\rangle_{q(Z)} + \mathcal{H}[q(Z)] \\ &= \left\langle\log \frac{p(X|Z)p(Z)}{q(Z)}\right\rangle_{q(Z)} \\ &= \left\langle\log p(X|Z)\right\rangle_{q(Z)} - \mathcal{KL}[q(Z)\parallel p(Z)] \end{align*} $$

This form of the ELBO is used as an objective for training VAEs, where $q(Z_i)$ approximates $p(Z_i|X_i)$ with an independent latent for each datapoint. The first term can be interpreted as a reconstruction loss, because it encourages correct predictions of $X$ given only the latents. The second term acts as a regularizer by constraining the likely values $Z$ (usually with a Gaussian prior).

$$ \begin{align*} \mathcal{F} &= \left\langle\log p(X,Z)\right\rangle_{q(Z)} + \mathcal{H}[q(Z)] \\ &= \left\langle\log \frac{p(Z|X)p(X)}{q(Z)}\right\rangle_{q(Z)} \\ &= \left\langle\log p(X)\right\rangle_{q(Z)} - \mathcal{KL}[q(Z)\parallel p(Z|X)] \\ &= \log p(X) - \mathcal{KL}[q(Z)\parallel p(Z|X)] \end{align*} $$

The third form makes explicit that the lower bound is less than the data log-likelihood by exactly the KL divergence between $q(Z)$ and the posterior $p(Z|X)$. This motivates the E-step in the EM algorithm, where we set $q(Z)=p(Z|X)$ through analytic solutions for the posterior. The field of approximate inference studies cases where this is intractable, instead using a variational E-step to approximate the posterior. As long as this step decreases the KL divergence, the algorithm will make progress by tightening the lower bound.