Blog v2

As scientists, we love to take measurements, but some things are quite difficult to measure, so instead we take the value of a proxy and make a prediction. We call the proxy measurements covariates (also features) and the predicted value is the response (also target).

Linear regression is one of the simplest models for learning relationships between variables. It has a wide array of tools built around it, both for theoretical analysis and practical use.

The Linear Model

Suppose we believe there is a linear function $\phi$ that maps the covariates to our target but is off by a small amount each time. The error of each observation is modeled as an independent sample from a Gaussian distribution centered at zero with variance $\sigma^2$. Note that this is a strong assumption and we will relax it later.

Given a sample $\mathcal{D} = {(x_i, y_i)}_{i=1}^N$ of $N$ independent observations, we write the model as: $$y_i = \phi(x_i) + \epsilon_i$$

We parametrize $\phi$ with a vector of parameters $\beta \in \R^{D}$. To include the intercept, one of the covariates has to be constant, e.g. by setting it to 1. In vector notation: $$y = X\beta + \epsilon$$ where $X$ is a $N \times D$ matrix and $\epsilon\sim\mathcal{N}(0, \sigma^2I)$. Note that since Gaussians are closed under affine transformations, our model says $y | X \sim \mathcal{N}(X\beta, \sigma^2I)$.

If the true $\beta$ were known, we could predict the expected response for new covariates $x$ by computing $\mathbb{E}[y | x] = x^\mathsf{T}\beta$.

Parameter Estimation

Frequentists believe parameters are fixed and unknown, but can be estimated through repeated sampling. A common approach is Maximum Likelihood Estimation (MLE), which seeks the value of $\beta$ that maximizes the data’s likelihood $p(y | X)$. Since we are modeling $y$ as a Gaussian, the likelihood is: $$p(y | X) = \prod_{i=1}^N \mathcal{N}(y_i | x_i^\mathsf{T} \beta, \sigma^2)$$ $$= \frac{1}{(2\pi\sigma^2)^{N/2}} \exp\left(-\frac{1}{2\sigma^2} \sum_{i=1}^N (y_i - x_i^\mathsf{T} \beta)^2\right)$$

Switching to matrix notation and taking the negative log of the likelihood: $$-\log p(y | X) = \frac{N}{2} \log(2\pi\sigma^2) + \frac{1}{2\sigma^2} \underbrace{(y - X\beta)^\mathsf{T} (y - X\beta)}_{\text{RSS}}$$

The residual sum of squares ($\text{RSS}$) is a quadratic form in $\beta$, which can be minimized analytically. By setting its derivative to zero we obtain the ordinary least squares (OLS) estimator: $$\frac{\partial}{\partial \beta} \text{RSS} = -2X^\mathsf{T} (y - X\beta)$$ $$\implies \hat{\beta} = \argmin_\beta \text{RSS} = (X^\mathsf{T} X)^{-1} X^\mathsf{T} y$$

Oh look, that’s where the name of XTX Markets comes from! Now, if we substitute $\hat{\beta}$ into the linear model with the training data as inputs we obtain: $$\hat{y} = X\hat{\beta} = \underbrace{X (X^\mathsf{T} X)^{-1} X^\mathsf{T}}_{P} y$$

This reveals the projection matrix $P$, often called the hat matrix, which is idempotent $P^2 = P$ and symmetric $P^\mathsf{T} = P$, meaning it behaves like the identity matrix on the column space of $X$.

We will assume $X$ is full-rank going forward, i.e. $\text{rank}(X)=\min(N,D)$. Recall $X$ contains $N$ samples of $D$ covariates. We encounter three cases:

Case 1: $D < N$

$(X^\mathsf{T} X)^{-1}$ is invertible so we can compute the OLS estimator as shown above. Intuitively, $\hat{\beta}$ is the projection of $y$ onto the column space of $X$.

Case 2: $D > N$

$X^\mathsf{T}X$ is not invertible. With more features than samples, the columns of $X$ form an overcomplete spanning set and $\beta$ is not identifiable. If we additionally ask that the optimal $\beta$ has minimal $\ell_2$ norm, we can use the method of Lagrange multipliers with constraint $X\beta = y$ to find a unique solution $\hat{\beta} = X^\mathsf{T} (XX^\mathsf{T})^{-1} y$.

Case 3: $D = N$

$X$ is invertible and effectively a new basis. We can write $(X^\mathsf{T} X)^{-1} = X^{-1}(X^\mathsf{T})^{-1}$ and it follows that $P=I$ so $\hat{y} = y$.

Estimator Quality

Going forward, let’s assume we are operating in the case where $D\leq N$. We apply the distributional assumptions on the errors $\epsilon$ to derive the sampling distribution of $\hat{\beta}$.

$$\hat{\beta} = (X^\mathsf{T}X)^{-1} X^\mathsf{T}\underbrace{(X\beta + \epsilon)}_{y| X} = \beta + (X^\mathsf{T}X)^{-1} X^\mathsf{T}\epsilon$$ $$\implies \hat{\beta} \sim \mathcal{N}\left(\beta, (X^\mathsf{T}X)^{-1}\sigma^2\right)$$

Frequentists are interested in how estimators behave in the limit of infinite samples. An estimator’s bias measures how much its expected value deviates from the true value, while its variance measures how spread out its samples are. An estimator with lower variance will require fewer samples to estimate the unknown parameter within a given confidence interval.

The estimator $\hat{\beta}$ is unbiased because its bias is $E\left[\hat{\beta}\right]-\beta = 0$. Its variance is more interesting. If we eigen-decompose $X^\mathsf{T}X = U\Lambda U^\mathsf{T}$ with eigenvalues $\lambda_i$ on the diagonal of $\Lambda$, then $V\left[\hat{\beta}\right] = U\Lambda^{-1}\sigma^2 U^\mathsf{T}$. If $\lambda_{\text{min}}$ is small, the variance in that direction will be large. In other words, columns of $X$ that are almost collinear will reduce the effective rank of $X^\mathsf{T}X$ and spike the estimator’s variance.

Covariate Selection

An important consideration in modelling is which features to use.

Suppose we have fitted a model and want to reduce the model loss by adding a new covariate $s$. Let the current residual be $r=y-X\hat{\beta}$. Define $\tilde s := (I-P)s$, the part of $s$ orthogonal to $\mathrm{col}(X)$. Then refitting with $s$ yields:

$$ \text{RSS}_{\text{new}}=\text{RSS}_{\text{old}}-\frac{(\tilde{s}^\mathsf{T} r)^2}{\tilde{s}^\mathsf{T}\tilde{s}}\le \text{RSS}_{\text{old}} $$

Thus any covariate whose novel component correlates with the residual ($\tilde{s}^\mathsf{T} r \neq 0$) strictly decreases training $\text{RSS}$. However, adding many features increases the risk of spurious correlation (fitting noise). Under the Gaussian error model, $V\left[x^\mathsf{T} \hat{\beta}\mid X\right] = x^\mathsf{T} (X^\mathsf{T} X)^{-1}x \cdot\sigma^2$, so weakly identified (low-eigenvalue) directions of $X^\mathsf{T} X$ can amplify prediction variance on test inputs that project onto them.

Given a dataset with $D$ features, there are $2^D$ possible subsets to choose from, so we have to be smarter than brute force.

Statistical Tests

One practical approach to selection uses a statistical test to determine if each feature is significant in the context of the others. Previously, we derived the sampling distribution for the random variable $\hat{\beta}$, but the value computed from observations is just a sample $\beta_\text{obs}$. From here, we can test whether the coefficient for a particular covariate is zero, i.e. the null hypothesis $H_0: \beta_i = 0$. This is done using a p-value: the probability of observing a value as or more extreme than the one we observed, given $H_0$ is true.

In a two-sided test we want to know if the coefficient is significantly different from zero, without regard to direction. The p-value is computed as $2\min(F_0(\beta_\text{obs}), 1-F_0(\beta_\text{obs}))$ where $F_0$ is the CDF of the sampling distribution with $\beta_i = 0$. We reject the null hypothesis if the p-value is less than a chosen significance level $\alpha$ e.g. $0.05$. If we fail to reject the null hypothesis, we can consider removing the covariate from the model.

While the hypothesis sets a value for $\beta$, the sampling distribution also involves $\sigma^2$, which affects how we proceed.

Case 1: $\sigma$ is known

With the sampling distribution for $\hat{\beta}$ in hand, we can compute the z-score for each coefficient. $$z_i = \frac{\hat{\beta}_i - \beta_i}{\sigma \sqrt{v_i}}$$

where $v_i$ is the $i$-th diagonal element of $(X^\mathsf{T}X)^{-1}$.

The p-value is $2\cdot \Phi(-|z_i|)$ where $\Phi$ is the CDF of the standard Normal distribution $\mathcal{N}(0, 1)$.

Case 2: $\sigma$ is unknown

One might observe that given $N$ samples, we effectively have $N$ independent error terms, so the sample variance should be $\frac{\text{RSS}}{N}$, but this is a biased estimate. Intuitively, $P$ removes uncertainty along $D$ dimensions, but leaves $N-D$ degrees of freedom for the error terms. Note, the trace of the projection matrix $P$ is equal to the sum of its eigenvalues, i.e. $\text{rank}(X)$.

$$\mathbb{E}\left[\text{RSS}\right] = \mathbb{E}\left[ (y - Py)^\mathsf{T} (y-Py) \right] = \mathbb{E}\left[y^\mathsf{T}(I - P)y\right]$$ $$ \implies \mathbb{E}\left[\text{RSS}\right] = \mathbb{E}\left[\epsilon^\mathsf{T}(I-P)\epsilon \right] = (N-D)\sigma^2$$

Hence, an unbiased estimate of the sample variance is:

$$\hat{\sigma}^2 = \frac{\text{RSS}}{N-D}$$

For a model with a single covariate, the scaling factor $\tfrac{1}{N-1}$ is known as Bessel’s correction. To gain some intuition for why it is needed, consider sampling twice from $\mathcal{N}(0,1)$. The sample mean $\bar{x}=\frac{x_1+x_2}{2}$ will be non-zero and equidistant from both points.

$$\text{RSS}(\bar{x})=(x_1-\bar{x})^2+(x_2-\bar{x})^2=\frac{(x_2-x_1)^2}{2}$$ $$\le x_1^2+x_2^2=\text{RSS}(0)$$

so computing residuals using the sample mean gives an under-estimate of the true variance. With the correction, $\tfrac{1}{2}\text{RSS}(0) - \text{RSS}(\bar{x}) = x_1x_2$ which has zero mean.

We might be tempted to compute a z-score for each coefficient using the variance estimate, but this would be incorrect! In addition to $\hat{\beta}_i$ being a random variable, $\hat{\sigma}^2$ is as well. This means we cannot compute a p-value using the Gaussian CDF. Instead, we compute the t-statistic.

Observe that $\frac{\text{RSS}}{\sigma^2} \sim \chi^2_{N-D}$ follows a Chi-squared distribution – a distribution over variances computed as a sum of $N-D$ squared samples from a standard Normal. Doing some algebra:

$$t_i = \frac{\hat{\beta}_i - \beta_i}{\sqrt{(N-D)^{-1}\text{RSS}}\sqrt{v_i}}\cdot \frac{\sigma}{\sigma}$$ $$\implies t_i = \sqrt{N-D} \cdot \underbrace{\frac{\hat{\beta}_i - \beta_i}{\sigma\sqrt{v_i}}}_{Z}\cdot \underbrace{\frac{\sigma}{\sqrt{\text{RSS}}}}_{U^{-1/2}} = \frac{Z}{\sqrt{U/(N-D)}}$$

where $Z\sim\mathcal{N}(0,1)$ and $U\sim\chi^2_{N-D}$. The two random variables are independent because $Z$ depends only on the projected noise $P\epsilon\in\text{col}(X)$, while $U$ depends only on the residual noise $(I-P)\epsilon\in\text{col}(X)^\perp$.

The resulting ratio is therefore $t$-distributed. For degrees of freedom $<30$, Student’s $t$-distribution has notably heavier tails than the Gaussian. With $t$-statistics, we can compute a p-value just as we did with the z-scores under the null hypothesis, but using the $t$-distribution CDF.

LASSO

Another approach to covariate selection is to use a sparsity-inducing penalty to the loss function. In statistics it is often called the least absolute shrinkage and selection operator (LASSO). Another name is basis pursuit, as the model searches for a sparse subset of features.

If we chose to limit the number of active features to $k$, our optimization problem would become:

$$\min_{\beta} \text{RSS} \text{ s.t. } \Vert \beta \Vert_0 \leq k$$

where $\Vert \cdot \Vert_0$ is the $\ell_0$ norm, counting the number of non-zero elements in the vector. However, this problem is NP-hard, because interactions between features can lead to many local minima. Additionally, it is difficult to optimize the $\ell_0$ norm directly due to its discrete nature. Instead, a common approach is to relax the problem using the $\ell_1$ norm:

$$\min_{\beta} \text{RSS} \text{ s.t. } \Vert \beta \Vert_1 \leq k$$

This problem is convex and is solved efficiently by gradient descent. The unconstrained formulation uses a pre-specified Lagrange multiplier $\lambda$ to enforce the constraint:

$$\min_{\beta} \left(\text{RSS} + \lambda \Vert \beta \Vert_1\right)$$

When covariates form subsets of their own, a useful extension to this method is Group LASSO, which penalizes the activity of entire groups of features by summing their $\ell_2$ norms.

Relaxed Assumptions

Thus far we have assumed the errors $\epsilon$ are independent and identically distributed (i.i.d.) Gaussian with variance $\sigma^2$. There is some elegant theory that allows us to relax these assumptions.

Non-Gaussian Errors

First, we can replace the Gaussian assumption with the weaker requirement that the error distribution has zero mean, finite variance, and all error samples are uncorrelated. This is weaker, because it means the distribution can now have higher-order moments and error samples are not strictly independent.

The Gauss-Markov Theorem states that the OLS estimator is the best linear unbiased estimator (BLUE). Suppose there exists another linear estimator $\tilde{\beta} = \left((X^\mathsf{T}X)^{-1}X^\mathsf{T} + A\right)y$ where $A\in\R^{D\times N}$ is a non-zero matrix.

$$E\left[\tilde{\beta}\right] = E\left[\left((X^\mathsf{T}X)^{-1}X^\mathsf{T} + A\right)y\right] = (I + AX)\beta$$

Hence, $\tilde{\beta}$ is biased unless $AX=0$, which causes cross terms to vanish in the variance calculation.

$$V\left[\tilde{\beta}\right] = \left((X^\mathsf{T}X)^{-1}X^\mathsf{T} + A\right)V\left[\epsilon\right]\left((X^\mathsf{T}X)^{-1}X^\mathsf{T} + A\right)^\mathsf{T}$$ $$= \sigma^2\left((X^\mathsf{T}X)^{-1} + AA^\mathsf{T}\right) = V[\hat{\beta}] + \sigma^2\underbrace{AA^\mathsf{T}}_{\text{PSD}} \geq V[\hat{\beta}]$$

Thus, $\hat{\beta}$ is BLUE.

Heteroscedasticity

The assumption that errors have fixed variance is called homoscedasticity. In practice it is often invalid, as the measurement device might become increasingly mis-calibrated over time. When errors have different variances, we call it heteroscedasticity. Suppose $\epsilon\sim\mathcal{N}(0,W^{-1})$ where $W$ is a diagonal matrix with diagonal elements $w_i$. If $W$ is known, the weighted least squares estimator (WLS) is:

$$\hat{\beta} = (X^\mathsf{T}WX)^{-1}X^\mathsf{T}Wy$$

If $W$ is unknown, we can use iterative re-weighted least squares (IRLS): start with an initial guess for $W$, fit $\hat{\beta}$, then update $W$ using an estimated model for $V\left[\epsilon_i\mid X\right]$ based on the residuals (often via $r_i^2$ or $\log r_i^2$), and iterate until convergence. Naive updates like $w_i = r_i^{-2}$ can be numerically unstable. In practice, regularization is required through M-estimation or priors on $W$, but that is beyond the scope of this post.

Linear Regression for Frequentists

Contents