Home

Linear Regression: Frequentist Approach

Contents


Linear regression is one of the simplest models and builds upon (in my opinion) very elegant and intuitive maths. It has a wide array of tools built around it, both for theoretical analysis and practical use.

We love to take measurements, but some things are quite difficult to measure, so we can only obtain the value of a proxy and make a prediction. We call the proxy measurements covariates and the target value the response variable.

The Linear Model

Suppose we believe there is a linear function $\phi$ that maps the covariates to our target but is off by a small amount each time. The error of each observation is modeled as an independent sample from a Gaussian distribution centered at zero with variance $\sigma^2$. Note that this is a strong assumption and we will relax it later.

Given a sample $\mathcal{D} = {(x_i, y_i)}_{i=1}^N$ of $N$ independent observations, we write the model as: $$y_i = \phi(x_i) + \epsilon_i$$

We parametrize $\phi$ with a vector of parameters $\beta \in \R^{D}$. To include the intercept, one of the covariates has to be constant, e.g. by setting it to 1. In vector notation, we write: $$y = X\beta + \epsilon$$ where $X$ is a $N \times D$ matrix and $\epsilon\sim\mathcal{N}(0, \sigma^2I)$. Note that since Gaussians are closed under affine transformations, our model says $y | X \sim \mathcal{N}(X\beta, \sigma^2I)$.

If the true $\beta$ were known, we could predict the expected response for new covariates $x$ by computing $\hat{y} = \mathbb{E}[y | x] = \phi(x)$.

Parameter Estimation

Frequentists believe parameters are fixed and unknown, but can be estimated through repeated sampling. A common approach is Maximum Likelihood Estimation (MLE), which seeks the value of $\beta$ that maximizes the data’s likelihood $p(y | X, \beta)$. Since we are modeling $y$ as a Gaussian, the likelihood is: $$p(y | X, \beta) = \prod_{i=1}^N \mathcal{N}(y_i | x_i^\mathsf{T} \beta, \sigma^2)$$ $$= \frac{1}{(2\pi\sigma^2)^{N/2}} \exp\left(-\frac{1}{2\sigma^2} \sum_{i=1}^N (y_i - x_i^\mathsf{T} \beta)^2\right)$$

Switching to matrix notation and taking the negative log of the likelihood: $$-\log p(y | X, \beta) = \frac{N}{2} \log(2\pi\sigma^2) + \frac{1}{2\sigma^2} (y - X\beta)^\mathsf{T} (y - X\beta)$$ $$\implies -\frac{\partial}{\partial \beta} \log p(y | X, \beta) \propto X^\mathsf{T} (y - X\beta)$$

We can solve for $\hat{\beta}$ analytically because the negative log-likelihood is a convex function of $\beta$ and therefore has a global minimizer at the MLE. By setting its derivative to zero: $$\hat{\beta} = (X^\mathsf{T} X)^{-1} X^\mathsf{T} y$$

Oh look, that’s where the name of XTX Markets comes from! Now, if we substitute $\hat{\beta}$ into the linear model with the training data as inputs we obtain $$\hat{y} = X\hat{\beta} = \underbrace{X (X^\mathsf{T} X)^{-1} X^\mathsf{T}}_{P} y$$

This reveals the projection matrix $P$ which is idempotent $P^2 = P$ and symmetric $P^T = P$, meaning it behaves like the identity matrix on the column space of $X$.

We will assume $X$ is full-rank going forward, i.e. $\text{rank}(X)=\min(N,D)$. Recall $X$ contains $N$ samples of $D$ covariates. We encounter three cases:

Case 1: $D < N$

$(X^\mathsf{T} X)^{-1}$ is invertible so we can compute $\hat{\beta}$ as shown above. Intuitively, $\hat{\beta}$ is the projection of $y$ onto the column space of $X$.

Case 2: $D > N$

$X^\mathsf{T} X$ is not invertible. Since the row space of $X$ is larger than the column space, there are infinitely many values of $\beta$ that perfectly predict $y$. If we add the objective that the optimal $\beta$ has minimal 2-norm, we can use the method of Lagrange multipliers with constraint $X\beta = y$ to find a unique solution $\hat{\beta} = X^\mathsf{T} (XX^\mathsf{T})^{-1} y$.

Case 3: $D = N$

$X$ is invertible and effectively a new basis. We can write $(X^\mathsf{T} X)^{-1} = X^{-1}(X^\mathsf{T})^{-1}$ and it follows that $P=I$ so $\hat{y} = y$.

Going forward, let’s assume we are operating in the case where $D\leq N$. Applying our distributional assumptions on the errors $\epsilon$ to derive the sampling distribution for $\hat{\beta}$:

$$\hat{\beta} = (X^\mathsf{T}X)^{-1} X^\mathsf{T}\underbrace{(X\beta + \epsilon)}_{y| X} = \beta + (X^\mathsf{T}X)^{-1} X^\mathsf{T}\epsilon$$ $$\implies \hat{\beta} \sim \mathcal{N}\left(\beta, (X^\mathsf{T}X)^{-1}\sigma^2\right)$$

Frequentists are interested in how estimators behave in the limit of infinite samples. An estimator’s bias measures how much its expected value deviates from the true value, while its variance measures how much it varies depending on the data samples. An estimator with lower variance will require fewer samples to estimate the unknown parameter.

The estimator $\hat{\beta}$ is unbiased because its bias is $E[\hat{\beta}]-\beta = 0$. Its variance is more interesting. If we eigen-decompose $X^TX = U\Lambda U^\mathsf{T}$ with eigenvalues $\lambda_i$ on the diagonal of $\Lambda$, then $V[\hat{\beta}] = U\Lambda^{-1}\sigma^2 U^\mathsf{T}$. If $\lambda_{\text{min}}$ is very small, the variance in that direction will be very large. Hence, it’s crucial to avoid collinearity in $X$ and scale the covariates to have similar variances.

Covariate Selection

We can use the sampling distribution for $\hat{\beta}$ to test hypotheses about the unknown parameters. Significant tests are widely used for covariate selection.

Case 1: $\sigma$ is known

With the sampling distribution for $\hat{\beta}$ in hand, we can compute the z-score for each coefficient and run hypothesis tests. $$z_i = \frac{\hat{\beta}_i - \beta_i}{\sigma \sqrt{v_i}}$$

where $v_i$ is the $i$-th diagonal element of $(X^\mathsf{T}X)^{-1}$.

We can then test the null hypothesis that the coefficient for a particular covariate is zero, i.e. $H_0: \beta_i = 0$. The p-value in a two-sided test is $2 \cdot \Phi(-|z_i|)$ where $\Phi$ is the CDF of the standard Normal distribution $\mathcal{N}(0,1)$. If the p-value is less than a chosen significance level $\alpha$ e.g. $0.05$, we reject the null hypothesis and conclude that the coefficient is non-zero.

Case 2: $\sigma$ is unknown

We can estimate $\sigma$ from the data using its residual sum of squares (RSS): $$RSS = (y - X\hat{\beta})^\mathsf{T} (y-X\hat{\beta})$$

One might observe that given $N$ samples, we effectively have $N$ independent error terms, so the sample variance should be $\frac{RSS}{N}$, but this is a biased estimate. Intuitively, $P$ removes uncertainty along $D$ dimensions, but leaves $N-D$ degrees of freedom for the error terms. The trace of the projection matrix $P$ is equal to the rank of $X$.

$$\mathbb{E}\left[RSS\right] = \mathbb{E}\left[ (y - Py)^\mathsf{T} (y-Py) \right] = \mathbb{E}\left[y^\mathsf{T}(I - P)y\right]$$ $$ \implies \mathbb{E}\left[RSS\right] = \mathbb{E}\left[\epsilon^\mathsf{T}(I-P)\epsilon \right] = (N-D)\sigma^2$$

Hence, an unbiased estimate of the sample variance is:

$$\hat{\sigma}^2 = \frac{RSS}{N-D}$$

For scalar covariates, the scaling factor $\tfrac{1}{N-1}$ is known as Bessel’s correction. To gain some intuition for why it is needed, consider sampling twice from $\mathcal{N}(0,1)$. The sample mean $\bar{x}=\frac{x_1+x_2}{2}$ will be non-zero and equidistant from both points. However

$$\mathrm{RSS}(\bar{x})=(x_1-\bar{x})^2+(x_2-\bar{x})^2=\frac{(x_2-x_1)^2}{2}\le x_1^2+x_2^2=\mathrm{RSS}(0)$$

so computing residuals using the sample mean gives an under-estimate of the true variance. With the correction, $\tfrac{1}{2}RSS(0) - RSS(\bar{x}) = x_1x_2$ which has zero mean.

We might be tempted to compute a z-score for each coefficient using the variance estimate:

$$z_i = \frac{\hat{\beta}_i - \beta_i}{\hat{\sigma} \sqrt{v_i}}$$

but this is incorrect! In addition to $\hat{\beta}_i$ being a random variable, $\hat{\sigma}^2$ is as well. This means we cannot compute confidence intervals using a Gaussian CDF. Instead, we compute the t-statistic.

Observe that $\frac{RSS}{\sigma^2} \sim \chi^2_{N-D}$ follows a Chi-squared distribution – a distribution over variances computed as a sum of $N-D$ squared samples from a standard Normal. Doing some algebra:

$$ t_i = \frac{\hat{\beta}_i - \beta_i}{\sqrt{(N-D)^{-1}RSS}\sqrt{v_i}}\cdot \frac{\sigma}{\sigma} = \underbrace{\frac{\hat{\beta}_i - \beta_i}{\sigma\sqrt{v_i}}}_{Z} \cdot \underbrace{\frac{\sigma}{\sqrt{RSS}}}_{U^{-1/2}} \cdot \sqrt{N-D} = \frac{Z}{\sqrt{U/(N-D)}} $$

where $Z\sim\mathcal{N}(0,1)$ and $U\sim\chi^2_{N-D}$. The two random variables are independent because $Z$ depends only on the projected noise $P\epsilon\in\mathrm{col}(X)$, while $U$ depends only on the residual noise $(I-P)\epsilon\in\mathrm{col}(X)^\perp$.

The resulting ratio is therefore $t$-distributed, i.e. $t_i \sim \mathcal{t}_{N-D}$. For degrees of freedom $<30$, the $t$-distribution has heavier tails than the Gaussian. We can compute confidence intervals using t-statistics just as we did with the z-scores, except using the different CDF.

Relaxed Assumptions

  • BLUE, Gauss-Markov theorem
    • only require homoscedasticity and uncorrelated errors for unbiasedness and minimal variance in OLS estimator
  • iteratively re-weighted least squares (IRLS) helps with heteroscedasticity

TODO:

  • a little history of Student’s t-dist
  • covariate selection using t-dist signficance test or LASSO
  • multi-output generalization – multiple target variables
  • weighted least squares, IRLS (connection to EM)

Should I mention the central limit theorem?

The Normal-InverseGamma has an interesting connection to frequentist statistics, which I cover in another post (but in short, $\chi^2$ is an instance of InverseGamma and marginalizing over the variance results in a Student-t distribution).