Linear Regression

Linearity is a great property for functions to have. Suppose we have a function $f:\mathbb{R}^n\rightarrow\mathbb{R}^n$ mapping n-dimensional vectors to n-dimensional vectors. We say $f$ is linear if $$ f(a\vec{x}+b\vec{y}) = af(\vec{x})+bf(\vec{y}), $$ where $a,b\in\mathbb{R}$ are scalars and $\vec{x},\vec{y}\in\mathbb{R}^n$ are vectors. Essentially, they preserve the structure of a vector space.

The convenience of linear functions has historically made them a popular choice for modelling. In particular, they're easy to compute.

Suppose we have a vector $\vec{x}\in\mathbb{R}^n$ (whose components $x_i$ we'll call features) and a value $y$ we wish to predict. A linear model is a model that attempts to predict $y$ in terms of linear combinations of the features $x_i$: $$ f(\vec{x}) = f(x_1, x_2,\dots,x_n) = \hat{y} = \beta_0 + \sum_{i=1}^n x_i\beta_i, $$ where the $\beta_i$'s are coefficients that can be estimated, as we'll see.
We write the predicted value of $y$ as $\hat{y}$ (pronounced y hat). It is common in statistics and machine learning to write a circumflex over a letter to denote a predicted quantity.

We can fully utilize vector notation if we augment $\vec{x}$ with the constant 1 and form the vector $\vec{\beta}=(\beta_0,\beta_1,\dots,\beta_n)$. Now we can write the model as $$ \hat{y} = \vec{x}\cdot\vec\beta = \vec{x}^T \vec{\beta} $$

Now that we can make predictions, it'd be nice to know how good they are. We do this by choosing a loss function. Suppose that after making a prediction $\hat{y}_i$, using features $\vec{x}_i$ and coefficients $\beta$, we then observe the actual outcome $y_i$. The most common loss function is the sum of squared errors (or just sum of squares): $$ \ell(\beta_1,\dots,\beta_n) = \sum_{i=1}^n(y_i - \vec{x_i}^T\beta_i)^2 $$ Notice that the loss is a function of the coefficients. We'll use this quantity to guide us in improving our estimates of the model coefficients $\beta$.

We rewrite the loss function using vector notation like so: $$ \ell(\vec\beta) = (\vec{y} - \textbf{X}\vec{\beta})^T(\vec{y} - \textbf{X}\vec{\beta}), $$ where $\textbf{X}$ is a matrix.

Using the properties of the transpose operation and some algebra, we find that $$ \begin{equation} \begin{aligned} \ell(\vec\beta) = (\vec{y} - \textbf{X}\vec{\beta})^T(\vec{y} - \textbf{X}\vec{\beta}) &= (\vec{y}^T - (\textbf{X}\vec{\beta})^T)(\vec{y} - \textbf{X}\vec{\beta}) \\ &= \vec{y}^T\vec{y} - \vec{y}^T\textbf{X}\vec{\beta} - (\textbf{X}\vec{\beta})^T\vec{y} +(\textbf{X}\vec{\beta})^T(\textbf{X}\vec{\beta}) \end{aligned} \end{equation} $$

To minimize the loss function, we will compute its gradient. Let $\vec e = \vec{y} - X\vec{\beta}$ with components $e_i = y_i - \sum_{j=1}^nx_{ij}\beta_j$ so that we may write the loss function as $\ell(\vec\beta) = \vec{e}^T \vec{e} = \sum_{i=1}^n e_i^2$.

Now we compute the derivative of the loss function with respect to a single component of $\vec\beta$. $$ \begin{equation} \begin{aligned} \frac{\partial\ell}{\partial \beta_k} &= \sum_i 2e_i \frac{\partial e_i}{\partial \beta_k}\\ &= -2\sum_i x_{ik}e_i \\ &= -2\vec{X}_k \cdot (\vec{y} - X\vec{\beta}) \end{aligned} \end{equation} $$ where $\vec{X}_k $ is a column-vector of the matrix $X$. We can put this all together by writing $$ \nabla\ell = -2 X^T(\vec{y} - X\vec\beta). $$

This quantity is minimized where $$ \nabla\ell = X^T (\vec{y} - X\vec\beta) = X^T \vec{y} - X^T X \vec\beta = 0. $$ Some simple algebra gives the solution: $$ \beta = (X^T X)^{-1} X^T\vec{y} $$ $\square$