Linear Regression
Linearity is a great property for functions to have. Suppose we have a function \(f:\mathbb{R}^n\rightarrow\mathbb{R}^n\) mapping n-dimensional vectors to n-dimensional vectors. We say \(f\) is linear if $$ f(a\vec{x}+b\vec{y}) = af(\vec{x})+bf(\vec{y}), $$ where \(a,b\in\mathbb{R}\) are scalars and \(\vec{x},\vec{y}\in\mathbb{R}^n\) are vectors. Essentially, they preserve the structure of a vector space.
The convenience of linear functions has historically made them a popular choice for modelling. In particular, they're easy to compute.
Suppose we have a vector \(\vec{x}\in\mathbb{R}^n\) (whose components \(x_i\) we'll call features) and a value \(y\) we wish to predict.
A linear model is a model that attempts to predict \(y\) in terms of linear combinations of the features \(x_i\):
$$
f(\vec{x}) = f(x_1, x_2,\dots,x_n) = \hat{y} = \beta_0 + \sum_{i=1}^n x_i\beta_i,
$$
where the \(\beta_i\)'s are coefficients that can be estimated, as we'll see.
We write the predicted value of \(y\) as \(\hat{y}\) (pronounced y hat). It is common in statistics and machine learning to write a circumflex over a letter to denote a predicted quantity.
We can fully utilize vector notation if we augment \(\vec{x}\) with the constant 1 and form the vector \(\vec{\beta}=(\beta_0,\beta_1,\dots,\beta_n)\). Now we can write the model as $$ \hat{y} = \vec{x}\cdot\vec\beta = \vec{x}^T \vec{\beta} $$
Now that we can make predictions, it'd be nice to know how good they are. We do this by choosing a loss function. Suppose that after making a prediction \(\hat{y}_i\), using features \(\vec{x}_i\) and coefficients \(\beta\), we then observe the actual outcome \(y_i\). The most common loss function is the sum of squared errors (or just sum of squares): $$ \ell(\beta_1,\dots,\beta_n) = \sum_{i=1}^n(y_i - \vec{x_i}^T\beta_i)^2 $$ Notice that the loss is a function of the coefficients. We'll use this quantity to guide us in improving our estimates of the model coefficients \(\beta\).
We rewrite the loss function using vector notation like so: $$ \ell(\vec\beta) = (\vec{y} - \textbf{X}\vec{\beta})^T(\vec{y} - \textbf{X}\vec{\beta}), $$ where \(\textbf{X}\) is a matrix.
Using the properties of the transpose operation and some algebra, we find that $$ \begin{equation} \begin{aligned} \ell(\vec\beta) = (\vec{y} - \textbf{X}\vec{\beta})^T(\vec{y} - \textbf{X}\vec{\beta}) &= (\vec{y}^T - (\textbf{X}\vec{\beta})^T)(\vec{y} - \textbf{X}\vec{\beta}) \\ &= \vec{y}^T\vec{y} - \vec{y}^T\textbf{X}\vec{\beta} - (\textbf{X}\vec{\beta})^T\vec{y} +(\textbf{X}\vec{\beta})^T(\textbf{X}\vec{\beta}) \end{aligned} \end{equation} $$
To minimize the loss function, we will compute its gradient. Let \(\vec e = \vec{y} - X\vec{\beta}\) with components \(e_i = y_i - \sum_{j=1}^nx_{ij}\beta_j\) so that we may write the loss function as \(\ell(\vec\beta) = \vec{e}^T \vec{e} = \sum_{i=1}^n e_i^2\).
Now we compute the derivative of the loss function with respect to a single component of \(\vec\beta\). $$ \begin{equation} \begin{aligned} \frac{\partial\ell}{\partial \beta_k} &= \sum_i 2e_i \frac{\partial e_i}{\partial \beta_k}\\ &= -2\sum_i x_{ik}e_i \\ &= -2\vec{X}_k \cdot (\vec{y} - X\vec{\beta}) \end{aligned} \end{equation} $$ where \(\vec{X}_k \) is a column-vector of the matrix \(X\). We can put this all together by writing $$ \nabla\ell = -2 X^T(\vec{y} - X\vec\beta). $$
This quantity is minimized where $$ \nabla\ell = X^T (\vec{y} - X\vec\beta) = X^T \vec{y} - X^T X \vec\beta = 0. $$ Some simple algebra gives the solution: $$ \beta = (X^T X)^{-1} X^T\vec{y} $$ \(\square\)