Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) is a method for estimating the parameters of a statistical model from a set of data. Suppose that we've collected a dataset \(\mathcal{D} = \{x_k\}_{k=1}^N\) and that we have good prior reason to suspect that the data is best modeled by a particular family of probability distributions \(p(x|\theta)\) where \(\theta\) is that model's parameters. The act of choosing a particular value for \(\theta\) is called model selection, model fitting, or training.

Suppose we had a particular \(\theta\) value. Then the likelihood of the data, given the parameters, is given by \(p(\mathcal{D}|\theta) = p(x_1x_2\dots x_N|\theta)\). We can make a simplifying assumption that the data is indepedent and identically distributed (IID), so that $$ \begin{aligned} \begin{equation} p(\mathcal{D}|\theta) = p(x_1x_2\dots x_N|\theta) \\ = \prod_{k=1}^Np(x_k|\theta) \end{equation} \end{aligned} $$

The essence of the maximum likelihood method lies in the simple idea that we should choose our model parameter \(\theta\) to maximize the likelihood of the data that we've actually observed. In practice, we're going to minimize the negative log-likelihood (NLL), $$ \ell_\mathcal{D}(\theta) = -\log p(\mathcal{D}|\theta) = -\sum_{k=1}^N\log{p(x_k|\theta)} $$ where the notation \(\ell_\mathcal{D}(\theta)\) emphasizes that the NLL is a function of the parameters with the data fixed beforehand. We'll use this as a score to rank the quality of models with respect to our datset.

Note that minimizing the negative log-likelihood is equivalent to maximizing the likelihood function.

Now let's minimize the NLL by taking the derivative and finding the critical points \(\theta^*\): $$ \begin{equation} \begin{aligned} \frac{\partial}{\partial\theta}\ell_\mathcal{D}(\theta^*) &= -\sum_{k=1}^N\frac{\partial}{\partial\theta}\log{p(x_k|\theta^*)} = 0\\ \end{aligned} \end{equation} $$ and then checking that the second derivative is positive there to ensure the point is indeed a minimum [3].

It is not always so easy to determine a neat closed-form expression for the maximum likelihood estimate \(\theta^*\). A prototypical example would be determining the maximum likelihood estimator for the Cauchy distribution. In these cases, we can numerically optimize the likelihood using something like the Newton-Raphson method [1].
A beautifully clear exposition of Newton's method can be found in the first chapter of the Structure and Interpretation of Computer Programs [2].

Alternative methods exist for estimation, such as the method of moments which is computationally simple and avoids the complexity of likelihood maximimization, but it has its own drawbacks (see [0, pg. 112-113, 4.4.1]).

An interesting twist on likelihood estimation is to add a penalization term to the negative log-likelihood: $$ \mathcal{L}(\theta;\lambda) = \lambda C(\theta) + \ell_\mathcal{D}(\theta), $$ where \(\lambda\geq 0\) is called the regularization parameter and \(C(\theta)\) is called the complexity penalty.

Through appropriate choice of complexity penalty, by applying the above to a linear regression model we end up with weights $$ w^* = \text{argmin}_w \ell_\mathcal{D}(w) + \lambda ||w||_2^2. $$ The term \(\lambda ||w||_2^2=\sum_{k=1}^D w_k^2\) penalizes large weight values. This is known as ridge regression. See [0, ch. 4] for more.

References