The Fisher Information Matrix
Consider the log-likelihood function $$ \ell(\theta;x) = \log p(x;\theta), $$ where \(p(x;\theta):X\times\Omega\rightarrow\mathbb{R}\) is a family of probability distributions indexed by parameters \(\theta\in\Omega\).
The gradient of the log-likelihood function with respect to the model parameters \(\theta\) is called the score function: $$ s(\theta;x) = \nabla_\theta{\ell(\theta;x)} $$ where \(\nabla_\theta\) emphasizes that we let \(\theta\) vary while holding \(x\) fixed.
With our score function, we'll now treat it as a statistic of our data set. We'll assume that our data is generated from the "true distribution" \(p(x;\theta^*)\) and we'll calculate the expectation: $$ \begin{equation} \begin{aligned} \text{E}_X\left[s(\theta;x)\right] &= \text{E}_X\left[\nabla_\theta\log p(x;\theta)\right] \\ &= \int_X\nabla_\theta\log p(x;\theta) p(x;\theta) dx. \end{aligned} \end{equation} $$ Observe that $$ \frac{\partial}{\partial\theta_i}\log p(x;\theta) = \frac{\partial_i\,p(x;\theta)}{p(x;\theta)} $$ where \(\partial_i = {\partial}/{\partial\theta_i}\) denotes partial differentiation with respect to the i-th model parameter.
From this, we get $$ \begin{equation} \begin{aligned} \nabla_\theta\log p(x;\theta) p(x;\theta) &= \langle p(x;\theta)\partial_1\,\log p(x;\theta),\dots,p(x;\theta)\partial_m\,\log p(x;\theta)\rangle^T\\ &= \langle \partial_1\,p,\dots,\partial_m\,p\rangle^T \end{aligned} \end{equation} $$
Assuming the Leibniz integral rule holds, we can differentiate under the integral sign (or vice versa): $$ \int_X\partial_i\,p(x;\theta)dx = \partial_i\int_X p(x;\theta) = \partial_i(1) = 0 $$ hence $$ \text{E}_X\left[s(\theta;x)\right] = \int_X\nabla_\theta\log p(x;\theta) p(x;\theta) dx = \nabla_\theta \int_X p(x;\theta)dx= 0. $$
So much for the mean. Consider now the covariance matrix $$ s(\theta;x) s(\theta;x)^T = [\sigma_{ij}] $$ (remember that \(s(\theta;x) = \nabla_\theta{\ell(\theta;x)}\) is a column vector), with \(ij\)-th entry given by $$ \begin{equation} \begin{aligned} \sigma_{ij} &= \frac{1}{p(x;\theta)}\frac{\partial p(x;\theta)}{\partial\theta_i}\frac{1}{p(x;\theta)}\frac{\partial p(x;\theta)}{\partial\theta_j}\\ &= \frac{\partial_i\,p\partial_j\,p}{p^2(x;\theta)}. \end{aligned} \end{equation} $$
If we take the partial derivative wrt \(\theta_j\) of $$ \int_X\partial_i\log p(x;\theta) p(x;\theta) dx = 0 $$ we find $$ \begin{equation} \begin{aligned} \partial_j\int_X\partial_i\log p(x;\theta)p(x;\theta) dx &= \int_X\partial_j\left(\partial_i\log p(x;\theta) p(x;\theta)\right)dx\\ \end{aligned} \end{equation} $$ Applying the product rule to the integrand gives us $$ \begin{equation} \begin{aligned} \int_X\left(\partial_j\partial_i\log{p}\right)p + \left(\partial_i\log{p}\right)\left(\partial_i p\right)\,dx= 0 \end{aligned} \end{equation} $$
So we now have an equality $$ \begin{equation} \begin{aligned} \int_X\left(\partial_j\partial_i\log{p}\right)p\,dx &= -\int_X\left(\partial_i\log{p}\right)\left(\partial_i p\right)\,dx\\ &= -\int_X\frac{\partial_i\,p\,\partial_j\,p}{p}dx \\ &= -\int_X \sigma_{ij}pdx = -\text{E}_X\left[\sigma_{ij}\right] = \text{E}_X\left[H_{ij}\right] \end{aligned} \end{equation} $$ where \(H\) is the Hessian matrix of the log-likelihood function $$ \left[H\right]_{ij} = \partial_i\partial_j\log p(x;\theta) = \frac{\partial^2}{\partial\theta_i\partial\theta_j}\log p(x;\theta). $$
The Hessian fulfils a role like the ordinary second derivative for single-variable functions, in particular it arises in generalizing the second-derivative test, acting as a measure of local curvature of the function, letting us know what type of critical point we're at (max, min, etc.).
Therefore the expression $$ \text{E}_X\left[\sigma_{ij}\right] = -\text{E}_X\left[H_{ij}\right] $$ relates the covariances of model parameters to the local curvature of the log-likelihood function.
References
- [0] Amari and Nagaoka, Methods of Information Geometry