Maximum likelihood for one variable

Summary

Setup and statistical model

Given: random sample \(y_1, \ldots ,y_n\) for \(i=1, \ldots ,n\) where \(y_i\) are scalars. \(y_i\) may be discrete or continuous.
We postulate a statistical model by assuming that the density/distribution of \(y_i\) is known up to an unknown \(p×1\) vector of parameters \(θ\) .
The density/distribution function is denoted by

\[f\left( y_i;θ \right)\]

If \(y_i\) is continuous, then \(f\left( y_i;θ \right)\) is the probability density function while if \(y_i\) is discrete, \(f\left( y_i;θ \right)\) is the probability distribution function (or probability mass function).
The density/distribution function viewed as a function of \(θ\) is called the likelihood contribution or the individual likelihood function and it is denoted by \(L_i\left( θ \right)\) ,

\[L_i\left( θ \right)=f\left( y_i;θ \right)\]

Joint density and likelihood function

The joint density function is defined as the product of the marginal densities

\[f_J\left( y;θ \right)=\prod_{i=1}^{n}{ f\left( y_i;θ \right) }\]

The joint density/distribution function viewed as a function of \(θ\) is called the likelihood function and it is denoted by \(L\left( θ \right)\) ,

\[L\left( θ \right)=\prod_{i=1}^{n}{ L_i\left( θ \right) }\]

Log-likelihood

The log-likelihood contribution or the individual log-likelihood function is defined as the (natural) logarithm of the likelihood contribution. The log-likelihood contribution is denoted by \(l_i\left( θ \right)\) :

\[l_i\left( θ \right)=log L_i\left( θ \right)\]

The log-likelihood function is defined as the (natural) logarithm of the likelihood function. The log-likelihood function is denoted by \(l\left( θ \right)\) :

\[l\left( θ \right)=log L\left( θ \right)\]

Result

\[l\left( θ \right)=\sum_{i=1}^{n}{ l_i\left( θ \right) }\]

Maximum likelihood

Definition: The maximum likelihood estimator of \(θ\) , denoted by \({\hat{θ}}_{ML}\) is defined as

\[{\hat{θ}}_{ML}=\arg max_{θ} L\left( θ \right)\]

“arg max” means the argument that maximizes the function
Result: since \(log x\) is strictly increasing,

\[{\hat{θ}}_{ML}=\arg max_{θ} l\left( θ \right)\]

It is generally simpler to maximize \(l\left( θ \right)\) than \(L\left( θ \right)\) .
Result: If then model is correctly specified then under weak regularity conditions, \({\hat{θ}}_{ML}\) is a consistent estimator of \(θ\)

\[plim {\hat{θ}}_{ML}=θ\]

Score vector

The individual score vector \(s_i\left( θ \right)\) is defined as

\[s_i\left( θ \right)= \frac{∂l_i\left( θ \right)}{∂θ}\]

\(s_i\left( θ \right)\) is \(p×1\) .
The score vector \(s\left( θ \right)\) is defined as

\[s\left( θ \right)= \frac{∂l\left( θ \right)}{∂θ}\]

\(s\left( θ \right)\) is \(p×1\) .
Result:

\[s\left( θ \right)=\sum_{i=1}^{n}{ s_i\left( θ \right) }\]

First order condition for \({\hat{θ}}_{ML}\) :

\[s\left( {\hat{θ}}_{ML} \right)=0\]

Information matrix

The information matrix \(I\left( θ \right)\) for \(i=1, \ldots ,n\) is defined as

\[I\left( θ \right)=-E\left( \frac{∂^2l_i\left( θ \right)}{∂θ∂θ'} \right)=-E\left( \frac{∂s_i\left( θ \right)}{∂θ'} \right)\]

\(I\left( θ \right)\) is \(p×p\) .
Result

\[I\left( θ \right)=E\left( \frac{∂l_i\left( θ \right)}{∂θ} \frac{∂l_i\left( θ \right)}{∂θ'} \right)=E\left( s_i\left( θ \right)s_i{\left( θ \right)}' \right)\]

The asymptotic distribution of the MLE

Result: if then model is correctly specified then under weak regularity conditions

\[\sqrt{n}\left( {\hat{θ}}_{ML}-θ \right)→N\left( 0,V \right)\]

where \(V=I{\left( θ \right)}^{-1}\) , the asymptotic variance matrix of \({\hat{θ}}_{ML}\) .
Result: if then model is correctly specified then under weak regularity conditions \({\hat{θ}}_{ML}\) is asymtotically efficient , it has the smallest asymptotic variance among all consistent estimators.

Estimating the asymptotic variance

In most cases, \(I\left( θ \right)\) cannot be found analytically.
However,

\[I\left( θ \right)=-E\left( \frac{∂^2l_i\left( θ \right)}{∂θ∂θ'} \right)\]

can be consistently estimated using

\[I_H\left( θ \right)=- \frac{1}{n}\sum_{i=1}^{n}{ \frac{∂^2l_i\left( θ \right)}{∂θ∂θ'} }\]

which can be consistently estimated using \(I_H\left( {\hat{θ}}_{ML} \right)\) .
Also,

\[I\left( θ \right)=E\left( \frac{∂l_i\left( θ \right)}{∂θ} \frac{∂l_i\left( θ \right)}{∂θ'} \right)=E\left( s_i\left( θ \right)s_i{\left( θ \right)}' \right)\]

can be consistently estimated using

\[I_G\left( θ \right)= \frac{1}{n}\sum_{i=1}^{n}{ s_i\left( θ \right)s_i{\left( θ \right)}' }\]

which can be consistently estimated using \(I_G\left( {\hat{θ}}_{ML} \right)\)
We have two consistent estimators of \(I\left( θ \right)\) , \(I_H\left( {\hat{θ}}_{ML} \right)\) and \(I_G\left( {\hat{θ}}_{ML} \right)\) , where \(H\) stands for “Hessian” and \(G\) for “Gradient”. You can find many more.
Result

\[{\hat{V}}_H=I_H{\left( {\hat{θ}}_{ML} \right)}^{-1}\]

as well as

\[{\hat{V}}_G=I_G{\left( {\hat{θ}}_{ML} \right)}^{-1}\]

are consistent estimators of the asympotic variance \(V\) , the variance of \(\sqrt{n}\left( {\hat{θ}}_{ML}-θ \right)\) as \(n→∞\) .
Result: approximately for \(n\) large,

\[{\hat{θ}}_{ML} \sim N\left( θ,n^{-1}{\hat{V}}_H \right)\]

\[{\hat{θ}}_{ML} \sim N\left( θ,n^{-1}{\hat{V}}_G \right)\]

\({\hat{V}}_H\) and \({\hat{V}}_G\) can be calculated numerically.