The linear regression model with a missing variable

Summary

  • Setup: random sample \(\left( y_i,x_i,z_i \right)\) for \(i=1, \ldots ,n\) where \(x_i,z_i\) are scalars.
  • True relation ( \(θ≠0\) )

\[E\left( y_i \mid x_i,z_i \right)=βx_i+θz_i\]

  • We believe incorrectly that \(θ=0\) and that

\[E\left( y_i \mid x_i,z_i \right)=βx_i\]

  • We believe that in the model

\[y_i=βx_i+ε_i\]

  • the exogenous assumption is satisfied,

\[E\left( ε_i \mid x_i,z_i \right)=0\]

  • but this is wrong . In fact:

\[E\left( ε_i \mid x_i,z_i \right)=E\left( y_i-βx_i \mid x_i \right)=βx_i+θz_i-βx_i=θz_i≠0\]

  • For our OLS estimator of \(β\)

\[b= \frac{∑x_iy_i}{\sum{ x_i^2 }}\]

  • we have

\[E\left( b \mid x,z \right)=β+θ \frac{∑x_iz_i}{\sum{ x_i^2 }}\]

  • and

\[E\left( b \right)=β+θE\left( \frac{∑x_iz_i}{\sum{ x_i^2 }} \right)\]

  • \(b\) will generally be a biased and inconsistent estimator of \(β\) if

\[E\left( \frac{∑x_iz_i}{\sum{ x_i^2 }} \right)≠0\]

  • The term

\[θE\left( \frac{∑x_iz_i}{\sum{ x_i^2 }} \right)\]

  • is called the “missing variable bias”.
  • Since we are certainly almost always missing variables in applied work, is this the “end of econometrics”? Not really, \(b\) is “almost” unbiased if \(θ\) is small or the correlation between \(x\) and \(z\) is small. Even if this is not the case, there is a part 2 to this story.