Large Sample Properties of OLS: cont.

Large Sample Properties of OLS: cont.
We have seen that under A.MLR1-2, A.MLR3’and A.MLR4, b is consistent
for ; i.e.
p lim b = :
This property ensures us that, as the sample gets large, b becomes closer and
closer to :
This is really important, but it is a pointwise property, and so it tells us
nothing about the sampling distribution of OLS as n gets large.
Under A.MLR6, i.e. ujX ' N (0; 2u ); we know that for any given n; n > k;
b jX ' N ( ; (XX)
and also
b
' Student
se( b )
1
2
u)
tn
k:
The issue is that assuming a normal distribution for the error is quite a strong
assumption. Via the central limit theorem it is possible to show that, even if
the error are no longer normally distributed, we can still do valid inference (e.g.
t-test, F-test etc.) as the sample size increases.
Result LS-OLS-3: Let A.MLR1-2, A.MLR3’, A.MLR4-5, then:
(i)
d
1 2
n1=2 ^
! N 0; (p lim (X0 X=n))
u
i.e. as n ! 1; n1=2 ^
0
is distributed as a zero mean normal with variance
1
2
u:
1
equal to (p lim (X X=n))
Note that (p lim (X0 X=n))
2
u
is called the asymptotic variance of n1=2 ^
also known as avar n1=2 ^
:
(ii)
1=2
(p lim (X0 X=n))
n1=2 ^
d
! N (0; Ik ) ;
u
where Ik is a k
(iii)
k identity.
1=2
ui :
(X0 X=n)
d
n1=2 ^
! N (0; Ik ) ;
bu
Pn
Pn
where b2u = n 1 i=1 u
b2 or b2u = (n k) 1 i=1 u
b2 :
Sketch of proof:
For sake of simplicity, we consider the simple linear model yi =
(i)
b =
2
Pn
yi
Pn
by
i=1
i=1
(x2;i
1
x2;i
2
bx )
bx2
1+
2 x2;i +
;
n
b
1=2
2
=
2
1=2
n
1
n
Pn
i=1
x2;i
i=1
x2;i
Pn
bx2 ui
bx2
2
Now, E ((xi bx ) ui ) = 0; as the errors are uncorrelated with regressors (and
have mean zero). Given that E(u2i jx) = E(u2i ) = 2u ; it follows that,
E
bx2
x2;i
=E
2
u2i = E
bx2
x2;i
x2;i
2
2
u
2
bx2
E(u2i jx2;i )
= var(x2 )
2
u
Thus, by the central limit theorem
n
1=2
n
X
d
bx2 ui ! N (0; var(x2 )
x2;i
i=1
2
u)
and by noting that because of the law of large numbers,
!
n
X
2
1
p lim n
x2;i bx2
= var(x2 )
i=1
it follows that
n1=2 b 2
d
2
!N
0;
2
u
var(x2 )
(ii) Simply by standardization,
var(x2 )1=2
u
d
n1=2 b 2
2
! N (0; 1)
Pn
2
(iii) As n ! 1; p limb2u = 2u and p limn 1 i=1 x2;i bx2 = var(x2 ):
In fact, as n ! 1 we can use an estimator of the variance instead of the true
variance, provided the former is consistent.
Recall that, in the simple linear model,
se b 2
=
Pn
i=1
=
1
n
Pn
bu
x2;i
i=1
bx2
bu =n1=2
x2;i
2 1=2
bx2
2 1=2
:
Pn
2 1=2
gets close to var(x2 )1=2 ; which is a
As for large n; n 1 i=1 x2;i bx2
constant, and bu gets close to u ; which is a constant, we see that for large n;
se b 2 is of order n 1=2 : For the multiple linear model, for i = 1; :::; k
1=2
se b i = (X0 X=n)ii
bu =n1=2
2
so that se b i is of order n
Note that,
1=2
b
:
i
i
se b i
=
= n1=2 b i
with
p lim
n1=2 b i
n1=2 se b i
i
1=2
(X0 X=n)ii
bu
1=2
(X0 X=n)ii
bu
=
var(xi )1=2
u
Important consequences of asymptotic normality of OLS
Under A.MLR1-2, A.MLR3’, A.MLR4-5, i.e. without assuming A.MLR6
and so without assuming normal errors:
#1 If i = 0; then b i =se b i is asymptotically N (0; 1): Recalling that a
Student-t approaches a normal as the number of degree of freedom approaches
in…nity, it is true that for n large b i =se b i is distrubuted as a Student-t with
n k degree of freedom. Though, this is no longer true for "small" n:
#2 F-statistic.
(SSRr SSRu ) =q
F =
n k
where SSRr and SSRu denote respectively the sum of squared residuals of
restricted and unrestricted model, and q denotes the number of restrictions. If
d
the restrictions are "true", then qF ! 2 (q): Thus, for …nite sample F is no
longer distributed as Fisher-F with (q; n k) degree of freedom, but as n gets
large qF will be distributed as 2 (q):
#3 Wald statistic.
W = R^
r
0
h
b2u R (X0 X)
d
If the q restrictions are true, then W !
1
2
R0
i
1
R^
r
(q):
#4 Lagrange Multiplier Test.
d
(see Week 4 note for de…nition). If the restrictions are true, then nR2 !
2
(q); where q denotes the number of restrictions.
Let Xr be the set of regressors used in the restricted model. Note that
X0r u
br
= X0r y
1
0
X r X r Xr
0
= X0r y (X0r Xr ) Xr Xr
=
X0r y
X0r y
3
= 0;
Xr y
1
Xr y
so, it is the same whether we regress the residuals from the restricted model on
all the regressors or only on the omitted ones.
Example of LM test
Unrestricted model:
narr86
=
1
+
+ 2 pcnv +
5 ptime86 +
3 avgsen
+ 4 tottime
qemp86
+u
6
sample: n = 2725; information on men arrested in 1986, who were born in
1960-1 and who have been arrested at least once prior 1986. narr86 number
of times being arrested in 86 (from 0 to 12), pcnv proportion of arrests leading
to conviction, agsen average sentence length served (often 0); ptime86; months
spent in prison in 1986, tottime; total number of months spent in prison by
individual, qemp, number of quarter the individual was employed in 86. We
estimate a restricted model, in which avgsen and tottime are omitted, and get
d = :712
narr86
1:5pcnv
:034ptime86
:104qemp96
n = 2725; R2 = :0413:
We now take the residuals from the restricted model and, say u
bi and regress
on pcnv; ptime86; qemp; tottime and avgsen; the resulting R2 is 0:015; and
nR2 = 2725 0:015 = 4:09: The 10% critical value of 2 (2) is 4:61; thus cannot
reject at 10%. Now, Pr 2 (2) > 4:09 = :129; i.e. the P value is .129. Thus,
can only reject at signi…cant level higher than 13%:
Asymptotic e¢ ciency
We have seen that under A.MLR1-A.MLR5 OLS is BLUE, best linear unbiased estimator (Gauss-Markov theorem). Best means that there is no other
linear unbiased estimator which has smaller variance. This for any sample size
n; n > k: Given two unbiased estimator for ; say b and e ; we say that b is
more e¢ cient than e ; if var b
var e :
We see from Result LS-OLS-3, asymptotic normality for OLS, that
avar n1=2 ^
=
lim var n1=2 ^
= (p lim (X0 X=n))
n!1
1
2
u
Under A.MLR1-2, A.MLR3’and A.MLR4-5, the OLS estimator has the smallest
asymptotic variance. For any other consistent estimator of ; say e ; we have
that
avar n1=2 ^
avar n1=2 e
:
4
Heretoskedasticity
In deriving the (asymptotic) variance of OLS, we have assumed A.MLR5,
2
according to which E(uu0 jX) = E(uu0 ) = u In .
This assumption indeed incorporates two compound assumptions:
(i) E(u2i jX) =E(u2i ) = 2 for all i: This means that the element on the main
diagonal do not depend on X and they are all equal. This is what we call
conditional homoskedasticity.
(ii) E(ui uj jX) = 0 8i 6= j; this means that all element out of the main
diagonal are equal to zero.
Such an assumption is known as serially uncorrelated error (or non autocorrelated errors). It is more correct to de…ne it as conditional uncorrelation.
Under A.MLR2, (yi ; xi ) are identically and independently distributed (iid ),
and so u = y X is also an iid vector and thus (ii) always hold.
You will consider the case in which (ii) fails to in the time series part of this
course.
Consequences of failure of conditional homoskedasticity.
What happens when conditional homoskedasticity fail to hold?
We did not use the assumption of conditional homoskedasticity (A.MLR5) to
show unbiasedness or consistency of OLS (or any other estimator), thus violation
of (i) does not cause inconsistent (or biased) estimators.
On the other hand, OLS estimators are no longer e¢ cient, in the sense that
they no longer have the smallest possible variance.
In particular, Gauss-Markov theorem does no longer hold, i.e. OLS is no
longer the best linear unbiased estimator, and, in large sample, OLS does no
longer have the smallest asymptotic variance.
Thus, once we drop the assumption of conditional homoskedasticity, OLS is
no longer e¢ cient or asymptotically e¢ cient.
Though, there is another bigger problem.
Tests based on OLS, e.g. tests for linear restrictions, require a consistent
estimator of the standard error, in order to provide asymptotically valid inference.
So far we have used estimator of the variance which were consistent under
the assumption of conditional homoskedasticity. If this assumption fail to hold,
then "usual" standard errors are no longer consistent for the true variance.
1
As an estimator of V ar(n1=2 ( b
)) we used b2u (X0 X=n) (this is indeed
also the estimator provided by most computer packge). Such an estimator is
1
consistent for 2u (p lim(X0 X=n)) ; but the latter is no longer the true variance!
For sake of simplicity, consider the simple model yi = 1 + 2 x2;i + ui :
Pn
n 1=2 i=1 x2;i bx2 ui
1=2 b
n
Pn
2
2 =
2
n 1 i=1 x2;i bx2
5
Now, recalling that (yi ; x2;i ) are iid;
0
E@ n
= E
x2;i
= E
x2;i
1=2
2
u
2
u:
bx2
x2;i
i=1
bx2
6= var(x2 )
as E(u2i jx2;i ) 6=
n
X
bx2
2
u2i
2
E(u2i jx2;i )
!2 1
ui A
Thus,
b
1=2
avar n
2
2
u
6=
2
var(x2 )
Thus the usual standard errors provided by the packages are not consistent
for the true standard deviation and inference based on them is no longer valid.
That is, we believe to (do not) reject the null at say 5%; but this is not true.
Furthermore, we do not know whether the standard errors provided by the computer are an overestimate or an underestimate of the true one!! (This depends
case by case...).
White’s Standard Errors
A variance estimator which is consistent even in the case of conditional
heteroskedasticity has been proposed by White (1980). The White’s estimator
for V ar(n1=2 ( b
)) is
(X0 X=n)
1
0
0
(X uu0 X=n)(X X=n)
1
where u = y X b ; i.e. the OLS residuals. Such an estimator is implemented
by computer packages nowaday, just need to use the "White" option.
In terms of the simple linear model, the usual estimator for avar n1=2 b 2
is
b2u
Pn
2
n 1 i=1 x2;i bx2
while the White estimator for avar n
n
1
Pn
i=1
n
1
Pn
i=1
1=2
x2;i
x2;i
b
2
bx2
bx2
2
2
is
u
b2i
2 2
Clearly, if conditional homoskedasticity indeed holds, then as n gets large,
usual standard errors and White’s standard error approach the same probability
limit, and so for large n they are very close each other.
6
2
On the other hand, if conditional homoskedasticity fails to hold, then usual
and White’s standard error have di¤erent probability limits, and so for large n
they will be far apart.
Now, heteroskedastic robust t-statistics are constructed as
b
i
(W hite) se b i
=
n1=2 b i
n1=2 (W hite)se b i
i.e. by scaling using White standard errors instead of usual one.
IMPORTANT: Even if A.MLR5 (conditional homoskedasticity) and A.MLR6
(conditionally normal error) hold, if we are using White standard errors, then,
in …nite sample,
b
i
(W hite)se(bi )
is NOT distributed as Student-t with n k degrees
of freedom, though asymptotically is distributed as a standard normal.
Moral: by using White standard error, cannot draw any exact (…nite) sample
inference. This is the price for robustness.
Second moral: if n is large enough, no cost in using White SE.
Example: below we report the …ndings for the log-wage equation, reporting
both usual standard errors and White’s standard errors, the latter in square
brackets. Below marmale is a variable equal to 1 if the individual is a married
male, marf em is a variable equal to 1 if the individual is a married female,
singf em is a variable equal to 1 if the individual is a single female,
logd
wage = :321(:10)[:109] + :213marmale(:055)[0:057]
:198marf em(:058)[0:058] :11 sin gf em(:056)[:057]
+0:079educ(:0067)[:0074] + :027 exp er(:0055)[:0051]
:00054(:00011)[:00011] exp er2 + :029tenure(:0068)[:0069]
:00053tenure2 (:00023)[:00024]
n = 526 and R2 = :461
Note: (i) Usual and White SE very very close. Inference drawn on usual
or White’s SE lead to the same conclusion, e.g. a parameter signi…catively
di¤erent from 0 at 5% using usual SE is also signi…catively di¤erent from 0
at 5% using White SE (ii) We should then infer that there is no conditional
heteroskedasticity in this case (iii) White SE can be either smaller or larger
than usual SE.
7