Large Sample Properties of OLS: cont. We have seen that under A.MLR1-2, A.MLR3’and A.MLR4, b is consistent for ; i.e. p lim b = : This property ensures us that, as the sample gets large, b becomes closer and closer to : This is really important, but it is a pointwise property, and so it tells us nothing about the sampling distribution of OLS as n gets large. Under A.MLR6, i.e. ujX ' N (0; 2u ); we know that for any given n; n > k; b jX ' N ( ; (XX) and also b ' Student se( b ) 1 2 u) tn k: The issue is that assuming a normal distribution for the error is quite a strong assumption. Via the central limit theorem it is possible to show that, even if the error are no longer normally distributed, we can still do valid inference (e.g. t-test, F-test etc.) as the sample size increases. Result LS-OLS-3: Let A.MLR1-2, A.MLR3’, A.MLR4-5, then: (i) d 1 2 n1=2 ^ ! N 0; (p lim (X0 X=n)) u i.e. as n ! 1; n1=2 ^ 0 is distributed as a zero mean normal with variance 1 2 u: 1 equal to (p lim (X X=n)) Note that (p lim (X0 X=n)) 2 u is called the asymptotic variance of n1=2 ^ also known as avar n1=2 ^ : (ii) 1=2 (p lim (X0 X=n)) n1=2 ^ d ! N (0; Ik ) ; u where Ik is a k (iii) k identity. 1=2 ui : (X0 X=n) d n1=2 ^ ! N (0; Ik ) ; bu Pn Pn where b2u = n 1 i=1 u b2 or b2u = (n k) 1 i=1 u b2 : Sketch of proof: For sake of simplicity, we consider the simple linear model yi = (i) b = 2 Pn yi Pn by i=1 i=1 (x2;i 1 x2;i 2 bx ) bx2 1+ 2 x2;i + ; n b 1=2 2 = 2 1=2 n 1 n Pn i=1 x2;i i=1 x2;i Pn bx2 ui bx2 2 Now, E ((xi bx ) ui ) = 0; as the errors are uncorrelated with regressors (and have mean zero). Given that E(u2i jx) = E(u2i ) = 2u ; it follows that, E bx2 x2;i =E 2 u2i = E bx2 x2;i x2;i 2 2 u 2 bx2 E(u2i jx2;i ) = var(x2 ) 2 u Thus, by the central limit theorem n 1=2 n X d bx2 ui ! N (0; var(x2 ) x2;i i=1 2 u) and by noting that because of the law of large numbers, ! n X 2 1 p lim n x2;i bx2 = var(x2 ) i=1 it follows that n1=2 b 2 d 2 !N 0; 2 u var(x2 ) (ii) Simply by standardization, var(x2 )1=2 u d n1=2 b 2 2 ! N (0; 1) Pn 2 (iii) As n ! 1; p limb2u = 2u and p limn 1 i=1 x2;i bx2 = var(x2 ): In fact, as n ! 1 we can use an estimator of the variance instead of the true variance, provided the former is consistent. Recall that, in the simple linear model, se b 2 = Pn i=1 = 1 n Pn bu x2;i i=1 bx2 bu =n1=2 x2;i 2 1=2 bx2 2 1=2 : Pn 2 1=2 gets close to var(x2 )1=2 ; which is a As for large n; n 1 i=1 x2;i bx2 constant, and bu gets close to u ; which is a constant, we see that for large n; se b 2 is of order n 1=2 : For the multiple linear model, for i = 1; :::; k 1=2 se b i = (X0 X=n)ii bu =n1=2 2 so that se b i is of order n Note that, 1=2 b : i i se b i = = n1=2 b i with p lim n1=2 b i n1=2 se b i i 1=2 (X0 X=n)ii bu 1=2 (X0 X=n)ii bu = var(xi )1=2 u Important consequences of asymptotic normality of OLS Under A.MLR1-2, A.MLR3’, A.MLR4-5, i.e. without assuming A.MLR6 and so without assuming normal errors: #1 If i = 0; then b i =se b i is asymptotically N (0; 1): Recalling that a Student-t approaches a normal as the number of degree of freedom approaches in…nity, it is true that for n large b i =se b i is distrubuted as a Student-t with n k degree of freedom. Though, this is no longer true for "small" n: #2 F-statistic. (SSRr SSRu ) =q F = n k where SSRr and SSRu denote respectively the sum of squared residuals of restricted and unrestricted model, and q denotes the number of restrictions. If d the restrictions are "true", then qF ! 2 (q): Thus, for …nite sample F is no longer distributed as Fisher-F with (q; n k) degree of freedom, but as n gets large qF will be distributed as 2 (q): #3 Wald statistic. W = R^ r 0 h b2u R (X0 X) d If the q restrictions are true, then W ! 1 2 R0 i 1 R^ r (q): #4 Lagrange Multiplier Test. d (see Week 4 note for de…nition). If the restrictions are true, then nR2 ! 2 (q); where q denotes the number of restrictions. Let Xr be the set of regressors used in the restricted model. Note that X0r u br = X0r y 1 0 X r X r Xr 0 = X0r y (X0r Xr ) Xr Xr = X0r y X0r y 3 = 0; Xr y 1 Xr y so, it is the same whether we regress the residuals from the restricted model on all the regressors or only on the omitted ones. Example of LM test Unrestricted model: narr86 = 1 + + 2 pcnv + 5 ptime86 + 3 avgsen + 4 tottime qemp86 +u 6 sample: n = 2725; information on men arrested in 1986, who were born in 1960-1 and who have been arrested at least once prior 1986. narr86 number of times being arrested in 86 (from 0 to 12), pcnv proportion of arrests leading to conviction, agsen average sentence length served (often 0); ptime86; months spent in prison in 1986, tottime; total number of months spent in prison by individual, qemp, number of quarter the individual was employed in 86. We estimate a restricted model, in which avgsen and tottime are omitted, and get d = :712 narr86 1:5pcnv :034ptime86 :104qemp96 n = 2725; R2 = :0413: We now take the residuals from the restricted model and, say u bi and regress on pcnv; ptime86; qemp; tottime and avgsen; the resulting R2 is 0:015; and nR2 = 2725 0:015 = 4:09: The 10% critical value of 2 (2) is 4:61; thus cannot reject at 10%. Now, Pr 2 (2) > 4:09 = :129; i.e. the P value is .129. Thus, can only reject at signi…cant level higher than 13%: Asymptotic e¢ ciency We have seen that under A.MLR1-A.MLR5 OLS is BLUE, best linear unbiased estimator (Gauss-Markov theorem). Best means that there is no other linear unbiased estimator which has smaller variance. This for any sample size n; n > k: Given two unbiased estimator for ; say b and e ; we say that b is more e¢ cient than e ; if var b var e : We see from Result LS-OLS-3, asymptotic normality for OLS, that avar n1=2 ^ = lim var n1=2 ^ = (p lim (X0 X=n)) n!1 1 2 u Under A.MLR1-2, A.MLR3’and A.MLR4-5, the OLS estimator has the smallest asymptotic variance. For any other consistent estimator of ; say e ; we have that avar n1=2 ^ avar n1=2 e : 4 Heretoskedasticity In deriving the (asymptotic) variance of OLS, we have assumed A.MLR5, 2 according to which E(uu0 jX) = E(uu0 ) = u In . This assumption indeed incorporates two compound assumptions: (i) E(u2i jX) =E(u2i ) = 2 for all i: This means that the element on the main diagonal do not depend on X and they are all equal. This is what we call conditional homoskedasticity. (ii) E(ui uj jX) = 0 8i 6= j; this means that all element out of the main diagonal are equal to zero. Such an assumption is known as serially uncorrelated error (or non autocorrelated errors). It is more correct to de…ne it as conditional uncorrelation. Under A.MLR2, (yi ; xi ) are identically and independently distributed (iid ), and so u = y X is also an iid vector and thus (ii) always hold. You will consider the case in which (ii) fails to in the time series part of this course. Consequences of failure of conditional homoskedasticity. What happens when conditional homoskedasticity fail to hold? We did not use the assumption of conditional homoskedasticity (A.MLR5) to show unbiasedness or consistency of OLS (or any other estimator), thus violation of (i) does not cause inconsistent (or biased) estimators. On the other hand, OLS estimators are no longer e¢ cient, in the sense that they no longer have the smallest possible variance. In particular, Gauss-Markov theorem does no longer hold, i.e. OLS is no longer the best linear unbiased estimator, and, in large sample, OLS does no longer have the smallest asymptotic variance. Thus, once we drop the assumption of conditional homoskedasticity, OLS is no longer e¢ cient or asymptotically e¢ cient. Though, there is another bigger problem. Tests based on OLS, e.g. tests for linear restrictions, require a consistent estimator of the standard error, in order to provide asymptotically valid inference. So far we have used estimator of the variance which were consistent under the assumption of conditional homoskedasticity. If this assumption fail to hold, then "usual" standard errors are no longer consistent for the true variance. 1 As an estimator of V ar(n1=2 ( b )) we used b2u (X0 X=n) (this is indeed also the estimator provided by most computer packge). Such an estimator is 1 consistent for 2u (p lim(X0 X=n)) ; but the latter is no longer the true variance! For sake of simplicity, consider the simple model yi = 1 + 2 x2;i + ui : Pn n 1=2 i=1 x2;i bx2 ui 1=2 b n Pn 2 2 = 2 n 1 i=1 x2;i bx2 5 Now, recalling that (yi ; x2;i ) are iid; 0 E@ n = E x2;i = E x2;i 1=2 2 u 2 u: bx2 x2;i i=1 bx2 6= var(x2 ) as E(u2i jx2;i ) 6= n X bx2 2 u2i 2 E(u2i jx2;i ) !2 1 ui A Thus, b 1=2 avar n 2 2 u 6= 2 var(x2 ) Thus the usual standard errors provided by the packages are not consistent for the true standard deviation and inference based on them is no longer valid. That is, we believe to (do not) reject the null at say 5%; but this is not true. Furthermore, we do not know whether the standard errors provided by the computer are an overestimate or an underestimate of the true one!! (This depends case by case...). White’s Standard Errors A variance estimator which is consistent even in the case of conditional heteroskedasticity has been proposed by White (1980). The White’s estimator for V ar(n1=2 ( b )) is (X0 X=n) 1 0 0 (X uu0 X=n)(X X=n) 1 where u = y X b ; i.e. the OLS residuals. Such an estimator is implemented by computer packages nowaday, just need to use the "White" option. In terms of the simple linear model, the usual estimator for avar n1=2 b 2 is b2u Pn 2 n 1 i=1 x2;i bx2 while the White estimator for avar n n 1 Pn i=1 n 1 Pn i=1 1=2 x2;i x2;i b 2 bx2 bx2 2 2 is u b2i 2 2 Clearly, if conditional homoskedasticity indeed holds, then as n gets large, usual standard errors and White’s standard error approach the same probability limit, and so for large n they are very close each other. 6 2 On the other hand, if conditional homoskedasticity fails to hold, then usual and White’s standard error have di¤erent probability limits, and so for large n they will be far apart. Now, heteroskedastic robust t-statistics are constructed as b i (W hite) se b i = n1=2 b i n1=2 (W hite)se b i i.e. by scaling using White standard errors instead of usual one. IMPORTANT: Even if A.MLR5 (conditional homoskedasticity) and A.MLR6 (conditionally normal error) hold, if we are using White standard errors, then, in …nite sample, b i (W hite)se(bi ) is NOT distributed as Student-t with n k degrees of freedom, though asymptotically is distributed as a standard normal. Moral: by using White standard error, cannot draw any exact (…nite) sample inference. This is the price for robustness. Second moral: if n is large enough, no cost in using White SE. Example: below we report the …ndings for the log-wage equation, reporting both usual standard errors and White’s standard errors, the latter in square brackets. Below marmale is a variable equal to 1 if the individual is a married male, marf em is a variable equal to 1 if the individual is a married female, singf em is a variable equal to 1 if the individual is a single female, logd wage = :321(:10)[:109] + :213marmale(:055)[0:057] :198marf em(:058)[0:058] :11 sin gf em(:056)[:057] +0:079educ(:0067)[:0074] + :027 exp er(:0055)[:0051] :00054(:00011)[:00011] exp er2 + :029tenure(:0068)[:0069] :00053tenure2 (:00023)[:00024] n = 526 and R2 = :461 Note: (i) Usual and White SE very very close. Inference drawn on usual or White’s SE lead to the same conclusion, e.g. a parameter signi…catively di¤erent from 0 at 5% using usual SE is also signi…catively di¤erent from 0 at 5% using White SE (ii) We should then infer that there is no conditional heteroskedasticity in this case (iii) White SE can be either smaller or larger than usual SE. 7