Should We Go One Step Further? An Accurate Comparison of One-Step and Two-Step Procedures in a Generalized Method of Moments Framework Jungbin Hwang and Yixiao Sun Department of Economics, University of California, San Diego Preliminary. Please do not circulate. Abstract According to the conventional asymptotic theory, the two-step Generalized Method of Moments (GMM) estimator and test perform as least as well as the one-step estimator and test in large samples. The conventional asymptotics theory, as elegant and convenient as it is, completely ignores the estimation uncertainty in the weighting matrix, and as a result it may not re‡ect …nite sample situations well. In this paper, we employ the …xed-smoothing asymptotic theory that accounts for the estimation uncertainty, and compare the performance of the one-step and two-step procedures in this more accurate asymptotic framework. We show the two-step procedure outperforms the one-step procedure only when the bene…t of using the optimal weighting matrix outweighs the cost of estimating it. This qualitative message applies to both the asymptotic variance comparison and power comparison of the associated tests. A Monte Carlo study lends support to our asymptotic results. JEL Classi…cation: C12, C32 Keywords: Asymptotic E¢ ciency, Fixed-smoothing Asymptotics, Heteroskedasticity and Autocorrelation Robust, Increasing-smoothing Asymptotics, Asymptotic Mixed Normality, Nonstandard Asymptotics, Two-step GMM Estimation 1 Introduction E¢ ciency is one of the most important problems in statistics and econometrics. In the widelyused GMM framework, it is standard practice to employ a two-step procedure to improve the e¢ ciency of the GMM estimator and the power of the associated tests. The two-step procedure requires the estimation of a weighting matrix. According to the Hansen (1982), the optimal weighting matrix is the asymptotic variance of the (scaled) sample moment conditions. For time series data, which is our focus here, the optimal weighting matrix is usually referred to as the Email: [email protected], [email protected]. Correspondence to: Department of Economics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0508. 1 long run variance (LRV) of the moment conditions. To be completely general, we often estimate the LRV using the nonparametric kernel or series method. Under the conventional asymptotics, both the one-step and two-step GMM estimators are asymptotically normal1 . In general, the two-step GMM estimator has a smaller asymptotic variance. Statistical tests based on the two-step estimator are also asymptotically more powerful than those based on the one-step estimator. A driving force behind these results is that the two-step estimator and the associated tests have the same asymptotic properties as the corresponding ones when the optimal weighting matrix is known. However, given that the optimal weighting matrix is estimated nonparametrically in the time series setting, there is large estimation uncertainty. A good approximation to the distributions of the two-step estimator and the associated tests should re‡ect this relatively high estimation uncertainty. One of the goals of this paper is to compare the asymptotic properties of the one-step and twostep procedures when the estimation uncertainty in the weighing matrix is accounted for. There are two ways to capture the estimation uncertainty. One is to use the high order conventional asymptotic theory under which the amount of nonparametric smoothing in the LRV estimator increases with the sample size but at a slower rate. While the estimation uncertainty vanishes in the …rst order asymptotics, we expect it to remain in high order asymptotics. The second way is to use an alternative asymptotic approximation that can capture the estimation uncertainty even with just a …rst-order asymptotics. To this end, we consider a limiting thought experiment in which the amount of nonparametric smoothing is held …xed as the sample size increases. This leads to the so-called …xed-smoothing asymptotics in the recent literature. In this paper, we employ the …xed-smoothing asymptotics to compare the one-step and twostep procedures. For the one-step procedure, the LRV estimator is used in computing the standard errors, leading to the popular heteroskedasticity and autocorrelation robust (HAR) standard errors. See, for example, Newey and West (1987) and Andrews (1991). For the two-step procedure, the LRV estimator not only appears in the standard error estimation but also plays the role of the optimal weighting matrix in the second-step GMM criterion function. Under the …xed-smoothing asymptotics, the weighting matrix converges to a random matrix. As a result, the second-step GMM estimator is not asymptotically normal but rather asymptotically mixed normal. The asymptotic mixed normality re‡ects the estimation uncertainty of the GMM weighting matrix and is expected to be closer to the …nite sample distribution of the second-step GMM estimator. In a recent paper, Sun (2014b) shows that both the one-step and two-step test statistics are asymptotically pivotal under this new asymptotic theory. So a nuisance-parameter-free comparison of the one-step and two-step tests is possible. Comparing the one-step and two-step procedures under the new asymptotics is fundamentally di¤erent from that under the conventional asymptotics. Under the new asymptotics, the twostep procedure outperforms the one-step procedure only when the bene…t of using the optimal weighting matrix outweighs the cost of estimating it. This qualitative message applies to both the asymptotic variance comparison and the power comparison of the associated tests. This is in sharp contrast with the conventional asymptotics where the cost of estimating the optimal weighting matrix is completely ignored. Since the new asymptotic approximation is more accurate than the conventional asymptotic approximation, comparing the two procedures under this new asymptotics will give an honest assessment of their relative merits. This is con…rmed by a Monte 1 In this paper, the one-step estimator refers to the …rst-step estimator in the typical two-step GMM framework. It does not refer to the continuous updating GMM estimator that involves only one step. We use the terms “one-step” and “…rst-step” interchangingly. 2 Carlo study. There is a large and growing literature on the …xed-smoothing asymptotics. For kernel LRV variance estimators, the …xed-smoothing asymptotics is the so-called the …xed-b asymptotics …rst studied by Kiefer, Vogelsang and Bunzel (2002) and Kiefer and Vogelsang (2002a, 2002b, 2005) in the econometrics literature. For other studies, see, for example, Jansson (2004), Sun, Phillips and Jin (2008), Sun and Phillips (2009), Gonçlaves and Vogelsang (2011), and Zhang and Shao (2013) in the time series setting; Bester, Conley, Hansen and Vogelsang (2014) and Sun and Kim (2014) in the spatial setting; and Gonçalves (2011), Kim and Sun (2013), and Vogelsang (2012) in the panel data setting. For OS LRV variance estimators, the …xed-smoothing asymptotics is the socalled …xed-K asymptotics. For its theoretical development and related simulation evidence, see, for example, Phillips (2005), Müller (2007), and Sun (2011, 2013). The approximation approaches in some other papers can also be regarded as special cases of the …xed-smoothing asymptotics. This includes, among others, Ibragimov and Müller (2010), Shao (2010) and Bester, Conley, and Hansen (2011). The …xed-smoothing asymptotics can be regarded as a convenient device to obtain some high order terms under the conventional increasing-smoothing asymptotics. The rest of the paper is organized as follows. The next section presents a simple overidenti…ed GMM framework. Section 3 compares the two procedures from the perspective of point estimation. Section 4 compares them from the testing perspective. Section 5 extends the ideas to a general GMM framework. Section 6 reports simulation evidence on the …nite sample performances of the two procedures. Proofs of the main theorems are provided in the Appendix. A word on notation: for a symmetric matrix A; A1=2 (or A1=2 ) is a matrix square root of A 0 such that A1=2 A1=2 = A: Note that A1=2 does not have to be symmetric. We will specify A1=2 explicitly when it is not symmetric. If not speci…ed, A1=2 is a symmetric matrix square root of A 1 based on its eigendecomposition. By de…nition, A 1=2 = A1=2 and so A 1 = (A 1=2 )0 (A 1=2 ): For matrices A and B; we use “A B” to signify that A B is positive (semi)de…nite. We use “0” and “O” interchangeably to denote a matrix of zeros whose dimension may be di¤erent at di¤erent occurrences. For two random variables X and Y; we use X ? Y to indicate that X and Y are independent. 2 A Simple Overidenti…ed GMM Framework To illustrate the basic ideas of this paper, we consider a simple overidenti…ed time series model of the form: y1t = 0 + u1t ; y1t 2 Rd ; y2t = u2t ; y2t 2 Rq (1) for t = 1; :::; T where 0 2 Rd is the parameter of interest and the vector process ut := (u01t ; u02t )0 is stationary with mean zero. We allow ut to have autocorrelation of unknown forms so that the long run variance of ut : 1 X = lrvar(ut ) = Eut u0t j j= 1 3 takes a general form. However, for simplicity, we assume that var(ut ) = 2 Id for the moment2 . Our model is just a location model. We initially consider a general GMM framework but later …nd out that our points can be made more clearly in the simple location model. In fact, the simple location model may be regarded as a limiting experiment in a general GMM framework. Embedding the location model in a GMM framework, the moment conditions are 0 E(yt ) 0 ; y 0 )0 . Let where yt = (y1t 2t p1 T p1 T gT ( ) = Then a GMM estimator of 0 0q =0 1 ! PT y1t Pt=1 T t=1 y2t : can be de…ned as ^GM M = arg min gT ( )0 W 1 gT ( ) T for some positive de…nite weighting matrix WT : Writing WT = where W11 is a d d matrix and W22 is a q T X ^GM M = 1 (y1t T W11 W12 W21 W22 ; q matrix, then it is easy to show that W y2t ) for W = W12 W221 : t=1 There are at least two di¤erent choices of WT . First, we can take WT to be the identity matrix WT = Im for m = d + q: In this case, W = 0 and the GMM estimator ^1T is simply T X ^1T = 1 y1t : T t=1 Second, we can take WT to be the ‘optimal’ weighting matrix WT = obtain the GMM estimator: T X ~2T = 1 (y1t y2t ) ; T . With this choice, we t=1 where = 12 221 is the long run regression coe¢ cient matrix. While ^1T completely ignores the information in fy2t g ; ~2T takes advantage of this information. 2 If V11 V21 var (ut ) = V := for any 2 V12 V22 6= > 0; we can let V1=2 = (V1 2 )1=2 0 0 V12 (V22 ) 1=2 (V22 )1=2 2 Id ! : 1 0 0 Then V1=2 (y1t ; y2t ) can be written as a location model whose error variance is the identity matrix Id : The estimation uncertainty in estimating V will not a¤ect our asymptotic results. 4 Under some moment and mixing conditions, we have p p d T ^1T T ~2T 0 =) N (0; 11 ) and d 0 =) N (0; 1 2) ; where 12 = 11 12 1 22 21 : So Asym Var(~2T ) < Asym Var(^1T ) unless 12 = 0. This is a well known result in the literature. Since we do not know in practice, ~2T is infeasible. However, given the feasible estimator ^1T ; we can estimate and construct a feasible version of ~2T : The common two-step estimation strategy is as follows. i) Estimate the long run covariance matrix by T T 1 XX s t ^ := ^ (^ u) = Qh ( ; ) u ^t T T T T 1X u ^ T s=1 t=1 0 where u ^t = (y1t =1 ! u ^s T 1X u ^ T =1 !0 ^1T ; y 0 )0 . 2t ii) Obtain the feasible two-step estimator ^2T = T 1 PT t=1 y1t ^ y2t where ^ = ^ 12 ^ 1 . 22 In the above de…nition of ^ , Qh (r; s) is a symmetric weighting function that depends on the smoothing parameter h: For conventional kernel estimators, Qh (r; s) = P k ((r s) =b) and we take K 1 h = 1=b: For the orthonormal series (OS) estimators, Qh (r; s) = K j=1 j (r) j (s) and we R1 2 take h = K; where j (r) are orthonormal basis functions on L [0; 1] satisfying 0 j (r) dr = 0: We parametrize h in such a way so that h indicates the level or amount of smoothing for both types of LRV estimators. P ^ g in constructing ^ (^ u) : For the Note that we use the demeaned process f^ ut T 1 T=1 u ^ ^ location model, (^ u) is numerically identical to (u) where the unknown error process fut g is used. The moment estimation uncertainty is re‡ected in the demeaning operation. Had we known the true value of 0 and hence the true moment process fut g ; we would not need to demean fut g. While ~2T is asymptotically more e¢ cient than ^1T ; is ^2T necessarily more e¢ cient than ^1T and in what sense? Is the Wald test based on ^2T necessary more powerful than that based on ^1T ? One of the objectives of this paper is to address these questions. 3 A Tale of Two Asymptotics: Point Estimation We …rst consider the conventional asymptotics where h ! 1 as T ! 1 but at a slower rate, i.e., h=T ! 0: The asymptotic direction is represented by the curved arrow in Figure 1. Sun (2014a, 2014b) calls this type of asymptotics the “Increasing-smoothing Asymptotics,”as h increases with the sample size. Under this type of asymptotics and some regularity conditions, we have ^ !p : p It can then be shown that ^2T is asymptotically equivalent to ~2T , i.e., T (~2T ^2T ) = op (1). As a direct consequence, we have p p d d 1 ^2T T ^1T 0 =) N (0; 11 ); T 0 =) N 0; 11 12 22 21 : So ^2T is still asymptotically more e¢ cient than ^1T : 5 Figure 1: Conventional Asymptotics (increasing smoothing) vs New Asymptotics (…xed smoothing) The conventional asymptotics, as elegant and convenient as it is, does not re‡ect the …nite sample situations well. Under this type of asymptotics, we essentially approximate the distribution of ^ by the degenerate distribution concentrating on . That is, we completely ignore the estimation uncertainty in ^ : The degenerate approximation is too optimistic, as ^ is a nonparametric estimator, which by de…nition can have high variation in p …nite samples. To obtain a more accurate distributional approximation of T (^2T 0 ); we could develop a high order increasing-smoothing asymptotics that re‡ects the estimation uncertainty in ^ . This is possible but requires strong assumptions that cannot be easily veri…ed. In addition, it is also technically challenging and tedious to rigorously justify the high order asymptotic theory. Instead of high order asymptotic theory under the conventional asymptotics, we adopt the type of asymptotics that holds h …xed as T ! 1: This asymptotic behavior of h and T is illustrated by the arrow pointing to the right in Figure 1. Given that h is …xed, we follow Sun (2014a, 2014b) and call this type of asymptotics the “Fixed-smoothing Asymptotics.” This type of asymptotics takes the sampling variability of ^ into consideration. Sun (2013, 2014a) has shown that the …xed-smoothing asymptotic distribution is high order correct under the conventional increasing-smoothing asymptotics. So the …xed-smoothing asymptotics can be regarded as a convenient device to obtain some high order terms under the conventional increasing-smoothing asymptotics. To establish the …xed-smoothing asymptotics, we maintain Assumption 1 on the kernel function and basis functions. 6 Assumption 1 (i) For kernel HAR variance estimators, the kernel function k ( ) satis…es the following condition: for any b 2 (0; 1] and 1, kb (x) and k (x) are symmetric, continuous, piecewise monotonic, and piecewise continuously di¤ erentiable on [ 1; 1]. (ii) For the OS HAR variance estimator, the basis functions j ( ) are piecewise monotonic, continuously di¤ erentiable R1 and orthonormal in L2 [0; 1] and 0 j (x) dx = 0: Assumption 1 on the kernel function is very mild. It includes many commonly used kernel functions such as the Bartlett kernel, Parzen kernel, and Quadratic Spectral (QS) kernel. De…ne Z 1Z 1 Z 1 Z 1 Qh ( 1 ; 2 )d 1 d 2 Qh (r; )d + Qh ( ; s)d Qh (r; s) = Qh (r; s) 0 0 0 0 which is a centered version of Qh (r; s); and T X T X s t ~= 1 ut u ^0s : Qh ( ; )^ T T T s=1 t=1 Assumption 1 ensures that ~ and ^ are asymptotically equivalent. Furthermore, under this assumption, Sun (2014a) shows that, for both kernel LRV and OS LRV estimation, the centered weighting function Qh (r; s) satis…es : Qh (r; s) = 1 X j j (r) j (s) j=1 R1 where f j (r)g is a sequence of continuously di¤erentiable functions satisfying 0 j (r)dr = 0 and the series on the right hand side converges to Qh (r; s) absolutely and uniformly over (r; s) 2 [0; 1] [0; 1]. The representation can be regarded as a spectral decomposition of the compact Fredholm operator with kernel Qh (r; s) : See Sun (2014a) for more discussion. Now, letting 0 ( ) := 1 and using the basis functions f j ( )g1 j=1 in the series representation of the weighting function, we make the following assumptions. Assumption 2 The vector process fut gTt=1 satis…es: P a) T 1=2 Tt=1 j (t=T )ut converges weakly to a continuous distribution, jointly over j = 0; 1; :::; J for every …xed J; b) For every …xed J and x 2 Rm , T P P 1 X p T t=1 t j ( )ut T T 1 X 1=2 p T t=1 x for j = 0; 1; ::; J t j ( )et T where 1=2 is a matrix square root of = = P1 ! = x for j = 0; 1; :::; J 1=2 12 12 0 0 j= 1 Eut ut j 7 1=2 22 1=2 22 ! ! + o(1) as T ! 1 >0 and et s iid N (0; Im ) Assumption 3 P1 j= 1 k Eut u0t k< 1: j Proposition 1 Let Assumptions 1–3 hold. As T ! 1 for a …xed h, we have: d (a) ^ =) 1 where = 1 ~1 = 1=2 Z 0 ~1 1Z 1 0 0 1=2 := 1;11 1;12 1;21 1;22 ~ 1;11 ~ 1;21 Qh (r; s)dBm (r)dBm (s)0 := and Bm ( ) is a standard Brownian motion of dimension m = d + q; p d Id ; (b) T ^2T 0 =) 1=2 Bm (1) where 1 = 1 independent of Bm (1) : Conditional on with variance 1, V2 = Id ; p T (^2T 0) V2 11 That is, the asymptotic distribution of 1 11 12 21 22 Id = 0 1 p T ^2T 1 (h; d; q) 0 0 12 1 11 ~ 1;12 ~ 1;22 := 1;12 1 1;22 is is a normal distribution 1 21 + 1 0 22 1 : is asymptotically mixed-normal rather than normal. Since 12 1 22 21 = = 12 12 1 22 0 12 1 21 1 22 1 22 1 12 21 1 22 + 1 0 1 0 22 1 0 almost surely, the feasible estimator ^2T has a large variation than the infeasible estimator ~2T : This is consistent with our intuition. p d ^ Under the …xed-smoothing asymptotics, we still have T (^1T 0 ) =) N (0; 11 ) as 1T does not depend onpthe smoothing parameter h: So the asymptotic (conditional) variances of p 1 T (^1T T (^2T 0 ) and 0 ) are both larger than 11 12 22 21 in terms of matrix de…niteness. However, it is not straightforward to directly compare them. To evaluate the relative magnitude of the two (conditional) asymptotic variances, we de…ne ~ 1 1 := ~ 1 (h; d; q) := ~ 1;12 ~ 1;22 ; (2) which does not depend on any nuisance parameter but depends on h; d; q: For notational economy, we sometimes suppress this dependence. Direct calculations show that 1 = 1=2 ~ 12 1 1=2 22 + 12 1 22 : (3) Using this, we have, for any conforming vector a : Ea0 (V2 11 ) a = Ea0 = a0 = a0 0 a0 12 01 + 1 21 a 1 22 1 a 1=2 ~ ~ 0 1=20 a0 12 221 21 a 12E 1 1 12 a h i 1=2 1=2 0 1=2 1=2 0 1 ~ ~0 12 22 21 ( 1 2 ) ( 1 2 ) a: 12 E 1 1 12 The following lemma gives a characterization of E ~ 1 (h; d; q) ~ 1 (h; d; q)0 : 8 (4) Lemma 2 For any d 1; we have E ~ 1 (h; d; q) ~ 1 (h; d; q)0 = Ejj ~ 1 (h; 1; q) jj2 Id : On the basis of the above lemma, we de…ne g(h; q) := Let = Ejj ~ 1 (h; 1; q)jj2 2 (0; 1): 1 + Ejj ~ 1 (h; 1; q)jj2 1=2 11 1=2 0 12 ( 22 ) 2 Rd q ; which is the long run correlation matrix between u1t and u2t . Using (4) and rewriting it in terms of ; we obtain the following proposition. 0 is positive (negative) semide…nite, Proposition 3 Let Assumptions 1–3 hold. If g(h; q)Id ^ ^ then 2T has a larger (smaller) asymptotic variance than 1T under the …xed-smoothing asymptotics. 1=2 In the proof of the proposition, we show that the choices of the matrix square roots 11 1=2 and 22 are innocuous. To use the proposition, we need only to check whether the condition is satis…ed for a given choice of the matrix square roots. 0 holds trivially. In this case, the asymptotic variance of ^ If = 0; then g(h; q)Id 2T is larger than the asymptotic variance of ^1T : Intuitively, when the long run correlation is zero, there is no information that can be explored to improve e¢ ciency. If we insist on using the long run correlation matrix in attempt to improve the e¢ ciency, we may end up with a less e¢ cient estimator, due to the noise in estimating the zero long run correlation matrix. On the other hand, if 0 = Id after some possible rotation, which holds when the long run variation of u1t is almost perfectly predicted by u2t ; then g(h; q)Id < 0 . In this case, it is worthwhile estimating the long run variance and using it to improve the e¢ ciency ^2T : There are many scenarios in 0 is inde…nite. In this case we may still be able to access between where the matrix g(h; q)Id the de…niteness of the matrix along some directions; see Theorem 10. The case with d = 1 is simplest. In this case, if 0 > g(h; q); then ^2T is asymptotically more e¢ cient than ^1T : Otherwise, it is not asymptotically less e¢ cient. In the case of kernel LRV estimation, it is hard to obtain an analytical expression for ~ Ejj 1 (h; 1; q)jj2 and hence g(h; q), although we can always simulate it numerically. The threshold g(h; q) depends on the smoothing parameter h = 1=b and the degree of overidenti…cation q: Tables 1–3 report the simulated values of g(h; q) for b = 0:00 : 0:01 : 0:20 and q = 1 5. It is clear that g(h; q) increases with q and decreases with the smoothing parameter h = 1=b. When the OS LRV estimation is used, we do not need to simulate g(h; q) as we can obtain a closed form expression for it. Using this expression, we can prove the following corollary. 0 Corollary 4 Let Assumptions 1–3 hold. Consider the case of OS LRV estimation. If Kq 1 Id is positive (negative) semide…nite, then ^2T has a larger (smaller) asymptotic variance than ^1T under the …xed-smoothing asymptotics. 0 to be positive semide…nite is that A necessary and su¢ cient condition for q=(K 1)Id 0 the largest eigenvalue of is smaller than q=(K 1): The eigenvalues of 0 are the squared long run correlation coe¢ cients between c01 u1t and c02 u2t for some c1 and c2 ; i.e., the square long run canonical correlation coe¢ cients between u1t and u2t : So if the largest squared long run 9 canonical correlation is smaller than q=(K 1); then ^2T has a larger asymptotic variance than ^1T : Similarly, if the smallest squared long run canonical correlation is greater than q=(K 1); then ^2T has a smaller asymptotic variance than ^1T : Since ^2T is not asymptotically normal, asymptotic variance comparison does not paint the whole picture. To compare the asymptotic distributions of ^1T and ^2T , we consider the case that d = q = 1 and K = 4 as an example: Figure 2 represents the shapes of probability density functions in the case of OS LRV estimation when ( 11 ; 212 ; 22 ) = (1; 0:10; 1). In this p a 1 case, 11 T (^1T N (0; 1) (red solid line) 12 22 21 = 0:9. The …rst graph shows 0) p a ^ and T ( 2T N (0; 0:9) (bluepdotted line) under p the conventional asymptotics. The con0) ventional limiting distributions for T (^1T ) and T (^2T 0 0 ) are both normal but the ^ latter has a smaller variance, so the asymptotic e¢ ciency of 2T is always guaranteed. However, this is not true in the second graph of Figure 2, which represents the limiting distributions under the …xed-smoothing asymptotics. Here, as before the red solid line represents N (0; 1) 2 but thePblue dottedPline represents the mixed normal distribution M N [0; 0:9(1 + ~ 1 )] with K K 2 2 ~ = 1 i=1 1i 2i = i=1 2i s M N (0; 1= K ): This can be obtained by using (4) with a = 1: More speci…cally, V2 = 1=2 ~ ~ 0 1=2 0 1 2 1 1( 1 2 ) 12 1 22 21 2 2 + 11 = 0:9 ~ 1 + 0:9 = 0:9(1 + ~ 1 ): (5) Comparing these two di¤erent families of distributions, we …nd that the asymptotic distribution of ^2T has fatter tail than that of ^1T : The asymptotic variance of ^2T is 0:9(1 + 1=2) = 1:35 which is larger than the asymptotic variance of ^1T . More generally, when d = q = 1 with ( 11 ; 212 ; 22 ) = (1; 2 ; 1); we can use (4) to obtain the asymptotic variance of ^2T as follows: 2 1+ 1 = 1 2 1+ 2 2 E ~1 1 1 K 2 2 = 1 K K 2 2 1 + E ~1 2 = 1 1 2 : So, for any choice of K, the asymptotic variance of ^2T decreases as 2 increases. Under the quadratic loss function, the asymptotic relative e¢ ciency of ^2T relative to ^1T is thus equal to 2 (K 1 1) = (K 2) : The second factor (K 1) = (K 2) > 1 re‡ects the cost of having to estimate : When 2 = 1=(K 1); ^1T and ^2T have the same asymptotic variance. For a given choice of K, if 2 1=(K 1), we prefer to use ^1T . This is not captured by the conventional asymptotics, which always chooses ^2T over ^1T . 4 A Tale of Two Asymptotics: Hypothesis Testing We are interested p in testing the null hypothesis H0 : R 0 = r against the local alternative H1 : R 0 = r + 0 = T for some p d matrix R and p 1 vectors r and 0 : Nonlinear restrictions can be converted into linear ones using the Delta method. We construct the following two Wald statistics: W1T := T (R^1T W2T := T (R^2T 1 r)0 R ^ 11 R (R^1T h i 1 (R^2T r)0 R ^ 1 2 R0 10 r) r) Conventional Asymptoptic Distributions 0.4 0.35 0.3 0.25 0.2 One-s tep estimator: N(0,1) T wo-s tep estimator: N(0,0.9) 0.15 0.1 0.05 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 1.5 2 Fixed-smoothign Asymptotic Distributions 0.4 0.35 0.3 0.25 0.2 0.15 One-s tep estimator: N(0,1) 0.1 T wo-s tep estimator: MN(0,0.9(1+ 2 )) 0.05 0 -2 -1.5 -1 -0.5 0 0.5 1 Figure 2: Limiting distributions of ^1T and ^2T based on the OS LRV estimator with K = 4: where ^ 1 2 = ^ 11 ^ 12 ^ 221 ^ 21 : When p = 1 and the alternative is one sided, we can construct the following two t statistics: p T R^1T r p T1T : = (6) R ^ 11 R0 p T R^2T r p T2T : = : (7) R ^ 1 2 R0 No matter whether the test is based on ^1T or ^2T ; we have to employ the long run covariance estimator ^ . De…ne the p p matrices 1 and 2 according to 1 In other words, 1 and 2 0 1 =R 11 R 0 and 2 0 2 are matrix square roots of R 11 =R 11 R 1 2R 0 0 : and R 1 2R 0 respectively. Under the conventional p increasing-smoothing asymptotics, it is straightforward to show that under H1 : R 0 = r + 0 = T : d W1T =) 1 2 p( d 1 T1T =) N ( When 0 1 2 d 0 ), W2T =) 0 ; 1), T2T =) N ( 1 d 1 2 2 1 2 p( 0 2 ), 0 ; 1): = 0; we obtain the null distributions: d 2 p W1T ; W2T =) d and T1T ; T2T =) N (0; 1): So under the conventional increasing-smoothing asymptotics, the null limiting distributions of 2 2 1 1 W1T and W2T are identical. Since 0 0 , under the conventional asymptotics, 1 2 the local asymptotic power function of the test based on W2T is higher than that based on W1T : The key driving force behind the conventional asymptotics is that we approximate the distribution of ^ by the degenerate distribution concentrating on : The degenerate approximation does not re‡ect the …nite sample distribution well. As in the previous section, we employ the …xed-smoothing asymptotics to derive more accurate distributional approximations. Let Z 1Z 1 Z 1Z 1 0 Cpp = Qh (r; s)dBp (r)dBp (s) ; Cpq = Qh (r; s)dBp (r)dBq (s)0 0 0 0 0 Z 1Z 1 0 Cqq = Qh (r; s)dBq (r)dBq (s)0 ; Cqp = Cpq 0 0 and 0 Cpq Cqq1 Cpq Dpp = Cpp where Bp ( ) 2 Rp and Bq ( ) 2 Rq are independent standard Brownian motion processes. Proposition 5 Let Assumptions 1–3 hold. As T ! 1 for a …xed h, we have, under H1 : R p r + 0= T : d 2 1 (a) W1T =) W11 ( 0 1 d 1 2 2 0 d 1 1 d (d) T2T =) T21 1 2 2 Rp : (8) ) where W21 (k k2 ) = Bp (1) (c) T1T =) T11 = ) where W11 (k k2 ) = [Bp (1) + ]0 Cpp1 [Bp (1) + ] for (b) W2T =) W21 ( 0 Cpq Cqq1 Bq (1) + 1 0 Dpp1 Bp (1) p = Cpp for p = 1: 0 := Bp (1)+ 1 0 := Bp (1) Cpq Cqq1 Bq (1) + 0 1 2 0 = p Cpq Cqq1 Bq (1) + : (9) Dpp for p = 1: In Proposition 5, we use the notation W11 (k k2 ), which implies that the right hand side of (8) depends on only through k k2 : This is true, because for any orthogonal matrix H : [Bp (1) + ]0 Cpp1 [Bp (1) + ] = [HBp (1) + H ]0 HCpp1 H 0 [HBp (1) + H ] d = [Bp (1) + H ]0 Cpp1 [Bp (1) + H ] : 12 ~ 0 for some H ~ such that H is orthogonal, then If we choose H = ( = k k ; H) d [Bp (1) + ]0 Cpp1 [Bp (1) + ] = [Bp (1) + k k ep ]0 Cpp1 [Bp (1) + k k ep ] : So the distribution of [Bp (1) + ]0 Cpp1 [Bp (1) + ] depends on only through k k : Similarly, the distribution of the right hand side of (9) depends only on k k2 : When 0 = 0; we obtain the limiting distributions of W1T ; W2T ; T1T and T2T under the null hypothesis: d W1T =) W11 := W11 (0) = Bp (1)0 Cpp1 Bp (1) ; 0 d W2T =) W21 := W21 (0) = Bp (1) Cpq Cqq1 Bq (1) Dpp1 Bp (1) p d T1T =) T11 := T11 (0) = Bp (1)= Cpp ; p d T2T =) T21 := T21 (0) = Bp (1) Cpq Cqq1 Bq (1) = Dpp : Cpq Cqq1 Bq (1) ; These distributions are di¤erent from those under the conventional asymptotics. For W1T and p T1T ; the di¤erence lies in the random scaling factor Cpp or Cpp : The random scaling factor captures the estimation uncertainty of the LRV estimator. For W2T and T2T ; there is an additional di¤erence embodied by the random location shift Cpq Cqq1 Bq (1) with a consequent change in the random scaling factor. The proposition below provides some characterization of the two limiting distributions W11 and W21 : Proposition 6 For any x > 0; the following hold: (a) W21 (0) …rst-order stochastically dominates (FSD) W11 (0) in that P [W21 (0) h (b) P W11 (k k2 ) h (c) P W21 (k k2 ) x] > P [W11 (0) x] : i x strictly increases with k k2 and limk i x strictly increases with k k2 and limk h W11 (k k2 ) h 2 k!1 P W21 (k k ) k!1 P i x = 1: i x = 1: Proposition 6(a) is intuitive. W21 FSD W11 because W21 FSD Bp (1)0 Dpp1 Bp (1), which in turn FSD Bp (1)0 Cpp1 Bp (1) which is just W11 . According to a property of the …rst-order stochastic dominance, we have d W21 = W11 + We for some We > 0: Intuitively, W21 shifts some of the probability mass of W11 to the right. A direct implication is that the asymptotic critical values for W2T are larger than the corresponding ones for W1T : The di¤erence in critical values has implications on the power properties of the two tests. For x > 0; we have 1 P (T11 > x) = P W11 2 x2 1 and P (T21 > x) = P W21 2 x2 . It then follows from Proposition 6(a) that P (T21 > x) P (T11 > x) for x > 0: So for a onesided test with the alternative H1 : R 0 > r; critical values from T21 are larger than those from 13 T11 : Similarly, we have P (T21 < x) P (T11 < x) for x < 0: This implies that for a one-sided test with the alternative H1 : R 0 < r; critical values from T21 are smaller than those from T11 : Let W11 and W21 be the (1 ) quantile from the distributions W11 and W21 ; respectively. The local asymptotic power functions of the two tests are h i 2 2 2 1 1 := 1 ; K; p; q; = P W11 ( 1 1 0 ) > W11 ; 1 0 0 1 1 h i 2 2 2 1 1 1 := ; K; p; q; = P W ( ) > W 0 0 2 21 0 2 21 : 1 2 2 2 2 1 1 While ; we also have W21 > W11 : The e¤ects of the critical values and 0 0 2 1 the noncentrality parameter move in opposite directions. It is not straightforward to compare the two power functions. However, Proposition 6 suggests that if the di¤erence in the noncentrality 2 2 1 1 parameters is large enough to o¤set the increase in critical values, then 0 0 2 1 the two-step test based on W2T will be more powerful. 2 2 1 1 ; we de…ne To evaluate 0 0 1 2 = R R 11 R which is the long run correlation matrix 1 2 2 1 0 1 = 0 0 = 0 0 0 1 = 0 0 0 1 = 0 0 0 1 R 11 R 1 0 R h 1 12 ) 1=2 22 ; (10) between Ru1t and u2t : In terms of R 2 Rp q we have 2 1 22 12 1 R 0 R R 21 R 12 0 R R ( Ip (R 0 1 Ip n 1 Ip R 1=2 0 1 0 1 22 1 21 R Ip 1=2 0 ) 0 0 0 o 0 0 1 1 1 0 R R 0 R i 1 11 R 1 0 1 1 0 0 0 1=2 0 R R Ip 1 0 1 1 1 0 > 0: 0 1 1 0 1 So the di¤erence in the noncentrality parameters depends on the matrix R 0R : Pmin(p;q) 0 Let R = i=1 i ai bi be the singular value decomposition of R where fai g and fbi g are 2 orthonormal vectors in Rp and Rq respectively. Sorted in the descending order, are the i (squared) long run canonical correlation coe¢ cients between Ru1t and u2t : Then min(p;q) 1 2 2 1 0 1 2 0 X = i=1 Consider a special case that 1 2 2 1 := maxpi=1 2 i 2 i 1 2 i a0i 1 1 approaches 1. If a01 2 0 1 1 : 0 6= 0; then jj 1 2 2 0 jj and hence jj 2 1 0 jj2 approaches 1 as 21 approaches 1 from below. This case happens when the second block of moment conditions has very high long run prediction power for the …rst 2 block. In this case, we expect the W2T test to be more powerful, as limj 1 j!1 2 ( 2 1 0 ) = 1: Consider another special case that maxi 2i = 0; i.e., R is a matrix of zeros. In this case, the 1 0 second block of moment conditions contains no additional information, and we have 1 2 1 2 2 0 = : In this case, we expect the W2T test to be less powerful. 0 It follows from Proposition 6(b)(c) that for any , there exists a unique ( ) := ( ; h; p; q; ) such that ( )) : 1( ) = 2( 1 14 Then 1 2 = ( 0 2 2 ( 0 1 1 2 1 2( 2 0 ) )< 1 1 0 0 1( 2 0 0 1 ) 0 1 2 1 1 1 1 Ip 1 ) if and only if 2 2 0 1 < ( 1 2 0 ) 1 1 2 0 : But 2 0 0 R R 1=2 where h 0 R R 1 2 0 Ip i 0 R R Ip 1=2 1 1 0 ; ( ; h; p; q; ) 1 : ( ; h; p; q; ) f ( ) := f ( ; h; p; q; ) = So the power comparison depends on whether f ( 1 f 1 1 2 0 )Ip 0 R R is positive semide…nite. 2 1 Proposition 7 Let Assumptions 1–3 hold and de…ne := = 00 (R 11 R0 ) 1 0 . If 0 1 0 f ( ; h; p; q; )Ip R R is positive (negative) semide…nite, then the two-step test based on W2T has a lower (higher) local asymptotic power than the one-step test based on W1T under the …xedsmoothing asymptotics. Note that 0 R R = R 11 R 0 1=2 R 12 1 22 21 R0 R 11 R 0 1=2 : To pin down R 0R ; we need to choose a matrix square root of R 11 R0 : Following the arguments in the proof of Proposition 3, it is easy to see that we can use any matrix square root. In particular, we can use the principal matrix square root. When p = 1; which is of ultimate importance in empirical studies, R 0R is equal to the sum of the squared long run canonical correlation coe¢ cients. In this case, f ( ; h; p; q; ) is the threshold value of R 0R for assessing the relative e¢ ciency of the two tests. More speci…cally, when R 0R > f ( ; h; p; q; ), the two-step W2T test is more powerful than the one-step W1T test. Otherwise, the two-step W2T test is less powerful. Proposition 7 is in parallel with Proposition 3. The qualitative messages of these two propositions are the same — when the long run correlation is high enough, we should estimate and exploit it to reduce the variation of our point estimator and improve the power of the associated tests. However, the thresholds are di¤erent quantitatively. The two propositions fully characterize the threshold for each criterion under consideration. Proposition 8 Consider the case of OS LRV estimation. For any 2 ( ) and hence ( ; h; p; q; ) > 1 and f ( ; h; p; q; ) > 0: 2 R+ , we have 1( )> Proposition 8 is intuitive. When there is no long run correlation between Ru1t and u2t ; we 2 2 1 1 : In this case, the two-step W2T test is necessarily less powerful. The have = 0 0 1 2 proof uses the theory of uniformly most powerful invariant tests and the theory of complete and su¢ cient statistics. It is an open question whether the same strategy can be adopted to prove Proposition 8 in the case of kernel LRV estimation. Our extensive numerical work supports that ( ; h; p; q; ) > 1 and f ( ; h; p; q; ) > 0 continue to hold in the kernel case. It is not easy to give an analytical expression for f ( ; h; p; q; ) but we can compute it numerically without any di¢ culty. In Tables 4 and 5, we consider the case of OS LRV estimation and compute the values of f ( ; K; p; q; ) for = 1 10; K = 8; 10; 12; 14, p = 1 4 and q = 1 3: Similar to the asymptotic variance comparison, we …nd that these threshold values increase as the degree of over-identi…cation increases and decrease as the smoothing parameter K increases. Note that the continuity of two power functions 1 ( ; K; p; q; ) and 2 ( ; K; p; q; ) guarantees 15 that for 2 [1; 10] the two-step W2T test has a lower local asymptotic power than the one-step W1T test if the largest eigenvalue of R 0R is smaller than min 2[1;10] f ( ; K; p; q; ). on the other hand, for 2 [1; 10] the two-step W2T test has a higher local asymptotic power if the smallest eigenvalue of R 0R is greater than max f ( ; K; p; q; ). For the case of kernel LRV estimator, results not reported here show that f ( ; h; p; q; ) increases with q and decreases with h. This is entirely analogous to the case of OS LRV estimation. 5 General Overidenti…ed GMM Framework In this section, we consider the general GMM framework. The parameter of interest is a d 1 vector 2 Rd . Let vt 2 Rdv denote the vector of observations at time t. We assume that 0 is the true value, an interior point of the parameter space . The moment conditions E f (vt ; ) = 0; t = 1; 2; :::; T: hold if and only if = 0 where f (vt ; ) is an m 1 vector of continuously di¤erentiable functions. The process f (vt ; 0 ) may exhibit autocorrelation of unknown forms. We assume that m d and that the rank of E[@ f (vt ; 0 ) =@ 0 ] is equal to d: That is, we consider a model that is possibly over-identi…ed with the degree of over-identi…cation q = m d: 5.1 One-Step and Two-Step Estimation and Inference De…ne the m m contemporaneous covariance matrix 0 = E f (vt ; )f (vt ; ) and = 1 X j and the LRV matrix where j = E f (vt ; as: 0 0 )f (vt j ; 0 ) : j= 1 Let t 1 X gt ( ) = p f (vj ; ): T j=1 Given a simple positive-de…nite weighting matrix W0T that does not depend on any unknown parameter, we can obtain an initial GMM estimator of 0 as ^0T = arg min gT ( )0 W 1 gT ( ): 0T For example, we may set W0T equal to Im . In the case of IV regression, we may set W0T equal to ZT0 ZT =T where ZT is the matrix of the instruments. Using or as the weighting matrix, we obtain the following two (infeasible) GMM estimators: ~1T : = arg min gT ( )0 1 gT ( ); (11) ~2T : = arg min gT ( )0 1 gT ( ): (12) 2 2 For the estimator ~1T ; we use the contemporaneous covariance matrix as the weighting matrix and ignore all the serial dependency in the moment vector process ff (vt ; 0 )gTt=1 . In contrast to this procedure, the second estimator ~2T accounts for the long run dependency. The feasible 16 versions of these two estimators ^1T and ^2T can be naturally de…ned by replacing their estimates est (^0T ) and est (^0T ) where est ( ) : = and with T 1X f (vt ; )f (vt ; )0 ; T (13) t=1 T T 1 XX s t Qh ( ; )f (vt ; )f (vs ; )0 : est ( ) : = T T T (14) s=1 t=1 To test the null hypothesis H0 : R di¤erent Wald statistics as follows: 0 = r against H1 : R W1T : = T (R^1T W2T : = T (R^2T n o r)0 RV^1T R0 n o r)0 RV^2T R0 where V^1T = V^2T = h h and G01T 1 ^ est ( 1T )G1T G2T 1 ^ est ( 2T )G2T G1T i i 1h G01T 0 1 1 =r+ (R^1T r); (R^2T r); 1 ^ 1 ^ ^ est ( 1T ) est ( 1T ) est ( 1T )G1T 1 T 1 X @ f (vt ; ) = T @ 0 t=1 ih T 1 X @ f (vt ; ) = T @ 0 t=1 ; G2T =^1T p 0= T ; we construct two (15) 1 ^ est ( 1T )G1T G01T i 1 (16) : =^2T These are the standard Wald test statistics in the GMM framework. To compare the two estimators ^1T and ^2T and associated tests, we maintain the standard assumptions below. Assumption 4 As T ! 1; ^0T = point 0 2 : 0 +op (1) ; ^1T = 0 +op (1) ; ^2T = 0 +op (1) for an interior Assumption 5 De…ne t 1 @gt 1 X @ f (vt ; ) Gt ( ) = p = for t 0 T @ 0 T@ j=1 For any T = 0 1 and G0 ( ) = 0: + op (1); the following holds: (a) plimT !1 G[rT ] ( 0 G = G( 0 ) and G( ) = E@ f (vt ; )=@ ; (b) est ( T ) p ! T) = rG uniformly in r where > 0: With these assumptions and some mild conditions, the standard GMM theory gives us p T (^1T T 1 Xh 0 G 0) = p T t=1 1 G i 1 G0 1 f (vt ; 0) + op (1): Under the …xed-smoothing asymptotics, Sun (2014b) establishes the representation: p T (^2T T 1 Xh 0 p G 0) = T t=1 1 0 1G 17 i 1 G0 1 1 f (vt ; 0 ) + op (1); where 1 is de…ned in the similar way as 1 in Proposition 1: 1 = 1=2 ~ 1 01=2 : Due to the complicated structure of two transformed moment vector processes, it is not straightforward to compare the asymptotic distributions of ^1T and ^2T as in Sections 3 and 4. To confront this challenge, we let G= U V0 m m m d d d be a singular value decomposition (SVD) of G, where 0 A is a d A; = O d d ; d q d diagonal matrix and O is a matrix of zeros. Also, we de…ne f (vt ; 0) = (f1 0 (vt ; 0 ); f2 0 (vt ; 0 0 )) := U 0 f (vt ; 0) 2 Rm ; where f1 (vt ; 0 ) 2 Rd and f2 (vt ; 0 ) 2 Rq are the rotated moment conditions. The variance and long run variance matrices of ff (vt ; 0 )g are := U 0 U = 11 12 21 22 and := U 0 U . To convert the variance matrix into an identity matrix, we de…ne the normalized moment conditions below: f (vt ; 0) 0 0 0 0 ) ; f2 (vt ; 0 ) = f1 (vt ; := ( where 1=2 1=2 1 2) ( = 1=2 12 ( 22 ) ( 22 )1=2 0 1 1=2 ) ! f (vt ; 0) : (17) More speci…cally, f1 (vt ; 0) : =( 1 2) 1=2 f2 (vt ; 0) : =( 22 ) 1=2 h f1 (vt ; f2 (vt ; 0) 0) 12 ( 22 ) 1 f2 (vt ; 2 Rq : Then the contemporaneous variance of the time series ff (vt ; 0 ) 1: is := ( 1=2 ) 1 ( 1=2 0 )g 1 2) 1=2 p AV 0 T (^1T 0) = ( 1 2) 1=2 p AV 0 T (^2T 0) = T 1 X p f1 (vt ; T t=1 0) T 1 X p [f1 (vt ; T t=1 d =) M N 0; where 1 := 1;12 1 1;22 11 is the same as in Proposition 1. 18 0) in Assumptions 2 and 3. d + op (1) =) N (0; 0) 2 Rd ; is Im and the long run variance Lemma 9 Let Assumptions 1–5 hold by replacing ut with f (vt ; Then as T ! 1 for a …xed h; ( i 0) 1 f2 (vt ; 0 )] 0 12 1 1 11 ) (18) + op (1) 21 + 1 (19) 0 22 1 The asymptotic normality and mixed normality in the lemma are not new. The p p limiting distribution of T (^1T 0 ) is available from the standard GMM theory while that of T (^2T 0 ) has been recently established in Sun (2014b). It is the representation that is new and insightful. The representation casts the two estimators in the same form and enables us to directly compare their asymptotic properties. It follows from the proof of the lemma that ( 1 2) 1=2 AV 0 p T 1 X [f1 (vt ; 0) = p T t=1 T (~2T 0) 0 f2 (vt ; 0 )] + op (1) where 0 = 12 221 as de…ned before. So the di¤erence between the feasible and infeasible twostep GMM estimators lies in the uncertainty in estimating 0 : While the true value of appears in the asymptotic distribution of the infeasible estimator ~2T ; the …xed-smoothing limit of the implied estimator ^ := ^ 12 ^ 221 appears in that of the feasible estimator ^2T : It is important to point out that the estimation uncertainty in the whole weighting matrix est matters only through that in ^ : If we let (u1t ; u2t ) = (f1 (vt ; 0 ); f2 (vt ; 0 )); then the right hand sides of (18) and (19) are exactly the same as what we would obtain in the location model. The location model, as simple as it is, has implications for general settings from an asymptotic point of view. More speci…cally, de…ne y1t = ( 1=2 1 2) AV 0 0 + u1t ; y2t = u2t ; where u1t = f1 (vt ; 0 ) and u2t = f2 (vt ; 0 ). The estimation and inference problems in the GMM setting are asymptotically equivalent to those in the above simple location model with fy1t ; y2t g as the observations. ~ using To present our next theorem, we transform R into R ~ = RV A R 1 ( which has the same dimension as R: We let Z 1Z 1 ~ (h; p; q) = Qh (r; s)dBp (r)dBq (s)0 1 0 0 1=2 ; 1 2) Z 0 (20) 1Z 1 0 1 Qh (r; s)dBq (r)dBq (s)0 ; which is compatible with the de…nition in (2). We de…ne = 1=2 11 12 1=2 22 2 Rd q and R ~ = R ~0 11 R 1=2 ~ R 12 1=2 22 2 Rp q : While is the long run correlation matrix between f1 (vt ; 0 ) and f2 (vt ; 0 ), R is the long run ~ 1 (vt ; 0 ) and f2 (vt ; 0 ). Finally, we rede…ne the noncentrality pacorrelation matrix between Rf rameter : h i 1 1 = 00 R(G0 1 G) 1 (G0 1 G)(G0 1 G) 1 R0 (21) 0; which is the noncentrality parameter based on the …rst-step test. For the location model considered before, the above de…nition of R is identical to that in (10). In that case, we have G = (Id ; Od q )0 and so U = Im , A = Id and V = Id : Given the ~ = R: In addition, the assumption that = = Im ; which implies that 1 2 = Id ; we have R 1 0 0 noncentrality parameter reduces to = 0 (R 11 R ) 0 as de…ned in Proposition 7. 19 Theorem 10 Let the assumptions in Lemma 9 hold. 0 is positive (negative) semide…nite, then ^ (a) If g(h; q) Id 2T has a larger (smaller) ^ asymptotic variance than 1T as T ! 1 for a …xed h: 0 ^ (b) If g(h; q) Ip R R is positive (negative) semide…nite, then R 2T has a larger (smaller) ^ asymptotic variance than R 1T as T ! 1 for a …xed h: 0 (c) If f ( ; h; p; q; ) Ip R R is positive (negative) semide…nite, then the test based on W2T is asymptotically less (more) powerful than that based on W1T ; as T ! 1 for a …xed h: Theorem 10(a) is a special case of Theorem 10(b) with R = Id . We single out this case in order to compare it with Proposition 3. It is reassuring to see that the results are the same. The only di¤erence is that in the general GMM case we need to rotate and standardize the original moment conditions before computing the long run correlation matrix. Part (b) can also be applied to a 1=2 ~ = R( general location model with a nonscalar error variance, in which case R . Theorem 1 2) 10(c) reduces to Proposition 7 in the case of the location model. 5.2 Two-Step GMM Estimation and Inference with a Working Weighting Matrix In the previous subsection, we employ two speci…c weighting matrices — the variance and long run variance estimators. In this subsection, we consider a general weighting matrix WT (^0T ); which may depend on the initial estimator ^0T and the sample size T; leading to yet another GMM estimator: i 1 h ^aT = arg min gT ( )0 WT (^0T ) gT ( ) 2 where the subscript ‘a’signi…es ‘another’or ‘alternative’. An example of WT (^0T ) is the implied LRV matrix when we employ a simple approximating parametric model to capture the dynamics in the moment process. We could also use the general LRV estimator but we choose a large h so that the variation in WT (^0T ) is small. In the kernel LRV estimation, this amounts to including only autocovariances of low orders in constructing WT (^0T ): We assume that WT (^0T ) !p W , a positive de…nite nonrandom matrix under the …xed-smoothing asymptotics. W may not be equal to the variance or long run variance of the moment process. We call WT (^0T ) a working weighting matrix. This is in the same spirit of using a working correlation matrix rather than a true correlation matrix in the generalized estimating equations (GEE) setting. See, for example, Liang and Zeger (1986). In parallel to (15), we construct the test statistic n o r)0 RV^aT R0 WaT := T (R^aT where, for GaT = 1 T PT t=1 @ f (vt ; )=@ h i V^aT = G0aT WT 1 (^aT )GaT 1h 0 =^aT 1 (R^aT ; V^aT is de…ned according to G0aT WT 1 (^aT ) est ^aT W 1 (^aT )GaT T which is a standard variance estimator for ^aT : De…ne W = U 0 W U and W = r); 1 1=2 W 20 ( 0 1 1=2 ) = ih G0aT WT 1 (^aT )GaT W11 W12 W21 W22 i 1 ; and a = W12 W221 : Using the same argument for proving Lemma 9, we can show that ( 1=2 1 2) 1 A T 1 X p ) = [f1 (vt ; 0 T t=1 p V 0 T (^aT 0) a f2 (vt ; 0 )] The above representation is the same as that in (19) except that 1=2 1=2 De…ne ~ a according to a = 1 2 ~ a 22 + 0 : That is, 1=2 12 ( a ~ = a The relationship between In fact, if we let a 0) 1=2 22 1=2 a 12 = 1=2 22 1=2 1=2 W 0 = 1=2 0 Id ~ 12 ~ 11 W W ~ ~ W21 W22 (22) is now replaced by 1 and ~ a is entirely analogous to that between ~ = W + op (1): a: : (23) 1 and ~ 1 ; see (3). ; ~ 12 W ~ 1 ; which is the long run regression matrix implied by the normalized weighting then ~ a = W 22 ~: matrix W ~ [f1 (vt ; 0 ) Let Va be the long run variance of R a f2 (vt ; 0 )] and h i 1=2 1=2 ~ = V R ( ) 12 22 a;R a a 22 ; ~ [f1 (vt ; be the long run correlation matrix between R R = Id ; a;R reduces to a = 11 + a 0 22 a a 0 12 a 21 0) 1=2 a f2 (vt ; 0 )] ( 12 a 22 ) and f2 (vt ; 0) : When 1=2 22 : Let ~ a = RV A R 1 ( 1=2 1 2) 1=2 12 ~ =R 1=2 12 ~R = R ~aR ~ a0 and D 1=2 ~a = R ~ R ~0 1 2R 1=2 ~ R 1=2 12: ~R ~ 0 = Ip , that is, each row of D ~ R has the same dimension as R ~ a and R: By construction, D ~ RD D R d ~ ~ is a unit vector in R and the rows of DR are orthogonal to each other. The matrix DR signi…es ~a: the orthogonal directions embodied in R With these additional notations, we are ready to state the theorem below. Theorem 11 Let the assumptions in Lemma 9 hold. Assume further that WT (^0T ) !p W , a positive de…nite nonrandom matrix. 0 0 (a) If E ~ 1 (h; d; q) ~ 1 (h; d; q) ~ a ~ a is positive (negative) semide…nite, then ^2T has a larger (smaller) asymptotic variance than ^aT ; as T ! 1 for a …xed h: 0 ^ ~0 ~ R ~a ~0 D (b) If E ~ 1 (h; p; q) ~ 1 (h; p; q) D a R is positive (negative) semide…nite, then R 2T ^ has a larger (smaller) asymptotic variance than R aT ; as T ! 1 for a …xed h: 0 ^ (c) If g(h; d) Id a a is positive (negative) semide…nite, then 2T has a larger (smaller) ^ asymptotic variance than aT ; as T ! 1 for a …xed h: (d) If g(h; d) Ip a;R 0a;R is positive (negative) semide…nite, then R^2T has a larger (smaller) asymptotic variance than R^aT ; as T ! 1 for a …xed h: 1=2 2 0 (e) Let a = jjVa 0 jj : If f ( a ; h; p; q; ) Ip a;R a;R is positive (negative) semide…nite, then the test based on W2T is asymptotically less (more) powerful than that based on WaT ; as T ! 1 for a …xed h: 21 0 Theorem 11(a) is intuitive. While E ~ 1 (h; d; q) ~ 1 (h; d; q) can be regarded as the variance 0 in‡ation arising from estimating the long run regression matrix, ~ a ~ a can be regarded as the bias e¤ect from not using a true long run regression matrix. When the variance in‡ation e¤ect dominates the bias e¤ect, the two-step estimator ^2T will have a larger asymptotic variance than ^aT : In the special case when W = ; we obtain the infeasible two-step estimator ~2T and ~ a = 0: 0 ~ ~ 0 holds trivially and the feasible two-step estimator In this case, E ~ 1 (h; d; q) ~ 1 (h; d; q) a a ^2T has higher variation than its infeasible version ~2T : In another special case when W = ; we obtain the …rst step estimator ^1T in which case 0 ) 1=2 : Hence ~ (Id a = 0 and by (23), a = ~ ~ 0 = Id a a 0 1=2 0 1=2 0 Id : 0 The condition in Theorem 11(a) reduces to whether g(h; q) Iq not. Theorem 11(a) thus reduces to Theorem 10(a). is positive semide…nite or d ~ ~ R ~ (h; d; q) = Theorem 11(b) is a rotated version of Theorem 11(a). Given that D (h; p; q); the condition in Theorem 11(a) implies that in Theorem 11(b). So part (b) of the theorem is stronger than part (a) when R is not a square matrix. When W = ; we have a = 0 and so ~ R ~a ~0 D ~0 D a R = ~ R ~0 1 2R = ~ R 1 2R = ~ R 1 2R = Ip ~0 ~0 0 R R 1=2 h 1=2 h 1=2 h 1=2 ~ R ~( R ~ R 1=2 12 i a a 0) a 12 0 R R h ~ ~0 1 22 22 f( a ~0 21 R i 0 R R Ip 0 As a result, the condition E ~ 1 (h; p; q) ~ 1 (h; p; q) to 0 E ~ 1 (h; p; q) ~ 1 (h; p; q) Ip R i 1=2 0 12 ~ R ~ R ~ R 0 ~0 0 )g R ~0 1 2R 1=2 ~0 1 2R i ~ R 1=2 ~0 1 2R 1=2 1=2 : ~ R ~a ~0 D ~0 D a R in Theorem 11(b) is equivalent 0 R 1=2 0 R R Ip 0 R R 1=2 ; which can be reduced to the same condition in Theorem 10(b). Theorem 11 (c)-(e) is entirely analogous to Theorem 10(a)-(c). The only di¤erence is that the second block of moment conditions is removed from the …rst block using the implied matrix coe¢ cient a before computing the long run correlation coe¢ cient. To understand 11(e), we can see that the e¤ective moment conditions behind R^aT are: Ef1a (vt ; 0) = 0 for f1a (vt ; 0) = RV A 1 ( 1=2 [f1 (vt ; 0 ) 1 2) a f2 (vt ; 0 )] : R^aT uses the information in Ef2 (vt ; 0 ) = 0 to some extent, but it ignores the residual information that is still potentially available from Ef2 (vt ; 0 ) = 0: In contrast, R^2T attempts to explore the residual information. If there is no long run correlation between f1a (vt ; 0 ) and f2 (vt ; 0 ) ; i.e., aR = 0; then all the information in Ef2 (vt ; 0 ) = 0 has been fully captured by the e¤ective moment conditions underlying R^aT : As a result, the test based on R^aT necessarily outperforms that based on R^2T : If the long run correlation aR is large enough in the sense that is given in 11(e), the test based on R^2T could be more powerful than that based on R^aT in large samples. 22 5.3 Practical Recommendation Theorems 10 and 11 give theoretical comparisons of various procedures. But there are some unknown quantities in the two Theorem 10. In practice, given the set of moment conditions E f (vt ; ) = 0 and the data fvt g ; suppose that we want to test H0 : R 0 = r against R 0 6= r for some R 2 Rp d We can follow the steps below to evaluate the relative merit of one-step and two-step testing procedures. 1. Compute the initial estimator ^0T = arg min PT t=1 f (vt ; ) 2 : 2. One the basis of ^0T ; use a data-driven method to select the smoothing parameter: Denote ^ the data-driven value by h: 3. Compute ^ est ( 0T ) and ^ est ( 0T ) ^ using the formulae in (14) and smoothing parameter h: P ^ ^ V^ 0 4. Compute GT (^0T ) = T1 Tt=1 @ f @(vt0; ) j =^0T and its singular value decomposition U where ^ 0 = (A^d d ; Od q ) and A^d d is diagonal. 5. Estimate the variance and the long run variance of the rotated moment processes by ^ ^ := U ^0 ^ and ^ := U ^0 est ( 0T )U ^ ^ est ( 0T )U . 6. Compute the normalized LRV estimator: ^ = (^ ) 1=2 1^ 0 ^ where ^ ~ est = RV^ A^ 7. Let R 1( ^ 1=2 1=2 1 2) B =@ 0 ( ^ 1=2 ) 1=2 12 1 ^ ^ 12 ^ 0 ^ 11 ^ 21 := ^ 12 ^ 22 1=2 1 C A: 22 1=2 22 (24) and compute h 0 ~ est ^ 11 R ~ est ^R ^0R = (R ) 1=2 ih 0 ~ est ^ 12 ^ 1 ^ 12 R ~ est R 22 ih 0 ~ est ^ 11 R ~ est (R ) and its largest eigenvalue vmax ^R ^0R and smallest eigenvalue vmin ^R ^0R : 1=2 i0 8. Choose the value of o such that P 2p ( ) > p1 = 75%. We may also choose a value of to re‡ect scienti…c interest or economic signi…cance. 9. (a) If vmin ^R ^0R > f on W2T : (b) If vmax ^R ^0R < f on W1T : o ^ p; q; ; h; o ^ p; q; ; h; , then we use the second-step testing procedure based ; then we use the …rst-step testing procedure based (c) If either the condition in (a) or in (b) is satis…ed, then we use the …rst-step testing procedure based on WaT : 23 6 Simulation Evidence This section compares the …nite sample performances of one-step and two-step estimators and tests using the …xed-smoothing approximation. 6.1 Point Estimation We consider the location model given in (1) with the true parameter value 0 = (0; :::; 0) 2 Rd but we allow for a nonscalar error variance. The error fut g follows a VAR(1) process: u1ti = u1ti 1 u2ti = u2ti +p q q X u2tj 1 + e1ti for i = 1; :::; d (25) j=1 + e2ti for i = 1; :::; q 1 where e1ti iid N (0; 1) across i and t, e2ti iid N (0; 1) across i and t; and fe1t ; t = 1; 2; :::; T g are independent of fe2t ; t = 1; 2; :::; T g : Let ut := (u1t ; u2t )0 2 Rm , then ut = ut 1 + et where ! Id pq Jd;q e1t = ; et = iid N (0; Im ) : m m e2t 0 Iq Direct calculations give us the expressions for the long run and contemporaneous variances of fut g as = 1 X Eut ut j = (Im j= 1 0 and where Jd;q is the d = @ 1 (1 I + )2 d (1 0 1 B = var(ut ) = @ 1 ) 3p 2 (1 J )4 d ) 1 3p (1 J q q;d ) 1 (1 (1+ 2 ) 2 Id + 3 Jd (1 2 ) p J 2 2 q d q (1 ) 0 (Im ) ) J q d;q 2 Iq 1 1 A 2 p 1 1 Jd ) 2 Iq 2 2 q (1 q 1 C A; q matrix of ones. In our simulation, we …xhthe value of at 0:75 i so 1=2 1 that each time series is reasonably persistent. Let u1t = ( 1 2 ) u1t u2t and 12 ( 22 ) u2t = ( 22 ) 1=2 u2t and be the long run correlation matrix of ut = (u01t ; u02t )0 : With some algebraic manipulations, we have 0 = d+ 2 2 ) (1 2 1 Jd : 1 2 2 So the maximum eigenvalue of 0 is given by vmax ( 0 ) = 1 + (1 ) =(d 2 ) , which is also the only nonzero eigenvalue. Obviously, vmax ( 0 ) is increasing in 2 for any given and d: Given and d, we choose the value of to get a di¤erent value of vmax ( 0 ): We consider vmax ( 0 ) = 2 2 0; 0:09; 0:18; :::; 0:90; 0:99 which are obtained by setting 2 = vmax ( 0 )(1 ) = (d(1 vmax ( 0 )) : 24 For the basis functions in OS HAR estimation, we choose the following orthonormal basis 2 functions f j g1 j=1 in L [0; 1] space: 2j 1 (x) = p 2 cos(2j x) and 2j (x) = p 2 sin(2j x) for j = 1; :::; K=2 We also consider commonly used kernel based LRV estimators by employing three kernels: Bartlett, Parzen, QS kernels. In all our simulations the sample size T is 200: We focus on the case with d = 1, under which 0 is a scalar and max ( 0 ) = 0 . According to Proposition 3, if 0 greater than a threshold value, then V ar(^2T ) will be smaller than V ar(^1T ): Otherwise, V ar(^2T ) will be larger. We simulate V ar(^1T ) and V ar(^2T ) using 10; 000 simulation replications. Tables 6 7 report the simulated variances under the OS and kernel HAR estimation with q = 3 and 4 at some given values of the smoothing parameters3 . We …rst discuss the case when OS HAR estimation is used. It is clear that V ar(^2T ) becomes smaller than V ar(^1T ) only when 0 ) is large enough. For example, when q = 4 and there is no long run correlation, i.e., max ( 0 ) = 0; we have V ar(^1T ) = 0:080 < V ar(^2T ) = 0:112, and so ^1T is more e¢ cient than max ( ^2T with 27% e¢ ciency gain. These numerical observations are consistent with our theoretical result in Proposition 1: ^2T becomes more e¢ cient relative to ^1T only when the bene…t of using long run correlation matrix outweighs the cost of estimating it. With the choice of K = 14 and q = 4; Table 6 indicates that V ar(^2T ) starts to become smaller than V ar(^1T ) when max ( 0 ) = 0 crosses a value in the interval [0:270; 0:360] from below: This agrees with the theoretical threshold value 0 = q=(K 1) 0:307 given in Corollary 4: In the case when kernel HAR estimation is used, we get the exactly same qualitative messages. For example, consider the case with the Bartlett kernel, b = 0:08; and q = 3: We observe that V ar(^2T ) starts to become smaller than V ar(^1T ) when max ( 0 ) = 0 crosses a value in the interval [0:09; 0:18] from below. This is compatible with the threshold value 0:152 given in Table 1. 6.2 Hypothesis Testing In addition to the VAR(1) error process, we also consider the following VARMA(1,1) process for ut : u1ti = u1ti u2ti = i 1 + e1t + p u2ti 1 + e2ti q q X e2;tj 1 for i = 1; :::; d (26) j=1 for i = 1; :::; q i:i:d where et N (0; Im ) : The corresponding long run covariance matrix and contemporaneous covariance matrix are 1 0 2 1 Jd (1 )2 pq Jd;q 2 Id + (1 )2 A = @ (1 ) 1 J I q q;d 2p 2 (1 ) q (1 ) 3 To calculate the values of K and b, we employ the plug-in values (1991)’s AMSE-optimal fomulae assuming ( 0 ) = 0:40: 25 and B in Phillips (2005) and Andrews and 1 = 2 1 Id + p1 q1 2 2 2 1 p1 q1 Jd Jd;q 2 1 Jq;d 2 1 ! With some additional algebras, we have, 0 = d+ 1 1 )2 (1 2 Jd : Considering both the VAR(1) and VARMA(1,1) error processes, we implement three testing procedures on the basis of W1T , W2T and WaT : The signi…cance level is = 0:05 and we set 0 ) 2 f0:00; 0:35; 0:50; 0:60; 0:80; 0:90g with d = 3 and q = 3. The null hypotheses of max ( interest are: H01 : 1 = 0; H02 : 1 = 2 =0 where p = 1; 2 respectively. Here, WaT is based on working weighting matrix W (^0T ) using n o VAR(1) as the approximating model for the estimated error process u ^t (^0T ) : Note that W (^0T ) converges in probability to the true long run variance matrix if the true data generating process is VAR(1) process. However, the probability limit of W (^0T ) is di¤erent from the true LRV when the true DGP is VARMA(1,1) process. For the smoothing parameters, we employ the data driven AMSE optimal bandwidth through VAR(1) plug-in implementation developed by Andrews (1991) and Phillips (2005). All our results below are based on 10,000 simulation replications. Tables 8–15 report the empirical size of three testing procedures based on the two types of asymptotic approximations with OS HAR and kernel based estimators. It is clear that all of the three tests based on W1T ; WaT and W2T su¤er from severe size distortion if the conventional normal (or chi-square) critical values are used. For example, when the DGP is VAR(1) and OS HAR estimation is implemented, the empirical sizes of the three tests using the OS variance estimator are reported to be around 16% 24% when p = 2. The relatively large size distortion of the W2T test comes from the additional cost in estimating the weighting matrix. However, if the nonstandard critical values W11 and W21 are used, we observe that the size distortion of both procedures is substantially reduced. The result agrees with the previous literature such as Sun (2013, 2014a&b) and Kiefer and Vogelsang (2005) which highlight the higher accuracy of the …xed-smoothing approximations. Also, we observe that when the …xed-smoothing approximations are used, the W1T test is more size-distorted than the W2T test in most cases. Similar results for the kernel cases are reported in Tables 10 15. Next, we investigate the …nite sample power performances of the three procedures. We use the …nite sample critical values under the null, the power is size-adjusted and the power comparison is meaningful. The DGPs p are the same as before except the parameters are from the local null alternatives = 0 + 0 = T . The degree of over identi…cation considered here is q = 3: Also, the domain of each power curve is rescaled to be := 00 111 0 as in Section 4. Figures 3–6 show the size-adjusted …nite sample power of both procedures by OS HAR estimation. The dashed lines in the …gures indicate power curves of WaT procedure and W1T , W2T are represented by dash-dot and dotted lines, respectively. We can see that in all …gures, the power curve of the two-step test shifts upward as the degree of the long run correlation max ( R 0R ) increases and it starts to dominate that of the one-step test from certain point max ( R 0R ) 2 (0; 1). This is consistent with Proposition 7. 26 For example, with K = 14 and p = 1; the power curves in Figure 4 show that the power curve of the two-step test W2T starts to cross with that of the one-step test W1T when max ( R 0R ) is around 0:25. This matches our theoretical results in Proposition 7 and Table 5 which indicate that the threshold value max 2[1;10] f ( ; K; p; q; ) is about 0:273 when K = 14; p = 1 and q = 3. Also, if max ( R 0R ) is as high as 0:75, we can see that the two step test is more powerful than the one-step test in most of cases. Lastly, when the DGP is based on VAR(1), the performance of WaT dominates W1T and W2T in all ranges of max ( R 0R ) 2 (0; 1). This is because the working weighting matrix W (^0T ) with VAR(1) approximation can almost precisely capture the true long run variance matrix and this improves the power of tests whenever there is some long run correlation between u1t and u2t . However, in VARMA(1,1) DGP case, Figures 5 6 show that the advantages of WaT are reduced, especially when max ( R 0R ) is close to one. This is due to the misspeci…cation bias for W (^0T ) which is estimated di¤erently from the true model. Nevertheless, we still observe comparable performances of WaT in most of max ( R 0R ) values: The other Figures 7 9 by kernel LRV estimators deliver the same quantitative messages. 7 Conclusion In this paper we have provided more accurate and honest comparisons between the popular onestep and two-step GMM estimators and the associated inference procedures. We have given some clear guidance on when we should go one step further and use a two-step procedure. Qualitatively, we want to go one step further only if the bene…t of doing so clearly outweighs the cost. When the bene…t and cost comparison is not clear-cut, we recommend using the two-step procedure with a working weighting matrix. The qualitative message of the paper is applicable more broadly. As long as there is additional nonparametric estimation uncertainty in a two-step procedure relative to the one-step procedure, we have to be very cautious about using the two-step procedure. While some asymptotic theory may indicate that the two step procedure is always more e¢ cient, the e¢ ciency gain may not materialize in …nite samples. In fact, it may do more harm than good sometimes if we blindly use the two step procedure. There are many extensions of the paper. We give some examples here. First, we can use the more accurate approximations to compare the continuous updating GMM and other generalized empirical likelihood estimators with the one-step and two-step estimators. While the …xed-smoothing asymptotics captures the nonparametric estimation uncertainty of the weighting matrix estimator, it does not fully capture the estimation uncertainty embodied in the …rst-step estimator. The source of the problem is that we do not observe the moment process and have to use the estimated moment process based on the …rst-step estimator to construct the nonparametric variance estimator. It is interesting to develop a further re…nement to the …xed-smoothing approximation to capture the …rst step estimation uncertainty more adequately. Finally, it will be also very interesting to give an honest assessment of the relative merits of the OLS and GLS estimators which are popular in empirical applications. 8 Table of Selected Notations Moment processes and their variance, LRV and implied regression matrices. 27 Moment Process ut = f (vt ; ut = U 0 f (vt ; 0) = E ut u0t Variance LRV = = P1 0 j= 1 E ut ut j = ~1 = 1 = R1R1 0 0 0 0 1=2 = = 12 21 22 11 12 12 ( 22 ) 0 1=2 := 1;11 1;12 1;21 1;22 Qh (r; s)dBp+q (r)dBp+q (s)0 = 28 ut Id+q = 11 12 21 22 1 1 ~ 1 1 Cpp Cpq Cqp Cqq 1=2 ) 22 12 ( 22 ) Some limiting quantities. ~ 1;11 ~ 1;12 Qh (r; s)dBm (r)dBm (s)0 = ~ 1;21 ~ 1;22 ~1 ut = ( 11 21 Implied beta R1R1 0) = 12 1 22 1 = ~ 1 (h; d; q) = ~ 1;12 ~ 1;22 = 1 (h; d; q) Dpp = Cpp = 1;12 0 Cpq Cqq1 Cpq 1 1;22 Table 1: Threshold values g(h; q) for asymptotic variance comparison with Bartlett kernel b q=1 q=2 q=3 q=4 q=5 0.010 0.007 0.014 0.020 0.027 0.033 0.020 0.014 0.027 0.040 0.053 0.065 0.030 0.020 0.040 0.059 0.078 0.097 0.040 0.027 0.053 0.079 0.104 0.128 0.050 0.034 0.066 0.098 0.128 0.157 0.060 0.040 0.079 0.116 0.152 0.185 0.070 0.047 0.092 0.135 0.175 0.211 0.080 0.054 0.104 0.152 0.197 0.237 0.090 0.061 0.117 0.170 0.218 0.260 0.100 0.068 0.129 0.186 0.238 0.282 0.110 0.074 0.141 0.203 0.257 0.303 0.120 0.081 0.153 0.218 0.274 0.322 0.130 0.088 0.164 0.233 0.291 0.340 0.140 0.094 0.175 0.247 0.306 0.356 0.150 0.101 0.186 0.260 0.321 0.371 0.160 0.107 0.196 0.273 0.334 0.384 0.170 0.113 0.206 0.284 0.347 0.397 0.180 0.119 0.216 0.295 0.358 0.407 0.190 0.124 0.226 0.306 0.369 0.417 0.200 0.130 0.235 0.316 0.380 0.425 Note: h = 1=b indicates the level of smoothing and q is the degrees of overidenti…cation. When 0 > g (h; q) ; then the two-step estimator ^ ^ 2T is more e¢ cient than the one-step estimator 1T : 29 Table 2: Threshold values g(h; q) for asymptotic variance comparison with Parzen kernel b q=1 q=2 q=3 q=4 q=5 0.010 0.006 0.011 0.016 0.022 0.027 0.020 0.011 0.022 0.033 0.043 0.054 0.030 0.017 0.033 0.049 0.065 0.081 0.040 0.022 0.044 0.065 0.087 0.107 0.050 0.028 0.055 0.082 0.108 0.134 0.060 0.033 0.066 0.099 0.130 0.161 0.070 0.039 0.077 0.115 0.152 0.187 0.080 0.045 0.088 0.132 0.173 0.213 0.090 0.051 0.100 0.148 0.194 0.238 0.100 0.057 0.111 0.164 0.215 0.263 0.110 0.063 0.122 0.181 0.236 0.288 0.120 0.069 0.133 0.197 0.257 0.312 0.130 0.075 0.145 0.213 0.277 0.336 0.140 0.081 0.156 0.229 0.297 0.359 0.150 0.087 0.168 0.245 0.317 0.382 0.160 0.093 0.179 0.261 0.337 0.404 0.170 0.100 0.191 0.277 0.356 0.426 0.180 0.106 0.202 0.293 0.375 0.448 0.190 0.112 0.214 0.308 0.393 0.469 0.200 0.118 0.225 0.323 0.411 0.489 See notes to Table 1 30 Table 3: Threshold values g(h; q) for asymptotic variance comparison with QS kernel b 0.010 0.020 0.030 0.040 0.050 0.060 0.070 0.080 0.090 0.100 0.110 0.120 0.130 0.140 0.150 0.160 0.170 0.180 0.190 0.200 q=1 q=2 0.010 0.020 0.021 0.041 0.031 0.062 0.042 0.084 0.053 0.106 0.065 0.128 0.077 0.151 0.089 0.175 0.102 0.198 0.115 0.222 0.127 0.247 0.140 0.271 0.153 0.296 0.166 0.321 0.179 0.346 0.193 0.371 0.206 0.395 0.220 0.418 0.233 0.441 0.246 0.463 See notes q=3 0.030 0.061 0.093 0.126 0.158 0.191 0.225 0.259 0.293 0.326 0.359 0.392 0.426 0.458 0.489 0.520 0.549 0.578 0.605 0.630 to Table Table 4: Values of f ( ; K; p; q; ) for K 8 10 1 5 9 13 17 21 25 1 5 9 13 17 21 25 q=1 0.144 0.152 0.154 0.162 0.168 0.171 0.165 0.071 0.133 0.133 0.133 0.138 0.128 0.133 p=1 q=2 0.350 0.357 0.343 0.354 0.360 0.363 0.379 0.283 0.276 0.260 0.267 0.267 0.304 0.311 q=3 0.514 0.494 0.489 0.494 0.501 0.509 0.526 0.441 0.425 0.408 0.405 0.417 0.418 0.432 q=1 0.234 0.208 0.218 0.217 0.210 0.204 0.207 0.159 0.132 0.134 0.133 0.139 0.137 0.151 p=2 q=2 0.375 0.372 0.397 0.404 0.397 0.405 0.410 0.290 0.304 0.308 0.308 0.309 0.313 0.321 q=4 0.040 0.082 0.124 0.166 0.209 0.252 0.296 0.340 0.382 0.423 0.463 0.502 0.542 0.581 0.619 0.655 0.690 0.722 0.752 0.779 1 q=5 0.050 0.102 0.154 0.206 0.258 0.311 0.362 0.414 0.464 0.516 0.565 0.612 0.655 0.697 0.736 0.773 0.806 0.834 0.859 0.879 = 0.05 and K = 8,10. q=3 0.599 0.600 0.608 0.616 0.621 0.619 0.623 0.450 0.431 0.432 0.440 0.447 0.453 0.457 q=1 0.231 0.229 0.227 0.229 0.233 0.234 0.233 0.192 0.187 0.196 0.209 0.212 0.202 0.204 p=3 q=2 0.456 0.490 0.498 0.511 0.513 0.515 0.514 0.365 0.331 0.337 0.349 0.348 0.348 0.360 q=3 0.614 0.647 0.667 0.686 0.700 0.712 0.720 0.509 0.508 0.513 0.509 0.515 0.523 0.522 Note: If the largest squared long run canonical correlation between the two blocks of moment conditions is smaller than f ( ; K; p; q; ), then the two step test is asymptotically less powerful 31 Table 5: Values of f ( ; K; p; q; ) for K 12 14 1 5 9 13 17 21 25 1 5 9 13 17 21 25 q=1 0.092 0.101 0.093 0.104 0.092 0.119 0.113 0.071 0.091 0.084 0.087 0.087 0.103 0.086 p=1 q=2 0.200 0.217 0.201 0.196 0.194 0.204 0.251 0.286 0.223 0.215 0.199 0.213 0.245 0.244 q=3 0.298 0.304 0.308 0.322 0.327 0.327 0.350 0.273 0.267 0.265 0.257 0.265 0.262 0.269 q=1 0.146 0.124 0.127 0.121 0.124 0.116 0.105 0.101 0.133 0.118 0.107 0.105 0.097 0.095 p=2 q=2 0.202 0.247 0.226 0.230 0.239 0.240 0.243 0.182 0.184 0.192 0.200 0.196 0.190 0.201 = 0.05 and K = 12,14. q=3 0.322 0.350 0.357 0.361 0.362 0.362 0.364 0.331 0.279 0.283 0.283 0.284 0.289 0.295 q=1 0.163 0.118 0.117 0.116 0.123 0.122 0.123 0.135 0.110 0.118 0.122 0.128 0.119 0.130 p=3 q=2 0.339 0.290 0.274 0.275 0.284 0.286 0.296 0.263 0.211 0.212 0.216 0.207 0.213 0.205 q=3 0.346 0.360 0.357 0.369 0.379 0.385 0.388 0.366 0.337 0.338 0.334 0.334 0.337 0.339 Note: If the largest squared long run canonical correlation between the two blocks of moment conditions is smaller than f ( ; K; p; q; ), then the two step test is asymptotically less powerful 32 Table 6: Finite sample variance comparison between ^1T and ^2T under VAR(1) error with T = 200, and q = 3. 0 ) Var(^ ) Var(^2T ) Var(^aT ) max ( 1T OS HAR Bartlett Parzen QS K=14 b=0.08 b=0.15 b=0.08 0.000 0.081 0.103 0.100 0.108 0.109 0.089 0.090 0.093 0.105 0.103 0.110 0.111 0.093 0.180 0.107 0.108 0.105 0.112 0.113 0.096 0.270 0.124 0.111 0.108 0.114 0.115 0.099 0.360 0.146 0.115 0.111 0.117 0.118 0.102 0.450 0.174 0.120 0.116 0.120 0.122 0.106 0.540 0.214 0.127 0.122 0.125 0.127 0.110 0.630 0.272 0.137 0.131 0.132 0.134 0.116 0.720 0.368 0.154 0.145 0.144 0.146 0.123 0.810 0.554 0.185 0.174 0.166 0.170 0.135 0.900 1.073 0.274 0.253 0.227 0.235 0.166 0.990 10.892 1.937 1.731 1.372 1.451 0.714 Table 7: Finite sample variance comparison between ^1T and ^2T under VAR(1) error with T = 200, and q = 4 0 ) Var(^ ) Var(^2T ) Var(^aT ) max ( 1T OS HAR Bartlett Parzen QS K=14 b=0.70 b=0.150 b=0.70 0.000 0.081 0.112 0.104 0.120 0.114 0.089 0.090 0.092 0.114 0.106 0.121 0.115 0.093 0.180 0.106 0.117 0.108 0.123 0.118 0.096 0.270 0.122 0.124 0.111 0.126 0.120 0.100 0.360 0.146 0.125 0.115 0.129 0.124 0.105 0.450 0.175 0.130 0.121 0.133 0.129 0.110 0.540 0.217 0.139 0.129 0.139 0.135 0.116 0.630 0.278 0.151 0.141 0.148 0.146 0.123 0.720 0.379 0.172 0.160 0.163 0.162 0.134 0.810 0.576 0.213 0.198 0.193 0.196 0.152 0.900 1.128 0.328 0.305 0.276 0.289 0.197 0.990 11.627 2.538 2.364 1.884 2.089 1.013 33 Table 8: Empirical Size of one-step and two-step tests based on VAR(1) error when = 0:75; p = 1 2, and T = 200 p = 1 and q = 3 One Step( ^ ) One Step(W ) 0 2 2 W11 W11 max ( R R ) 0.00 0.128 0.098 0.151 0.119 0.15 0.126 0.096 0.135 0.103 0.25 0.135 0.102 0.138 0.105 0.33 0.135 0.105 0.127 0.094 0.57 0.139 0.107 0.086 0.061 0.75 0.143 0.116 0.046 0.031 p = 2 and q = 3 One Step( ^ ) One Step(W ) 0 2 2 W11 W11 max ( R R ) 0.00 0.181 0.111 0.222 0.138 0.26 0.191 0.118 0.219 0.136 0.40 0.192 0.115 0.201 0.120 0.50 0.195 0.119 0.194 0.112 0.73 0.206 0.120 0.168 0.095 0.86 0.206 0.124 0.143 0.082 the series LRV estimator under Table 9: Empirical Size of one-step and two-step tests based on VARMA(1,1) error when = 0:75; p = 1 2, and T = 200 p = 1 and q = 3 One Step( ^ ) One Step(W ) 0 2 2 W11 W11 max ( R R ) 0.00 0.117 0.091 0.138 0.108 0.15 0.140 0.113 0.142 0.113 0.25 0.144 0.117 0.140 0.113 0.33 0.155 0.127 0.141 0.111 0.57 0.167 0.138 0.128 0.106 0.75 0.168 0.141 0.118 0.096 p = 2 and q = 3 One Step( ^ ) One Step(W ) 0 2 2 W11 W11 max ( R R ) 0.00 0.188 0.119 0.227 0.146 0.26 0.202 0.129 0.209 0.136 0.40 0.206 0.135 0.204 0.134 0.50 0.223 0.148 0.215 0.144 0.73 0.221 0.148 0.205 0.138 0.86 0.222 0.156 0.194 0.132 the series LRV estimator under 34 Two Step 2 W21 0.187 0.076 0.177 0.061 0.187 0.063 0.174 0.059 0.154 0.044 0.118 0.032 Two Step 2 W21 0.290 0.077 0.296 0.069 0.290 0.065 0.290 0.057 0.272 0.057 0.245 0.051 Two Step 2 W21 0.181 0.068 0.173 0.071 0.165 0.065 0.160 0.060 0.121 0.043 0.087 0.025 Two Step 2 W21 0.290 0.080 0.270 0.073 0.254 0.069 0.251 0.065 0.214 0.053 0.178 0.044 Table 10: Empirical Size of one-step and two-step tests based on the Bartlett kernel variance estimator under VAR(1) error when = 0:75, p = 1 2 and T = 200 p = 1 and q = 3 One Step( ^ ) One Step(W ) Two Step 0 2 2 2 ( W W W21 ) max R R 11 11 0.00 0.156 0.138 0.192 0.172 0.201 0.133 0.15 0.163 0.138 0.175 0.154 0.201 0.120 0.25 0.161 0.138 0.164 0.141 0.196 0.112 0.33 0.154 0.127 0.140 0.115 0.181 0.100 0.57 0.147 0.119 0.085 0.066 0.144 0.069 0.75 0.152 0.128 0.035 0.023 0.115 0.053 p = 2 and q = 3 One Step( ^ ) One Step(W ) Two Step 0 2 2 2 W11 W11 W21 max ( R R ) 0.00 0.239 0.183 0.287 0.228 0.305 0.177 0.26 0.230 0.166 0.263 0.196 0.298 0.150 0.40 0.231 0.169 0.243 0.170 0.296 0.138 0.50 0.228 0.161 0.234 0.159 0.286 0.130 0.73 0.228 0.157 0.179 0.118 0.263 0.108 0.86 0.230 0.159 0.161 0.108 0.240 0.098 Table 11: Empirical Size of one-step and two-step tests based on the Bartlett kernel variance estimator under VARMA(1,1) error when = 0:75, p = 1 2 and T = 200 p = 1 and q = 3 One Step( ^ ) One Step(W ) Two Step 0 2 2 2 ( ) W W W21 max R R 11 11 0.00 0.161 0.142 0.196 0.177 0.203 0.134 0.15 0.147 0.127 0.165 0.144 0.188 0.116 0.25 0.140 0.117 0.149 0.129 0.174 0.105 0.33 0.131 0.115 0.134 0.113 0.158 0.090 0.57 0.117 0.099 0.083 0.068 0.109 0.051 0.75 0.109 0.092 0.035 0.026 0.058 0.024 p = 2 and q = 3 One Step( ^ ) One Step(W ) Two Step 0 2 2 2 W11 W11 W21 max ( R R ) 0.00 0.235 0.180 0.292 0.230 0.307 0.174 0.26 0.213 0.157 0.239 0.181 0.278 0.146 0.40 0.203 0.147 0.224 0.165 0.262 0.124 0.50 0.205 0.146 0.209 0.151 0.246 0.115 0.73 0.191 0.136 0.167 0.114 0.195 0.085 0.86 0.190 0.133 0.147 0.105 0.174 0.078 35 Table 12: Empirical Size of one-step and two-step tests based on the Parzen kernel variance estimator under VAR(1) error when = 0:75, p = 1 2 and T = 200 p = 1 and q = 3 One Step( ^ ) One Step(W ) Two Step 0 2 2 2 ( W W W21 ) max R R 11 11 0.00 0.145 0.108 0.182 0.139 0.214 0.090 0.15 0.148 0.105 0.173 0.125 0.223 0.076 0.25 0.142 0.102 0.161 0.115 0.220 0.070 0.33 0.142 0.101 0.142 0.099 0.211 0.063 0.57 0.150 0.105 0.107 0.068 0.186 0.050 0.75 0.141 0.101 0.054 0.030 0.147 0.034 p = 2 and q = 3 One Step( ^ ) One Step(W ) Two Step 0 2 2 2 W11 W11 W21 max ( R R ) 0.00 0.216 0.123 0.278 0.169 0.340 0.102 0.26 0.225 0.117 0.267 0.149 0.348 0.085 0.40 0.221 0.117 0.260 0.140 0.346 0.081 0.50 0.219 0.112 0.241 0.123 0.331 0.072 0.73 0.217 0.102 0.199 0.097 0.310 0.059 0.86 0.226 0.116 0.175 0.080 0.292 0.054 Table 13: Empirical Size of one-step and two-step tests based on the Parzen kernel variance estimator under VARMA(1,1) error when = 0:75, p = 1 2 and T = 200 p = 1 and q = 3 One Step( ^ ) One Step(W ) Two Step 0 2 2 2 ( ) W W W21 max R R 11 11 0.00 0.142 0.104 0.186 0.141 0.218 0.088 0.15 0.134 0.099 0.164 0.125 0.210 0.082 0.25 0.136 0.099 0.155 0.117 0.200 0.076 0.33 0.127 0.096 0.150 0.113 0.191 0.074 0.57 0.122 0.087 0.110 0.079 0.156 0.052 0.75 0.111 0.082 0.070 0.046 0.114 0.033 p = 2 and q = 3 One Step( ^ ) One Step(W ) Two Step 0 2 2 2 W11 W11 W21 max ( R R ) 0.00 0.220 0.124 0.279 0.171 0.338 0.100 0.26 0.204 0.112 0.248 0.142 0.320 0.094 0.40 0.198 0.108 0.226 0.135 0.303 0.083 0.50 0.196 0.112 0.225 0.131 0.291 0.085 0.73 0.186 0.106 0.188 0.102 0.255 0.063 0.86 0.182 0.105 0.156 0.083 0.219 0.055 36 Table 14: Empirical Size of one-step and two-step tests based on the QS kernel variance estimator under VAR(1) error when = 0:75, p = 1 2 and T = 200 p = 1 and q = 3 One Step( ^ ) One Step(W ) Two Step 0 2 2 2 ( W W W21 ) max R R 11 11 0.00 0.138 0.107 0.174 0.144 0.204 0.089 0.15 0.138 0.103 0.164 0.126 0.209 0.077 0.25 0.141 0.106 0.151 0.115 0.214 0.076 0.33 0.135 0.099 0.145 0.106 0.208 0.069 0.57 0.149 0.110 0.101 0.068 0.187 0.056 0.75 0.132 0.099 0.049 0.029 0.136 0.036 p = 2 and q = 3 One Step( ^ ) One Step(W ) Two Step 0 2 2 2 W11 W11 W21 max ( R R ) 0.00 0.210 0.124 0.265 0.168 0.312 0.101 0.26 0.217 0.122 0.261 0.151 0.335 0.089 0.40 0.216 0.119 0.244 0.141 0.327 0.084 0.50 0.214 0.114 0.234 0.130 0.332 0.077 0.73 0.204 0.113 0.188 0.099 0.295 0.063 0.86 0.214 0.121 0.158 0.082 0.277 0.063 Table 15: Empirical Size of one-step and two-step tests based on the QS kernel variance estimator under VARMA(1,1) error when = 0:75, p = 1 2 and T = 200 p = 1 and q = 3 One Step( ^ ) One Step(W ) Two Step 0 2 2 2 ( ) W W W21 max R R 11 11 0.00 0.141 0.112 0.175 0.141 0.204 0.090 0.15 0.137 0.110 0.164 0.132 0.201 0.089 0.25 0.130 0.104 0.149 0.117 0.188 0.076 0.33 0.123 0.096 0.140 0.111 0.178 0.074 0.57 0.117 0.094 0.113 0.088 0.152 0.058 0.75 0.110 0.085 0.060 0.042 0.110 0.034 p = 2 and q = 3 One Step( ^ ) One Step(W ) Two Step 0 2 2 2 W11 W11 W21 max ( R R ) 0.00 0.213 0.128 0.271 0.176 0.323 0.106 0.26 0.199 0.123 0.249 0.160 0.310 0.104 0.40 0.194 0.122 0.231 0.147 0.297 0.096 0.50 0.183 0.108 0.212 0.130 0.278 0.083 0.73 0.188 0.114 0.187 0.113 0.250 0.072 0.86 0.182 0.113 0.156 0.091 0.217 0.061 37 1 0.8 0.8 0.6 0.6 Power Power 1 0.4 0.2 0.2 0 5 10 15 20 0 25 1 1 0.8 0.8 0.6 0.6 Power Power 0 0.4 0.2 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0.4 0 5 10 15 20 0 25 1 1 0.8 0.8 0.6 0.6 0.4 0.2 0 0 0.2 Power Power 0.4 0.4 0.2 0 5 10 15 20 0 25 Figure 3: Size-adjusted power of the three tests based on the OS HAR estimator under VAR(1) error with p = 1, q = 3, = 0:75, T = 200; and K = 14. 38 1 0.8 0.8 0.6 0.6 Power Power 1 0.4 0.2 0.2 0 5 10 15 20 0 25 1 1 0.8 0.8 0.6 0.6 Power Power 0 0.4 0.2 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0.4 0 5 10 15 20 0 25 1 1 0.8 0.8 0.6 0.6 0.4 0.2 0 0 0.2 Power Power 0.4 0.4 0.2 0 5 10 15 20 0 25 Figure 4: Size-adjusted power of the three tests based on the OS HAR estimator under VAR(1) error with p = 2, q = 3, = 0:75, T = 200; and K = 14. 39 1 0.8 0.8 0.6 0.6 Power Power 1 0.4 0.2 0.2 0 5 10 15 20 0 25 1 1 0.8 0.8 0.6 0.6 Power Power 0 0.4 0.2 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0.4 0 5 10 15 20 0 25 1 1 0.8 0.8 0.6 0.6 0.4 0.2 0 0 0.2 Power Power 0.4 0.4 0.2 0 5 10 15 20 0 25 Figure 5: Size-adjusted power of the three tests based on the OS HAR estimator under VARMA(1,1) error with p = 1, q = 3 and = 0:75, T = 200 and K = 14. 40 1 0.8 0.8 0.6 0.6 Power Power 1 0.4 0.2 0.2 0 5 10 15 20 0 25 1 1 0.8 0.8 0.6 0.6 Power Power 0 0.4 0.2 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0.4 0 5 10 15 20 0 25 1 1 0.8 0.8 0.6 0.6 0.4 0.2 0 0 0.2 Power Power 0.4 0.4 0.2 0 5 10 15 20 0 25 Figure 6: Size-adjusted power of the three tests based on the OS HAR estimator under VARMA(1,1) error with p = 2, q = 3, = 0:75, T = 200; and K = 14. 41 1 0.8 0.8 0.6 0.6 Power Power 1 0.4 0.2 0.2 0 5 10 15 20 0 25 1 1 0.8 0.8 0.6 0.6 Power Power 0 0.4 0.2 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0.4 0 5 10 15 20 0 25 1 1 0.8 0.8 0.6 0.6 0.4 0.2 0 0 0.2 Power Power 0.4 0.4 0.2 0 5 10 15 20 0 25 Figure 7: Size-adjusted power of the three tests based on the Bartlett HAR estimator under VAR(1) error with p = 2, q = 3, = 0:75, T = 200; and b = 0:078. 42 1 0.8 0.8 0.6 0.6 Power Power 1 0.4 0.2 0.2 0 5 10 15 20 0 25 1 1 0.8 0.8 0.6 0.6 Power Power 0 0.4 0.2 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0.4 0 5 10 15 20 0 25 1 1 0.8 0.8 0.6 0.6 0.4 0.2 0 0 0.2 Power Power 0.4 0.4 0.2 0 5 10 15 20 0 25 Figure 8: Size-adjusted power of the three tests based on the Parzen HAR estimator under VAR(1) error with p = 2, q = 3, = 0:75, T = 200; and b = 0:16. 43 1 0.8 0.8 0.6 0.6 Power Power 1 0.4 0.2 0.2 0 5 10 15 20 0 25 1 1 0.8 0.8 0.6 0.6 Power Power 0 0.4 0.2 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0.4 0 5 10 15 20 0 25 1 1 0.8 0.8 0.6 0.6 0.4 0.2 0 0 0.2 Power Power 0.4 0.4 0.2 0 5 10 15 20 0 25 Figure 9: Size-adjusted power of the three tests based on the QS HAR estimator under VAR(1) error with p = 2, q = 3, = 0:75, T = 200; and b = 0:079. 44 9 Appendix of Proofs Lemma 12 Let A 2 Rn n and B 2 Rn n be any two matrix square roots of a positive …nite symmetric matrix C 2 Rn n such that AA0 = BB 0 = C; then there exists an orthogonal matrix Q 2 Rn n such that A = BQ: Proof of Lemma 12. It follows from AA0 = BB 0 that B 1 AA0 (B 0 ) 1 = In : Let Q = B 1 A; then QQ0 = In . That is, Q is an orthogonal matrix. For such a choice of Q; we have A = BQ; as desired. Proof of Proposition 1. Part (a) follows from Lemma 2 of Sun (2013). For part (b), we note d that ^ =) and so 1 p T ^2T T 1 Xh =p (y1t T t=1 0 = Id ; ^ ^ y2t Ey1t ) p1 T p1 T PT i (y1t Pt=1 T t=1 y2t ! Ey1t ) d =) Id ; 1 1=2 Bm (1): Proof of Lemma 2. For any a 2 Rd ; we have Ea0 ~ 1 (h; d; q) ~ 1 (h; d; q)0 a Z 1Z 1 = E tra0 Qh (r; s) dBd (r) dBq0 (s) " Z 0 0 1Z 1 0 Z = E tr " 0 Z 0 = E tr 0 : Qh (r; s) dBq (r) dBq0 (s) 0 1Z 1 0 0 Z 0 2 Qh (r; s) dBq (r) dBq0 # 1Z 1 Qh (r; s) dBq (r) dBd0 (s) a 0 (s) Qh (r; s) dBq (r) dBd0 (s) aa0 0 1Z 1 0 Z 2 1Z 1 1Z 1 Z 0 Z 1Z 1 Qh (r; s) dBd (r) dBq0 (s) Z 1Z 1 Qh (r; s) dBd0 (r) a dBq0 (s) 0 2 0 Qh (r; s) dBq (r) dBq0 (s) Qh (r; s) dBq (r) a0 dBd (s) 0 = (h; q)a0 a; 0 where (h; q) Z = Etr 0 1Z 1 0 2 Qh (r; s) dBq (r) dBq0 (s) Z 0 45 1Z 1 0 Z 0 1 Qh (r; ) Qh ( ; s) d dBq (r) dBq0 (s) : So E ~ 1 (h; d; q) ~ 1 (h; d; q)0 = (h; q) Id : Since this holds for any d; we have E ~ 1 (h; 1; q) ~ 1 (h; 1; q)0 = (h; q): It then follows that E ~ 1 (h; d; q) ~ 1 (h; d; q)0 = 2 E ~ 1 (h; 1; q) Id : Proof of Proposition 3. We have 1=2 12 = = n 12 1 22 Note that 1=2 11 is equal to by h Id 1 2: 1=2 11 11 1=2 11 h 1=2 1 22 12 21 1=2 11 Id 1=2 0 21 ( 11 ) 1 22 12 i 1=2 0 1=2 21 ( 11 ) 1=2 11 h i 1=2 11 Id o 1=2 0 1=2 : ) 11 ( 0 i 1=2 0 1=2 21 ( 11 ) 1 22 12 Combining this with Lemma 12, we can represent any matrix square root of 1=2 12 1=2 11 = h 1=2 11 Id 12 1 22 i 1=2 0 1=2 Q ) 11 21 ( 12 (27) for an orthogonal matrix Q: Using this representation, we have 1=2 12 12 h = Q0 Id 1=2 11 h 1=2 0 21 ( 1 2 ) 1 22 1=2 11 12 ( 1=2 0 12 ( 22 ) 1=2 11 Id = Q0 Id 0 1=2 0 22 ) 1=2 22 0 1=2 0 11 ) 21 ( 1=2 0 21 ( 11 ) 1=2 0 12 ( 22 ) 1=2 1=2 22 h 1=2 22 1=2 0 21 ( 11 ) 0 Id i 1=2 0 i i 1=2 1=2 0 Q Q: (28) It then follows that ^2T has a larger (smaller) asymptotic variance than ^1T under the …xedsmoothing asymptotics if h i 0 1=2 1=2 0 0 0 0 E ~ 1 (h; d; q) ~ 1 (h; d; q) Q0 Id Id Q for any orthogonal matrix Q: But by Lemma 2, this is equivalent to E ~ 1 (h; 1; q) 2 Id Note that Id 0 1=2 0 0 Id h Id 0 46 1=2 i 1=2 0 0 h = Id 0 Id 0 1 i 1=2 0 Id ; : (29) the condition in (29) is equivalent to 2 E ~ 1 (h; 1; q) + 1 Id or E ~ 1 (h; 1; q) which is the same as 2 Id Id = g (h; q) Id : 2 E ~ 1 (h; 1; q) +1 2 E ~ 1 (h; 1; q) 0 0 1 0 Id 1 Id +1 1=2 1=2 Note that is not uniquely de…ned as 11 and 22 may not be uniquely de…ned. However, the non-uniqueness is innocuous. According to Lemma 12, if ~ is another version of the long run correlation matrix, then ~ = Q1 Q2 for some orthogonal matrices Q1 and Q2 : So ~~0 = Q1 and ~~0 0 g (h; q) Id if and only if 0 Q01 g (h; q) Id : Proof of Corollary 4. For the OS LRV estimation, we have K 1 X Qh (r; s) = K i (r) i (s) i (r) i( ) i=1 and so Z 0 Z 1 Qh (r; ) Qh ( ; s) d = 0 As a result, Z 1 (h; q) = Etr K i then = Z 0 K 1 X K i=1 1 K2 = Let 1 K X 0 j ( ) j (s) d j=1 i (r) i (s) = i=1 1Z 1 K 1 X K 1 Q (r; s) : K h 1 Qh (r; s) dBq (r) dBq0 (s) : 1 i (r) dBq (r) 0 20 K X (h; q) = trE 4@ j=1 s iidN (0; Iq ) 1 0A j j 13 5= K q q 1 where the last equality follows from the mean of an inverse Wishart distribution. Using this, we have q (h; q) q K q 1 = = : g(h; q) = q 1 + (h; q) 1+ K q 1 K 1 47 The corollary then follows from Proposition 3. Proof of Proposition 5. It su¢ ces to prove parts (a) and (b) as parts (c) and (d) follow from similar arguments. Part (b) is a special case of Theorem 6(a) of Sun (2014b) with G = [Id ; Od q ]0 : p It remains to prove part (a). Under R 0 = r + 0 = T , we have: p p d 1=2 T (R^1T r) = T R(^1T 0 ) =) R 11 Bd (1) + 0 : Using Proposition 1(a), we have d 1=2 1=2 0 11 Cdd (R 11 ) (R ^ 11 R0 ) =) R R1R1 where Cdd = 0 0 Qh (r; s)dBd (r)dBd (s)0 and Cdd ? Bd (1). Invoking the continuous mapping theorem yields p p W1T : = T (R^1T r)0 (R ^ 11 R0 ) 1 T (R^1T r) h i0 h i 1h i d 1=2 1=2 1=2 1=2 R 11 Bd (1) + 0 : =) R 11 Bd (1) + 0 R 11 Cdd (R 11 )0 h Now, R so 1=2 11 Bd (1); R 1=2 1=2 0 11 Cdd (R 11 ) d W1T =) [ 1 Bp (1) d = Bp (1) + 1 1 0 i is distributionally equivalent to [ 0 0] + 0 1 0 1 1 Cpp Cpp1 Bp (1) + [ 1 Bp (1) + 1 Bp (1) ; = W11 ( 0 0 ]; 1 and 0] d 1 1 1 Cpp 2 1 0 1 ); as desired. Proof of Proposition 6. Part (a) We …rst prove that P Note that 2 p P 2 2 2 p >x = 2 > x increases with 1 X e 2 =2 ( 2 =2)j j! j=0 2 p+2j P for any integer p and x > 0: >x ; we have @P 2 p @ 2 2 >x 1 1 X ( 2 =2)j e 2 j! = 1 2 = = 1 2 j=0 1 X ( 2 =2)j e j! j=0 1 2 X j=0 as needed. Let s N (0; 1) and monotonicity of P 2p 2 2 ( =2)j e j! 2 2 =2 =2 =2 P P P 2 p+2j 2 p+2j >x + >x + 2 p+2+2j >x 1 1 X ( 2 =2)j 1 e 2 (j 1)! 1 2 j=1 1 X j=0 ( 2 =2)j e j! 2 p+2j P >x 2 2 =2 =2 P P 2 p+2j >x 2 p+2+2j >x >0 be a zero mean random variable that is independent of . Using the > x in 2 ; we have P (k + k2 > x) = E P ( > P( 2 1 2 1 2 > x)j 2 > x) = P (k k2 > x) for any x: 48 Now we proceed to provePthe theorem. Note that Bp (1) and Bq (1) are independent of Cpq ; Cpp ; and Cqq : Let Dpp1 = pi=1 Di di d0i be the spectral decomposition of Dpp1 where Di 0 almost surely and fdi g are orthonormal in Rp . Then Bp (1) = p X 0 Cpq Cqq1 Bq (1) Dpp1 Bp (1) 0 Di di Bp (1) d0i Cpq Cqq1 Bq Cpq Cqq1 Bq (1) (1) 2 = i=1 where i = d0i Bp (1), and Cqq : In addition, for any x > 0; i p X Di ( i 2 i) + i=1 = d0i Cpq Cqq1 Bq (1) ; f i g is independent of f i g conditional on Cpq ; Cpp ; i s iidN (0; 1) conditionally on Cpq ; Cpp ; and Cqq and unconditionally. So P (W21 (0) > x) = EP (W21 (0) > xjCpq ; Cpp ; Cqq ) = EP p X Di ( i 2 + i) > xjCpq ; Cpp ; Cqq i=1 = EP D1 ( 1 2 1) + >x p X Di ( i + i=2 EP = EP 2 D1 1 2 D1 1 >x >x p X i=2 p X Di ( i Di ( i + + ! 2 i ) jCpq ; Cpp ; Cqq ; f 2 i ) jCpq ; Cpp ; Cqq ; f i ; 2 i) i=2 p p i gi=2 ; f i gi=1 jCpq ; Cpp ; Cqq ; f p i gi=2 p i gi=2 ! ! ! : Using the above argument repeatedly, we have P (W21 (0) > x) =P p X 2 Di i EP ! >x i=1 p X 2 Di i > xjCpq ; Cpp ; Cqq i=1 ! = P Bp (1)0 Dpp1 Bp (1) > x > P Bp (1)0 Cpp1 Bp (1) > x = P (W11 (0) > x) where the last inequality follows from the fact that Dpp1 > Cpp1 almost surely. Pp 1 Part (b). Let Cpp = i=1 Ci ci c0i be the spectral decomposition of Cpp1 : Since Cpp > 0 with probability one, ci > 0 with probability one. We have W11 k k2 d = [Bp (1) + k k ep ]0 Cpp1 [Bp (1) + k k ep ] = p X i=1 Ci c0i Bp (1) + k k c0i ep 2 where [c0i Bp (1) + k k c0i ep ]2 follows independent noncentral chi-square distributions with noncentrality parameter k k2 (c0i ep )2 ; conditional on f Ci gpi=1 and fci gpi=1 : Now consider two vectors 1 49 and such that k 1 k < k 2 k : We have h i P W11 k 1 k2 > x ( p ) X 2 0 0 =P >x Ci ci Bp (1) + k 1 k ci ep 2 i=1 = EP < EP =P ( ( ( C1 C1 c01 Bp (1) + k 1 k c01 ep 2 c01 Bp (1) + k 2 k c01 ep 2 0 C1 c1 Bp (1) +k 2 0 2 k c1 ep >x p X i=2 p X >x <P <P C1 ( p X c01 Bp (1) + k 2 k c01 ep 0 Ci ci Bp (1) i=1 = P [Bp (1) + +k + as desired. Part (c). We note that + C2 c0i Bp (1) + k 1 k c0i ep c0i Bp (1) + k 1 k c0i ep 2 0 Ci ci Bp (1) i=2 2 0 2 k ci ep 0 1 2 ] Cpp [Bp (1) + Ci i=2 p X where we have used the strict monotonicity of P ment, we have i h P W11 k 1 k2 > x ( 2 Ci 2 2 1 +k 2 2 0 1 k ci ep 2 > x in c02 Bp (1) + k 2 k c02 ep 2 + ) f p p Ci gi=1 ; fci gi=1 ) ) >x : Repeating the above argu- p X i=3 ) f p p Ci gi=1 ; fci gi=1 Ci c0i Bp (1) + k 1 k c0i ep 2 ) >x >x h i 2 ] > x = P W k k > x 11 2 2 W21 k k2 0 = Bp (1) Cpq Cqq1 Bq (1) + k k ep Dpp1 Bp (1) Cpq Cqq1 Bq (1) + k k ep n o0 1=2 = Ip + Cpq Cqq1 Cqq1 Cqp Bp (1) Cpq Cqq1 Bq (1) + k k e~p 1=2 where Ip + Cpq Cqq1 Cqq1 Cqp n Ip + Cpq Cqq1 Cqq1 Cqp Dpp1 Ip + Cpq Cqq1 Cqq1 Cqp 1=2 Bp (1) 1=2 Cpq Cqq1 Bq (1) + k k e~p e~p = Ip + Cpq Cqq1 Cqq1 Cqp 1=2 o ep : P 1=2 Let pi=1 ~ Di d~i d~0i be the spectral decomposition of Ip + Cpq Cqq1 Cqq1 Cqp Dpp1 Ip + Cpq Cqq1 Cqq1 Cqp De…ne ~ = d~0 Ip + Cpq C 1 C 1 Cqp 1=2 Bp (1) Cpq C 1 Bq (1) : di i qq qq qq 50 1=2 . Then conditional on Cpq ; Cpp and Cqq ; ~ di s iidN (0; 1): Since the conditional distribution does not depend on Cpq ; Cpp and Cqq ; ~ di s iidN (0; 1) unconditionally. Now W21 (k 1 k2 ) p n X ~ D d~0 Ip + Cpq C 1 C 1 Cqp = i qq qq i = i=1 p X ~D ~ + k k d~0 e~p di 1 i i i=1 1=2 Bp (1) Cpq Cqq1 Bq (1) + k 1 k d0i e~p o2 2 and so for two vectors 1 and 2 such that k 1 k < k 2 k we have n o P W21 (k 1 k2 ) > x ) ( p X 2 ~ D ~ + k k d~0 e~p > x Cpq ; Cpp ; Cqq = EP di 1 i i < EP =P ( i=1 p X ~D i i=1 ( p X ~D i i=1 ~ +k di ~ +k di 2 ~0 ~p 2 k di e ~0 ~p 2 k di e 2 > x Cpq ; Cpp ; Cqq ) >x ) n o = P W21 (k 2 k2 ) > x : Proof of Proposition 8. Instead of directly proving 1 ( ) > 2 ( ) for any > 0; we consider the following testing problem: we observe (Y; S) 2 Rp+q R(p+q) (p+q) with Y ? S from the following distributions: 0 1 0 11 Y1 0 B (p p) (p q) C (p 1) s Np+q ( ; ) with Y = = (p 1) ; = @ A 0 (p+q) 1 22 0 Y2 (q p) (q q) (q 1) S (p+q) (p+q) 0 B = @ S11 (p p) S12 (p q) S21 S22 (q p) (q q) (q 1) 1 C Wp+q (K; ) As K where 11 and 22 are non-singular matrices and Wp+q (K; ) is the Wishart distribution with K degrees of freedom. We want to test H0 : 0 = 0 against H1 : 0 6= 0. The testing problem is partially motivated by Das Gupta and Perlman (1974) and Marden and Perlman (1980). The joint pdf of (Y; S) can be written as f (Y; Sj 0 ; = ( 0; 11 ; 11 ; 22 ) 22 )h(S) exp 1 tr 2 1 0 11 (Y1 Y1 + KS11 ) + 1 0 22 (Y2 Y2 + KS22 ) + Y10 for some functions ( ) and h( ): It follows from the exponential structure that := (Y1 ; S11 ; Y2 Y20 + KS22 ) 51 1 11 0 is a complete su¢ cient statistic for := ( 0 ; 11 ; 22 ): We note that Y1 N ( 0 ; 11 ), KS11 Wp (K; 11 ) and Y2 Y20 + KS22 Wq (K + 1; these three random variables are mutually independent. Now, we de…ne the following two test functions for testing H0 : 0 = 0 against H1 : 1( ) : = 1(V1 ( ) > W11 ) 2( ) : = E[1(W2 (Y; S) > W21 )j ] 22 ) 0 and 6= 0: where V1 ( ) := Y10 S111 Y1 and W2 (Y; S) := (Y1 S12 S221 Y2 )0 (S11 S12 S221 S21 ) 1 (Y1 S12 S221 Y2 ): We can show that the distributions of V1 ( ) and W2 (Y; S) depend on the parameter 1 0 0 11 0 : First, it is easy to show that W2 (Y; S) = 0 Y1 Y2 1 S11 S12 S21 S22 Y1 Y2 Y20 S221 Y2 : Let Y~ : = S~ : = Then Y~ ? S~ and Y~1 Y~2 = S~11 S~12 S~21 S~22 Y1 Y2 1=2 1=2 0 11 s N (0; Ip+q ); ~ = S11 S12 S21 S22 1=2 = W2 (Y; S) = Y~ + ~ 0 S~ 1 ( Y~ + ~ 0 1=2 0 ) s to 1 11 0 : ! and Wp+q (K; Ip+q ) : K Y~20 S~221 Y~2 : only via ~ It is now obvious that the distribution of W2 (Y; S) depends on 0 0 only via 2 ; which is equal Second, we have V1 ( ) = Y~1 + 0 1=2 ~ 1 0 S11 11 Y~1 + 1=2 1=2 0 11 2 and so the distribution of V1 ( ) depends on only via which is also equal to 00 111 0 : 0 11 It is easy to show that the null distributions of V1 ( ) and W2 (Y; S) are the same as W11 and W21 ; respectively. In view of the critical values used, both the tests 1 ( ) and 2 ( ) have the correct level . Since E 1( ) = P (V1 ( ) > W11 ) and E 2( ) = E fE[1(W2 (Y; S) > W21 )j ]g = P (W2 (Y; S) > W11 ); the power functions of the two tests 1 ( ) and 2 ( ) are 1 ( 00 111 0 ) and 2 ( 00 111 0 ), respectively. We consider a group of transformations G, which consists of the elements in Ap p := fA 2 Rp Rp : A is a (p p) non-singular matrixg and acts on the sample space := Rp Rp p Rq q for the su¢ cient statistic through the mapping G : (Y1 ; S11 ; Y2 Y20 + KS22 ) ) (AY1 ; AS11 A0 ; Y2 Y20 + KS22 ): 52 The induced group of transformations G acting on the parameter space given by G : = ( 0 ; 11 ; 22 ) ) (A 0 ; A 11 A0 ; 22 ): := Rp Sp p Sq q is Our testing problem is obviously invariant to this group of transformations. De…ne V( ) := (Y10 S111 Y1 ; Y2 Y20 + KS22 ) := (V1 ( ); V2 ( )) : It is clear that V( ) is invariant under G: We can also show that V( ) is maximal invariant under G. To do so, we consider two di¤erent samples := (Y1 ; S11 ; Y2 Y20 + KS22 ) and := 0 (Y1 ; S11 ; Y2 Y2 + K S22 ) such that V( ) = V( ): We want to show that there exists a p p nonsingular matrix A such that Y1 = AY1 and S11 = AS11 A0 whenever Y10 S111 Y1 = Y10 S111 Y1 : By Theorem A9.5 (Vinograd’s Theorem) in Muirhead (2009), there exists an orthogonal p p matrix 1=2 1=2 1=2 1=2 H such that S11 Y1 = H S11 Y1 and this gives us the non-singular matrix A := S11 H S11 satisfying Y1 = AY1 and S11 = AS11 A0 : Similarly, we can show that 1 11 0 ; 0 0 v( ) := ( 22 ) is maximal invariant under the induced group G. Therefore, restricting attention to G-invariant tests, testing H0 : 0 = 0 against H1 : 0 6= 0 reduces to testing H00 : 0 0 1 11 0 = 0 against H10 : 0 0 1 11 0 >0 based on the maximal invariant statistic V( ): Let f (V1 ; 00 111 0 ) and f (V2 ; 22 ) be the marginal pdf’s of V1 := V1 ( ) and V2 := V2 ( ): By construction, V1 ( )K=(K p + 1) follows the noncentral F distribution Fp;K p+1 ( 00 111 0 ): So f (V1 ; 00 111 0 ) is the (scaled) pdf of the noncentral F distribution. It is well known that the noncentral F distribution has the Monotone Likelihood Ratio (MLR) property in V1 with respect to the parameter 00 111 0 (e.g. Chapter 7.9 in Lehmann and Romano (2008)). Also, in view of the independence between V1 and V2 ; the joint distribution of V( ) also has the MLR property in V1 : By the virtue of the Neyman-Pearson lemma, the test 1 ( ) := 1(V1 ( ) > W11 ) is the unique Uniformly Most Powerful Invariant (UMPI) test among all G-invariant tests based on the complete su¢ cient statistic : So if 2 ( ) is equivalent to a G-invariant test, then 1 ( 00 111 0 ) > 1 1 0 0 2 ( 0 11 0 ) for any 0 11 0 > 0. To show that 2 ( ) has this property, we let g 2 G be any element of G with the corresponding matrix Ag and induced transformation g 2 G. Then, E [ 2 (g )] = Eg [ = 2( 0 2( 0 )] = 1 11 0 ) for all : It follows from the completeness of the desired result. (Ag 0 )0 (Ag 2 =E [ that 2( 2 (g 0 1 11 Ag ) (Ag 0 ) )] )= 2( ) almost surely and this drives Proof of Lemma 9. We prove a more general result by establishing a representation for T 1 Xh 0 p GM T t=1 1 G i 1 G0 M 1 f (vt ; 0) in terms of the rotated and normalized moment conditions for any m m (almost surely) positive de…nite matrix M which can be random. Let M = U 0 M U; M = 1 1=2 M ( 53 1 0 1=2 ) = M11 M12 M21 M22 and M1 2 = M11 1 G0 M M12 M221 M21 : Using the SVD U V 0 of G, we have 1 V0 0 (U 0 M U ) = VA Id ; O (M ) = VA Id ; O ( 1 0 1=2 ) = VA Id ; O ( 1 0 1 1=2 ) M = V A( 1 2) 1=2 = V A( 1 2) 1=2 G = V 0 1 Id ; O AV 0 h i 1 1 0 M ( ) 1=2 1=2 Id ; O M1 21 ( 1 1=2 1 M 1 2) 1=2 h 1 0 Id ; O V A( 1 1=2 AV 0 AV 0 1=2 1 2) 0 Id ; O i0 Id ; O AV 0 ; (30) where we have used Id ; O ( 1 0 1=2 ) = Id ; O = ( 1=2 1 2) h ( ( ; O 1 2) 1=2 1 2) 1=2 =( 1 12 ( 22 ) 1 2) 1=2 i0 O ( 22 ) Id ; O 1=2 ! : In addition, G0 M 1 f (vt ; Id ; O (M ) 1 f (vt ; 0 ) i 1 h 1 1 0 1 M ( ) 1=2 1=2 1=2 f (vt ; 0 ) (U M U ) = VA Id ; O ( 1 0 1=2 ) = VA Id ; O ( 1 0 1 f 1=2 ) M = V A( 1 2) 1=2 Id ; O = V A( 1 2) 1=2 M1 21 ; = V A( 1 2) 1=2 = V 0 0) 1 0 0 U f (vt ; 0) =VA (vt ; 0) = V A( M1 21 M1 21 M12 M221 M1 21 M12 M221 M1 21 f1 (vt ; 0) 1=2 1 2) Id ; O M M1 21 M12 M221 M2 11 0 f (vt ; 0) M12 M221 f2 (vt ; 0) 1 f (vt ; f (vt ; 0) 0) : Hence T 1 Xh 0 p GM T t=1 = T 1 Xh p V A( T t=1 f1 (vt ; = Let M = As a result 0) T 1 X p VA T t=1 1 1 G i 1 2) 1 G0 M 1=2 1 G i f (vt ; M1 21 ( 1 2) M12 M221 f2 (vt ; ( 1=2 1 2) 1 0 GM 1 0) 1=2 0) f1 (vt ; ; we have M = U 0 U = T 1 Xh 0 p GM T t=1 1 0) AV 0 i 1h M12 M221 f2 (vt ; and M = 1 1=2 M T 1 X p f (vt ; 0 ) = VA T t=1 54 V A( 0) 1 0 1=2 ) ( 1 ( 1 2) 1=2 M1 21 i : (31) = Im . So M12 M221 = 0: 1=2 f1 (vt ; 0 ) : 1 2) G = V A ( 1 2 ) 1=2 ( 1 2 ) 1=2 AV 0 ; p Using this and the stochastic expansion of T (^1T 0 ), we have 1 G0 M p T (^1T T 1 X p ) = VA 0 T t=1 1 1=2 f1 (vt ; 0 ) 1 2) ( + op (1): It then follows that ( Let M = As a result 1 2) 1; T 1 Xh 0 p G T t=1 1=2 p AV 0 T (^1T 1 0 1=2 U we have M = 1 1G i 1 G 0 0) T 1 X =p f1 (vt ; T t=1 10 1=2 1U = T 1 X =p VA T t=1 1 1 f (vt ; 0 ) 0) d + op (1) =) N (0; 1, and so M12 M221 = 1 1=2 [f1 (vt ; 0 ) 1 2) ( 11 ): 1;12 1 1;22 = 1: 1 f2 (vt ; 0 )] : Using this, we have p T (^2T 0) = = T 1 Xh 0 p G T t=1 1 p VA T 1 ( 1 1G i 1=2 1 2) 1 1 1 f (vt ; 0 ) G0 T X (f1 (vt ; + op (1) 0) 1 f2 (vt ; 0 )) t=1 + op (1): It then follows that ( 1 2) 1=2 p AV 0 T (^2T T 1 X p [f1 (vt ; T t=1 0) = d =) M N 0; 1 f2 (vt ; 0 )] 0) 0 12 1 11 1 + op (1) 21 + 1 (32) 0 22 1 : Proof p of Theorem 10.p Parts (a) and (b). Instead of comparing the asymptotic variances ^ of R T (^1T 0 ) and R T ( 2T 0 ) directly, we equivalently compare the asymptotic variances 1=2 p 1=2 p 1 1 0 0 1 1 0 0 of RV A R T (^1T R T (^2T 0 ) and RV A 0 ): We 1 2A V R 1 2A V R can do so because RV A 1 1 V 0 R0 1=2 is nonsingular. Note that the latter two asymptotic R variances are the same as those of the respective one-step estimator ^1T and two-step estimator ^R of R in the following simple location model: 2T 1 2A 0 R = R + u R 2 Rp y1t 0 1t y2t = u2t 2 Rq (33) where R 0 = uR 1t = RV A RV A 1 1 1 2A 1 2A 1 1 V 0 R0 1=2 V 0 R0 1=2 55 R 0; RV A 1 ( 1=2 u1t 1 2) and the (contemporaneous) variance and long run variance of ut = (u01t ; u02t )0 are Im and respectively. R R It su¢ ces to compare the asymptotic variances of ^1T and ^2T in the above location model. 0 0 0 uR 1t ; (u2t ) By construction, the variance of uR t := is Ip O O Iq var(uR t )= = Ip+q : So the above location model has exactly the same form as the model in Section 3. We can invoke Proposition 3 to complete the proof. The long run variance of uR t is R 11 R 21 lrvar uR t = R 12 R 22 where R 11 = 1 RV A h 11 R 12 = 1 2A 1 RV A 1 1 2A 1 1 RV A 1=2 1 2) ( h 1=2 V 0 R0 i0 1=2 V 0 R0 1 RV A 1=2 1 2) ( RV A 1 1 2A 1 ( 1=2 1 2) RV A 1 i 1=20 V 0 R0 and 12 R 22 = 22 : The implied long run correlation matrix is then = n 1=2 R 11 1 RV A RV A 1 ( n RV A 1 ~R ~0 R = ~R ~ R ~ R 1=2 1=2 ~0 11 R 1=2 R 22 1 1 2A h = = R 12 1=2 1 2) i0 h 1 1 2A ~0 11 R ~ R ~0 11 R 1=2 n ~ R 12 RV A RV A V 0 R0 ~ R h 1=2 V 0 R0 1=2 o h 1 2A RV A 1 V 0 R0 ( 11 i 1=2 0 1=2 1 2) i ~R ~0 R 1=2 ~R ~ R = 1 i 1=2 1 1=2 22 1=2 1 2) ( 1=2 0 ~R ~0 R 1=2 1 1 ~ R 1=2 12 o 1=2 1=2 22 ~ R 1=2 22 12 1=2 22 12 R: Part (b) then follows from Proposition 3. n ~R ~0) In the above calculation we have used a particular form of (R 1=2 1=2 R ~ ~ 0 [(R ~R ~0) 11 R 1=2 R R R If we employ a di¤erent matrix square root, then =Q 11 12 22 orthogonal matrix Q. This is innocuous as our result is rotation invariant. ~ is a square matrix, and we have In the special case when R = Id ; R R = ~ R 11 R = ~ R 1=2 11 1=2 ~0 1 ~ R ~ R 12 1=2 22 1=2 22 12 56 = 1=2 11 12 1=2 22 = : R 1=2 ]0 o1=2 for some p : p 1=2 ~0 ~ So Part (a) is a special case of Part (b). Again, if R 11 R takes another form instead of ~ 1=2 ; then R = Q for some orthogonal matrix Q: Part (a) is still a special case of part (b) R 11 because of the rotational invariance. Part (c). The local asymptotic power of the one-step test and two step test are the same as the local asymptotic power of respective one-step and two-step tests in the location model given in (33). Part (c) follows from Proposition 7 with h i0 h i 1 1=2 1=2 R 1 1 0 0 = RV A 1 1 2 A 1 V 0 R0 RV A A V R 0 0 11 12 0 0 = h 1 RV A 1=2 1 2) ( i 11 h 1 RV A 1=2 1 2) ( i0 1 0: (34) It remains to show that the last term is equal to what is de…ned in (21). 1 , we have M = I and Using (30) with M 1 = m G0 1 Using (30) again but with M M 1 G = V A( 1 = 1, 1 0 1 0 1=2 U M U ( 1=2 ) = 1 0 1=2 U = 1 1=2 = ( which implies that M1 21 = 11 . 1 U( 1 0 1=2 ) 1 0 1=2 ) U0 U( 1 0 1=2 ) ( ( 1 1=2 ) i0 1 1 = ; So 1 G0 1 h 1=2 ) AV 0 : 1 0 1=2 U = U U0 U ) 1 we obtain the corresponding M as = ( 1 2) 1 1 G = V A( 1 2) 1=2 1=2 11 ( 1 2 ) AV 0 : Therefore, = = 0 0 R G0 0 0 n R VA 0 0 h R VA 1 1 1 G 1 1 2A 1=2 12 i G0 1 V0 which is exactly the same as 11 h h 1 1 V A( 1 2) VA 1 G0 G 1=2 12 1 1=2 i0 1 G 11 ( 1 2) 1 R0 0 1=2 AV 0 1 R0 i VA 1 1 2A 1 V 0 R0 o 1 0 0; given in (34) Proof of Theorem 11. Part (a). Plugging M = W into (31) and using the similar notation, we have p T (^aT 0) = T 1 Xh 0 p GW T t=1 = VA 1 ( 1=2 1 2) 1 G i 1 G0 W 1 f (vt ; T 1 X p [f1 (vt ; T t=1 57 0) 0) + op (1) a f2 (vt ; 0 )] + op (1): So we have ( ( 1 2) 1=2 1 2) 1=2 p AV 0 T (^aT AV 0 p 0) T (^2T T 1 X p [f1 (vt ; T t=1 = 0) T 1 X p [f1 (vt ; T t=1 = 0) 0 f2 (vt ; 0 )] +( 0) 0 f2 (vt ; 0 )] +( 0 ) f2 (vt ; 0 ) a + op (1) 0 ) f2 (vt ; 0 ) 1 + op (1) (35) where the …rst term p the second term. It p in each summation has no long run correlation with ) has a larger asymptotic variance than T (^aT then follows that T (^2T 0 0 ) if E( 0) 1 0 0) 22 ( 1 >( 0) a 0 0) : 22 ( a By de…nition, 1 1=2 ~ 12 1 = 1=2 22 + and 0 a 1=2 ~ 12 a = 1=2 22 + 0: The inequality is then equivalent to 1=2 ~ ~ 0 12 1 1 E 1=20 12 1=2 ~ ~ 0 12 a a > 0 0 which simpli…es to E ~ 1 ~ 1 > ~ a ~ a as desired. Part (b). It follows from part (a) that i h p h p asymvar R T (^aT asymvar R T (^2T 0) = ERV A 1 ( 1 2 )1=2 ( h = E RV A 1 ( 1 2 )1=2 ~ a ~ 1 ~ 01 = ER = ~aR ~0 R a 1=2 0) a 1=2 12 ~ ~0 a a ~aR ~0 E R a i 22 ( ~ ~0 1 1 0 0) a h ~ ~0 a a ~ a0 R 1=2 ~a ~ 1 ~ 0 R 1 1=2 ~aR ~ a0 It is easy to see that R h p i asymvar R T (^aT 0 ) if ( i 0) 0) 1 RV A ~ ~0 1=20 12 ; 1 ( 22 ( 1=2 1 2) 1=2 ~aR ~0 R a a a 0 0) a i 1=2 0 12 0 ~a R h RV A ~aR ~0 R a 1=2 ~ a ~ a ~ 0a R 1=2 ~aR ~ a0 R ( 1=2 1 2) i0 1=2 0 h p d ~ ^ ~a ~ 1 = R 1 (h; p; q): Hence asymvar R T ( 2T 0 ~aR ~ a0 E ~ 1 (h; p; q) ~ 1 (h; p; q) > R 1 i 0) > ~a : R Parts (c)(d)(e). These three parts are similar to Theorem 10(a),(b), and (c) respectively. We only give the proof for part (e) in some details. It is easy to show that under the local p d 1=2 2 alterative H1 : R 0 = r + 0 = T , we have WaT =) W11 (jjVa 0 jj ) where Va = R G0 W = RV A 1 1 ( G 1 G0 W 1=2 (Id ; 1 2) 1 W a) 58 1 G G0 W Id 0 a 1 h G RV A 1 R0 1 ( i0 1=2 ) : 12 (36) Similarly, we have d W2T =) W21 (jjV2 1=2 2 0 jj ); where V2 = R G0 = RV A 1 1 ( G 1 R0 1=2 (Id ; 1 2) h Id 0) 0 0 RV A 1 i0 1=2 ) ; 12 ( p ~ which is the asymptotic variance of R T ~2T 0 with 2T being the infeasible optimal twostep GMM estimator. The di¤erence in the two matrices Va and V2 is h i0 1=2 1=2 0 1 1 : Va V2 = RV A ( 1 2 ) ( a ( 1 2) 0 ) 22 ( a 0 ) RV A Now = = = where 1=2 jjVa 1=2 0 jj2 o h i0 n 1 1=2 1 1 0 1=2 1=2 0 0 1=2 V I V ] = V [V ] V [V p Va 0 0 a a 0 a a 0 2 2 o1=20 on o1=2 n h i0 n 1 1=2 1 1=2 1=2 0 1=2 1=2 0 0 1=2 1=2 0 V [V ] V Va 1=2 0 V I V V [V ] V [V ] V p 2 a a a a a 0 a a 2 2 o 1=20 on o 1=2 n h i0 n 1=2 1=2 0 1=2 1=2 0 0 1=2 1=2 1=2 0 Va 1=2 0 ; V V [V ] I V V [V ] V V V [V ] 2 a p 2 a 2 a a a 0 a a jjV2 2 0 jj Va 1=2 V2 [Va 1=2 ]0 = Ip and Va 1=2 (Va = Va 1=2 RV A = Va 1=2 RV A V2 ) [Va 1=2 ]0 1 1 ( ( 1=2 1 2) 1=2 ( 1 2) 12 a a 22 Va 1=2 (Va 1 22 22 12 ) 1=2 22 a De…ne a;R = Va 1=2 RV A 1 ( h V2 ) [Va 1=2 ]0 ; 12 1 0 22 i 1=2 0 ( a 22 1=2 ( 12 1 2) a h RV A 22 22 ) 1 ( h i0 1=2 ) [Va 1=2 ]0 12 0 12 ) RV A 1 i0 1=2 ) [Va 1=2 ]0 : 12 ( 1=2 22 ; which is the long run correlation matrix between RV A 1 ( 1=2 [f1 (vt ; 0 ) 1 2) a f2 (vt ; 0 )] and f2 (vt ; 0) : Then and Va 1=2 (Va V2 ) [Va 1=2 ]0 = 0 a;R a;R and Va 1=2 V2 [Va 1=2 ]0 = Ip 0 a;R a;R 1=2 2 jjV2 jjVa 1=2 0 jj2 0 jj h i0 0 = 00 Va 1=2 Ip a;R a;R 1=2 0 a;R a;R 1 Ip h Ip 1=2 0 a;R a;R i 1=2 0 2 So if a;R 0a;R > ( 1) = Ip = f ( a ; h; p; q; ) Ip ; then jjV2 0 jj > jjVa test based on W2T is asymptotically more powerful than that based on WaT : 59 1=2 Va 1=2 2 0 jj 0 : and the References [1] Andrews, D. W. (1991): “Heteroskedasticity and autocorrelation consistent covariance matrix estimation.” Econometrica 59(3), 817–858. [2] Bester, C. A., Conley, T. G., and Hansen, C. B. (2011): “Inference with dependent data using cluster covariance estimators.” Journal of Econometrics 165(2), 137–151. [3] Bester, C. A., Conley, T. G., Hansen, C. B., and Vogelsang, T. J. (2014): “Fixed-b asymptotics for spatially dependent robust nonparametric covariance matrix estimators.” Forthcoming, Econometric Theory. [4] Goncalves, S. and Vogelsang, T. (2011): “Block Bootstrap HAC Robust Tests: The Sophistication of the Naive Bootstrap.” Econometric Theory 27(4), 745–791. [5] Goncalves, S. (2011). “The moving blocks bootstrap for panel linear regression models with individual …xed e¤ects.” Econometric Theory 27(5), 1048–1082. [6] Gupta, S.D. and Perlman, M. D. (1974). On the power of the noncentral F-test: e¤ect of additional variates on Hotelling’s T 2 test. Journal of the American Statistical Association 69, 174-180 [7] Hansen, L. P. (1982): “Large Sample Properties of Generalized Method of Moments Estimators.” Econometrica 50, 1029–1054. [8] Ibragimov, R. and Müller, U. K. (2010): “t-Statistic Based Correlation and Heterogeneity Robust Inference.” Journal of Business and Economic Statistics 28, 453–468. [9] Jansson, M. (2004): “The error in rejection probability of simple autocorrelation robust tests.” Econometrica 72(3), 937–946. [10] Kiefer, N. M., Vogelsang, T. J. and Bunzel, H. (2002): “Simple robust testing of regression hypotheses” Econometrica 68 (3), 695–714. [11] Kiefer, N. M. and Vogelsang, T. J. (2002a): “Heteroskedasticity-autocorrelation robust testing using bandwidth equal to sample size.” Econometric Theory 18, 1350–1366. [12] — — — — (2002b): “Heteroskedasticity-autocorrelation Robust Standard Errors Using the Bartlett Kernel without Truncation.” Econometrica 70(5), 2093–2095. [13] — — — — (2005): “A New Asymptotic Theory for Heteroskedasticity-Autocorrelation Robust Tests.” Econometric Theory 21, 1130–1164. [14] Kim, M. S., and Sun, Y. (2013): “Heteroskedasticity and spatiotemporal dependence robust inference for linear panel models with …xed e¤ects.”Journal of Econometrics 177(1), 85–108. [15] Lehmann, E.L. and Romano, J. (2008): Testing Statistical Hypotheses, Springer: New York. [16] Liang, K.-Y. and Zeger, S. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73(1):13–22. [17] Marden, J. and Perlman, M. D. (1980): “Invariant tests for means with covariates.”Annals of Statistics 8(1), 25–63. 60 [18] Müller, U. K. (2007): “A theory of robust long-run variance estimation.”Journal of Econometrics 141(2), 1331–1352. [19] Muirhead, R. J. (2009): Aspects of Multivariate Statistical Theory. Vol. 197. Wiley. [20] Newey, W. K. and West, K. D. (1987): “A simple, positive semi-de…nite, heteroskedasticity and autocorrelation consistent covariance matrix.” Econometrica 55, 703–708. [21] Phillips, P. C. B. (2005). “HAC estimation by automated regression.” Econometric Theory 21(1), 116–142. [22] Shao, X. (2010): “A Self-normalized Approach to Con…dence Interval Construction in Time Series.” Journal of the Royal Statistical Society B, 72, 343–66. [23] Sun, Y. (2011): “Robust Trend Inference with Series Variance Estimator and Testing-optimal Smoothing Parameter.” Journal of Econometrics 164(2), 345–66. [24] Sun, Y. (2013): “A Heteroskedasticity and Autocorrelation Robust F Test Using Orthonormal Series Variance Estimator”. Econometrics Journal 16(1), 1–26. [25] Sun, Y. (2014a): “Let’s Fix It: Fixed-b Asymptotics versus Small-b Asymptotics in Heteroscedasticity and Autocorrelation Robust Inference.” Journal of Econometrics 178(3), 659–677. [26] Sun, Y. (2014b): “Fixed-smoothing Asymptotics in a Two-step GMM Framework.” Econometrica 82(6), 2327–2370. [27] Sun, Y., Phillips, P. C. B. and Jin, S. (2008). “Optimal Bandwidth Selection in Heteroskedasticity-Autocorrelation Robust Testing.” Econometrica 76(1), 175–94. [28] Sun, Y. and Phillips, P. C. B. (2009): “Bandwidth Choice for Interval Estimation in GMM Regression.” Working paper, Department of Economics, UC San Diego. [29] Vogelsang, T. J. (2012). “Heteroskedasticity, autocorrelation, and spatial correlation robust inference in linear panel models with …xed-e¤ects.”Journal of Econometrics 166(2), 303–319. [30] Zhang, X., and Shao, X. (2013). “Fixed-smoothing asymptotics for time series.” Annals of Statistics 41(3), 1329–1349. 61
© Copyright 2024