An Accurate Comparison of One-Step and Two

Should We Go One Step Further?
An Accurate Comparison of One-Step and Two-Step Procedures
in a Generalized Method of Moments Framework
Jungbin Hwang and Yixiao Sun
Department of Economics,
University of California, San Diego
Preliminary. Please do not circulate.
Abstract
According to the conventional asymptotic theory, the two-step Generalized Method of
Moments (GMM) estimator and test perform as least as well as the one-step estimator and
test in large samples. The conventional asymptotics theory, as elegant and convenient as it
is, completely ignores the estimation uncertainty in the weighting matrix, and as a result it
may not re‡ect …nite sample situations well. In this paper, we employ the …xed-smoothing
asymptotic theory that accounts for the estimation uncertainty, and compare the performance
of the one-step and two-step procedures in this more accurate asymptotic framework. We show
the two-step procedure outperforms the one-step procedure only when the bene…t of using
the optimal weighting matrix outweighs the cost of estimating it. This qualitative message
applies to both the asymptotic variance comparison and power comparison of the associated
tests. A Monte Carlo study lends support to our asymptotic results.
JEL Classi…cation: C12, C32
Keywords: Asymptotic E¢ ciency, Fixed-smoothing Asymptotics, Heteroskedasticity and Autocorrelation Robust, Increasing-smoothing Asymptotics, Asymptotic Mixed Normality, Nonstandard Asymptotics, Two-step GMM Estimation
1
Introduction
E¢ ciency is one of the most important problems in statistics and econometrics. In the widelyused GMM framework, it is standard practice to employ a two-step procedure to improve the
e¢ ciency of the GMM estimator and the power of the associated tests. The two-step procedure
requires the estimation of a weighting matrix. According to the Hansen (1982), the optimal
weighting matrix is the asymptotic variance of the (scaled) sample moment conditions. For time
series data, which is our focus here, the optimal weighting matrix is usually referred to as the
Email: [email protected], [email protected]. Correspondence to: Department of Economics, University of
California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0508.
1
long run variance (LRV) of the moment conditions. To be completely general, we often estimate
the LRV using the nonparametric kernel or series method.
Under the conventional asymptotics, both the one-step and two-step GMM estimators are
asymptotically normal1 . In general, the two-step GMM estimator has a smaller asymptotic variance. Statistical tests based on the two-step estimator are also asymptotically more powerful than
those based on the one-step estimator. A driving force behind these results is that the two-step
estimator and the associated tests have the same asymptotic properties as the corresponding ones
when the optimal weighting matrix is known. However, given that the optimal weighting matrix
is estimated nonparametrically in the time series setting, there is large estimation uncertainty. A
good approximation to the distributions of the two-step estimator and the associated tests should
re‡ect this relatively high estimation uncertainty.
One of the goals of this paper is to compare the asymptotic properties of the one-step and twostep procedures when the estimation uncertainty in the weighing matrix is accounted for. There
are two ways to capture the estimation uncertainty. One is to use the high order conventional
asymptotic theory under which the amount of nonparametric smoothing in the LRV estimator
increases with the sample size but at a slower rate. While the estimation uncertainty vanishes in
the …rst order asymptotics, we expect it to remain in high order asymptotics. The second way
is to use an alternative asymptotic approximation that can capture the estimation uncertainty
even with just a …rst-order asymptotics. To this end, we consider a limiting thought experiment
in which the amount of nonparametric smoothing is held …xed as the sample size increases. This
leads to the so-called …xed-smoothing asymptotics in the recent literature.
In this paper, we employ the …xed-smoothing asymptotics to compare the one-step and twostep procedures. For the one-step procedure, the LRV estimator is used in computing the standard
errors, leading to the popular heteroskedasticity and autocorrelation robust (HAR) standard errors. See, for example, Newey and West (1987) and Andrews (1991). For the two-step procedure,
the LRV estimator not only appears in the standard error estimation but also plays the role of the
optimal weighting matrix in the second-step GMM criterion function. Under the …xed-smoothing
asymptotics, the weighting matrix converges to a random matrix. As a result, the second-step
GMM estimator is not asymptotically normal but rather asymptotically mixed normal. The asymptotic mixed normality re‡ects the estimation uncertainty of the GMM weighting matrix and
is expected to be closer to the …nite sample distribution of the second-step GMM estimator. In
a recent paper, Sun (2014b) shows that both the one-step and two-step test statistics are asymptotically pivotal under this new asymptotic theory. So a nuisance-parameter-free comparison of
the one-step and two-step tests is possible.
Comparing the one-step and two-step procedures under the new asymptotics is fundamentally
di¤erent from that under the conventional asymptotics. Under the new asymptotics, the twostep procedure outperforms the one-step procedure only when the bene…t of using the optimal
weighting matrix outweighs the cost of estimating it. This qualitative message applies to both
the asymptotic variance comparison and the power comparison of the associated tests. This
is in sharp contrast with the conventional asymptotics where the cost of estimating the optimal
weighting matrix is completely ignored. Since the new asymptotic approximation is more accurate
than the conventional asymptotic approximation, comparing the two procedures under this new
asymptotics will give an honest assessment of their relative merits. This is con…rmed by a Monte
1
In this paper, the one-step estimator refers to the …rst-step estimator in the typical two-step GMM framework.
It does not refer to the continuous updating GMM estimator that involves only one step. We use the terms
“one-step” and “…rst-step” interchangingly.
2
Carlo study.
There is a large and growing literature on the …xed-smoothing asymptotics. For kernel LRV
variance estimators, the …xed-smoothing asymptotics is the so-called the …xed-b asymptotics …rst
studied by Kiefer, Vogelsang and Bunzel (2002) and Kiefer and Vogelsang (2002a, 2002b, 2005) in
the econometrics literature. For other studies, see, for example, Jansson (2004), Sun, Phillips and
Jin (2008), Sun and Phillips (2009), Gonçlaves and Vogelsang (2011), and Zhang and Shao (2013)
in the time series setting; Bester, Conley, Hansen and Vogelsang (2014) and Sun and Kim (2014)
in the spatial setting; and Gonçalves (2011), Kim and Sun (2013), and Vogelsang (2012) in the
panel data setting. For OS LRV variance estimators, the …xed-smoothing asymptotics is the socalled …xed-K asymptotics. For its theoretical development and related simulation evidence, see,
for example, Phillips (2005), Müller (2007), and Sun (2011, 2013). The approximation approaches
in some other papers can also be regarded as special cases of the …xed-smoothing asymptotics.
This includes, among others, Ibragimov and Müller (2010), Shao (2010) and Bester, Conley,
and Hansen (2011). The …xed-smoothing asymptotics can be regarded as a convenient device to
obtain some high order terms under the conventional increasing-smoothing asymptotics.
The rest of the paper is organized as follows. The next section presents a simple overidenti…ed GMM framework. Section 3 compares the two procedures from the perspective of
point estimation. Section 4 compares them from the testing perspective. Section 5 extends the
ideas to a general GMM framework. Section 6 reports simulation evidence on the …nite sample
performances of the two procedures. Proofs of the main theorems are provided in the Appendix.
A word on notation: for a symmetric matrix A; A1=2 (or A1=2 ) is a matrix square root of A
0
such that A1=2 A1=2 = A: Note that A1=2 does not have to be symmetric. We will specify A1=2
explicitly when it is not symmetric. If not speci…ed, A1=2 is a symmetric matrix square root of A
1
based on its eigendecomposition. By de…nition, A 1=2 = A1=2
and so A 1 = (A 1=2 )0 (A 1=2 ):
For matrices A and B; we use “A B” to signify that A B is positive (semi)de…nite. We use
“0” and “O” interchangeably to denote a matrix of zeros whose dimension may be di¤erent at
di¤erent occurrences. For two random variables X and Y; we use X ? Y to indicate that X and
Y are independent.
2
A Simple Overidenti…ed GMM Framework
To illustrate the basic ideas of this paper, we consider a simple overidenti…ed time series model
of the form:
y1t =
0
+ u1t ; y1t 2 Rd ;
y2t = u2t ; y2t 2 Rq
(1)
for t = 1; :::; T where 0 2 Rd is the parameter of interest and the vector process ut := (u01t ; u02t )0
is stationary with mean zero. We allow ut to have autocorrelation of unknown forms so that the
long run variance of ut :
1
X
= lrvar(ut ) =
Eut u0t j
j= 1
3
takes a general form. However, for simplicity, we assume that var(ut ) = 2 Id for the moment2 .
Our model is just a location model. We initially consider a general GMM framework but later
…nd out that our points can be made more clearly in the simple location model. In fact, the
simple location model may be regarded as a limiting experiment in a general GMM framework.
Embedding the location model in a GMM framework, the moment conditions are
0
E(yt )
0 ; y 0 )0 . Let
where yt = (y1t
2t
p1
T
p1
T
gT ( ) =
Then a GMM estimator of
0
0q
=0
1
!
PT
y1t
Pt=1
T
t=1 y2t
:
can be de…ned as
^GM M = arg min gT ( )0 W 1 gT ( )
T
for some positive de…nite weighting matrix WT : Writing
WT =
where W11 is a d
d matrix and W22 is a q
T
X
^GM M = 1
(y1t
T
W11 W12
W21 W22
;
q matrix, then it is easy to show that
W y2t )
for
W
= W12 W221 :
t=1
There are at least two di¤erent choices of WT . First, we can take WT to be the identity
matrix WT = Im for m = d + q: In this case, W = 0 and the GMM estimator ^1T is simply
T
X
^1T = 1
y1t :
T
t=1
Second, we can take WT to be the ‘optimal’ weighting matrix WT =
obtain the GMM estimator:
T
X
~2T = 1
(y1t
y2t ) ;
T
. With this choice, we
t=1
where = 12 221 is the long run regression coe¢ cient matrix. While ^1T completely ignores
the information in fy2t g ; ~2T takes advantage of this information.
2
If
V11
V21
var (ut ) = V :=
for any
2
V12
V22
6=
> 0; we can let
V1=2 =
(V1 2 )1=2
0
0
V12 (V22 ) 1=2
(V22 )1=2
2
Id
!
:
1
0
0
Then V1=2
(y1t
; y2t
) can be written as a location model whose error variance is the identity matrix Id : The estimation uncertainty in estimating V will not a¤ect our asymptotic results.
4
Under some moment and mixing conditions, we have
p
p
d
T ^1T
T ~2T
0 =) N (0; 11 ) and
d
0
=) N (0;
1 2) ;
where
12
=
11
12
1
22
21 :
So Asym Var(~2T ) < Asym Var(^1T ) unless 12 = 0. This is a well known result in the literature.
Since we do not know in practice, ~2T is infeasible. However, given the feasible estimator ^1T ;
we can estimate
and construct a feasible version of ~2T : The common two-step estimation
strategy is as follows.
i) Estimate the long run covariance matrix by
T
T
1 XX
s t
^ := ^ (^
u) =
Qh ( ; ) u
^t
T
T T
T
1X
u
^
T
s=1 t=1
0
where u
^t = (y1t
=1
!
u
^s
T
1X
u
^
T
=1
!0
^1T ; y 0 )0 .
2t
ii) Obtain the feasible two-step estimator ^2T = T
1
PT
t=1
y1t
^ y2t where ^ = ^ 12 ^ 1 .
22
In the above de…nition of ^ , Qh (r; s) is a symmetric weighting function that depends on the
smoothing parameter h: For conventional kernel estimators, Qh (r; s) = P
k ((r s) =b) and we take
K
1
h = 1=b: For the orthonormal series (OS) estimators, Qh (r; s) = K
j=1 j (r) j (s) and we
R1
2
take h = K; where j (r) are orthonormal basis functions on L [0; 1] satisfying 0 j (r) dr = 0:
We parametrize h in such a way so that h indicates the level or amount of smoothing for both
types of LRV estimators.
P
^ g in constructing ^ (^
u) : For the
Note that we use the demeaned process f^
ut T 1 T=1 u
^
^
location model, (^
u) is numerically identical to (u) where the unknown error process fut g is
used. The moment estimation uncertainty is re‡ected in the demeaning operation. Had we known
the true value of 0 and hence the true moment process fut g ; we would not need to demean fut g.
While ~2T is asymptotically more e¢ cient than ^1T ; is ^2T necessarily more e¢ cient than ^1T
and in what sense? Is the Wald test based on ^2T necessary more powerful than that based on
^1T ? One of the objectives of this paper is to address these questions.
3
A Tale of Two Asymptotics: Point Estimation
We …rst consider the conventional asymptotics where h ! 1 as T ! 1 but at a slower rate, i.e.,
h=T ! 0: The asymptotic direction is represented by the curved arrow in Figure 1. Sun (2014a,
2014b) calls this type of asymptotics the “Increasing-smoothing Asymptotics,”as h increases with
the sample size. Under this type of asymptotics and some regularity conditions,
we have ^ !p :
p
It can then be shown that ^2T is asymptotically equivalent to ~2T , i.e., T (~2T ^2T ) = op (1).
As a direct consequence, we have
p
p
d
d
1
^2T
T ^1T
0 =) N (0; 11 ); T
0 =) N 0; 11
12 22 21 :
So ^2T is still asymptotically more e¢ cient than ^1T :
5
Figure 1: Conventional Asymptotics (increasing smoothing) vs New Asymptotics (…xed smoothing)
The conventional asymptotics, as elegant and convenient as it is, does not re‡ect the …nite
sample situations well. Under this type of asymptotics, we essentially approximate the distribution of ^ by the degenerate distribution concentrating on . That is, we completely ignore the
estimation uncertainty in ^ : The degenerate approximation is too optimistic, as ^ is a nonparametric estimator, which by de…nition can have high variation in
p …nite samples.
To obtain a more accurate distributional approximation of T (^2T
0 ); we could develop a
high order increasing-smoothing asymptotics that re‡ects the estimation uncertainty in ^ . This
is possible but requires strong assumptions that cannot be easily veri…ed. In addition, it is
also technically challenging and tedious to rigorously justify the high order asymptotic theory.
Instead of high order asymptotic theory under the conventional asymptotics, we adopt the type
of asymptotics that holds h …xed as T ! 1: This asymptotic behavior of h and T is illustrated
by the arrow pointing to the right in Figure 1. Given that h is …xed, we follow Sun (2014a, 2014b)
and call this type of asymptotics the “Fixed-smoothing Asymptotics.” This type of asymptotics
takes the sampling variability of ^ into consideration.
Sun (2013, 2014a) has shown that the …xed-smoothing asymptotic distribution is high order correct under the conventional increasing-smoothing asymptotics. So the …xed-smoothing
asymptotics can be regarded as a convenient device to obtain some high order terms under the
conventional increasing-smoothing asymptotics.
To establish the …xed-smoothing asymptotics, we maintain Assumption 1 on the kernel function and basis functions.
6
Assumption 1 (i) For kernel HAR variance estimators, the kernel function k ( ) satis…es the
following condition: for any b 2 (0; 1] and
1, kb (x) and k (x) are symmetric, continuous,
piecewise monotonic, and piecewise continuously di¤ erentiable on [ 1; 1]. (ii) For the OS HAR
variance estimator, the basis functions j ( ) are piecewise monotonic, continuously di¤ erentiable
R1
and orthonormal in L2 [0; 1] and 0 j (x) dx = 0:
Assumption 1 on the kernel function is very mild. It includes many commonly used kernel
functions such as the Bartlett kernel, Parzen kernel, and Quadratic Spectral (QS) kernel.
De…ne
Z 1Z 1
Z 1
Z 1
Qh ( 1 ; 2 )d 1 d 2
Qh (r; )d +
Qh ( ; s)d
Qh (r; s) = Qh (r; s)
0
0
0
0
which is a centered version of Qh (r; s); and
T X
T
X
s t
~= 1
ut u
^0s :
Qh ( ; )^
T
T T
s=1 t=1
Assumption 1 ensures that ~ and ^ are asymptotically equivalent. Furthermore, under this
assumption, Sun (2014a) shows that, for both kernel LRV and OS LRV estimation, the centered
weighting function Qh (r; s) satis…es :
Qh (r; s) =
1
X
j
j (r) j (s)
j=1
R1
where f j (r)g is a sequence of continuously di¤erentiable functions satisfying 0 j (r)dr = 0
and the series on the right hand side converges to Qh (r; s) absolutely and uniformly over (r; s) 2
[0; 1] [0; 1]. The representation can be regarded as a spectral decomposition of the compact
Fredholm operator with kernel Qh (r; s) : See Sun (2014a) for more discussion.
Now, letting 0 ( ) := 1 and using the basis functions f j ( )g1
j=1 in the series representation
of the weighting function, we make the following assumptions.
Assumption 2 The vector process fut gTt=1 satis…es:
P
a) T 1=2 Tt=1 j (t=T )ut converges weakly to a continuous distribution, jointly over j = 0; 1; :::; J
for every …xed J;
b) For every …xed J and x 2 Rm ,
T
P
P
1 X
p
T t=1
t
j ( )ut
T
T
1 X
1=2 p
T t=1
x for j = 0; 1; ::; J
t
j ( )et
T
where
1=2
is a matrix square root of
=
=
P1
!
=
x for j = 0; 1; :::; J
1=2
12
12
0
0
j= 1 Eut ut j
7
1=2
22
1=2
22
!
!
+ o(1) as T ! 1
>0
and et s iid N (0; Im )
Assumption 3
P1
j= 1
k Eut u0t
k< 1:
j
Proposition 1 Let Assumptions 1–3 hold. As T ! 1 for a …xed h, we have:
d
(a) ^ =) 1 where
=
1
~1 =
1=2
Z
0
~1
1Z 1
0
0
1=2
:=
1;11
1;12
1;21
1;22
~ 1;11
~ 1;21
Qh (r; s)dBm (r)dBm (s)0 :=
and Bm ( ) is a standard Brownian motion of dimension m = d + q;
p
d
Id ;
(b) T ^2T
0 =)
1=2 Bm (1) where 1 =
1
independent of Bm (1) :
Conditional on
with variance
1,
V2 =
Id ;
p
T (^2T
0)
V2
11
That is,
the asymptotic distribution of
1
11
12
21
22
Id
=
0
1
p
T ^2T
1 (h; d; q)
0
0
12 1
11
~ 1;12
~ 1;22
:=
1;12
1
1;22
is
is a normal distribution
1
21
+
1
0
22 1 :
is asymptotically mixed-normal rather than normal. Since
12
1
22
21
=
=
12
12
1
22
0
12 1
21
1
22
1
22
1
12
21
1
22
+
1
0
1
0
22 1
0
almost surely, the feasible estimator ^2T has a large variation than the infeasible estimator ~2T :
This is consistent with our intuition.
p
d
^
Under the …xed-smoothing asymptotics, we still have T (^1T
0 ) =) N (0; 11 ) as 1T
does not depend onpthe smoothing parameter h: So the asymptotic (conditional) variances of
p
1
T (^1T
T (^2T
0 ) and
0 ) are both larger than
11
12 22 21 in terms of matrix
de…niteness. However, it is not straightforward to directly compare them. To evaluate the
relative magnitude of the two (conditional) asymptotic variances, we de…ne
~
1
1
:= ~ 1 (h; d; q) := ~ 1;12 ~ 1;22
;
(2)
which does not depend on any nuisance parameter but depends on h; d; q: For notational economy,
we sometimes suppress this dependence. Direct calculations show that
1
=
1=2 ~
12 1
1=2
22
+
12
1
22 :
(3)
Using this, we have, for any conforming vector a :
Ea0 (V2
11 ) a
= Ea0
= a0
= a0
0
a0 12 01 + 1 21 a
1 22 1 a
1=2 ~ ~ 0
1=20
a0 12 221 21 a
12E 1 1 12 a
h
i
1=2
1=2 0
1=2
1=2 0
1
~ ~0
12 22 21 ( 1 2 ) ( 1 2 ) a:
12 E 1 1
12
The following lemma gives a characterization of E ~ 1 (h; d; q) ~ 1 (h; d; q)0 :
8
(4)
Lemma 2 For any d
1; we have E ~ 1 (h; d; q) ~ 1 (h; d; q)0 = Ejj ~ 1 (h; 1; q) jj2
Id :
On the basis of the above lemma, we de…ne
g(h; q) :=
Let
=
Ejj ~ 1 (h; 1; q)jj2
2 (0; 1):
1 + Ejj ~ 1 (h; 1; q)jj2
1=2
11
1=2 0
12 ( 22 )
2 Rd
q
;
which is the long run correlation matrix between u1t and u2t . Using (4) and rewriting it in terms
of ; we obtain the following proposition.
0 is positive (negative) semide…nite,
Proposition 3 Let Assumptions 1–3 hold. If g(h; q)Id
^
^
then 2T has a larger (smaller) asymptotic variance than 1T under the …xed-smoothing asymptotics.
1=2
In the proof of the proposition, we show that the choices of the matrix square roots 11
1=2
and 22 are innocuous. To use the proposition, we need only to check whether the condition is
satis…ed for a given choice of the matrix square roots.
0 holds trivially. In this case, the asymptotic variance of ^
If = 0; then g(h; q)Id
2T
is larger than the asymptotic variance of ^1T : Intuitively, when the long run correlation is zero,
there is no information that can be explored to improve e¢ ciency. If we insist on using the long
run correlation matrix in attempt to improve the e¢ ciency, we may end up with a less e¢ cient
estimator, due to the noise in estimating the zero long run correlation matrix. On the other
hand, if 0 = Id after some possible rotation, which holds when the long run variation of u1t is
almost perfectly predicted by u2t ; then g(h; q)Id < 0 . In this case, it is worthwhile estimating
the long run variance and using it to improve the e¢ ciency ^2T : There are many scenarios in
0 is inde…nite. In this case we may still be able to access
between where the matrix g(h; q)Id
the de…niteness of the matrix along some directions; see Theorem 10. The case with d = 1
is simplest. In this case, if 0 > g(h; q); then ^2T is asymptotically more e¢ cient than ^1T :
Otherwise, it is not asymptotically less e¢ cient.
In the case of kernel LRV estimation, it is hard to obtain an analytical expression for
~
Ejj 1 (h; 1; q)jj2 and hence g(h; q), although we can always simulate it numerically. The threshold g(h; q) depends on the smoothing parameter h = 1=b and the degree of overidenti…cation q:
Tables 1–3 report the simulated values of g(h; q) for b = 0:00 : 0:01 : 0:20 and q = 1 5. It is
clear that g(h; q) increases with q and decreases with the smoothing parameter h = 1=b.
When the OS LRV estimation is used, we do not need to simulate g(h; q) as we can obtain a
closed form expression for it. Using this expression, we can prove the following corollary.
0
Corollary 4 Let Assumptions 1–3 hold. Consider the case of OS LRV estimation. If Kq 1 Id
is positive (negative) semide…nite, then ^2T has a larger (smaller) asymptotic variance than ^1T
under the …xed-smoothing asymptotics.
0 to be positive semide…nite is that
A necessary and su¢ cient condition for q=(K 1)Id
0
the largest eigenvalue of
is smaller than q=(K 1): The eigenvalues of 0 are the squared
long run correlation coe¢ cients between c01 u1t and c02 u2t for some c1 and c2 ; i.e., the square long
run canonical correlation coe¢ cients between u1t and u2t : So if the largest squared long run
9
canonical correlation is smaller than q=(K 1); then ^2T has a larger asymptotic variance than
^1T : Similarly, if the smallest squared long run canonical correlation is greater than q=(K 1);
then ^2T has a smaller asymptotic variance than ^1T :
Since ^2T is not asymptotically normal, asymptotic variance comparison does not paint the
whole picture. To compare the asymptotic distributions of ^1T and ^2T , we consider the case
that d = q = 1 and K = 4 as an example: Figure 2 represents the shapes of probability density functions in the case of OS LRV estimation when
( 11 ; 212 ; 22 ) = (1; 0:10; 1). In this
p
a
1
case, 11
T (^1T
N (0; 1) (red solid line)
12 22 21 = 0:9. The …rst graph shows
0)
p
a
^
and T ( 2T
N (0; 0:9) (bluepdotted line) under p
the conventional asymptotics. The con0)
ventional limiting distributions for T (^1T
)
and
T (^2T
0
0 ) are both normal but the
^
latter has a smaller variance, so the asymptotic e¢ ciency of 2T is always guaranteed. However, this is not true in the second graph of Figure 2, which represents the limiting distributions
under the …xed-smoothing asymptotics. Here, as before the red solid line represents N (0; 1)
2
but thePblue dottedPline represents the mixed normal distribution M N [0; 0:9(1 + ~ 1 )] with
K
K
2
2
~ =
1
i=1 1i 2i =
i=1 2i s M N (0; 1= K ): This can be obtained by using (4) with a = 1:
More speci…cally,
V2 =
1=2 ~ ~ 0
1=2 0
1 2 1 1( 1 2 )
12
1
22
21
2
2
+
11
= 0:9 ~ 1 + 0:9 = 0:9(1 + ~ 1 ):
(5)
Comparing these two di¤erent families of distributions, we …nd that the asymptotic distribution
of ^2T has fatter tail than that of ^1T : The asymptotic variance of ^2T is 0:9(1 + 1=2) = 1:35
which is larger than the asymptotic variance of ^1T .
More generally, when d = q = 1 with ( 11 ; 212 ; 22 ) = (1; 2 ; 1); we can use (4) to obtain the
asymptotic variance of ^2T as follows:
2
1+ 1
=
1
2
1+
2
2
E ~1
1
1
K
2
2
= 1
K
K
2
2
1 + E ~1
2
= 1
1
2
:
So, for any choice of K, the asymptotic variance of ^2T decreases as 2 increases. Under the
quadratic loss function, the asymptotic relative e¢ ciency of ^2T relative to ^1T is thus equal to
2 (K
1
1) = (K 2) : The second factor (K 1) = (K 2) > 1 re‡ects the cost of having
to estimate : When 2 = 1=(K 1); ^1T and ^2T have the same asymptotic variance. For a given
choice of K, if 2 1=(K 1), we prefer to use ^1T . This is not captured by the conventional
asymptotics, which always chooses ^2T over ^1T .
4
A Tale of Two Asymptotics: Hypothesis Testing
We are interested p
in testing the null hypothesis H0 : R 0 = r against the local alternative
H1 : R 0 = r + 0 = T for some p d matrix R and p 1 vectors r and 0 : Nonlinear restrictions
can be converted into linear ones using the Delta method. We construct the following two Wald
statistics:
W1T := T (R^1T
W2T := T (R^2T
1
r)0 R ^ 11 R
(R^1T
h
i 1
(R^2T
r)0 R ^ 1 2 R0
10
r)
r)
Conventional Asymptoptic Distributions
0.4
0.35
0.3
0.25
0.2
One-s tep estimator: N(0,1)
T wo-s tep estimator: N(0,0.9)
0.15
0.1
0.05
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
1.5
2
Fixed-smoothign Asymptotic Distributions
0.4
0.35
0.3
0.25
0.2
0.15
One-s tep estimator: N(0,1)
0.1
T wo-s tep estimator: MN(0,0.9(1+
2
))
0.05
0
-2
-1.5
-1
-0.5
0
0.5
1
Figure 2: Limiting distributions of ^1T and ^2T based on the OS LRV estimator with K = 4:
where ^ 1 2 = ^ 11 ^ 12 ^ 221 ^ 21 : When p = 1 and the alternative is one sided, we can construct
the following two t statistics:
p
T R^1T r
p
T1T : =
(6)
R ^ 11 R0
p
T R^2T r
p
T2T : =
:
(7)
R ^ 1 2 R0
No matter whether the test is based on ^1T or ^2T ; we have to employ the long run covariance
estimator ^ . De…ne the p p matrices 1 and 2 according to
1
In other words,
1
and
2
0
1
=R
11 R
0
and
2
0
2
are matrix square roots of R
11
=R
11 R
1 2R
0
0
:
and R
1 2R
0
respectively.
Under the conventional
p increasing-smoothing asymptotics, it is straightforward to show that
under H1 : R 0 = r + 0 = T :
d
W1T =)
1
2
p(
d
1
T1T =) N (
When
0
1
2
d
0
), W2T =)
0 ; 1),
T2T =) N (
1
d
1
2
2
1
2
p(
0
2
),
0 ; 1):
= 0; we obtain the null distributions:
d
2
p
W1T ; W2T =)
d
and T1T ; T2T =) N (0; 1):
So under the conventional increasing-smoothing asymptotics, the null limiting distributions of
2
2
1
1
W1T and W2T are identical. Since
0
0 , under the conventional asymptotics,
1
2
the local asymptotic power function of the test based on W2T is higher than that based on W1T :
The key driving force behind the conventional asymptotics is that we approximate the distribution of ^ by the degenerate distribution concentrating on : The degenerate approximation
does not re‡ect the …nite sample distribution well. As in the previous section, we employ the
…xed-smoothing asymptotics to derive more accurate distributional approximations. Let
Z 1Z 1
Z 1Z 1
0
Cpp =
Qh (r; s)dBp (r)dBp (s) ; Cpq =
Qh (r; s)dBp (r)dBq (s)0
0
0
0
0
Z 1Z 1
0
Cqq =
Qh (r; s)dBq (r)dBq (s)0 ; Cqp = Cpq
0
0
and
0
Cpq Cqq1 Cpq
Dpp = Cpp
where Bp ( ) 2 Rp and Bq ( ) 2 Rq are independent standard Brownian motion processes.
Proposition
5 Let Assumptions 1–3 hold. As T ! 1 for a …xed h, we have, under H1 : R
p
r + 0= T :
d
2
1
(a) W1T =) W11 (
0
1
d
1
2
2
0
d
1
1
d
(d) T2T =) T21
1
2
2 Rp :
(8)
) where
W21 (k k2 ) = Bp (1)
(c) T1T =) T11
=
) where
W11 (k k2 ) = [Bp (1) + ]0 Cpp1 [Bp (1) + ] for
(b) W2T =) W21 (
0
Cpq Cqq1 Bq (1) +
1
0
Dpp1 Bp (1)
p
= Cpp for p = 1:
0
:= Bp (1)+
1
0
:= Bp (1)
Cpq Cqq1 Bq (1) +
0
1
2
0
=
p
Cpq Cqq1 Bq (1) +
:
(9)
Dpp for p = 1:
In Proposition 5, we use the notation W11 (k k2 ), which implies that the right hand side of
(8) depends on only through k k2 : This is true, because for any orthogonal matrix H :
[Bp (1) + ]0 Cpp1 [Bp (1) + ] = [HBp (1) + H ]0 HCpp1 H 0 [HBp (1) + H ]
d
= [Bp (1) + H ]0 Cpp1 [Bp (1) + H ] :
12
~ 0 for some H
~ such that H is orthogonal, then
If we choose H = ( = k k ; H)
d
[Bp (1) + ]0 Cpp1 [Bp (1) + ] = [Bp (1) + k k ep ]0 Cpp1 [Bp (1) + k k ep ] :
So the distribution of [Bp (1) + ]0 Cpp1 [Bp (1) + ] depends on only through k k : Similarly, the
distribution of the right hand side of (9) depends only on k k2 :
When 0 = 0; we obtain the limiting distributions of W1T ; W2T ; T1T and T2T under the null
hypothesis:
d
W1T =) W11 := W11 (0) = Bp (1)0 Cpp1 Bp (1) ;
0
d
W2T =) W21 := W21 (0) = Bp (1) Cpq Cqq1 Bq (1) Dpp1 Bp (1)
p
d
T1T =) T11 := T11 (0) = Bp (1)= Cpp ;
p
d
T2T =) T21 := T21 (0) = Bp (1) Cpq Cqq1 Bq (1) = Dpp :
Cpq Cqq1 Bq (1) ;
These distributions are di¤erent from those under the conventional
asymptotics. For W1T and
p
T1T ; the di¤erence lies in the random scaling factor Cpp or Cpp : The random scaling factor captures the estimation uncertainty of the LRV estimator. For W2T and T2T ; there is an additional
di¤erence embodied by the random location shift Cpq Cqq1 Bq (1) with a consequent change in the
random scaling factor.
The proposition below provides some characterization of the two limiting distributions W11
and W21 :
Proposition 6 For any x > 0; the following hold:
(a) W21 (0) …rst-order stochastically dominates (FSD) W11 (0) in that
P [W21 (0)
h
(b) P W11 (k k2 )
h
(c) P W21 (k k2 )
x] > P [W11 (0)
x] :
i
x strictly increases with k k2 and limk
i
x strictly increases with k k2 and limk
h
W11 (k k2 )
h
2
k!1 P W21 (k k )
k!1 P
i
x = 1:
i
x = 1:
Proposition 6(a) is intuitive. W21 FSD W11 because W21 FSD Bp (1)0 Dpp1 Bp (1), which
in turn FSD Bp (1)0 Cpp1 Bp (1) which is just W11 . According to a property of the …rst-order
stochastic dominance, we have
d
W21 = W11 + We
for some We > 0: Intuitively, W21 shifts some of the probability mass of W11 to the right. A
direct implication is that the asymptotic critical values for W2T are larger than the corresponding
ones for W1T : The di¤erence in critical values has implications on the power properties of the
two tests.
For x > 0; we have
1
P (T11 > x) = P W11
2
x2
1
and P (T21 > x) = P W21
2
x2 .
It then follows from Proposition 6(a) that P (T21 > x) P (T11 > x) for x > 0: So for a onesided test with the alternative H1 : R 0 > r; critical values from T21 are larger than those from
13
T11 : Similarly, we have P (T21 < x) P (T11 < x) for x < 0: This implies that for a one-sided
test with the alternative H1 : R 0 < r; critical values from T21 are smaller than those from T11 :
Let W11 and W21 be the (1
) quantile from the distributions W11 and W21 ; respectively.
The local asymptotic power functions of the two tests are
h
i
2
2
2
1
1
:= 1
; K; p; q;
= P W11 ( 1 1 0 ) > W11 ;
1
0
0
1
1
h
i
2
2
2
1
1
1
:=
;
K;
p;
q;
=
P
W
(
)
>
W
0
0
2
21
0
2
21 :
1
2
2
2
2
1
1
While
; we also have W21 > W11 : The e¤ects of the critical values and
0
0
2
1
the noncentrality parameter move in opposite directions. It is not straightforward to compare the
two power functions. However, Proposition 6 suggests that if the di¤erence in the noncentrality
2
2
1
1
parameters
is large enough to o¤set the increase in critical values, then
0
0
2
1
the two-step test based on W2T will be more powerful.
2
2
1
1
; we de…ne
To evaluate
0
0
1
2
= R
R
11 R
which is the long run correlation matrix
1
2
2
1
0
1
=
0
0
=
0
0
0
1
=
0
0
0
1
=
0
0
0
1
R
11 R
1
0
R
h
1
12 )
1=2
22 ;
(10)
between Ru1t and u2t : In terms of
R
2 Rp
q
we have
2
1
22
12
1
R
0
R R
21 R
12
0
R R
( Ip
(R
0
1
Ip
n
1
Ip
R
1=2
0
1
0
1
22
1
21 R
Ip
1=2 0
)
0
0
0
o
0
0
1
1
1
0
R R
0
R
i
1
11 R
1
0
1
1
0
0
0
1=2
0
R R
Ip
1
0
1
1
1
0
> 0:
0
1
1
0
1
So the di¤erence in the noncentrality parameters depends on the matrix R 0R :
Pmin(p;q)
0
Let R = i=1
i ai bi be the singular value decomposition of R where fai g and fbi g are
2
orthonormal vectors in Rp and Rq respectively. Sorted in the descending order,
are the
i
(squared) long run canonical correlation coe¢ cients between Ru1t and u2t : Then
min(p;q)
1
2
2
1
0
1
2
0
X
=
i=1
Consider a special case that
1
2
2
1
:= maxpi=1
2
i
2
i
1
2
i
a0i
1
1
approaches 1. If a01
2
0
1
1
:
0
6= 0; then jj
1
2
2
0 jj
and hence jj 2 1 0 jj2 approaches 1 as 21 approaches 1 from below. This case happens
when the second block of moment conditions has very high long run prediction power for the …rst
2
block. In this case, we expect the W2T test to be more powerful, as limj 1 j!1 2 ( 2 1 0 ) = 1:
Consider another special case that maxi 2i = 0; i.e., R is a matrix of zeros. In this case, the
1
0
second block of moment conditions contains no additional information, and we have
1
2
1
2
2
0
=
: In this case, we expect the W2T test to be less powerful.
0
It follows from Proposition 6(b)(c) that for any , there exists a unique ( ) := ( ; h; p; q; )
such that
( )) :
1( ) = 2(
1
14
Then
1
2
= (
0
2
2
(
0
1
1
2
1
2(
2
0
)
)<
1
1
0
0
1(
2
0
0
1
)
0
1
2
1
1
1
1
Ip
1
) if and only if
2
2
0
1
< (
1
2
0
)
1
1
2
0
: But
2
0
0
R R
1=2
where
h
0
R R
1
2
0
Ip
i
0
R R
Ip
1=2
1
1
0
;
( ; h; p; q; ) 1
:
( ; h; p; q; )
f ( ) := f ( ; h; p; q; ) =
So the power comparison depends on whether f (
1
f
1
1
2
0
)Ip
0
R R
is positive semide…nite.
2
1
Proposition 7 Let Assumptions 1–3 hold and de…ne
:=
= 00 (R 11 R0 ) 1 0 . If
0
1
0
f ( ; h; p; q; )Ip
R R is positive (negative) semide…nite, then the two-step test based on W2T
has a lower (higher) local asymptotic power than the one-step test based on W1T under the …xedsmoothing asymptotics.
Note that
0
R R
= R
11 R
0
1=2
R
12
1
22
21
R0 R
11 R
0
1=2
:
To pin down R 0R ; we need to choose a matrix square root of R 11 R0 : Following the arguments in
the proof of Proposition 3, it is easy to see that we can use any matrix square root. In particular,
we can use the principal matrix square root.
When p = 1; which is of ultimate importance in empirical studies, R 0R is equal to the
sum of the squared long run canonical correlation coe¢ cients. In this case, f ( ; h; p; q; ) is the
threshold value of R 0R for assessing the relative e¢ ciency of the two tests. More speci…cally,
when R 0R > f ( ; h; p; q; ), the two-step W2T test is more powerful than the one-step W1T test.
Otherwise, the two-step W2T test is less powerful.
Proposition 7 is in parallel with Proposition 3. The qualitative messages of these two propositions are the same — when the long run correlation is high enough, we should estimate and exploit
it to reduce the variation of our point estimator and improve the power of the associated tests.
However, the thresholds are di¤erent quantitatively. The two propositions fully characterize the
threshold for each criterion under consideration.
Proposition 8 Consider the case of OS LRV estimation. For any
2 ( ) and hence ( ; h; p; q; ) > 1 and f ( ; h; p; q; ) > 0:
2 R+ , we have
1(
)>
Proposition 8 is intuitive. When there is no long run correlation between Ru1t and u2t ; we
2
2
1
1
: In this case, the two-step W2T test is necessarily less powerful. The
have
=
0
0
1
2
proof uses the theory of uniformly most powerful invariant tests and the theory of complete and
su¢ cient statistics. It is an open question whether the same strategy can be adopted to prove
Proposition 8 in the case of kernel LRV estimation. Our extensive numerical work supports that
( ; h; p; q; ) > 1 and f ( ; h; p; q; ) > 0 continue to hold in the kernel case.
It is not easy to give an analytical expression for f ( ; h; p; q; ) but we can compute it numerically without any di¢ culty. In Tables 4 and 5, we consider the case of OS LRV estimation and
compute the values of f ( ; K; p; q; ) for = 1 10; K = 8; 10; 12; 14, p = 1 4 and q = 1 3:
Similar to the asymptotic variance comparison, we …nd that these threshold values increase as
the degree of over-identi…cation increases and decrease as the smoothing parameter K increases.
Note that the continuity of two power functions 1 ( ; K; p; q; ) and 2 ( ; K; p; q; ) guarantees
15
that for 2 [1; 10] the two-step W2T test has a lower local asymptotic power than the one-step
W1T test if the largest eigenvalue of R 0R is smaller than min 2[1;10] f ( ; K; p; q; ). on the other
hand, for 2 [1; 10] the two-step W2T test has a higher local asymptotic power if the smallest
eigenvalue of R 0R is greater than max f ( ; K; p; q; ).
For the case of kernel LRV estimator, results not reported here show that f ( ; h; p; q; )
increases with q and decreases with h. This is entirely analogous to the case of OS LRV estimation.
5
General Overidenti…ed GMM Framework
In this section, we consider the general GMM framework. The parameter of interest is a d 1
vector 2
Rd . Let vt 2 Rdv denote the vector of observations at time t. We assume that 0
is the true value, an interior point of the parameter space . The moment conditions
E f (vt ; ) = 0; t = 1; 2; :::; T:
hold if and only if = 0 where f (vt ; ) is an m 1 vector of continuously di¤erentiable functions.
The process f (vt ; 0 ) may exhibit autocorrelation of unknown forms. We assume that m d and
that the rank of E[@ f (vt ; 0 ) =@ 0 ] is equal to d: That is, we consider a model that is possibly
over-identi…ed with the degree of over-identi…cation q = m d:
5.1
One-Step and Two-Step Estimation and Inference
De…ne the m
m contemporaneous covariance matrix
0
= E f (vt ; )f (vt ; ) and
=
1
X
j
and the LRV matrix
where
j
= E f (vt ;
as:
0
0 )f (vt j ; 0 ) :
j= 1
Let
t
1 X
gt ( ) = p
f (vj ; ):
T j=1
Given a simple positive-de…nite weighting matrix W0T that does not depend on any unknown
parameter, we can obtain an initial GMM estimator of 0 as
^0T = arg min gT ( )0 W 1 gT ( ):
0T
For example, we may set W0T equal to Im . In the case of IV regression, we may set W0T equal
to ZT0 ZT =T where ZT is the matrix of the instruments.
Using or as the weighting matrix, we obtain the following two (infeasible) GMM estimators:
~1T
: = arg min gT ( )0
1
gT ( );
(11)
~2T
: = arg min gT ( )0
1
gT ( ):
(12)
2
2
For the estimator ~1T ; we use the contemporaneous covariance matrix as the weighting matrix
and ignore all the serial dependency in the moment vector process ff (vt ; 0 )gTt=1 . In contrast
to this procedure, the second estimator ~2T accounts for the long run dependency. The feasible
16
versions of these two estimators ^1T and ^2T can be naturally de…ned by replacing
their estimates est (^0T ) and est (^0T ) where
est ( ) : =
and
with
T
1X
f (vt ; )f (vt ; )0 ;
T
(13)
t=1
T
T
1 XX
s t
Qh ( ; )f (vt ; )f (vs ; )0 :
est ( ) : =
T
T T
(14)
s=1 t=1
To test the null hypothesis H0 : R
di¤erent Wald statistics as follows:
0
= r against H1 : R
W1T
: = T (R^1T
W2T
: = T (R^2T
n
o
r)0 RV^1T R0
n
o
r)0 RV^2T R0
where
V^1T
=
V^2T
=
h
h
and
G01T
1 ^
est ( 1T )G1T
G2T
1 ^
est ( 2T )G2T
G1T
i
i
1h
G01T
0
1
1
=r+
(R^1T
r);
(R^2T
r);
1 ^
1 ^
^
est ( 1T ) est ( 1T ) est ( 1T )G1T
1
T
1 X @ f (vt ; )
=
T
@ 0
t=1
ih
T
1 X @ f (vt ; )
=
T
@ 0
t=1
; G2T
=^1T
p
0=
T ; we construct two
(15)
1 ^
est ( 1T )G1T
G01T
i
1
(16)
:
=^2T
These are the standard Wald test statistics in the GMM framework.
To compare the two estimators ^1T and ^2T and associated tests, we maintain the standard
assumptions below.
Assumption 4 As T ! 1; ^0T =
point 0 2 :
0 +op (1) ;
^1T =
0 +op (1) ;
^2T =
0 +op (1)
for an interior
Assumption 5 De…ne
t
1 @gt
1 X @ f (vt ; )
Gt ( ) = p
=
for t
0
T
@ 0
T@
j=1
For any
T
=
0
1 and G0 ( ) = 0:
+ op (1); the following holds: (a) plimT !1 G[rT ] (
0
G = G( 0 ) and G( ) = E@ f (vt ; )=@ ; (b)
est ( T )
p
!
T)
= rG uniformly in r where
> 0:
With these assumptions and some mild conditions, the standard GMM theory gives us
p
T (^1T
T
1 Xh 0
G
0) = p
T t=1
1
G
i
1
G0
1
f (vt ;
0)
+ op (1):
Under the …xed-smoothing asymptotics, Sun (2014b) establishes the representation:
p
T (^2T
T
1 Xh 0
p
G
0) =
T t=1
1 0
1G
17
i
1
G0
1
1 f (vt ; 0 )
+ op (1);
where 1 is de…ned in the similar way as 1 in Proposition 1: 1 = 1=2 ~ 1 01=2 :
Due to the complicated structure of two transformed moment vector processes, it is not
straightforward to compare the asymptotic distributions of ^1T and ^2T as in Sections 3 and 4.
To confront this challenge, we let
G= U
V0
m m m d d d
be a singular value decomposition (SVD) of G, where
0
A is a d
A;
=
O
d d
;
d q
d diagonal matrix and O is a matrix of zeros. Also, we de…ne
f (vt ;
0)
= (f1 0 (vt ;
0 ); f2
0
(vt ;
0
0 ))
:= U 0 f (vt ;
0)
2 Rm ;
where f1 (vt ; 0 ) 2 Rd and f2 (vt ; 0 ) 2 Rq are the rotated moment conditions. The variance and
long run variance matrices of ff (vt ; 0 )g are
:= U 0 U =
11
12
21
22
and
:= U 0 U . To convert the variance matrix into an identity matrix, we de…ne the normalized
moment conditions below:
f (vt ;
0)
0
0 0
0 ) ; f2 (vt ; 0 )
= f1 (vt ;
:= (
where
1=2
1=2
1 2)
(
=
1=2
12 ( 22 )
( 22 )1=2
0
1
1=2 )
!
f (vt ;
0)
:
(17)
More speci…cally,
f1 (vt ;
0)
: =(
1 2)
1=2
f2 (vt ;
0)
: =(
22 )
1=2
h
f1 (vt ;
f2 (vt ;
0)
0)
12 (
22 )
1
f2 (vt ;
2 Rq :
Then the contemporaneous variance of the time series ff (vt ;
0 ) 1:
is := ( 1=2 ) 1 ( 1=2
0 )g
1 2)
1=2
p
AV 0 T (^1T
0) =
(
1 2)
1=2
p
AV 0 T (^2T
0)
=
T
1 X
p
f1 (vt ;
T t=1
0)
T
1 X
p
[f1 (vt ;
T t=1
d
=) M N 0;
where
1
:=
1;12
1
1;22
11
is the same as in Proposition 1.
18
0)
in Assumptions 2 and 3.
d
+ op (1) =) N (0;
0)
2 Rd ;
is Im and the long run variance
Lemma 9 Let Assumptions 1–5 hold by replacing ut with f (vt ;
Then as T ! 1 for a …xed h;
(
i
0)
1 f2 (vt ; 0 )]
0
12 1
1
11 )
(18)
+ op (1)
21
+
1
(19)
0
22 1
The asymptotic
normality and mixed normality in the lemma are not new. The
p
p limiting
distribution of T (^1T 0 ) is available from the standard GMM theory while that of T (^2T 0 )
has been recently established in Sun (2014b). It is the representation that is new and insightful.
The representation casts the two estimators in the same form and enables us to directly compare
their asymptotic properties.
It follows from the proof of the lemma that
(
1 2)
1=2
AV
0
p
T
1 X
[f1 (vt ;
0) = p
T t=1
T (~2T
0)
0 f2 (vt ; 0 )]
+ op (1)
where 0 = 12 221 as de…ned before. So the di¤erence between the feasible and infeasible twostep GMM estimators lies in the uncertainty in estimating 0 : While the true value of appears
in the asymptotic distribution of the infeasible estimator ~2T ; the …xed-smoothing limit of the
implied estimator ^ := ^ 12 ^ 221 appears in that of the feasible estimator ^2T : It is important
to point out that the estimation uncertainty in the whole weighting matrix est matters only
through that in ^ :
If we let (u1t ; u2t ) = (f1 (vt ; 0 ); f2 (vt ; 0 )); then the right hand sides of (18) and (19) are
exactly the same as what we would obtain in the location model. The location model, as simple
as it is, has implications for general settings from an asymptotic point of view. More speci…cally,
de…ne
y1t = (
1=2
1 2)
AV 0
0
+ u1t ;
y2t = u2t ;
where u1t = f1 (vt ; 0 ) and u2t = f2 (vt ; 0 ). The estimation and inference problems in the GMM
setting are asymptotically equivalent to those in the above simple location model with fy1t ; y2t g
as the observations.
~ using
To present our next theorem, we transform R into R
~ = RV A
R
1
(
which has the same dimension as R: We let
Z 1Z 1
~ (h; p; q) =
Qh (r; s)dBp (r)dBq (s)0
1
0
0
1=2
;
1 2)
Z
0
(20)
1Z 1
0
1
Qh (r; s)dBq (r)dBq (s)0
;
which is compatible with the de…nition in (2). We de…ne
=
1=2
11
12
1=2
22
2 Rd
q
and
R
~
= R
~0
11 R
1=2
~
R
12
1=2
22
2 Rp
q
:
While is the long run correlation matrix between f1 (vt ; 0 ) and f2 (vt ; 0 ), R is the long run
~ 1 (vt ; 0 ) and f2 (vt ; 0 ). Finally, we rede…ne the noncentrality pacorrelation matrix between Rf
rameter :
h
i 1
1
= 00 R(G0 1 G) 1 (G0 1
G)(G0 1 G) 1 R0
(21)
0;
which is the noncentrality parameter based on the …rst-step test.
For the location model considered before, the above de…nition of R is identical to that in
(10). In that case, we have G = (Id ; Od q )0 and so U = Im , A = Id and V = Id : Given the
~ = R: In addition, the
assumption that =
= Im ; which implies that 1 2 = Id ; we have R
1
0
0
noncentrality parameter reduces to = 0 (R 11 R )
0 as de…ned in Proposition 7.
19
Theorem 10 Let the assumptions in Lemma 9 hold.
0 is positive (negative) semide…nite, then ^
(a) If g(h; q) Id
2T has a larger (smaller)
^
asymptotic variance than 1T as T ! 1 for a …xed h:
0
^
(b) If g(h; q) Ip
R R is positive (negative) semide…nite, then R 2T has a larger (smaller)
^
asymptotic variance than R 1T as T ! 1 for a …xed h:
0
(c) If f ( ; h; p; q; ) Ip
R R is positive (negative) semide…nite, then the test based on W2T
is asymptotically less (more) powerful than that based on W1T ; as T ! 1 for a …xed h:
Theorem 10(a) is a special case of Theorem 10(b) with R = Id . We single out this case in order
to compare it with Proposition 3. It is reassuring to see that the results are the same. The only
di¤erence is that in the general GMM case we need to rotate and standardize the original moment
conditions before computing the long run correlation matrix. Part (b) can also be applied to a
1=2
~ = R(
general location model with a nonscalar error variance, in which case R
. Theorem
1 2)
10(c) reduces to Proposition 7 in the case of the location model.
5.2
Two-Step GMM Estimation and Inference with a Working Weighting
Matrix
In the previous subsection, we employ two speci…c weighting matrices — the variance and long
run variance estimators. In this subsection, we consider a general weighting matrix WT (^0T );
which may depend on the initial estimator ^0T and the sample size T; leading to yet another
GMM estimator:
i 1
h
^aT = arg min gT ( )0 WT (^0T )
gT ( )
2
where the subscript ‘a’signi…es ‘another’or ‘alternative’.
An example of WT (^0T ) is the implied LRV matrix when we employ a simple approximating
parametric model to capture the dynamics in the moment process. We could also use the general
LRV estimator but we choose a large h so that the variation in WT (^0T ) is small. In the kernel
LRV estimation, this amounts to including only autocovariances of low orders in constructing
WT (^0T ): We assume that WT (^0T ) !p W , a positive de…nite nonrandom matrix under the
…xed-smoothing asymptotics. W may not be equal to the variance or long run variance of the
moment process. We call WT (^0T ) a working weighting matrix. This is in the same spirit of using
a working correlation matrix rather than a true correlation matrix in the generalized estimating
equations (GEE) setting. See, for example, Liang and Zeger (1986).
In parallel to (15), we construct the test statistic
n
o
r)0 RV^aT R0
WaT := T (R^aT
where, for GaT =
1
T
PT
t=1
@ f (vt ; )=@
h
i
V^aT = G0aT WT 1 (^aT )GaT
1h
0
=^aT
1
(R^aT
; V^aT is de…ned according to
G0aT WT 1 (^aT )
est
^aT W 1 (^aT )GaT
T
which is a standard variance estimator for ^aT :
De…ne
W = U 0 W U and W =
r);
1
1=2 W
20
(
0
1
1=2 )
=
ih
G0aT WT 1 (^aT )GaT
W11 W12
W21 W22
i
1
;
and a = W12 W221 :
Using the same argument for proving Lemma 9, we can show that
(
1=2
1 2)
1
A
T
1 X
p
)
=
[f1 (vt ;
0
T t=1
p
V 0 T (^aT
0)
a f2 (vt ; 0 )]
The above representation is the same as that in (19) except that
1=2
1=2
De…ne ~ a according to a = 1 2 ~ a 22 + 0 : That is,
1=2
12 ( a
~ =
a
The relationship between
In fact, if we let
a
0)
1=2
22
1=2
a
12
=
1=2
22
1=2
1=2
W
0
=
1=2
0
Id
~ 12
~ 11 W
W
~
~
W21 W22
(22)
is now replaced by
1
and ~ a is entirely analogous to that between
~ =
W
+ op (1):
a:
:
(23)
1
and ~ 1 ; see (3).
;
~ 12 W
~ 1 ; which is the long run regression matrix implied by the normalized weighting
then ~ a = W
22
~:
matrix W
~ [f1 (vt ; 0 )
Let Va be the long run variance of R
a f2 (vt ; 0 )] and
h
i
1=2
1=2 ~
=
V
R
(
)
12
22
a;R
a
a
22 ;
~ [f1 (vt ;
be the long run correlation matrix between R
R = Id ; a;R reduces to
a
=
11
+
a
0
22 a
a
0
12 a
21
0)
1=2
a f2 (vt ; 0 )]
(
12
a
22 )
and f2 (vt ;
0) :
When
1=2
22 :
Let
~ a = RV A
R
1
(
1=2
1 2)
1=2
12
~
=R
1=2
12
~R = R
~aR
~ a0
and D
1=2
~a = R
~
R
~0
1 2R
1=2
~
R
1=2
12:
~R
~ 0 = Ip , that is, each row of D
~ R has the same dimension as R
~ a and R: By construction, D
~ RD
D
R
d
~
~
is a unit vector in R and the rows of DR are orthogonal to each other. The matrix DR signi…es
~a:
the orthogonal directions embodied in R
With these additional notations, we are ready to state the theorem below.
Theorem 11 Let the assumptions in Lemma 9 hold. Assume further that WT (^0T ) !p W , a
positive de…nite nonrandom matrix.
0
0
(a) If E ~ 1 (h; d; q) ~ 1 (h; d; q) ~ a ~ a is positive (negative) semide…nite, then ^2T has a larger
(smaller) asymptotic variance than ^aT ; as T ! 1 for a …xed h:
0
^
~0
~ R ~a ~0 D
(b) If E ~ 1 (h; p; q) ~ 1 (h; p; q) D
a R is positive (negative) semide…nite, then R 2T
^
has a larger (smaller) asymptotic variance than R aT ; as T ! 1 for a …xed h:
0
^
(c) If g(h; d) Id
a a is positive (negative) semide…nite, then 2T has a larger (smaller)
^
asymptotic variance than aT ; as T ! 1 for a …xed h:
(d) If g(h; d) Ip a;R 0a;R is positive (negative) semide…nite, then R^2T has a larger (smaller)
asymptotic variance than R^aT ; as T ! 1 for a …xed h:
1=2
2
0
(e) Let a = jjVa
0 jj : If f ( a ; h; p; q; ) Ip
a;R a;R is positive (negative) semide…nite,
then the test based on W2T is asymptotically less (more) powerful than that based on WaT ; as
T ! 1 for a …xed h:
21
0
Theorem 11(a) is intuitive. While E ~ 1 (h; d; q) ~ 1 (h; d; q) can be regarded as the variance
0
in‡ation arising from estimating the long run regression matrix, ~ a ~ a can be regarded as the
bias e¤ect from not using a true long run regression matrix. When the variance in‡ation e¤ect
dominates the bias e¤ect, the two-step estimator ^2T will have a larger asymptotic variance than
^aT :
In the special case when W = ; we obtain the infeasible two-step estimator ~2T and ~ a = 0:
0
~ ~ 0 holds trivially and the feasible two-step estimator
In this case, E ~ 1 (h; d; q) ~ 1 (h; d; q)
a a
^2T has higher variation than its infeasible version ~2T :
In another special case when W = ; we obtain the …rst step estimator ^1T in which case
0 ) 1=2 : Hence
~
(Id
a = 0 and by (23), a =
~ ~ 0 = Id
a a
0
1=2
0
1=2
0
Id
:
0
The condition in Theorem 11(a) reduces to whether g(h; q) Iq
not. Theorem 11(a) thus reduces to Theorem 10(a).
is positive semide…nite or
d ~
~ R ~ (h; d; q) =
Theorem 11(b) is a rotated version of Theorem 11(a). Given that D
(h; p; q);
the condition in Theorem 11(a) implies that in Theorem 11(b). So part (b) of the theorem is
stronger than part (a) when R is not a square matrix.
When W = ; we have a = 0 and so
~ R ~a ~0 D
~0
D
a R =
~
R
~0
1 2R
=
~
R
1 2R
=
~
R
1 2R
=
Ip
~0
~0
0
R R
1=2 h
1=2 h
1=2 h
1=2
~
R
~(
R
~
R
1=2
12
i
a a
0)
a
12
0
R R
h
~ ~0
1
22
22 f( a
~0
21 R
i
0
R R
Ip
0
As a result, the condition E ~ 1 (h; p; q) ~ 1 (h; p; q)
to
0
E ~ 1 (h; p; q) ~ 1 (h; p; q)
Ip
R
i
1=2 0
12
~
R
~
R
~
R
0 ~0
0 )g R
~0
1 2R
1=2
~0
1 2R
i
~
R
1=2
~0
1 2R
1=2
1=2
:
~ R ~a ~0 D
~0
D
a R in Theorem 11(b) is equivalent
0
R
1=2
0
R R
Ip
0
R R
1=2
;
which can be reduced to the same condition in Theorem 10(b).
Theorem 11 (c)-(e) is entirely analogous to Theorem 10(a)-(c). The only di¤erence is that
the second block of moment conditions is removed from the …rst block using the implied matrix
coe¢ cient a before computing the long run correlation coe¢ cient.
To understand 11(e), we can see that the e¤ective moment conditions behind R^aT are:
Ef1a (vt ;
0)
= 0 for f1a (vt ;
0)
= RV A
1
(
1=2
[f1 (vt ; 0 )
1 2)
a f2 (vt ; 0 )] :
R^aT uses the information in Ef2 (vt ; 0 ) = 0 to some extent, but it ignores the residual information that is still potentially available from Ef2 (vt ; 0 ) = 0: In contrast, R^2T attempts to explore
the residual information. If there is no long run correlation between f1a (vt ; 0 ) and f2 (vt ; 0 ) ;
i.e., aR = 0; then all the information in Ef2 (vt ; 0 ) = 0 has been fully captured by the e¤ective
moment conditions underlying R^aT : As a result, the test based on R^aT necessarily outperforms
that based on R^2T : If the long run correlation aR is large enough in the sense that is given in
11(e), the test based on R^2T could be more powerful than that based on R^aT in large samples.
22
5.3
Practical Recommendation
Theorems 10 and 11 give theoretical comparisons of various procedures. But there are some
unknown quantities in the two Theorem 10. In practice, given the set of moment conditions
E f (vt ; ) = 0 and the data fvt g ; suppose that we want to test H0 : R 0 = r against R 0 6= r
for some R 2 Rp d We can follow the steps below to evaluate the relative merit of one-step and
two-step testing procedures.
1. Compute the initial estimator ^0T = arg min
PT
t=1 f (vt ;
)
2
:
2. One the basis of ^0T ; use a data-driven method to select the smoothing parameter: Denote
^
the data-driven value by h:
3. Compute
^
est ( 0T )
and
^
est ( 0T )
^
using the formulae in (14) and smoothing parameter h:
P
^ ^ V^ 0
4. Compute GT (^0T ) = T1 Tt=1 @ f @(vt0; ) j =^0T and its singular value decomposition U
where ^ 0 = (A^d d ; Od q ) and A^d d is diagonal.
5. Estimate the variance and the long run variance of the rotated moment processes by
^
^ := U
^0
^ and ^ := U
^0
est ( 0T )U
^
^
est ( 0T )U .
6. Compute the normalized LRV estimator:
^ = (^ )
1=2
1^
0
^
where
^
~ est = RV^ A^
7. Let R
1( ^
1=2
1=2
1 2)
B
=@
0
( ^ 1=2
)
1=2
12
1
^
^
12
^
0
^ 11
^ 21
:=
^ 12
^ 22
1=2
1
C
A:
22
1=2
22
(24)
and compute
h
0
~ est ^ 11 R
~ est
^R ^0R = (R
)
1=2
ih
0
~ est ^ 12 ^ 1 ^ 12 R
~ est
R
22
ih
0
~ est ^ 11 R
~ est
(R
)
and its largest eigenvalue vmax ^R ^0R and smallest eigenvalue vmin ^R ^0R :
1=2
i0
8. Choose the value of o such that P 2p ( ) > p1
= 75%. We may also choose a value of
to re‡ect scienti…c interest or economic signi…cance.
9. (a) If vmin ^R ^0R > f
on W2T :
(b) If vmax ^R ^0R < f
on W1T :
o
^ p; q;
; h;
o
^ p; q;
; h;
, then we use the second-step testing procedure based
; then we use the …rst-step testing procedure based
(c) If either the condition in (a) or in (b) is satis…ed, then we use the …rst-step testing
procedure based on WaT :
23
6
Simulation Evidence
This section compares the …nite sample performances of one-step and two-step estimators and
tests using the …xed-smoothing approximation.
6.1
Point Estimation
We consider the location model given in (1) with the true parameter value 0 = (0; :::; 0) 2 Rd
but we allow for a nonscalar error variance. The error fut g follows a VAR(1) process:
u1ti
=
u1ti 1
u2ti = u2ti
+p
q
q
X
u2tj
1
+ e1ti for i = 1; :::; d
(25)
j=1
+ e2ti for i = 1; :::; q
1
where e1ti
iid N (0; 1) across i and t, e2ti
iid N (0; 1) across i and t; and fe1t ; t = 1; 2; :::; T g
are independent of fe2t ; t = 1; 2; :::; T g : Let ut := (u1t ; u2t )0 2 Rm , then ut = ut 1 + et where
!
Id pq Jd;q
e1t
=
; et =
iid N (0; Im ) :
m m
e2t
0
Iq
Direct calculations give us the expressions for the long run and contemporaneous variances of
fut g as
=
1
X
Eut ut
j
= (Im
j= 1
0
and
where Jd;q is the d
= @
1
(1
I +
)2 d
(1
0
1
B
= var(ut ) = @
1
)
3p
2
(1
J
)4 d
)
1
3p
(1
J
q q;d
)
1
(1
(1+ 2 )
2 Id +
3 Jd
(1 2 )
p
J
2 2 q d
q (1
)
0
(Im
)
)
J
q d;q
2
Iq
1
1
A
2
p
1
1
Jd
)
2 Iq
2 2
q (1
q
1
C
A;
q matrix of ones. In our simulation, we …xhthe value of
at 0:75
i so
1=2
1
that each time series is reasonably persistent. Let u1t = ( 1 2 )
u1t
u2t and
12 ( 22 )
u2t = ( 22 ) 1=2 u2t and be the long run correlation matrix of ut = (u01t ; u02t )0 : With some
algebraic manipulations, we have
0
=
d+
2 2
)
(1
2
1
Jd :
1
2 2
So the maximum eigenvalue of 0 is given by vmax ( 0 ) = 1 + (1
) =(d 2 ) , which is also
the only nonzero eigenvalue. Obviously, vmax ( 0 ) is increasing in 2 for any given and d: Given
and d, we choose the value of to get a di¤erent value of vmax ( 0 ): We consider vmax ( 0 ) =
2 2
0; 0:09; 0:18; :::; 0:90; 0:99 which are obtained by setting 2 = vmax ( 0 )(1
) = (d(1 vmax ( 0 )) :
24
For the basis functions in OS HAR estimation, we choose the following orthonormal basis
2
functions f j g1
j=1 in L [0; 1] space:
2j 1 (x)
=
p
2 cos(2j x) and
2j (x)
=
p
2 sin(2j x) for j = 1; :::; K=2
We also consider commonly used kernel based LRV estimators by employing three kernels:
Bartlett, Parzen, QS kernels. In all our simulations the sample size T is 200:
We focus on the case with d = 1, under which 0 is a scalar and max ( 0 ) = 0 . According to
Proposition 3, if 0 greater than a threshold value, then V ar(^2T ) will be smaller than V ar(^1T ):
Otherwise, V ar(^2T ) will be larger. We simulate V ar(^1T ) and V ar(^2T ) using 10; 000 simulation
replications.
Tables 6 7 report the simulated variances under the OS and kernel HAR estimation with
q = 3 and 4 at some given values of the smoothing parameters3 . We …rst discuss the case when
OS HAR estimation is used. It is clear that V ar(^2T ) becomes smaller than V ar(^1T ) only when
0 ) is large enough. For example, when q = 4 and there is no long run correlation, i.e.,
max (
0
) = 0; we have V ar(^1T ) = 0:080 < V ar(^2T ) = 0:112, and so ^1T is more e¢ cient than
max (
^2T with 27% e¢ ciency gain. These numerical observations are consistent with our theoretical
result in Proposition 1: ^2T becomes more e¢ cient relative to ^1T only when the bene…t of
using long run correlation matrix outweighs the cost of estimating it. With the choice of
K = 14 and q = 4; Table 6 indicates that V ar(^2T ) starts to become smaller than V ar(^1T )
when max ( 0 ) = 0 crosses a value in the interval [0:270; 0:360] from below: This agrees with
the theoretical threshold value 0 = q=(K 1) 0:307 given in Corollary 4:
In the case when kernel HAR estimation is used, we get the exactly same qualitative messages.
For example, consider the case with the Bartlett kernel, b = 0:08; and q = 3: We observe that
V ar(^2T ) starts to become smaller than V ar(^1T ) when max ( 0 ) = 0 crosses a value in the
interval [0:09; 0:18] from below. This is compatible with the threshold value 0:152 given in Table
1.
6.2
Hypothesis Testing
In addition to the VAR(1) error process, we also consider the following VARMA(1,1) process for
ut :
u1ti = u1ti
u2ti
=
i
1 + e1t + p
u2ti 1
+
e2ti
q
q
X
e2;tj
1
for i = 1; :::; d
(26)
j=1
for i = 1; :::; q
i:i:d
where et
N (0; Im ) : The corresponding long run covariance matrix
and contemporaneous
covariance matrix
are
1
0
2
1
Jd (1 )2 pq Jd;q
2 Id +
(1
)2
A
= @ (1 )
1
J
I
q
q;d
2p
2
(1
) q
(1
)
3
To calculate the values of K and b, we employ the plug-in values
(1991)’s AMSE-optimal fomulae assuming ( 0 ) = 0:40:
25
and B in Phillips (2005) and Andrews
and
1
=
2
1
Id +
p1
q1
2
2
2
1
p1
q1
Jd
Jd;q
2
1
Jq;d
2
1
!
With some additional algebras, we have,
0
=
d+
1
1
)2
(1
2
Jd :
Considering both the VAR(1) and VARMA(1,1) error processes, we implement three testing
procedures on the basis of W1T , W2T and WaT : The signi…cance level is = 0:05 and we set
0 ) 2 f0:00; 0:35; 0:50; 0:60; 0:80; 0:90g with d = 3 and q = 3. The null hypotheses of
max (
interest are:
H01 :
1
= 0;
H02 :
1
=
2
=0
where p = 1; 2 respectively. Here, WaT is based on working weighting
matrix
W (^0T ) using
n
o
VAR(1) as the approximating model for the estimated error process u
^t (^0T ) : Note that W (^0T )
converges in probability to the true long run variance matrix
if the true data generating process
is VAR(1) process. However, the probability limit of W (^0T ) is di¤erent from the true LRV when
the true DGP is VARMA(1,1) process. For the smoothing parameters, we employ the data driven
AMSE optimal bandwidth through VAR(1) plug-in implementation developed by Andrews (1991)
and Phillips (2005). All our results below are based on 10,000 simulation replications.
Tables 8–15 report the empirical size of three testing procedures based on the two types of
asymptotic approximations with OS HAR and kernel based estimators. It is clear that all of the
three tests based on W1T ; WaT and W2T su¤er from severe size distortion if the conventional
normal (or chi-square) critical values are used. For example, when the DGP is VAR(1) and OS
HAR estimation is implemented, the empirical sizes of the three tests using the OS variance
estimator are reported to be around 16% 24% when p = 2. The relatively large size distortion
of the W2T test comes from the additional cost in estimating the weighting matrix. However, if
the nonstandard critical values W11 and W21 are used, we observe that the size distortion of
both procedures is substantially reduced. The result agrees with the previous literature such as
Sun (2013, 2014a&b) and Kiefer and Vogelsang (2005) which highlight the higher accuracy of the
…xed-smoothing approximations. Also, we observe that when the …xed-smoothing approximations
are used, the W1T test is more size-distorted than the W2T test in most cases. Similar results for
the kernel cases are reported in Tables 10 15.
Next, we investigate the …nite sample power performances of the three procedures. We use the
…nite sample critical values under the null, the power is size-adjusted and the power comparison
is meaningful. The DGPs
p are the same as before except the parameters are from the local null
alternatives = 0 + 0 = T . The degree of over identi…cation considered here is q = 3: Also, the
domain of each power curve is rescaled to be := 00 111 0 as in Section 4.
Figures 3–6 show the size-adjusted …nite sample power of both procedures by OS HAR estimation. The dashed lines in the …gures indicate power curves of WaT procedure and W1T , W2T are
represented by dash-dot and dotted lines, respectively. We can see that in all …gures, the power
curve of the two-step test shifts upward as the degree of the long run correlation max ( R 0R ) increases and it starts to dominate that of the one-step test from certain point max ( R 0R ) 2 (0; 1).
This is consistent with Proposition 7.
26
For example, with K = 14 and p = 1; the power curves in Figure 4 show that the power curve
of the two-step test W2T starts to cross with that of the one-step test W1T when max ( R 0R ) is
around 0:25. This matches our theoretical results in Proposition 7 and Table 5 which indicate
that the threshold value max 2[1;10] f ( ; K; p; q; ) is about 0:273 when K = 14; p = 1 and q = 3.
Also, if max ( R 0R ) is as high as 0:75, we can see that the two step test is more powerful than
the one-step test in most of cases.
Lastly, when the DGP is based on VAR(1), the performance of WaT dominates W1T and
W2T in all ranges of max ( R 0R ) 2 (0; 1). This is because the working weighting matrix W (^0T )
with VAR(1) approximation can almost precisely capture the true long run variance matrix
and this improves the power of tests whenever there is some long run correlation between u1t
and u2t . However, in VARMA(1,1) DGP case, Figures 5 6 show that the advantages of WaT
are reduced, especially when max ( R 0R ) is close to one. This is due to the misspeci…cation bias
for W (^0T ) which is estimated di¤erently from the true model. Nevertheless, we still observe
comparable performances of WaT in most of max ( R 0R ) values: The other Figures 7 9 by kernel
LRV estimators deliver the same quantitative messages.
7
Conclusion
In this paper we have provided more accurate and honest comparisons between the popular onestep and two-step GMM estimators and the associated inference procedures. We have given some
clear guidance on when we should go one step further and use a two-step procedure. Qualitatively,
we want to go one step further only if the bene…t of doing so clearly outweighs the cost. When
the bene…t and cost comparison is not clear-cut, we recommend using the two-step procedure
with a working weighting matrix.
The qualitative message of the paper is applicable more broadly. As long as there is additional
nonparametric estimation uncertainty in a two-step procedure relative to the one-step procedure,
we have to be very cautious about using the two-step procedure. While some asymptotic theory
may indicate that the two step procedure is always more e¢ cient, the e¢ ciency gain may not
materialize in …nite samples. In fact, it may do more harm than good sometimes if we blindly
use the two step procedure.
There are many extensions of the paper. We give some examples here. First, we can use
the more accurate approximations to compare the continuous updating GMM and other generalized empirical likelihood estimators with the one-step and two-step estimators. While the
…xed-smoothing asymptotics captures the nonparametric estimation uncertainty of the weighting
matrix estimator, it does not fully capture the estimation uncertainty embodied in the …rst-step
estimator. The source of the problem is that we do not observe the moment process and have
to use the estimated moment process based on the …rst-step estimator to construct the nonparametric variance estimator. It is interesting to develop a further re…nement to the …xed-smoothing
approximation to capture the …rst step estimation uncertainty more adequately. Finally, it will
be also very interesting to give an honest assessment of the relative merits of the OLS and GLS
estimators which are popular in empirical applications.
8
Table of Selected Notations
Moment processes and their variance, LRV and implied regression matrices.
27
Moment Process
ut = f (vt ;
ut = U 0 f (vt ;
0)
= E ut u0t
Variance
LRV
=
=
P1
0
j= 1 E ut ut j
=
~1 =
1
=
R1R1
0
0
0
0
1=2
=
=
12
21
22
11
12
12 ( 22 )
0
1=2
:=
1;11
1;12
1;21
1;22
Qh (r; s)dBp+q (r)dBp+q (s)0 =
28
ut
Id+q
=
11
12
21
22
1
1
~
1
1
Cpp Cpq
Cqp Cqq
1=2
)
22
12 ( 22 )
Some limiting quantities.
~ 1;11 ~ 1;12
Qh (r; s)dBm (r)dBm (s)0 =
~ 1;21 ~ 1;22
~1
ut = (
11
21
Implied beta
R1R1
0)
=
12
1
22
1
= ~ 1 (h; d; q) = ~ 1;12 ~ 1;22
=
1 (h; d; q)
Dpp = Cpp
=
1;12
0
Cpq Cqq1 Cpq
1
1;22
Table 1: Threshold values g(h; q) for asymptotic variance comparison with Bartlett kernel
b
q=1 q=2 q=3 q=4 q=5
0.010 0.007 0.014 0.020 0.027 0.033
0.020 0.014 0.027 0.040 0.053 0.065
0.030 0.020 0.040 0.059 0.078 0.097
0.040 0.027 0.053 0.079 0.104 0.128
0.050 0.034 0.066 0.098 0.128 0.157
0.060 0.040 0.079 0.116 0.152 0.185
0.070 0.047 0.092 0.135 0.175 0.211
0.080 0.054 0.104 0.152 0.197 0.237
0.090 0.061 0.117 0.170 0.218 0.260
0.100 0.068 0.129 0.186 0.238 0.282
0.110 0.074 0.141 0.203 0.257 0.303
0.120 0.081 0.153 0.218 0.274 0.322
0.130 0.088 0.164 0.233 0.291 0.340
0.140 0.094 0.175 0.247 0.306 0.356
0.150 0.101 0.186 0.260 0.321 0.371
0.160 0.107 0.196 0.273 0.334 0.384
0.170 0.113 0.206 0.284 0.347 0.397
0.180 0.119 0.216 0.295 0.358 0.407
0.190 0.124 0.226 0.306 0.369 0.417
0.200 0.130 0.235 0.316 0.380 0.425
Note: h = 1=b indicates the level of smoothing and q is the degrees of overidenti…cation. When
0 > g (h; q) ; then the two-step estimator ^
^
2T is more e¢ cient than the one-step estimator 1T :
29
Table 2: Threshold values g(h; q) for asymptotic variance comparison with Parzen kernel
b
q=1 q=2 q=3 q=4 q=5
0.010 0.006 0.011 0.016 0.022 0.027
0.020 0.011 0.022 0.033 0.043 0.054
0.030 0.017 0.033 0.049 0.065 0.081
0.040 0.022 0.044 0.065 0.087 0.107
0.050 0.028 0.055 0.082 0.108 0.134
0.060 0.033 0.066 0.099 0.130 0.161
0.070 0.039 0.077 0.115 0.152 0.187
0.080 0.045 0.088 0.132 0.173 0.213
0.090 0.051 0.100 0.148 0.194 0.238
0.100 0.057 0.111 0.164 0.215 0.263
0.110 0.063 0.122 0.181 0.236 0.288
0.120 0.069 0.133 0.197 0.257 0.312
0.130 0.075 0.145 0.213 0.277 0.336
0.140 0.081 0.156 0.229 0.297 0.359
0.150 0.087 0.168 0.245 0.317 0.382
0.160 0.093 0.179 0.261 0.337 0.404
0.170 0.100 0.191 0.277 0.356 0.426
0.180 0.106 0.202 0.293 0.375 0.448
0.190 0.112 0.214 0.308 0.393 0.469
0.200 0.118 0.225 0.323 0.411 0.489
See notes to Table 1
30
Table 3: Threshold values g(h; q) for asymptotic variance comparison with QS kernel
b
0.010
0.020
0.030
0.040
0.050
0.060
0.070
0.080
0.090
0.100
0.110
0.120
0.130
0.140
0.150
0.160
0.170
0.180
0.190
0.200
q=1 q=2
0.010 0.020
0.021 0.041
0.031 0.062
0.042 0.084
0.053 0.106
0.065 0.128
0.077 0.151
0.089 0.175
0.102 0.198
0.115 0.222
0.127 0.247
0.140 0.271
0.153 0.296
0.166 0.321
0.179 0.346
0.193 0.371
0.206 0.395
0.220 0.418
0.233 0.441
0.246 0.463
See notes
q=3
0.030
0.061
0.093
0.126
0.158
0.191
0.225
0.259
0.293
0.326
0.359
0.392
0.426
0.458
0.489
0.520
0.549
0.578
0.605
0.630
to Table
Table 4: Values of f ( ; K; p; q; ) for
K
8
10
1
5
9
13
17
21
25
1
5
9
13
17
21
25
q=1
0.144
0.152
0.154
0.162
0.168
0.171
0.165
0.071
0.133
0.133
0.133
0.138
0.128
0.133
p=1
q=2
0.350
0.357
0.343
0.354
0.360
0.363
0.379
0.283
0.276
0.260
0.267
0.267
0.304
0.311
q=3
0.514
0.494
0.489
0.494
0.501
0.509
0.526
0.441
0.425
0.408
0.405
0.417
0.418
0.432
q=1
0.234
0.208
0.218
0.217
0.210
0.204
0.207
0.159
0.132
0.134
0.133
0.139
0.137
0.151
p=2
q=2
0.375
0.372
0.397
0.404
0.397
0.405
0.410
0.290
0.304
0.308
0.308
0.309
0.313
0.321
q=4
0.040
0.082
0.124
0.166
0.209
0.252
0.296
0.340
0.382
0.423
0.463
0.502
0.542
0.581
0.619
0.655
0.690
0.722
0.752
0.779
1
q=5
0.050
0.102
0.154
0.206
0.258
0.311
0.362
0.414
0.464
0.516
0.565
0.612
0.655
0.697
0.736
0.773
0.806
0.834
0.859
0.879
= 0.05 and K = 8,10.
q=3
0.599
0.600
0.608
0.616
0.621
0.619
0.623
0.450
0.431
0.432
0.440
0.447
0.453
0.457
q=1
0.231
0.229
0.227
0.229
0.233
0.234
0.233
0.192
0.187
0.196
0.209
0.212
0.202
0.204
p=3
q=2
0.456
0.490
0.498
0.511
0.513
0.515
0.514
0.365
0.331
0.337
0.349
0.348
0.348
0.360
q=3
0.614
0.647
0.667
0.686
0.700
0.712
0.720
0.509
0.508
0.513
0.509
0.515
0.523
0.522
Note: If the largest squared long run canonical correlation between the two blocks of moment
conditions is smaller than f ( ; K; p; q; ), then the two step test is asymptotically less powerful
31
Table 5: Values of f ( ; K; p; q; ) for
K
12
14
1
5
9
13
17
21
25
1
5
9
13
17
21
25
q=1
0.092
0.101
0.093
0.104
0.092
0.119
0.113
0.071
0.091
0.084
0.087
0.087
0.103
0.086
p=1
q=2
0.200
0.217
0.201
0.196
0.194
0.204
0.251
0.286
0.223
0.215
0.199
0.213
0.245
0.244
q=3
0.298
0.304
0.308
0.322
0.327
0.327
0.350
0.273
0.267
0.265
0.257
0.265
0.262
0.269
q=1
0.146
0.124
0.127
0.121
0.124
0.116
0.105
0.101
0.133
0.118
0.107
0.105
0.097
0.095
p=2
q=2
0.202
0.247
0.226
0.230
0.239
0.240
0.243
0.182
0.184
0.192
0.200
0.196
0.190
0.201
= 0.05 and K = 12,14.
q=3
0.322
0.350
0.357
0.361
0.362
0.362
0.364
0.331
0.279
0.283
0.283
0.284
0.289
0.295
q=1
0.163
0.118
0.117
0.116
0.123
0.122
0.123
0.135
0.110
0.118
0.122
0.128
0.119
0.130
p=3
q=2
0.339
0.290
0.274
0.275
0.284
0.286
0.296
0.263
0.211
0.212
0.216
0.207
0.213
0.205
q=3
0.346
0.360
0.357
0.369
0.379
0.385
0.388
0.366
0.337
0.338
0.334
0.334
0.337
0.339
Note: If the largest squared long run canonical correlation between the two blocks of moment
conditions is smaller than f ( ; K; p; q; ), then the two step test is asymptotically less powerful
32
Table 6: Finite sample variance comparison between ^1T and ^2T under VAR(1) error with
T = 200, and q = 3.
0 ) Var(^ )
Var(^2T )
Var(^aT )
max (
1T
OS HAR Bartlett Parzen
QS
K=14
b=0.08 b=0.15 b=0.08
0.000
0.081
0.103
0.100
0.108
0.109
0.089
0.090
0.093
0.105
0.103
0.110
0.111
0.093
0.180
0.107
0.108
0.105
0.112
0.113
0.096
0.270
0.124
0.111
0.108
0.114
0.115
0.099
0.360
0.146
0.115
0.111
0.117
0.118
0.102
0.450
0.174
0.120
0.116
0.120
0.122
0.106
0.540
0.214
0.127
0.122
0.125
0.127
0.110
0.630
0.272
0.137
0.131
0.132
0.134
0.116
0.720
0.368
0.154
0.145
0.144
0.146
0.123
0.810
0.554
0.185
0.174
0.166
0.170
0.135
0.900
1.073
0.274
0.253
0.227
0.235
0.166
0.990
10.892
1.937
1.731
1.372
1.451
0.714
Table 7: Finite sample variance comparison between ^1T and ^2T under VAR(1) error with
T = 200, and q = 4
0 ) Var(^ )
Var(^2T )
Var(^aT )
max (
1T
OS HAR Bartlett Parzen
QS
K=14
b=0.70 b=0.150 b=0.70
0.000
0.081
0.112
0.104
0.120
0.114
0.089
0.090
0.092
0.114
0.106
0.121
0.115
0.093
0.180
0.106
0.117
0.108
0.123
0.118
0.096
0.270
0.122
0.124
0.111
0.126
0.120
0.100
0.360
0.146
0.125
0.115
0.129
0.124
0.105
0.450
0.175
0.130
0.121
0.133
0.129
0.110
0.540
0.217
0.139
0.129
0.139
0.135
0.116
0.630
0.278
0.151
0.141
0.148
0.146
0.123
0.720
0.379
0.172
0.160
0.163
0.162
0.134
0.810
0.576
0.213
0.198
0.193
0.196
0.152
0.900
1.128
0.328
0.305
0.276
0.289
0.197
0.990
11.627
2.538
2.364
1.884
2.089
1.013
33
Table 8: Empirical Size of one-step and two-step tests based on
VAR(1) error when = 0:75; p = 1 2, and T = 200
p = 1 and q = 3
One Step( ^ ) One Step(W )
0
2
2
W11
W11
max ( R R )
0.00
0.128 0.098 0.151 0.119
0.15
0.126 0.096 0.135 0.103
0.25
0.135 0.102 0.138 0.105
0.33
0.135 0.105 0.127 0.094
0.57
0.139 0.107 0.086 0.061
0.75
0.143 0.116 0.046 0.031
p = 2 and q = 3
One Step( ^ ) One Step(W )
0
2
2
W11
W11
max ( R R )
0.00
0.181 0.111 0.222 0.138
0.26
0.191 0.118 0.219 0.136
0.40
0.192 0.115 0.201 0.120
0.50
0.195 0.119 0.194 0.112
0.73
0.206 0.120 0.168 0.095
0.86
0.206 0.124 0.143 0.082
the series LRV estimator under
Table 9: Empirical Size of one-step and two-step tests based on
VARMA(1,1) error when = 0:75; p = 1 2, and T = 200
p = 1 and q = 3
One Step( ^ ) One Step(W )
0
2
2
W11
W11
max ( R R )
0.00
0.117 0.091 0.138 0.108
0.15
0.140 0.113 0.142 0.113
0.25
0.144 0.117 0.140 0.113
0.33
0.155 0.127 0.141 0.111
0.57
0.167 0.138 0.128 0.106
0.75
0.168 0.141 0.118 0.096
p = 2 and q = 3
One Step( ^ ) One Step(W )
0
2
2
W11
W11
max ( R R )
0.00
0.188 0.119 0.227 0.146
0.26
0.202 0.129 0.209 0.136
0.40
0.206 0.135 0.204 0.134
0.50
0.223 0.148 0.215 0.144
0.73
0.221 0.148 0.205 0.138
0.86
0.222 0.156 0.194 0.132
the series LRV estimator under
34
Two Step
2
W21
0.187 0.076
0.177 0.061
0.187 0.063
0.174 0.059
0.154 0.044
0.118 0.032
Two Step
2
W21
0.290 0.077
0.296 0.069
0.290 0.065
0.290 0.057
0.272 0.057
0.245 0.051
Two Step
2
W21
0.181 0.068
0.173 0.071
0.165 0.065
0.160 0.060
0.121 0.043
0.087 0.025
Two Step
2
W21
0.290 0.080
0.270 0.073
0.254 0.069
0.251 0.065
0.214 0.053
0.178 0.044
Table 10: Empirical Size of one-step and two-step tests based on the Bartlett kernel variance
estimator under VAR(1) error when = 0:75, p = 1 2 and T = 200
p = 1 and q = 3
One Step( ^ ) One Step(W )
Two Step
0
2
2
2
(
W
W
W21
)
max R R
11
11
0.00
0.156 0.138 0.192 0.172 0.201 0.133
0.15
0.163 0.138 0.175 0.154 0.201 0.120
0.25
0.161 0.138 0.164 0.141 0.196 0.112
0.33
0.154 0.127 0.140 0.115 0.181 0.100
0.57
0.147 0.119 0.085 0.066 0.144 0.069
0.75
0.152 0.128 0.035 0.023 0.115 0.053
p = 2 and q = 3
One Step( ^ ) One Step(W )
Two Step
0
2
2
2
W11
W11
W21
max ( R R )
0.00
0.239 0.183 0.287 0.228 0.305 0.177
0.26
0.230 0.166 0.263 0.196 0.298 0.150
0.40
0.231 0.169 0.243 0.170 0.296 0.138
0.50
0.228 0.161 0.234 0.159 0.286 0.130
0.73
0.228 0.157 0.179 0.118 0.263 0.108
0.86
0.230 0.159 0.161 0.108 0.240 0.098
Table 11: Empirical Size of one-step and two-step tests based on the Bartlett kernel variance
estimator under VARMA(1,1) error when = 0:75, p = 1 2 and T = 200
p = 1 and q = 3
One Step( ^ ) One Step(W )
Two Step
0
2
2
2
(
)
W
W
W21
max R R
11
11
0.00
0.161 0.142 0.196 0.177 0.203 0.134
0.15
0.147 0.127 0.165 0.144 0.188 0.116
0.25
0.140 0.117 0.149 0.129 0.174 0.105
0.33
0.131 0.115 0.134 0.113 0.158 0.090
0.57
0.117 0.099 0.083 0.068 0.109 0.051
0.75
0.109 0.092 0.035 0.026 0.058 0.024
p = 2 and q = 3
One Step( ^ ) One Step(W )
Two Step
0
2
2
2
W11
W11
W21
max ( R R )
0.00
0.235 0.180 0.292 0.230 0.307 0.174
0.26
0.213 0.157 0.239 0.181 0.278 0.146
0.40
0.203 0.147 0.224 0.165 0.262 0.124
0.50
0.205 0.146 0.209 0.151 0.246 0.115
0.73
0.191 0.136 0.167 0.114 0.195 0.085
0.86
0.190 0.133 0.147 0.105 0.174 0.078
35
Table 12: Empirical Size of one-step and two-step tests based on the Parzen kernel variance
estimator under VAR(1) error when = 0:75, p = 1 2 and T = 200
p = 1 and q = 3
One Step( ^ ) One Step(W )
Two Step
0
2
2
2
(
W
W
W21
)
max R R
11
11
0.00
0.145 0.108 0.182 0.139 0.214 0.090
0.15
0.148 0.105 0.173 0.125 0.223 0.076
0.25
0.142 0.102 0.161 0.115 0.220 0.070
0.33
0.142 0.101 0.142 0.099 0.211 0.063
0.57
0.150 0.105 0.107 0.068 0.186 0.050
0.75
0.141 0.101 0.054 0.030 0.147 0.034
p = 2 and q = 3
One Step( ^ ) One Step(W )
Two Step
0
2
2
2
W11
W11
W21
max ( R R )
0.00
0.216 0.123 0.278 0.169 0.340 0.102
0.26
0.225 0.117 0.267 0.149 0.348 0.085
0.40
0.221 0.117 0.260 0.140 0.346 0.081
0.50
0.219 0.112 0.241 0.123 0.331 0.072
0.73
0.217 0.102 0.199 0.097 0.310 0.059
0.86
0.226 0.116 0.175 0.080 0.292 0.054
Table 13: Empirical Size of one-step and two-step tests based on the Parzen kernel variance
estimator under VARMA(1,1) error when = 0:75, p = 1 2 and T = 200
p = 1 and q = 3
One Step( ^ ) One Step(W )
Two Step
0
2
2
2
(
)
W
W
W21
max R R
11
11
0.00
0.142 0.104 0.186 0.141 0.218 0.088
0.15
0.134 0.099 0.164 0.125 0.210 0.082
0.25
0.136 0.099 0.155 0.117 0.200 0.076
0.33
0.127 0.096 0.150 0.113 0.191 0.074
0.57
0.122 0.087 0.110 0.079 0.156 0.052
0.75
0.111 0.082 0.070 0.046 0.114 0.033
p = 2 and q = 3
One Step( ^ ) One Step(W )
Two Step
0
2
2
2
W11
W11
W21
max ( R R )
0.00
0.220 0.124 0.279 0.171 0.338 0.100
0.26
0.204 0.112 0.248 0.142 0.320 0.094
0.40
0.198 0.108 0.226 0.135 0.303 0.083
0.50
0.196 0.112 0.225 0.131 0.291 0.085
0.73
0.186 0.106 0.188 0.102 0.255 0.063
0.86
0.182 0.105 0.156 0.083 0.219 0.055
36
Table 14: Empirical Size of one-step and two-step tests based on the QS kernel variance estimator
under VAR(1) error when = 0:75, p = 1 2 and T = 200
p = 1 and q = 3
One Step( ^ ) One Step(W )
Two Step
0
2
2
2
(
W
W
W21
)
max R R
11
11
0.00
0.138 0.107 0.174 0.144 0.204 0.089
0.15
0.138 0.103 0.164 0.126 0.209 0.077
0.25
0.141 0.106 0.151 0.115 0.214 0.076
0.33
0.135 0.099 0.145 0.106 0.208 0.069
0.57
0.149 0.110 0.101 0.068 0.187 0.056
0.75
0.132 0.099 0.049 0.029 0.136 0.036
p = 2 and q = 3
One Step( ^ ) One Step(W )
Two Step
0
2
2
2
W11
W11
W21
max ( R R )
0.00
0.210 0.124 0.265 0.168 0.312 0.101
0.26
0.217 0.122 0.261 0.151 0.335 0.089
0.40
0.216 0.119 0.244 0.141 0.327 0.084
0.50
0.214 0.114 0.234 0.130 0.332 0.077
0.73
0.204 0.113 0.188 0.099 0.295 0.063
0.86
0.214 0.121 0.158 0.082 0.277 0.063
Table 15: Empirical Size of one-step and two-step tests based on the QS kernel variance estimator
under VARMA(1,1) error when = 0:75, p = 1 2 and T = 200
p = 1 and q = 3
One Step( ^ ) One Step(W )
Two Step
0
2
2
2
(
)
W
W
W21
max R R
11
11
0.00
0.141 0.112 0.175 0.141 0.204 0.090
0.15
0.137 0.110 0.164 0.132 0.201 0.089
0.25
0.130 0.104 0.149 0.117 0.188 0.076
0.33
0.123 0.096 0.140 0.111 0.178 0.074
0.57
0.117 0.094 0.113 0.088 0.152 0.058
0.75
0.110 0.085 0.060 0.042 0.110 0.034
p = 2 and q = 3
One Step( ^ ) One Step(W )
Two Step
0
2
2
2
W11
W11
W21
max ( R R )
0.00
0.213 0.128 0.271 0.176 0.323 0.106
0.26
0.199 0.123 0.249 0.160 0.310 0.104
0.40
0.194 0.122 0.231 0.147 0.297 0.096
0.50
0.183 0.108 0.212 0.130 0.278 0.083
0.73
0.188 0.114 0.187 0.113 0.250 0.072
0.86
0.182 0.113 0.156 0.091 0.217 0.061
37
1
0.8
0.8
0.6
0.6
Power
Power
1
0.4
0.2
0.2
0
5
10
15
20
0
25
1
1
0.8
0.8
0.6
0.6
Power
Power
0
0.4
0.2
0
5
10
15
20
25
0
5
10
15
20
25
0
5
10
15
20
25
0.4
0
5
10
15
20
0
25
1
1
0.8
0.8
0.6
0.6
0.4
0.2
0
0
0.2
Power
Power
0.4
0.4
0.2
0
5
10
15
20
0
25
Figure 3: Size-adjusted power of the three tests based on the OS HAR estimator under VAR(1)
error with p = 1, q = 3, = 0:75, T = 200; and K = 14.
38
1
0.8
0.8
0.6
0.6
Power
Power
1
0.4
0.2
0.2
0
5
10
15
20
0
25
1
1
0.8
0.8
0.6
0.6
Power
Power
0
0.4
0.2
0
5
10
15
20
25
0
5
10
15
20
25
0
5
10
15
20
25
0.4
0
5
10
15
20
0
25
1
1
0.8
0.8
0.6
0.6
0.4
0.2
0
0
0.2
Power
Power
0.4
0.4
0.2
0
5
10
15
20
0
25
Figure 4: Size-adjusted power of the three tests based on the OS HAR estimator under VAR(1)
error with p = 2, q = 3, = 0:75, T = 200; and K = 14.
39
1
0.8
0.8
0.6
0.6
Power
Power
1
0.4
0.2
0.2
0
5
10
15
20
0
25
1
1
0.8
0.8
0.6
0.6
Power
Power
0
0.4
0.2
0
5
10
15
20
25
0
5
10
15
20
25
0
5
10
15
20
25
0.4
0
5
10
15
20
0
25
1
1
0.8
0.8
0.6
0.6
0.4
0.2
0
0
0.2
Power
Power
0.4
0.4
0.2
0
5
10
15
20
0
25
Figure 5: Size-adjusted power of the three tests based on the OS HAR estimator under
VARMA(1,1) error with p = 1, q = 3 and = 0:75, T = 200 and K = 14.
40
1
0.8
0.8
0.6
0.6
Power
Power
1
0.4
0.2
0.2
0
5
10
15
20
0
25
1
1
0.8
0.8
0.6
0.6
Power
Power
0
0.4
0.2
0
5
10
15
20
25
0
5
10
15
20
25
0
5
10
15
20
25
0.4
0
5
10
15
20
0
25
1
1
0.8
0.8
0.6
0.6
0.4
0.2
0
0
0.2
Power
Power
0.4
0.4
0.2
0
5
10
15
20
0
25
Figure 6: Size-adjusted power of the three tests based on the OS HAR estimator under
VARMA(1,1) error with p = 2, q = 3, = 0:75, T = 200; and K = 14.
41
1
0.8
0.8
0.6
0.6
Power
Power
1
0.4
0.2
0.2
0
5
10
15
20
0
25
1
1
0.8
0.8
0.6
0.6
Power
Power
0
0.4
0.2
0
5
10
15
20
25
0
5
10
15
20
25
0
5
10
15
20
25
0.4
0
5
10
15
20
0
25
1
1
0.8
0.8
0.6
0.6
0.4
0.2
0
0
0.2
Power
Power
0.4
0.4
0.2
0
5
10
15
20
0
25
Figure 7: Size-adjusted power of the three tests based on the Bartlett HAR estimator under
VAR(1) error with p = 2, q = 3, = 0:75, T = 200; and b = 0:078.
42
1
0.8
0.8
0.6
0.6
Power
Power
1
0.4
0.2
0.2
0
5
10
15
20
0
25
1
1
0.8
0.8
0.6
0.6
Power
Power
0
0.4
0.2
0
5
10
15
20
25
0
5
10
15
20
25
0
5
10
15
20
25
0.4
0
5
10
15
20
0
25
1
1
0.8
0.8
0.6
0.6
0.4
0.2
0
0
0.2
Power
Power
0.4
0.4
0.2
0
5
10
15
20
0
25
Figure 8: Size-adjusted power of the three tests based on the Parzen HAR estimator under
VAR(1) error with p = 2, q = 3, = 0:75, T = 200; and b = 0:16.
43
1
0.8
0.8
0.6
0.6
Power
Power
1
0.4
0.2
0.2
0
5
10
15
20
0
25
1
1
0.8
0.8
0.6
0.6
Power
Power
0
0.4
0.2
0
5
10
15
20
25
0
5
10
15
20
25
0
5
10
15
20
25
0.4
0
5
10
15
20
0
25
1
1
0.8
0.8
0.6
0.6
0.4
0.2
0
0
0.2
Power
Power
0.4
0.4
0.2
0
5
10
15
20
0
25
Figure 9: Size-adjusted power of the three tests based on the QS HAR estimator under VAR(1)
error with p = 2, q = 3, = 0:75, T = 200; and b = 0:079.
44
9
Appendix of Proofs
Lemma 12 Let A 2 Rn n and B 2 Rn n be any two matrix square roots of a positive …nite
symmetric matrix C 2 Rn n such that AA0 = BB 0 = C; then there exists an orthogonal matrix
Q 2 Rn n such that A = BQ:
Proof of Lemma 12. It follows from AA0 = BB 0 that B 1 AA0 (B 0 ) 1 = In : Let Q = B 1 A;
then QQ0 = In . That is, Q is an orthogonal matrix. For such a choice of Q; we have A = BQ; as
desired.
Proof of Proposition 1. Part (a) follows from Lemma 2 of Sun (2013). For part (b), we note
d
that ^ =)
and so
1
p
T ^2T
T
1 Xh
=p
(y1t
T t=1
0
=
Id ;
^
^ y2t
Ey1t )
p1
T
p1
T
PT
i
(y1t
Pt=1
T
t=1 y2t
!
Ey1t )
d
=)
Id ;
1
1=2 Bm (1):
Proof of Lemma 2. For any a 2 Rd ; we have
Ea0 ~ 1 (h; d; q) ~ 1 (h; d; q)0 a
Z 1Z 1
= E tra0
Qh (r; s) dBd (r) dBq0 (s)
"
Z
0
0
1Z 1
0
Z
= E tr
"
0
Z
0
= E tr
0
:
Qh (r; s) dBq (r) dBq0 (s)
0
1Z 1
0
0
Z
0
2
Qh (r; s) dBq (r) dBq0
#
1Z 1
Qh (r; s) dBq (r) dBd0 (s) a
0
(s)
Qh (r; s) dBq (r) dBd0 (s) aa0
0
1Z 1
0
Z
2
1Z 1
1Z 1
Z
0
Z
1Z 1
Qh (r; s) dBd (r) dBq0 (s)
Z
1Z 1
Qh (r; s) dBd0 (r) a dBq0 (s)
0
2
0
Qh (r; s) dBq (r) dBq0 (s)
Qh (r; s) dBq (r) a0 dBd (s)
0
= (h; q)a0 a;
0
where
(h; q)
Z
= Etr
0
1Z 1
0
2
Qh (r; s) dBq (r) dBq0
(s)
Z
0
45
1Z 1
0
Z
0
1
Qh (r; ) Qh ( ; s) d
dBq (r) dBq0 (s) :
So
E ~ 1 (h; d; q) ~ 1 (h; d; q)0 = (h; q) Id :
Since this holds for any d; we have E ~ 1 (h; 1; q) ~ 1 (h; 1; q)0 = (h; q): It then follows that
E ~ 1 (h; d; q) ~ 1 (h; d; q)0 =
2
E ~ 1 (h; 1; q)
Id :
Proof of Proposition 3. We have
1=2
12
=
=
n
12
1
22
Note that
1=2
11
is equal to
by
h
Id
1 2:
1=2
11
11
1=2
11
h
1=2
1
22
12
21
1=2
11
Id
1=2 0
21 ( 11 )
1
22
12
i
1=2 0 1=2
21 ( 11 )
1=2
11
h
i
1=2
11
Id
o
1=2 0 1=2
:
)
11
(
0
i
1=2 0 1=2
21 ( 11 )
1
22
12
Combining this with Lemma 12, we can represent any matrix square root of
1=2
12
1=2
11
=
h
1=2
11
Id
12
1
22
i
1=2 0 1=2
Q
)
11
21 (
12
(27)
for an orthogonal matrix Q: Using this representation, we have
1=2
12
12
h
= Q0 Id
1=2
11
h
1=2 0
21 ( 1 2 )
1
22
1=2
11
12 (
1=2 0
12 ( 22 )
1=2
11
Id
= Q0 Id
0
1=2 0
22 )
1=2
22
0
1=2 0
11 )
21 (
1=2 0
21 ( 11 )
1=2 0
12 ( 22 )
1=2
1=2
22
h
1=2
22
1=2 0
21 ( 11 )
0
Id
i
1=2 0
i
i
1=2
1=2
0
Q
Q:
(28)
It then follows that ^2T has a larger (smaller) asymptotic variance than ^1T under the …xedsmoothing asymptotics if
h
i
0
1=2
1=2 0
0
0
0
E ~ 1 (h; d; q) ~ 1 (h; d; q) Q0 Id
Id
Q
for any orthogonal matrix Q: But by Lemma 2, this is equivalent to
E ~ 1 (h; 1; q)
2
Id
Note that
Id
0
1=2
0
0
Id
h
Id
0
46
1=2
i
1=2 0
0
h
= Id
0
Id
0
1
i
1=2 0
Id ;
:
(29)
the condition in (29) is equivalent to
2
E ~ 1 (h; 1; q)
+ 1 Id
or
E ~ 1 (h; 1; q)
which is the same as
2
Id
Id = g (h; q) Id :
2
E ~ 1 (h; 1; q)
+1
2
E ~ 1 (h; 1; q)
0
0
1
0
Id
1
Id
+1
1=2
1=2
Note that is not uniquely de…ned as 11 and 22 may not be uniquely de…ned. However,
the non-uniqueness is innocuous. According to Lemma 12, if ~ is another version of the long run
correlation matrix, then ~ = Q1 Q2 for some orthogonal matrices Q1 and Q2 : So
~~0 = Q1
and ~~0
0
g (h; q) Id if and only if
0
Q01
g (h; q) Id :
Proof of Corollary 4. For the OS LRV estimation, we have
K
1 X
Qh (r; s) =
K
i (r)
i (s)
i (r)
i( )
i=1
and so
Z
0
Z
1
Qh (r; ) Qh ( ; s) d
=
0
As a result,
Z
1
(h; q) = Etr
K
i
then
=
Z
0
K
1 X
K
i=1
1
K2
=
Let
1
K
X
0
j
( )
j
(s) d
j=1
i (r)
i (s)
=
i=1
1Z 1
K
1 X
K
1
Q (r; s) :
K h
1
Qh (r; s) dBq (r) dBq0 (s)
:
1
i (r) dBq (r)
0
20
K
X
(h; q) = trE 4@
j=1
s iidN (0; Iq )
1
0A
j j
13
5=
K
q
q
1
where the last equality follows from the mean of an inverse Wishart distribution. Using this, we
have
q
(h; q)
q
K q 1
=
=
:
g(h; q) =
q
1 + (h; q)
1+ K q 1
K 1
47
The corollary then follows from Proposition 3.
Proof of Proposition 5. It su¢ ces to prove parts (a) and (b) as parts (c) and (d) follow from
similar arguments. Part (b) is a special case of Theorem
6(a) of Sun (2014b) with G = [Id ; Od q ]0 :
p
It remains to prove part (a). Under R 0 = r + 0 = T , we have:
p
p
d
1=2
T (R^1T r) = T R(^1T
0 ) =) R 11 Bd (1) + 0 :
Using Proposition 1(a), we have
d
1=2
1=2 0
11 Cdd (R 11 )
(R ^ 11 R0 ) =) R
R1R1
where Cdd = 0 0 Qh (r; s)dBd (r)dBd (s)0 and Cdd ? Bd (1). Invoking the continuous mapping
theorem yields
p
p
W1T : = T (R^1T r)0 (R ^ 11 R0 ) 1 T (R^1T r)
h
i0 h
i 1h
i
d
1=2
1=2
1=2
1=2
R 11 Bd (1) + 0 :
=) R 11 Bd (1) + 0 R 11 Cdd (R 11 )0
h
Now, R
so
1=2
11 Bd (1); R
1=2
1=2 0
11 Cdd (R 11 )
d
W1T =) [
1 Bp (1)
d
= Bp (1) +
1
1
0
i
is distributionally equivalent to [
0
0]
+
0
1
0
1
1 Cpp
Cpp1 Bp (1) +
[
1 Bp (1)
+
1 Bp (1) ;
= W11 (
0
0 ];
1
and
0]
d
1
1
1 Cpp
2
1
0
1
);
as desired.
Proof of Proposition 6.
Part (a) We …rst prove that P
Note that
2
p
P
2
2
2
p
>x =
2
> x increases with
1
X
e
2
=2 ( 2 =2)j
j!
j=0
2
p+2j
P
for any integer p and x > 0:
>x ;
we have
@P
2
p
@
2
2
>x
1
1 X ( 2 =2)j
e
2
j!
=
1
2
=
=
1
2
j=0
1
X
( 2 =2)j
e
j!
j=0
1
2
X
j=0
as needed.
Let s N (0; 1) and
monotonicity of P 2p
2
2
( =2)j
e
j!
2
2
=2
=2
=2
P
P
P
2
p+2j
2
p+2j
>x +
>x +
2
p+2+2j
>x
1
1 X ( 2 =2)j 1
e
2
(j 1)!
1
2
j=1
1
X
j=0
( 2 =2)j
e
j!
2
p+2j
P
>x
2
2
=2
=2
P
P
2
p+2j
>x
2
p+2+2j
>x
>0
be a zero mean random variable that is independent of . Using the
> x in 2 ; we have
P (k + k2 > x) = E P (
> P(
2
1
2
1
2
> x)j
2
> x) = P (k k2 > x) for any x:
48
Now we proceed to provePthe theorem. Note that Bp (1) and Bq (1) are independent of
Cpq ; Cpp ; and Cqq : Let Dpp1 = pi=1 Di di d0i be the spectral decomposition of Dpp1 where Di 0
almost surely and fdi g are orthonormal in Rp . Then
Bp (1)
=
p
X
0
Cpq Cqq1 Bq (1) Dpp1 Bp (1)
0
Di di Bp (1)
d0i Cpq Cqq1 Bq
Cpq Cqq1 Bq (1)
(1)
2
=
i=1
where i = d0i Bp (1),
and Cqq : In addition,
for any x > 0;
i
p
X
Di ( i
2
i)
+
i=1
= d0i Cpq Cqq1 Bq (1) ; f i g is independent of f i g conditional on Cpq ; Cpp ;
i s iidN (0; 1) conditionally on Cpq ; Cpp ; and Cqq and unconditionally. So
P (W21 (0) > x) = EP (W21 (0) > xjCpq ; Cpp ; Cqq )
= EP
p
X
Di ( i
2
+
i)
> xjCpq ; Cpp ; Cqq
i=1
= EP
D1 ( 1
2
1)
+
>x
p
X
Di ( i
+
i=2
EP
= EP
2
D1 1
2
D1 1
>x
>x
p
X
i=2
p
X
Di ( i
Di ( i
+
+
!
2
i ) jCpq ; Cpp ; Cqq ; f
2
i ) jCpq ; Cpp ; Cqq ; f i ;
2
i)
i=2
p
p
i gi=2 ; f i gi=1
jCpq ; Cpp ; Cqq ; f
p
i gi=2
p
i gi=2
!
!
!
:
Using the above argument repeatedly, we have
P (W21 (0) > x)
=P
p
X
2
Di i
EP
!
>x
i=1
p
X
2
Di i
> xjCpq ; Cpp ; Cqq
i=1
!
= P Bp (1)0 Dpp1 Bp (1) > x
> P Bp (1)0 Cpp1 Bp (1) > x = P (W11 (0) > x)
where the last inequality follows
from the fact that Dpp1 > Cpp1 almost surely.
Pp
1
Part (b). Let Cpp = i=1 Ci ci c0i be the spectral decomposition of Cpp1 : Since Cpp > 0 with
probability one, ci > 0 with probability one. We have
W11 k k2
d
= [Bp (1) + k k ep ]0 Cpp1 [Bp (1) + k k ep ]
=
p
X
i=1
Ci
c0i Bp (1) + k k c0i ep
2
where [c0i Bp (1) + k k c0i ep ]2 follows independent noncentral chi-square distributions with noncentrality parameter k k2 (c0i ep )2 ; conditional on f Ci gpi=1 and fci gpi=1 : Now consider two vectors 1
49
and
such that k 1 k < k 2 k : We have
h
i
P W11 k 1 k2 > x
( p
)
X
2
0
0
=P
>x
Ci ci Bp (1) + k 1 k ci ep
2
i=1
= EP
< EP
=P
(
(
(
C1
C1
c01 Bp (1) + k 1 k c01 ep
2
c01 Bp (1) + k 2 k c01 ep
2
0
C1 c1 Bp (1)
+k
2
0
2 k c1 ep
>x
p
X
i=2
p
X
>x
<P
<P
C1
( p
X
c01 Bp (1) + k 2 k c01 ep
0
Ci ci Bp (1)
i=1
= P [Bp (1) +
+k
+
as desired.
Part (c). We note that
+
C2
c0i Bp (1) + k 1 k c0i ep
c0i Bp (1) + k 1 k c0i ep
2
0
Ci ci Bp (1)
i=2
2
0
2 k ci ep
0
1
2 ] Cpp [Bp (1) +
Ci
i=2
p
X
where we have used the strict monotonicity of P
ment, we have
i
h
P W11 k 1 k2 > x
(
2
Ci
2
2
1
+k
2
2
0
1 k ci ep
2
> x in
c02 Bp (1) + k 2 k c02 ep
2
+
)
f
p
p
Ci gi=1 ; fci gi=1
)
)
>x
: Repeating the above argu-
p
X
i=3
)
f
p
p
Ci gi=1 ; fci gi=1
Ci
c0i Bp (1) + k 1 k c0i ep
2
)
>x
>x
h
i
2
]
>
x
=
P
W
k
k
>
x
11
2
2
W21 k k2
0
= Bp (1) Cpq Cqq1 Bq (1) + k k ep Dpp1 Bp (1) Cpq Cqq1 Bq (1) + k k ep
n
o0
1=2
= Ip + Cpq Cqq1 Cqq1 Cqp
Bp (1) Cpq Cqq1 Bq (1) + k k e~p
1=2
where
Ip + Cpq Cqq1 Cqq1 Cqp
n
Ip + Cpq Cqq1 Cqq1 Cqp
Dpp1 Ip + Cpq Cqq1 Cqq1 Cqp
1=2
Bp (1)
1=2
Cpq Cqq1 Bq (1) + k k e~p
e~p = Ip + Cpq Cqq1 Cqq1 Cqp
1=2
o
ep :
P
1=2
Let pi=1 ~ Di d~i d~0i be the spectral decomposition of Ip + Cpq Cqq1 Cqq1 Cqp
Dpp1 Ip + Cpq Cqq1 Cqq1 Cqp
De…ne
~ = d~0 Ip + Cpq C 1 C 1 Cqp 1=2 Bp (1) Cpq C 1 Bq (1) :
di
i
qq
qq
qq
50
1=2
.
Then conditional on Cpq ; Cpp and Cqq ; ~ di s iidN (0; 1): Since the conditional distribution does
not depend on Cpq ; Cpp and Cqq ; ~ di s iidN (0; 1) unconditionally. Now
W21 (k 1 k2 )
p
n
X
~ D d~0 Ip + Cpq C 1 C 1 Cqp
=
i
qq
qq
i
=
i=1
p
X
~D
~ + k k d~0 e~p
di
1
i
i
i=1
1=2
Bp (1)
Cpq Cqq1 Bq (1) + k 1 k d0i e~p
o2
2
and so for two vectors 1 and 2 such that k 1 k < k 2 k we have
n
o
P W21 (k 1 k2 ) > x
)
( p
X
2
~ D ~ + k k d~0 e~p > x Cpq ; Cpp ; Cqq
= EP
di
1
i
i
< EP
=P
(
i=1
p
X
~D
i
i=1
( p
X
~D
i
i=1
~ +k
di
~ +k
di
2
~0 ~p
2 k di e
~0 ~p
2 k di e
2
> x Cpq ; Cpp ; Cqq
)
>x
)
n
o
= P W21 (k 2 k2 ) > x :
Proof of Proposition 8. Instead of directly proving 1 ( ) > 2 ( ) for any > 0; we consider
the following testing problem: we observe (Y; S) 2 Rp+q R(p+q) (p+q) with Y ? S from the
following distributions:
0
1
0
11
Y1
0
B (p p) (p q) C
(p 1) s Np+q ( ; ) with
Y
=
= (p 1) ; = @
A
0
(p+q) 1
22
0
Y2
(q p) (q q)
(q 1)
S
(p+q) (p+q)
0
B
= @
S11
(p p)
S12
(p q)
S21
S22
(q p)
(q q)
(q 1)
1
C Wp+q (K; )
As
K
where 11 and 22 are non-singular matrices and Wp+q (K; ) is the Wishart distribution with
K degrees of freedom. We want to test H0 : 0 = 0 against H1 : 0 6= 0. The testing problem is
partially motivated by Das Gupta and Perlman (1974) and Marden and Perlman (1980).
The joint pdf of (Y; S) can be written as
f (Y; Sj 0 ;
=
( 0;
11 ;
11 ;
22 )
22 )h(S) exp
1
tr
2
1
0
11 (Y1 Y1
+ KS11 ) +
1
0
22 (Y2 Y2
+ KS22 ) + Y10
for some functions ( ) and h( ): It follows from the exponential structure that
:= (Y1 ; S11 ; Y2 Y20 + KS22 )
51
1
11 0
is a complete su¢ cient statistic for
:= ( 0 ;
11 ;
22 ):
We note that Y1
N ( 0 ; 11 ), KS11
Wp (K; 11 ) and Y2 Y20 + KS22
Wq (K + 1;
these three random variables are mutually independent.
Now, we de…ne the following two test functions for testing H0 : 0 = 0 against H1 :
1(
) : = 1(V1 ( ) > W11 )
2(
) : = E[1(W2 (Y; S) > W21 )j ]
22 )
0
and
6= 0:
where
V1 ( ) := Y10 S111 Y1 and W2 (Y; S) := (Y1
S12 S221 Y2 )0 (S11
S12 S221 S21 )
1
(Y1
S12 S221 Y2 ):
We can show that the distributions of V1 ( ) and W2 (Y; S) depend on the parameter
1
0
0 11 0 : First, it is easy to show that
W2 (Y; S) =
0
Y1
Y2
1
S11 S12
S21 S22
Y1
Y2
Y20 S221 Y2 :
Let
Y~
: =
S~ : =
Then Y~ ? S~ and
Y~1
Y~2
=
S~11 S~12
S~21 S~22
Y1
Y2
1=2
1=2
0
11
s N (0; Ip+q ); ~ =
S11 S12
S21 S22
1=2
=
W2 (Y; S) = Y~ + ~
0
S~
1
(
Y~ + ~
0
1=2 0
) s
to
1
11 0 :
!
and
Wp+q (K; Ip+q )
:
K
Y~20 S~221 Y~2 :
only via ~
It is now obvious that the distribution of W2 (Y; S) depends on
0
0
only via
2
; which is equal
Second, we have
V1 ( ) = Y~1 +
0
1=2
~ 1
0 S11
11
Y~1 +
1=2
1=2
0
11
2
and so the distribution of V1 ( ) depends on only via
which is also equal to 00 111 0 :
0
11
It is easy to show that the null distributions of V1 ( ) and W2 (Y; S) are the same as W11
and W21 ; respectively. In view of the critical values used, both the tests 1 ( ) and 2 ( ) have
the correct level . Since
E
1(
) = P (V1 ( ) > W11 ) and E
2(
) = E fE[1(W2 (Y; S) > W21 )j ]g = P (W2 (Y; S) > W11 );
the power functions of the two tests 1 ( ) and 2 ( ) are 1 ( 00 111 0 ) and 2 ( 00 111 0 ), respectively.
We consider a group of transformations G, which consists of the elements in Ap p := fA 2
Rp Rp : A is a (p p) non-singular matrixg and acts on the sample space := Rp Rp p Rq q
for the su¢ cient statistic through the mapping
G : (Y1 ; S11 ; Y2 Y20 + KS22 ) ) (AY1 ; AS11 A0 ; Y2 Y20 + KS22 ):
52
The induced group of transformations G acting on the parameter space
given by
G : = ( 0 ; 11 ; 22 ) ) (A 0 ; A 11 A0 ; 22 ):
:= Rp
Sp
p
Sq
q
is
Our testing problem is obviously invariant to this group of transformations.
De…ne
V( ) := (Y10 S111 Y1 ; Y2 Y20 + KS22 ) := (V1 ( ); V2 ( )) :
It is clear that V( ) is invariant under G: We can also show that V( ) is maximal invariant
under G. To do so, we consider two di¤erent samples
:= (Y1 ; S11 ; Y2 Y20 + KS22 ) and
:=
0
(Y1 ; S11 ; Y2 Y2 + K S22 ) such that V( ) = V( ): We want to show that there exists a p p nonsingular matrix A such that Y1 = AY1 and S11 = AS11 A0 whenever Y10 S111 Y1 = Y10 S111 Y1 : By
Theorem A9.5 (Vinograd’s Theorem) in Muirhead (2009), there exists an orthogonal p p matrix
1=2
1=2
1=2
1=2
H such that S11 Y1 = H S11 Y1 and this gives us the non-singular matrix A := S11 H S11
satisfying Y1 = AY1 and S11 = AS11 A0 : Similarly, we can show that
1
11 0 ;
0
0
v( ) := (
22 )
is maximal invariant under the induced group G. Therefore, restricting attention to G-invariant
tests, testing H0 : 0 = 0 against H1 : 0 6= 0 reduces to testing
H00 :
0
0
1
11 0
= 0 against H10 :
0
0
1
11 0
>0
based on the maximal invariant statistic V( ):
Let f (V1 ; 00 111 0 ) and f (V2 ; 22 ) be the marginal pdf’s of V1 := V1 ( ) and V2 := V2 ( ):
By construction, V1 ( )K=(K p + 1) follows the noncentral F distribution Fp;K p+1 ( 00 111 0 ):
So f (V1 ; 00 111 0 ) is the (scaled) pdf of the noncentral F distribution. It is well known that the
noncentral F distribution has the Monotone Likelihood Ratio (MLR) property in V1 with respect
to the parameter 00 111 0 (e.g. Chapter 7.9 in Lehmann and Romano (2008)). Also, in view of
the independence between V1 and V2 ; the joint distribution of V( ) also has the MLR property
in V1 : By the virtue of the Neyman-Pearson lemma, the test 1 ( ) := 1(V1 ( ) > W11 ) is the
unique Uniformly Most Powerful Invariant (UMPI) test among all G-invariant tests based on the
complete su¢ cient statistic : So if 2 ( ) is equivalent to a G-invariant test, then 1 ( 00 111 0 ) >
1
1
0
0
2 ( 0 11 0 ) for any 0 11 0 > 0. To show that 2 ( ) has this property, we let g 2 G be any
element of G with the corresponding matrix Ag and induced transformation g 2 G. Then,
E [
2 (g
)] = Eg [
=
2(
0
2( 0
)] =
1
11 0 )
for all : It follows from the completeness of
the desired result.
(Ag 0 )0 (Ag
2
=E [
that
2(
2 (g
0
1
11 Ag ) (Ag 0 )
)]
)=
2(
) almost surely and this drives
Proof of Lemma 9. We prove a more general result by establishing a representation for
T
1 Xh 0
p
GM
T t=1
1
G
i
1
G0 M
1
f (vt ;
0)
in terms of the rotated and normalized moment conditions for any m m (almost surely) positive
de…nite matrix M which can be random. Let
M = U 0 M U; M =
1
1=2 M
(
53
1 0
1=2 )
=
M11 M12
M21 M22
and M1 2 = M11
1
G0 M
M12 M221 M21 : Using the SVD U V 0 of G, we have
1
V0
0
(U 0 M U )
= VA
Id ; O
(M )
= VA
Id ; O
(
1 0
1=2 )
= VA
Id ; O
(
1 0
1
1=2 ) M
= V A(
1 2)
1=2
= V A(
1 2)
1=2
G = V
0
1
Id ; O AV 0
h
i
1
1 0
M
(
)
1=2
1=2
Id ; O
M1 21 (
1
1=2
1
M
1 2)
1=2
h
1
0
Id ; O
V A(
1
1=2
AV 0
AV 0
1=2
1 2)
0
Id ; O
i0
Id ; O
AV 0 ;
(30)
where we have used
Id ; O
(
1 0
1=2 )
=
Id ; O
=
(
1=2
1 2)
h
(
(
; O
1 2)
1=2
1 2)
1=2
=(
1
12 ( 22 )
1 2)
1=2
i0
O
(
22 )
Id ; O
1=2
!
:
In addition,
G0 M
1
f (vt ;
Id ; O (M ) 1 f (vt ; 0 )
i 1
h
1
1 0
1
M
(
)
1=2
1=2
1=2 f (vt ; 0 )
(U M U )
= VA
Id ; O
(
1 0
1=2 )
= VA
Id ; O
(
1 0
1
f
1=2 ) M
= V A(
1 2)
1=2
Id ; O
= V A(
1 2)
1=2
M1 21 ;
= V A(
1 2)
1=2
= V
0
0)
1 0
0
U f (vt ;
0)
=VA
(vt ;
0)
= V A(
M1 21
M1 21 M12 M221
M1 21 M12 M221
M1 21 f1 (vt ;
0)
1=2
1 2)
Id ; O
M
M1 21 M12 M221
M2 11
0
f (vt ;
0)
M12 M221 f2 (vt ;
0)
1
f (vt ;
f (vt ;
0)
0)
:
Hence
T
1 Xh 0
p
GM
T t=1
=
T
1 Xh
p
V A(
T t=1
f1 (vt ;
=
Let M =
As a result
0)
T
1 X
p
VA
T t=1
1
1
G
i
1 2)
1
G0 M
1=2
1
G
i
f (vt ;
M1 21 (
1 2)
M12 M221 f2 (vt ;
(
1=2
1 2)
1
0
GM
1
0)
1=2
0)
f1 (vt ;
; we have M = U 0 U =
T
1 Xh 0
p
GM
T t=1
1
0)
AV 0
i
1h
M12 M221 f2 (vt ;
and M =
1
1=2 M
T
1 X
p
f (vt ; 0 ) =
VA
T t=1
54
V A(
0)
1 0
1=2 )
(
1
(
1 2)
1=2
M1 21
i
:
(31)
= Im . So M12 M221 = 0:
1=2
f1 (vt ; 0 ) :
1 2)
G = V A ( 1 2 ) 1=2 ( 1 2 ) 1=2 AV 0 ;
p
Using this and the stochastic expansion of T (^1T
0 ), we have
1
G0 M
p
T (^1T
T
1 X
p
)
=
VA
0
T t=1
1
1=2
f1 (vt ; 0 )
1 2)
(
+ op (1):
It then follows that
(
Let M =
As a result
1 2)
1;
T
1 Xh 0
p
G
T t=1
1=2
p
AV 0 T (^1T
1 0
1=2 U
we have M =
1
1G
i
1
G
0
0)
T
1 X
=p
f1 (vt ;
T t=1
10
1=2
1U
=
T
1 X
=p
VA
T t=1
1
1 f (vt ; 0 )
0)
d
+ op (1) =) N (0;
1,
and so M12 M221 =
1
1=2
[f1 (vt ; 0 )
1 2)
(
11 ):
1;12
1
1;22
=
1:
1 f2 (vt ; 0 )] :
Using this, we have
p
T (^2T
0) =
=
T
1 Xh 0
p
G
T t=1
1
p VA
T
1
(
1
1G
i
1=2
1 2)
1
1
1 f (vt ; 0 )
G0
T
X
(f1 (vt ;
+ op (1)
0)
1 f2 (vt ; 0 ))
t=1
+ op (1):
It then follows that
(
1 2)
1=2
p
AV 0 T (^2T
T
1 X
p
[f1 (vt ;
T t=1
0) =
d
=) M N 0;
1 f2 (vt ; 0 )]
0)
0
12 1
11
1
+ op (1)
21
+
1
(32)
0
22 1
:
Proof
p of Theorem 10.p Parts (a) and (b). Instead of comparing the asymptotic variances
^
of R T (^1T
0 ) and R T ( 2T
0 ) directly, we equivalently compare the asymptotic variances
1=2 p
1=2 p
1
1
0
0
1
1 0 0
of RV A
R T (^1T
R T (^2T
0 ) and RV A
0 ): We
1 2A V R
1 2A V R
can do so because RV A
1
1 V 0 R0
1=2
is nonsingular. Note that the latter two asymptotic
R
variances are the same as those of the respective one-step estimator ^1T and two-step estimator
^R of R in the following simple location model:
2T
1 2A
0
R = R + u R 2 Rp
y1t
0
1t
y2t = u2t 2 Rq
(33)
where
R
0
=
uR
1t =
RV A
RV A
1
1
1 2A
1 2A
1
1
V 0 R0
1=2
V 0 R0
1=2
55
R 0;
RV A
1
(
1=2
u1t
1 2)
and the (contemporaneous) variance and long run variance of ut = (u01t ; u02t )0 are Im and
respectively.
R
R
It su¢ ces to compare the asymptotic variances of ^1T and ^2T in the above location model.
0
0
0
uR
1t ; (u2t )
By construction, the variance of uR
t :=
is
Ip O
O Iq
var(uR
t )=
= Ip+q :
So the above location model has exactly the same form as the model in Section 3. We can invoke
Proposition 3 to complete the proof. The long run variance of uR
t is
R
11
R
21
lrvar uR
t =
R
12
R
22
where
R
11
=
1
RV A
h
11
R
12
=
1 2A
1
RV A
1
1 2A
1
1
RV A
1=2
1 2)
(
h
1=2
V 0 R0
i0
1=2
V 0 R0
1
RV A
1=2
1 2)
(
RV A
1
1 2A
1
(
1=2
1 2)
RV A
1
i
1=20
V 0 R0
and
12
R
22
=
22 :
The implied long run correlation matrix is then
=
n
1=2
R
11
1
RV A
RV A 1 (
n
RV A 1
~R
~0
R
=
~R
~
R
~
R
1=2
1=2
~0
11 R
1=2
R
22
1
1 2A
h
=
=
R
12
1=2
1 2)
i0 h
1
1 2A
~0
11 R
~
R
~0
11 R
1=2 n
~
R
12
RV A
RV A
V 0 R0
~
R
h
1=2
V 0 R0
1=2
o
h
1 2A
RV A
1
V 0 R0
(
11
i
1=2 0
1=2
1 2)
i
~R
~0
R
1=2
~R
~
R
=
1
i
1=2
1
1=2
22
1=2
1 2)
(
1=2 0
~R
~0
R
1=2
1
1
~
R
1=2
12
o
1=2
1=2
22
~
R
1=2
22
12
1=2
22
12
R:
Part (b) then follows from Proposition 3.
n
~R
~0)
In the above calculation we have used a particular form of (R
1=2
1=2 R
~
~ 0 [(R
~R
~0)
11 R
1=2
R
R
R
If we employ a di¤erent matrix square root, then
=Q
11
12
22
orthogonal matrix Q. This is innocuous as our result is rotation invariant.
~ is a square matrix, and we have
In the special case when R = Id ; R
R
=
~
R
11 R
=
~
R
1=2
11
1=2
~0
1
~
R
~
R
12
1=2
22
1=2
22
12
56
=
1=2
11
12
1=2
22
= :
R
1=2 ]0
o1=2
for some p
:
p
1=2
~0
~
So Part (a) is a special case of Part (b). Again, if R
11 R
takes another form instead of
~ 1=2 ; then R = Q for some orthogonal matrix Q: Part (a) is still a special case of part (b)
R
11
because of the rotational invariance.
Part (c). The local asymptotic power of the one-step test and two step test are the same as
the local asymptotic power of respective one-step and two-step tests in the location model given
in (33). Part (c) follows from Proposition 7 with
h
i0
h
i
1
1=2
1=2
R
1
1 0 0
=
RV A 1 1 2 A 1 V 0 R0
RV
A
A
V
R
0
0
11
12
0
0
=
h
1
RV A
1=2
1 2)
(
i
11
h
1
RV A
1=2
1 2)
(
i0
1
0:
(34)
It remains to show that the last term is equal to what is de…ned in (21).
1 , we have M = I and
Using (30) with M 1 =
m
G0
1
Using (30) again but with M
M
1
G = V A(
1
=
1,
1 0
1 0
1=2 U M U ( 1=2 )
=
1 0
1=2 U
=
1
1=2
=
(
which implies that M1 21 =
11 .
1
U(
1 0
1=2 )
1 0
1=2 )
U0 U(
1 0
1=2 )
(
(
1
1=2
)
i0
1
1
=
;
So
1
G0
1
h
1=2
)
AV 0 :
1 0
1=2 U
=
U U0 U
)
1
we obtain the corresponding M as
=
(
1 2)
1
1
G = V A(
1 2)
1=2
1=2
11 ( 1 2 )
AV 0 :
Therefore,
=
=
0
0
R G0
0
0
n
R VA
0
0
h
R VA
1
1
1
G
1
1 2A
1=2
12
i
G0
1
V0
which is exactly the same as
11
h
h
1
1
V A(
1 2)
VA
1
G0
G
1=2
12
1
1=2
i0
1
G
11 (
1 2)
1
R0
0
1=2
AV 0
1
R0
i
VA
1
1 2A
1
V 0 R0
o
1
0
0;
given in (34)
Proof of Theorem 11. Part (a). Plugging M = W into (31) and using the similar notation,
we have
p
T (^aT
0)
=
T
1 Xh 0
p
GW
T t=1
= VA
1
(
1=2
1 2)
1
G
i
1
G0 W
1
f (vt ;
T
1 X
p
[f1 (vt ;
T t=1
57
0)
0)
+ op (1)
a f2 (vt ; 0 )]
+ op (1):
So we have
(
(
1 2)
1=2
1 2)
1=2
p
AV 0 T (^aT
AV
0
p
0)
T (^2T
T
1 X
p
[f1 (vt ;
T t=1
=
0)
T
1 X
p
[f1 (vt ;
T t=1
=
0)
0 f2 (vt ; 0 )]
+(
0)
0 f2 (vt ; 0 )]
+(
0 ) f2 (vt ; 0 )
a
+ op (1)
0 ) f2 (vt ; 0 )
1
+ op (1)
(35)
where the …rst term
p the second term. It
p in each summation has no long run correlation with
)
has
a
larger
asymptotic
variance
than
T (^aT
then follows that T (^2T
0
0 ) if
E(
0)
1
0
0)
22 ( 1
>(
0)
a
0
0) :
22 ( a
By de…nition,
1
1=2 ~
12 1
=
1=2
22
+
and
0
a
1=2 ~
12 a
=
1=2
22
+
0:
The inequality is then equivalent to
1=2 ~ ~ 0
12 1 1
E
1=20
12
1=2 ~ ~ 0
12 a a
>
0
0
which simpli…es to E ~ 1 ~ 1 > ~ a ~ a as desired.
Part (b). It follows from part (a) that
i
h p
h p
asymvar R T (^aT
asymvar R T (^2T
0)
= ERV A 1 ( 1 2 )1=2 (
h
= E RV A 1 ( 1 2 )1=2
~ a ~ 1 ~ 01
= ER
=
~aR
~0
R
a
1=2
0)
a
1=2
12
~ ~0
a a
~aR
~0
E R
a
i
22 (
~ ~0
1 1
0
0)
a
h
~ ~0
a a
~ a0
R
1=2
~a ~ 1 ~ 0
R
1
1=2
~aR
~ a0
It is easy to see that R
h p
i
asymvar R T (^aT
0 ) if
(
i
0)
0)
1
RV A
~ ~0
1=20
12 ;
1
(
22 (
1=2
1 2)
1=2
~aR
~0
R
a
a a
0
0)
a
i
1=2 0
12
0
~a
R
h
RV A
~aR
~0
R
a
1=2
~ a ~ a ~ 0a
R
1=2
~aR
~ a0
R
(
1=2
1 2)
i0
1=2 0
h p
d ~
^
~a ~ 1 =
R
1 (h; p; q): Hence asymvar R T ( 2T
0
~aR
~ a0
E ~ 1 (h; p; q) ~ 1 (h; p; q) > R
1
i
0)
>
~a :
R
Parts (c)(d)(e). These three parts are similar to Theorem 10(a),(b), and (c) respectively.
We only give the proof for part (e) in some details. It is easy to show that under the local
p
d
1=2
2
alterative H1 : R 0 = r + 0 = T , we have WaT =) W11 (jjVa
0 jj ) where
Va = R G0 W
= RV A
1
1
(
G
1
G0 W
1=2
(Id ;
1 2)
1
W
a)
58
1
G G0 W
Id
0
a
1
h
G
RV A
1
R0
1
(
i0
1=2
)
:
12
(36)
Similarly, we have
d
W2T =) W21 (jjV2
1=2
2
0 jj );
where
V2 = R G0
= RV A
1
1
(
G
1
R0
1=2
(Id ;
1 2)
h
Id
0)
0
0
RV A
1
i0
1=2
)
;
12
(
p
~
which is the asymptotic variance of R T ~2T
0 with 2T being the infeasible optimal twostep GMM estimator.
The di¤erence in the two matrices Va and V2 is
h
i0
1=2
1=2
0
1
1
:
Va V2 = RV A ( 1 2 ) ( a
( 1 2)
0 ) 22 ( a
0 ) RV A
Now
=
=
=
where
1=2
jjVa 1=2 0 jj2
o
h
i0 n
1 1=2
1
1
0
1=2
1=2 0
0
1=2
V
I
V
]
=
V
[V
]
V
[V
p Va
0
0
a
a
0
a
a
0 2
2
o1=20
on
o1=2 n
h
i0 n
1 1=2
1 1=2
1=2 0
1=2
1=2 0
0
1=2
1=2 0
V
[V
]
V
Va 1=2 0
V
I
V
V
[V
]
V
[V
]
V
p
2
a
a
a
a
a
0
a
a
2
2
o 1=20
on
o 1=2 n
h
i0 n
1=2
1=2 0
1=2
1=2 0
0
1=2
1=2
1=2 0
Va 1=2 0 ;
V
V
[V
]
I
V
V
[V
]
V
V
V
[V
]
2 a
p
2 a
2 a
a
a
0
a
a
jjV2
2
0 jj
Va 1=2 V2 [Va 1=2 ]0 = Ip
and
Va 1=2 (Va
= Va 1=2 RV A
= Va 1=2 RV A
V2 ) [Va 1=2 ]0
1
1
(
(
1=2
1 2)
1=2
(
1 2)
12
a
a
22
Va 1=2 (Va
1
22
22
12 )
1=2
22
a
De…ne
a;R
= Va 1=2 RV A
1
(
h
V2 ) [Va 1=2 ]0 ;
12
1 0
22
i
1=2 0
( a
22
1=2
( 12
1 2)
a
h
RV A
22
22 )
1
(
h
i0
1=2
)
[Va 1=2 ]0
12
0
12 ) RV A
1
i0
1=2
)
[Va 1=2 ]0 :
12
(
1=2
22 ;
which is the long run correlation matrix between
RV A
1
(
1=2
[f1 (vt ; 0 )
1 2)
a f2 (vt ; 0 )]
and f2 (vt ;
0) :
Then
and
Va 1=2 (Va
V2 ) [Va 1=2 ]0 =
0
a;R a;R
and Va 1=2 V2 [Va 1=2 ]0 = Ip
0
a;R a;R
1=2
2
jjV2
jjVa 1=2 0 jj2
0 jj
h
i0
0
= 00 Va 1=2 Ip
a;R a;R
1=2
0
a;R a;R
1
Ip
h
Ip
1=2
0
a;R a;R
i
1=2 0
2
So if a;R 0a;R > (
1) = Ip = f ( a ; h; p; q; ) Ip ; then jjV2
0 jj > jjVa
test based on W2T is asymptotically more powerful than that based on WaT :
59
1=2
Va 1=2
2
0 jj
0
:
and the
References
[1] Andrews, D. W. (1991): “Heteroskedasticity and autocorrelation consistent covariance matrix estimation.” Econometrica 59(3), 817–858.
[2] Bester, C. A., Conley, T. G., and Hansen, C. B. (2011): “Inference with dependent data
using cluster covariance estimators.” Journal of Econometrics 165(2), 137–151.
[3] Bester, C. A., Conley, T. G., Hansen, C. B., and Vogelsang, T. J. (2014): “Fixed-b asymptotics for spatially dependent robust nonparametric covariance matrix estimators.” Forthcoming, Econometric Theory.
[4] Goncalves, S. and Vogelsang, T. (2011): “Block Bootstrap HAC Robust Tests: The Sophistication of the Naive Bootstrap.” Econometric Theory 27(4), 745–791.
[5] Goncalves, S. (2011). “The moving blocks bootstrap for panel linear regression models with
individual …xed e¤ects.” Econometric Theory 27(5), 1048–1082.
[6] Gupta, S.D. and Perlman, M. D. (1974). On the power of the noncentral F-test: e¤ect of
additional variates on Hotelling’s T 2 test. Journal of the American Statistical Association
69, 174-180
[7] Hansen, L. P. (1982): “Large Sample Properties of Generalized Method of Moments Estimators.” Econometrica 50, 1029–1054.
[8] Ibragimov, R. and Müller, U. K. (2010): “t-Statistic Based Correlation and Heterogeneity
Robust Inference.” Journal of Business and Economic Statistics 28, 453–468.
[9] Jansson, M. (2004): “The error in rejection probability of simple autocorrelation robust
tests.” Econometrica 72(3), 937–946.
[10] Kiefer, N. M., Vogelsang, T. J. and Bunzel, H. (2002): “Simple robust testing of regression
hypotheses” Econometrica 68 (3), 695–714.
[11] Kiefer, N. M. and Vogelsang, T. J. (2002a): “Heteroskedasticity-autocorrelation robust testing using bandwidth equal to sample size.” Econometric Theory 18, 1350–1366.
[12] — — — — (2002b): “Heteroskedasticity-autocorrelation Robust Standard Errors Using the
Bartlett Kernel without Truncation.” Econometrica 70(5), 2093–2095.
[13] — — — — (2005): “A New Asymptotic Theory for Heteroskedasticity-Autocorrelation Robust
Tests.” Econometric Theory 21, 1130–1164.
[14] Kim, M. S., and Sun, Y. (2013): “Heteroskedasticity and spatiotemporal dependence robust
inference for linear panel models with …xed e¤ects.”Journal of Econometrics 177(1), 85–108.
[15] Lehmann, E.L. and Romano, J. (2008): Testing Statistical Hypotheses, Springer: New York.
[16] Liang, K.-Y. and Zeger, S. (1986). Longitudinal data analysis using generalized linear models.
Biometrika 73(1):13–22.
[17] Marden, J. and Perlman, M. D. (1980): “Invariant tests for means with covariates.”Annals
of Statistics 8(1), 25–63.
60
[18] Müller, U. K. (2007): “A theory of robust long-run variance estimation.”Journal of Econometrics 141(2), 1331–1352.
[19] Muirhead, R. J. (2009): Aspects of Multivariate Statistical Theory. Vol. 197. Wiley.
[20] Newey, W. K. and West, K. D. (1987): “A simple, positive semi-de…nite, heteroskedasticity
and autocorrelation consistent covariance matrix.” Econometrica 55, 703–708.
[21] Phillips, P. C. B. (2005). “HAC estimation by automated regression.” Econometric Theory
21(1), 116–142.
[22] Shao, X. (2010): “A Self-normalized Approach to Con…dence Interval Construction in Time
Series.” Journal of the Royal Statistical Society B, 72, 343–66.
[23] Sun, Y. (2011): “Robust Trend Inference with Series Variance Estimator and Testing-optimal
Smoothing Parameter.” Journal of Econometrics 164(2), 345–66.
[24] Sun, Y. (2013): “A Heteroskedasticity and Autocorrelation Robust F Test Using Orthonormal Series Variance Estimator”. Econometrics Journal 16(1), 1–26.
[25] Sun, Y. (2014a): “Let’s Fix It: Fixed-b Asymptotics versus Small-b Asymptotics in Heteroscedasticity and Autocorrelation Robust Inference.” Journal of Econometrics 178(3),
659–677.
[26] Sun, Y. (2014b): “Fixed-smoothing Asymptotics in a Two-step GMM Framework.” Econometrica 82(6), 2327–2370.
[27] Sun, Y., Phillips, P. C. B. and Jin, S. (2008). “Optimal Bandwidth Selection in
Heteroskedasticity-Autocorrelation Robust Testing.” Econometrica 76(1), 175–94.
[28] Sun, Y. and Phillips, P. C. B. (2009): “Bandwidth Choice for Interval Estimation in GMM
Regression.” Working paper, Department of Economics, UC San Diego.
[29] Vogelsang, T. J. (2012). “Heteroskedasticity, autocorrelation, and spatial correlation robust
inference in linear panel models with …xed-e¤ects.”Journal of Econometrics 166(2), 303–319.
[30] Zhang, X., and Shao, X. (2013). “Fixed-smoothing asymptotics for time series.” Annals of
Statistics 41(3), 1329–1349.
61