Small Sample Properties of GMM and GEL Estimation and Inference

Small Sample Properties of GMM and GEL Estimation and Inference
Procedures for Social Interaction Models
Bryan S. Graham*
(Initial Draft: April 17th , 2005)
(This Draft: May 5th , 2005)
by
Abstract
This paper studies the small sample properties of generalized method-of-moments (GMM) and
generalized empirical likelihood (GEL) estimators and testing procedures. For linear IV-type models both two-step GMM and GEL achieve first order efficiency by replacing an infeasible optimal
instrumental variable with a consistent estimate (c.f., Newey 1993). GEL, unlike two-step GMM,
uses a semiparametrically efficient estimate of this optimal instrument. As a result the first order
conditions for GEL are asymptotically mean zero, while those for two-step GMM are not. Many
discrepancies in the small sample performance of the two estimators and their associated inference
procedures are attributable to this difference. The paper uses the ‘linear-in-means’ model of social interactions as an organizing example. Conditional variance restrictions (moments) provide
one means of identifying the strength of social interactions in this model (Graham 2005). The
Monte Carlo designs, loosely calibrated to the peer effects and learning application pursed by Graham (2005), are also interpretable as being based upon linear simultaneous-equations models with
heteroscedastic and right-skewed errors. This is a useful setting for comparing the small sample
properties of GMM and GEL estimators. While first order asymptotically equivalent, for the designs
considered here, higher order expansions suggest non-trivial differences in asymptotic bias, both
between GMM and GEL, as well as among specific members of the GEL family (Newey and Smith
2004). The usefulness of higher order expansions for rationalizing differences in small sample bias
is directly evaluated. The coverage properties of alternative approaches to constructing confidence
intervals are also investigated. Including two new approaches to constructing confidence intervals
specific to GEL. GEL estimators pose sizable computational challenges. As a result information
on their small sample properties is limited and their use in actual empirical research even more so.
This paper outlines and uses two alternative algorithms for computation.
JEL Classification: C12, C15, C31.
Key Words: Social Interactions, Excess Variance Contrasts, GMM, Generalized Empirical
Likelihood (GEL), Quasi-Maximum Likelihood (QML), Bias, Confidence Intervals, Monte Carlo.
*
I would like to thank Gary Chamberlain for many extensive and very useful conversations. I have also found
comments and suggestions from Caroline Hoxby, Dale Jorgenson, Larry Katz, and Jim Stock as well as feedback
from seminar participants at the University of California - San Diego, Harvard-MIT and the Harvard Econometrics Lunch to be very useful. Financial support provided by the Program of Fellowships for Junior Scholars, the
MacArthur Research Network on Social Interactions and Economic Inequality. All the usual disclaimers apply. Correspondence: Department of Economics, Harvard University, Littauer Building, Cambridge, MA 02138. e-mail:
[email protected].
1
1
Introduction
It is now widely recognized that the small sample properties of the conventional two-step GMM
estimator and associated inference procedures can be extraordinarily poor in empirical settings
of substantive interest to economists (e.g., Tauchen 1986, Altonji and Segal 1996, Clark 1996,
Hansen, Heaton and Yaron 1996). Motivated by these concerns several alternative estimators and
testing procedures for statistical models defined by moment restrictions have been suggested. These
include the empirical likelihood (EL) estimator of Imbens (1997) and Qin and Lawless (1994), the
exponential tilting (ET) estimator of Kitamura and Stutzer (1997) and Imbens, Spady and Johnson
(1998), as well as the continuously updating (CU) estimator of Hansen, Heaton and Yaron (1996).
Newey and Smith (2004) show that these estimators are members of a common family — dubbed
generalized empirical likelihood (GEL) estimators — and characterize their higher-order properties.1
Parallel work focuses on improving the coverage properties of confidence intervals for parameters
identified by moment conditions. Alternatives are typically linked to various members of the GEL
family of estimators. Examples include intervals formed by inverting the ‘S-statistic’ of Stock and
Wright (2000), the minimum-χ2 ‘tilting’ parameter statistic of Imbens and Spady (2002), the score
statistic of Guggenberger and Smith (2004) and Kleinbergen (2005), and the ‘many weak moments’
Wald statistic of Newey and Windmeijer (2005).
Despite a substantial amount of promising theoretical research, little is known about the small
sample properties of these estimators. An important reason for this lack of knowledge is that the
most convenient representation of GEL parameter estimates is as the solution to a saddle-point
(i.e., min-max) problem that is difficult to quickly and reliably solve (c.f., Mittelhammer, Judge
and Schoenberg 2005). Monte Carlo experiments are correspondingly costly, with the few smallscale studies currently available being quite limited in their coverage.2 With a few exceptions, such
as Imbens (1997, 2002), existing Monte Carlo studies are also not calibrated to specific economic
data sets.3
The Monte Carlo experiments reported below directly respond to these concerns. The basic
design uses the ‘linear-in-means’ model of social interactions with identification arising from conditional variance restrictions (Graham 2005). The design, calibrated to the Project STAR peer
effects application of Graham (2004), is equivalent to one based on a linear simultaneous equations model with modest endogeneity as well as heteroscedastic and right-skewed first-stage and
structural errors.
This setting provides an interesting context in which to explore the small sample properties of
1
Imbens (2002) provides an accessible, practitioner-oriented, introduction to this literature.
In some Monte Carlo studies the outer portion of the saddle-point problem is solved by a costly grid-search
to ensure numerical accuracy (e.g., Hansen, Heaton and Yaron 1996, Stock and Wright 2000, Guggenberger 2005a,
2005b). In others the saddle-point representation is replaced by an expensive, but apparently reliable, penaltyfunction approach (Imbens, Spady and Johnson 1998, Ramalho 2005). For computational reasons, Guggenberger and
Hahn (2005) even study the properties of a two-step approximation to the empirical likelihood estimator.
3
Most Monte Carlo studies are based upon the linear simultaneous equations model with homoscedastic errors
and a single endogenous regressor (e.g., Guggenberger 2005a, 2005b, Guggenberger and Hahn 2005) or simple AR(1)
dynamic panel data models (e.g., Bond, Bowsher and Windmeijer 2001, Newey and Windmeijer 2005). Others are
based on stylized Monte Carlo designs inherited from older studies on the finite sample properties of GMM (e.g.,
Imbens, Spady and Johnson 1998, Imbens and Spady 2002, Ramalho 2005).
2
2
GMM and GEL estimators and test statistics. Previous research indicates that two-step GMM performs very poorly when the identifying moments are derived from variance-covariance restrictions
and there is modest overidentification (Altonji and Segal 1996, Clark 1996, Imbens 1997, Horowitz
1998). Introducing endogeneity is unlikely to improve the situation. The higher order expansions
of Newey and Smith (2004) also suggest substantively important differences both between GMM
and GEL, as well as among specific GEL estimators in settings like those considered here.
Section 2 outlines the ‘linear-in-means’ model and quickly reviews the key identification result
of Graham (2004). The section also outlines the linear simultaneous-equations interpretation of the
model. Sections 3 reviews the two-step GMM estimator and associated testing procedures. Section 4
defines the GEL estimator and reviews some of its main first order asymptotic properties. Section 5
discusses the connection between GEL and (semiparametrically) efficient estimates of expectations.
In the context of linear IV-type models, both two-step GMM and GEL achieve first order asymptotic
efficiency by replacing an infeasible optimal instrument with a consistent estimate. While GEL
uses an efficient estimate of this instrument, two-step GMM does not. Section 6 discusses how
this difference affects the higher order asymptotic properties of the two estimators and associated
testing procedures. Section 7 outlines and discusses issues of inference with GEL estimators.
Section 8 outlines the various Monte Carlo designs and reports simulation results. Newey and
Smith’s (2004) asymptotic bias formulae are complex functions of the conditional variance and
skewness of the moment vector. I calculate asymptotic bias analytically for each estimator and
Monte Carlo design and compare it to the corresponding, small sample, Monte Carlo estimate of
median bias. This provides a direct test of the usefulness of higher order asymptotic theory for
understanding the bias properties of GMM and GEL estimators. While exercises of this type have
been undertaken in the context of the linear simultaneous-equations model with homoscedastic and
symmetric errors (e.g., Hahn, Hausman and Kuersteiner 2004), there is no previous work of this
kind in the GMM/GEL context. Suggestions for applied researchers are given in Section 9.
Proofs and derivations are collected in Appendices A to D. Appendix E outlines the computational methods used in the Monte Carlo work. The approach builds on the Newton-Raphson and
quasi-Newton procedures suggested by Owen (2001) and Imbens (2002) respectively. I have found
my implementation of this method, which uses a novel estimate of the initial Hessian matrix, to be
fast and reliable in most settings. For weakly identified designs with few overidentifying restrictions,
the profiled saddle-point criterion function can be very flat in the parameter of interest, making it
difficult for standard quasi-Newton algorithms to maintain positive definiteness of the approximate
Hessian. For this setting I outline a computationally convenient GEL analog of Fisher-scoring.
Appendix F provides auxiliary details on the Monte Carlo experiments.
2
The linear-in-means model
In this paper I work with the reduced form of the ‘linear-in-means’ model of social interactions
yci = αc + (γ − 1) εc + εci ,
3
(1)
where c = 1, . . . , N indexes social groups and i = 1, . . . , Mc individuals within groups. I assume
the presence of group-size variation (i.e., V ar (Mc ) > 0). In particular, the marginal distribution of
P
group sizes is multinomial with Mc ∈ M ≡ {m1 , . . . , mS }, Pr (Mc = ms ) = ps > 0 and Ss=1 ps = 1.
Equation (1) is the reduced form of a simple structural model of social interactions reviewed
by Graham (2005). In equilibrium, individual outcomes depend on group-level variables, αc , own
characteristics, εci , and mean peer group characteristics, εc . The parameter of interest is γ, the
‘social multiplier’. Assume (1) is a reduced form model of student achievement as in Graham (2005)
with αc equaling teacher quality, εci student ability, and εc the average ability of students in the
cth classroom. If a low ability student is replaced with a high ability student average achievement
in the classroom will increase for pure compositional reasons as well as because the high ability
student raises peer quality, and hence positive learning spillovers, in the classroom. The ratio of
the full equilibrium effect to the pure compositional effect equals γ. Since γ is a multiplier its null
value is one.
I assume that E [yci |Mc ] = E [αc |Mc ] = E [εci |Mc ] = 04 and consider the identifying assumptions
£
¤
E εc ε0c |Mc = (σ 2 (Mc ) − σ εε )IMc + σεε ιMc ι0Mc
V ar (αc |Mc ) = σ 2α ,
(2)
Cov (αc , εc |Mc ) = σαε ιMc ,
where εc = (εc1 , . . . , εcMc )0 and IMc and ιMc denote an Mc × Mc identity matrix and an Mc × 1
vector of ones respectively. Under (2) we can assume, without loss of generality, that σεε = σ αε = 0,
which is done henceforth.5
The variance-covariance matrix of individual outcomes yc = (yc1 , . . . , ycMc )0 , conditional on
group size, is then
¤
£
¡
¢
V ar (yc |Mc ) = Ω (φ; Mc ) = σ2 (Mc ) IMc + σ 2α + γ 2 − 1 σ 2 (Mc ) Mc−1 ιMc ι0Mc .
(3)
The goal is to estimate and conduct inference on the social interactions parameter γ 2 .
Observe that (3) follows a one-way error component structure; this is motivated by the combination of within-group exchangeability of individuals and conditional exchangeability of social
groups across groups of the same size. This suggests that the relevant unrestricted or reduced form
model involves a total of 2S parameters corresponding to the two error components for each of the
S group sizes. Identification comes from imposing restrictions on how these error components vary
with group size.
¡
¡
¡
¢0
¢0
¢0
Let φ = σ 20 , θ0 with σ 2 = σ21 , . . . , σ 2S where σ2s = σ 2 (ms ) , θ = σ 2α , γ 2 and dim (φ) =
¡ ¢
dim σ 2 + dim (θ) = S + K. The parameter space is
φ∈Φ≡
4
©¡ 2 2 2 ¢ 2
ª
σ , σα , γ : σ > RS+ , σ 2α ≥ 0, γ 2 > 0 .
(4)
This condition can be satisfied in practice by working with residuals from a preliminary nonparametric regression
of yci on Mc . Since the support of Mc is discrete, no smoothing is required with a sufficiently large data set; one can
simply work with deviations of yci from its group size specific mean.
5
For example, under (2) we can always redefine non-orthogonal εci and αc to equal ε∗ci = εci − E ∗ [εci |αc ] and
α∗c = αc + E ∗ [εci |αc ] where E ∗ [εci |αc ] denotes the (minimum mean-square-error) linear predictor of αc for εci (c.f,
Chamberlain 1984, p. 1249).
4
Identification requires that the conditional variance of yc is given by (3) for some choice of φ.
Condition 1 For some φ0 ∈ Φ, V ar (yc |Mc ) = Ω (φ0 ; Mc ) .
Condition 1 is a strong requirement. While the error-component structure of V ar (yc |Mc ) is well
motivated by exchangeability arguments, the assumptions that V ar (αc |Mc ) , Cov (αc , εc |Mc ) and
Cov (εci, εcj |Mc ) (for i 6= j) are all constant in group-size are substantive. Graham (2004) discusses
the plausibility of these assumption in the context of a peer effects and learning application that
uses data from Project STAR. In general their plausibility depends on the underlying technology
of the phenomenon under consideration, as well as the process by which individuals are assigned
to peers of a given ‘quality’ (εc ) and to ‘locations’ (αc ).
Also required is:
Condition 2 For all φ ∈ Φ, φ 6= φ0
E [kΩ (φ0 ; Mc ) − Ω (φ; Mc )k] =
S
X
s=1
ps × kΩ (φ0 ; ms ) − Ω (φ; ms )k > 0,
where expectations are taken with respect to the marginal distribution of Mc , which has probability
mass ps at each of the s = 1, . . . , S points in its support.
Condition 2 will hold as long as there is group-size variation (i.e., S ≥ 2). To see this consider
the following transforms of the within- and between-group variation of the data:
gcw = Mc−1 (Mc − 1)−1
XMc
i=1
(yci − y c )2 ,
gcb = y 2c .
(5)
Conditioning on group-size and taking expectations yields
E [gcw |Mc ] = (rc0 σ 2 )Mc−1 ,
E[gcb |Mc ] = σ 2α + γ 2 (rc0 σ2 )Mc−1 ,
(6)
where rc is an S × 1 vector with a 1 in the sth element if Mc = ms and zeros elsewhere such that
rc0 σ2 = σ2 (Mc ). Observe that E [gcw |Mc ] equals the amount of between-group variance in outcomes
(across groups of size Mc ) that we would ‘expect’ to observe in the absence of both unobserved
group-level heterogeneity (σ 2α = 0) and social interactions (γ 2 = 1).
Under (2) the ratio of differences
¤
£
¤
£
E gcb |Mc = m − E gcb |Mc = m0
,
γ =
E [gcw |Mc = m] − E [gcw |Mc = m0 ]
2
m, m0 ∈ M, m 6= m0
(7)
identifies γ 2 . The strength of social interactions is estimated by the ratio of actual differences in
between-group variation in outcomes across groups of different sizes to the difference in variation
we would ‘expect’ to observe in the absence of social interactions. In the absence of group-size
variation the contributions of σ2α and γ 2 (rc0 σ2 )Mc−1 to the between-group variance in the data are
not separately identifiable. If group size variation is modest identification may be weak, in the
sense that the denominator of (7) is small.
5
From (6) we have the conditional moment restriction
E [u (gc , θ 0 ) |Mc ] = 0,
(8)
¢0
¡
where u (gc , θ) = gcb − σ 2α − γ 2 gcw with gc = gcw , gcb . In principle, (8) implies an infinite number
of unconditional moment restrictions of the form E [f (Mc ) u (gc , θ 0 )] = 0 for different functions of
group size f (Mc ). With Mc equalling one of only a finite number of values however, the number of
unconditional moment restrictions implied by (8) is also finite (although it may be large in certain
empirical applications). The full set of unconditional moments implied by (8) is thus
E[ψ (zc , θ0 )] = E [rc u (gc , θ0 )] = 0,
(9)
where zc = (rc0 , gc0 )0 . I am interested in the properties of different estimators and associated inferential procedures based on (9). One direct implication of the discreteness of the group-size
support is that an efficient estimator based on (9) attains the semiparametric efficiency bound for
the estimation problem defined by (8) (c.f., Chamberlain 1987, 1992).
In order to interpret the Monte Carlo experiments reported below in the light of previous
research on the small sample properties of linear simultaneous-equations estimators (e.g., Hahn,
Hausman and Kuersteiner 2004), it is convenient to recast the excess variance model in that form.
¢0
¡
Let vcb = gcb − E[gcb |rc ], vcw = gcw − E [gcw |rc ] and Π = σ 21 /m1 , . . . , σ2S /mS . Using this notation the
excess variance social interactions model is
gcb = σ2α + γ 2 gcw + uc
(10)
gcw
(11)
=
rc0 Π + vcw ,
where uc = vcb − γ 2 vcw . Since gcb and gcw are functions of the between- and within-group squares of
the underlying data we would typically expect, in contrast to the ‘classical’ linear simultaneousequations model, uc and vcw to have skewed distributions in addition to being heteroscedastic with
respect to Mc .
To develop some further intuition about the properties of the model it is useful to consider the
special case where yc |Mc is multivariate normal. In this setting the least squares regression of gcb
on a constant and gcw is inconsistent with a probability limit of
p
γb2LS → γ 2 · κ,
κ=
V ar(σ2 (Mc ) Mc−1 )
.
E[2σ 4 (Mc ) Mc−2 (Mc − 1)−1 ] + V ar(σ 2 (Mc ) Mc−1 )
(12)
As highlighted by (12), the motivation for instrumenting gcw with group size is measurement error:
gcw is a noisy measure of the true ‘expected’ between-group variance in outcomes, σ2 (Mc ) Mc−1 . The
¡
¢
unconditional variance of gcw consists of two components. The first, equaling V ar σ 2 (Mc ) Mc−1 ,
is the actual amount of variation
in ‘expected’ between-group
variance in the sample. The second,
i
h
−1
w
4
−2
equaling V ar (gc |Mc ) = E 2σ (Mc ) Mc (Mc − 1) , is measurement error. The attenuation
bias in γ
b2LS thus takes a standard signal-to-noise ratio form.
6
The correlation between uc and vcw , the disturbances of the structural (10) and first-stage (11)
equations respectively, provides a measure of the degree of simultaneity. Under normality we have
ρuv
h¡
¢2 i ⎤−1/2
E σ 2α + σ 2 (Mc ) Mc−1
i⎦
= − ⎣1 + h
.
E σ 4 (Mc ) Mc−2 (Mc − 1)−1
⎡
The absolute level of simultaneity, |ρuv |, is decreasing in σ 2α , the of amount group-level heterogeneity,
and is increasing in V ar (gcw |Mc ), or the measurement error portion of the variation in gcw . For a
given marginal distribution of group size, the degree of endogeneity can be calibrated by choice of
σ 2α .
b will have a large sample distribution
Under normality the estimated first-stage coefficients, Π,
of
√
D
b − Π) →
N(Π
N (0, 2Vw ) ,
where Vw equals
Vw = diag
½
1 σ 4S
1 σ 41
1
1
,
.
.
.
,
2
2
p1 m1 m1 − 1
pS mS mS − 1
¾
.
The concentration parameter, a unitless measure of instrument strength therefore equals (c.f.,
Bound, Jaeger, and Baker 1995, Stock, Wright and Yogo 2002)
μ2 =
¢−1
N 0 0¡
DΠ,
Π D DVw D0
2
(13)
where D equals the first S − 1 rows of the within-group operator. Note that (13) is the population
Wald statistic6 corresponding to the null hypothesis that ‘expected’ between-group variance does
not change with group size or non-identification (i.e., H0 : Π1 = . . . = ΠS = 0). Define the ‘first
stage F-statistic’ as the feasible Wald statistic divided by the number of excluded instruments (which
in this case is S − 1).7 Staiger and Stock (1997) show that in large samples E [F − 1] ' μ2 /(S − 1),
suggesting that the F-statistic provides a useful measure of the strength of identification in the case
of a single endogenous regressor.
The Monte Carlo results reported in Section 8 are based on a total of twelve designs corresponding to different degrees of overidentification and instrument strength. For all designs N = 300 and
σ 2α is chosen to keep ρuv fixed at approximately −0.20; these values are consistent with the Project
STAR peer group effects application pursed in Graham (2004). I consider designs with S = 2,
4, 7 and 12 group sizes, corresponding to zero, two, five and ten overidentifying restrictions. The
distribution of group sizes is calibrated in each case to keep μ2 / (S − 1) equal to 5 (‘weak instruments’), 20 (‘strong instruments’) or 50 (‘very strong instruments’). The data generating process
(DGP) is (εc , αc )|Mc ∼ MV N with a variance-covariance matrix given by (3) where σ 2α is chosen
as described above and σ 21 = . . . = σ2S = 1 and γ = 2.
6
The population Wald statistic uses the population values of Π and Vw in place of estimates.
Many computer regression packages, such as Stata, refer to a Wald statistic normalized in this way as an ‘Fstatistic’ (see http://www.stata.com/support/faqs/stat/wald.html ).
7
7
3
Generalized method of moments
Let ψ N (θ) =
PN
c=1 ψ(zc , θ)/N,
the sample average of the identifying moments. A generalized
method-of-moments (GMM) estimator chooses b
θ such that a linear combination of ψN (θ) equals
zero. That is b
θ GM M satisfies
θGM M ) = 0,
(14)
WN∗ · ψN (b
p
where WN∗ is a K ×S weight matrix that may depend on the data such that WN∗ → W0∗ . The asymptotic variance minimizing choice for the weight matrix is Γ00 Ω−1
0 (e.g., Hansen 1982, Chamberlain
1982, 1984), where
£
¤
Γ0 = E ∂ψ (zc , θ 0 ) /∂θ 0 ,
S×K
£
¤
Ω0 = E ψ (zc , θ 0 ) ψ (zc , θ 0 )0 .
S×S
In practice, the Jacobian and second moment terms, Γ0 and Ω0 respectively, are replaced with
consistent estimates. For example, Hansen’s (1982) optimal two-step estimator can be interpreted
as replacing Γ0 and Ω0 in (14) with
ΓN (b
θ) = N −1
XN
c=1
∂ψ(zc , b
θ)/∂θ0 ,
ΩN (θ) = N −1
XN
c=1
ψ(zc , θ)ψ(zc , θ)0 ,
where θ is an inefficient preliminary estimate of θ based upon a non-stochastic weight matrix.
In empirical work b
θ is typically defined as the minimizer of a quadratic form in ψN (θ):
p
1
b
θ = arg min ψ N (θ)0 WN−1 ψ N (θ),
θ∈Θ 2
(15)
where WN → W0 is an S × S weight matrix. Inspecting the first order conditions associated
with (15) indicates that this definition of b
θ is equivalent to setting WN∗ equal to ΓN (b
θ)0 WN−1 in
the representation of b
θ given by (14) above. The asymptotic variance minimizing choice of WN is
therefore any matrix which is consistent for Ω0 , the second moments of ψ(zc , θ).
In addition to Hansen’s (1982) two-step estimator, the two-stage least squares (2SLS) estimator
is frequently used in applied work when the identifying moments are linear in the parameters, as
is the case in (9). Let xc = (1, gcw )0 = −∂u (gc , θ 0 ) /∂θ 0 ; the 2SLS estimator uses the weight matrix
hP
i hP
i−1
N
N
0
0
WN∗ =
x
r
/N
×
r
r
/N
. 2SLS is typically inefficient since the conditional varic
c
c
c
c=1
c=1
P
p
0
0
ance of u (gc , θ 0 ) is not constant in group-size; N
c=1 rc rc /N → E [rc rc ] is therefore not proportional
to Ω0 .8 Despite is name, 2SLS is a ‘one-step’ GMM estimator since a preliminary estimate of θ is
not required for weight matrix formation.
In large samples GMM estimators converge in distribution to
8
√
D
−1
N(b
θ − θ 0 ) → N (0, A−1
0 B0 A0 ),
A0 = Γ00 W0−1 Γ−1
0 ,
B0 = Γ00 W0−1 Ω0 W0−1 Γ0
(16)
Homoscedasticity of u (gc , θ0 ) only holds for the special case where σ2 (Mc ) is proportional to group size, Mc (i.e.,
σ (Mc ) ≡ λ · Mc ). However, identification breaks down in this case since E [gcw |Mc ] = λ, which is constant in group
size.
2
8
where W0 is the probability limit of the (quadratic form) weight matrix used for estimation. If this
¡
¢−1
.
matrix is consistent for Ω0 , then the variance expression simplifies to I(θ)−1 = Γ00 Ω−1
0 Γ0
−1
Among estimators based on (9), I(θ) is the asymptotic efficiency bound (c.f., Hansen 1982,
Chamberlain 1987). Because Mc has a discrete multinomial distribution I(θ)−1 is also the efficiency
bound for the problem defined by the conditional moment restriction (8). For this reason all GMM
p
estimators based on (9) which use a weight matrix WN∗ such that WN∗ → W0∗ ≡ Γ00 Ω−1
0 attain the
semiparametric efficiency bound for the problem defined by (8).
A direct application of Chamberlain’s (1987) results yields a semiparametric efficiency bound
of
n h
io−1
£
¤−1
E E [xc |Mc ]0 E u2c |Mc
E [xc |Mc ]
where uc = u (gc , θ 0 ). For ψ (zc , θ 0 ) given by (9) it is straightforward to show that
⎞
p1 E[xc |Mc = m1 ]
⎟
⎜
..
⎟,
Γ0 = − ⎜
.
⎠
⎝
pS E[xc |Mc = mS ]
⎛
©
£
£
¤
¤ª
Ω0 = diag p1 E u2c |Mc = m1 , . . . , pS E u2c |Mc = mS .
(17)
Multiplying out we have
∙
PS
¸−1
E [xc |Mc = ms ]0 E [xc |Mc = ms ]
s=1 ps
E [u2c |Mc = ms ]
n h
io−1
£
¤−1
=
E E [xc |Mc ]0 E u2c |Mc
E [xc |Mc ]
.
¡ 0 −1 ¢−1
Γ0 Ω0 Γ0
=
Thus I (θ 0 ) also equals the Fisher Information associated with the conditional moment problem.
While the first order asymptotic properties of two-step GMM are identical to those of the
(infeasible) oracle estimator which uses Γ00 Ω−1
0 as the weight matrix, Newey and Smith (2004) show
that its higher order properties are affected by weight matrix estimation. Altonji and Segal (1996)
and Clark (1996), among others, provide Monte Carlo evidence consistent with this result. Using the
same data to estimate the optimal weight matrix as well as θ can induce a correlation between the
sampling error in WN∗ and ψN (θ). For a fixed sample size, sampling error in WN∗ = ΓN (b
θ)0 ΩN (θ)−1 ,
and hence the bias of b
θ, will be increasing in the number of overidentifying restrictions since this
increases the dimension of WN∗ .
Thus although the two-step GMM estimator based on (9) attains the semiparametric efficiency
bound, its finite sample properties may be very poor since implementation requires estimating a
weight matrix with a potentially large number of unique elements. Applied researchers typically
‘solve’ this problem by either arbitrarily reducing the number of identifying moments (e.g., Tauchen
1986) or by using inefficient one-step procedures such as the equally weighted minimum distance
(EWMD) procedure introduced by Abowd and Card (1989). In applied dynamic panel data analysis
the ‘one-step’ 2SLS estimator described above is often used instead of efficient two-step GMM (c.f.,
Blundell and Bond 1998, Bond, Bowsher and Windmeijer 2002). All of these procedures involve
losses of asymptotic efficiency.
9
3.1
Inference with GMM estimators
Consider hypotheses of the form
H0 : c (θ 0 ) = 0,
(18)
with C0 = ∂c (θ 0 ) /∂θ 0 . It will be useful to define the following notation. Let b
θ and e
θ denote
the unrestricted and restricted estimates of θ respectively (i.e., e
θ minimizes (15) subject to (18)).
e
Furthermore, let θ and θ respectively denote consistent preliminary unrestricted and restricted
estimates, where such estimates are required for efficient weight matrix estimation. Finally, for any
function of θ, let Fb = F (b
θ) etc.
For any choice of weight matrix, WN∗ , we can conduct inference using either the Wald or score
statistics:
bA
b−1 C
b 0 )−1 c(b
bA
b−1 B
W M (b
θ) = N · c(b
θ)0 (C
θ),
e 0 (C
eA
e−1 C
e 0 )−1 C
eA
e−1 ψN (e
e−1 C
eA
e−1 B
S M (e
θ) = N · ψ N (e
θ)A
θ),
(19)
(20)
b and B
b are constructed from estimates of their constituent elements as given in (16). Both
where A
statistics are χ2J random variables in large samples, where J = rank (C0 ) .9
For the efficient two-step GMM estimator simpler forms of both statistic are appropriate:
b −1 b 0 −1 b
b Γ
bN Ω−1 Γ
W (b
θ) = N · c(b
θ)0 (C(
N N ) C ) c(θ),
−1
−1
(21)
−1
e Γ
e
e e 0 e e −1 e 0 e
S(e
θ) = N · ψN (e
θ)0 Ω
N N (ΓN ΩN ΓN ) ΓN ΩN ψ N (θ).
(22)
The statistic given by (22) is different from the score statistic given by Newey and West (1987). It
differs from their formulation in that the variance-covariance matrix of the score vector is estimated
under the null, as opposed to the alternative. Intuitively, this difference should improve the size
properties of the test by improving the precision of the weight matrix estimate (c.f., Bond and
Windmeijer, 2002). Regardless, the above formulation of the test statistic is more in the spirit
of the score principle as applied in the maximum likelihood setting since it is based entirely on
quantities computed under the null.
The minimized two-step GMM criterion function (15) is also a χ2 random variable. This suggests
an additional approach to inference. Let SN (θ, θ) = 12 ψN (θ)0 ΩN (θ)−1 ψN (θ). The Sargan-Hansen
difference or ‘deviance’ statistic
LR(e
θ, b
θ) = 2[SN (e
θ, e
θ) − SN (b
θ, θ)],
(23)
θ, e
θ) and SN (b
θ, θ)
is the GMM analog of a likelihood ratio. The weight matrices used to form SN (e
are computed under the null and alternative respectively.10
All of the above test statistics can be inverted to form confidence sets. Let θ 1 be the scalar
9
Equations (19) and (20) follow from the standard theory of M-estimation. An excellent textbook exposition of
this theory is given by Wooldridge (2002, Chapter 12).
10
Newey and West (1987) advocate the use the same weight matrix in both quadratic forms, ensuring the statistic
is positive in finite samples.
10
parameter for which a confidence set is desired. Define e
θ(θ1 ) = (θ1 , θ 2 (θ 1 )0 )0 to be the constrained
GMM estimate of θ holding θ 1 fixed at a given null value. A (1 − α) × 100 percent confidence set
based on the Sargan-Hansen difference statistic is
o
n
,
θ (θ 1 ) , b
θ) ≤ χ2,1−α
CIαLR (θ1 ) = θ1 |LR(e
1
where χ12,1−α is the 1 − α quantile of the χ21 distribution. Confidence sets based on other statistics
are defined analogously.
Newey and West (1987) show that for moment functions and hypotheses that are linear in θ
the two-step GMM forms of W , S, and LR are numerically identical if the same estimate of Ω0 is
used throughout. For example, if Ω0 is estimated without imposing the null restrictions, then all
of the test statistics are identical to the standard Wald test. The three statistics as defined above,
however, are distinct even in the linear case. This is because the Wald test computes Ω0 under
the alternative, the score test under the null, while the Sargan-Hansen difference test uses both a
restricted and unrestricted estimate of Ω0 . These differences may be practically important in finite
samples, particularly when testing distant nulls as is done implicitly when forming confidence sets.
Confidence sets constructed using the S and LR tests associated with the efficient GMM estimator need not be symmetric around b
θ. The LR statistic is also invariant to differentiable reparameterizations of the null hypothesis. Under weak identification, in the sense described by Stock
and Wright (2000), confidence intervals constricted using the above test statistics are invalid, since
the test statistics in this case have non-standard limiting distributions.
4
Generalized empirical likelihood
Newey and Smith (2004) provide a masterful synthesis of the generalized empirical likelihood (GEL)
literature, Imbens (2002) a user-oriented introduction. The exposition in this section emphasizes
the characterization of b
θ GEL as the solution to a saddle-point (i.e., min-max) problem. This allows
for the straightforward derivation of large sample properties using standard results on M-estimators
(e.g., Newey and McFadden 1994, Wooldridge 2002, Chapter 12).
ª
def ©
Let V be an open interval containing zero and ΛN (θ) = λ : λ0 ψc (θ) ∈ V, c = 1, . . . , N . b
θ GEL
is the solution to the following saddle-point problem
def
b
θ GEL = arg min sup lN (λ, θ)
θ∈Θ
(24)
λ∈ΛN (θ)
¡ 0
¢
P
where Θ is a compact set of potential parameter values, lN (λ, θ) = N
c=1 ρ λ ψ c (θ) the unprofiled
GEL criterion function and ρ (v) a concave scalar function on V, normalized such that ρ1 (0) =
ρ2 (0) = −1 (with ρj = ∂ j ρ/∂vj ). For the empirical likelihood (EL), exponential tilting (ET) and
continuous updating (CU) estimators ρ (v) equals ln (1 − v) , −ev and −(1 + v)2 /2 respectively.
By concentrating out the S × 1 vector of ‘tilting’ parameters, λ, we get a profiled objective
function of
def
p
eN (θ) , θ),
lN
(θ) = sup lN (λ, θ) = lN (λ
(25)
λ∈ΛN (θ)
11
eN (θ) denotes the maximizing value of the tilting parameter for a given sample size and
where λ
value of θ. b
θGEL is the minimizer of (25).
λN (θ) , θ), with each candidate value
Calculating b
θ GEL requires an outer minimization over lN (e
eN (θ). While unavailable in closed form — except
of θ requiring an inner maximization solving for λ
eN (θ) is simple to calculate numerically. Imbens, Spady and Johnson (1998), Owen
for CU — λ
(2001), Mittelhammer, Judge and Schoenberg (2005) and Imbens (2002) discuss various algorithms.
Two alternative methods of computing b
θGEL that appear to work well in practice are outlined in
Appendix E. The first is a specific implementation of the quasi-Newton procedure suggested by
Imbens (2002) while the second is a GEL-analog of Fisher-scoring.
The large sample properties of b
θ GEL are well-known and summarized by the following Theorem
(c.f., Imbens 1997, Qin and Lawless 1994, Newey and Smith 2004).
Theorem 1 Under correct specification (i.e., (9) or E [ψ (zc , θ 0 )] = 0) and regularity conditions,
p
(b
λ, b
θ) → (0, θ 0 ) with a limiting distribution of
√
N
Ã
b
λ
b
θ − θ0
!
D
→N
à "
0,
0
P0
0 I (θ 0 )−1
#!
,
(26)
¢
¡
−1 0 −1
−1
−1
Γ0 Ω0 .
where I (θ0 ) = Γ00 Ω−1
0 Γ0 and P0 = Ω0 − Ω0 Γ0 I (θ 0 )
Proof. See Appendix B.
A key implication of Theorem 1 is the first order equivalence of all GEL estimators with one
another as well as with two-step GMM.
It is also straightforward to characterize the asymptotic properties of b
θ GEL under misspecification by viewing GEL as an M-estimator. Under misspecification there is no θ0 ∈ Θ such that
E [ψ (zc , θ0 )] = 0. In this setting b
θ is no longer consistent but (generally)11 converges to the pseudoeN (θ)0 ψ(zc , θ)] and, by standard large sample results for M-estimators
true value θ∗ = arg min E[ρ(λ
(c.f., Newey and McFadden 1994, Wooldridge 2002, Chapter 12), continues to have a limiting
normal distribution of
where
√
D
N(b
θ − θ ∗ ) → N (0, Ap (θ ∗ )−1 B p (θ ∗ ) Ap (θ∗ )−1 ),
Ap (θ) = E [H p (zc , θ)] ,
(27)
B p (θ) = V ar (sp (zc , θ)) ,
with sp (zc , θ) and H p (zc , θ) equaling the cth observation’s contribution to the score and Hessian
of the profiled GEL objective function (25) respectively. Analytic expressions for sp (zc , θ) and
H p (zc , θ) are provided in Appendix A. As pointed out by Imbens (1997), under misspecification
the limiting pseudo-true parameter value, (λ∗ , θ∗ ), varies with the specific GEL estimator.
The limit distribution given by (26) is a special case of the general M-estimator result given by
(27), as the next Theorem demonstrates.
11
Schennach (2003) shows that the asymptotic variance of the EL member of the GEL family of estimators may
become undefined in the presence of misspecification due to the presence of a singularity in its influence function (see
also Ragusa 2005).
12
Theorem 2 Under correct specification an information matrix equality of the form
A (θ 0 ) = B (θ 0 ) = I (θ 0 )
(28)
holds.
Proof. See Appendix B.
Although its proof is simple and straightforward, to my knowledge Theorem 2 is a new result.12
One of its implications is that the ‘observed information’ matrix is consistent for I (θ 0 ). That is,
p
p b
(θ)/N → I (θ 0 ) .
5θθ lN
(29)
The Hessian is a direct function of the estimated parameter and realized sample, as well as an
eN (b
eN (b
indirect function of them through their impact on the (profiled) tilting parameter λ
θ). If λ
θ)
is instead replaced by its limiting (population) value of zero, it is straightforward to show that
PN
0
−1
b
b0 Ω
b
b
b −1 b −1
the inverse ‘Hessian’ simplifies to (Γ
N N ΓN ) /N where ΓN = N
c=1 ∂ ψ c /∂θ and ΩN =
PN b b 0
−1
/N and
c=1 ψ c ψ c /N (c.f., equation (67) in Appendix A). This is a direct estimate of I (θ0 )
is loosely akin to the conditional Fisher information estimator used in the parametric maximum
likelihood setting, taking a form similar to that of a conventional GMM variance estimate.13
4.1
Minimum discrepancy formulation
An informative alternative characterization of b
θGEL is as the solution to the following minimum
discrepancy (MD) problem
(b
θ, π
b ) = arg min
θ∈Θ, π 1 ,...,π N
XN
c=1
h (π c ) ,
s.t.
XN
c=1
πc ψc (θ) = 0,
XN
c=1
πc = 1,
(30)
where π is an N × 1 vector of probability weights. Newey and Smith (2004, Theorem 2.2, p.
θ GEL when h (πc ) is a member of the Cressie-Read family
224) show duality between b
θ M D and b
(η+1)
η+1
(N πc )
−1
η
1
of discrepancies. The duality holds for h (πc ) = η(η+1)
and ρ (v) = − (1+ηv)
(with
N
η+1
14
expressions being interpreted as limits for η = 0 or −1).
The minimum discrepancy formulation is important since it indicates that we can interpret
GEL as simultaneously choosing a set of sample weights, π
b , and a parameter value, b
θ, such that in
the re-weighted sample the moment conditions are mean zero, as is assumed to be the case in the
P
b c = 0). Subject to this restriction the weights are chosen to be as
sampled population (i.e., N
bcψ
c=1 π
close as possible to ιN /N = (1/N, . . . , 1/N )0 , the empirical measure of the data. Algorithmically,
12
Kleinbergen (2005) and Newey and Windmeijer (2005) show that V ar (sp (zc , θ0 )) = I (θ 0 ) for the special case of
the continuously updating (CU) estimator and the entire GEL family of estimators respectively. Kitamura, Tripathi
and Ahn (2005, p. 1678) — in the context of an extension of the EL estimator for conditional moment models — note
that the Hessian of the (profiled) empirical likelihood criterion function, divided by the sample size, is consistent for
I (θ0 ) .
13
b N with ΩN , where the
The standard two-step GMM variance estimator for cross-section data typically replaces Ω
− indicates evaluation at a preliminary consistent, but typically inefficient, estimate of θ.
14
In an interesting and wide-ranging working paper Ragusa (2005) shows that GEL-MD duality extends beyond
the Cressie-Read family.
13
P
for a candidate value of θ a re-weighting of the data, π
e (θ), such that N
e c (θ) ψc (θ) = 0 is
c=1 π
solved for; θ is then chosen to make the requisite re-weighting close to the empirical distribution.
‘Closeness’ is defined in terms of the empirical discrepancy measure, h (πc ), used.
For EL and ET h (πc ) equals − ln (π c ) and πc ln (πc ) respectively. In both cases ‘closeness’ is
defined in terms of the Kullback-Leibler information criterion (KLIC). For EL expectations are
taken with respect to the empirical measure of the data such that b
θ EL is chosen to minimize
K (e
π(θ), ιN /N ) = EFN
∙ ½
¾¸ X
N
dFπ (z, θ)
ln
{ln(e
π c (θ)) − ln (1/N)} (1/N),
=
c=1
dFN (z)
subject to the restrictions in (30) and where
FN (z) =
XN
c=1
1 (zc ≤ z) /N,
Fπ (z, θ) =
XN
c=1
1 (zc ≤ z) π
e c (θ),
(31)
are the empirical distribution function and the fitted cumulative distribution function (CDF) of z
R
respectively. An important property of the fitted CDF is that ψ(z, θ)dFπ (z, θ) = 0. EL chooses
θ such that, among feasible distributions (i.e., those for which the moments are mean zero), the
KLIC-discrepancy between π
e (b
θ) and ιN /N is smallest.
ET reverses the roles of the fitted and empirical measure of the data, with b
θ ET chosen to
minimize
∙ ½
¾¸ X
N
dFN (z)
e (θ)) = EFπ ln
{ln(1/N ) − ln(e
πc (θ))} π
ec (θ).
=
K (ιN /N, π
c=1
dFπ (z, θ)
For CU h (π c ) = (1/2) × [(Nπ c )2 − 1]/N ; therefore b
θCU minimizes the Euclidean distance between
b
π
e (θ) and ιN /N.
To summarize, GEL chooses b
θ to minimize the distance between the empirical distribution
function FN (z) and the fitted CDF. This interpretation also makes clear what GEL does in the
θ) converges to F ∗ (z), which, among the
presence of misspecification. Under misspecification Fπ (z, b
set of distributions for which the moments are mean zero, is ‘closest’, in the sense of the metric,
h (·), to the true distribution function (Imbens 1997, p. 367).15 Since the relevant metric varies
across different GEL estimators, under misspecification the limiting pseudo-true value θ ∗ also varies
with the choice of estimator.
For any given value of θ we can concentrate out the N × 1 vector of probability weights, π
e (θ),
leading to the saddle-point characterization of the GEL estimator given by (24). Newey and Smith
(2004) show that the probability weights associated with (30) can be recovered directly from the
saddle-point problem by
ρ (e
λ (θ)0 ψ c (θ))
π
ec (θ) = PN 1
,
e 0
c=1 ρ1 (λ (θ) ψ c (θ))
15
c = 1, . . . , N.
(32)
It is also straightforward to show that the S×1 vector of tilting parameters, b
λ, in the saddle-point
This result is analogous to the behavior of maximum likelihood under misspecification (e.g., Gourieroux and
Monfort 1993).
14
representation of the problem are proportional to the Lagrange multipliers on the first constraint in
b has an intuitive interpretation
the MD one (c.f., Imbens 2002, Newey and Smith 2004). Therefore λ
as a measure of the tightness on the restriction that the moments be mean zero in the re-weighted
sample.
The following Theorem, due to Imbens (1997), shows that (31) is an efficient estimate of the
CDF of z. As will be argued below, the characterization of the GEL probabilities as efficient
estimates of the underlying distribution function of z — efficient because they take into account the
prior knowledge that ψ (zc , θ0 ) is mean zero — is important for understanding the attractive higher
order bias properties of b
θ GEL .
Theorem 3 Let ω A = Pr (z ∈ A). The estimate
ω
bA = ω
e A (b
θ) =
p
Z
1 (z ∈ A) dFeN (z, b
θ) =
XN
c=1
1 (zc ∈ A) π
e c (b
θ)
is consistent for ω A (i.e., ω
b A → ω A ) and attains the semiparametric efficiency bound with a limiting
distribution of
√
N(b
ω A − ω A ) → N (0, V (b
ω A ))
where
V (b
ωA ) = ω A (1 − ω A ) − ψ 0A P0 ψA
with ψA = E [ψ (zc , θ0 ) × 1 (zc ∈ A)] = ω A × E [ψ (zc , θ 0 ) |zc ∈ A] .
Proof. See Imbens (1997) and Appendix B.
The first term is V (b
ω A ) is simply the limiting variance of the sample frequency of the event
zc ∈ A. The second term represents the asymptotic gain in precision from exploiting (9). Intuitively,
R
θ) follows directly from efficiency of b
θ and because ψ(z, b
θ)dFπ (z, b
θ) = 0, as is
efficiency of Fπ (z, b
assumed to be true in the population from which the sample was drawn.
For what follows it is convenient to define the additional set of weights:
e
k ∗ (e
λ (θ)0 ψ c (θ))
e
,
kc (θ) = PN c
e∗ e 0
c=1 kc (λ (θ) ψ c (θ))
c = 1, . . . , N,
(33)
ec (θ) = π
where e
kc∗ (v) = ρ1 (v)+1
. Note that for EL k
ec (θ), but this does not hold in general.
v
Using the two sets of GEL weights, as well as the empirical measure, we can consider three
def
consistent estimators for the expectation of any function of z and θ. For μa = E [a(zc , θ 0 )] we
P
P
P
b
b
b
aπ = N
b c a(zc , b
θ), and (3) b
ak = N
have: (1) b
aN = N1 N
c=1 a(zc , θ), (2) b
c=1 π
c=1 kc a(zc , θ). The next
aπ is an efficient estimate
section demonstrates that, when all that is known is that E [ψ(zc , θ 0 )] = 0, b
of the mean of a(zc , θ 0 ).
15
5
Efficient semiparametric estimation of expectations
Suppose we wish to estimate the mean of the scalar function a(zc , θ0 ).16 Although the distribution
of zc is not parametrically specified, the problem is only semiparametric since prior information in
the form of the moment restriction (9) is available. Brown and Newey (1998) consider this problem
is some generality (c.f., Imbens and Lancaster (1994) for an interesting empirical application of
related ideas).
Assume b
θ is an efficient estimate of θ; a common, obvious and consistent estimate of μa is the
P
b
sample mean b
aN = N
c=1 a(zc , θ)/N, but this fails to utilize any of the information contained in
(9). Alternatively one could use an estimate of the (minimum mean-square error) linear predictor
b 0c evaluated at ψ
b 0c = ψ
b 0N (the sample mean of the moment vector):
of ac given 1 and ψ
0
bN ,
μ
ba = bb1 + bbψ ψ
(34)
b 0c :
where b
b1 and bbψ are the least squares coefficients from the regression of ac onto 1 and ψ
Ã
bb1
bbψ
!
=
Ã
PN
1
b c /N
ψ
c=1
Ã
!
!−1 Ã P
!
PN b 0
N
μ
ψ
/N
a
b
/N
p
c
a
c
→
PNc=1b
PNc=1b b 0
0
ac /N
Ω−1
ψ
ψ
/N
0 Σaψ
c=1 ψ c b
c=1 c c
(35)
b N are consistent for μa
aN , is consistent since bb1 and ψ
with Σaψ = E[ac ψ0c ]. This estimator, like b
0
and zero respectively. If ac and ψc are correlated (i.e., bψ = Ω−1
0 Σaψ 6= 0) we might also expect (34)
to be variance-reducing relative to sample mean estimator. However (34) is still inefficient since it
fails to exploit the prior knowledge that ψ c is known to be mean zero in the population from which
b N with zero gives the efficient estimator μ
the sample was drawn. Replacing ψ
b∗a = bb1 . By applying
standard results on partitioned inverses to (35) we have μ
b∗a equal to
XN
b −1 ψ
b N )−1
b −1 ψ
b c )/N
b 0N Ω
b aψ,N Ω
μ
b∗a = (1 − ψ
(b
ac − Σ
N
N
c=1
XN
−1/2
=
(ac − Σaψ Ω−1
).
0 ψ c )/N + op (N
(36)
c=1
Inspection of (36) reveals that μ
b∗a is (asymptotically) uncorrelated with the sample average of the
b∗a is an efficient estimator of μa — when the only prior information available
moments, ψN . That μ
is (9) — follows directly from this lack of correlation since any such correlation could be used to
construct an asymptotically more efficient estimator.
Theorem 4 μ
b∗a attains the semiparametric efficiency bound for estimating μa = E[a(zc , θ0 )] when
all that is known about the distribution of zc is (9) (i.e., that ψ(zc , θ) is mean zero for some θ = θ 0 ).
p
We have μ
b∗a → μa with a limiting distribution of
√
D
N (b
μ∗a − μa ) → N (0, V (b
μ∗a ))
16
The generalization of what follows to matrix-valued functions is straightforward but comes at the expense of
considerably more notation. Results for the general case are given in Theorem 6 below.
16
where
−1 0 −1 0
−1
0
Γ0 Ω0 Σaψ .
V (b
μ∗a ) = Σaa − Σaψ Ω−1
0 Σaψ + Σaψ Ω0 Γ0 I (θ0 )
= Σaa − Σaψ P0 Σ0aψ .
Proof. See Appendix B and also c.f. Brown and Newey (1998, Theorem 2, p. 456).
Each component of V (b
μ∗a ) has an interesting interpretation. Assume that θ0 is known; in this
case the first element of V (μa ) is the asymptotic variance of the sample mean estimator, aN . The
second element represents the asymptotic gain in precision from utilizing the prior information that
ψ(zc , θ0 ) is mean zero. In practice θ0 is not known, the last term represents the asymptotic penalty
θ.
which arises from replacing θ 0 with the efficient estimate b
To get some intuition for the efficiency result of Theorem 4 consider the alternative linear
estimator μ
b+
b ∗a +αψ N for the special case of a single moment restriction (i.e., dim(ψc ) = S = 1)
a =μ
and θ 0 known. The large sample variance of μ
b+
a is
V ar(b
μ+
a)
σ2a (1 − ρ2aψ ) α2 σ 2ψ
+
,
=
N
N
where ρaψ is the correlation between ac and ψ c . V ar(b
μ+
b+
b ∗a .
a ) is minimized at α = 0 or μ
a = μ
Choosing α 6= 0 introduces correlation between μ
b+
a and ψ N which, when present, can be used to
reduce variance.
Note that the asymptotic relative efficiency of μ
b∗a over aN — the sample mean of ac — is 1/(1−ρ2aψ ).
The intuition is straightforward: when ac and ψc are correlated, the prior knowledge that ψc is
mean zero is informative about μa .
A straightforward but nvery important connection
exists between Theorem 4 and the use of the
o0
GEL probabilities π
e (b
θ) = π
e1 (b
θ), . . . , π
eN (b
θ) to estimate expectations. Manipulating (36) we can
∗
rewrite μ
b a as the weighted sum,
μ
b∗a
=
Z
a(z, b
θ)dFπ (z, b
θ CU ) =
N
X
c=1
π
ec (b
θ CU )a(zc , b
θCU ),
eN (θ)0 ψc (θ)
1+λ
,
π
ec (θ) = PN
0
e
c=1 1 + λN (θ) ψ c (θ)
(37)
P
where Fπ (z, b
θ CU ) = N
ec (b
θ CU ) is the estimate of the CDF of z associated with the
c=1 1 (zc ≤ z) π
continuously updating (CU) estimator. In deriving (37) I have used the result that for ρ (v) =
b N (see equation (77) in Appendix
eN (b
b −1 ψ
−(1+v)2 /2 the vector of tilting parameters, λ
θ), equals −Ω
N
C).
b −1 > 0
b 0N Ω
Given the prior knowledge that ψc is mean zero, the form of π
e(b
θCU ) is intuitive. If ψ
N
then ‘large’ realizations of ψc are over-represented in the sample and π
e(b
θ CU ) downweights such
b −1 < 0 then ‘large’ realizations are under-represented and π
b 0N Ω
observations accordingly; if ψ
e(b
θCU )
N
upweights such observations accordingly (c.f., Back and Brown 1993).
R
That μ
b ∗a = a(z, b
θ CU )dFπ (z, b
θCU ) is efficient suggests that Fπ (z, b
θCU ) is an efficient estimate
of the CDF of z, but this is precisely what Theorem 3 shows to be true. Since any convex ρ (v)
can be approximated by a quadratic function in the neighborhood of zero, replacing π
ec (b
θ CU ) with
17
the probability weights associated with any other GEL estimator also yields efficient estimators of
the mean of a(zc , θ 0 ) (see Theorem 6 below). Equations (36) and (37) show that estimating the
mean of a(zc , θ0 ) using the GEL probability weights is asymptotically equivalent to using the linear
b 0c where ψ
b c is evaluated at its population mean value of zero.
predictor of ac given 1 and ψ
PN
e c (b
θ)a(zc , b
θ) is an efficient estimate of μa and that efficient estimates are
That b
aπ =
c=1 π
asymptotically uncorrelated with ψN , the sample average of the moments, is key to understanding
the attractive asymptotic bias properties of GEL estimators derived by Newey and Smith (2004).
6
Anticipating the small sample properties of b
θ GM M and b
θ GEL
Intuition regarding the finite sample bias GMM and GEL estimates of θ can be drawn from their
respective first order conditions.
Theorem 5 The first order conditions associated with the GMM and GEL are
N
1 Xe
hc u(gc , b
θ) = 0
N c=1
(38)
2
where e
hc is an estimate of the infeasible optimal instrument, hc = Γ0 Ω−1
0 rc = −E [xc |Mc ] /E[uc |Mc ].
For two-step GMM and GEL e
hc equals
M
e
b 0N Ω−1
=Γ
hGM
c
N rc ,
e
b0π Ω
b −1 rc ,
hGEL
=Γ
c
k
(39)
where ΩN is computed using the preliminary consistent, but typically inefficient, estimate θ.
Proof. See Appendix C.
Theorem 5 indicates that a key feature of GEL estimators, relative to the conventional two-step
θ) and e
kc (b
θ), in the construction of the
GMM estimator, is their use of the probability weights, π
e c (b
optimal instrumental variable. All GEL estimates use an efficient estimate of the Jacobian matrix
in their construction of e
hc and EL also uses an efficient estimate of the variance of the moments.
It is useful to start by considering the asymptotic bias of two-step GMM. Theorem 5 implies
that the bias of b
θ GM M equals
b
θ − θ0 =
µX
N
c=1
e
hc x0c /N
¶−1 µX
N
c=1
e
hc u (gc , θ0 ) /N
¶
.
(40)
First assume that xc is replaced with its conditional expectation, E [xc |Mc ]; this eliminates small
sample bias due to endogeneity (i.e., from sampling error in the Jacobian matrix). In this setting
bias arises solely from the correlation between the heteroscedasticity-correcting denominator of the
estimated optimal instrument and the residual. That is, from the correlation between
b 2c |Mc ] =
E[u
PN
2
i=1 uc · 1 (Mi = Mc )
P
N
i=1 1 (Mi = Mc )
18
(41)
and u (gc , θ0 ) .
In group-size cells where some groups ‘happen to have’ unusually small or large average outcomes, y c , this heteroscedasticity-correcting weight will be small (since between-group variation in
yci will be large). In these cells uc will also tend to be positive. This implies that less weight will
be placed on positive realizations of the sampling error than on negative. Since uc is mean zero,
the finite-sample bias of b
θ will be downward.17
The direction of bias due to correlation between sampling variability in the Jacobian and uc
operates in the same direction but for different reasons. In group-size cells that include groups with
‘usually large’ amounts of within-group outcome variation the numerator of the second element of
e
hc ,
PN w
w
i=1 gi · 1 (Mi = Mc )
b
E [gc |Mc ] = P
,
(42)
N
i=1 1 (Mi = Mc )
will be large. In these same cells uc will tend to be negative. Again this implies that negative
θ.
realizations of uc receive more weight in estimation, inducing an additional downward bias in b
b
Examining the nature of bias in θGM M which arises from Jacobian variability clarifies how GEL
estimators eliminate this source of bias asymptotically. For simplicity consider the case where Ω0
is known. Sampling error in the Jacobian term is therefore the main source of bias in b
θ GM M .
Evaluating (40) we have approximately
E[b
θGM M − θ 0 ] ' E
"µ
XN
e
hc x0c /N
c=1
¶−1 #
×E
∙µX
N
c=1
b N )−1 ] × E[Γ
b 0N Ω−1 ψN ]
b 0N Ω−1 Γ
= E[(Γ
0
0
¸
∙ X
N
1
−1
0 −1
xc rc Ω0 ψ N
= I (θ 0 ) × E
c=1
N
£
¤
I (θ 0 )−1
× E xc rc0 Ω−1
=
0 rc uc
N
∙
¸
Σxu (Mc )
I (θ 0 )−1
×E
=
N
p (Mc ) σ 2u (Mc )
I (θ 0 )−1 XS Σxu (ms )
=
×
,
s=1 σ 2
N
u (ms )
e
hc uc /N
¶¸
(43)
where p (Mc ) is the marginal probability of observing a group of size Mc (i.e., p (Mc ) = ps if
Mc = ms ), Σxu (Mc ) = E [xc uc |Mc ] is the K ×1 vector of conditional covariances between xc and uc
£
¤
and σ 2u (Mc ) = E u2c |Mc is the conditional variance of uc . Note that (43) generally increases with
S, the number of group sizes.18 Intuitively this is because increases in S enlarge the dimension of,
and hence variability in, the estimated Jacobian.
17
This argument directly mirrors that given by Altonji and Segal (1996) for the bias of the optimal minimum
distance (OMD) estimator in covariance-restriction models.
−1
SΣxu
18
0)
Under homoscedasticity of uc (43) simplifies to I(θN
. Under the additional assumption that the (implied)
σ2
u
‘first-stage’ errors, vc , are homoscedastic (as is assumed in the conventional linear simultaneous equations model),
and that xc is scalar, the bias expression further simplifies to
Σxu
1
,
E[b
θGM M − θ0 ] = 2 2
σ v μ /S + 1
19
GEL estimators use alternative estimates of Γ0 and Ω0 to construct e
hc . Some simple algebra
GEL
equals
shows that the second element of e
hc
e
hGEL
=
c
1
ps
− 1
ps
PN
bi
i=1 π
PN b
i=1 ki
· giw · 1 (Mi = Mc )
·u
b2i · 1 (Mi = Mc )
.
kc — in e
hGEL
generate
Theorem 6 provides intuition on how the presence of the GEL weights — π
bc and b
c
b
the attractive asymptotic bias properties of θGEL found by Newey and Smith (2004, Theorem 4.2,
p. 228).
θ) be a K × L matrix a(zc , θ 0 ) = (a1 (zc , θ0 ), . . . , aL (zc , θ 0 )) with
Theorem 6 Let b
ac = a(zc , b
PN
0 p
(l)
b
acl ψ c → Σaψ = E [al (zc , θ0 )ψ(zc , θ0 )0 ], the K ×S covariance between the lth column of a(zc , θ 0 )
c=1 b
(1)
(L)
and ψ(zc , θ 0 ). Let Σaψ = (Σaψ , . . . , Σaψ ) be the K × LS matrix of the L individual covariance matrices. For the GEL probability weights π
bc and b
kc , given by (32) and (33) respectively, and the
empirical weights 1/N we have the expansions
XN
c=1
XN
c=1
XN
c=1
π
bc b
ac = N −1
b
kc b
ac = N
b
ac /N
= N
XN
c=1
XN
−1
c=1
XN
−1
c=1
−1/2
(ac − Σaψ (IL ⊗ Ω−1
)
0 ψ c )) + op (N
1
−1/2
(ac + ρ3 Σaψ (IL ⊗ Ω−1
),
0 ψ c )) + op (N
2
ac + op (N −1/2 ),
and hence, to the order of the approximations, we have
£ ¡
¢¤
E b
aπ IL ⊗ ψ0N
= 0
£ ¡
¢¤
1
1
(1 + ρ3 Σaψ )
E b
ak IL ⊗ ψ0N
=
N
2
£ ¡
¢¤
1
E b
aN IL ⊗ ψ0N
Σaψ .
=
N
Proof. See Appendix C.
Theorem 6 formalizes the claim made in Section 5 that using π
b to estimate expectations is
semiparametrically efficient when all that is known is (9). Asymptotic uncorrelatedness between
b
aπ and ψN follows directly from efficiency since any such correlation, if present, could be used to
construct an alternative estimate of the mean of a(zc , θ0 ) with lower variance.
A direct application of Theorem 6 demonstrates how GEL estimators, by using an efficient
estimate of the Jacobian, asymptotically eliminate bias due to correlation of its sampling error
with ψN . Let d denote an S 2 × 1 vector with a total of S ones appropriately placed such that that
(IS ⊗ ψ 0c )d = ψc .19 Continuing to consider the case where Ω0 is known and evaluating (40) with
where μ2 is the ‘concentration parameter’ measuring instrument strength (c.f., Stock, Wright and Yogo 2002). This
is precisely the standard expression for the finite sample bias of the 2SLS estimator (e.g., Hahn and Hausman 2002).
Under the above conditions two-step GMM with Ω0 known is 2SLS.
19
That is d has a total of S ones located in its first element, its (S + 1)th element, its (S + 2)th element and so on
with all remaining elements equal to zero.
20
e
b 0π Ω−1 rc we have:
hc = Γ
0
E[b
θGEL − θ 0 ] ' I (θ0 )−1 × E
=
=
=
=
=
I (θ0 )−1
N
I (θ0 )−1
N
I (θ0 )−1
N
I (θ0 )−1
N
I (θ0 )−1
N
∙X
N
π
bc xc rc0 Ω−1
0 ψN
c=1
¸
(44)
£
¤
−1
× E (xc rc0 Ω−1
0 − ΣΓψ (IS ⊗ Ω0 ψ c ))ψ c
£
¤
−1
0
× E (xc rc0 Ω−1
0 − ΣΓψ (IS ⊗ Ω0 ψ c ))(IS ⊗ ψ c )d
£
¤
0
−1
0
× E xc rc0 Ω−1
0 (IS ⊗ ψ c )d − ΣΓψ (IS ⊗ Ω0 ψ c ψ c )d
0
× ΣΓψ d − ΣΓψ (IS ⊗ Ω−1
0 E[ψ c ψ c ])d
× ΣΓψ d − ΣΓψ d = 0,
where the second line applies Theorem 6 with ac equal to the K ×S matrix xc rc0 Ω−1
0 and the fifth line
0
uses that fact that Σaψ = E[ac (IS ⊗ ψ c )]. The GEL probabilities asymptotically eliminate bias due
to correlation between variability in the Jacobian matrix and the sample moments by ‘subtracting
off’ that part of the variation in ΓN which is correlated with ψN . Unlike two-step GMM, asymptotic
bias due to Jacobian variability is zero and therefore not increasing in the number of group sizes,
S.
Theorems 4.5 of Newey and Smith (2004) provides stochastic expansions for the GMM and
GEL estimators that formalize the intuition driving the above examples. From their theorem we
have E[b
θ − θ0 ] = b(θ 0 )/N with b(θ 0 ) differing by estimator according to
bORACLE (θ 0 ) = BO ,
³
ρ ´
b
(θ 0 ) = BO + 1 + 3 BΩ ,
2
GM M
b
(θ 0 ) = BO + BΓ + BΩ + BW .
(45)
GEL
Specializing Newey and Smith’s (2004) expressions for BO , BΓ , BΩ , and BW to the ‘linear-in-means’
model outlined in Section 2 we have
h
i
BO = I (θ 0 )−1 E h0c hc I (θ0 )−1 Σxu (Mc ) ,
−1
BΩ = I (θ 0 )
S
X
E [x |m ] μ (m )
p c s 23u s3/2 ,
σ2u (ms ) [σ u (ms )]
s=1
BΓ = I (θ 0 )−1
S
X
Σxu (ms )
s=1
σ2u (ms )
(46)
£
¤
BW = −2I (θ 0 )−1 E hc rc0 (H − HW )0 Σxu (Mc ) ,
¢−1
¡
−1 0
0
with H = Γ00 Ω−1
Γ0 Ω−1
Γ0 W0 , μ3u (Mc ) = E[u3c |Mc ] and the remaining
0 Γ0
0 , HW = (Γ0 W0 Γ0 )
terms as defined above.
As noted by Newey and Smith (2004) each element of b(θ0 ) has an intuitive interpretation.
BO /N equals the asymptotic bias of the infeasible oracle estimator which uses the true Jacobian
and weight matrices to form the optimal instrument hc = Γ00 Ω−1
0 rc ; this component of bias is
typically small and not sensitive to the degree of overidentification. BΓ /N equals the bias arising
21
from the correlation between sampling error in the estimated Jacobian and ψ N . All GEL estimators,
because they use an efficient estimate of Γ0 when forming the optimal instrument, asymptotically
eliminate this source of bias. In weakly identified settings, in the sense of Stock and Wright (2000),
this source of bias can be substantial.
BΩ /N equals the bias due to correlation between the sampling error in the estimated variancecovariance matrix of of the moments (i.e., the weight matrix) and ψ N . The EL member of the
GEL family asymptotically eliminates bias due to weight matrix variability since it uses an efficient
estimate of Ω0 , as well as Γ0 , when forming e
hc . Other GEL estimators are sensitive to this source of
bias depending on the sign and magnitude of ρ3 . For CU and ET ρ3 equals zero and one respectively,
indicating that CU removes none of the bias due to weight matrix variability relative to two-step
GMM, while ET removes about half of this source of bias.
b N is a function of
The BΩ /N bias term depends on the conditional skewness of u(gc , θ). Since Ω
the sample second moments of uc , in the absence of skewness it will be uncorrelated with uc . For
the linear-in-means model u(gc , θ) is a function of the within- and between-group squares of the
original data, it is therefore likely a right-skewed random variable. This suggests that the BΩ /N
bias component is negative in the current context.
BW /N is the bias associated with constructing the optimal (quadratic form) weight matrix in
two-step GMM using a preliminary consistent, but inefficient, estimate of θ. An additional iteration
of the two-step GMM estimator, such that the weight matrix is estimated using an efficient estimate
of θ, eliminates this source of bias.
Equation (45) suggests a set of patterns to look for in the Monte Carlo results reported in
Section 8 below. First, the small sample bias of GMM, CU, and ET should all be increasing in
the number of distinct group sizes, S. ET should be least sensitive to S, two-step GMM most
sensitive. The empirical likelihood (EL) member of the GEL family has the same asymptotic bias
as the infeasible oracle estimator. Its small sample bias should be unaffected by the magnitude
of S. Of course even higher order asymptotic approximations need not be a good guide to the
sampling properties of these estimators in specific empirical settings.
7
Inference with GEL estimators
As noted in the introduction, associated with GEL are an array of alternative test statistics. These
include the score statistic of Kleinbergen (2005) and Guggenberger and Smith (2004), the S-statistic
— a GMM analog of the Anderson-Rubin statistic — proposed by Stock and Wright (2000), likelihood
ratio type statistics as in Smith (1997), and the tilting parameter tests of Imbens, Spady and
Johnson (1998) and Imbens and Spady (2002).
This section attempts to organize these various approaches to inference into a common framework. Virtually all of the ‘new’ test statistics suggested above are actually just members of the
standard M-estimator versions of the classical ‘trinity’ of the Wald, Score or Likelihood Ratio (i.e.,
criterion function) type statistics.
Most suggestions found in the literature can be categorized into one of two conceptual approaches. In the first, and by far the most common approach, direct hypotheses of the form
22
H0 : θ = θ 0 are evaluated using the Wald, Score or Likelihood Ratio principle as applied to the
profiled GEL criterion function (e.g., Qin and Lawless 1994, Kitamura and Stutzer 1997, Smith
1997, Kleinbergen 2005, Guggenberger and Smith 2004).
In the second, less common, approach hypotheses concerning θ are tested indirectly by assessing
λ(θ 0 ) = 0 (e.g.,
the tilting parameter’s proximity to zero, i.e. hypotheses take the form H0 : e
Corcoran, Davison and Spady 1995, Imbens 1997, Imbens, Spady and Johnson 1998, Imbens and
Spady 2002). As with the first approach, indirect tests can be based on the Wald, Score, and
Likelihood Ratio principles; although in practice the range of existing suggestions does not cover
all these possibilities.
As with bias, the asymptotic size of some GEL tests often depends critically on their use of
the GEL probabilities. For simplicity as well as to focus ideas, in this section I assume that
p
ΩN (θ 0 ) → Ω0 sufficiently quickly such that the effects of weight matrix variability can be ignored.
7.1
The classical trinity
Score statistics. The role played by the GEL probabilities in assuring accurate inference is
easiest to see when comparing Newey and West’s (1987) score test for the two-step GMM estimator
with the corresponding score test for GEL (e.g., Guggenberger and Smith 2004, Kleinbergen 2005).
From Section 3, the score of the two-step GMM quadratic form criterion function evaluated at
√
θ 0 and scaled by 1/ N is
√
√
M
(θ
)/
N
=
−
NΓN (θ 0 )0 Ω−1
sGM
0
N
0 ψ N (θ 0 ) + op (1).
(47)
Appendix D shows that the score of the profiled GEL criterion function can be written as
√
√
spN (θ0 )/ N = − NΓπ (θ0 )0 Ω−1
0 ψ N (θ0 ) + op (1).
(48)
√
The main difference between the two scores is that spN (θ0 )/ N is formed using an efficient estimate
of the expected Jacobian of the moments. A sequence of arguments identical to those use to derive
(43) and (44) above give
£
¤
1 XS Σxu (ms )
M
E sGM
(θ0 )/N =
,
N
s=1 σ 2
N
u (ms )
¤
£
E spN (θ0 )/N = 0.
(49)
Because the GEL score is constructed using an efficient estimate of the Jacobian its (higher order)
asymptotic mean is zero. The two-step GMM score uses an inefficient estimate of the Jacobian. The
sampling variability in this estimate is correlated with ψN (θ 0 ), generating asymptotic bias in the
score. The score of the two-step GMM criterion function is not mean zero even when evaluated at
the true parameter, θ 0 . This suggests that in small samples, particularly with many overidentifying
restrictions, tests based on the two score statistics may have very different size properties.
The contrasts between the two statistics are even more dramatic when we consider their respective properties under weak identification. One way to model weak identification is to assume that
√
the expectation of ψ(zc , θ) is of order 1/ N uniformly in θ (e.g., Stock and Wright 2000). This
23
captures the idea that the gradient of E[ψ(zc , θ)] is flat in θ and hence its poor identification. In
particular, assume that
¸
Γ∗
∂ψ(zc , θ 0 )
= Γ0N = 0ξ ,
E
0
N
∂θ
∙
with ‘strong identification’ ξ = 0 and Γ0N = Γ∗0 is a fixed full rank matrix. ‘Weak identification’
can be modelled by setting ξ = 1/2. In this case Γ0N will not be a full rank matrix in large samples,
leading to ‘vanishing’ identification. In order to make further progress we require the following
corollary to Theorem 6.
√
Corollary 1 Let a(zc , θ 0 ) be a K ×1 vector obeying a central limit theorem such that N (aN (θ0 )−
D
μa ) → A with A ∼ N (0, Σaa ) . A direct corollary of Theorem 6 is asymptotic independence of ψN
and aπ with
Ã
Ã" # "
!
#!
√
0
ψN (θ0 )
0
Ω0
D
N
→N
,
.
0
aπ (θ 0 ) − μa
0 Σaa − Σaψ Ω−1
0
0 Σaψ
Following the general approach of Kleinbergen (2005, Theorem 1)20 we can apply a central limit
theorem such that N ξ Γπ converges in distribution to G, a normal random variable with mean Γ∗0 .21
Note that in the case of strong identification ξ = 0 and we have N ξ Γπ converging in probability to
√
the fixed matrix Γ∗0 . Let P ∼ N (0, Ω0 ) be the limiting distribution of Nψ N . By Corollary 1 G
and P are independent random variables, therefore
√
D
N ξ Γπ (θ0 )0 Ω−1
N ψN (θ 0 ) → G0 Ω−1
0
0 P.
−1/2 we have
Conditioning on a specific realization of G and scaling by (G0 Ω−1
0 G)
−1/2 0 −1
G Ω0 P |G = g ∼ N (0, IK ).
(G0 Ω−1
0 G)
However by virtue of independence G and P this result also holds regardless of the specific realization of G (i.e., unconditionally). Therefore
√
D
−1/2 0 −1
(Γ0π Ω−1
Γπ Ω0 Nψ N (θ0 ) → N (0, IK ).
0 Γπ )
The final step in this argument breaks down if Γπ (θ 0 ) is replaced with the inefficient ΓN (θ 0 ). Since
variability in ΓN (θ0 ) will be correlated with ψ N across repeated samples the score’s large sample
(unconditional) distribution will be non-normal and depend on nuisance parameters (Guggenberger
and Smith 2004). Although the efficient estimate of the Jacobian is also very noisy in weakly
identified settings, across repeated samples its variability will be independent of ψN .
20
See also Caner (2005, Theorem 5) and Guggenberger and Smith (2004, Theorem 4).
In order to avoid the introduction of tedious notation involving the vec(·) operator, in what follows I assume that
K = dim (θ0 ) = 1 and hence that Γπ is an S × 1 vector.
21
24
It is also illuminating to consider the joint null of
H0 : λ = 0,
θ = θ0 .
(50)
Note that (50) simultaneously considers the hypothesis that θ = θ 0 as well as the validity of the
overidentifying moment restrictions at θ = θ 0 .
It is straightforward to form GEL-based score tests for (50). The (S + K) × 1 score for the
√
√
unprofiled GEL criterion function under (50) equals sN (0, θ 0 )/ N = N (ψN (θ 0 ), 0)0 with a
variance of diag{Ω0 , 0} (see Appendix B and D). A feasible score statistic is therefore
S AR = N · ψN (θ0 )0 Ωπ (θ 0 )−1 ψ N (θ 0 ),
(51)
which is essentially the ‘S-statistic’ of Stock and Wright (2000)22 . That is, S AR is a GEL-analog
of the Anderson-Rubin statistic; it is a χ2 random variable with S degrees of freedom under (50).
Forming a score test for validity of the moments (i.e., that λ = 0) conditional on a known value
of θ 0 yields a interesting type of Sargan-Hansen overidentification statistic. Appendix D shows that
the appropriate score statistic in this case equals
SJ
= N · ψN (θ0 )0 Pπ ψN (θ 0 )
(52)
0 −1
0 −1
−1 0 −1
= N · ψN (θ0 )0 Ω−1
π ψ N (θ0 ) − N · ψ N (θ 0 ) Ωπ Γπ (Γπ Ωπ Γπ ) Γπ Ωπ ψ N (θ0 )
= S AR − S K ,
with Γπ = Γπ (θ 0 ) and similarly for Ωπ . S J is a χ2 random variable with S − K degrees of freedom.
From (51) and (52) we see that S AR = S J + S K , highlighting how S AR combines a score test for
the null that θ = θ0 with a Sargan-Hansen type test for the validity of the overidentifying moments
at θ = θ 0 .
b
If instead of conditioning on θ = θ0 we condition on b
θ we have Γ0π Ω−1
π ψ N (θ) ' 0 by the GEL
b −1
b
θ)0 Ω
first order conditions of Theorem 5 and hence (52) simplifies to N ·ψN (b
π ψ N (θ), a conventional
Sargan-Hansen type statistic.
Since S J and S K are asymptotically independent at θ = θ 0 one can test (50) sequentially. First,
test the validity of the overidentifying moment restrictions at θ = θ0 using S J then, conditional
on failing to reject these restrictions, test the null that θ = θ0 using the score statistic, S K . This
procedure, a version of which is suggested by Kleinbergen (2005) for the continuously updating (CU)
estimator, avoids ‘spuriously’ accepting the θ = θ 0 null when θ 0 is in a region of the parameter
space where the identifying moment restrictions are violated (e.g., near a maximum of the GEL
criterion). The overall asymptotic size of this procedure equals αJ + αK + αJ αK , where αJ and
αK denote the size of the individual tests.
In the Monte Carlo experiments I consider the small sample properties of two versions of the
score statistic. The first, which exploits the information equality, is
22
Their statistic replaces Ωπ (θ0 ) with ΩN (θ0 ).
25
e0π Ω
e −1
e −1 p e
S(e
θ) = spN (e
θ)0 (Γ
π Γπ ) sN (θ)/N,
(53)
θ) = ∇θ lpN (e
θ) is the score of the profiled GEL criterion function evaluated at e
θ (c.f.,
where spN (e
equation (65) in Appendix A). Recall that e
θ denotes the GEL point estimates computed under the
p b −1
e 0π Ω
e −1
e −1
null hypothesis (18). One could also replace (Γ
π Γπ ) /N in (53) with H (θ) .
N
Guggenberger and Smith (2004) consider an alternative variant of the score statistic given by
e π (Γ
e −1
e −1 e 0 e e
ee
e 0π Ω
θ) = N · λ(
θ)0 Γ
S GS (e
π Γπ ) Γπ λ(θ).
(54)
e0 Ω
e −1 e e0 e −1 e −1 e 0 e −1 e
S K (e
θ) = N · ψ
N π Γπ (Γπ Ωπ Γπ ) Γπ Ωπ ψ N ,
(55)
e−1
e 0 e e−1 e e−1 0 −1 e e−1 p e
θ) = spN (e
θ)0 A
S M (e
p C (C Ap Bp Ap C ) C Ap sN (θ)/N,
(56)
While Kleinbergen (2005) and Newey and Windmeijer (2005) suggest the variant
which is similar in form to the Newey and West (1987) score statistic for two-step GMM. The
√
θ)/ N , the actual (scaled) score of the profiled
latter two variants — (54) and (55) — replace spN (e
GEL criterion function, with op (1) approximations. This suggests that the variant given by (53) is
preferable and this is the form used in the Monte Carlo experiments.23
The second version of the score statistic examined in the Monte Carlo experiments follows
directly from standard M-estimator results. Making no use of the information matrix equality
suggests a robust statistic of
PN p p 0
e
ep is estimated by H p (e
sc e
sc /N.
where A
c=1 e
N θ)/N and Bp by the outer-product-gradient estimator
M
The S score statistic has not been suggested in the context of GEL previously.
The large sample distribution of the score statistic is χ2J under the maintained null (recalling
that rank (C) = J). The score statistic retains its good properties under weak identification when
some elements of θ are replaced with constrained GEL estimates as long as the K −J parameters (or
‘combinations’ of parameters) that are unconstrained by (18) are strongly identified (Guggenberger
and Smith 2004, Kleinbergen 2005).
Likelihood ratio statistics. It is also possible to base inference directly on the profiled GEL
p b
criterion function, lN
(θ). Smith (1997) shows that the in large samples the minimized (and normalized) GEL criterion function is a χ2 random variable:
LRJ (b
θ) = 2
XN
c=1
D
eb
[ρ(λ(
θ)0 ψ c (b
θ)) − ρ0 ] → χ2S−K ,
θ) measures the extent to which the data
where S − K equals the degree of overidentification. LRJ (b
need to be re-weighted in order for the moment restrictions to be satisfied. It can form the basis
of an overidentification test.
e π is replaced by Ω
e N then the three statistics are identical for the continuously updating (CU) estimator
If Ω
(Guggenberger and Smith 2004).
23
26
Hypotheses like (18) can be tested using
LR(b
θ, e
θ) = 2
XN
c=1
ee
eb
[ρ(λ(
θ)0 ψ c (e
θ)) − ρ(λ(
θ)0 ψc (b
θ))].
The LR statistic is the EL analog of a log-likelihood ratio in a fully parametric model. In the
absence of any moment restrictions the empirical distribution of the data places probability 1/N
on each point. Overidentification necessitates a re-weighting of the data in order to ensure that
the moment conditions are satisfied. As more and more restrictions are placed on the data, the
weights will diverge from the empirical measure in a finite sample. The LR statistic measures
the incremental effect of imposing the null hypothesis (18) on this divergence. The large sample
distribution of LR is χ2J under the maintained null.
The LR statistic does not have standard limiting distributions under weak identification (Guggenberger and Smith 2004). This is because b
θ is inconsistent in this setting. However the following
Anderson-Rubin type statistic
LRAR (e
θ) = 2
XN
c=1
ee
[ρ(λ(
θ)0 ψc (e
θ)) − ρ0 )],
is a χ2 random variable in large samples — regardless of the strength of identification — with degrees
of freedom equal to S − (K − J) (again assuming that the K − J parameters unconstrained by (18)
are strongly identified). Note that LRAR (e
θ) = LRJ (b
θ) + LR(b
θ, e
θ).
Wald statistics. The Monte Carlo experiments conducted below consider four variants of the
Wald statistic. The first variant, analogous to the Wald statistic associated with two-step GMM,
is based on a direct estimate of I (θ 0 )−1 :
b −1
b −1 b −1 b
b Γ
b 0π Ω
W (b
θ) = N · c(b
θ)0 [C(
π Γπ ) C] c(θ).
(57)
−1 b −1 b
b p (b
θ) = c(b
θ)0 [CH
W H (b
N θ) C] c(θ).
(58)
Alternatively, exploiting (29), we have a statistic based on the ‘observed information’ in the sample:
Kitamura, Gautam and Ahn (2004) note that W H leads to asymptotically valid inference, but do
not evaluate its small sample properties. A third Wald statistic, which although new is implicit in
the work of Imbens (1997), treats GEL as an M-estimator, ignoring the information matrix equality
given by Theorem 2:
−1
b p (b
W (b
θ) = c(b
θ)0 [CH
N θ)
M
½X
N
sp (b
θ)spc (b
θ)0
c=1 c
¾
p b −1 b −1 b
HN
(θ) C] c(θ).
(59)
Newey and Windmeijer (2005) suggest another Wald statistic, which like (59) is based on a
sandwich estimate of variance-covariance matrix for b
θ, but where the outer-product estimator for
27
the variance of the score is replaced by a direct estimate. Their statistic is
p b −1 b −1 b
−1 b0 b −1 b
b p (b
θ) = N · c(b
θ)0 [CH
W N W (b
N θ) (Γπ Ωπ Γπ )HN (θ) C] c(θ).
(60)
Newey and Windmeijer (2005) derive (60) using an alternative ‘many weak moments’ asymptotics where the number of moments grows with the sample size. Under their asymptotics GEL
remains consistent with a normal limiting distribution as long as the number of moments does
not increase too quickly with the sample size. However — heuristically — Theorem 2 does not hold
in this setting. The variance of the score is larger than the average Hessian under ‘many weak
moments’ asymptotics and therefore the conventional variance estimator is too small. Fortunately,
the sandwich variance estimator is consistent for the correct limiting variance in this setting.
Since (59) and (60) are estimating the same quantity they should behave similarly in finite
samples. It is interesting that a simple M-estimator argument leads to essentially the same test
statistic as the more complicated ‘many weak moments’ asymptotics of Newey and Windmeijer
(2005).
In weakly identified settings, with the number of moments held fixed, none of these statistics
will lead to valid inference (Stock and Wright 2000). This is because GEL is inconsistent with a
non-standard limiting distribution in such settings.
Tilting parameter tests. Imbens, Spady and Johnson (1998) suggest the following Wald test
for the null of correct specification:
eN (b
be
θ) = N · λ
θ)0 R
θ),
λN (b
IS J (b
(61)
b is a ‘robust’ sandwich estimate of the variance of the moments — Ω0 — equalling
where R
b=
R
∙X
N
c=1
¸ ∙X
¸−1 ∙X
¸
N
N
0
0
2 b b0
b
b
b
b
b
ρ2 ψc ψ c /N ×
b
ρ1 ψc ψc /N
×
b
ρ2 ψc ψc /N .
c=1
c=1
(62)
The use of an estimate of Ω0 as the weight matrix in a quadratic form of the tilting parameters
is appropriate because P0 Ω0 P0 = P0 and hence Ω0 is a generalized inverse of P0 (c.f., Imbens,
b as an estimate of
Spady and Johnson 1998). A heuristic motivation for the specific choice of R
Ω0 is provided in Appendix D. Imbens, Spady and Johnson (1998) and Imbens and Spady (2002)
θ) has very good size properties. Presumably these
report Monte Carlo experiments where IS J (b
results are driven by the superior accuracy of the asymptotic normal approximation to the sampling
distribution of e
λN (b
θ) relative to the corresponding approximation for b
θ itself.
A difference between (61) evaluated at the constrained and unconstrained parameter estimates
provides a direct test of (18). Alternatively Imbens and Spady (2002) suggest the minimum-χ2
statistic
D
b e
eN (b
eN (e
λN (e
θ) − e
λN (b
θ))0 R(
θ) − λ
θ)) → χ2J ,
IS(b
θ, e
θ) = N · (λ
for testing (18). The coverage properties of confidence intervals formed by inverting IS(b
θ, e
θ) are
evaluated in the Monte Carlo experiments below.
28
8
Monte Carlo
The goal of the Monte Carlo experiments is to evaluate the small sample properties of GMM and
GEL estimators and associated testing procedures for the linear-in-means model in light of the
theoretical predictions of Sections 6 and 7. The Monte Carlo experiments mimic key features of
the Project STAR data set used in Graham (2004) to estimate peer group effects. Five estimators
based on the conditional moment restriction (8) are evaluated:
1. GMM: This is Hansen’s (1982) optimal two-step estimator. One-step 2SLS estimates are
used to construct the weight matrix.
2. ET: The ‘exponential tilting’ member of the GEL family of estimators. This estimator is
advocated by Imbens, Spady and Johnson (1998). It is computationally attractive relative to
other GEL estimators.
3. EL: The ‘empirical likelihood’ member of the GEL family of estimators. This estimator has
the smallest asymptotic bias among all GEL estimators.
4. 2SLS: This is a ‘one-step’ GMM estimator which uses WN = (
matrix. It is inefficient, but computationally straightforward.
P
rc rc0 /N)−1 as a weight
5. QML: This is an M-estimator which use the log-likelihood appropriate if yc |Mc were multivariate normal as a criterion function. Normality is not required for the consistency of b
θ QM L
but it is only efficient in that case. In the designs considered below, the QML estimator is in
fact the maximum likelihood (ML) estimator; however, the Wald and Score statistics used for
hypothesis testing are of the QML/robust form (c.f., MaCurdy 1982, Gourieroux and Monfort
1993, Wooldridge 2002).
It is also useful to compare the properties of the above estimators to those of the infeasible
oracle estimator which uses the unknown optimal linear combination of the sample average of the
moment vector — Γ00 Ω−1
0 — for estimation. The Γ0 and Ω0 matrices are given by (17). A closed
form expression for Ω0 requires auxiliary distributional assumptions. Under normality Appendix F
shows that Ω0 is diagonal with elements
Ωss
0
)
( ∙
¸2
4
1
2
2 2 1
4 2σ s
,
= ps · 2 σ α + γ σs
+γ
ms
m2s ms − 1
s = 1, . . . , S.
b
The oracle chooses b
θ O to satisfy Γ00 Ω−1
0 ψ N (θO ) = 0.
8.1
Small sample bias
Table 1 reports the theoretical values of each of the four asymptotic bias components — BO /N,
BΓ /N, BΩ /N, and BW /N — for all twelve Monte Carlo designs. The bias values are divided by
1/2
). Note that the standard
the appropriate asymptotic standard error of b
γ 2 (i.e., by (I (θ0 )−1
γγ /N)
29
error used for scaling varies across designs and hence the ranking of these scaled asymptotic biases
across designs need not correspond to rankings based on absolute bias.
The scaled bias components vary substantially across the twelve designs. Holding the population
first stage F statistic, μ2 /(S − 1) constant the bias due to Jacobian and weight matrix variability
increases with the degree of overidentification. Jacobian bias is greatest for the weakly identified
designs of Panel A, the weight matrix bias for the very strongly identified designs of Panel C.
Higher order theory suggests that the two-step GMM, ET and EL estimators vary in their
sensitivity to the Jacobian and weight matrix bias components. The twelve designs therefore provide
an interesting setting in which to compare small sample bias in the GMM, ET and EL estimates of
γ 2 and to assess the ability of higher order asymptotic theory to rationalize any differences. The
b
asymptotic biases of these three estimators differ dramatically across some of the designs. The
most extreme example is for the S = 12, μ2 /(S − 1) = 50 design where the asymptotic bias γb2
— as a fraction of its standard error — is -0.8369, -0.3721 and -0.0001 for the GMM, ET and EL
estimators respectively (see Table 2). These are extraordinary differences for first order equivalent
estimators.
Table 2 reports Monte Carlo estimates of the median bias of b
γ 2 for each estimator and design.
Median bias is reported instead of mean bias since GEL estimators, like LIML, may lack finite
moments in small samples (c.f., Guggenberger 2005a, 2005b, Guggenberger and Hahn 2005). Also
reported is the asymptotic bias for each estimator and design calculated analytically using the
formulae
given in Section 6. The bias estimates, as well as their theoretical counterparts, are scaled
q
by I(θ 0 )−1
γγ /N . The results are based upon B = 5, 000 successful Monte Carlo draws for each
design.24 The standard error of each of the scaled median bias estimates is approximately 0.0177.25
The qualitative bias ranking implied by Newey and Smith’s (2004) expansions are mirrored by
the small sample median bias estimates. Among GMM, ET and EL median bias is greatest for the
two-step GMM estimator and smallest for EL estimator. The quasi-maximum likelihood estimator,
which is the ML estimator in the current setting, consistently has the lowest median bias among
all feasible estimators. The ORACLE, QML, and EL estimators are higher order equivalent in
terms of bias for these designs. However the EL estimator performs noticeably worse than either
the QML or ORACLE estimators. Importantly the median bias of EL is sensitive to the number of
moments, unlike its asymptotic counterpart. Overall, higher order expansions do a reasonable job
of rationalizing differences in the small sample median bias of different estimates of γ 2 however,
important discrepancies remain. For example for the S = 12, μ2 /(S − 1) = 50 design the median
bias of the EL, QML and ORACLE estimates is -0.2232, -0.0447, and -0.0497 respectively, although
the asymptotic biases of these three estimates are identical.
Table 3 directly evaluates the usefulness of Newey and Smith’s (2004) asymptotic bias expressions for understanding the finite sample properties of the GMM and GEL estimators. Under
the null hypothesis that an estimator’s asymptotic bias accurately characterizes its small sample
24
25
There were no convergence failures for any of the estimators across all draws and designs.
The notes to Table 2 explain how this standard error is estimated.
30
median bias we have, for N and B large enough,
√
D
B(BiasM C − BiasA ) ' N (0, π/2),
where BiasM C is scaled median bias and BiasA is the scaled asymptotic bias of the corresponding
estimate; both of which are reported in Table 2. A weighted least squares regression of BiasM C
p
onto BiasA — with weights 1/ π/2B, the coefficient on BiasA constrained to equal 1 and no
constant — should thus have a root mean square error of 1. Panel A of Table 3 reports minimum-χ2
tests for this null estimator-by-estimator. For all estimators median bias differs significantly from
the values predicted by the higher order asymptotics of Newey and Smith (2004). The asymptotic
bias predictions are most accurate for the ET and QML estimators and least so for the GMM and
EL estimators.
Panel B of Table 3 reports weighted least squares regressions based upon the idea that
D
∗
∗
∗
+ αΓ BΓ∗ + αΩ BΩ
+ αW BW
, π/2B),
BiasM C ' N (αO BO
−1/2
∗
∗
∗
= BNO × [I(θ 0 )−1
and similarly for BΓ∗ , BΩ
and BW
, and the values of αO , αΓ ,
with BO
γγ /N ]
αΩ and αW equal to those implied by (45) (again under the null that an estimator’s asymptotic
∗
∗
and BW
vary little across the
bias accurately characterizes it small sample median bias). Since BO
designs considered here, αO and αW are constrained to equal the values given in (45) while αΓ and
αΩ are allowed to deviate from their theoretical values.
For the two-step GMM estimator higher order theory suggests that αΓ = αΩ = 1, for the GEL
estimators it suggests αΓ = 0 and αΩ = 1 + ρ3 /2, or 1/2 and 0 for ET and EL respectively. For
QML, which is ML for these designs, it suggests αΓ = αΩ = 0. The actual regressions suggest that
two-step GMM is noticeably less sensitive to Jacobian and weight matrix variability than would be
predicted on the basis of asymptotic theory. For the ET and QML estimators, αΓ and αΩ are not
significantly different from their predicted values. For EL, unlike asymptotic bias, small sample
median bias does appear to be sensitive to weight matrix variability, albeit much less so than either
the GMM or ET estimator.
8.2
Accuracy of the asymptotic approximation to the sampling distribution of b
θ and
confidence interval coverage
Table 4 reports scaled median absolute error (MAE) values for each estimator/design. MAE is
a robust measure of dispersion. Under the null that the first order asymptotic approximation to
the sampling distribution of b
γ 2 is accurate, the scaled MAE estimates should be close to one.
Deviations from this benchmark could be due to either non-negligible bias or because the actual
sampling distribution has thick or thin tails.
Table 5 gives the ratio of a robust quantile-based estimate q
of the standard deviation of the
2
sampling distribution of γb to is true asymptotic standard error, I(θ 0 )−1
γγ /N, as well as the ratio
q
of the median estimated asymptotic standard error relative to I(θ0 )−1
γγ /N . With the exception of
the weakly identified designs, where there is evidence of thick tails in the GEL and QML estimators,
31
the Monte Carlo standard deviation of b
γ 2 is close to the value predicted by the standard asymptotic
normal approximation. However the ratio of the median estimated standard error to the true
asymptotic standard error is well below one for the heavily overidentified designs (S = 7, 12). This
is consistent with Γ0 and/or Ω0 being difficult to estimate precisely in small samples, biasing the
estimated standard errors downwards.
To see that this is indeed the case it is illuminating to examine the standard variance estimator
PN
0
b
b
b 0 Ω−1
b
for θ GM M — (Γ
N N ΓN )/N — in detail. The Jacobian term, ΓN , equals −
c=1 rc xc /N. Variability
in this term is modest when θ is strongly identified. The sth row of the estimated Jacobian matrix
b cw |Mc = ms ]) where
equals (1, E[g
b cw |Mc
E[g
∙
¸
XN XMg
1
1
2
= ms ] = −
(ygi − yg ) × 1 (Mg = ms ) ,
g=1
i=1
ms ns − Ns
P
with Ns = N
c=1 1 (Mc = ms ) . When ns = Ns ms is moderately large this term is estimated with
reasonable precision. For the S = 12, μ2 /(S − 1) = 5 ‘weakly’ identified design, 5, 000 Monte Carlo
b cw |Mc = m6 ] of (standard deviation in parentheses)
draws give a bias for E[g
E[{ΓN (θ 0 ) − Γ0 }6,2 ] =
"
0.0000022
(0.0001623)
#
.
Both the bias and standard deviation are small relative to the population value of −0.0029885.
The estimated Jacobian, because it uses the individual-level within-group variation of the data, is
typically quite precise in linear-in-means applications.
The estimated variance-covariance matrix of the moments for two-step GMM, ΩN (θ), equals
the outer-product
ΩN (θ) = N −1
where
XN
c=1
Ns
ps =
,
N
©
ª
u(gc , θ)2 rc rc0 = diag p1 σ
b2u (m1 ), . . . , pS σ
b 2u (mS ) ,
σ
b2u (ms )
=
PN
u(gcb , θ)2 · 1 (Mc = ms )
.
PN
c=1 1 (Mc = ms )
c=1
Since uc is a function of squares of yci , the elements in Ω0 are functions of the conditional fourth
moments of the original data. Fourth moments are difficult to estimate with precision. For N = 300
and S = 12, each of the twelve distinct elements of ΩN (θ) are estimated using an average of only
25 observations! In small samples with modest overidentification, ΩN (θ) is thus a very imprecise
estimate of Ω0 .
Consider the S = 12, μ2 /(S − 1) = 50 ‘very strongly’ identified design. Even allowing Ω0 to be
estimated using the ‘true’ residuals — u(gcb , θ 0 ) — we have for the 6th diagonal element of ΩN (θ0 ) a
bias and standard deviation of
"
#
©
ª
18.1738
]=
,
E[ ΩN (θ 0 )−1 − Ω−1
0
6,6
(32.2942)
32
which is very large relative to the population value of 38.7998.
Although the asymptotic normal approximation to the sampling distribution of b
γ 2 is reasonable accurate (with the exception of the weakly identified designs), consistently estimating I(θ)−1
appears to be challenging.
Coverage rates for different 95 percent confidence intervals under weak, strong and very strong
identification are given in Tables 6, 7 and 8 respectively. Several patterns consistently emerge. The
ET and EL Wald intervals based on W H , which uses an ‘observed information’ variance estimate,
performs significantly better than traditional intervals based on W . This is especially true in the
weakly identified designs. The Wald intervals based upon either the Newey-Windmeijer or Mestimator sandwich variance estimators, W NW and W M , do even better, with W M typically doing
best.
The Score statistics consistently have coverage closest to 95 percent across all designs. Among
the two Score statistics the standard M-estimator version, S M , which does not exploit the information equality result of Theorem 2, outperforms the efficient versions suggested by Kleinbergen
(2005), Guggenberger and Smith (2004) and Newey and Windmeijer (2005). For the S = 12,
μ2 /(S − 1) = 5 design for example, the traditional Wald intervals have an actual coverage of only
79.0 percent while the S M intervals have a coverage of 93.9 percent, very close to nominal coverage.
The superior coverage of the GEL-based score statistic intervals is consistent with the analysis
of Section 7. Recall that the EL first order conditions use a semiparametrically efficient estimate
of the optimal instrumental variable. The instrument is constructed from an efficient estimates of
both the Jacobian and weight matrix; therefore the EL score has an asymptotic bias of zero under
the null. The GMM score, in contrast, is asymptotically biased for zero. Hence we would expect the
EL score-based intervals to exhibit superior coverage. Construction of the score statistic requires
an estimate of I(θ)−1 which, as argued above, is difficult to estimate. The contrast between the
S and S M statistics in terms of size may reflect the fact that the latter effectively uses a superior
estimate for I(θ)−1 . This is a topic for further research.
The LR intervals, which base inference directly upon the GEL criterion function, perform
similar to the W H intervals, which is not surprising as W H is a quadratic approximation to LR in
the neighborhood of b
θ GEL . The IS intervals, suggested by Imbens and Spady (2002), consistently
perform worst of all. I attribute this poor performance to the quadratic form structure of the
statistic where the weight matrix is equal to an estimate of Ω0 , which is very difficult to precisely
estimate in the current setting.
For the two-step GMM estimator the coverage of the intervals based on changes in quadratic
form criterion function (23) perform quite well, with coverage typically close to the highest performing ET and EL S M intervals.26 For QML the W M and S M intervals perform similarly, both
with actual coverage very close to 95 percent across all designs considered. Note that, although
QML is ML in the current setting, W M and S M are M-estimator versions of the Wald and Score
statistics that do not take into account simplifications arising from the information matrix equality.
The two-stage least squares (2SLS) W M and S M intervals are consistently outperformed by their
26
Bond, Bowsher and Windmeijer (2001) find that these intervals perform well in strongly identified dynamic panel
data settings where there are often many overidentifying restrictions.
33
ET, EL and QML counterparts. However the 2SLS W M intervals consistently have better coverage
than the two-step GMM W intervals.
9
Suggestions for applied researchers
The Monte Carlo results have an number of implications for applied researchers estimating models
identified by moment conditions. Particularly for those interested in the social interactions application of Graham (2005). First, the bias properties of EL are consistently superior to those of
two-step GMM and ET. In the designs considered here QML has the best bias properties among
feasible estimators. However the QML estimator is also the maximum likelihood estimator for these
designs. It will be interesting to examine the performance of QML relative to EL in settings where
is not efficient.27
In weakly identified settings, where Jacobian variability is an important source of bias, GEL
estimators outperform 2SLS as well as two-step GMM in terms of median bias. In settings where
bias due to weight matrix variability is a concern, the comparison is less clear cut. 2SLS eliminates
bias due to weight matrix variability by virtue of using a non-stochastic weight matrix. This comes,
of course, with a loss in asymptotic efficiency. In strongly identified settings, where Jacobian bias is
small, 2SLS tends to display lower median bias than both ET and EL. In such settings the improved
performance in terms of bias may compensate for any efficiency losses. However, bias correction of
either ET or EL provides an alternative which, in principle, involves no efficiency loss 28 .
In terms of inference, the simple and newly suggested W M and S M statistics, derived by treating
ET and EL as M-estimators, consistently outperform all other alternatives considered. For those
researchers using the two-step GMM estimator inference based directly upon the quadratic form
criterion function also performs well. For strongly identified designs an applied researcher uninterested in programming could report 2SLS point estimates, but conduct inference using the two-step
GMM criterion function, and reasonably expect good bias and coverage properties, although this
approach would not be advisable in weakly identified settings.
Although these recommendations are guided by specific Monte Carlo designs, they are likely to
be useful for a wide variety of linear instrumental variable models. In particular theory provides
a strong case for privileging GEL over GMM and for conducting inference using the GEL score
statistic. Qualitatively the Monte Carlo evidence supports these preferences (although I find that
the sandwich score statistic performs better than the forms suggested by Guggenberger and Smith
(2004) and Kleinbergen (2005)).
Overall GEL estimators and inference procedures offer real improvements relative to two-step
GMM in terms of bias and coverage; performing well across all the designs considered here. Unfortunately GEL estimators are computationally challenging and require special programming. They
are therefore less convenient than both 2SLS and two-step GMM, which are readily available in
standard software packages. For researchers interested in using the GEL estimators and testing
procedures evaluated here, a self-contained MATLAB program for models defined by linear moment
27
28
Monte Carlo experiments where yc |Mc is multivariate Laplace are in progress.
In fact, Newey and Smith (2004) prove that bias-corrected EL is higher order efficient.
34
conditions is available from the author.
35
Appendices
A
Calculating the score and Hessian of the profiled generalized empirical likelihood
p
objective function, lN
(θ)
This appendix provides analytic formulae for the score and Hessian of the profiled and unprofiled
GEL criterion function. The expressions generalize and extend those given by Owen (2001) for the
empirical likelihood estimator.
It is convenient to express the score and Hessian of the profiled GEL objective function (25)
def
in terms of the first and second derivatives of the unprofiled objective function, lN (λ, θ) =
PN
0
c=1 ρ(λ ψ c (θ)). The first derivatives of lN with respect to λ and θ are
5λ l N =
J×1
N
X
c=1
¡
¢
ρ1 λ0 ψ c ψ c ,
5θ lN
K ×1
¶
µ
¡ 0 ¢ ∂ψ c 0
=
ρ1 λ ψ c
λ,
∂θ 0
c=1
N
X
(63)
while the second derivatives and cross-partial equal
5λλ lN
J ×J
5θθ lN
K×K
5θλ lN
K×J
=
N
X
c=1
¡
¢
ρ2 λ0 ψ c ψ c ψ0c ,
(64)
⎤
⎤⎡
⎤0
⎡
⎡
J
N
J
J
(j)
(j)
2 (j)
X
X
X
¡ 0 ¢ X
¡
¢
∂ ψc
∂ψc
∂ψc
⎦
=
ρ1 λ ψ c ⎣
ρ2 λ0 ψc ⎣
λj ⎦ ⎣
λj ⎦
0 λj +
∂θ∂θ
∂θ
∂θ
c=1
c=1
j=1
j=1
j=1
⎤
⎡
µ
¶
µ
¶
N
J
N
(j)
X
X
¡
¢ X
¡ 0 ¢ ∂ψc 0 0 ∂ψ c
∂ 2 ψc
⎦
+
,
=
ρ1 λ0 ψ c ⎣
λ
ρ
ψ
λλ
λ
2
c
0 j
0
0
∂θ∂θ
∂θ
∂θ
c=1
j=1
c=1
N
X
¶
¶
µ
µ
N
¡ 0 ¢ ∂ψ c 0 X
¡ 0 ¢ ∂ψc 0 0
=
ρ1 λ ψ c
+
ρ2 λ ψc
λψc .
∂θ0
∂θ 0
c=1
c=1
N
X
p
p
p
The score (spN (θ) = 5θ lN
(θ)) and Hessian (HN
(θ) = 5θθ lN
(θ)) can be written in terms of (63)
and (64).
By the chain and product rules the score of the profiled objective function (25) is
spN
(θ) =
Ã
∂e
λN (θ)
∂θ 0
!0
5λ lN + 5θ lN
(65)
= 5θ lN ,
where the second equality follows from the Envelope Theorem (i.e., since 5λ lN = 0 at the maximizer
eN (θ 0 ) = arg sup lN (λ, θ0 )). The cth observation’s contribution to the score is
λ
λ∈ΛN (θ 0 )
"
e
∂λ
s (zc , θ) =
∂θ0
p
#0
∙
e ψ c )ψc + ρ1 (λ
e ψc ) ∂ψc
ρ1 (λ
∂θ0
0
0
36
¸0
e
λ,
e on θ has been surpressed to simplify notation.
where dependence of ψc , π c , and λ
0
0
0
e
e
) 5λλ lN (∂ λ/∂θ
) + 2(∂ e
λ/∂θ 0 )0 5λθ lN + 5θθ lN . However we can use
The Hessian equals (∂ λ/∂θ
the fact that
∂e
λN (θ)
= − (5λλ lN )−1 5λθ lN ,
(66)
∂θ0
to write the Hessian as
p
0
HN
(θ) = 5θθ lN − 5θλ lN (5λλ lN )−1 5λθ lN
.
(67)
To understand (66) note that totally differentiating 5λ lN w.r.t. to θ we have
"
#
XN
e
0
0 ∂ψ c
∂ψ
∂
λ
c
0
e ψc )
0 =
ρ (λ
+
ρ (e
λ ψc )ψc e
+ ψc 0
λ
c=1 1
c=1 2
∂θ 0
∂θ0
∂θ
∙
¸
XN
XN
0
∂e
λ
e0 ∂ψc +
e0 ψc ) ∂ψc + ρ2 (λ
e0 ψ c )ψ c λ
=
ρ 1 (λ
ρ2 (e
λ ψc )ψc ψ0c 0 ,
0
0
c=1
c=1
∂θ
∂θ
∂θ
XN
0
and solving for ∂ e
λ/∂θ 0 yields
eN (θ)
∂λ
∂θ 0
= −
∙X
N
c=1
0
ρ2 (e
λ ψc )ψc ψc 0
= − (5λλ lN )−1 5λθ lN ,
¸−1
×
∙X
N
e0 ∂ψc
e0 ψ c ) ∂ψ c + ρ2 (λ
e0 ψc )ψc λ
ρ (λ
0
c=1 1
∂θ
∂θ 0
¸
which equals (66).
B
B.1
Proof of asymptotic normality and related results
0
0
Proof of Theorem 1 (Asymptotic normality of (b
λGEL , b
θGEL )0 )
Asymptotic normality of b
θ GEL has been shown by Imbens (1997) and Qin and Lawless (1994) for
the case of empirical likelihood and by Kitamura and Stutzer (1997) for the case of exponential
tilting. Newey and Smith (2004) give a rigorous and general proof for all members of the GEL
family (Theorem 3.2, p. 226). The reader is referred to their proof for regularity conditions.
Taking consistency as a given, a sketch of the derivation of the limiting distribution is as follows:
from (63) and (64) above we have at (θ, λ) = (θ0 , 0)
5λ lN (0, θ 0 ) = −
5λλ lN (0, θ0 ) = −
N
X
c=1
ψc ψ0c ,
XN
c=1
ψc,
5θ lN (0, θ 0 ) = 0,
5θθ lN (0, θ0 ) = 0,
5θλ lN (0, θ0 ) =
(68)
N
X
∂ψ
c=1
∂θ
c
0
0
.
Expanding each row of the stacked first order conditions w.r.t. λ and θ around (0, θ 0 ) in a
mean-value expansion we have
λ, b
θ) = sN (0, θ0 ) + HN (λ, θ)
0 = sN (b
37
Ã
b
λ
b
θ − θ0
!
,
where
sN (λ, θ) =
Ã
5λ lN (λ, θ)
5θ lN (λ, θ)
!
,
HN (λ, θ) =
Ã
5λλ lN (λ, θ) 5λθ lN (λ, θ)
5θλ lN (λ, θ) 5θθ lN (λ, θ)
!
(69)
b b
and each row of the Hessian matrix is evaluated at a different mean value between (λ,
θ) and (0, θ 0 ).
√
Multiplying through by 1/ N and rearranging we have
√
N
Ã
b
λ
b
θ − θ0
!
∙
HN (λ, θ)
=−
N
¸−1
1
sN (0, θ 0 ) √ .
N
By (68) and the uniform weak law of large (UWLLN) numbers
HN (λ, θ) p
→−
N
Ã
Ω0 −Γ0
−Γ00
0
!
def
= A0 ,
(70)
£
£
¤
¤
where Γ0 = E ∂ψ c (θ 0 ) /∂θ 0 and Ω0 = E ψc (θ) ψ c (θ)0 . Using the continuity of the inverse of a
non-singular matrix and standard results on partitioned inverses gives
∙
HN (λ, θ)
N
¸−1
p
→−
Ã
−1
−Ω−1
P0
0 Γ0 I (θ0 )
−I (θ 0 )−1 Γ00 Ω−1
−I (θ0 )−1
0
!
,
(71)
¢
¡
−1 0 −1
−1
−1
where I (θ0 ) = Γ00 Ω−1
Γ0 Ω0 .
0 Γ0 and P0 = Ω0 − Ω0 Γ0 I (θ 0 )
Applying the central limit theorem to the normalized score vector gives
√ D
sN (0, θ0 )/ N → N (0, B0 ) ,
def
B0 =
"
Ω0 0
0 0
#
(72)
which with Slutsky’s theorem and (71) gives (26). Q.E.D.
B.2
Proof of Theorem 2 (Information Matrix Equality)
p b
(θ)/N)−1
By the continuity of the inverse of a non-singular matrix and the law of large numbers (HN
p
−1
converges to I (θ 0 ) . To derive the asymptotic variance of the score of lN (θ0 ) — the profiled GEL
criterion function — we can adapt standard results on concentrated likelihood functions given by
Amemiya (1985, pp. 125 - 126). By the Envelope Theorem the score evaluated at θ0 equals
p
(θ 0 ) = 5θ lN (e
λN (θ 0 ) , θ0 ),
spN (θ0 ) = 5θ lN
eN (θ0 ) = arg sup lN (λ, θ 0 ) . A mean value expansion around e
where λ
λN (θ 0 ) = 0 gives
λ∈ΛN (θ0 )
¡ ¢
eN (θ 0 )
spN (θ 0 ) = 5θ lN (0, θ 0 ) + 5θλ lN λ, θ · λ
¢
¡
= 5θλ lN λ, θ · e
λN (θ 0 ) ,
38
(73)
¡ ¢
eN (θ 0 ) , θ 0 ) and (0, θ 0 ) and
with the second line following from (68) and where λ, θ lies between (λ
varies from row-to-row. We also have, for fixed and known θ0 ,
√
eN (θ0 ) LD
Nλ
= −
LD
µ
1
5λλ lN (0, θ0 )
lim
N →∞ N
¶−1
√
5λ lN (0, θ 0 )/ N,
(74)
where the = denotes that both sides of (74) have the same limiting distribution. Rearranging (73)
and using (74) as well as (71) above we have
√ D
spN (θ 0 ) / N → Γ00 Ω−1
0 · N (0, Ω0 ) .
−1
and hence
Therefore the asymptotic variance of the score equals Γ00 Ω−1
0 Γ0 = I (θ0 )
E [H p (zc , θ 0 )] = V ar (sp (zc , θ0 )) = I (θ0 ) ,
(75)
which is an information matrix equality for GEL estimators. Q.E.D.
B.3
Proof of Theorem 3 (Limiting distribution of ω
bA =
PN
c=1
1 (zc ∈ A) π
ec (b
θ))
θ) is
Consistency of ω
b A for ω A = Pr (zc ∈ A) follows from the fact that in large samples Fπ (z, b
an op (1) approximation to the empirical distribution function. By the Glivenko-Cantelli Theorem
(e.g., van der Vaart 1998, p. 266) Fπ (z, b
θ) is thus consistent for F0 (z).
The limiting distribution of ω
b A can be computed using a straightforward method-of-moments
¡ 0 0 ¢0
argument. Let β = ωA , θ and consider the augmented (1 + S) × 1 vector of moments m (zc , β) =
ª0
©
ωA − 1 (zc ∈ A) , ψ (zc , θ)0 . From Chamberlain (1987, 1992) the variance bound for β is
−1
I (β)
¸
¸¾−1
½ ∙
∙
∂m (zc , β 0 ) 0 £
∂m (zc , β 0 )
0 ¤−1
= E
E m (zc , β 0 ) m (zc , β 0 )
E
,
∂β 0
∂β 0
where
#
¸ "
1 0
∂m (zc , β)
E
=
,
∂β 0
0 Γ0
∙
£
¤
E m (zc , β) m (zc , β)0 =
"
ω A (1 − ω A ) −ψ0A
−ψA
Ω0
#
with ψ A = E [ψ (zc , θ) × 1 (zc ∈ A)] = ω A × E [ψ (zc , θ) |zc ∈ A] as stated in the Theorem. Since
ω
b A is efficient its limiting variance equals the relevant element of I (β)−1 . By standard results on
partitioned inverses we have
I (β) =
Ã
−1
−1
−1 0
(A − ψ0A Ω−1
−(A − ψ0A Ω−1
0 ψA )
0 ψ A ) ψ A Ω0 Γ0
−1
0
0 −1
−1
−Γ00 Ω−1
Γ00 (Ω0 − ψA A−1
0 ψ A (A − ψ A Ω0 ψ A )
0 ψ A ) Γ0
!
.
Again applying the partitioned inverse formula we get the upper-left element of I (β)−1 equal to
0
−1
0
−1
0
−1 0 −1
I (β)−1
ωω = (A − ψ A Ω0 ψ A ) + ψ A Ω0 Γ0 [Γ0 (Ω0 − ψ A A0 ψ A ) Γ0
0
−1
−1 0
−1
−1 0 −1
−Γ00 Ω−1
0 ψ A (A − ψ A Ω0 ψ A ) ψ A Ω0 Γ0 ] Γ0 Ω0 ψ A ,
39
the result follows by showing that
−1
−1
−1
0 −1
0
−1 0
− Ω−1
(Ω0 − ψA A−1
0 ψA )
0 ψ A (A − ψ A Ω0 ψ A ) ψ A Ω0 = Ω0 ,
−1
−1
−1
0 −1
0
−1 0
= Ω−1
which is true since (Ω0 − ψA A−1
0 ψA )
0 + Ω0 ψ A (A − ψ A Ω0 ψ A ) ψ A Ω0 . Q.E.D.
B.4
Proof of Theorem 4 (Efficient estimation of expectations)
The semiparametric efficiency bound for estimating β = (μ0a , θ 0 )0 — when the only prior information
is the moment restriction (9) — can be calculated directly using the results of Chamberlain (1987,
1992). Efficiency of μ∗a follows by showing that its asymptotic variance attains this bound.
Similarly to the proof of Theorem 3 consider the augmented vector of moments m (zc , β) =
ª0
©
(ac − μa )0 , ψ (zc , θ)0 with an expected Jacobian and covariance matrix of
¸ "
−IK
∂m (zc , β)
=
E
0
∂β
0
∙
0
Γ0
#
,
£
¤
E m (zc , β) m (zc , β)0 =
"
Σaa Σaψ
Σ0aψ Ω0
#
.
where K = dim(ac ). Applying the bound from Chamberlain (1987) and solving for I (β)−1
aa gives
−1 0 −1 0
−1 0
−1
I (β)−1
Γ0 Ω0 Σaψ .
aa = Σaa − Σaψ Ω0 Σaψ + Σaψ Ω0 Γ0 I (θ 0 )
(76)
If θ 0 is known the final term in I (β)−1
aa drops out with the variance bound simplifying to Σaa −
−1 0
Σaψ Ω0 Σaψ .
√
To derive the limiting variance of N μ∗a replace ψN (b
θ) in (36) with a mean value expansion
around θ 0 to yield
μ∗a
½
¶
µ X
¾
∂ψ(zc , θ b
1
−1
b
b
(θ − θ 0 )
= aN − Σaψ,N ΩN ψN (θ 0 ) +
N
∂θ 0
o
n
b −1 ψN (θ0 ) + ΓN (b
b aψ,N Ω
= aN − Σ
θ − θ0) .
N
p
b aψ,N →
By the UWLLN and the continuity of the inverse of a non-singular matrix we have Σ
Σaψ ,
√
p
−1 p
−1
b
b
ΩN → Ω0 , and ΓN → Γ0 . Replacing N (θ − θ0 ) with the asymptotically equivalent quantity
√
N ψN (θ 0 ) (see the proof to Theorem 1 above) we have
−I (θ0 )−1 Γ00 Ω−1
0
√
√
√ ∗ √
−1 0 −1
N μa = N aN − Σaψ Ω−1
NψN (θ 0 ) + Σaψ Ω−1
Γ0 Ω0 Nψ N (θ0 ) + op (1).
0
0 Γ0 I (θ 0 )
40
Applying the central limit theorem and some tedious algebra we have
√
√
√
−1 0
V ar( Nμ∗a ) = V ar( NaN ) + Σaψ Ω−1
0 V ar( Nψ N )Ω0 Σaψ
√
−1 0 −1
−1 0 −1 0
+Σaψ Ω−1
Γ0 Ω0 V ar( N ψN )Ω−1
Γ0 Ω0 Σaψ
0 Γ0 I (θ 0 )
0 Γ0 I (θ0 )
√
√ 0
√
√
−1 0 −1
−1
−1
−2Σaψ Ω0 Cov( NψN , N aN ) + 2Σaψ Ω0 Γ0 I (θ0 ) Γ0 Ω0 Cov( N ψN , Na0N )
√
√
−1 0 −1 0
−1
0
Γ0 Ω0 Σaψ
−2Σaψ Ω−1
0 Cov( Nψ N , N ψ N )Ω0 Γ0 I (θ0 )
−1 0 −1 0
0
−1
= Σaa + Σaψ Ω−1
Γ0 Ω0 Σaψ
0 Σaψ + Σaψ Ω0 Γ0 I (θ 0 )
−1 0 −1 0
−1 0 −1 0
0
−1
−2Σaψ Ω−1
Γ0 Ω0 Σaψ − 2Σaψ Ω−1
Γ0 Ω0 Σaψ
0 Σaψ + 2Σaψ Ω0 Γ0 I (θ 0 )
0 Γ0 I (θ0 )
−1 0 −1 0
0
−1
= Σaa − Σaψ Ω−1
Γ0 Ω0 Σaψ ,
0 Σaψ + Σaψ Ω0 Γ0 I (θ 0 )
which equals the bound given by (76). Therefore μ∗a is efficient. Q.E.D.
‘Rao-Blackwell’ argument for efficiency. An alternative proof of the efficiency of μ∗a follows from a straightforward application of the Rao-Blackwell Theorem (e.g., Rao 1973, pp. 317
- 318). For simplicity consider the case where ψ(zc , θ0 ) is scalar and θ 0 is known. The unbiasedness/consistency restriction limits the class of potential estimators to those which can be written
b+
b ∗a + αψN .
as linear combinations of aN and ψN . Consider one such alternative estimator μ
a = μ
For any α ∈ [0, −2Cov(b
μ∗a , ψN )/V ar(ψ N )] this estimator will have lower variance than μ
b ∗a unless
Cov(b
μ∗a , ψN ) = 0 since
¡ +¢
μ∗a , ψN ) + α2 V ar(ψN ) ≤ V ar (b
μ∗a )
V ar μ
ba = V ar (b
μ∗a ) + 2αCov(b
μ∗a , ψN ) = 0 is a necessary condition for
for all Cov(b
μ∗a , ψN ) different from zero. Therefore Cov(b
μ
b∗a to be efficient. Sufficiency of this condition follows by observing that for μ
b+
b ∗a + αψ N and
a = μ
Cov(b
μ∗a , ψN ) = 0 we have
E[b
μ∗a (b
μ∗a − μ
b+
μ∗a ψ N ] = 0,
a ] = −αE[b
¡ +¢
¡ ∗ +¢
¡ ∗ +¢
ba , we therefore have
or equivalently that V ar (b
μ∗a ) = Cov μ
ba, μ
ba . Let ρ2 = Cov μ
ba , μ
b a /V ar μ
¡ +¢
¡ +¢
∗
2
V ar (b
μa ) = ρ V ar μ
ba ≤ V ar μ
ba with the latter constraint holding with equality only when
0
∗
μ
ba = μ
ba . Q.E.D.
C
C.1
Form of GEL estimated optimal instruments and probability expansions
Proof of Theorem 5 (Form of estimated instrument, e
hGEL
)
c
To derive the implied GEL estimates of the optimal instrumental variable, hc = −E [xc |Mc ] ×
E[u (gc , θ 0 )2 |Mc ]−1 , given by (39) it is useful to rearrange the GEL first order conditions as in
Theorem 2.3 of Newey and Smith (2004, p. 224). Following their proof we can manipulate the first
41
order condition for λ, given by (63) above, to get
N
X
0 =
b c )ψ
b0 ψ
bc
ρ 1 (λ
c=1
N h
X
=
N
i
X
0
bc −
b c) + 1 ψ
bc
ψ
λψ
ρ1 (b
c=1
"
N
X
=
c=1
N
b c) + 1
X
0
ρ1 (b
λψ
b
b
b c.
b
ψ
ψ c ψc λ −
b0 ψ
bc
λ
c=1
P
Solving for b
λ and multiplying through by N
c=1
b
λ=
c=1
#
0
"N
X
c=1
b0ψ
b )+1
ρ1 (λ
c
/N
0
b
b
λψ
then yields
c
b
b cψ
b 0c
kc ψ
#−1 "
#
N
1 Xb
ψ ,
N c=1 c
(77)
where b
kc is as defined in (33). Substituting (77) into (63), the first order condition for θ, and
P
b0 b
dividing by N
c=1 ρ1 (λ ψ c ) yields
"N
X
c=1
π
bc
Ã
bc
∂ψ
∂θ 0
!0 # "
N
X
c=1
b
b cψ
b 0c
kc ψ
#−1 "
#
N
1 Xb
ψc = 0,
N
c=1
where π
bc is as defined in (32).
¡
¢
b 0π Ω
b −1 rc we therefore have
For ψc = rc u (gc , θ) = rc gcb − x0c θ and e
hc = Γ
k
N
1 X e GEL b
h
θ) = 0,
(gc − x0cb
N c=1 c
which can be rearranged to give (38) and (40). Q.E.D.
Proof of Theorem 6 (GEL probability expansions)
PN
The first result of the proof — the expansion for
bc b
ac — extends Newey’s (2002) result to
c=1 π
the case where a(zc , θ) is a matrix. Elements of this proof strategy are also used by Caner (2003,
Theorem 5) and Guggenberger and Smith (2004, Theorem 4). The expansion for b
kc is derived along
the same lines.
√
b bounded in probability (i.e.,
From Newey and Smith (2004, Theorem
3.1,
p.
226)
we
have
Nλ
¯ 0 ¯
p
¯
¯
b = Op (N −1/2 )) and that max ¯b
b c ¯ → 0.
λ
λψ
C.2
{c:1,...,N }
42
b b
b = 0 we have
Taking a mean value expansion of 5λ lN (λ,
θ) around λ
¶
0 b b b0
b
ψ
ρ2 (λ ψ
)
ψ
c
c c (λ − 0)
c=1
c=1
µX
¶
XN
N
0
b
b
b
= −
ψc /N −
ψc ψ c /N b
λ + op (N −1/2 )
0 = −
XN
= −
XN
bc +
ψ
c=1
which solving for b
λ yields
Expanding
PN
c=1
c=1
µX
N
c=1
b c /N − Ω0 λ
b + op (N −1/2 ),
ψ
−1/2
b
b
λ = −Ω−1
).
0 ψ N + op (N
0
b c )/N around λ
b = 0 we have
ρ1 (b
λψ
XN
c=1
0
b c )/N
bψ
ρ 1 (λ
= −1 +
µX
N
c=1
(78)
0
0
b c )ψ
b c /N
ρ2 (λ ψ
b N + op (N −1/2 )
b0 ψ
= −1 − λ
¶
b
λ
(79)
= −1 + op (N −1/2 ).
θ) be a K × L matrix a(zc , θ 0 ) = (a1 (zc , θ 0 ), . . . , aL (zc , θ0 )) with Σaψ as defined
Let b
ac = a(zc , b
in the statement of the Theorem.
P
b0 b ac /N we have
Expanding N
c=1 ρ1 (λ ψ c )b
XN
c=1
0
b c )b
ρ1 (b
λψ
ac /N
∙X
N
¸
0
b
b
= −
b
ac /N +
ρ (λ ψ c )b
ac (IL ⊗ ψ c )/N (IL ⊗ λ)
(80)
c=1
c=1 2
∙X
¸
XN
N
b N ) + op (N −1/2 )
b 0c )/N (IL ⊗ Ω−1 ψ
b
ac /N +
ac (IL ⊗ ψ
b
= −
0
XN
c=1
= −
= −
XN
c=1
XN
c=1
= Op (1) .
0b
c=1
−1/2
b
b
ac /N + Σaψ (IL ⊗ Ω−1
)
0 ψ N ) + op (N
−1/2
b
(b
ac − Σaψ (IL ⊗ Ω−1
)
0 ψ c ))/N + op (N
Combining (79) and (80) we have:
XN
c=1
π
bc b
ac =
=
=
=
as claimed.
0
bc) =
Let b
kc∗ (b
λψ
0
b )+1
ρ1 ( b
λψ
c
.
0
b
b
λψ
PN
b0 b ac /N
c=1 ρ1 (λ ψ c )b
PN
b0 b
c=1 ρ1 (λ ψ c )/N
P
ac − Σaψ (IL
− N
c=1 (b
XN
c=1
XN
c=1
¡ −1/2 ¢
b
⊗ Ω−1
0 ψ c ))/N + op N
¡
¢
−1 + op N −1/2
−1/2
b
(b
ac − Σaψ (IL ⊗ Ω−1
)
0 ψ c ))/N + op (N
−1/2
(ac − Σaψ (IL ⊗ Ω−1
),
0 ψ c ))/N + op (N
Using l’Hopital’s rule and a slight abuse of notation we have a mean
c
43
value expansion around b
λ → 0 of
XN
c=1
0
b
b c )/N
kc∗ (b
λψ
Ã
!
0b b
0b b
bc
1 XN −ψ
1 XN ρ2 (λ ψ
ρ
ψ
)
ψ
(λ
)
ψ
c
c
b
=
+
− 1 0 c c λ
0b
bc
2
c=1 ψ
c=1
b
N
N
λ ψc
(λ ψ c )
Ã
0
0 !
b cψ
b c 1 ρ3 ψ
b cψ
bc
1 XN ρ3 ψ
b + op (N −1/2 )
= −1 +
−
λ
c=1
b
b
N
2
ψc
ψc
1 b0 b
ψN + op (N −1/2 )
= −1 + ρ3 λ
2
= −1 + op (N −1/2 ),
(81)
where the shift from the first to the second line uses of a double application of l’Hopital’s rule29 to
0b b
b → 0. Again using l’Hopital’s rule we have
evaluate ρ1 (λ 0ψb c )2ψc as λ
(λ ψ)c
XN
c=1
b
b c )b
b0 ψ
kc∗ (λ
ac /N
= −
XN
c=1
µX
b
ac /N
(82)
¶
1
b 0 )/N (IL ⊗ λ)
b + op (N −1/2 )
ac (IL ⊗ ψ
ρ3 b
c
c=1 2
XN
1
−1/2
b
= −
(b
ac + ρ3 Σaψ (IL ⊗ Ω−1
).
0 ψ c ))/N + op (N
c=1
2
+
N
Combining (81) and (82) we have
XN
c=1
0
b
kc b
ac =
=
XN
1
−1/2
b
(b
ac + ρ3 Σaψ (IL ⊗ Ω−1
)
0 ψ c ))/N + op (N
2
1
−1/2
(ac + ρ3 Σaψ (IL ⊗ Ω−1
),
0 ψ c ))/N + op (N
c=1
2
c=1
XN
b c ) is defined by (33).
λψ
where b
kc = k(b
£ ¡
¢¤
For the second part of the Theorem it suffices to show that E b
ak IL ⊗ ψ0N = (1 + 12 ρ3 Σaψ )/N
PN b
ac ). By the first part of the Theorem we have, to the order of the
(recall that b
ak =
c=1 kcb
29
0b
b
(λ ψc )ψc
Specifically l’Hopital’s rule is applied twice to − ρ1(λ
.
0b
ψ )2
0b
b0
ρ (λ ψ
)ψ
− 22(λ0 ψbc ) c , which
c
1
b 0 at λ = 0.
ρ ψ
2 3 c
c
The first application gives −
is undefined for λ = 0. The second application then gives
44
0b
b b0
ρ3 (λ ψ
c )ψc ψ c
,
b
2ψ
c
0b
b b0
ρ2 (λ ψ
c )ψ c ψc
0b
b
2(λ ψ
c )ψ c
=
which evaluates to
approximation,
∙X
¸
£ ¡
¡
¢¤
¢
N
1
0
−1
0
E b
ak IL ⊗ ψN
(ac + ρ3 Σaψ (IL ⊗ Ω0 ψc ))/N IL ⊗ ψN
= E
c=1
2
∙X
¸
¡
¡
¢ 1
¢
N
1
0
−1
0
E
(ac IL ⊗ ψN + ρ3 Σaψ (IL ⊗ Ω0 ψ c ) IL ⊗ ψN )
=
c=1
N
2
¸
∙X
X
N
N
1
1
1
0
=
Σaψ /N + E
ψ
ψ
)
ρ3 Σaψ (IL ⊗ Ω−1
c N
0
c=1
c=1 2
N
N
1 XN 1
1 XN
ρ Σaψ (IL ⊗ IS )/N
Σaψ /N +
=
c=1
c=1 2 3
N
N
1
1 1
=
Σaψ +
ρ Σaψ
Nµ
N 2¶ 3
1
1
1 + ρ3 Σaψ ,
=
N
2
as claimed. The remaining results follow analogously. Q.E.D.
D
D.1
Test statistics
GEL Score statistic based on the unprofiled criterion
Consider the joint null H0 : (λ0 , θ 0 )0 = (00 , θ 00 )0 ; this combines a test for moment validity (λ = 0)
with a test on the parameter of interest (θ = θ 0 ). Let β = (λ0 , θ 0 )0 , the null hypothesis can be
expressed as c(β 0 ) = 0 with c(β 0 ) = Cβ 0 − (00 , θ 0 )0 and C = IS+K . Using the standard M-estimator
form of the score statistic given by Wooldridge (2002, p. 365) among others we have
−g
−1
−1
0
−g
2
S AR = sN (0, θ0 ) A−g
0 C (CA0 B0 A0 C) CA0 sN (0, θ0 ) /N ∼ χS ,
where A−g denotes a generalized inverse of A. Using (69), (70) and (72) to substitute in for
sN (0, θ 0 ) , A0 and B0 respectively, fact that Ω0 is a generalized inverse of P0 and multiplying out
yields S AR = N · ψ N (θ 0 )Ω−1
0 ψ N (θ0 ), which, with Ω0 replaced by Ωπ (θ 0 ), is the statistic given by
(51).
If we wish to test H0 : λ = 0 conditional θ = θ 0 we can express the null as c(β) = Cβ with
−1
−1
−1
−1 0
C = (IS , 0). Using the fact that CA−1
0 = (P0 , − Ω0 Γ0 I (θ0 ) ) and CA0 B0 A0 C = P0 we have
a score statistic of
SJ
= N · ψN (θ 0 )0 P0 ψN (θ 0 )
0 −1
0 −1
−1 0 −1
= N · ψN (θ 0 )0 Ω−1
0 ψ N (θ 0 ) − N · ψ N (θ0 ) Ω0 Γ0 (Γ0 Ω0 Γ0 ) Γ0 Ω0 ψ N (θ 0 )
= S AR − S K ∼ χ2S−K
which, replacing Γ0 and Ω0 with Γπ (θ0 ) and Ωπ (θ 0 ), gives the expression given in the text.
45
D.2
GEL Score statistic based on the profiled criterion
From Theorem 2 we have
0
spN (e
θ)
spN (e
θ)
√
∼ χ2J ,
I (θ 0 )−1 √
N
N
e 0π Ω
e −1
e −1 gives (53). From (63) and (65) we have
replacing I (θ0 )−1 with (Γ
π Γπ )
θ)
spN (e
√
N
Ã
!
ec 0
0
1 XN
∂
ψ
e c)
e
eψ
= √
λ
ρ (λ
c=1 1
∂θ 0
N
Ã
!0
ec
√
√ XN
∂ψ
e
λ + Nop (N −1/2 )
π
ec
= − N
0
c=1
∂θ
√
e 0π e
λ + op (1),
= − NΓ
PN
−1/2 ), as given by (79) above.
e0 e
where the second line uses that
c=1 ρ1 (λ ψ c )/N = −1 + op (N
Substituting into (53) gives the Guggenberger and Smith (2004) form of the score statistic (54).
√
√
p e √
e
Using the fact that N e
λ = − NΩ−1
0 ψ N + op (1) (c.f., (78) above) we also have sN (θ)/ N =
√
b N + op (1); substituting into (53) and replacing Ω−1 with Ω
e 0π Ω−1 ψ
e −1
NΓ
π givens the Kleinbergen
0
0
(2005) and Newey and Windmeijer (2005) form of the score statistic (55).
D.3
Tilting parameter test
To motivate the robust estimate of Ω0 suggested by Imbens, Spady and Johnson (1998) begin by
taking a mean value expansion of the first order condition for λ — given by (63) above — around
b = λ∗ , holding θ fixed at its estimated value. This yields after some rearranging
λ
¸−1 X
∙ X
√
√
N
N
1
0 b b b0
b c )ψ
e
b
bc/ N ,
N (λ(θ) − λ∗ ) = −
ρ2 (λ ψc )ψc ψc
ρ1 (λ0∗ ψ
c=1
c=1
N
suggesting that
√
N (e
λ(b
θ) − λ∗ ) has an approximate variance of
b−1 =
R
∙
¸−1 ∙ X
¸ ∙ X
¸−1
N
N
0
0
1
1 XN
1
2b b
b
b
b
b
b
ρ ψψ
×
b
ρ ψψ ×
b
ρ ψ ψ
,
c=1 2 c c
c=1 1 c c
c=1 2 c c
N
N
N
eN (b
where both λ and λ∗ have been replaced with λ
θ) for the purposes of computation. Under
correct specification λ∗ = 0; under this null the IS statistic, using the above variance formula is
bλ
eN (b
eN (b
θ)0 R
θ) with an (approximate) χ2S−K distribution. Pagan and Robertson (1997, pp. 28 Nλ
30) discuss this derivation in more detail.
The derivative of e
λN (θ0 ) with respect to θ has an interesting interpretation. From (66) the
eN (θ 0 ) = 0 equals
Jacobian associated with the null restriction λ
¶
µ
eN (θ 0 )
5λλ lN −1 5λθ lN
∂λ
.
=−
N
N
∂θ 0
46
where 5λλ lN /N = Ω0 + op(N −1/2 ) and
5λθ lN (e
λN (θ0 ), θ0 )/N
=
S×K
=
=
=
N
1 X
e0 ] ∂ψ c
e 0 ψ c ) + ρ 2 (λ
e0 ψc )ψ c λ
[ρ1 (λ
N c=1
∂θ0
¸
N
N ∙
0
1 X e0
∂ψc
1 X
∂ψ
c
0
e ψc )
ρ (λ
ρ (λ ψ c ) 0 +
(IK ⊗ ψ c ) (IK ⊗ e
λ)
N c=1 1
N c=1 2
∂θ
∂θ0
¸
N
N ∙
1 X ∂ψc
1 X ∂ψc
0
−1/2
−
+
(IK ⊗ ψc ) (IK ⊗ Ω−1
)
0 ψ N ) + op (N
N c=1 ∂θ 0
N c=1 ∂θ 0
¸
N ∙
1 X ∂ψc
−1
−
− ΣΓψ (IK ⊗ Ω0 ψc ) + op(N −1/2 ),
N c=1 ∂θ 0
¤
£
−1/2
eN (θ 0 ) equals the product
and hence 5θ e
λN (θ 0 ) = −Ω−1
). 5θ λ
ΓN − ΣΓψ (IK ⊗ Ω−1
0
0 ψ N ) +op (N
of −Ω−1
0 and an efficient estimate of the Jacobian matrix.
E
Computation
Each of the suggested algorithms applies a gradient-based optimization procedure to the profiled
GEL criterion function. This requires concentrating out the S × 1 vector of tilting parameters, λ,
for each iteration of θ. Fortunately this is straightforward to do and can be done so quickly.
eN (θ) to form lp (θ), the profiled GEL criterion function.
For a given value of θ, compute λ
N
e
For the continuous updating (CU) estimator λN (θ) exist in closed form, but otherwise it must
be solved for numerically by using the first and second derivatives of lN (λ, θ) with respect to λ
— given by (63) and (64) above — to implement a Newton-Raphson procedure. To get starting
e∗N (θ) =
values for this procedure take a Taylor expansion of (63) around e
λN (θ) = 0 and solve for λ
´−1 ³P
³P
´
∗
N
N
eN (θ) for CU since the quadratic
−
ψc (θ) ψc (θ)0
ψc (θ) . Note that e
λ (θ) = λ
c=1
N
c=1
approximation is exact in this case.
p
The profiled GEL criterion function, lN
(θ), can be combined with the expression for its score
vector given by (65) to implement any of a number of gradient-based optimization procedures.30
I have found that the BFGS quasi-Newton method, as implemented by MATLAB 6.0’s fminunc()
function, works well in practice and this is the method used in the Monte Carlo experiments
undertaken for the paper. If the implementation allows for a user-specified initial approximation to
the Hessian matrix one should use N × ΓN (θ)ΩN (θ)−1 ΓN (θ), where θ are the starting values (e.g.,
2SLS estimates).
One can also use the expression given for the Hessian, equation (67) above, to implement a
Newton-Raphson procedure where the i + 1 updates to θ equal
p
θ (i+1) = θ(i) − [HN
(θ (i) )]−1 spN (θ (i) ),
p
(θ (i) ) with N ×ΓN (θ (i) )ΩN (θ (i) )−1 ΓN (θ(i) )
with iteration continuing until θ (i+1) ' θ (i) . Replacing HN
I have found that the version of spN (θ) that does note take advantage of the simplification due to the Envelope
Theorem, given in the first line of (65), works best is practice.
30
47
p
results in a GEL analog of Fisher-scoring. If HN
(θ (i) ) is not positive definite at any point during
the optimization procedure it can be replaced with N × ΓN (θ (i) )ΩN (θ (i) )−1 ΓN (θ(i) ) for a single
iteration, resulting in a hybrid Newton-Raphson/Fisher-scoring algorithm that works quite well
and is numerically stable in a variety of settings.
For the empirical likelihood (EL) case, following Owen (2001, p. 235), I replace the ρ (v) function
with
)
(
−1 ≥ v
ln
(1
−
v)
1
−
N
ρ∗ (v) =
.
¡
¢
ln N −1 − 1.5 + 2(1 − v)N − 12 (1 − v)2 N 2 1 − N −1 < v
For max πc (b
θ) ≤ 1, ρ∗ (v) will result in the same solution as using ρ (v) with the advantage that ρ∗ (v)
has unrestricted domain, while using ρ (v) requires optimization to be conducted under maintained
restriction that 1 − λ0 ψc (θ) > 0 for all c = 1, . . . , N. The derivatives of ρ (v) that enter the score
vector are modified accordingly. For CU and ET ρ (v) is as given in the main text.
F
Additional details on the Monte Carlo experiments
In order to form the optimal instrument for the case where yc |Mc is multivariate normal with a
covariance matrix given by (3) the following results on scaled χ2p random variables are useful (these
facts are easy to verify using Stein’s Lemma (Casella and Berger 1990, pp. 188 - 189).
¤
£
E λχ2p = pλ
¡
¢
V ar λχ2p = 2pλ2
h¡
£
¤¢3 i
= 8pλ3 .
E λχ2p − E λχ2p
Under joint normality of αc and εc the within- and between-group ‘squares’, gcw and gcb respectively, are independent scaled χ2 random variables:
σ 2 (Mc ) 1
χ2
,
Mc Mc − 1 Mc −1
¶
µ
2
2
2 σ (Mc )
χ21 .
∼
σα + γ
Mc
gcw |Mc ∼
gcb |Mc
To calculate conditional moments of the residual — u (gc , θ) — it is convenient to rearrange u (gc , θ) =
gcb − σ 2α − γ 2 gcw as
u (gc , θ) = (gcb − E[gcb |Mc ]) − γ 2 (gcw − E [gcw |Mc ]) ,
¤
£
2 (M )
2 (M )
c
c
and E [gcw |Mc ] = σ M
. Using independence of gcw and gcb and
where E gcb |Mc = σ 2α + γ 2 σ M
c
c
the results on the centered moments of a scaled χ2 random variable given above we have:
h
2
E u (gc , θ) |Mc
h
3
E u (gc , θ) |Mc
i
i
= 2
= 8
µ
µ
σ2α
σ2α
+γ
+γ
2σ
2 (M )
c
Mc
2
2 σ (Mc )
Mc
48
¶2
¶3
+2
+8
γ 4 σ 4 (Mc ) 1
Mc2
Mc − 1
γ 8 σ 8 (Mc )
1
.
3
Mc
(Mc − 1)2
Using the form of the optimal instrument given by Chamberlain (1987, 1992) and Newey (1993)
for the conditional moment E [u (gc , θ) |Mc ] = 0 and the above results we have
hc = −
h
E [xc |Mc ]
E u (gc , θ)2 |Mc
i
i−1
h
= − (1, E [gcw |Mc ])0 × E ρ (gc , θ)2 |Mc
#−1
¸0 " µ
¶2
∙
2
σ 2 (Mc )
γ 4 σ4 (Mc ) 1
2
2 σ (Mc )
= − 1,
× 2 σα + γ
+2
.
Mc
Mc
Mc2
Mc − 1
References
Abowd, John M. and David Card. (1989). “On the covariance structure of earnings and hours
changes,” Econometrica 57 (2): 411 - 445.
Altonji, Joseph G. and Lewis M. Segal. (1996). “Small-sample bias in GMM estimation of
covariance structures,” Journal of Business and Economic Statistics 14 (3): 353 - 366.
Amemiya, Takeshi. (1985). Advanced Econometrics. Cambridge, MA: Harvard University
Press.
Back, Kerry and David P. Brown. (1993). “Implied probabilities in GMM estimators,” Econometrica 61 (4): 971 - 975.
Bound, John, David A. Jaeger and Regina M. Baker. (1995). “Problems with instrumental
variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak,” Journal of the American Statistical Association 90 (430): 443 450.
Blundell, Richard and Stephen Bond. (1998). “Initial conditions and moment restrictions in
dynamic panel data models,” Journal of Econometrics 87 (1): 115 - 143.
Bond, Stephen, Clive Bowsher and Frank Windmeijer. (2001). “Criterion-based inference for
GMM in autoregressive panel data models,” Economic Letters 73 (3): 379 - 388.
Bond, Stephen and Frank Windmeijer. (2002). “Finite sample inference for GMM estimators
in linear panel data models,” CEMMAP Working Paper CWP04/02.
Brown, Bryan W. and Whitney K. Newey. (1998). “Efficient semiparametric estimation of
expectations,” Econometrica 66 (2): 453 - 464.
Caner, Mehmet. (2003). “Exponential tilting with weak instruments: estimation and testing,”
Mimeo, University of Pittsburgh.
Casella, George and Roger L. Berger (1990). Statistical Inference. Belmont, CA: Duxbury
Press.
49
Chamberlain, Gary. (1982). “Multivariate regression models for panel data,” Journal of
Econometrics 18 (1): 5 - 46.
Chamberlain, Gary. (1984). “Panel data," Handbook of Econometrics 2: 1247 - 1318 (Z.
Griliches & M.D. Intriligator, Eds.). Amsterdam: North-Holland.
Chamberlain, Gary. (1987). “Asymptotic efficiency in estimation with conditional moment
restrictions,” Journal of Econometrics 34 (3): 305 - 334.
Chamberlain, Gary. (1992). “Efficiency bounds for semiparametric regression,” Econometrica
60 (3): 567 - 596.
Clark, Todd E. (1996). “Small-sample properties of estimators of nonlinear models of covariance structure,” Journal of Business and Economic Statistics 14 (3): 367 - 373.
Corcoran, S.A., A.C. Davison and R.H. Spady. (1995). “Reliable inference from empirical
likelihoods,” Mimeo, Nuffield College, Oxford University.
Ferguson, Thomas S. (1996). A Course in Large Sample Theory. London: Chapman & Hall.
Gourieroux, Christian and Alain Monfort. (1993). “Pseudo-likelihood methods,” Handbook
of Statistics 11 : 335 - 362 (G.S. Maddala, C.R. Rao & H.D. Vinod, Eds.). Amsterdam:
North-Holland.
Graham, Bryan S. (2005). “Identifying social interactions through excess variance contrasts,”
Mimeo, Harvard University.
Guggenberger, Patrik. (2005a). “Monte-carlo evidence suggesting a no moment problem of
the continuous updating estimator,” Economics Bulletin 3 (13): 1- 6.
Guggenberger, Patrik. (2005b). “Finite-sample evidence suggesting a heavy tail problem of
the generalized empirical likelihood estimator,” Mimeo, UCLA.
Guggenberger, Patrik and Jinyong Hahn. (2005). “Finite sample properties of the 2-step
empirical likelihood estimator,” Mimeo, UCLA.
Guggenberger, Patrik and Richard J. Smith. (2004). “Generalized empirical likelihood estimators and tests under partial, weak and strong identification,” Mimeo, UCLA.
Hahn, Jinyong and Jerry Hausman. (2002). “Notes on bias in estimators for simultaneous
equations models,” Economic Letters 75 (2): 237 - 241.
Hahn, Jinyong, Jerry Hausman, and Guido Kuersteiner. (2004). “Estimation with weak instruments: accuracy of higher-order bias and MSE approximations,” Econometrics Journal
7 (1): 272 - 306.
Hansen, Lars Peter. (1982). “Large sample properties of generalized method of moments
estimators," Econometrica 50 (4): 1029 - 1054.
50
Hansen, Lars Peter, John Heaton and Amir Yaron. (1996). “Finite-sample properties of some
alternative GMM estimators,” Journal of Business and Economic Statistics 14 (3): 262 - 280.
Horowitz, Joel L. (1998). “Bootstrap methods for covariance structures,” Journal of Human
Resources 33 (1): 39 - 61.
Imbens, Guido. W. (1997). “One-step estimators of over-identified generalized method of
moments models,” Review of Economic Studies 64 (3): 359 - 383.
Imbens, Guido. W. (2002). “Generalized method of moments and empirical likelihood,” Journal of Business and Economic Statistics 20 (4): 493 - 506.
Imbens, Guido W. and Tony Lancaster. (1994). “Combining micro and macro data in microeconometric models,” Review of Economic Studies 61 (4): 655 - 680.
Imbens, Guido W. and Richard Spady (2002). “Confidence intervals in generalized method
of moments models,” Journal of Econometrics 107 (1): 87 - 98.
Imbens, Guido W, Richard Spady and Phillip Johnson (1998). “Information theoretic approaches to inference in moment condition models,” Econometrica 66 (2): 333 - 357.
Kitamura, Yuichi and Michael Stutzer. (1997). “An information-theoretic alternative to generalized method of moments estimation,” Econometrica 65 (4): 861 - 874.
Kitamura, Yuichi, Gautam Tripathi and Hyungtaik Ahn. (2004). “Empirical likelihood-based
inference in conditional moment restriction models,” Econometrica 72 (6): 1667 - 1714.
Kleinbergen, Frank. (2005). “Testing parameters in GMM without assuming that they are
identified,” Econometrica, forthcoming.
MaCurdy, Thomas E. (1982). “The use of time series processes to model the error structure
of earnings in a longitudinal data analysis,” Journal of Econometrics 18 (1): 83 - 114.
Mittelhammer, Ron C., George Judge and Ron Schoenberg. (2005). “Empirical evidence concerning the finite sample performance of EL-type structural equation estimation and inference
methods,” Identification and Inference for Econometric Models: A Festschrift in Honor of
Thomas Rothenberg: 282 - 305 (D.W.K Andrews & J.H. Stock, Eds.). Cambridge: Cambridge
University Press.
Newey, Whitney K. (1993). “Efficient estimation of models with conditional moment restrictions,” Handbook of Statistics 11 : 419 - 454 (G.S. Maddala, C.R. Rao & H.D. Vinod, Eds.).
Amsterdam: North-Holland.
Newey, Whitney. (2002). “Notes on GMM and Generalized Empirical Likelihood,” Mimeo,
MIT.
51
Newey, Whitney K. and Daniel McFadden. (1994). “Large sample estimation and hypothesis
testing,” Handbook of Econometrics 4: 2111 - 2245 (R.F. Engle & D.F. McFadden, Eds.).
Amsterdam: North-Holland.
Newey, Whitney K. and Richard J. Smith. (2004). “Higher order properties of GMM and
generalized empirical likelihood estimators,” Econometrica 72 (1): 219 - 255.
Newey, Whitney K. and Kenneth D. West. (1987). “Hypothesis testing with efficient method
of moments estimation,” International Economic Review 28 (3): 777 - 787.
Newey, Whitney K. and Frank Windmeijer. (2005). “Many weak moment asymptotics for
generalized empirical likelihood estimators,” Mimeo, MIT.
Owen, Art. B. (2001). Empirical Likelihood. New York: Chapman & Hall/CRC.
Pagan, Adrian R. and John Robertson. (1997). “GMM and its problems,” Mimeo, ANU.
Qin, Jin and Jerry Lawless. (1994). “Empirical likelihood and general estimating equations,”
Annals of Statistics 22 (1): 300 - 325.
Ragusa, Giuseppe. (2005). “Alternatives to GMM: properties of minimum divergence estimators,” Mimeo, UCSD.
Ramalho, Joaquim J.S. (2005). “Small sample bias of alternative estimation methods for moment condition models: Monte Carlo evidence for covariance structures,” Studies in Nonlinear
Dynamics and Econometrics 9 (1) Article 3.
Rao, C. Radhakrishna. (1973). Linear Statistical Inference and its Applications, 2 nd . Ed. New
York: John Wiley and Sons.
Schennach, Susanne M. (2003). “Exponentially tilted empirical likelihood,” Mimeo, University
of Chicago.
Smith, Richard J. (1997). “Alternative semi-parametric likelihood approaches to generalized
method of moments estimation,” Economic Journal 107 (443): 503 - 519.
Smith, Richard J. (2000). “Empirical likelihood estimation and inference,” Applications of
Differential Geometry to Econometrics: 119 - 150 (P. Marriot & M. Salmon, Eds.). Cambridge: Cambridge University Press.
Staiger, Douglas and James H. Stock. (1997). “Instrumental variables regression with weak
instruments,” Econometrica 65 (3): 557 - 586.
Stock, James H. and Jonathan H. Wright. (2000). “GMM with weak identification,” Econometrica 68 (5): 1055 - 1096.
Stock, James H., Jonathan H. Wright and Motohiro Yogo. (2002). “A survey of weak instruments and weak identification in generalized method of moments,” Journal of Business and
Economic Statistics 20 (4): 518 - 529.
52
Tauchen, George. (1986). “Statistical properties of generalized method-of-moments estimators of structural parameters obtained from financial market data,” Journal of Business and
Economic Statistics 4 (4): 397 - 416.
van der Vaart, A.W. (1998). Asymptotic Statistics. Cambridge: Cambridge University Press.
Wooldridge, Jeffrey M. (2002). Econometric Analysis of Cross Section and Panel Data. Cambridge, MA: The MIT Press.
53
Table 1: Individual asymptotic bias components for γ 2 under weak, strong and very strong identification (multivate normal DGP)
μ2 /(S
− 1) = 5
S=2
S=4
S=7
S = 12
Panel B: μ2 /(S − 1) = 20
S=2
S=4
S=7
S = 12
Panel C: μ2 /(S − 1) = 50
S=2
S=4
S=7
S = 12
Panel A:
∗
BO
BΓ∗
∗
BΩ
∗
BW
−
−
−0.0000
−0.0000
−
−
−0.2028
−0.2465
−
−
−0.0670
−0.1751
−
−
−0.0017
−0.0011
−0.0000
−0.0000
−0.0001
−0.0001
−
−0.0553
−0.1092
−0.1347
−
−0.0413
−0.2040
−0.4310
−
−0.0003
−0.0028
−0.0032
−0.0000
−0.0001
−0.0001
−0.0001
−
−0.0434
−0.0533
−0.0894
−
−0.2200
−0.2658
−0.7440
−
−0.0008
−0.0024
−0.0033
The table reports the asymptotic bias of γ
b2 due to each of the four components given by (46) for
each of the twelve Monte Carlo designs. The reported bias values are divided by the asymptotic standard
1
∗
γ 2 , (i.e., BO
= BNO
error of b
−1
1/2 etc.).
Notes:
[I(θ0 )γγ /N ]
54
Table 2: Median bias of competing estimates of γ 2 under weak, strong and very strong identification (multivate normal DGP)
S
Panel A: μ /(S − 1) = 5
2
2
BiasMC
GMM
ET
EL
QML
Oracle
2SLS
Panel B: μ2 /(S − 1) = 20
GMM
ET
EL
QML
Oracle
2SLS
Panel C: μ2 /(S − 1) = 50
GMM
ET
EL
QML
Oracle
2SLS
4
BiasA
BiasM C
−
7
12
BiasA
BiasM C
−0.2106
−0.0281
−0.0004
−0.0267
0.0049
−0.1796
BiasA
−0.2716
−0.0335
−0.0000
−0.0000
−0.0000
−
BiasM C
−0.3255
−0.0343
0.0002
−0.0012
0.0050
−0.2203
BiasA
−0.4227
−0.0876
−0.0001
−0.0001
−0.0001
−
−
0.0080
0.0080
0.0080
0.0080
0.0080
0.0080
−0.0000
−0.0000
−0.0000
−0.0000
−0.0000
−0.0000
−0.0901
−0.0334
−0.0240
−0.0198
−0.0181
−0.0732
−0.0970
−0.0207
−0.0000
−0.0000
−0.0000
−
−0.2021
−0.0804
−0.0371
−0.0014
−0.0017
−0.1038
−0.3160
−0.1021
−0.0001
−0.0001
−0.0001
−
−0.3867
−0.2149
−0.1412
−0.0378
−0.0359
−0.1540
−0.5689
−0.2156
−0.0001
−0.0001
−0.0001
−
0.0063
0.0063
0.0063
0.0063
0.0063
0.0063
−0.0000
−0.0000
−0.0000
−0.0000
−0.0000
−0.0000
−0.2014
−0.0885
0.0003
−0.0294
0.0333
−0.0459
−0.2642
−0.1100
−0.0001
−0.0001
−0.0001
−
−0.2054
−0.1124
−0.0617
−0.0045
−0.0092
−0.0594
−0.3216
−0.1330
−0.0001
−0.0001
−0.0001
−
−0.5468
−0.3622
−0.2232
−0.0447
−0.0479
−0.1484
−0.8369
−0.3721
−0.0001
−0.0001
−0.0001
−
The table reports Monte Carlo estimates of scaled median bias (BiasM C ) for competing estimates of b
γ 2 (i.e., the median bias of b
γ 2 divided by
−1
1/2
its asymptotic standard error, [I(θ0 )γγ /N ] ). Reported statistics are based on B = 5, 000 successful Monte Carlo draws. The column labelled BiasA
Notes:
−1
gives the asymptotic bias of γ
b2 — again divided by [I(θ 0 )γγ /N]
1/2
— as calculated analytically using Newey and Smith’s (2004) expansions. The standard
D
1/2
error of the median bias estimates are approximately [(π/2)/5000]1/2 ' 0.01772454 since for N and B large enough γ
b2 /[I(θ0 )−1
'N(BiasA , 1)
γγ /N ]
and
√
D
B(BiasMC − BiasA )'N (0, π/2) (c.f., Ferguson 1996, Theorem 13, p. 89).
55
Table 3: (Monte Carlo) median bias relative to theoretical predictions of Newey and Smith (2004)
(multivate normal DGP)
GMM
ET
EL
QML
Panel A
r.m.s.e.
8.0032
[pv = 0.00]
1.3356
[pv = 0.07]
5.4783
[pv = 0.00]
1.4655
[pv = 0.03]
αΓ
0.8414
(0.0530)
−0.1224
(0.0612)
−0.1537
(0.1017)
0.0371
(0.0727)
Panel B
αΩ
0.6293
(0.0210)
0.4964
(0.0243)
0.3063
(0.0404)
0.0413
(0.0289)
r.m.s.e.
0.9040
[pv = 0.56]
1.0447
[pv = 0.36]
1.7351
[pv = 0.01]
1.2401
[pv = 0.16]
The table reports coefficients and root mean squared error (r.m.s.e.) from weighted least squares
regressions of the BiasM C estimates reported in Table 2 onto the corresponding theoretical bias components
reported in Table 1. The weights equal [(π/2)/5000]−1/2 (see the notes to Table 2 for a discussion of why
this is a reasonable estimate of the inverse standard error of BiasM C ). Panel A reports the r.m.s.e.
∗ , B ∗ , B ∗ and B ∗ where the coefficients are constrained to the
from a regression of BiasM C onto BO
Γ
Ω
W
∗
to vary freely
appropriate theoretical values given in (45). Panel B allows the coefficients on BΓ∗ and BΩ
(with the remaining coefficients continuing to be constrained to their theoretical values). Constants are not
included in any of the regressions. Regressions are performed estimator-by-estimator using data from the
eight overidentified designs. Under the null that the theoretical bias expressions accurately capture the finite
sample median bias properties of the given estimator r.m.s.e. should be close to one with RSS ∼ χ2DF ,
where RSS equals the residual sum-of-squares and DF the degrees of freedom in the regression. The p-value
for the χ2 test is reported below the r.m.s.e. values.
Notes:
56
Table 4: Median absolute error (MAE) of competing estimates of γ 2 under weak, strong and very
strong identification (multivate normal DGP)
S
Panel A:
μ2 /(S
− 1) = 5
GMM
ET
EL
QML
Oracle
2SLS
Panel B: μ2 /(S − 1) = 20
GMM
ET
EL
QML
Oracle
2SLS
Panel C: μ2 /(S − 1) = 50
GMM
ET
EL
QML
Oracle
2SLS
2
MAE
4
MAE
7
MAE
0.9824
1.1610
1.1628
1.1053
1.0273
0.9681
12
MAE
0.9919
1.1951
1.1803
1.0885
0.9720
0.9379
−
−
0.9994
0.9994
0.9994
0.9994
0.9994
0.9994
1.0249
1.0476
1.0433
1.0205
1.0095
1.0055
1.0319
1.0519
1.0420
1.0196
0.9962
0.9948
1.1127
1.0884
1.0605
1.0113
1.0025
1.0191
0.9913
0.9913
0.9913
0.9913
0.9913
0.9913
1.0295
1.0144
1.0211
1.0153
1.0055
0.9813
1.0540
1.0366
1.0304
0.9939
0.9926
0.9810
1.1779
1.0962
1.0531
0.9965
1.0042
1.0019
Notes: The table reports Monte Carlo estimates of the (scaled) median absolute error (MAE) of com1/2
b2 ; the MAE estimates are multiplied by 1.4826 and divided [I(θ0 )−1
. Reported
peting estimates of γ
γγ /N ]
statistics based on B = 5, 000 successful Monte Carlo draws. If the first-order asymptotic normal approxiγ 2 is accurate, then the scaled MAE values should, up to simulation
mation to the sampling distribution of b
error, equal one.
57
Table 5: Standard deviation of competing estimates of γ 2 under weak, strong and very strong identification (multivate normal DGP)
S
Panel A:
μ2 /(S
− 1) = 5
GMM
ET
EL
QML
Oracle
2SLS
Panel B: μ2 /(S − 1) = 20
GMM
ET
EL
QML
Oracle
2SLS
Panel C: μ2 /(S − 1) = 50
GMM
ET
EL
QML
Oracle
2SLS
2
4
7
12
med(se
b γb 2 )
std(b
γ2 )
−1
1/2
−1
[I(θ 0 )γγ /N]
[I(θ0 )γγ /N ]1/2
med(se
b γb 2 )
std(b
γ2)
−1
1/2
−1
[I(θ 0 )γγ /N]
[I(θ0 )γγ /N ]1/2
med(se
b γb 2 )
std(b
γ2)
−1
1/2
−1
[I(θ0 )γγ /N ]
[I(θ 0 )γγ /N]1/2
med(se
b bγ2 )
std(b
γ2)
−1
1/2
−1
[I(θ0 )γγ /N ]
[I(θ 0 )γγ /N]1/2
−
−
0.9685
1.2283
1.2455
1.1628
1.0675
0.9664
0.8586
0.8550
0.8781
1.1127
1.0000
0.9154
0.9373
1.2701
1.2788
1.1418
1.0253
0.9307
0.7837
0.7662
0.7937
1.1225
1.0000
0.8937
1.0705
1.0705
1.0705
1.0705
1.0705
1.0705
1.0156
1.0156
1.0156
1.0156
1.0000
1.0156
1.0105
1.0485
1.0598
1.0294
1.0254
1.0086
0.9439
0.9410
0.9529
1.0157
1.0000
0.9738
1.0110
1.0673
1.0621
1.0270
1.0110
0.9813
0.8909
0.8821
0.9037
1.0183
1.0000
0.9571
1.0165
1.0844
1.0732
1.0233
1.0011
0.9747
0.8342
0.8107
0.8428
1.0066
1.0000
0.9472
1.0112
1.0112
1.0112
1.0112
1.0112
1.0112
0.9920
0.9920
0.9920
0.9920
1.0000
0.9920
1.0342
1.0414
1.0359
1.0265
1.0191
1.0020
0.9520
0.9508
0.9606
0.9918
1.0000
0.9726
1.0339
1.0371
1.0414
1.0047
1.0042
0.9928
0.9004
0.8901
0.9157
0.9887
1.0000
0.9673
1.0225
1.0467
1.0532
1.0083
0.9898
0.9731
0.8374
0.8092
0.8400
0.9905
1.0000
0.9493
1/2
Notes: The table reports the ratio between the actual sampling standard deviation of γb2 and its asymptotic standard error (std(b
γ 2 )/[I(θ0 )−1
).
γγ /N ]
−1
1/2
b γb 2 )/[I(θ0 )γγ /N] ). std(b
γ 2 ) equals the
Also reported is the ratio between the median estimated asymptotic standard error and the true one (med(se
γ 2 ; this a robust estimate formed by subtracting the 5th quantile from the 95th quantile of the Monte
standard deviation of the Monte Carlo estimates of b
2
γ and dividing by 2 × 1.645; med(se
b bγ 2 ) denotes the median estimated asymptotic standard error of γb2 based on the conventional
Carlo distribution of b
variance-covariance estimator.
58
Table 6: Actual coverage of alternative 95 percent asymptotic confidence intervals for γ 2 under weak identification (multivate normal DGP)
μ2 /(S − 1) = 5
Panel A:
W
WH
W NW
WM
S
SM
LR
IS
Panel B:
W
WH
W NW
WM
S
SM
LR
IS
S=2
GMM ET
EL
S=4
QML ORACLE 2SLS
GMM ET
−
0.918
−
−
−
0.926
−
0.917
−
0.874
0.908
0.931
0.945
0.929
0.960
0.885
0.817
EL
QML ORACLE 2SLS
−
S=7
0.878
−
0.934
−
0.910
−
−
−
0.932 0.958
−
0.938
0.942
−
−
−
0.917
−
0.949
−
0.953 0.965
−
0.926
0.888
−
0.952
−
0.826
−
−
−
0.865
−
−
−
0.903
−
0.907
−
0.790
0.849
0.887
0.907
0.910
0.939
0.838
0.746
S=12
0.807
−
0.943
−
0.848
−
−
−
0.879 0.952
−
0.928
0.902
−
−
−
0.884
−
0.952
−
0.929 0.964
−
0.925
0.837
−
0.949
−
0.747
−
−
−
Notes: Table reports coverage rates for nominal 95 percent intervals constructed by inverting the various test statistics outlined in the main text (i.e.,
Pr(γ 2 ∈ CI α (γ 2 , σ
e2α (γ 2 ))). Coverage is estimated as the fraction of Monte Carlo repetitions where a test of H0 : γ 2 = γ 20 fails to reject, where γ 20 denotes
the DGP value. Reported results are for the case of weak identification (μ2 /(S − 1) = 5).
59
Table 7: Actual coverage of alternative 95 percent asymptotic confidence intervals for γ 2 under strong identification (multivate normal
DGP)
μ2 /(S − 1) = 20
Panel A:
W
WH
W NW
WM
S
M
S
LR
IS
Panel B:
W
WH
W NW
WM
S
SM
LR
IS
S=2
GMM ET
0.968
−
−
−
0.949
−
0.949
−
0.968
0.968
0.968
0.968
0.947
0.973
0.943
0.917
0.891
−
−
−
0.909
−
0.917
−
0.889
0.902
0.914
0.920
0.922
0.929
0.902
0.878
EL
S=4
QML ORACLE 2SLS
0.968
−
0.968
−
0.968 0.968
0.968
−
0.945
−
0.962 0.964
0.943
−
0.923
−
S=7
0.902
−
0.911
−
0.922 0.948
0.927
−
0.916
−
0.928 0.947
0.911
−
0.887
−
GMM ET
0.927
−
−
−
−
0.968
−
−
0.950
−
−
0.929
0.950
−
−
−
0.934
−
−
−
0.937
−
0.935
−
0.931
0.939
0.942
0.946
0.936
0.948
0.927
0.907
0.948
−
−
−
−
0.938
−
−
0.948
−
−
0.923
0.951
−
−
−
0.856
−
−
−
0.894
−
0.906
−
0.842
0.867
0.887
0.886
0.902
0.901
0.874
0.843
EL
QML ORACLE 2SLS
0.933
−
0.937
−
0.943 0.954
0.945
−
0.934
−
0.940 0.947
0.927
−
0.909
−
S=12
0.863
−
0.866
−
0.874 0.948
0.883
−
0.875
−
0.887 0.949
0.870
−
0.850
−
0.939
−
−
−
−
0.950
−
−
0.948
−
−
0.927
0.947
−
−
−
0.948
−
−
−
−
0.929
−
−
0.951
−
−
0.902
0.953
−
−
−
Notes: Table reports coverage rates for nominal 95 percent intervals constructed by inverting the various test statistics outlined in the main text (i.e.,
Pr(γ 2 ∈ CI α (γ 2 , σ
e2α (γ 2 ))). Coverage is estimated as the fraction of Monte Carlo repetitions where a test of H0 : γ 2 = γ 20 fails to reject, where γ 20 denotes
the DGP value. Reported results are for the case of strong identification (μ2 /(S − 1) = 20).
60
Table 8: Actual coverage of alternative 95 percent asymptotic confidence intervals for γ 2 under very strong identification (multivate normal
DGP)
μ2 /(S − 1) = 50
Panel A:
W
WH
WM
W NW
S
M
S
LR
IS
Panel B:
W
WH
WM
W NW
S
SM
LR
IS
S=2
GMM ET
0.955
−
−
−
0.949
−
0.949
−
0.955
0.955
0.955
0.955
0.948
0.957
0.944
0.930
0.897
−
−
−
0.918
−
0.923
−
0.901
0.909
0.917
0.917
0.923
0.924
0.914
0.897
EL
S=4
QML ORACLE 2SLS
0.955
−
0.955
−
0.955 0.955
0.955
−
0.944
−
0.944 0.949
0.944
−
0.938
−
S=7
0.914
−
0.916
−
0.917 0.946
0.919
−
0.916
−
0.918 0.947
0.915
−
0.908
−
GMM ET
0.948
−
−
−
−
0.955
−
−
0.953
−
−
0.933
0.953
−
−
−
0.930
−
−
−
0.936
−
0.937
−
0.932
0.937
0.941
0.943
0.941
0.947
0.931
0.913
0.953
−
−
−
−
0.939
−
−
0.956
−
−
0.888
0.953
−
−
−
0.836
−
−
−
0.875
−
0.886
−
0.844
0.864
0.880
0.876
0.892
0.887
0.868
0.839
EL
QML ORACLE 2SLS
0.937
−
0.941
−
0.943 0.949
0.945
−
0.937
−
0.933 0.939
0.934
−
0.922
−
S=12
0.879
−
0.881
−
0.878 0.951
0.889
−
0.877
−
0.889 0.947
0.878
−
0.854
−
0.943
−
−
−
−
0.949
−
−
0.947
−
−
0.922
0.947
−
−
−
0.950
−
−
−
−
0.937
−
−
0.952
−
−
0.886
0.955
−
−
−
Notes: Table reports coverage rates for nominal 95 percent intervals constructed by inverting the various test statistics outlined in the main text (i.e.,
Pr(γ 2 ∈ CI α (γ 2 , σ
e2α (γ 2 ))). Coverage is estimated as the fraction of Monte Carlo repetitions where a test of H0 : γ 2 = γ 20 fails to reject, where γ 20 denotes
the DGP value. Reported results are for the case of very strong identification (μ2 /(S − 1) = 50).
61