Download Report

CLUSTER-SAMPLE METHODS IN APPLIED ECONOMETRICS:
AN EXTENDED ANALYSIS
Jeffrey M. Wooldridge
Department of Economics
Michigan State University
East Lansing, MI 48824-1038
(517) 353-5972
[email protected]
This version: June 2006
1
ABSTRACT
This is an expanded version of Wooldridge (2003), which provided an overview of
cluster-sample methods in linear models. Here I include additional details for linear models
and provide, as much as possible, a parallel treatment for nonlinear models, including a
summary of strategies for dealing with data sets having a small number of clusters.
Keywords: Cluster Correlation; Generalized Estimating Equations; Minimum Distance
Estimation; Panel Data; Robust Variance Matrix; Unobserved Effect
JEL Classification Codes: C13, C21, C23
2
1. INTRODUCTION
In Wooldridge (2003), I provided a brief overview of econometric approaches to analyzing
cluster samples in the context of a linear regression model. I considered cases with both large
and small cluster sizes (relative to the number of clusters). That treatment was necessarily
terse, and some subtle issues were only briefly mentioned or neglected entirely.
The asymptotic theory for the case with a large number of clusters (and relatively small
cluster sizes), either in linear or nonlinear models, has been pretty well worked out; for a
summary, see, for example, Wooldridge (2002). Just as importantly, popular statistical
packages, such as Stata ® , allow for computation of variance matrices that are robust to
arbitrary cluster correlation for a variety of linear and nonlinear estimation methods. Still,
while accounting for clustering in data is much more common than it was 10 years ago,
inference methods robust to cluster correlation are still not used routinely in all relevant
applications. I think that is partly because empirical researchers are not entirely sure when
certain estimators are robust to various kinds of misspecification. I hope this expanded paper
helps to fill that gap.
For nonlinear models, there are some open modeling questions for cases where the group
sizes vary – a common situation with true cluster samples – and one wants to allow correlation
between the unobserved group effect (or heterogeneity) and the covariates that vary within
group. I discuss the modeling issues, and offer some tentative solutions, in Section 3.1. This
is very preliminary and hopefully generates some interest in the problem.
The case of a small number of clusters has received much attention recently, particularly
3
for linear models. In Wooldridge (2003), I summarized two possible ways to estimate the
effects of cluster-level covariates on individual-specific outcomes when the number of clusters
is small (while the sizes of the clusters are moderately large). One approach, suggested by
Donald and Lang (2001), is to effectively treat the number of groups as the number of
observations, and use finite sample analysis (with individual-specific unobservables becoming
unimportant – relative to the cluster effect – as the cluster sizes get large). A second approach
is to view the cluster-level covariates as imposing restrictions on cluster-specific intercepts in a
set of individual-specific regression models, and then imposing and testing the restrictions
using minimum distance estimation. The minimum distance (MD) approach, when valid, has
the benefit of allowing standard asymptotic analysis under a weak set of assumptions, at least
when the group sample sizes are large. When it is valid, the MD approach can be expected to
lead to stronger statistical significance than the Donald and Lang approach when there are few
groups. Campolieti (2004) applies both the Donald and Lang (2001) (or DL, for short) and
minimum distance approaches.
Conveniently, both the DL and minimum distance approaches can be extended to a broad
class of nonlinear models. As in the linear case, the conditions under which minimum distance
can be justified are less restrictive than those for the DL approach (which, importantly,
requires unobserved cluster effects to be drawn from a homoskedastic normal distribution).
Plus, the DL approach does not allow one to obtain standard errors for the partial effects
themselves in nonlinear models. I cover this case in Section 3.2.
2. THE LINEAR MODEL WITH CLUSTER
4
EFFECTS
This section considers linear models estimated using cluster samples (of which a panel data
set is a special case). For each group or cluster g, let y gm , x g , z gm  : m  1, . . . , M g  be the
observable data, where M g is the number of units in cluster g, y gm is a scalar response, x g is a
1  K vector containing explanatory variables that vary only at the group level, and z gm is a
1  L vector of covariates that vary within (as well as across) groups.
2.1. Specification of the Model
The linear model with an additive error is
y gm    x g   z gm   v gm , m  1, . . . , M g ; g  1, . . . , G.
(1)
Our approach to estimation and inference in equation (1) depends on several factors, including
whether we are interested in the effects of aggregate variables  or individual-specific
variables . Plus, we need to make assumptions about the error terms. In the context of pure
cluster sampling, an important issue is whether the v gm contain a common group effect that can
be separated in an additive fashion, as in
v gm  c g  u gm , m  1, . . . , M g ,
where c g is an unobserved cluster effect and u gm is the idiosyncratic error. (In the statistics
literature, (1) and (2) are referred to as a “hierarchical linear model.”) One important issue is
whether the explanatory variables in (1) can be taken to be appropriately exogenous. Under
5
(2)
(2), exogeneity issues are usefully broken down by separately considering c g and u gm .
Throughout we assume that the sampling scheme generates observations that are
independent across g. This assumption can be restrictive, particularly when the clusters are
large geographical units. This paper does not formally consider problems of “spatial
correlation” across clusters, although below I will mention some benefits of fixed effects
estimators in such settings.
Appropriate sampling assumptions within cluster are more complicated. Theoretically, the
simplest case also allows the most flexibility for robust inference: from a large population of
relatively small clusters, we draw a large number of clusters (G), where cluster g has M g
members. This setup is appropriate, for example, in randomly sampling a large number of
families, classrooms, or firms from a large population. The key feature is that the number of
groups is large enough relative to the group sizes so that we can allow essentially unrestricted
within-cluster correlation. Randomly sampling a large number of clusters also applies to many
panel data sets, where the cross-sectional population size is large (say, individuals, firms, even
cities or counties) and the number of time periods is relatively small. In the panel data setting,
G is the number of cross-sectional units and M g is the number of time periods for unit g.
A different sampling scheme results in data sets that also can be arranged by group. We
first stratify the population into G ≥ 2 nonoverlapping groups and then obtain a random
sample of size M g from each group. Ideally, the strata sizes are large in the population,
hopefully resulting in large M g . I adopt this perspective for the “small G” case in Section 2.3.
2.2. Large Group Asymptotics
6
In this section I review methods and estimators justified when the asymptotic
approximations theory is with G →  and the group sizes are fixed. The theory is well
developed; see, for example, White (1984) and Wooldridge (2002, Chapters 10, 11). Here, I
summarize the “large G” theory, emphasizing how one might wish to use methods robust to
cluster sampling even when it is not so obvious.
First suppose that the covariates satisfy
Ev gm |x g , z gm   0, m  1, . . . , M g ; g  1, . . . , G.
For consistency, we can, of course, get by with zero correlation assumptions, but we use (3) for
convenience because it meshes well with assumptions concerning conditional second
moments. Importantly, the exogeneity in (3) only requires that z gm and v gm are uncorrelated.
In particular, it does not specify assumptions concerning v gm and z gp for m ≠ p. In the panel
data literature, where m is a time index, (3) is called the “contemporaneous exogeneity”
assumption; see Wooldridge (2002, Chapter 7). Allowing for correlation between v gm and
z gp , m ≠ p is useful for some panel data applications and possibly even cluster samples (if the
covariates of one unit can affect another unit’s response). Under (3) and a standard rank
condition on the covariates, the pooled OLS estimator, where we regress y gm on
1, x g , z gm , m  1, . . . , M g ; g  1, . . . , G, is consistent for  ≡ ,  ′ ,  ′  ′ (as G →  with M g
fixed) and G -asymptotically normal. See, for example, Wooldridge (2002, Sections 7.8).
Without more assumptions, a robust variance matrix is needed to account for correlation
within clusters or heteroskedasticity in Varv gm |x g , Z g , or both. When v gm has the form in (2),
the amount of within-cluster correlation can be substantial, which means the usual OLS
7
(3)
standard errors can be very misleading (and, in most cases, systematically too small). In the
case of common cluster sizes, Wooldridge (2002, Section 7.8) gives the formula for a
variance-matrix estimator that assumes no particular kind of within-cluster correlation nor a
particular form of heteroskedasticity. Allowing for different group sizes is straightforward.
Write the equation by cluster as
y g  W g   v g , g  1, . . . , G,
(4)
where W g is the M g  1  K  L matrix of all regressors for group g. Then the
1  K  L  1  K  L variance matrix estimator is
Avar̂  
G
∑ W ′g W g
g1
−1
G
∑ W ′g v̂ g v̂ ′g W g
g1
G
−1
∑ W ′g W g
(5)
g1
where v̂ g is the M g  1 vector of pooled OLS residuals for group g. This asymptotic variance
is now computed routinely by packages such as Stata ® (see the appendix). It is important,
however, to remember that the standard errors and test statistics obtained are known to be valid
only as G →  with each M g fixed. Probably they are valid if M g increases at a rate somewhat
less than G. Practically, the important issue is how well the robust variance matrix estimator
works for common cluster sample sizes, something I comment on below.
In estimating the parameters in (1), the pooled OLS estimator ignores the within-cluster
correlation of the v gm . If we strengthen the exogeneity assumption to
Ev gm |x g , Z g   0, m  1, . . . , M g ; g  1, . . . , G,
where Z g is the M g  L matrix of unit-specific covariates, then we can exploit the presence of
c g in (2) in a generalized least squares (GLS) analysis. [In the panel data case, (6) is known as
the “strict exogeneity” assumption on z gm : m  1, . . . , M g .] The standard “random effects”
8
(6)
approach makes enough assumptions so that the M g  M g variance-covariance matrix of
v g  v g1 , v g2 , . . . , v g,M g  ′ has the so-called “random effects” form,
Varv g    2c j ′M g j M g   2u I M g ,
(7)
where j M g is the M g  1 vector of ones and I M g is the M g  M g identity matrix. In the standard
setup, we also make the “system homoskedasticity” assumption,
Varv g |x g , Z g   Varv g .
(8)
It is important to understand the role of assumption (8): it implies that the conditional
variance-covariance matrix is the same as the unconditional variance-covariance matrix, but it
does not restrict Varv g . The particular random effects structure on Varv g  is given by (7).
Under (7) and (8), the resulting GLS estimator is the well-known random effects (RE)
estimator; see, for example, Wooldridge (2002, Section 10.3).
The random effects estimator is asymptotically more efficient than pooled OLS under (6),
(7), and (8) (and a standard rank condition). The RE estimates and test statistics are computed
routinely by popular software packages. Nevertheless, an important point is often overlooked
in applications of RE: one can, and in many cases should, make inference completely robust to
an unknown form of Varv g |x g , Z g . Wooldridge [2002, equation (7.49)] gives the robust
formula in the balanced case (same group sizes), and the formula extends easily to the
unbalanced case. One way to think about the robust formula is to characterize RE as a pooled
OLS estimator on quasi-time demeaned data. For each g, define
̂ g  1 − 1/1  M g ̂ 2c /̂ 2u  1/2 , where ̂ 2c and ̂ 2u are estimators of the two variance
parameters. Then the RE estimator is identical to the pooled OLS estimator of
y gm − ̂ g ȳ g on 1 − ̂ g , 1 − ̂ g x g , z gm − ̂ g z̄ g , m  1, . . . , M g ; g  1, . . . , G;
9
(9)
see, for example, Hsiao (1986). For fully robust inference, we can just apply the fully robust
variance matrix estimator in (5) but on the transformed data.
The idea in obtaining a fully robust variance matrix of RE is straightforward, and dates
back at least to the “generalized estimating equation” (GEE) literature [Liang and Zeger
(1986)]. Even if Varv g |x g , Z g  does not have the RE form, the RE estimator is still consistent
and G -asymptotically normal under (6), and it is likely to be more efficient than pooled OLS.
[Except in unusual cases, incorrectly accounting for within-group correlation by assuming
constant correlation within-group correlation is likely to be more efficient than entirely
ignoring within-group correlation.] Yet we should recognize that the RE second moment
assumptions can be violated without causing inconsistency in the RE estimator. Making
inference robust to serial correlation in the idiosyncratic errors for panel data applications,
particularly with more than a few time periods, can be very important. But within-group
correlation in the idiosyncratic errors can arise for cluster samples, too, especially if underlying
(1) is a random coefficient model, where z gm  g replaces z gm . By estimating a standard
random effects model that assumes common slopes , we effectively include z gm  g −  in the
idiosyncratic error; this generally creates within-group correlation because z gm  g −  and
z gp  g −  will be correlated for m ≠ p, conditional on Z g . Also, the idiosyncratic error will
have heteroskedasticity that is a function of z gm . Nevertheless, if we assume
E g |X g , Z g   E g  ≡  along with (6), the random effects estimator still consistently
estimates the average slopes, . Therefore, in applying random effects to panel data or cluster
samples, it is sensible (with large G) to make the variance estimator of random effects robust to
arbitrary heteroskedasticity and within-group correlation.
With panel data – so that there is a natural ordering of the observations within group – it
10
may make sense to estimate an unrestricted version of Varv g , especially if G is large. Even
in that case, we can use Wooldridge [2002, equation (7.49)] to obtain a variance matrix robust
to Varv gm |x g , Z g  ≠ Varv g .
In economics, the prevailing view is still that robust inference is not necessary when using
GLS. Adopting the perspective of the GEE literature – that one often chooses a “working”
variance-covariance matrix for convenience, but should recognize it may not represent the
conditional variance-covariance matrix – is relatively straightforward these days. The
appendix shows how to use GEE software to obtain fully robust inference after random effects
estimation.
If we are only interested in estimating , the “fixed effects” (FE) or “within” estimator is
attractive. The within transformation subtracts off group averages from the dependent variable
and explanatory variables:
y gm − ȳ g  z gm − z̄ g   u gm − ū g , m  1, . . . , M g ; g  1, . . . , G,
and this equation is estimated by pooled OLS. (The x g get swept away by the within-group
demeaning.) Under a full set of “fixed effects” assumptions – which, unlike random effects,
allows arbitrary correlation between c g and the z gm – inference is straightforward using
standard software. Nevertheless, analogous to the random effects case, it is often important to
allow Varu g |Z g  to have an arbitrary form, including within-group correlation and
heteroskedasticity. Arellano (1987) proposed a fully robust variance matrix estimator for the
fixed effects estimator, and it is consistent (as G increases with the M g fixed) with cluster
samples or panel data; see also Wooldridge [2002, equation (10.59)]. For panel data, the
idiosyncratic errors can always have serial correlation or heteroskedasticity, and it is easy to
11
(10)
guard against these problems in inference. Reasons for wanting a fully robust variance matrix
estimator for FE applied to cluster samples are similar to the RE case. For example, if  g
replaces  in (1), then z gm − z̄ g  g −  appears in u gm in (4). The FE estimator is still
consistent if E g |z g1 − z̄ g , . . . , z g,M g − z̄ g   E g    – a reasonable assumption that allows
 g to be correlated with z̄ g provide the group slopes are mean-independent of the within-group
deviations from the mean. Nevertheless, u gm , u gp will be correlated for m ≠ p. A fully robust
variance matrix estimator is
G
Avar̂ FE  
∑ Z̈ ′g Z̈ g
g1
−1
G
∑ Z̈ ′g û g û ′g Z̈ g
g1
G
∑ Z̈ ′g Z̈ g
−1
,
g1
where Z̈ g is the matrix of within-group deviations from means and û g is the M g  1 vector of
fixed effects residuals. This estimator is justified with large-G asymptotics.
One benefit of a fixed effects approach, especially in the standard model with constant
slopes but c g in the composite error term, is that no adjustments are necessary if the c g are
correlated across groups. When the groups represent different geographical units, we might
expect correlation across groups close to each other. If we think such correlation is largely
captured through the unobserved effect c g , then its elimination via the within transformation
effectively solves the problem. If we use pooled OLS or a random effects approach, we would
have to deal with spatial correlation across g, in addition to within-group correlation, and this
is a difficult problem.
Recently, in the context of fixed effects estimation and panel data, Kézde (2001) and
Bertrand, Duflo, and Mullainathan (2002) study the finite-sample properties of robust variance
matrix estimators that are theoretically justified only as G → . One common finding is that
the fully robust estimator works reasonably well even when the cross-sectional sample size (G)
12
(11)
is not especially large relative to the time-series dimension. When Varu g |Z g  does not depend
on Z g , a variance matrix that exploits system homoskedasticity can perform better than the
fully robust variance matrix estimator. Kézde (2001) also finds that, without serial correlation,
an estimator that adjusts the variance matrix only for heteroskedasticity works well. This
finding is probably more relevant for cluster sample applications, where independence of the
idiosyncratic errors is often reasonable unless slopes vary by group.
Importantly, the encouraging simulaton findings for fixed effects with panel data are not in
conflict with findings that the robust variance matrix for the pooled OLS estimator with a
small number of clusters can behave poorly. An important difference in the two problems is
that for FE estimation using panel data, the issue is serial correlation in the idiosyncratic errors
u gm : m  1, . . . , M g , and in simulation studies to date this correlation has been assumed to
die out as the time periods get far apart. [Typically, a stable AR(1) model is assumed.] The
pooled OLS estimator that keeps c g in the error term suffers because of the constant correlation
across all observations within cluster. Plus, fixed effects can only estimate , while for pooled
OLS on cluster samples the focus is on typically on . If we are interested in , an interesting
open question for a small cluster sample size is whether inference with the random effects
estimator, whether carried out under the usual RE assumptions or made robust to arbitrary
within-group correlation and heteroskedasticity, can improve on robust inference for pooled
OLS.
The previous discussion extends immediately to instrumental variables versions of all
estimators. With large G, one can typically afford to make pooled two stage least squares
(2SLS), random effects 2SLS, and fixed effects 2SLS robust to arbitrary within-cluster
correlation and heteroskedasticity. Also, more efficient estimation is possible by applying
13
generalized method of moments (GMM); again, GMM is justified with large G.
Unfortunately, for any of the IV methods there appears to be little simulation evidence for how
“large G” standard errors work for IV methods when G is not so large.
2.3. Large Group Size Asymptotics
The previous subsection discussed the mixed performance of standard “large G” variance
matrix estimators when G is not so large. I now discuss estimation of the parameters in (1)
under a standard stratified sampling scheme. In particular, suppose we have defined G strata in
the population, and we obtain our data by randomly sampling from each stratum. As before,
M g is the sample size for stratum (group) g. Without specifying how the sample was obtained,
the resulting data set is essentially indistinguishable from that described in Section 2.2, where
we randomly draw clusters from a population of many clusters. But the difference turns out to
be important, and in this subsection I act as if the usual “large G” asymptotics do not apply.
We are mostly interested in estimating  in (1), but that generally requires estimating , too.
Donald and Lang (2001) (hereafter, DL) study this problem, and I draw on their work while
offering a complementary approach.
DL’s examples are for very small G, and one of their examples is Card and Krueger
(1994), who had G  2 states, New Jersey and Pennsylvania. DL’s approach can be, and has
been, applied with larger numbers of groups. Moulton (1990), the clusters are states in the
U.S. (G  50) and, in Loeb and Bound (1996), G  36 cohort-division groups. I explicitly
consider the case where G and the M g are both (moderately) “large” in Section 2.4.
14
To illustrate the pitfalls in applying standard methods such as pooled OLS when G is
“small,” consider a special case of (1), where x g is a scalar (as in DL) and z gm is not in the
equation. The equation is
y gm    x g  c g  u gm , m  1, . . . , M g ; g  1, . . . , G,
(12)
where c g and u gm : m  1, . . . , M g  are independent of x g and u gm : m  1, . . . , M g  is a
mean-zero, independent, identically distributed sequence for each g. Now, if the cluster effect
c g is absent from the model, then we can estimate (12) using pooled ordinary least squares, and
inference is straightforward. If we assume Varu gm  is constant across g then, provided
N ≡ M 1 . . . M G is large enough (often N as small as 30 suffices) – even if G is not large – we
can use the usual t statistics from the pooled OLS regression as having an approximate
standard normal distribution. Making inference robust to heteroskedasticity is straightforward,
and tends to work well even for moderate N.
Donald and Lang (2001) allow for the presence of c g  Normal0,  2c , which they assume
to be independent of u gm : m  1, . . . , M g . With a common cluster effect, there is no
averaging out within cluster that allows application of the central limit theorem. One way to
see the problem is to note that the pooled OLS estimator, ̂ , is identical to the “between”
estimator obtained from the regression
ȳ g on 1, x g , g  1, . . . , G.
Conditional on the x g , ̂ inherits its distribution from v̄ g : g  1, . . . , G, the within-group
averages of the composite errors v gm ≡ c g  u gm . The presence of c g means new observations
within group do not provide additional information for estimating  beyond how they affect
the group average, ȳ g .
15
(13)
If we add some strong assumptions, there is an exact solution to the inference problem. In
particular, assume u gm  Normal0,  2u  and M g  M for all g (same sample size across strata).
Then v̄ g  Normal0,  2c   2u /M for all g. Because we assume independence across g, the
equation
ȳ g    x g  v̄ g , g  1, . . . , G
satisfies the classical linear model assumptions. Therefore, we can use inference based on the
t G−2 distribution to test hypotheses about , provided G  2. When G is very small, the
requirements for a significant t statistic using the t G−2 distribution are much more stringent then
if we use the t M 1 M 2 ...M G −2 distribution – which is what we would be doing if we use the usual
pooled OLS statistics. When x g is a 1  K vector, we need G  K  1 to use the t G−K−1
distribution for inference. [In Moulton (1990), x g contains 17 elements, which can be handled
in this setup because G  50. ]
As pointed out by DL, performing the correct inference in the presence of c g is not just a
matter of correcting the pooled OLS standard errors for cluster correlation, or using the RE
estimator. There is only one estimator: pooled OLS, random effects, and the between
regression in (13) all lead to the same ̂ . But it is the regression in (13) that gives the
appropriate standard error and reports the correct (small) degrees of freedom in the t
distribution.
We can apply the DL method without normality of the u gm if the common group size M is
large: by the central limit theorem, ū g will be approximately normally distributed very
generally. Then, because c g is normally distributed, we can treat v̄ g as approximately normal
with constant variance. Further, even if the group sizes differ across g, for very large group
16
(14)
sizes ū g will be a negligible part of v̄ g : Varv̄ g    2c   2u /M g . Provided c g is normally
distributed and it dominates v̄ g , a classical linear model analysis on (14) should be roughly
valid.
The broadest applicability of DL’s setup is when the average of the idiosyncratic errors, ū g ,
can be ignored – either because  2u is small relative to  2c , M g is large, or both. It is important
to see that applying DL with different group sizes or nonnormality of the u gm is identical to
ignoring the estimation error in the sample averages, ȳ g . In other words, it is as if we are
analyzing the simple regression  g    x g  c g using the classical linear model
assumptions (where ȳ g is used in place of the unknown group mean,  g ). With small G, we
need to assume c g is normally distributed.
The DL solution to the inference problem is actually pretty common as a strategy to check
robustness of results obtained from cluster samples, but typically it is implemented with large
G. Often with cluster samples one estimates the parameters using the disaggregated data and
also the averaged data. With covariates that vary within cluster, using averaged data is
generally inefficient. But it does mean that standard errors need not be made robust to
within-cluster correlation. DL essentially point out that using (14) with small G ensures that
inference is conservative (at least if normality of c g holds and the group sizes are large).
For small G and large M g , inference obtained from analyzing (14) as a classical linear
model will be very conservative in the absense of a cluster effect. Perhaps this is desirable, but
it rules out some widely-used staples of policy analysis. For example, suppose we have two
populations (maybe men and women, two different cities, or a treatment and a control group)
with means  g , g  1, 2, and we would like to obtain a confidence interval for their difference.
Under random sampling from each population, and assuming normality and equal population
17
variances, the usual comparison-of-means statistic is distributed exactly as t M 1 M 2 −2 under the
null hypothesis of equal population means. (Or, we can construct an exact 95% confidence
interval of the difference in population means.) With even moderate sizes for M 1 and M 2 , the
t M 1 M 2 −2 distribution is close to the standard normal distribution. Plus, we can relax normality
to obtain approximately valid inference, and it is easy to adjust the t statistic to allow for
different population variances. The comparison-of-means setup is most student’s first
exposure to policy analysis. But we cannot even study the difference-in-means estimator in
the DL setup because G  2.
DL criticize Card and Krueger (1994) for comparing mean wage changes of fast-food
workers across two states because Card and Krueger fail to account for the state effect (New
Jersery or Pennsylvania), c g , in the composite error, v gm . But the DL criticism in the G  2
case is indistinguishable from a common question raised for any difference-in-differences
analyses: How can we be sure that any observed difference in means is due entirely to the
policy change? To characterize the problem as failing to account for an unobserved group
effect is not necessarily helpful. If the experiment is appropriately randomized, then c g should
be part of the effect to be estimated.
Consider a related example. Suppose that, over the summer, a school district with two high
schools, say A and B, decides to provide personal computers for students at high school B who
just finished their first year. The announcement is made just before the school year starts, so
that students cannot switch high schools. The response variable is the change in a standardized
test score given in the district from the first to the second year. If we sample students from
each high school, a comparsion of means should be appropriate if the experiment is effectively
randomized. Of course, if there are other confounding factors – say, the average increase in
18
test scores is already higher at school B – then the difference-in-differences estimator will not
be appropriate.
If z gm appears in (1), we can modify (14) by adding the group averages, z̄ g , as explanatory
variables, but only if G  K  L  1. Under the normality and homoskedasticity assumptions
with common group sizes, inference can be carried out using the t G−K−L−1 distribution. DL
propose an estimation method that has an exact t G−K−1 distribution, regardless of the size of L,
but they assume that z̄ g is constant across g, z̄ g  z̄, in which case z̄ gets absorbed into the
intercept in (14) and we are back to the case without z gm .
With large group sizes M g , a different perspective on estimating  sheds additional insight
on problems of inference, and allows more flexibility on the kinds of questions that can be
answered. Write for each goup g
y gm   g  z gm  g  u gm , m  1, . . . , M g ,
(15)
where we assume random sampling within group and independent sampling across groups.
We make the standard assumptions for OLS to be consistent (as M g → ) and
M g -asymptotically normal; see, for example, Wooldridge (2002, Chapter 4). The presence
of group-level variables x g in a “structural” model can be viewed as putting restrictions on the
intercepts,  g , in the separate group models in (15). In particular,
 g    x g , g  1, . . . , G,
where we now think of x g as fixed, observed attributes of heterogeneous groups. With K
attributes we must have G ≥ K  1. If M g is large enough to estimate the  g precisely, a
simple two-step estimation strategy suggests itself. First, obtain the ̂ g , along with ̂ g , from an
OLS regression within each group. If G  K  1 then, typically, we can solve for ̂ ≡ ̂ , ̂ ′  ′
19
(16)
uniquely in terms of the G  1 vector ̂ :. ̂  X −1 ̂ , where X is the K  1  K  1 matrix
with g th row 1, x g . If G  K  1 then, in a second step, we can use a minimum distance
approach, as described in Wooldridge (2002, Section 14.6). If we use as the weighting matrix
I G , the G  G identity matrix, then the minimum distance estimator can be computed from the
OLS regression
̂ g on 1, x g , g  1, . . . , G.
Under asymptotics such that M g   g M where 0   g ≤ 1 and M → , the minimum distance
estimator ̂ is consistent and M -asymptotically normal. Still, this particular minimum
distance estimator is asymptotically inefficient except under strong assumptions. Because the
samples are assumed to be independent, it is not difficult to obtain the efficient minimum
distance (MD) estimator – also called the “minimum chi-square” estimator.
First consider the case where z gm does not appear in the first stage estimation, so that the ̂ g
is just ȳ g , the sample mean for group g. Let ̂ 2g denote the usual sample variance for group g.
Because the ȳ g are independent across g, the efficient MD estimator uses a diagonal weighting
matrix. As a computational device, the minimum chi-square estimator can be computed by
using the weighted least squares (WLS) version of (17), where group g is weighted by M g /̂ 2g
(groups that have more data and smaller variance receive greater weight). Conveniently, the
reported t statistics from the WLS regression are asymptotically standard normal as the group
sizes M g get large. (With fixed G, the WLS nature of the estimation is just a computational
device; the standard asymptotic analysis of the WLS estimator has G → .). The minimum
distance approach works with small G provided G ≥ K  1 and each M g is large enough so that
normality is a good approximation to the distribution of the (properly scaled) sample average
20
(17)
within each group.
If z gm is present in the first-stage estimation, we use as the minimum chi-square weights the
inverses of the asymptotic variances for the g intercepts in the separate G regressions. With
large M g , we might make these fully robust to heteroskedasticity in Eu 2gm |z gm  using the White
(1980) sandwich variance estimator, or at least allow for different  2g . In any case, once we
have the Avar̂ g  – which are just the squared reported standard errors for the ̂ g – we use as
weights 1/Avar̂ g  in the computationally simple WLS procedure. We are still using
independence across g in obtaining a diagonal weighting matrix in the MD estimation.
An important by-product of the WLS regression is a minimum chi-square statistic that can
be used to test the G − K − 1 overidentifying restrictions. The statistic is easily obtained as the
weighted sum of squared residuals, say SSR w . Under the null hypothesis in (16),
a
2
SSR w   G−K−1
as the group sizes, M g , get large. If we reject H 0 at a reasonably small
significance level, the x g are not sufficient for characterizing the changing intercepts across
groups. If we fail to reject H 0 , we can have some confidence in our specification, and perform
inference using the standard normal distribution for t statistics for testing linear combinations
of the population averages.
We might also be interested in how one of the slopes in  g depends on the group features,
x g . Then, we simple replace ̂ g with, say ̂ g1 , the slope on the first element of z gm . Naturally,
we would use 1/Avar̂ g1  as the weights in the MD estimation.
The minimum distance approach can also be applied if we impose  g   for all g, as in
the original model (1). Obtaining the ̂ g themselves is easy: run the pooled regression
y gm on d1 g , d2 g , . . . , dG g , z gm , m  1, . . . , M g ; g  1, . . . , G
21
(18)
where d1 g , d2 g , . . . , dG g are group dummy variables. Using the ̂ g from the pooled regression
(18) in MD estimation is complicated by the fact that the ̂ g are no longer asymptotically
independent; in fact, ̂ g  ȳ g − z̄ g ̂ , where ̂ is the vector of common slopes, and the presence
of ̂ induces correlation among the intercept estimators. Let V̂ be the G  G estimated
(asymptotic) variance matrix of the G  1 vector ̂ . Then the MD estimator is
̂  X ′ V̂ −1 X −1 X ′ V̂ −1 ̂ and its estimated asymptotic variance is X ′ V̂ −1 X −1 . If the OLS
regression (17) is used, or the WLS version, the resulting standard errors will be incorrect
because they ignore the across group correlation in the estimators. (With large group sizes the
errors might be small; see the next subsection.)
Intermediate approaches are available, too. Loeb and Bound (1996) (LB for short) allow
different group intercepts and group-specific slopes on education, but impose common slopes
on demographic and family background variable. The main group-level covariate is the
student-teacher ratio. Thus, LB are interested in seeing how the student-teach ratio affects the
relationship between test scores and education levels. LB use both the unweighted estimator
and the weighted estimator and find that the results differ in unimportant ways. Because they
impose common slopes on a set of regressors, the estimated slopes on education (say ̂ g1 ) are
not asymptotically independent, and perhaps using a nondiagonal estimated variance matrix V̂
(which would be 36  36 in this case) is more appropriate; but see Section 2.4.
As another example, suppose x g is a binary treatment indicator, equal to unity if group g is
in the treatment group and zero otherwise. Then ̂ is an estimate of an average treatment
effect (ATE). If G  2, there are no restrictions to test, as in the simple comparison-of-means
setup. If G  2 we can test the overidentifying restrictions. If we reject them, then there are
unaccounted for group-level differences, and we might want to re-specify our model by adding
22
more elements to x g , even if we think the new elements of x g are not systematically related to
the original elements. Of course, if we add elements until G  K  1, there are no restrictions
to test.
If we reject the overidentifying restrictions, we are essentially concluding that
 g    x g   c g , where c g is the error made in imposing the restrictions (16) on the  g . One
possibility is to apply the Donald and Lang approach, which is to analyze the OLS regression
in (17) in the context of the classical linear model (CLM), where inference is based on the
t G−K−1 distribution. Why is a CLM analysis justified? Since ̂ g   g  O p M −1/2
g , we can
ingore the estimation error in ̂ g for large M g (Recall that the same “large M g ” assumption
underlies the minimum distance approach.) Then, it is as if we are estimating the equation
 g    x g   c g , g  1, . . . , G by OLS. If the c g are drawn from a normal distribution,
classical analysis is applicable because c g is assumed to be independent of x g . This approach
is desirable when one cannot, or does not want to, find group-level observables that completely
determine the  g . It is predicated on the assumption that the other factors in c g are not
systematically related to x g , a reasonable assumption if, say, x g is a randomly assigned
treatment at the group level, a case considered by Angrist and Lavy (2002).
Before settling for conservative inference, which is justified only under normality of c g
when we can ignore the estimation error in ̂ g , it makes sense to choose x g carefully and test
Varc g   0. This is precisely the purpose of the minimum chi-square overidentification
statistic. The minimum chi-square approach is justified very generally when the group sizes,
M g , are moderately large; normality is not needed anywhere, nor is homoskedasticity within or
2
across groups. If the M g are not very large but G − K − 1 is large, the  G−K−1
distribution may
not be a good approximation to the distribution of the overidentification statistic.
23
Even if the overidentification statistic rejects (16), it may be too rash to adopt the DL
approach with small degrees of freedom. To see why, consider a treatment effect setup.
Suppose there are three large groups (G  3), and groups one and two are the control groups
while group three is the treatment group. So x g is a binary variable with x 1  x 2  0 and
x 3  1. For simplicity, there are no control variables. Then the estimated average treatment
effect using the DL regression (13) or the (unweighted) minimum distance regression (18) are
identical, and simply equal to the difference in means for the treatment and control group. It is
conveniently written as
̂  ȳ 3 − p 1 ȳ 1  p 2 ȳ 2 ,
where p 1  M 1 /M 1  M 2  and p 2  M 2 /M 1  M 2  are the proportions of observations for
groups one and two, respectively, out of the total number of control units. In other words, the
mean outcome for the control group is just a weighted average. The efficient minimum
distance estimator would use a different weighting scheme, but that is not important for the
main point here.
Because G  3, we can test the single overidentifying restriction. This is tantamount to
testing whether  1   2 , that is, the population means of the two control groups are the same.
Here is the important point: should we really reject the standard analysis just because there is a
heterogeneous response in the two control groups? I think not. If we do and apply the DL
approach, we wind up using a t distribution with a single degree of freedom to conduct
inference, even though we may have hundreds, or even thousands, of observations from each
group or stratum.
Instead, we can be satisfied with (19) as the estimate of the treatment effect without
worrying whether  1   2 . In fact, if the sample proportions p 1 and p 2 converge to the
24
(19)
relative population frequencies, say  1 and  2 , then the probability limit of (19) is simply
 3 −  1  1   2  2 , where the term in parentheses is the average across the two control group
populations. If p 1 and p 2 are not consistent estimates of the population frequencies, then the
control group mean estimates a weighted average of  1 and  2 . We may or may not be
interested in the particular weighted average implied by (19). Nevertheless, the DL method
produces exactly the same estimate of the ATE, but DL would insist that the estimated effect
be deemed statistically significant only if the t statistic, from the regression of ȳ g on 1, x g ,
g  1, 2, 3, is on the order of 12.8 (for a two-sided, 5% level test).
Rather than either rejecting the minimum distance approach if  1 ≠  2 , or simply
accepting the estimated treatment effect as possibly estimating an unteresting weighted
combination of the mean effects for different groups, we can specify ahead of time a particular
linear combination of the means that is of interest. So, for example, suppose that the first G 1
groups are the controls and the remaining G 2 groups are the treated groups. Then, we might
define the treatment effect in terms of two simple averages, say  C   1 . . .  G 1 /G 1 and
 T   G 1 1 . . .  G /G 2 ; the treatment effect would be defined as    T −  C . Even with
only a moderate sample size for each g, we can typically get fairly precise estimates of the  g
using the sample average. Estimation of  is then straightforward, and inference is standard
because asymptotic standard errors for ̂ C and ̂ T are easily obtained.
Of course, one might use weighted averages in each case, particularly if one has knowledge
of population frequencies. The general point is that once we have specified an interesting
population parameter, we can obtain a consistent, approximately normal estimator, and
inference can be based on the standard normal distribution when the group sizes are
moderately large. There is no reason to rely on the small degrees of freedom approach of DL.
25
Beyond the treatment effect case, the issue of how to define parameters of interest appears
complicated, and deserves further research.
2.4. What if G and the M g are Both “Large”?
In Section 2.2, I reviewed methods appropriate for a large number of groups and relatively
small group sizes. In Section 2.3, I considered two approaches appropriate for large group
sizes and a small number of groups. The DL and minimum distance approaches use the large
group sizes assumption differently: in its most applicable setting, DL use the large M g
assumption to ignore the first-stage estimation error entirely, while the asymptotics underlying
the MD approach is based on applying the central limit theorem within each group. Not
surprisingly, more flexibility is afforded if G and M g are both “large.”
For example, suppose we adope the DL specification (with an unobserved cluster effect)
and G  50 (say, states in the U.S.). Further, assume first that the group sizes are large enough
(or the cluster effects are so strong) that the first-stage estimation error can be ignored. Then,
it matters not whether we impose some common slopes or run separate regressions for each
group (state) in the first stage: we can treat the ̂ g , g  1, . . . , G, as independent “observations”
to be used in the second stage. This means we apply regression (17) and apply the usual
inference procedures. The difference now is that with G  50, the usual t statistics have some
robustness to nonnormality of the c g , assuming the CLT approximation works well With small
G, the exact inference was based on normality of the c g .
Loeb and Bound (1996) essentially use regression (17), but with estimated slopes as the
26
dependent variable in place of estimated intercepts. As mentioned in Section 2.3, LB impose
some common slopes across groups, which means all estimated parameters are dependent
across group. The minimum distance approach without cluster effects is one way to account
for the dependence. Alternatively, one can simply adopt the DL perspective and just assume
the estimation error is swamped by c g ; then standard OLS analysis is approximately justfied.
LB also apply a weighted least squares procedure that is intended to account for the different
sampling variations in their slope estimates (largely due to different group sample sizes). With
a cluster effect, the LB weights are not the correct ones for eliminating heteroskedasticity.
[Plus, we do not have the actual variances of the ̂ g of ̂ g1 ; we have only estimates of their
asymptotic variances. This is fine from the MD perspective, but not really correct from a WLS
perspective.] There is nothing really wrong with using the incorrect weights provided the
standard errors are appropriately adjusted, but with G  36 one might be concerned about the
finite-sample properties of robust standard errors. With a cluster effect, the same problem
arises with OLS, as it ignores the heteroskedasticity. From the MD perspective, the LB WLS
approach would be correct if each first-stage regression were estimated separately; but they
were not.
In summary, with G somewhat large, the DL approach gains robustness to nonnormality of
the c g if the first-stage estimation error can be safely ignored. A WLS approach is justified in
the MD approach when separate regressions can be run for each group. But that weighting is
not the correct one when either a cluster effect is present or one must account for dependence
in the first-stage estimators across groups.
27
3. NONLINEAR MODELS
Many of the issues for nonlinear models are the same as for linear models. The biggest
difference is that, in many cases, standard approaches require distributional assumptions about
the unobserved group effects. In addition, it is more difficult in nonlinear models to allow for
group effects correlated with covariates, especially when group sizes differ. For the small G
case, we offer extensions of the Donald and Lang (2001) approach (with large group sizes) and
the minimum distance approach.
Rather than using a general, abstract setting, the issues for nonlinear models are easily
illustrated with a few widely-used examples. Here I discuss binary response, count data, tobit
models, and fractional response models.
3.1. Large Groups Asymptotics
As in Section 2, we begin with the case where the asymptotic analysis is with fixed group
sizes M g  with the number of groups G getting large.
3.1.1. Binary Response Models
We can illustrate many issues using an unobserved effects probit model. Let y gm be a
binary response, with x g and z gm , m  1, . . . , M g , g  1, . . . , G defined as in Section 2. Assume
28
that
y gm  1  x g   z gm   c g  u gm ≥ 0
(20)
u gm |x g , Z g , c g ~Normal0, 1
(21)
(where 1 is the indicator function). Equations (20) and (21) imply
Py gm  1|x g , z gm , c g   Py gm  1|x g , Z g , c g     x g   z gm   c g ,
(22)
where  is the standard normal cumulative distribution function (cdf). We assume
throughout that only z gm affects the response probability of y gm conditional on x g and c g ; the
outcomes of z gp for p ≠ m are assumed not to matter. This is captured in (22). For pooled
methods we could relax this restriction (as in the linear case), but, with the presence of c g , this
affords little generality in practice.
The presence of c g in (22) raises several important issues. Before even considering
estimation, we should know how to define the partial effects of the explanatory variables,
x g , z gm . The elements of  and  provide the directions of effects and, for continuous
elements, the relative sizes. For example, if the first element of x g is continuous,
∂Py gm  1|x g , z gm , c g 
  1   x g   z gm   c g ,
∂x g1
where  is the standard normal density function. From (23), the sign of  1 tells us whether
x g1 has a positive or negative partial effect on the response probability. Further, if x g2 is
another continuous variable, the ratio of partial effects is simply  1 / 2 . For discrete
covariates, the signs of the parameters give the direction of effects, although the relative
magnitudes are not free of the unobservable, c g . In any case, the magnitude of the partial
effects depends directly on c g , as (23) shows in the case of a continuous covariate. Often, we
need to a measure of the change in the response probability given a change in a regressor. To
29
(23)
address the dependence of (23) [and its discrete analogs] on c g , we can average (23) across the
population distribution of c g , for fixed values of the covariates. This results in what I have
called the “average partial effect” (APE); see, for example, Wooldridge (2002, 2005). The
simplest way to obtain APEs requires is to specify a distribution (or a conditional distribution)
for c g .
We can easily derive the APEs, and then consistently estimate them, if we assume c g is
independent of x g , Z g  and is normally distributed:
c g |x g , Z g ~Normal0,  2c ,
(24)
where the zero mean is without loss of generality because (20) contains an intercept, . The
unconditional normality assumption for c g implies that the APEs are obtained from the
function
Py gm  1|x g , Z g     x g   z gm /1   2c  1/2  ≡  c  x g  c  z gm  c ,
where  c  /1   2c  1/2 , and so on; see, for example, Wooldridge (2002, Chapter 15).
Conveniently, the scaled coefficients are exactly the coefficients estimated by using a simple
pooled probit procedure. So, for estimating the average partial effects, pooled probit is
perfectly acceptable. With large G and small group sizes, we can easily make the standard
errors and test statistics robust to arbitarary within group correlation. This is especially
convenient for panel data applications, where serial correlation in the idiosyncratic errors, u gm
(where m indexes time) causes extra dependence in the responses y gm that is not accounted for
by c g . In other words, the “cluster-robust” variance matrix estimator allows for any kind of
within-group correlation, just as in the linear case.
If we supplement (20),.(21), and (24) with
30
(25)
u g1 , . . . , u g,M g  are independent conditional on x g , Z g , c g 
then we have the so-called “random effects probit” model. Under the RE probit assumptions,
, ,  and  2c are all identified, which means we can estimate the APEs as well as the partial
effects evaluated at the everage value of c g , which is zero. We can also compute partial effects
at other values of c g that we might select from the normal distribution with estimated standard
deviation  c . The details for random effects probit in the balanced panel data case are given in
Wooldridge (2002, Chapter 15). The unbalanced case is similar.
The random effects probit estimator has no known robustness properties – that is, if any of
the assumptions (20),.(21), (24), and (26) fail, the parameter estimators are believed to be
inconsistent. Plus, the APEs would not be consistently estimated in general. Therefore, if we
insist that the parameter estimators and APEs are consistently estimated, there is no logical
reason to use robust inference; the usual MLE inference is appropriate. Still, one might view
the RE probit model as an approximation to the true model, and use a “sandwich” estimator for
robust inference; see White (1982) and Wooldridge (2002, Chapter 12).
Random effects probit relies on a full set of distributional assumptions, while pooled probit
relaxes assumptions on the joint distribution at the cost of inefficient estimators (of the scaled
parameters indexing the APEs). If we maintain the strict exogeneity assumption of the
explanatory variables, there are estimators that are generally more efficient than pooled probit
while remaining consistent under (20), (21), and (24). The “generalized estimating equation”
(GEE) literature applies to grouped data (panel data as a special case); see Liang and Zeger
(1986). Perhaps the best way to think about GEE is as a multivariate weighted nonlinear least
squares (WNLS) estimator, with explicit recognition that the variance-covariance matrix
underlying the WNLS estimation is likely misspecified. The GEE approach begins by
31
(26)
specifying a conditional mean for the M g  1 vector y g . In the binary response case,
Ey g |x g , Z g  is simply the vector of response probabilities in (25). [In the context of GEE, the
expression in (25) would be called the “population averaged model” because the unobserved
effect has been averaged out.] Because y gm is binary,
Vary gm |x g , Z g    c  x g  c  z gm  c 1 −  c  x g  c  z gm  c .
The pooled weighted nonlinear least squares estimator, using as weights the inverse of the
estimated variance function, produces an an estimator asymptotically equivalent to the pooled
probit estimator. (Remember, this is all as G →  with the M g fixed.) What GEE does is
exploit the within-group correlation to obtain a more efficient estimator. It does this by
specifying a “working correlation matrix” (WCM) which is, for each g, an M g  M g matrix of
correlations. This matrix is supposed to approximate Corry g |x g , Z g . In most
implementations of GEE, the WCM does not depend on the condititioning variables. For
cluster samples and panel data, a useful WCM is the “exchangeable” matrix that assumes a
constant correlation, , across any two pairs m and p. Combined with (27), the WCM gives an
approximation to Vary g |x g , Z g . The resulting (simple) structure is known to be wrong if the
random effects probit model, or any extensions (such as modeling the within-group correlation
in u g1 , . . . , u g,M g ). The idea is that accounting for the within-cluster correlation in even a
crude way generally leads to more efficient estimation than ignoring the correlation in
estimation, as with pooled probit. Because the M g  M g variance matrices are assumed to be
misspecified, a “sandwich” estimator should be used for inference, and this option is standard
(and should always be used) for GEE applications. The GEE estimates are directly comparable
to the pooled probit estimates – the parameters index the APEs in both cases – and can be
32
(27)
compared to random effects estimates after scaling the latter by 1/1  ̂ 2c  1/2 .
In many large G applications, interest centers on the causal effect of the unit specific
covariates, z gm , controlling for differences in unobserved heterogeneity, c g . Then, x g (and an
intercept) is absorbed into c g and correlation between c g and z g1 , z g2 , . . . , z g,M g  is allowed.
For linear models, we saw that the within or fixed effects estimator allows arbitrary
correlation, and does not restrict the within-cluster dependence of u g1 , . . . , u g,M g .
Unfortunately, allowing general correlation is much more difficult in nonlinear models. In the
balanced case, where the M g are the same, a device due to Chamberlain (1980) can be used.
Actually, in practice a more restrictive version is often used: the distribution of c g given
z g1 , z g2 , . . . , z gM  is assumed to be homoskedasticity normal with a mean linear in the group
average:
c g |Z g ~Normal  z̄ g ,  2a ,
(28)
where  2a is the conditional variance Varc g |Z g . If we use all random effects probit
assumptions but with (28) in place of (24), then we obtain a simple extension of the RE probit
model: simply add the group averages, z̄ g , as a set of additional explanatory variables. The
marginal distributions are
Py gm  1|Z g     z gm   z̄ g /1   2a  1/2  ≡  a  z gm  a  z̄ g  a 
(29)
where now the coefficients are scaled by a function of the conditional variance. As shown in
Wooldridge (2002, Chapter 15), the average partial effects are consistently estimated by taking
derivatives and changes of
G
G
−1
∑ ̂ a  z̂ a  z̄ g ̂ a 
g1
33
(30)
with respect to elements of z. In other words, we average out the effects of the group averages,
z̄ g . The RE probit assumptions identify the original coefficients and  2a . We can also apply
pooled probit or GEE without (26) and directly estimate the scaled coefficients (and only the
scaled coefficients are identified). Also, one is free to add x g to the set of covariates, but,
typically, one would not interpret the partial effects in a causal way (because c g and x g might
be (partially) correlated.)
Unfortunately, while the Chamberlain-Mundlak device is sensible for balanced cluster
samples (including balanced panels), it needs to be modified for the unbalanced case. [Of
course, one can always balance a cluster sample under the assumption that the cluster sizes are
exogenous, and that might be desirable if there is not much variation in the cluster sizes. But
sometimes it entails throwing away a lot of data.] Here I will simply mention what might be
done. At a minimum, (28) should be modified to allow the variances to depend on the cluster
size, M g . Under restrictive assumptions, such as joint normality of c g , z g1 , . . . , z g,M g , with the
z gm independent and identically distributed within a cluster, one can derive Varc g |Z g . But
these are strong assumptions. With G large and little variation in the group sizes, M g , one
might just allow a different variance for each M g . Then the marginal distributions are
Py gm  1|Z g     z gm   z̄ g /1   2a,M g  1/2 .
A simple approach, and one that allows ,  to be  M g ,  M g , is to estimate a different
equation by pooled probit for each group size [with covariates 1, z gm , z̄ g ]. Then, we can
estimate APEs for each M g , and then average those. Obtaining standard errors via the delta
method would be tricky, and this approach ignores the constancy of  across groups. [In
effect, we would just be obtaining ̂ M g , ̂ M g , ̂ M g  for each group size.] To impose a common
34
(31)
, a minimum distance approach can be used.
Alternatively, for each different value of M g we can estimate a random effects probit model
with the z̄ g included as covariates. In fact, we could be more general and allow ,  to be
 M g ,  M g . Then, after estimating the slope and variance parameters for a set of extended RE
probits, we can then use a minimum distance approach to estimate the common . This might
be a useful topic for future research.
In estimating APEs, one can dispense with parametric assumptions entirely, provided
nonparametric estimation is feasible. In particular, consider the two assumptions
Py gm  1|Z g , c g   Py gm  1|z gm , c g  ≡ Fz gm , c g 
(32)
Dc g |z g1 , z g2 , . . . , z g,M g   Dc g |z̄ g .
(33)
and
Assumption (32) implies strict exogeneity conditional on c g while (33) means that the
distribution of the group effect, c g , given all within-group covariates, depends only on the
group average; a specific distributional assumption, such as homoskedastic normality with a
mean linear in z̄ g , is clearly a special case.
Define H g z gm , z̄ g   Py gm  1|z gm , z̄ g . Under (32) and (33), it can be show that the
APEs are obtained from
E z̄ g H g z, z̄ g ;
see, for example, Wooldridge (2005). If the group sizes differ, H g ,  generally does depend
on g. If there are relatively few group sizes, it makes sense to estimate the H g ,  separately
for each group size M g . Then, the APEs can be estimated from
35
(34)
G
G −1 ∑ Ĥ g z, z̄ g .
(35)
g1
As a practical matter, we might just use flexible parametric models, such as pooled probit or
random effects probit. This approach does not impose the constant function F,  as specified
in (32), and so more efficient methods should be available, but I will not explore that here.
Other strategies are available for estimating APEs. A procedure commonly known as
“fixed effects probit” treats the c g as parameters to estimate in
Py gm  1|Z g , c g   Py gm  1|z gm , c g   z gm   c g .
(36)
With small group sizes M g (say, siblings or short panel data sets), treating the c g as parameters
to estimate creates an incidental parameters problem. In particular, the estimator of  can be
badly biased. Interestingly, as shown in recent work by Fernández-Val (2005) for the balanced
panel case and independence across m conditional on Z g , c g , the average partial effects,
obtained from
G
G
−1
∑ z̂  ĉ g ,
g1
are unbiased for the population APEs up to order M −2 when the group effects are not present.
When group effects are present, Fernández-Val (2005) proposes a bias correction for the APEs.
See Fernández-Val (2005) for details.
Some prefer to use a logit response function rather than probit. The use of logit for any of
the methods discussed previously makes estimation of parameters and average partial effects
more difficult, so logit has little to offer as an alternative. Nevertheless, for allowing
correlation between c g and z gm , logit has an advantage over probit: under the conditional
36
(37)
independence assumption
y g1 , . . . , y g,M g  are independent conditional on Z g , c g ,
(38)
the conditional maximum likelihood estimator eliminates c g and leads to consistent estimation
of , without having to specify Dc g |Z g . While this is useful for obtaining directions and
relative magnitudes, it does not allow us to estimate the APEs. Plus, for panel data
applications it is limited by the conditional independence assumption.
3.1.2. Count Data
Almost all of the discussion for binary response models extends to other nonlinear models
for cluster samples. An important application is to count data (or, in fact, any nonnegative
response variable). A popular model for the conditional mean is
Ey gm |x g , Z g , c g   Ey gm |x g , z gm , c g   exp  x g   z gm   c g .
As with the probit model we assume strict exogeneity of the unit-specific covariates, z gm ,
conditional on x g , c g . As usual, strict exogeneity can be relaxed for pooled methods but not
for random effects, fixed effects, and GLS-type methods. A convenient normalization in (39)
is Eexpc g   1.
Under (39), a simple way to estimate , , and  is the pooled Poisson quasi-maximum
likelihood estimator (QMLE). In other words, act as if y gm has a Poisson distribution given
x g , z gm  and ignore the within-group correlation in estimation. In many statistical packages, it
is simple to make inference robust to arbitrary within-group correlation (including serial
correlation in a panel data setting). The Poisson QMLE is known to be fully robust to
37
(39)
distributional misspecification, as well as within-group correlation. All that is required is
Ey gm |x g , Z m   exp  x g   z gm 
(40)
(and we can replace Z m with z gm ) along with regularity conditions. See Wooldridge (2002,
Chapter 19) for references and a summary.
As with the linear case, any pooled method is likely to be inefficient. A typical random
effects Poisson approach adds the assumptions
y gm |x g , Z g , c g ~Poissonexp  x g   z gm   c g ,
y g1 , . . . , y g,M g  are independent conditional on x g , Z g , c g ,
(41)
(42)
and
expc g |x g , Z g ~Gamma, ,
(43)
where the gamma distribution is parameterized to have mean equal to one. [Of course, other
distributions are possible, but the gamma is very common.] This estimator is programmed in
many packages, and is, of course, efficient under the maintained assumptions. Unfortunately,
the estimator has no known robustness properties to violations of (41), (42), or (43).
As in the binary response case, a middle ground that maintains robustness while likely
being more efficient than the pooled estimator is available. A GEE approach assumes (40),
uses the nominal Poisson variance assumption
Vary gm |x g , Z g   Ey gm |x g , Z g   exp  x g   z gm 
and specifies a working correlation matrix. Typically, with a cluster sample the
“exchangeable” option would be used. As described in Section 3.1.1, the GEE estimator is a
multivariate weighted nonlinear least squares estimator with a (probably) misspecified
38
(44)
variance matrix. Thus, inference should be made fully robust using a sandwich estimator, and
this option is available (see appendix).
Unlike in the probit case, for count data we can derive the correct conditional variance
matrix under the nominal random effects Poisson assumption, and the resulting correlation
matrix does not have the exchangeable form. As shown in Wooldridge (2002, Chapter 19),
Vary gm |x g , Z g   exp  x g   z gm    2 exp  x g   z gm  2
Covy gm , y mp |x g , Z g    2 exp  x g   z gm   exp  x g   z gp 
(45)
(46)
where  2  1/  Varexpc g . The conditional correlation clearly depends on the
covariates. Therefore, rather than using a constant working correlation, one uses the variance
matrix implied by (45) and (46). Such an estimator is still fully robust but would be more
efficient than the exchangeable GEE estimator under the RE Poisson assumptions.
Wooldridge (2002, Chapter 19) provides details on estimating  2 as well as fully robust
inference [to guard against (45) or (46) being incorrect]. As far as I know, this estimator is not
programmed in standard econometrics software, although it would not be difficult to do so.
It is straightforward to extend the previous models to allow correlation between c g and z gm .
Is in the binary response case, we initially drop x g . Then, we might be willing to assume
Eexpc g |Z g   Eexpc g |z̄ g   exp g  z̄ g  g 
(47)
where  g ,  g  depends on g through the group size, M g . Then
Ey gm |Z m   exp g  z gm   z̄ g  g 
and now we can apply the pooled Poisson estimator, the GEE estimator, or the WNLS
estimator implied by (45) and (46), but but where we allow the slope and intercepts on z̄ g to
depend on the group sizes. One might want to use a different estimate of  2 for each group
39
(48)
size, but that is not necessary for consistency. One could also extend the RE Poisson estimator
to allow expc g |Z g to have a gamma distribution with mean in (47), although estimation
would be much more complicated than extending the other methods. As in the binary response
case, one can add x g to the model, subject to the issue of how to interpret the parameters.
Because the elements of  are semi-elasticities (or elasticities), often estimation of  is
sufficient. Still, estimation of APEs can be based on
G
G
−1
∑ exp̂ g  z̂  z̄ g ̂ g .
(49)
g1
Rather than restricting the nature of the relationship between c g and Z g , we can use the
“fixed effects Poisson” estimator to consistently estimate . Originally, this estimator was
derived as a conditional MLE by Hausman, Hall, and Griliches (1984). Blundell, Griffith, and
Windmeijer (2002) showed that treating the c g as parameters to estimate is the same as the
conditional MLE, so the name “fixed effects Poisson” cannot cause confusion.
As shown by Wooldridge (1999), the FE Poisson estimator has very desirable robustness
properties: it is consistent and G -asymptotically normal under the assumption
Ey gm |Z g , c g   expz gm   c g .
The actual distribution of y gm is practically unrestricted (except for regularity conditions), as is
the conditition dependence withing group. If interest lies in , the FE Poisson estimator is very
attractive. Of course, one should generally use a sandwich variance matrix estimator [see
Wooldridge (1999) or Wooldridge (2002, Section 19.6.4)]; unfortunately, this is not currently a
standard option in statistical packages.
40
(50)
3.1.2. Tobit Models for Corner Solutions
The discussion for Tobit models is very similar for probit models. Here, I assume that
Tobit is applied to a “corner solution” response, so that y gm , the nonnegative variable with
probability mass at zero, is the variable we wish to explain. In particular, I will focus on
estimating APEs on the mean response.
The standard Tobit model with a group effect is
y gm  max0,   x g   z gm   c g  u gm 
u gm |x g , Z g , c g ~Normal0,  2u 
(51)
(52)
Among other things, (51) and (52) imply
Ey gm |x g , Z g , c g     x g   z gm   c g / u     x g   z gm   c g 
  u   x g   z gm   c g / u 
≡ m  x g   z gm   c g ,  2u .
(53)
The APEs are obtained by integrating this expression with respect to the distribution of c g . A
typical “andom effects approach assumes
c g |x g , Z g ~Normal0,  2c ,
(54)
in which case the APEs are remarkably simple to obtain. They are estimated from
m̂  x̂  z̂ , ̂ 2v 
where  2v   2c   2u . Conveniently, the pooled Tobit directly provides consistent and
G -asymptotically normal estimators of , ,  and  2v . A sandwich estimator is easily
obtained to allow arbitrary within-group correlation.
The full random effects Tobit specification adds the same conditional independence
41
(55)
assumption as random effects probit; see (26). Then  2c and  2u are separately identified, but
consistency requires all RE Tobit assumptions. In principle, a GEE-type approach can be used
in the Tobit case as well. This could be based on the scores from the Tobit log-likelihood,
although, as far as I know, details remain to be worked out.
As with probit, we can modify pooled Tobit and RE Tobit to allow for correlation between
c g and Z g , as in (28). But, with different group sizes, we should replace  2a  Varc g |z̄ g  with
 2a,M g . As in the probit case, we can estimate a different variance for each group size M g if the
group sizes are relatively few. An even easier approach is to allow all parameters to differ by
M g , estimate the APEs in each case, and average them together. Either pooled Tobit or RE
Tobit can be used. The APEs would be obtained from
G
G −1 ∑ m̂ M g  z̂ M g  z̄ g ̂ M g , ̂ 2v,M g 
(56)
g1
where the “M g ” just indicates a different set of estimates for each different group size.
With larger M g , one might use “fixed effects” Tobit. While it is known that the estimator
of  suffers from an incidental parameters problem, one can use
G
G −1 ∑ mz̂  ĉ g , ̂ 2u 
g1
to estimate APEs. I know of no simulation evidence that studies APEs based on (57); they
might be reasonable estimates for moderate M g . Fernández-Val (2005) contains some relevant
theory.
3.1.4. Fractional Responses
42
(57)
Recently, more applied econometric work is recognizing that fractional responses can
require special treatment. Naturally, one can always opt for a linear model (as in the case of
binary responses, count, and corner solutions responses). But there is more interest in using
functional forms that ensure expected values are in the unit interval. One possibility for
responses that have probability mass at zero and one is to adopt a two-limit Tobit model. The
analysis would then be very similar to Section 3.1.3. In particular, the formulas for the
expected value in the two-limit case can be easily adapted to allow the presence of unobserved
heterogeneity, either assumed independent of the covariates or with Dc g |z̄ g  assumed normal.
The drawback of the two-limit Tobit model is that it makes sense only when there is pile up
a zero and one. Even in such cases, if fully specifies the distribution Dy gm |x g , Z g . If we are
primarily interested in effects on the mean response, more robust methods are available. These
methods produce consistent estimators with pile up at neither, both, or only one endpoint.
Papke and Wooldridge (2005) recently proposed methods for fractional responses for
balanced panel data sets. Here I draw on that work and make suggestions for the unbalanced
case. In fact, the methods are virtually identical to the results for the probit case, but with a
change in how we interpret the probit response function. With y gm a fractional response, we
begin with
Ey gm |x g , Z g , c g     x g   z gm , m  1, . . . , M g .
Equation (58) is attractive as a mean response In other words, we simply assume the mean
response has the probit form. [The logit form is also possible but does not lead to simple
estimating equations.] If we add the independence and normality assumption
43
(58)
c g |x g , Z g ~Normal0,  2c ,
(59)
just as in (24), then we have an extension of (25):
Ey gm |x g , Z g    c  x g  c  z gm  c ,
where the “c” index denotes scaling by 1/1   2c  1/2 . Because the Bernoulli quasi-likelihood
function identifies the parameters of a correctly specified conditional mean – see Gourieroux,
Monfort, and Trognon (1984) or Papke and Wooldridge (1996) – we can use pooled “probit”
to consistently estimate the scaled parameters and APEs. Naturally, we should use a fully
robust sandwich variance matrix estimator to account for within-group correlation but also for
the fact that the Bernoulli variance assumption can no longer be expected to hold. See Papke
and Wooldridge (2005) for more discussion.
A random effects probit analysis does not make sense here because we do not have a fully
specified distribution. But the GEE methods described for probit apply immediately. In fact,
using standard GEE commands, you would not even know y gm is fractional or binary (except
possibly for a “warning” that you applying “probit” to a nonbinary response).
The discussion about allowing correlation between c g and Z g is identical to the binary
response case. Again, it is important to allow at least Varc g |z̄ g  depend on the group size, and
possibly the slopes on z̄ g as well. A simple, flexible, but inefficient approach is to estimate a
different model for each M g , provided there is not much variation in M g . Better would be to
apply a minimum distance approach that recognizes a constant  vector on z gm in (58).
3.2 Large Group Size Asymptotics
44
(60)
Unlike in the linear case covered by Donald and Lang (2001), for nonlinear models no
exact inference is available. Nevertheless, if the group sizes M g are reasonably large, we can
extend the approximate inference using the DL approach to nonlinear models. Plus, the
minimum distance approach carries over essentially without change.
We can apply the methods to any of the nonlinear models in Section 3.1 (as well as others).
Here, I illustrate the probit response function (which means it can apply to binary or fractional
responses).
With small G but random sampling of y gm , z gm  : m  1, . . . , M g  within each g, write
Py gm  1|z gm    g  z gm  m , m  1, . . . , M g
 g    x g , g  1, . . . , G.
(61)
(62)
As with the linear model, we assume the intercept,  g in (61), is a function of the group
features x g . With the M g moderately large, we can get good estimates of the  g . The
̂ g , g  1, . . . , G, are easily obtained by estimating a separate probit for each group.
Under (62), we can apply the minimum distance approach just as before. Let Avar̂ g 
denote the estimated asymptotic variances of the ̂ g (so these shrink to zero at the rate 1/M g ).
For binary response, these are just the usual MLE estimated variances. For fractional response,
these would be from a sandwich estimate of the asymptotic variance. Then, we obtain the
minimum distance estimates as the WLS estimates from
̂ g on 1, x g , g  1, . . . , G
using weights 1/Avar̂ g  are used as the weights. This is the efficient minimum distance
45
(63)
estimator and, conveniently, the proper asymptotic standard errors are reported from the WLS
estimation (even though we are doing large M g , not large G, asymptotics.) The
overidentification test is obtained exactly as in the linear case: there are G − K − 1
degrees-of-freedom in the test.
The same cautions about using the overidentification test to reject the minimum distance
approach apply here as well. In particular, in the treatment effect setup, where x g is zero or
one, one might reject a weighted comparision of means simply because the means within the
control or within the treatment group differ, or both. It might make more sense to define, using
other considerations, weighted averages of the population means, and then to estimate those
means using within-group sample averages. Similar considerations may be relevant when x g is
a vector with a more complicated structure.
If we impose common slopes  g  , g  1, . . . , G, then in applying MD estimation we
should account for the across group correlation in the ̂ g , as described in Section 2.3. This is
also true if we impose that a subset of the slopes are common across g.
If we reject the overidentification restrictions and wish to have conservative inference, then
we can adapt Donald and Lang and treat
̂ g    x g   error g , g  1, . . . , G
as approximately satisfying the classical linear model assumptions, provided G  K  1, just as
before. As in the linear case, this approach is justified if  g    x g   c g with c g
independent of x g and c g drawn from a homoskedastic normal distribution. It assumes that we
can ignore the estimation error in ̂ g , based on ̂ g   g  O1/ M g .
Because the DL approach ignores the estimation error in ̂ g , it is unchanged if one imposes
46
(64)
some constant slopes across the groups, as with the linear model.
Once one estimates  and , the estimated effect on the response probability can be
obtained by averaging the response probability for a given x:
Mg
G
G
−1
∑
g1
M −1
g
∑ ̂  x̂  z gm ̂ g 
,
m1
where derivatives or differences with respect to the elements of x can be computed. Here, the
minimum distance approach has an important advantage of the DL approach: the finite sample
properties of (65) are viritually impossible to obtain, whereas the large-M g asymptotics
underlying minimum distance would be straightforward using the delta method. Whether
bootstrapping can be applied is an interesting question.
Particularly with binary response problems, the two-step methods described here are
problematical when the response does not vary within group. For example, suppose that x g is a
binary treatment – equal to one for receiving a voucher to attend college – and y gm is an
indicator of attending college. Each group is a high school class, say. If some high schools
have all students attend college, one cannot use probit (or logit) of y gm on z gm , m  1, . . . , M g .
A linear regression returns zero slope coefficients and intercept equal to unity. Of course, if
randomization occurs at the group level – that is, x g is independent of group attributes – then it
is not necessary to control for the z gm . Instead, the within-group averages can be used in a
simple minimum distance approach. In this case, as y gm is binary, the DL approximation will
not be valid, as the CLM assumptions will not even approximately hold in the model
ȳ g    x g   e g (because ȳ g is always a fraction regardless of the size of M g ).
As in the linear case, more flexibility is afforded if G is somewhat large along with large
M g . Then, in implementing the DL approach – with or without common slopes imposed in the
47
(65)
first stage – one gains robustness to nonnormality of c g if G is large enough so that
G
G
G −1 ∑ g1 c g and G −1 ∑ g1 x g c g are approximately normally distributed. If we estimate
separate models in the first stage and do not wish to ignore the estimation error in ̂ g , then
error g in (64) is heteroskedastic, and we can, perhaps, use heteroskedasticity-robust standard
errors in (64).
4. CONCLUDING REMARKS
My purpose is writing this expanded version of Wooldridge (2003) is to flesh out some
issues only touched on in the shorter paper, while also considering more general frameworks.
This version of the paper contains what are best described as conjectures, or at least
suggestions that may not turn out to work very well. Future research could resolve the issue of
how to allow group heterogeneity to be correlated with the unit-specific covariates when the
group sizes differ. Minimum distance estimation strikes me as promising, but that remains to
be seen. Less parametric approaches should also be considered.
Much more remains to be learned about the “small G” case, too, particularly for nonlinear
models. A similuation study comparing Donald and Lang’s (2001) conservative approach to
minimum distance estimation, when neither method is entirely appropriate, would be
worthwhile. For example, in a treatment effect case with G  4, two treatment groups and two
control groups, with different means within each group, one might select different, modest
sizes of M g (ranging from maybe 20 to 40). The underlying population distributions can be
normal. Which method, DL or minimum distance, produces the best confidence intervals for
48
the ATE (defined appropriately for a fictitious population)?
49
REFERENCES
Angrist, J.D. and V. Lavy (2002), “The Effect of High School Matriculation Awards:
Evidence from Randomized Trials,” NBER Working Paper 9389.
Arellano, M. (1987), “Computing Robust Standard Errors for Within-Groups Estimators,”
Oxford Bulletin of Economics and Statistics 49, 431-434.
Baker, M. and N.M. Fortin (2001), “Occupational Gender Composition and Wages in
Canada, 1987-1988,” Canadian Journal of Economics 34, 345-376.
Bertrand, M., E. Duflo, and S. Mullainathan (2002), “How Much Should We Trust
Differences-in-Differences Estimates?” mimeo, MIT Department of Economics.
Blundell, R., R. Griffith, and F. Windmeijer (2002), “Individual Effects and Dynamics in
Count Data Models,” Journal of Econometrics 108, 113-131.
Campolieti, M. (2004), “Disability Insurance Benefits and Labor Supply: Some Additional
Evidence,” Journal of Labor Economics 22, 863-889.
Card, D., and A.B. Krueger (1994), “Minimum Wages and Employment: A Case Study of
the Fast-Food Industry in New Jersey and Pennsylvania,” American Economic Review 84,
772-793.
Donald, S.G. and K. Lang (2001), “Inference with Difference in Differences and Other
Panel Data,” mimeo, University of Texas Department of Economics.
Fernández-Val, I. (2005), “Estimation of Structural Parameters and Marginal Effects in
Binary Choice Panel Data Models with Fixed Effects,” mimeo, Boston University Department
of Economics.
50
Hausman, J.A., B.H. Hall, and Z. Griliches (1984), “Econometric Models for Count Data
with an Application to the Patents-R&D Relationship,” Econometrica 52, 909-938.
Hsiao, C. (1986), Analysis of Panel Data. Cambridge: Cambridge University Press.
Kézdi, G. (2001), “Robust Standard Error Estimation in Fixed-Effects Panel Models,”
mimeo, University of Michigan Department of Economics.
Liang, K.-Y., and S.L. Zeger (1986), “Longitudinal Data Analysis Using Generalized
Linear Models,” Biometrika 73, 13-22.
Loeb, S. and J. Bound (1996), “The Effect of Measured School Inputs on Academic
Achievement: Evidence form the 1920s, 1930s and 1940s Birth Cohorts,” Review of
Economics and Statistics 78, 653-664
Moulton, B.R. (1990), “An Illustration of a Pitfall in Estimating the Effects of Aggregate
Variables on Micro Units,” Review of Economics and Statistics 72, 334-338.
Papke, L.E. and J.M. Wooldridge (1996), “Econometric Methods for Fractional Response
Variables with an Application to 401(k) Plan Participation Rates,” Journal of Applied
Econometrics 11, 619-632.
Papke, L.E. and J.M. Wooldridge (2005), “Panel Data Methods for Fractional Response
Variables with an Application to Test Pass Rates,” mimeo, Michigan State University
Department of Economics.
Pepper, J.V. (2002), “Robust Inferences from Random Clustered Samples: An Application
Using Data from the Panel Study of Income Dynamics,” Economics Letters 75, 341-345.
White, H. (1980), “A Heterosekdasticity-Consistent Covariance Matrix Estimator and A
Direct Test for Heteroskeasticity,” Econometrica 48, 817-838.
White, H. (1982), “Maximum Likelihood Estimation with Misspecified Models,”
51
Econometrica 50, 1-26.
White, H. (1984), Asymptotic Theory for Econometricians. Academic Press: Orlando, FL.
Wooldridge, J.M. (1999) “Distribution-Free Estimation of Some Nonlinear Panel Data
Models,” Journal of Econometrics 90, 77-97.
Wooldridge, J.M. (2002), Econometric Analysis of Cross Section and Panel Data. MIT
Press: Cambridge, MA.
Wooldridge, J.M. (2003), “Cluster-Sample Methods in Applied Econometrics,” American
Economic Review 93, 133-138.
Wooldridge, J.M. (2005), “Unobserved Heterogeneity and Estimation of Average Partial
Effects,” in Identification and Inference for Econometric Models: Essays in Honor of Thomas
Rothenberg. D.W.K. Andrews and J.H. Stock (eds.). Cambridge: Cambridge University Press,
27-55.
52
APPENDIX
Below are commands in version 8.0 of Stata that can be used for robust inference with
cluster or panel data sets. All commands assume that there is an identifier, “id,” that identifies
the cluster or group for each observation.
Section 2.2
Pooled OLS
reg y x1 x2 ... xK z1 z2 ... zL, cluster(id)
Random Effects: Usual Inference
iis id
xtreg y x1 ... xK z1 ... zL, re
Random Effects: Fully Robust Inference
iis id
xtgee y x1 x2 ... xK z1 ... zL, corr(exch) robust
Note: The xtgee and xtreg estimates will generally differ for two reasons. First, xtgee uses
estimates of  2c based on an initial pooled OLS estimation, while xtreg uses a fixed-effects
based estimate. Second, xtgee iterates, while xtreg does not.
Section 2.3
A separate linear model is estimated for each group identifier to obtain the asymptotic
53
variances of the intercepts. In the second step, OLS can be used (as in DL), relying on the
reported t statistics and critical values, or minimum distance (implemented as WLS) using the
standard normal approximation.
reg y z1 ... zL if id  g
Section 3.1.1
Pooled Probit
probit y x1 x2 ... xK z1 z2 ... zL, robust cluster(id)
Random Effects Probit
iis id
xtprobit y x1 x2 ... xK z1 z2 ... zK, re
GEE with an Exchangeable Working Correlation Matrix
iis id
xtgee y x1 x2 ... xK z1 z2 ... zK, family(bin) link(probit) corr(exch) robust
or
xtprobit y x1 x2 ... xK z1 z2 ... zK, pa
(The qualifier “pa” stands for “population averaged.” Stata simply uses the first command
when “pa” is used with xtprobit.)
Using the Chamberlain-Mundlak Device
All of the previous commands can include the within-group averages of the unit-specific
covariates. For example, Chamberlain’s random effects probit model can be estimated by
iis id
54
xtprobit y x1 x2 ... xK z1 z2 ... zL z1bar z2bar ... zLbar, re
But, as discussed in Section 3.1.1, at a minimum these coefficients are all scaled by a
variance that depends on the group size, M g . One could do this estimation for each group size
and average the APEs. Even better would be to use minimum distance to impose the
restriction of a common  in the structural model.
Section 3.1.2
Pooled Poisson
poisson y x1 ... xK z1 ... zL, robust cluster(id)
Random Effects Poisson
iis id
xtpois y x1 ... xK z1 ... zL, re
GEE with an Exchangeable WCM
xtgee y x1 ... xK z1 ... zL, family(pois) link(log) corr(exc) robust
The Fixed Effects Poisson Estimator
iis id
xtpois y x1 ... xK z1 ... zL, fe
Note: Currently, Stata does not report robust standard errors for FE Poisson
Section 3.1.3
Pooled Tobit
tobit y x1... xK z1 ... zL z1bar ... zLbar, ll(0) cluster(id)
55
Random Effects Tobit
xttobit y x1... xK z1 ... zL z1bar ... zLbar, re
Section 3.1.4
Pooled Two-Limit Tobit
tobit y x1... xK z1 ... zL z1bar ... zLbar, ll(0) ul(1) cluster(id)
Note: Unfortunately, xttobit does not currently allow specifying an upper bound, so the RE
version of the two-limit Tobit is not immediately available.
Pooled Quasi-MLE
glm y x1 ... xK z1 ... zL, family(bin) link(probit) robust cluster(id)
GEE with Exchangeable WCM
xtgee y x1 ... xK z1 ... zL, family(bin) link(logit) corr(exch) robust
Using the Chamberlain-Mundlak Device
For GEE, this would be
xtgee y x1 ... xK z1 ... zL z1bar ... zLbar, family(bin) link(probit) corr(exch) robust
Section 3.2
A separate model is estimated for each group identifier to obtain the asymptotic variances
of the intercepts. So, the commands are carried out for each group size, g. In the second step,
OLS can be used (as in DL), relying on the reported t statistics and critical values, or minimum
distance (implemented as WLS) using the standard normal approximation.
Binary Response
56
probit y z1 ... zL if id  g
Count Response
glm y z1 ... zL if id  g, family(poisson) link(log) robust
Fractional Response
glm y z1 ... zL if id  g, family(bin) link(probit) robust
or
glm y z1 ... zL if id  g, family(bin) link(logit) robust
57