Small-Sample Adjustments for Wald-Type Tests Using Sandwich Estimators

Small-Sample Adjustments for Wald-Type Tests Using Sandwich Estimators
Author(s): Michael P. Fay and Barry I. Graubard
Source: Biometrics, Vol. 57, No. 4 (Dec., 2001), pp. 1198-1206
Published by: International Biometric Society
Stable URL: http://www.jstor.org/stable/3068253
Accessed: 02/07/2009 11:27
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/action/showPublisher?publisherCode=ibs.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a not-for-profit organization founded in 1995 to build trusted digital archives for scholarship. We work with the
scholarly community to preserve their work and the materials they rely upon, and to build a common research platform that
promotes the discovery and use of these resources. For more information about JSTOR, please contact [email protected].
International Biometric Society is collaborating with JSTOR to digitize, preserve and extend access to
Biometrics.
http://www.jstor.org
BIOMETRICS 57, 1198-1206
December 2001
Small-Sample Adjustments for Wald-Type
Tests Using Sandwich Estimators
Michael
P. Fay
National Cancer Institute, 6116 Executive Boulevard, Suite 504, MSC 8317,
Bethesda, Maryland 20892-8317, U.S.A.
email: [email protected]
and
Barry
I. Graubard
National Cancer Institute, 6120 Executive Boulevard, EPS Room 8032,
Bethesda, Maryland 20892-7335, U.S.A.
SUMMARY. The sandwich estimator of variance may be used to create robust Wald-type tests from estimating equations that are sums of K independent or approximately independent terms. For example, for
repeated measures data on K individuals, each term relates to a different individual. These tests applied
to a parameter may have greater than nominal size if K is small, or more generally if the parameter to
be tested is essentially estimated from a small number of terms in the estimating equation. We offer some
practical modifications to these robust Wald-type tests, which asymptotically approach the usual robust
Wald-type tests. We show that one of these modifications provides exact coverage for a simple case and
examine by simulation the modifications applied to the generalized estimating equations of Liang and Zeger
(1986), conditional logistic regression, and the Cox proportional hazard model.
KEY WORDS: Conditional logistic regression; Cox proportional hazards model; Generalized estimating
equations; Robust Wald statistics; Sandwich estimator; Small sample size.
1. Introduction
The sandwich estimator of variance has been used with many
different types of data to provide inferences robust to certain
model misspecifications. We mention three examples. Liang
and Zeger (1986), using their generalized estimating equations
(GEEs), applied sandwich estimators of variances to repeated
measures data. Lin and Wei (1989) applied sandwich estimators to the Cox proportional hazards model, and Fay et al.
(1998) applied them to conditional logistic regression. Each
of these applications use estimating equations with K independent orapproximately independent terms. The Wald-type
tests used with the sandwich estimator are valid as K goes
to infinity. For finite samples simulations have shown that
these (unadjusted) Wald sandwich tests tend to be liberal
(see Lin and Wei, 1989; Emrich and Piedmonte, 1992; Gunsolley, Getchell, and Chinchilli, 1995; Fay et al., 1998; and
Mancl and DeRouen, 2001). In this paper we consider finite
sample adjustments for Wald sandwich tests and apply them
to quite general estimating equations, which include the three
applications mentioned above.
Consider the data that motivated this research. Fay et al.
(1997) performed a meta-analysis to determine the effect of
different types of dietary fat on the development of mammary tumors in rodents. The analysis combined 146 exper-
iments ("sets" in the terminology of Fay et al., 1997). Each
experiment had two or more groups of animals, each group received the same intervention except for diets, and the response
was the number of animals in the group that developed tumors. Although within an experiment there were only diet
differences between the groups, between experiments there
were many differences. For example, the type and/or dose of
carcinogen and the amount of follow-up time varied between
experiments. Thus, as detailed in Fay et al. (1998), a conditional logistic regression was used to condition out intercept
differences between experiments, and a sandwich estimator
of variance was used to account for heterogeneity of the diet
parameters. A random effects model for the diet parameters
was not tractable because each of the diet effects was not varied within each experiment. For example, Fay et al. (1997)
measured effects of four types of fat, but one of the types of
fats, n-3 polyunsaturated fatty acid (n-3 PUFA), was rarely
present except in very small portions. Only 22 of the experiments had any dietary groups with greater than 1% of the
dietary fat equal to n-3 PUFA. Thus it is difficult to estimate a random effect for the n-3 PUFA parameter in all but
these 22 experiments. The important aspect of these data for
this paper is that although there is a large sample size of independent observations, i.e., 146 experiments that make up
1198
Small-Sample
Adjustments for Wald-Type Tests
the terms in the estimating equation, there is an effectively
small sample size for testing the n-3 PUFA parameter. When
a parameter is primarily estimated from a small proportion
of the terms in the estimating equations (e.g., the n-3 PUFA
parameter in the diet data), we call the estimating equations
unbalanced for that parameter. We reanalyze the diet data in
section 3.3.
We propose two types of adjustments. First, we use Taylor
series approximations to adjust for the bias of the sandwich
estimator of variance. Mancl and DeRouen (2001) provide
such an adjustment for GEEs and compare it to many earlier
bias corrections (see that paper for earlier references). In this
paper we propose a bias correction similar to that of Mancl
and DeRouen (2001), but our correction may be applied to
more general estimating equations than the GEE. Our second adjustment is to use an F (or t) distribution instead of a
chi-square (or a normal) distribution to calculate significance.
Others have used this adjustment with the canonical use of
1 and K - p degrees of freedom, where p is the number of
parameters describing the mean (see Lipsitz et al., 1994, and
Mancl and DeRouen, 2001, for the GEE case). As seen for
the diet example above, these canonical degrees of freedom
may not be appropriate for estimating equations that are unbalanced for certain parameters. This problem was noted by
Chesher and Austin (1991) for the simple linear model. We
propose an estimator of degrees of freedom that produces different degrees of freedom depending on which parameter (or
combination of parameters) is being tested. These estimators
can more properly adjust for situations such as the diet example. Similar but not equivalent small sample adjustments
have been proposed for some special cases (see Fay et al.,
1998, for conditional logistic regression; Lipsitz and Ibrahim
(1999) for linear regression; and Fai and Cornelius (1996) for
unbalanced split-plot experiments).
Because of the way we motivate our degrees of freedom estimators, we restrict ourselves to tests of linear combinations
of parameters. In other words if p is the parameter vector,
then we are restricted to tests with null hypotheses equal to
CT/ = CTo0, where 3o is known and C is a vector of constants (i.e., C is not allowed to be a matrix).
In section 2.1 we review the use of the sandwich estimator
of variance and its associated Wald test. In section 2.2 we propose our bias correction and compare it to that of Mancl and
DeRouen (2001), and in section 2.3 we propose some degrees
of freedom estimators. In section 3 we examine one special
case where one of the modifications produces an exact test,
and we perform simulations to test these modifications when
used with GEEs, conditional logistic regression, and Cox proportional hazards models.
2. Main Result
2.1 Background
Consider
estimating
equations
of the form, SK1
Ui(/3) = 0,
where 3 is a p x 1 parameter vector. Let / be the solution
to the estimating equations. Let Ui = Ui(/), and let a
hat over any function denote evaluation at 3 (e.g., Ui
Ui(3)). Assume E{ iK_1 Ui(/3)} = 0 for some 3o, and both
cov{Ui (3o), Uj3(3o)} -
0 for i ? j and
-30
-- 0 as K -- oo.
We use a Taylor series approximation about / but replace the
1199
derivative by an estimator,
Ui -
Ui
i (
(1)
-),
where Qi is an estimator of -dUi//3
evaluated at 3.
Summing over all clusters and rearranging terms gives
/
-/3
i
Vm
(2)
I
where Vm == (K 1 Qi)-l and will be called the "modelbased" variance. If Vm is approximately constant with small
changes in /3, then the variance of 3 - 3 may be estimated
with the sandwich estimator, Vs = Vm(EK1 UiUT)Vm.
We consider hypotheses of the form Null: CT/ = CT3o
vs. Alternative: CTP3
CT 30, where C is a p x 1 vector
of constants. For example, we test whether 3j = 0 or not
by letting C be all zeros except for a 1 in the jth row
and 0o = 0. The unadjusted Wald sandwich test rejects
when T2
{T(
- 3o)}2/CT VsC
> (X2)-(1
- a), where
(X2)-l(q) is the qth quantile of the chi-square distribution
with one degree of freedom.
2.2 A Bias-Corrected Sandwich Covariance Estimator
Mancl and DeRouen (2001) motivated a bias correction of Vs
for the GEE case by using a first-order Taylor series expansion
of the ith residual vector together with approximation (2) to
approximate the expected value of the ith squared residuals.
Analogously, we use a first-order Taylor series expansion of
Ui (approximation [1]) together with approximation (2) to
obtain
E (UiU^)
(Ip-
QiVm) 'i
(Ip-
Qi Vm)
(
+ QjVm
Z"'j
\j
;
I
Vm,(3
(3)
where i = cov(Ui) and Ip is a p x p identity matrix. For
tractability, Mancl and DeRouen (2001) motivated their bias
correction by assuming the the last term in their expression
(4) is small [note there is a typographical error in expression
(4) of Mancl and DeRouen, 2001; the cov(yi) in the last term
should be cov(yj)], which is analogous to assuming here that
the last term in expression (3) is small.
We take a different approach to tractability. We consider
a correction that is reasonable when the working variance
model is approximately within a scale factor of the true
variance, i.e., when 'i
cQi for all i and some constant
c and the model-based variance is consistent. For example, in
the GEE case this would occur when the correlation model
is correctly specified. When ji .
c2i, approximation (3)
T i(Ip - VmQi).
simplifies to E(UiUi)
(Ip - QiVm)qi
To partially correct for this bias [i.e., E(UiUT)
T
'i], we
estimate
Ii
with
i = HiUi Ui HiT, where Hi is given below
and the form of the correction ensures that Ii is a symmetric
nonnegative definite matrix. In general, (Ip - QiVm) is not
may
symmetric, so that the choice of Hi = (Ip - iVm)-/2
not exist. Instead, we propose the simple bias correction of
letting Hi be a p x p diagonal matrix with jjth element equal
to {1 - min(b, {fiVm}jj)}-1/2,
where b < 1 is a constant
1200
Biometrics,
defined by the user. Setting a bound, b, is a practical necessity
to prevent extreme adjustments when the jjth element of
QiVm is very close to 1. (In fact it is possible for the jjth
element of QiVm to be greater than 1; see http://srab.
cancer. gov/sandwich.)
We arbitrarily use b = 0.75 for our
simulations, which ensures that each diagonal element of Hi
is less than or equal to 2. We write this adjusted sandwich
estimator
as Va = Vm (K-=1 HiUiUTHi)
Vm. Although
the
bound 0.75 is arbitrary, the bound of b = 0.75 is rarely
reached. For example, the simulations in section 3 gave almost
exactly the same results (results not shown) when run without
any bound (i.e., the bound is infinity).
2.3 An Approximate F-Distribution
Let T2 be the Wald test statistic using Va instead of Vs. The
main result motivated in this section is
B
{CT(f-_o)}2U
T
=CTVaC
UTB1U
(4)
Fl,d,
where ~ denotes approximate distribution under the null
Bo and B1 are pK x pK
hypothesis, UT = [UT ... U,
matrices given in appendix A, F1,d is an F distribution with
1 and d degrees of freedom, and we give estimators of d
later. Following a standard derivation of the F distribution,
our F distribution approximation may be motivated by the
following three approximate conditions: (1) cr-2UTBoU
X1; (2) a-2dUTB1U
Xd; and (3) UTBoU and UTB1U are
approximately independent, where a2 = var{CT(/3 - 3o)}.
We discuss each of these approximations in Appendix A.
In order to estimate d we need estimators of qi for i =
1,.. , K. We estimate Ti in one of two ways. First, we simply
use Ii as in section 2.2; this estimator tends to overestimate
the heterogeneity of the Ti. One can see this by noting that
even when all Fi are equal, the Ji vary. The associated
estimator of d is
dH=
{
(
)
{trace
(Bi)}2
trace (B1IB1)
(5)
'
where I = block diagonal(Tl, * * IK) (see Appendix A).
Alternatively, we propose the more complicated iV =
Wi(Ke=l
we))-1(_i=
lj),
where
wi
=
CT{(ji
Q)-1
-
Vm}C and wi(E we)- represents the proportional reduction
in the model-based variance of CT(3 - 3) because of the
addition of the ith cluster. The idea behind the /i is that
it smooths the extremely variable estimates Tj, j = 1,..., K,
yet unlike the unweighted sum (i.e., using K-1 Z j to
estimate each Ti) it still accounts for some of the differences in
cluster variability associated with the working variance model.
We estimate dH in a similar manner to dH by replacing 'Ji
with 'i in equation (5).
To see if the adjustment of section 2.2 is necessary, we
estimate the distribution of Ts2 with an F distribution with
1 and d (equal to either d or d) degrees of freedom. Here d
is calculated the same as dH except with the replacement of
Hi with an identity matrix for all i. Similarly define d. We
compare the different methods in section 3.
December 2001
The adjustments of this section do not affect the asymptotic
properties of the unadjusted Wald sandwich test. Under the
assumption that the data are sufficiently regular such that
i(EQj)as K --
1
-
0 for all i and d --
oo, (Ip -
iVm)1
-
oo as K -*
oo, then
0.
Ip, and Va-Vs
Further, if d -- oo, then the Fl,d distribution approaches a
chi-square distribution with 1 degree of freedom. Thus even if
the assumptions and approximations that motivate equation
(4) are tenuous in some situations, the adjustments of this
section may likely be an improvement over the unadjusted
Wald sandwich test.
3. Simulations and Special Cases
3.1 Normal Model with the Same Design for Each Cluster
Consider the (true) model Yi/ -
N(X/,
E), where Yi is an
n x 1 vector of responses, X is an n x p full-rank design matrix,
n > p, and E is a pxp covariance matrix. We use independence
estimating equations (see Liang and Zeger, 1986), i.e., Ui =
(XT(Yi - X/3), where q-1 is a scalar dispersion estimator.
Under the null model for any nondegenerate E, X, and C,
then T2 = {(K - 1)/K}T2
F1,K-1.
Further,
here the
degrees of freedom estimators, dH and d, give K - 1, so
that the test, say 65, that compares T2 with F1 , produces
an exact test for any K (see Appendix B). In contrast,
for K = 20, the standard sandwich test, 61 (T2 compared
with X2) has size 0.071, 62 (T2 compared with F -) has
simulated size 0.034, 63 (T2 compared with F i,) has size
0.055, and 64 (T2 compared with F1- ) has simulated size
l,dH
0.031 where the simulations had 100,000 replications. The
associated test statistic of Mancl and DeRouen (2001) is
T/D
= {(K - 1)/K}2
T2. For K = 20, comparing
TMD to
X1 has size 0.059, whereas comparing it to F1,K-p with p = 1
has size 0.045, and comparing it to F1,K-p with p = 2 has size
0.044. Thus in this case 65 performs best, but 63 and the tests
of Mancl and DeRouen (2001) have close to nominal size.
3.2 Generalized Estimating Equations
Consider GEEs of the form (see Liang and Zeger, 1986)
V-1
Z= 1
Ei-U() DTUi())
(/,){Y - U(/P)}, where Yi is
an ni x 1 vector of responses, /i is the model of E(Yi),
Di = O,ai(P3)/o/3and Vi(3) estimates the "working variance"
of Yi. The function Vi(/3) has a complicated form. [It is
denoted Vi(3) in Liang and Zeger, 1986. See that paper for
details.] Let our estimate of -OUi/O/ be Qi = DTvi-1Di,
and Vs is the sandwich estimator proposed by Liang and
Zeger (1986). To test the GEE models we simulated both
Poisson and binomial data and tested the GEE model with
both an independence and exchangeable working correlation
within the cluster.
For the Poisson case we simulated four types of data sets, all
with K = 20 clusters. Each type of data set is denoted by one
level of each of two descriptors, the variance (either Poisson or
overdispersed Poisson), and the treatment assignment (either
changed within the cluster or fixed within cluster). For the
models that changed treatment assignments within the cluster
we used the working independence variance, and for those
that did not we used the working exchangeable variance. For
the model-based variance we included a scalar overdispersion
Small-Sample Adjustments for Wald-Type Tests
1201
Table 1
Simulations for poisson GEE: proportion rejected at a = 0.05
level (average length of confidence intervals) for 1000 simulations
Test
Distribution
Test
Variance
(df)
One Treatment
Per Cluster
Ri = Exchangeable
T = Oa
T = 1/2
Both Treatments
for each Cluster
Ri = Independence
T= 0
T = 1/2
.091 (.424)
.043 (.060)
.330 (.115)
.090 (.079)
Vm
6m
X2(1)
.089 (.421)
.075 (.058)
.074 (.227)
.103 (.079)
Vs
61
X2(1)
.052
.040
.028
.029 (.325)
Vs
62
(.097)
(.555)
(.071)
F(1,d)
.067
.066
.053
.055 (.244)
63
Vs
(.451)
(.062)
F(1,d)
(.087)
.042
.039
.022
.024 (.335)
Va
64
(.570)
(.073)
F(1,dH)
(.101)
.059 (.091)
.064 (.463)
.046 (.064)
.048 (.251)
F(1, dH)
Va
65
.041 (.098)
.047 (.501)
.043 (.066)
.041 (.258)
66
VMD
F(1, K- p)
a Forthis column, 991 out of 1000 converged(see http: //srab. cancer. gov/sandwich for a non-converging
data set); results relate to those 991 simulations. All other columns had 1000 convergedsimulations.
Data are simulated by Yij Poisson(/ij), i = 1, ...,20; j = 1,...,nil + ni2 where tij = exp(log(10)+
xijbij),
xij = 1 for j < nil, xij = -1 for j > nil, and bij are independent pseudonormal random variates
with mean 0 and standard deviation r. All models of means are lij = exp(/30+
q3lxij). We test whether
/1 = 0 and the average length of confidenceintervals are for P1 only. "Both treatments for each cluster" is
nia = ceiling(Nia), for a = 1, 2, where Nia is distributed pseudo-Gammawith mean 10 and variance 20,
and ceiling(X) gives the smallest integer greater than or equal to X. "Onetreatment per cluster" uses the
same method for generating the nia, then sets ni2 = 0 for i < 10 and sets nil1 = 0 for i > 10.
term as described in Liang and Zeger (1986); it is the default in the yags software that we modified to use for this
simulation (the yags function written in Splus by V. Carey
is reviewed in Horton and Lipsitz, 1999, and can be found at
/index. ssoft.
http: // biosunl. harvard. edu /carey
html). The details of the data generation are listed with the
results in Table 1. We see that 61 is liberal in all four cases, 63
is less liberal, whereas 65 appears to have values closer to the
nomial level. The test 66 (T2D compared to F,K-p) appears
to perform well, although perhaps slightly conservatively. The
tests 62 and 64 can be very conservative, especially in the cases
with both treatments in each cluster. We see in the last column of the first row that 6m may perform very poorly when
the model is misspecified. All of the modified tests (62, 63, 64,
65, and 66) appear to have sizes that come closer to maintaining the nominal level than the standard sandwich test, and
the average length of the confidence intervals give a crude
measure of the price paid to achieve those sizes.
We modeled the simulation for logistic regression after real
data. Preisser and Qaqish (1999) have made available data
from the Guidelines for Urinary Incontinence Discussion and
Evaluation (GUIDE) study at http://www.phs.wfubmc.
Preisser and Qaqish (1999) use
edu/data/uipreiss.html.
five covariates measured from a baseline survey to predict
whether individual patients would respond yes or no as to
whether they are bothered by accidental loss of urine. There
were 137 patients from 38 practices in the study, and 54 patients responded that they were bothered by their urinary
incontinence. We fit a logistic model without each of the five
covariates in the model and used each of the five sets of fitted
values as the true probability of response. We then simulated
binary responses using those probabilities. Thus each of the
five sets of simulated data came from a logistic model with one
of the parameters equal to zero. We then tested whether that
parameter equals zero. For this simulation we let the working correlation be the independence model. The results are
presented in Table 2. In this case the model-based variance
is correctly specified and 6m performs well and 61 appears
consistently liberal, whereas 64 and 65 appear quite conservative, with 62 and 66 slightly less conservative. In this case, 63
appears to be the best of the sandwich tests.
In Table 3 we list the sample variance of the parameter estimate and compare it to the mean of the variance estimates
for the four variance estimates, Vm, Vs, VMD (Mancl and
DeRouen's, 2001, adjustment), and Va. We bold the variance
estimator with the mean closest to the sample variance of
the parameter estimate. When the model is misspecified (see
the fourth row), Vm underestimates the variance, but otherwise Vm does well. The estimator Vs tends to underestimate
the variance. Neither of the corrected sandwich variance estimates, VMD and Va, appears to regularly outperform the
other. In many situations both VMD and Va appear to overcorrect for bias.
3.3 Conditional Logistic Regression
For conditional logistic regression, we have binary responses
grouped into clusters, and within each cluster we condition
on the total number of positive responses. The estimating
equations can be written in terms of the sufficient statistic
for 3, ti (i.e., Ui(/3) = ti - E(ti 1 3), where E(ti I 3) is the
modeled value for ti given 3). Then fi(/3) = -OUi/O3. For
details see, e.g., Fay et al. (1998).
We repeated four sets of the simulations from Fay et al.
(1998), which were motivated by a meta-analysis. We briefly
describe the simulations using the same terminology as Fay
et al. (1998); see that paper for more details. Each simulation
1202
December 2001
Biometrics,
Table 2
Simulations for logistic independence estimating equations derived from GUIDE data: proportion
rejected at a = 0.05 level (average length of confidence intervals) for 1000 simulations
Test
Var
Test
Distn
(df)
Model
without
FEMALE
6m
61
62
63
64
65
66
Vm
Vs
Vs
Vs
Va
Va
VMD
X2(1)
x2(1)
F(1,d)
F(1,d)
F(1,dH)
F(1,dH)
F(1, K - p)
.048
.078
.038
.051
.018
.026
.042
(2.58)
(2.41)
(3.06)
(2.70)
(3.47)
(3.02)
(2.87)
Model
without
AGE
.052
.067
.039
.057
.018
.029
.039
estimates 3 = [/3132]T from K clusters of 60 binary responses,
and here we test whether P1 is equal to zero or not. Each of
the four sets of simulations is described by both the "design"
(either A or D) and the "case" (either 1 or 4) and consists of
1000 simulations. Design A has K = 20 clusters, and by design
one-fourth of the clusters do not contribute the estimation of
,31 because of conditioning, whereas similarly design D has
K = 40 clusters but 37/40 of the clusters do not contribute
to the estimation of /1. The responses are all generated with a
random intercept term for each cluster. For case 1 the effects
for / are fixed, whereas for case 4 these effects are random.
Thus the conditional logistic model is correctly specified for
case 1 but not for case 4. All simulations are under the null
hypothesis. The results are presented in Table 4. Note that
design D shows that 61 can be very liberal in extreme cases.
We return to the meta analysis of Fay et al. (1997) mentioned in the introduction. The six different test results that
the effect of the n-3 PUFA is different from zero give two-sided
p-values of p = 0.0134 for 6m, p = 0.3377 for 61, p = 0.4136
for 62, p = 0.3590 for 63, p = 0.4320 for 64, and p = 0.3824
(2.38)
(2.32)
(2.68)
(2.46)
(3.06)
(2.74)
(2.63)
Model
without
DAYACC
Model
without
SEVERE
.049
.076
.044
.058
.031
.037
.040
.051
.062
.029
.049
.010
.017
.033
(.258)
(.242)
(.286)
(.269)
(.329)
(.296)
(.289)
(1.42)
(1.36)
(1.62)
(1.46)
(1.87)
(1.69)
(1.56)
Model
without
TOILET
.055
.080
.040
.056
.016
.028
.042
(.350)
(.330)
(.405)
(.375)
(.636)
(.551)
(.408)
for 65. If there is a random effect of the n-3 PUFA, the usual
sandwich test, 61, gives better coverage than the model-based
test, 6m, but judging from our simulations this may be slightly
liberal. Based on our simulations, the modified sandwich tests
(62,63,64, and 65) appear more likely to have nominal coverage. We recommend the use of either 63 or 65 for having better
coverage than 61 while (usually) not being overly conservative.
For details of the biological issues and the full model, see Fay
et al. (1997).
3.4 Cox Proportional Hazards
The Cox model uses the partial likelihood for right-censored
failure time data. Let the associated efficient score statistic for
/3 be written as ES=1 U (P3),where the summation is over the
K individuals whose failure time may or may not be rightcensored (see Appendix C for details of the notation). The
terms Ui are not independent, but Lin and Wei (1989) showed
' Ui (/) =
that the estimating equation may be written as i=
where
the
are
Ui
SK=1Ui(/3),
asymptotically independent,
Ui = Ui + Ui, and where EiN1 Ui'(3) = 0 for all /3. Then
Table 3
Comparison of simulated variance estimators
Sample
Variance
of 3i
Poisson
Ri=Exchangeable,
7 = Oa
Ri=Exchangeable, r= 1/2
Ri=Independence, T = 0
Ri=Independence, T = 1/2
Mean
of
Mean
of
Mean
of
Vm
Vs
VMD
Mean
of
Va
GEE (Table 1)
0.00051
0.00043
0.00042
0.00056
0.00045
0.01410
0.00023
0.00405
0.01237
0.00024
0.00087
0.01201
0.00023
0.00376
0.01482
0.00025
0.00424
0.01264
0.00024
0.00398
Logistic Independence
Estimating Equations (Table 2)
0.47065
0.43737
0.39234
0.51774
0.47954
Model without FEMALE
0.42700
0.40968
0.37511
0.35659
0.44090
Model without AGE
0.00434
0.00390
0.00521
0.00455
Model without DAYACC
0.00466
0.12274
0.12409
0.13186
0.15042
0.16223
Model without SEVERE
0.00806
0.00732
0.01050
0.01101
Model without TOILET
0.00898
a For this row, 991 out of 1000 converged;statistics calculated from only those 991 simulations. All other
rows had 1000 convergedsimulations.
The bold value in each row represents the closest mean to the sample variance of /i.
Adjustments for Wald- Type Tests
Small-Sample
1203
Table 4
Simulations for conditional logistic regression: proportion rejected at
a = 0.05 level (average length of confidence intervals) for 1000 simulations
Test
Var
Test
Distn (df)
Design A
Case 1
Design A
Case 4
Design D
Case 1
6m
61
Vm
Vs
X2(1)
X2(l)
.049 (.577)
.098 (.534)
.430 (.601)
.090 (1.36)
.027 (2.60)
.252 (1.85)
.345
.253
(2.80)
(4.02)
62
63
64
65
Vs
Vs
Va
Va
F(1,d)
F(1,d)
F(1,dH)
F(1,dH)
.042
.072
.033
.066
.045
.065
.038
.056
.044
.047
.025
.021
.054
.038
.030
.019
(12.00)
(10.19)
(16.33)
(13.79)
(.681)
(.590)
(.708)
(.613)
(1.74)
(1.51)
(1.82)
(1.57)
Design D
Case 4
(5.35)
(4.45)
(6.92)
(5.77)
All simulations test whether i1 = 0, and average confidence interval lengths are for /1 only. In design
A p1 is estimated from 15 clusters, whereas in design D 41 is estimated from three clusters. For case 1 the
effects for p3 are fixed, whereas for case 4 these effects are random.
Vm = -{IK
aU /ao+UU/ao}o
-
(Et1
-aUa
4. Discussion
We have examined four new modifications to the standard
Wald test with a sandwich estimator of variance. These modifications were necessary because the standard sandwich test
,
/a)-
and the sandwich estimator of Lin and Wei (1989) is Vs.
Although we do not need to define Qi, our estimator of
for the calculation of Vm and Vs, for our adjust-OUi/la/,
ments we do need to define Qi for i = 1,..., K. To do this
we use numerical derivatives similar to the method that Gail,
is liberal when the parameter
Lubin, and Rubinstein (1981) used to calculate U'/O3la. We
let the abth element of Qi be {Qi(/3)}ab = (4h)-1{Uia(/ +
hb)
-
Uia(
-
hb) + Uib(/3 + ha) - Uib(3
- ha)}, where
to be tested is essentially
Uia is
the ath element of Ui, ha is a vector of length p with all zeros
except the ath row, which has a value of h, and h is some
small number. We use h = 0.0001 for our simulations.
We performed simulations on three true models, (1) a proportional hazards model, (2) model 10 from Table 1 of Lin
and Wei (1989) in which the model-based test (6m) performed
worst for K = 50, and (3) model 11 from the same table in
which the sandwich test (61) performed worst for K = 50.
We give the details of the models and the results in Table 5.
Again the modified sandwich tests (62,63,64, and 65) appear
closer to the nominal level than the standard sandwich test,
61, although 63 appears quite liberal for K = 20 for models
10 and 11.
Table 5
Simulations for Cox regression: proportion rejected at a = 0.05
level (average length of confidence intervals) for 1000 simulations
Test
6m
61
62
63
64
65
Proportional Hazards
K = 20
K = 50
.051
.103
.055
.051
.038
(1.13)
(0.99)
(1.33)
(1.25)
(1.48)
.043 (1.37)
.042
.063
.037
.043
.032
(.614)
(.567)
(.672)
(.625)
(.715)
.037 (.652)
Model 10
K = 20
.180
.119
.047
.080
.039
esti-
mated from a small number of terms in the estimating equation. This liberalness can occur when either K, the number of
terms in the estimating equation, is small, or when primarily
a small proportion of terms are used to estimate the parameter (or linear combination of parameters). We have referred
to the latter type of estimating equations as unbalanced for
that parameter.
In all of our simulations the four new modifications had
an estimated size less than the liberal size of the standard
sandwich test. In the simple balanced normal model, 65 is
exact, and 65 did reasonably well in the other balanced cases
(see Tables 1 and 5 and design A from Table 4). However,
in more unbalanced cases (see Table 2 and design D from
Table 4), 65 appeared to produce a quite conservative test.
For this reason, unless the data are fairly well balanced, we
favor the use of 63, which appears to be less liberal than 61 but
not overly conservative even for unbalanced equations. The
(1.71)
(2.26)
(3.29)
(2.89)
(3.69)
.068 (3.13)
Model 11
K = 50
.195
.079
.049
.063
.044
(1.02)
(1.47)
(1.67)
(1.59)
(1.72)
.059 (1.63)
K = 20
.104
.137
.076
.089
.059
(1.13)
(1.01)
(1.38)
(1.28)
(1.55)
.073 (1.40)
K = 50
.073
.091
.049
.066
.039
(.614)
(.593)
(.729)
(.656)
(.780)
.054 (.686)
All simulations modeled the hazard proportional to exp(/3lzI + 2 z2) and tested whether 31 = 0 or not.
True models with t = time to event: (proportional hazards) t = exp(z2)e2; (model 10) t = exp(-.5z2 -
z2+ .5ej); (model 11) t = exp(-.5z2) + .5e; where Z1,Z2, and e1 are independent standard normal
pseudorandomvariables (p-r.v.), with z1 and z2 truncated at ?5, and e2 is a standard exponential p-r.v.
1204
Biometrics,
difference between 65 and 63 is that 65 includes a bias correction for the variance, whereas 63 does not. In our simulations
on the GEE case, the standard sandwich estimator regularly
underestimated the variance, whereas both our adjusted variance estimator, Va, and that proposed by Mancl and DeRouen
(2001), VMD, appeared less biased in the balanced cases (see
top section of Table 3). In the more unbalanced cases (see
bottom section of Table 3), both Va and VMD appeared to
overcorrect for bias in most cases, and this overcorrection in
Va may be the source of the conservativeness of 65. Further
work on correcting the bias of the sandwich variance estimator in unbalanced data is required. For the balanced cases the
advantage of Va over VMD is that it may be applied to other
cases besides the GEE case.
Although not listed in the tables, we included in our simulations a test comparing T2 to F1,K-p as has been used in
the literature (see, e.g., Lipsitz et al., 1994). This test gave
estimated sizes that were always less than the standard sandwich test (61), but at least as large as the four new tests
(62, 3, 64, and 65). As expected this test does reasonably well
with balanced data but not with unbalanced data. For example, for Table 4 the four estimated sizes (to test f1 = 0)
were 0.078 (design A, case 1), 0.073 (design A, case 4), 0.245
(design D, case 1), and 0.248 (design D, case 4). For insight
into this, note the failure of that method to address the degrees of freedom differences in the parameters in design D,
where /1 is estimated from only three clusters, whereas 32
is estimated from 39 clusters (see Fay et al., 1998). The degrees of freedom estimate, K - p, does not properly account
for this data structure and compares both the T2 statistic
for testing p1 = 0 and the T2 statistic for testing /2 = 0
against the same distribution, F,K-p. The complete simulation results of the tests comparing T2 to Fl,K-p along with
Splus functions that perform our adjustments and the Splus
programs used to perform all the simulations are given at
http://srab.cancer.gov/sandwich.
ACKNOWLEDGEMENTS
We thank William Davis, Edward Korn, and the reviewers
and editors for helpful comments on earlier versions of this
paper.
RESUME
L'estimateur sandwich de la variance peut etre utilise pour
creer des tests robustes type Wald a partir d'equations d'estimation qui sont la somme de K termes independants ou
presque independants. Par exemple, dans le cas de donnees
repetees sur K sujets, chaque terme correspond a un sujet. Ces
tests sur un parametre peuvent avoir un risque de premiere
espece eleve si K est petit, ou, plus generalement, si le parametre a tester est surtout estime a partir d'un petit nombre de termes dans l'equation d'estimation. Nous proposons
quelques modifications pratiques de ces tests robustes type
Wald qui tendent asymptotiquement vers les classiques tests
robustes type Wald. Nous montrons que l'une de ces modifications donne des resultats equivalents dans les cas simples,
et etudions par simulation les modifications appliquees aux
equations d'estimation generalisees de Liang et Zeger (1986),
a la regression logistique conditionnelle et au modele de Cox
a risques proportionnels.
December 2001
REFERENCES
Chesher, A. and Austin, G. (1991). The finite-sample distributions of heteroscedasticity robust Wald statistics. Journal of Econometrics 47, 153-173.
Emrich, L. J. and Piedmonte, M. R. (1992). On some small
sample properties of generalized estimating equation estimates for multivariate dichotomous outcomes. Journal
of Statistical Computation and Simulation 41, 19-29.
Fai, A. H-T. and Cornelius, P. L. (1996). Approximate F-tests
of multiple degree of freedom hypotheses in generalized
least squares analyses of unbalanced split-plot experiments. Journal of Statistical Computation and Simulation 54, 363-378.
Fay, M. P., Freedman, L. S., Clifford, C. K., and Midthune,
D. N. (1997). Effect of different types and amounts of fat
on the development of mammary tumors in rodents: A
review. Cancer Research 57, 3979-3988.
Fay, M. P., Graubard, B. I., Freedman, L. S., and Midthune,
D. N. (1998). Conditional logistic regression with sandwich estimators: application to a meta-analysis. Biometrics 54, 195-208.
Gail, M. H., Lubin, J. H., and Rubinstein, L. V. (1981). Likelihood calculations for matched case-control studies and
survival studies with tied death times. Biometrika 68,
703-707.
Gunsolley, J. C., Getchell, C., and Chinchilli, V. M. (1995).
Small sample characteristics of generalized estimating
equations. Communications in Statistics: Simulation and
Computation 24, 869-878.
Horton, N. J. and Lipsitz, S. R. (1999). Review of software
to fit generalized estimating equation regression models.
American Statistician 53, 160-169.
Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73, 1322.
D.
Y. and Wei, L. J. (1989). The robust inference for the
Lin,
Cox proportional hazards model. Journal of the American Statistical Association 84, 1074-1078.
Lipsitz, S. R., Fitzmaurice, G. M., Orav, E. J., and Laird, N.
M. (1994). Performance of generalized estimating equations in practical situations. Biometrics 50, 270-278.
Lipsitz, S. R. and Ibrahim, J. G. (1999). A degrees-of-freedom
approximation for a t-statistic with heterogeneous variance. The Statistician 48, 495-506.
Mancl, L. A. and DeRouen, T. A. (2001). A covariance estimator for GEE with improved small sample properties.
Biometrics 57, 126-134.
Preisser, J. S. and Qaqish, B. F. (1999). Robust regression
for clustered data with application to binary responses.
Biometrics 55, 574-579.
Searle, S. R. (1982). Matrix Algebra Useful for Statistics. New
York: Wiley.
Received February 2001. Revised August 2001.
Accepted August 2001.
APPENDIX A
Details of Degrees of Freedom Approximation
Using approximation (2), we write {CT (fwhere Bo = FVmCC
30)2}
UTBoU,
VmFT and FT = [Ip . Ip]. Rewrite
Small-Sample
Adjustments for Wald-Type Tests
CTVaC as
K
CTVaC =
CTVm (HiUiU
H,) VmC
i=l
K
= E
UT (HiVmCCTVmHi)
Ui,
i=l
where the second equality uses the fact that CTVmHiUi is a
scalar; this is why we require C to be a vector. We rewrite this
as CTVaC = UT MU, where M is a pK x pK block diagonal
matrix with ith block equal to HiVmCCTVmHi. We combine
the Taylor series approximations for the Ui to get U m GU,
where G =
IpK
-
VmFT
and QT =
[i...
QK]
Thus
CTVaC UT BU, where B1 = GT MG.
Now consider the three conditions:
X21:Assume CTVmUi has mean 0 and un(1) a-2UTBoU
known variance cr? (where we allow ai = 0 for some i),
that the CTVmUi are independent, and that approximation (2) holds. Provided some regularity conditions are
true, by the Liapunouv central limit theorem C-1CT(l -
1205
and is distributed tK-1
/KZ{(K-1)-1IK=1
(Zi-Z)2}-1/2
by the standard derivation of the t-test, and T2 is distributed
F1,K-1. All proposed estimators of d use equation (5) with
different estimators of 1i and different values of M (for d and
d the value Hi is set to Ip in the definition of M). Let Ti (I)
be an arbitrary estimator of 'i (I) and Mi be an arbitrary
block diagonal element of M. Then because the Qi = OX7TX
are all equal, trace(QB1) = {(K - )/K}IEK trace('iMi)
and trace(iBli B1) = {(K-2)/K}K=1
trace('iMJiiMi)+
R
R-1 I
1++
(K - 1)2
where
riK
-- oo, where a2 = EK1 i2. Thus
do) -+ N(0, 1) as K
- po)}2
is asymptotically
chi-squared with 1
a-2{CT(
degree of freedom.
(2) c-2dUTB1 U Xd: Assume U is distributed normal with
mean 0 and variance I = block diagonal (T1,..., K).
Under these assumptions, UTB1U has a distribution described by a weighted sum of chi-square random variables; however, we approximate its distribution with a
(much simpler) Gamma distribution with the same mean
and variance, i.e., with mean equal to trace('B1) and
variance equal to 2trace(BB1iJBi) (see Searle, 1982, p.
355). Since UTB1U is an approximately unbiased estimator of a2, we estimate a2 with trace(TBi), so that
u-2UTB1U is approximately Gamma with mean 1 and
or equivavariance 2trace(BB1'Bl)/{trace(BjB1)}-2,
~
=
where
d
xd
lently c-2dUTB1U
{trace(IB1)}2/
trace (B B1B ).
(3) UTBoU and UTB1U are independent: Assume that U
is normal with mean 0 and variance ', with rank(t) =
pK. Under correct model specification, i.e., Ji' cfi for
0 and UTBoU and UTB1U are
all i. Then BoIB1
approximately independent (see Searle, 1982, p. 356).
R
=
JZ(ZiK
-
)4
and the Zi are the same independent standard normal random
variables as in the previous expression for Ta.
APPENDIX C
Notation for Cox Proportional Hazards Model
We use standard notation for counting processes, so that the
notations Yi, Zi, Xi, and 6i are different in this section than
in the rest of the paper. Let X1,..
,XK
be the time until
failure or right-censoring, with 6i = 1 when Xi represents
a failure and 6i = 0 otherwise. Let Zi(t) be the covariate
vector for the ith observation observed at time t. Then Ui =
6i{Zi(Xi)Z(3, Xi)}, where
K
Z(3, t)=
E Yi(t)Zi(t)exp{/f'Zi(t)}
-i_ K
E
Yj(t) exp{3'Zj(t)}
j=l
Refer to the structure defined in section 3.1. Let Zi = K-1/2 x
where 02 = var{CT(P
2)2
i=l
Details: Normal Model with Identical Designs
i - CTP)},
{CT(XTX)-xTy
=
X(XT
K-1CT(XTX)-lXTX
3o)
2
-
E(Zi
APPENDIX B
a-1
To simplify these expres-
EK=
_3Ki1 trace(kiMijMj3).
K-2
sions we use the theorem that trace(AB) = trace(BA) for
any matrices A and B for which AB and BA are defined, so
that the traces may be written as scalars. Then we can show
that equation (5) gives K- 1 for both d and dH, whereas d
and dH may both be written as
-
X)-lC. Under the null
hypothesis assumptions of the model, the Zi are independent
standard normal random variables. Then K-1EK
Zi = Z =
=
and EzI1 (Zi-Z)2 = (2KCTVsC
K-1/2a-1CT(^-/),
The last step comes because Hi = Ip{K/
c-2(K-1)CTVaC.
(K - 1)}1/2 for all i. Thus, under the null hypothesis, Ta =
and here Yi(t) = 1 if Xi > t and 0 otherwise. Further,
K
0U?
U,-
[6jYi(Xj)exp{ 3'Zi(Xj)}{Zi(Xj)
- Z({, Xj)}]
3j=1
-
K
E
-h=l
Yh(Xj) exp { 'Zh (Xj)}
Biometrics, December 2001
1206
-h=l
Y9=i(
-j=
9Y (Xi) exp {3'Zj(Xi)
where A02 = AAT, for any vector, A.
x {zj(Xi)-
Z(3, Xi)}?2