Sample Size and Power Calculations With Correlated Binary Data

Sample Size and Power Calculations With Correlated Binary Data
Wei Pan, PhD,
Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN
Address reprint request to: Dr. Wei Pan, Division of Biostatistics, A460 Mayo Building, MMC
303, 420 Delaware Street SE, Minneapolis, MN 55455-0378, (Email: [email protected]).
0
ABSTRACT: Correlated binary data are common in biomedical studies. Such data can be
analyzed using Liang and Zeger's generalized estimating equations (GEE) approach. An attractive
point of the GEE approach is that one can use a misspecied working correlation matrix, such as the
working independence model (i.e. the identity matrix), and draw (asymptotically) valid statistical
inference by using the so-called robust or sandwich variance estimator. In this article we derive
some explicit formulas for sample size and power calculations under various common situations.
The given formulas are based on using the robust variance estimator in GEE. We believe that
these formulas will facilitate the practice in planning two-arm clinical trials with correlated binary
outcome data.
KEY WORDS: Clinical trials, GEE, logistic regression, sandwich estimator, z -test.
1
INTRODUCTION
It is important and even necessary to do sample size and power calculations in planning a
clinical trial 1]. This is relatively well studied and there is a large literature giving many specialized
formulas to use. However, most formulas available are only for independent or continuous response
data 1, 2]. In practice, correlated binary data also arise frequently. As a general and exible method
for correlated discrete data, the generalized estimating equations (GEE) approach has become
increasingly important and widely used in analyzing such data 3, 4]. An attractive property of
GEE is that one can use a misspecied working correlation matrix and yet the regression coecient
estimator is still consistent and asymptotically normal. The covariance matrix of the estimated
regression coecients can be also consistently estimated using the so-called robust or sandwich
estimator. A convenient choice of the working correlation structure is the working independence
model: the correlated data can be treated as though they were independent, and the resulting
regression parameter estimates along with the robust covariance estimator can be used to draw
proper statistical inference.
Recently Liu and Liang discussed a general method for sample size/power calculations in GEE
5]. Their method is based on the (generalized) score test in GEE and the resulting (asymptotically)
non-central chi-square distribution of the test statistic under the alternative hypothesis. In general,
there is no closed form available except for some special cases. As examples, they gave two explicit
formulas for analyzing correlated binary data, where it is assumed that the working correlation
structure is correctly specied as a compound symmetric matrix. They did not give formulas for
other common special cases. For instance, it would be helpful to have a formula to use when
the working independence model is used in GEE. It appears complicated and nontrivial to derive
explicit formulas for some important and common situations, at least for many practitioners, though
in principle it could be done following Liu and Liang's outline. Shih discussed an alternative
approach to sample size/power calculations in GEE, which is based on the z -test using the robust
variance estimator of the estimate of the parameter to be tested 6]. After discussing the general
method, Shih only gave one specic example for binary data. In this article, we follow Shih's
approach and derive explicit formulas for several important special cases. In addition to several
situations where a working correlation is specied as the true correlation structure, we also discuss
other situations when a misspecied working correlation matrix is used. Note that one needs to
2
specify the underlying correlation structure in sample size/power calculations and thus can use
it as the working correlation structure. However, we argue that using other working correlation
structures are sometimes useful or necessary. If we admit that we may not know the underlying
true correlation structure at the analysis stage, it is doubtful that we can know it at the earlier
design stage. Hence, at least, sensitivity analyses can be conducted using various combinations of
the working/true correlation structures in the design stage. More comments on this issue will be
given in a later section. We hope that the reader can derive relevant formulas for other settings by
following the method outlined here.
We also note that the issue of sample size and power calculations with binary data has been
actively discussed in the literature. In addition to those with the GEE approach 5, 6], there are
some other proposals with other estimation (non-GEE) methods, such as the weighted least squares
7-9]. Here we only consider the GEE approach due to its increasing use in practice.
In this paper we loosely use \subjects" to refer to the subsampling or correlated sampling units,
which consist of a higher level unit called \cluster." In familial studies, the cluster is each family
while the subjects are family members. In longitudinal studies, the cluster consists of repeated
measures from the same study participant and each measurement is treated as a \subject."
The general set-up here is that we are comparing two treatments, a control and an active
treatment. We consider two important scenarios. One is that the treatment is given at the cluster
level: all the subjects from the same cluster receive the same treatment. There are many such
examples. For instance, in a longitudinal study, each patient as a cluster is given only one treatment,
and repeated measures are made on the same patient across a period of time. The second scenario
is that each cluster may receive more than one treatment. This is also common in practice. For
example, in two-period crossover trials, each patient receives each treatment in turn. Another
example is that in ophthalmologic studies, one eye of a patient often serves as the control for teh
patient's other eye.
This article is organized as follows. In the next section we rst review GEE in the current
context, and relate the issue of sample size/power calculations in GEE with that in the well-known
two-sample z -test. The following section presents explicit sample size/power formulas for situations
where the treatment is assigned at the cluster level. We later discuss several special cases where
both treatments may be given to the same cluster. We end with a short discussion.
3
BACKGROUND AND THEORY
The Model and GEE
Denote the j th binary response (outcome) variable from cluster i as yij , and the corresponding
2 1 covariates as xij for j = 1 ::: ni and i = 1 ::: N . By convention we use the coding yij = 1
or 0. The rst element of xij is always 1 (for the intercept term) while the second is the indicator
of the treatment group, which is 0 for control and 1 for treatment. Let yi = (yi1 ::: yin ) and
Xi = (xi1 ::: xin ) respectively be the response vector and the design matrix for the ith cluster.
In general, the components of yi are correlated while yi and yk are independent if i 6= k. In other
words, there is a within-cluster correlation, but observations from dierent clusters are independent.
Any statistical analysis has to take the within-cluster correlation into account properly.
Dene the marginal mean and variance of yij as ij = E (yij jxij ) and V ar(yij ) = ij (1 ; ij ).
We use the marginal logistic regression model:
0
i
0
i
!
logit(ij ) = log 1 ;ij = xij ij
0
where = (1 2 ) is the unknown regression coecient vector. Note that 2 measures the treatment eect compared with the control 2 = 0 means that there is no dierence between the
treatment and control. The GEE estimator ^ of can be obtained through solving the following
so-called GEE:
N
X
DiVi 1(yi ; i) = 0
0
0
;
i=1
where i = (i1 ::: in ) , Di = @i =@ and Vi is a working covariance matrix for yi . Let Ai be an
ni ni diagonal matrix with the j th diagonal element as V ar(yij ). Then the working covariance
matrix can be expressed as
Vi = A1i =2 Rwi( )A1i =2 0
i
where Rwi( ) is a working correlation matrix with possibly unknown parameters , which can
be estimated using method of moments 3] or another set of estimating equations 10]. As usual
we assume that the structure of Rwi and the true correlation structure R0i do not change across
the clusters, hence we can denote them as Rw and R0 respectively (but keep in mind that their
dimension is ni ni, which may change with the cluster i). It is noted that the true covariance of
y, Cov(yi ) = A1i =2 R0A1i =2 , is in general unequal to the working covariance Vi = A1i =2 Rw A1i =2 unless
R0 = Rw .
4
An attractive property of GEE is that the GEE estimator ^ is still consistent and asymptotically
normal even if an incorrect correlation matrix Rw is used. Furthermore, the covariance matrix of
^ can be consistently estimated by the robust or sandwich estimator:
R = (1 1 0 1 1 )
;
where
1 =
N
X
i=1
Di Vi 1 Di and 0 =
0
;
(1)
;
N
X
i=1
Di Vi 1 A1i =2 R0 A1i =2 Vi 1Di :
0
;
;
(2)
1 1 is the usual model-based covariance estimator: its (asymptotic) validity depends on the correctness of Rw = R0 , in which case the robust covariance is the same as the model-based one. For
correlated discrete data, in general, it is dicult to know the structure of R0 in advance. There
are several commonly used (working) correlation structures:
;
The identity matrix: Rw = I . This is also called the working independence model. It is
the most convenient and probably most common choice. It has been found that in many
situations the resulting ^ is also highly ecient 3, 11]. In general, if there is no strong prior
knowledge about R0 , we feel that Rw = I could be used at least, computationally, it is the
simplest and most stable (in terms of the convergence of the iterative algorithm in solving
the GEE).
Compound symmetry (CS):
0
1
1
:::
BB
C
BB 1 ::: CCC
B
C
Rw = B
BB 1 ::: CCC :
BB
CC
B@ :::
CA
::: 1
5
The rst order auto-regressive (AR(1)):
0
BB 1
BB 1
BB
2
Rw = B
BB BB :::
@ n 1 n
i;
i ;2
2
1
n
i ;3
ni
:::
::: n
::: n
:::
1
;
i ;2
i ;3
1
1
CC
CC
CC
CC :
CC
CA
Note that for simplicity we have suppressed the obvious dependence of Rw on ni . Throughout the
paper, we use to denote the parameter in the CS or AR(1) correlation structure.
Sample size and Power Calculations
The goal is to test whether the treatment has a more favorable eect than the control does.
The hypotheses of interest can be formulated as:
H0 : 2 = 0 versus HA : 2 = b > 0:
Note that since there are only two treatment groups, we can use p0 or p1 to denote the (marginal)
mean of the binary response in the control or treatment group respectively: p0 = E yij jxij = (1 0))
and p1 = E yij jxij = (1 1)]. It is well-known that 2 is the log-odds ratio: 2 = log 1 p1p1 = 1 p0p0 .
p
Denote the robust variance estimate of N (^2 ; 2 ) as vR , which is the (2,2)th element of
p
N R . Since N (^2 ; 2 ) has an approximately normal distribution N (0 vR ), a z -test can be used
to test the null hypothesis. A two-sided z -test with type I error and power 1 ; requires the
sample size
;
vR (Z=2 ; Z1 )2
b2
where Z = 1 (1 ; ) is the 100(1 ; )th percentile of the standard normal distribution.
With a given sample size N , the power is
N=
;
;
(3)
;
s !
1 ; = 1 ; Z=2 ; b vN :
R
(4)
The corresponding formulas for a one-sided test are straightforward. As to be shown, in the
current setting, N corresponds to the number of clusters. The robust variance vR depends on the
cluster size ni . Either ni or N can be regarded as the \sample size" since increasing either one leads
6
to a larger data set. Apparently there is a trade-o between increasing ni or N . The decision may
depend on their respective cost. In this paper ni is treated as known and the goal is to determine
N for a specied power. This strategy is general since for a given series of possible ni 's, one can
determine the corresponding N 's and then pick up the most desirable combination of (ni N ). It
is obvious that the key to sample size/power calculations is to obtain vR , for which, in general,
there is no closed form available. However, in several important special cases, we show next that
it is possible (but tedious) to derive the closed forms of vR , and hence those for sample size/power
calculations.
CONSTANT WITHIN-CLUSTER TREATMENT
We rst discuss the situation when the treatment is given at the cluster level. Hence the
covariate matrix is Xi = (1n 1n ) if the ith cluster is in the treatment group, and Xi = (1n 0n )
otherwise. Here we use 1n to denote an ni 1 vector of 1's, and similarly for 0n . We assume that
the proportion of clusters assigned to the control group is and that to the treatment is 1 ; . In
the following we rst discuss a general situation, then derive formulas for various special cases. For
convenience the results of this section is summarized in Table 1.
Table 1 about here
i
i
i
i
i
i
A General Formula
It can be veried that Di = p0 (1 ; p0 )1n 0n ] if the ith cluster is in the control group, and
Di = p1 (1 ; p1 )1n 1n ] otherwise. Following Shih 6], we have
i
i
i
i
Di Vi 1 Di = 1n Rw 1 1n M
0
0
;
;
i
i
where
1
0
p (1 ; p0 ) + (1 ; )p1 (1 ; p1 ) (1 ; )p1 (1 ; p1) C
M =B
A
@ 0
(1 ; )p1 (1 ; p1 )
(1 ; )p1 (1 ; p1 )
0
1
1
;1 C
M 1 = p (11; p ) B
@
A:
0
0
;1 1 + (1 p0)(1p1 (1p0 )p1 )
;
;
;
;
Hence, the model-based covariance estimate is
1 = (
1
;
N
X
i=1
1n Rw 11n M ) 1 :
0
;
i
;
i
7
It can be also veried that
0 =
N
X
i=1
Di Vi 1 Cov(yi )Vi 1 Di =
;
0
;
N
X
i=1
1n Rw 1R0Rw 11n M
0
;
;
i
i
and thus the robust covariance estimate is
PN 1 R 1 R R 1 1
R = (1 0 1 ) = i=1
P n w 0 w n M 1:
( Ni=1 1n Rw 1 1n )2
p
The robust variance estimator of N (^2 ; 2 ) is
PN 1 R 1R R 11 1
1
n w 0 w n
+
:
vR = N i=1
P
( Ni=1 1n Rw 1 1n )2 p0 (1 ; p0 ) (1 ; )p1 (1 ; p1 )
0
;
1
1
;
;
i
i
;
0
i
i
0
;
;
i
i
0
;
i
;
;
(5)
i
Once vR is available, we can plug it into formula (3) or (4) to obtain an explicit formula for sample
size or power calculations. We denote in this section
PN 1 R 1 R R 1 1
c := N i=1
P n w 0 w n
( N 1 R 1 1 )2
0
;
;
i
i
i=1 ni w ni
0
;
(6)
and now the key to sample size/power calculations is to obtain c. We will give explicit formulas for
c for several important special cases.
Special Case I: Rw = R0 = CS
The rst special case is that Rw is specied correctly as CS. Then 0 = 1 , and we have
c = PN N
:
n
i=1 1+(nii 1)
(7)
;
When ni = n for all i, the formula reduces to the one given in Shih 6]:
c = 1 + (nn; 1) = n1 1 + (n ; 1) ] :
(8)
Replacing the corresponding vR in formula (3) and comparing it with that for independent data 1],
we will nd that the required sample size is inated by a factor 1+(n ; 1) due to the within-cluster
correlation.
A similar but dierent formula is also given in Liu and Liang 5]. The discrepancy arises since
the derivations of Liu and Liang are based on the (generalized) score test whereas ours are based
on the z -test, a special case of the Wald test. It is well known that the score test and Wald test are
asymptotically equivalent. In some situations the score test may be preferred 12]. However, the
8
Wald test is more convenient to use. Currently, for the GEE methodology, only the Wald test is
implemented in the two popular statistical packages, SAS and Splus. Hence, in practice, the Wald
test is more commonly used than is the score test, and our proposal based on the Wald test will
then be useful. As to be shown numerically, the results from these two approaches are often close.
Special Case II: Rw = I and R0 = CS
Now suppose that the correct correlation matrix is still CS but the working independence model
is used. Then
PN 1 R 1 N PN n 1 + (n ; 1)]
i
n 0 n =
i=1 Pi
c = N Pi=1
:
(9)
( Ni=1 1n 1n )2
( Ni=1 ni )2
In general, formula (9) is not equal to (7). However, when ni = n for all i, formula (9) reduces to
(8) in other words, using the working independence model results in no loss of eciency. This is
somewhat surprising. However, for bivariate data (i.e. n = 2), it is not surprising since McDonald
11] has proved that the GEE estimator ^ using the working independence model is the same as
that using Rw = R0 , implying that their variances must be the same. Hence using the working
independence model in this case results in no loss of information.
Special Case III: Rw = R0 = AR(1)
Without loss of generality, we always assume ni 2 when Rw or R0 is AR(1) in this paper. We
have
c = PN N 1 = PN N (1 + )
:
(10)
i=1 1n R0 1n
i=1 ni ; (ni ; 2)]
When ni = n for all i,
(11)
c = h 1 + 2 i:
n 1 ; (1 ; n )
0
i
i
0
i
i
0
;
i
i
In general, formula (11) is not equal to (8). However, when n = 2, formulas (11) and (8) are the
same, because for bivariate data the AR(1) reduces to CS.
Special Case IV: Rw = I and R0 = AR(1)
PN n + 2(n ; 1) + 2(n ; 2)2 + ::: + 2n 1 ]
i
i
c = N i=1 i
:
(12)
P
N
( i=1 ni)2
When ni = n for all i, it is
2 2
1 n 1
1
(13)
c = 1 + 2(1 ; n ) + 2(1 ;n n ) + ::: + 2 n i;
;
9
which in general is not equal to formula (11). Furthermore, when < 1, formula (13) can be
rewritten as the following:
c = n1 1 + 1 2; 1 ; n1(1;;) :
n
(14)
In the Appendix, we show that
formula (8) > (14) > (11)
(15)
when 0 < < 1 and n > 2. An intuitive interpretation is as follows. When the true correlation
structure is AR(1), for two observations yij and yik in the same cluster, as their \distance" jj ; kj
increases, their correlation j k decreases. On the other hand, if the true correlation structure
is CS, the correlation between yij and yik is always , not depending on jj ; kj. Hence, the
average within-cluster correlation is weaker for AR(1) than that for CS. In other words, for two
equally-sized clusters with AR(1) and CS correlation structures satisfying Corr(yij yij +1 ) = ,
the eective sample size in the cluster for AR(1) is larger than that for CS. This explains the rst
inequality. For the second inequality, it is expected that using the working correlation matrix as
the true underlying correlation structure will increase the eciency over that using an incorrect
working correlation structure.
j ;
j
Numerical Examples
For illustration we rst consider a hypothetical example. Suppose the probabilities of having
the desired outcome (i.e., yij = 1) for the control and the treatment are p0 = 0:6 and p1 = 0:8
respectively. The correlation parameter is = 0:4, hence for any j 6= k, Corr(yij yik ) = if
the true correlation structure R0 is CS, and Corr(yij yik ) = j k if R0 is AR(1). A cluster (of
subjects) is assigned to the control/treatment group with an equal probability: = 0:5. There are
n = 4 subjects in each cluster. The c in formulas (8), (11) and (14) are respectively 0.5500, 0.4375
and 0.4480. Their order is consistent with inequality (15). Then we can calculate
j ;
j
vR = c p (11; p ) + (1 ; )p1 (1 ; p )
0
0
1
1
as 11.458, 9.115 and 9.333 respectively. Since b = log 1 p1p1 = 1 p0p0 = 0:981, based on sample size
N = 40 we can calculate the powers using formula (3) as 44.9%, 53.8% and 52.8%, which are
respectively for Case I or II (Rw = R0 = CS or Rw = I , R0 = CS ), Case III (Rw = R0 = AR(1))
;
;
10
and Case IV (Rw = I , R0 = AR(1)). Similarly, for the specied power 80%, the required sample
sizes (i.e., numbers of clusters) N for three cases are 94, 75 and 77 respectively.
Using Liu and Liang's formula, for Rw = R0 = CS , to retain the 80% power the required sample
size is N = 87, and the power is 47.9% for N = 40.
More examples with various values of p0 , p1 and are listed in Tables 2 and 3. The set-ups in
the two tables are similar except that in Table 2 the risk dierence p1 ; p0 remains as 0.2, whereas
in Table 3 the odds ratio exp(b) = p1 (1 ; p0 )=p0 (1 ; p1 ) is always 2.5. It can be veried that the
inequality (15) again holds. It is also noted that the dierence between the sample sizes for cases
III and IV is very small. Unsurprisingly, as the correlation increases, the required sample size
also increases.
It appears that for Case I, using Liu and Liang's formula requires smaller sample sizes than does
using our formula (see the columns L&L and I,II in Tables 2 and 3), though often the dierence
is not large. As discussed earlier, Liu and Liang's formula is based on the score test whereas ours
on the Wald test. Although the score test and the Wald test are asymptotically equivalent, it is
possible that for nite samples the former is more ecient. The choice of the two tests in GEE
needs to be further studied in the future. In the current context, which formula to use will depend
on which test is planned to be used in the later analysis. For instance, if the score test is to be used,
then it is natural to use Liu and Liang's formula. However, since the score test for GEE has not
been implemented in many statistical packages, our formulas based on the Wald test is attractive.
In addition, Liu and Liang did not give explicit formulas for other cases we consider.
Tables 2 and 3 about here
Special Case V: Rw = R0 = I
To facilitate comparisons for the next two special cases, we give the formula for independent
data:
PN 1 1
c = N PNi=1 n n 2 = PNN (
1 1 )
n
0
i
i
i=1 ni ni
0
i=1 i
(16)
which reduces to the familiar 1=n when all ni = n.
Special Case VI: Rw = CS and R0 = I
Now we discuss the situation when we incorrectly specify the correlation coecient as for
11
independent data.
PN
PN 1 R 21
n
i=1 1+(n
n
i
=1 n w
c = N PN
n
(
1 R 11 )2 = N PN
0
i
;
i=1 ni w ni
0
]2
i ;1)
i
i
;
i=1 1+(ni
i
;
1)
2 :
(17)
Note that if all ni = n, formula (17) reduces to c = 1=n, which is equal to that of using Rw = R0 = I .
For illustration, we include the mathematical derivation of formula (17) in the appendix.
Special Case VII: Rw = AR(1) and R0 = I
PN 1 R 2 1
PN n
i
=1 n w n
i=1 i
c = N PN
= N P
:
1
2
N ni ; (ni ; 2)] 2
( i=1 1n Rw 1n )
i=1
0
;
0
(18)
i
i
;
i
i
Although we do not have a proof, based on our numerical experience we hypothesize that formula
(16) (17) (18), which implies that using incorrect working correlation matrices may lead to
larger variances vR and thus larger sample sizes needed for a specied power.
When all ni = n, we can simplify formula (18) as
c = 1 2 > n1 :
n 1 ; 1 ; n2 (19)
Hence in this situation using Rw = AR(1) results in a larger sample size needed than that for
Rw = R0 = I .
VARYING WITHIN-CLUSTER TREATMENT
Now we allow for both treatments to be allocated to the subjects within a single cluster. Without
loss of generality, assume that within any cluster i, the rst ni0 subjects are assigned to the control
group while the remaining ni1 = ni ; ni0 are assigned to the treatment group. Note that for varying
within-cluster treatment, there may be no closed form solutions in many cases and thus we will not
discuss cases III-VII considered earlier.
A General Formula
First it is straightforward to establish
0
p0(1 ; p0 ) 1n 0 0n 0
Di = B
@
p1(1 ; p1 ) 1n 1 1n 1
1i = Di Ai 1=2 Rw 1 Ai 1=2 Di 0
;
;
;
i
i
i
i
1
]C
A
]
0i = Di Ai 1=2 Rw 1 R0 Rw 1 Ai 1=2 Di :
0
;
;
;
;
12
The model-based covariance matrix is
1 1 = (
;
while the robust covariance estimate is
N
X
R = (
i=1
N
X
i=1
N
X
1i ) 1 (
;
1i ) 1 i=1
;
0i )(
N
X
i=1
1i ) 1 ]:
;
In general, neither 1 1 nor R can be written in a closed form. But, at least numerically, with a
given set of parameters, it is feasible to compute either 1 1 or R .
Special Case I: Rw = R0 = CS
We rst consider the general situation with Rw = R0 = CS and ni > 2. Denote
;
;
0
1
a b
1i = B
@ i i CA bi di
then it can be veried that
s
"
#
s
"
#
; p1) + n p (1 ; p ) u + (n ; 1)w p0(1 ; p0) ai = ni0p0(1 ; p0) ui + (ni0 ; 1)wi pp1(1
i1 1
1
i
i1
i p (1 ; p )
0 (1 ; p0 )
1
1
q
bi = ni0ni1 wi p0(1 ; p0 )p1(1 ; p1 ) + ni1 p1 (1 ; p1)ui + (ni1 ; 1)wi ]
di = ni1 p1 (1 ; p1 )ui + (ni1 ; 1)wi ]
where
ui = ; (n ; 1)(ni2;;2)(n +; 12) ; 1 i
i
wi = (n ; 1)2 ;(n ; 2) ; 1 :
i
i
Due to the complexity of the form of 1i1 , we will not give it here. In principle, with given
P
parameters one can calculate 1 = Ni=1 1i and thus R or 1 1 , from which vR can be obtained.
For the bivariate case that n0i = n1i = 1 for all i, a direct calculation of Di 1 Vi Di 1 shows that
the covariance matrix
;
;
0;
;
0
R = 1 1 = N1 B
@
;
q
p0 (1 p0 ) + 1
;
;
1
p0 (1;p0 )
1
p0 (1;p0 )p1 (1;p1 )
Hence
1
1
;
;
s
;
vR = p (1 ; p ) + p (1 ; p ) ; 2 p (1 ; p )1p (1 ; p ) :
0
0
1
1
0
0 1
1
1
1
q
1
p0 (1 p0 ) + q p0 (1 p0 )p1 (1 p1 ) C
A
1
1
1
+
;
2
p0 (1 p0 ) p1 (1 p1 )
p0 (1 p0 )p1 (1 p1 )
;
;
;
;
;
(20)
13
A sample size/power formula results by substituting formula (20) into formula (3) or (4). A dierent
formula derived from the score test is also given by Liu and Liang 5].
Special Case II: Rw = I and R0 = CS
Straightforward but tedious algebraic work can show that
0
1
n p (1 ; p0 ) + ni1 p1(1 ; p1 ) ni1 p1(1 ; p1) C
1i = B
@ i0 0
A
ni1 p1 (1 ; p1 )
ni1 p1(1 ; p1)
0
1
1
1
;
CA :
n 0 p0 (1 p0 )
1i1 = B
@ n 0p0(1 1 p0 )
1
1
; n 0p0(1 p0 ) n 0 p0(1 p0 ) + n 1 p1(1 p1 )
Further, if we denote
0
1
e
f
0i = B
@ i i CA ;
;
i
;
i
;
i
;
i
i
;
fi gi
then
s
"
#
s
"
#
; p1) +n p (1;p ) 1 + (n ; 1) + n p0(1 ; p0) ei = ni0 p0 (1;p0 ) 1 + (ni0 ; 1) + ni1 pp1 (1
i1 1
1
i1
i0 p (1 ; p )
0 (1 ; p0 )
1
1
q
fi = ni0ni1 p0 (1 ; p0 )p1(1 ; p1 ) + ni1 1 + (ni1 ; 1)]p1 (1 ; p1 )
gi = ni1 1 + (ni1 ; 1)]p1 (1 ; p1):
R could be calculated, but its closed form is too complex. Now we pursue it by further assuming
that ni0 = m0 and ni1 = m1 for all i. Then
0
1
u w C
R = N1 B
@
A
w vR
where
u = 1m+p(m(10 ;; p1)) 0 0
w = ; 1m+p(m(10 ;; p1)) + p
0 0
and
0
0
p0 (1 ; p0)p1 (1 ; p1)
vR = 1m+p(m(10 ;; p1)) + 1m+p(m(11 ;; p1)) ; p
0 0
0
1 1
1
2
:
p0(1 ; p0 )p1 (1 ; p1)
(21)
14
In the bivariate case with m0 = m1 = 1, it is simple to verify that formula (21) reduces to
formula (20). In other words, in this situation, there is no information loss by using a working
independence model. Again this is not surprising in light of McDonald's result 11]: the GEE
estimate ^ and thus its variance obtained by using Rw = I are equal to those using Rw = R0 when
m0 = m1 = 1.
Examples
Diggle et al. give the data of a crossover trial on cerebrovascular deciency4]. Two treatments,
an active drug and a placebo were compared. Thirty-four patients received the active drug rst
followed by placebo. The other 33 patients received placebo rst, followed by the active drug. The
binary endpoint is a normal/abnormal electrocardiogram reading after receiving each treatment
for each patient. It was found that there was no signicant period or carryover eect. Hence, a
marginal logistic model with an intercept term and a covariate indicating the treatment group can
be tted. The GEE estimates of the regression coecients (using Rw = CS ) are ^ = (0:52 0:56)
with an estimated within-cluster correlation ^ = 0:614. The active drug appears to be favorable
the log odds-ratio of having a normal electrocardiogram reading is 0.56 when comparing the active
drug with placebo. Now assume that the true values of the parameters are the same as the above
estimates, and we want to estimate the required sample size to have a certain power to reject the
null hypothesis that there is no treatment eect of the active drug. Note that since all ni = 2 in any
two-period crossover trial, it is reasonable to assume the true correlation structure as CS. Using
formula (20) or (21) we obtain vR = 3:723. To maintain the 80% and 90% power, the required
sample sizes are 94 and 125 respectively. If we use Liu and Liang's formula, the required sample
sizes are 91 and 122 respectively.
As a comparison, we next consider a hypothetical example similar to the one in the section on
Numerical Examples except that the cluster size is n = 2. Recall other parameters are p0 = 0:6,
p1 = 0:8 and = 0:4. In each cluster one subject is assigned to the control and the other to the
treatment. Using formula (20) or (21) we get vR = 6:334. Hence, using GEE with R0 = CS and
either Rw = CS or Rw = I , for the given N = 40, the corresponding power is 69.3%. On the other
hand, to obtain the power 80%, the required sample size is 52. Using Liu and Liang's formula
yields slightly dierent results: the power is 72.7% for N = 40, and N = 48 is needed to obtain the
80% power.
0
15
Comparing the results here with those in the section on Numerical Examples, we see an apparent
increase of the power here even though the cluster size is reduced from 4 to 2. This is not surprising:
with a positive within-cluster correlation, the within-cluster comparison of the treatments can
increase the eciency of detecting the treatment dierence. It is in fact one of the main advantages
of crossover trials.
DISCUSSION
We have discussed the issue of sample size and power calculations in GEE with correlated
binary data. We have presented relevant formulas for several important special cases. Many of our
given formulas are explicit while the others are straightforward (but maybe tedious) to calculate.
Note that all the formulas derived are only applicable to two-armed trials. In particular, we did
not include any covariates (other than the treatment indicator) in the marginal logistic regression
model. Extensions to the above situations may be complicated. Although this article focuses on
correlated binary data, we believe that by following the same line one can derive relevant formulas
for count or continuous data.
We have emphasized the use of the robust variance estimator of the estimated regression coefcient due to its widespread use in practice. This is in fact an attractive property of the GEE: the
(asymptotically) valid statistical inference only depends on whether the marginal mean structure
(here the logistic regression model) is correctly specied, but not on the correct specication of
the second moments of the response variable, including both the marginal variance and the withincluster correlation structure. For discrete data, it is more often that one does not have a high
condence in the underlying true correlation structure. The GEE approach side-steps this diculty by using some working correlation structure. In practice, there are other reasons for using the
working independence model. First, when the sample size (or cluster size) is large, as it is in many
community-based trials, using non-diagonal working correlation matrices may lead to slow- or even
non-convergence of the GEE program. Second, in some situations, such as in partly conditional
modeling with longitudinal data, it may even be necessary to use the working independence model
13, 14]. Third, there may be some theoretical diculties with using a more complex working
correlation structure 15]. This could happen when the true correlation structure is mis-specied as
a non-simple one and we take this misspecied true correlation structure as the working structure.
16
On the other hand, we also note the limitation of the GEE approach. It is well-known that with
nite samples the performance of the robust variance estimator may be less satisfactory 16, 17].
For small samples we recommend the use of simulations to conrm the results obtained from the
given sample size/power formulas.
ACKNOWLEDGMENTS
The author is grateful to Bruce Lindgren for reading an earlier draft and giving helpful suggestions, to John Connett and Chap Le for discussions on the subject. The author thanks two referees,
an associate editor and the editor for many detailed and helpful comments.
APPENDIX
1. Proof of formula (15)
If 0 < < 1, then > k for any k > 1. Hence from (13) we have
1 + 2(1 ; n1 ) + 2(1 ; n2 )2 + ::: + 2 n1 n
(14) =
n
1 + 2(1 ; n1 ) + 2(1 ; n2 ) + ::: + 2 n1 1
;
<
n
2
1 + n 1 + 2 + ::: + (n ; 1)]
= (8):
n
On the other hand, since 1 ; n < 1 ; and 2=n > 0,
1 1 " 2(1 ; 1 ) #
1
2
n = (11):
(14) > n 1 + 1 ; 1 ; n 1 > n 1 +
1 ; + n2 =
2. Proof of formula (17)
Since Rw = CS , we have
0
1
u
w
w
:::
w
i
i
i
i
BB
C
BB wi ui wi ::: wi CCC
B
C
Rw 1 = B
BB wi wi ui ::: wi CCC BB
CC
B@ :::
CA
;
wi wi wi ::: ui
where
ui = ; (n ; 1)(ni2;;2)(n +; 12) ; 1 i
i
wi = (n ; 1)2 ;(n ; 2) ; 1 :
i
i
17
0
1
a
b
b
:::
b
i C
BB i i i
BB bi ai bi ::: bi CCC
B
C
Rw 2 = B
BB bi bi ai ::: bi CCC BB
CC
B@ :::
CA
Then
;
bi bi bi ::: ai
where
ai = u2i + (ni ; 1)wi2 bi = 2ui wi + (ni ; 2)wi2 :
It is easy to verify that
1n Rw 11n = niui + (ni ; 1)wi ] = (n ; n1)i + 1 i
0
;
i
i
1n Rw 21n = niui + (ni ; 1)wi ]2 0
;
i
i
and formula (17) is readily available.
REFERENCES
1. Friedman LM, Furberg CD, DeMets DL. Fundamentals of Clinical Trials. 3nd ed. New York:
Springer 1998.
2. Shuster JJ. Practical Handbook of Sample Size Guidelines for Clinical Trials. Boca Raton,
FL: CRC Press 1993.
3. Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika
198673:13-22.
4. Diggle PJ, Liang K-Y, Zeger SL. Analysis of Longitudinal Data. Oxford: Oxford 1994.
5. Liu G, Liang K-Y. Sample size calculations for studies with correlated observations. Biometrics 199753:937-947.
6. Shih WJ. Sample size and power calculations for periodontal and other studies with clustered samples using the method of generalized estimating equations. Biometrical Journal
199739:899-908.
7. Lipsitz SR, Fitzmaurice GM. Sample size for repeated measures studies with binary responses.
Statistics in Medicine 199413:1233-1239.
18
8. Lee EW, Dubin N. Estimation and sample size considerations for clustered binary responses.
Statistics in Medicine 199413:1241-1252.
9. Sargent DJ, Sloan JA, Cha SS. Sample size and design considerations for phase II clinical
trials with correlated observations. Controlled Clinical Trials 199919:242-252.
10. Prentice RL. Correlated binary regression with covariates specic to each binary observation.
Biometrics 198844:1033-1048.
11. McDonald BW. Estimating logistic regression parameters for bivariate binary data. Journal
of the Royal Statistical Societyi, Ser. B 199355:391-397.
12. Breslow N. Tests of hypotheses in overdispersed Poisson regression and other quasi-likelihood
models. Journal of the American Statistical Association 199085:565-571.
13. Pepe MS, Anderson GL. A cautionary note on inference for marginal regression models with
longitudinal data and general correlated response data. Communcations in Statististics, Ser.
B 199423:939-951.
14. Pepe MS, Whitaker RC, Seidel K. Estimating and comparing univariate associations with
application to the prediction of adult obesity. Statistics in Medicine 199918:163-173.
15. Crowder M. On the use of a working correlation matrix in using generalized linear models for
repeated measures. Biometrika 199582:407-410.
16. Drum M, McCullagh P. Comment. Statistical Science 19938:300-301.
17. Emrich LJ, Piedmonte MR. On some small sample properties of generalized estimating equation estimates for multivariate dichotomous outcomes. Journal of Statistical Computation
and Simulation 199241:19-29.
19
Table 1: Formulas for calculating c in various special cases. Once c is obtained, then one can
use formulas (5) and (6) to calculate vR , based on which and using formulas (3) and (4) one can
calculate the required sample size N or power 1 ; .
Cases
Unequal ni Equal ni n
I. Rw = R0 = CS
(7)
(8)
(9)
(8)
II. Rw = I , R0 = CS
III. Rw = R0 = AR(1)
(10)
(11)
IV. Rw = I , R0 = AR(1)
(12)
(13)
V. Rw = R0 = I
(16)
1=n
VI. Rw = CS , R0 = I
(17)
1=n
(18)
(19)
VII. Rw = AR(1), R0 = I
20
Table 2: Sample sizes N required to attain 80% and 90% power for Cases I-IV. For comparison,
results of Liu and Liang (L&L) for Case I are also included. In all set-ups, p1 ; p0 = 0:2, = 0:5
and all ni = n = 4.
Set-up
= :2, p0 = :1, p1 = :3
p0 = :2, p1 = :4
p0 = :3, p1 = :5
p0 = :4, p1 = :6
= :4, p0 = :1, p1 = :3
p0 = :2, p1 = :4
p0 = :3, p1 = :5
p0 = :4, p1 = :6
= :6, p0 = :1, p1 = :3
p0 = :2, p1 = :4
p0 = :3, p1 = :5
p0 = :4, p1 = :6
Power=0.80
L&L I,II III
48 55 46
63 68 57
73 77 64
76 80 67
65 76 60
87 94 75
100 106 84
104 110 88
83 96 79
110 119 98
127 135 110
132 140 114
IV
46
58
65
67
62
77
86
90
81
101
114
118
Power=0.90
L&L I,II III
64 74 62
85 92 76
97 103 86
101 107 89
87 101 81
116 126 100
133 142 113
139 147 117
111 129 105
148 160 131
170 180 147
177 187 153
IV
62
77
87
90
83
102
115
120
109
135
152
158
21
Table 3: Sample sizes N required to attain 80% and 90% power for Cases I-IV. For comparison,
results of Liu and Liang (L&L) for Case I are also included. In all set-ups, the log odds ratio is
b = log(2:5), = 0:5 and all ni = n = 4.
Set-up
= :2, p0 = :1, p1 = :217
p0 = :2, p1 = :385
p0 = :3, p1 = :517
p0 = :4, p1 = :625
= :4, p0 = :1, p1 = :217
p0 = :2, p1 = :385
p0 = :3, p1 = :517
p0 = :4, p1 = :625
= :6, p0 = :1, p1 = :217
p0 = :2, p1 = :385
p0 = :3, p1 = :517
p0 = :4, p1 = :625
Power=0.80
L&L I,II III
120 128 107
73 79 66
62 66 55
59 64 53
164 176 140
101 108 86
85 91 72
81 87 69
209 224 183
128 137 112
107 115 94
103 111 91
IV
108
66
56
53
144
88
74
71
190
116
98
94
Power=0.90
L&L I,II III
160 172 143
98 105 88
83 88 74
79 85 71
220 236 188
134 144 115
113 121 97
109 117 93
280 300 245
171 183 150
144 154 126
138 148 121
IV
144
88
74
71
192
118
99
95
254
155
131
125