Document 262763

Sankhy¯
a : The Indian Journal of Statistics
2010, Volume 72-B, Part 1, pp. 58-75
c 2010, Indian Statistical Institute
Sample size determination in logistic regression
M. Khorshed Alam
University of Cincinnati Medical Centre, USA
M. Bhaskara Rao
University of Cincinnati Medical Centre, USA
Fu-Chih Cheng
North Dakota State University, USA
Abstract
Whittemore (1981) and Hsieh et al. (1998) have proposed different methods for determining sample size in the context of testing the significance of
a slope parameter in logistic regression. Their sample size formulas have
been incorporated in some statistical software packages. In this paper, we
use a variation of Whittemore (1981) method to calculate sample size. We
compare these three sample size formulas to assess closeness of the nominal
and observed powers via simulations. These studies lead us to propose another method, which is a combination of Hsieh et al. (1998) method and the
proposed variation of the Whittemore (1981) method for calculating sample
size when the covariate is normally distributed. However, when the covariate
has a Bernoulli distribution, sample size calculations widely diverge between
the Hsieh et al. (1998) method and the proposed method. Interestingly,
the proposed method gives a better account of power than the Hsieh et al.
(1998) method.
AMS (2000) subject classification. Primary 62J12; Secondary 62F03, 62Q05.
Keywords and phrases. Covariates, Logistic Regression, Sample Size, Power,
Size, Simulations.
1
Introduction
Logistic regression is ubiquitous in many epidemiological studies, in which
a binary response variable Y , representing disease status, is modeled as a
function of some risk factors. We will focus only on a single risk factor
(X), which is assumed to have a specific known distribution. The logistic
regression model is given by
P (Y = 1|X) =
eγ0 +γ1 X
= 1 − P (Y = 0|X)
1 + eγ0 +γ1 X
Sample Size Determination in Logistic Regression
59
for some unknown parameters γ0 and γ1 . We want to test the validity of the
null hypothesis H0 : γ1 = 0 against the alternative H1 : γ1 = A (specified)
> 0 based on a random sample (Y1 , X1 ), (Y2 , X2 ), . . . , (YN , XN ) of size N on
(Y, X). Normally, a test based on the asymptotic theory of the maximum
likelihood estimator γˆ1 of γ1 is used for testing H0 against H1 . The basic
question tackled in this paper is what should be the sample size N so that
the asymptotic test has a given size α and power 1 − β? In a typical sample
size calculation, three ingredients are essential: size (α), power (1 − β),
and specific alternative value of the parameter of interest (γ1 = A). In the
context of logistic regression considered here, γ0 is a nuisance parameter.
There are two ways to tackle the nuisance parameter γ0 .
1 Assume that γ0 is known.
2 Estimate γ0 under the null hypothesis and then use the asymptotic
theory of the resultant score statistic or use the asymptotic theory of
the likelihood ratio test statistic for the calculation of sample size.
Approach 1 has been pursued by Whittemore (1981) and Hsieh et al.
(1998). Approach 2, which has been pursued by Self and Mauritsen (1988)
and Self et al. (1992), is complicated and iterative without an explicit formula, and it will not be pursued here.
Whittemore (1981) has made an additional assumption, namely small
response probability, in her calculations. The assumption of small response
probability means that 1 + eγ0 +γ1 X ∼
= 1 for likely X. Technically, if X
has a standard normal distribution, small response probability means that
2
γ1
E(1 + eγ0 +γ1 X ) = 1 + eγ0 + 2 ∼
= 1. The small response probability condition
puts severe restrictions on both γ0 and γ1 . She has tabulated sample sizes for
each of the following cases of distribution of X : standard normal; standard
exponential; Poisson (γ = 1); and Bernoulli; for specified size α, power 1−β,
and γ1 = A. When X has a standard normal distribution, the sample size
is given by
A2
(Zα + e− 4
N=
eγ0 A2
Zβ 2
)
(1.1)
where Zα is the upper 100* αth percentile of the standard normal distribution. Whittemore (1981) offered a modification of (1.1) in order to ameliorate
the assumption of small response probability. The modified formula is given
60
M. Khorshed Alam, M. Bhaskara Rao and Fu-Chih Cheng
by
N=
Zα + e−
A2
4
eγ0 A2
Zβ
2

eγ0
∗ 1 + 2
[1 + (1 + A2 ) e
∗
(1 + 2 eγ0
5A2
4

]
(1.2)
Formula (1.2) is incorporated in the software nQuery (See nQuery Advisor,
Release 6.0, Appendix 7-19). Sample size calculations are not available in
nQuery for other distributions of X. The software warns the user that the
formula can only be used if 0.4 ≤ eA ≤ 2.5, or equivalently −0.916 ≤ A ≤
0.916. The role of γ0 is not addressed. However, in the actual usage of
the software, γ0 is restricted to the interval [− ln(99), 0]. If the warning
is ignored the sample size numbers are outlandish if formula (1.2) is used
directly. Hsieh (1989) has tabulated sample sizes from a different angle using
the Whittemore formula. This version is incorporated in nQuery, Sections
18-4 to 18-11.
We will now review Hsieh et al. (1998) method. The critical idea here is
that the logistic regression problem can be viewed as a two-sample problem.
The following is a chain of ideas they presented. Assume that X has a standard normal distribution. Let µ1 = E(X|Y = 1), µ2 = E(X|Y = 0), σ12 =
V ar(X|Y = 1) and σ22 = V ar(X|Y = 0). If γ1 = 0, the random variables X
and Y are independently distributed. Consequently, the conditional distributions of X|Y = 1 and X|Y = 0 and the distribution of X are all identical.
Therefore, the hypothesis H0 : γ1 = 0 implies H0 : µ1 = µ2 . The logic
was that if the conditional distributions were indeed normal, the original
testing problem can be brought under the purview of a two-sample t-test.
However, there is a caveat. If H0 : γ1 = 0 is not true, then µ1 6= µ2 and
σ12 6= σ22 , and in fact, the conditional distributions are non-normal. Hsieh et
al. (1998) assumed that σ12 = σ22 = σ 2 say, under the alternative hypothesis
µ −µ
and γ1 = A ∼
= 1 σ 2 , and then using the framework of a two-sample problem (see Rosner, 2000, p. 384), they came up with the following sample size
formula
(Zα + Zβ )2
N= ∗
,
(1.3)
P (1 − P ∗ )A2
γ
e 0
where P ∗ = 1+e
γ0 . This formula is simpler than (1.2). It is incorporated
in the software PASS (2005). One crucial advantage of Hsieh et al. (1998)
method over the Whittemore (1981) method is that the assumption of small
response probability is not used in the calculations. In the same vein, Hsieh
Sample Size Determination in Logistic Regression
61
et al. (1998) also developed a sample size formula when the covariate X is
binary, i.e., it has a Bernoulli distribution (π). The formula is given by
q
q
2
P (1−P )
1−π
Zα
+ Zβ
P1 (1 − P1 ) + P2 (1 − P2 ) π
π
(1.4)
N=
(P1 − P2 )2 (1 − π)
where P1 =
eγ0
1+eγ0 , P2
=
eγ0 +A
1+eγ0 +A
and P = (1 − π) P1 + πP2 .
This formula is also incorporated in nQuery and PASS. There is one critical
concern about these formulas in that they are very sensitive to the choice
of γ0 . A small variation in γ0 leads to a big difference in sample sizes.
Therefore, it is imperative to conduct a pilot study in order to have some
idea about the intercept γ0 .
Hsieh et al. (1998) invoked the standard two-sample framework for calculating sample sizes. However, the conditions that are to be met for the
applicability of the two-sample paradigm do not carry verbatim to the framework considered by Hsieh et al. (1998).
Standard Two-sample framework
1. Populations: X ∼ (µ1 , σ 2 )
Y ∼ (µ2 , σ 2 )
H 0 : µ1 = µ2
2. Hypotheses: H0 : (µ1 , µ2 )
H 1 : µ1 > µ2
3. Distributions are normal with
a common variance whatever
the means are.
Hsieh, Block and Larsen Set-up
1. Populations: Conditional
distributions X|Y = 1 and
X|Y = 0. The covariate X
has a normal distribution.
2. Hypotheses: H0 : γ1 = 0
H1 : γ1 > 0
3. Distributions X|Y = 1 and
X|Y = 0 are normal when
only γ = 0. Distributions are
non-normal if γ1 6= 0. Further,
they have different variances.
Even though there is a substantial divergence between these two sets
of conditions, Hsieh et al. (1998) forged ahead and derived a sample size
formula exploiting the standard two-sample paradigm. Surprisingly, it works
to a large extent as we shall see later. However, the calculations fail badly
when the covariate X is Bernoulli.
In this paper, we propose a variation of the Whittemore (1981) method
and derive an explicit formula for the sample size. Its exact computation
requires numerical integration or summation, which is easy to incorporate
in any software.
62
M. Khorshed Alam, M. Bhaskara Rao and Fu-Chih Cheng
In Section 2, we present the proposed variation of the Whittemore (1981)
method and derive a formula (Formula (2.4) in Section 2) for sample size.
We also present sample size tables contrasting the three formulas when the
covariate X has a standard normal distribution. In Section 3, we present
our simulation work for comparing nominal and actual sizes and powers
under different sample size formulas. There are some issues that stem out
of the simulation study. Both Hsieh et al. (1998) method and the proposed
variation do not seem to meet the nominal levels of power for a certain range
of parameter values. We, however, come up with a remedy by suggesting the
average of the sample sizes given by formulas (1.3) and (2.4). Simulations
are conducted using the new sample sizes in order to compare nominal and
observed powers. The proposed remedy seems to work. In Section 4, sample
sizes are calculated using the proposed method for other distributions of
X. In Section 5, the Bernoulli case is dealt separately and we observe that
samples size calculations widely diverge between the proposed method and
Hsieh et al. (1998) method. Using simulations, we demonstrate that the
proposed method gives a better account of the nominal power. Finally, in
Section 6, a discussion of the methods is carried out.
2
Variation
Let X be any covariate with mean zero and variance one. The case
of binary X is dealt with separately. The data consist of N independent
realizations of (Y1 , X1 ), (Y2 , X2 ), . . . , (YN , XN ) of (Y, X). The conditional
likelihood of the data is given by
1−Yi
γ0 +γ1 Xi Yi 1
e
N
L = L(γ1 ) = Πi=1
1 + eγ0 +γ1 Xi
1 + eγ0 +γ1 Xi
h
i
∂2
2 eγ0 +γ1 X
Note that −E ∂γ
= N I(γ1 ), say.
2 ln L = N E X 1+eγ0 +γ1 X
1
Let γˆ1 be the maximum likelihood estimator of γ1 . The asymptotic variance
1
.
of γˆ1 is N I(γ
1)
Note that
I(0) = EX 2
and
eγ0
eγ0
=
(1 + eγ0 )2
(1 + eγ0 )2
"
I(A) = E X 2
eγ0 +AX
(1 + eγ0 +AX )2
#
(2.1)
(2.2)
Sample Size Determination in Logistic Regression
63
The large sample test (Wald test) for testing H0 : γ1 = 0 is built on the
following statistic,
p
γˆ1
= γˆ1 N I(0)
Z=
(2.3)
SE(ˆ
γ1 )H0
Asymptotically, Z has a standard normal distribution underH0 . An α-level
one-sided test is given by: Reject H0 in favor of H1 : γ1 > 0 if and only
if Z > Zα , where Zα (critical value) is the 100α% upper percentile of the
standard normal distribution. Let 1−β be the power decreed and γ1 = A > 0
the specified alternative. Set
p
1 − β = P r γˆ1 N I(0) > Zα |H1
!
p
p
N I(A)
Zα |γ1 = A
= P r (ˆ
γ1 − A + A) N I(A) > p
N I(0)
!
p
p
p
N I(A)
= P r (ˆ
γ1 − A) N I(A) > p
Zα − A N I(A) |γ1 = A
N I(0)
p
Under H1 : γ1 = A, (ˆ
γ1 − A) N I(A) has a standard normal distribution,
asymptotically.
√
p
N I(A)
Set −Zβ = √
Zα − A N I(A). Then
N I(0)
N=
√Zα
I(0)
Zβ
+√
A2
I(A)
2
=
Zα (1+eγ0 )
√ γ
e 0
Zβ
+√
A2
I(A)
2
.
(2.4)
Structurally, the formula for N given above is the same as the one given by
Whittemore (1981). In our formula (2.4), I(A) is obtained numerically. The
value of I(A) depends on the underlying distribution of X. We have used
a combination of FORTRAN and Mathematica for numerical computations.
The complete code is available on request.
We have tabulated (Table 1) the required sample sizes along with those
given by Whittemore (1981) using formula (1.2) and those given by Hsieh
et al. (1998) using Formula (1.3).
64
M. Khorshed Alam, M. Bhaskara Rao and Fu-Chih Cheng
Table 1: Sample Size (N) calculations in logistic regression for different
values of event rate (eγ0 ). The covariate has a standard Normal
distribution.
eγ0
α
β
A
.1
0.01
0.01
.5
1
2
.1
0.01
0.05
.5
1
2
.1
0.05
0.01
.5
1
2
.1
0.05
0.05
.5
1
2
NF
W
HBL
NF
W
HBL
NF
W
HBL
NF
W
HBL
NF
W
HBL
NF
W
HBL
NF
W
HBL
NF
W
HBL
NF
W
HBL
NF
W
HBL
NF
W
HBL
NF
W
HBL
NF
W
HBL
NF
W
HBL
NF
W
HBL
NF
W
HBL
1
16
1
8
1
4
1
2
1
2
4
8
16
38908
39030
39101
1411
1524
1565
306
428
392
77
2791
98
28369
28452
28486
1047
1123
1140
233
325
285
59
2369
72
28320
28427
28486
1010
1099
1140
213
299
285
53
1724
72
19451
19522
19548
706
763
782
153
214
196
39
1396
49
21853
21588
21919
829
880
877
198
291
220
58
2771
55
15928
15737
15968
610
648
639
147
221
160
42
2351
40
15912
15723
15968
598
634
639
142
203
160
43
1711
40
10925
10798
10958
414
440
439
99
146
110
29
1386
28
13526
13039
13530
544
554
542
146
222
136
51
2759
34
9855
9505
9857
396
408
395
105
169
99
35
2342
25
9854
9497
9857
396
400
395
108
155
99
40
1704
25
6762
6522
6764
272
277
271
73
111
68
26
1380
17
9766
8679
9742
415
391
390
123
188
98
50
2754
25
7112
6326
7097
299
288
284
86
143
71
33
2341
18
7118
6321
7097
306
282
284
93
132
71
40
1704
18
4883
4341
4870
208
196
195
62
94
49
25
1377
13
8692
6541
8660
378
309
347
117
171
87
50
2752
22
6328
4769
6309
272
228
253
81
130
64
33
2329
16
6336
4765
6309
280
223
253
89
120
64
41
1697
16
4346
3272
4328
189
155
174
59
86
44
25
1382
11
9766
5451
9742
415
269
390
123
163
98
50
2761
25
7112
3974
7097
299
198
284
86
124
71
33
2327
18
7118
3971
7097
306
194
284
93
114
71
40
1696
18
4883
2727
4870
208
135
195
62
82
49
25
1370
13
13526
4912
13530
544
248
542
146
158
136
51
2752
34
9855
3580
9857
396
183
395
105
121
99
35
2335
25
9854
3577
9857
396
179
395
108
111
99
40
1700
25
6762
2457
6764
272
124
271
73
79
68
26
1378
17
21853
4642
21919
829
238
877
198
156
220
58
2747
55
15928
3384
15968
610
175
639
147
119
160
42
2339
40
15912
3381
15968
598
172
639
142
109
160
43
1696
40
10925
2322
10959
414
119
439
99
78
110
29
1374
28
38908
4507
39101
1411
233
1565
306
155
392
77
2747
98
28369
3285
28486
1047
172
1140
233
118
285
59
2330
72
28320
3283
28486
1010
168
1140
213
108
285
53
1704
72
19451
2254
19548
706
117
782
153
78
196
39
1374
49
Legend: NF = Formula (2.4); W = Formula (1.2); HBL = Formula (1.3).
Sample Size Determination in Logistic Regression
65
Some comments are in order on Table 1. In our new formula (2.4), sample
size N will remain the same for eγ0 and e−γ0 . In the case of the Whittemore
formula (1.2), the sample sizes are different. As one can see, for large values
of eγ0 and A, the small response probability condition in formula (1.2) is
breaking down leading to unreasonable sample sizes in comparison with those
provided by formulas (1.3) and (2.4). Even if we follow the advice of nQuery
that 0.4 ≤ eA ≤ 2.5, the sample sizes are still uncomfortable. We have
calculated required sample sizes using formula (1.2) for the case α = 0.05,
1 1 1 1
, 8 , 4 , 2 , and 1. The numbers for γ0 and A fall
1 − β = 0.90, and eγ0 = 16
within the guidelines. Sample sizes are also calculated using formulas (1.3)
and (2.4) for the same specifications. The numbers are tabulated below.
Sample Sizes
eγ0
Formula
(1.2)
(1.3)
(2.4)
1
16
1
8
1
4
1
2
206
185
152
137
104
95
103
64
67
80
46
55
1
77
41
52
As one can notice, formula (1.2) demands much larger sample size than the
other two.
3
Observed Power
Simulations are conducted to examine how ours, Whittemore (1981), and
Hsieh et al. (1998) sample sizes are achieving nominal size, α, and nominal
power, 1 − β. We consider two cases: H0 : γ1 = 0 versus H1 : γ1 = 0.5 and
versus H1 : γ1 = 1.0. For each choice for eγ0 (listed in Table 2), α = 0.05,
and each choice of β(0.10, 0.05), sample size N is calculated using formulas
(1.2), (1.3), and (2.4). The input for simulations are α, β, γ0 and γ1 . If we
use the test statistic (2.3) to calculate observed size and power, they will be
very close to nominal size and power for sample sizes calculated as per formula (2.4). We would like to see how close the observed and nominal powers
are when in (2.3) γ0 is replaced by the maximum likelihood estimator γˆ0 .
More specifically, in simulations we use the following test: Reject H0 in favor
√
γ
ˆ0
Nγ
ˆ1 e 2
(1+eγˆ0 )
> Zα to calculate observed size and power. Simulation
of H1 if
results are reported in Tables 2 and 3.
66
New formula (2.4)
Nominal
power
γ0
log
log
log
log
log
log
log
log
log
(1/16)
(1/8)
(1/4)
(1/2)
(1)
(2)
(4)
(8)
(16)
90 %
N
566
330
215
163
148
163
215
330
566
Whittemore Modified formula (1.2)
90 %
95 %
95 %
Actual
Power
Size
0.8841
0.0477
0.8944
0.0509
0.9028
0.0586
0.9031
0.0558
0.9136
0.0547
0.9111
0.0518
0.9066
0.0571
0.8878
0.0482
0.8966
0.0480
N
706
414
272
208
189
208
272
414
706
Actual
Power
Size
0.9395
0.0473
0.9443
0.0472
0.9530
0.0516
0.9576
0.0529
0.9605
0.0523
0.9562
0.0534
0.9500
0.0523
0.9424
0.0505
0.9345
0.0494
N
608
351
221
156
124
108
99
95
93
Actual
Power
Size
0.9016
0.0534
0.9080
0.0491
0.9097
0.0475
0.8981
0.0586
0.8641
0.0541
0.7982
0.0591
0.6377
0.0549
0.4531
0.0477
0.2608
0.0356
N
763
440
277
196
155
135
124
119
117
Actual
Power
Size
0.9533
0.0462
0.9527
0.0529
0.9498
0.0530
0.9470
0.0531
0.9221
0.5308
0.8585
0.0572
0.7190
0.0573
0.5369
0.0516
0.3252
0.0422
Hsieh, Bloch, & Larsen formula (1.3)
90 %
95 %
N
619
347
215
155
138
155
215
347
619
Actual
Power
Size
0.9083
0.0492
0.9090
0.0515
0.9056
0.0517
0.8969
0.0557
0.8937
0.0570
0.8982
0.0519
0.9042
0.0544
0.9073
0.0489
0.9095
0.0489
N
782
439
271
195
174
195
271
439
782
Actual
Power
Size
0.9549
0.0465
0.9516
0.0480
0.9485
0.0524
0.9446
0.0565
0.9414
0.0502
0.9430
0.0543
0.9488
0.0518
0.9552
0.0514
0.9573
0.0436
Table 3 : Calculated sample sizes N, actual power, and size in logistic regression for testing H0 : γ1 = 0 against H1 : γ1 = 1.0.
Nominal size = α = 0.05
New formula (2.4)
Nominal
power
γ0
log
log
log
log
log
log
log
log
log
(1/16)
(1/8)
(1/4)
(1/2)
(1)
(2)
(4)
(8)
(16)
90 %
N
125
80
57
48
45
48
57
80
125
Actual
Power
Size
0.8729
0.0393
0.8916
0.0468
0.9122
0.0627
0.9262
0.0676
0.9343
0.0654
0.9311
0.0652
0.9040
0.0585
0.8935
0.0485
0.8765
0.0429
Whittemore Modified formula (1.2)
90 %
95 %
95 %
N
153
99
73
62
59
62
73
99
153
Actual
Power
Size
0.9217
0.0404
0.9390
0.0522
0.9636
0.0559
0.9694
0.0606
0.9749
0.0599
0.9696
0.0628
0.9570
0.0546
0.9433
0.0472
0.9246
0.0460
N
175
199
91
77
70
67
65
64
64
Actual
Power
Size
0.9506
0.0427
0.9664
0.0508
0.9823
0.0579
0.9886
0.0600
0.9872
0.0604
0.9795
0.0621
0.9362
0.0571
0.8209
0.0511
0.5716
0.0299
N
214
146
111
94
86
82
79
78
78
Actual
Power
Size
0.9818
0.0472
0.9863
0.0489
0.9926
0.0532
0.9957
0.0566
0.9955
0.0611
0.9913
0.0602
0.9632
0.0541
0.8844
0.0496
0.6666
0.0316
Hsieh, Bloch, & Larsen formula (1.3)
90 %
95 %
N
155
87
54
39
35
39
54
87
155
Actual
Power
Size
0.9326
0.0426
0.9113
0.048
0.8932
0.0644
0.8803
0.0699
0.8777
0.0682
0.8828
0.0686
0.8907
0.0627
0.9205
0.0491
0.9335
0.0417
N
196
110
68
49
44
49
68
110
196
Actual
Power
Size
0.9675
0.0464
0.9618
0.0505
0.9468
0.0568
0.9344
0.0631
0.9312
0.0685
0.9332
0.0646
0.9466
0.0599
0.9572
0.0515
0.9695
0.0446
M. Khorshed Alam, M. Bhaskara Rao and Fu-Chih Cheng
Table 2 : Calculated sample sizes N, actual power, and size in logistic regression for testing H0 : γ1 = 0 against H1 : γ1 = 0.5.
Nominal size = α = 0.05
Sample Size Determination in Logistic Regression
67
From Tables 2 and 3, it is clear that the Whittemore formula (1.2) is not
giving actual powers close to the nominal one as we move away from small
γ0 to large γ0 . This study provides a clear warning to what happens if we
ignore the small response probability condition.
New formula (2.4) and formula (1.3) are also not up to the mark. For
small and large values of γ0 , New formula (2.4) is giving significantly lower
powers than the nominal ones. For middle values of γ0 formula (1.3) is
giving significantly lower powers than the nominal ones. This is because
the formulas are derived by assuming γ0 is known, whereas in simulations
estimated γ0 is used. In the case of formula (1.3), the assumption that
2
σ12 = σ22 = σ 2 and A ∼
= µ1 −µ
σ , which are not valid, might be playing a role
in lower observed powers. For example, when γ0 = 3, γ1 = A = 1, µ1 =
0.05621, µ2 = 0.75642, σ12 = 0.95000, and σ22 = 1.05810. If we take σ 2 to be
µ −µ
the average of σ12 and σ22 , we have A ∼
= 1 σ 2 = 0.80609 but A = 1.
As a remedy for this problem, we suggest to take the average of the
sample sizes given by formulas (1.3) and (2.4). With these sample sizes, we
did simulations. The results are tabulated in Table 4. We have now a clear
indication that the new sample size calculation seems to be working.
Table 4 : Calculated sample sizes N , actual power, and size in
logistic regrssion for testing H0 : γ1 = 0 against H1 : γ1 = 1.0.
Nominal size = α = 0.05.
N = [New formula (2.4) + Hsieh, Block, & Larsen (1.3)]/2
Nominal Power
90%
95%
γ0
N
Actual
N
Actual
Power
Size
Power
Size
log (1/16)
140 0.9071 0.0398 175 0.9550 0.0484
log (1/8 )
84 0.9072 0.0498 105 0.9492 0.0469
56 0.9074 0.0585 71 0.9548 0.0548
log (1/4 )
44 0.9088 0.0659 56 0.9521 0.0604
log (1/2 )
log (1)
40 0.9111 0.0681 52 0.9550 0.0671
44 0.9116 0.0637 56 0.9536 0.0610
log (2)
log (4)
56 0.9064 0.0577 71 0.9544 0.0541
84 0.9012 0.0514 105 0.9537 0.0500
log (8)
140 0.9055 0.0440 175 0.9527 0.0453
log (16)
68
M. Khorshed Alam, M. Bhaskara Rao and Fu-Chih Cheng
4
Other Distributions
We look at three other cases for the distribution of the covariate X.
1. X has a standardized exponential distribution, i.e.,X = U − 1, where
U has a standard exponential distribution (Exp(λ) ( with λ = 1).
2. X has a standardized Poisson distribution, i.e.,X = U − 1, where U
has a Poisson distribution with mean unity.
3. X has a Bernoulli distribution (π).
In Cases 1 and 2, the entity I(0), under H0 : γ1 = 0 remains the same as
in (2.1), and the test statistic also remains the same as in (2.3). The formula
for N structurally also remains the same as in (2.4). The only difference lies
in the computation of I(A), which is extracted numerically from (2.2) using
the assumed distribution of X. The sample sizes as per formula (2.4) are
given in Tables 5, 6, and 7 along side the Whittemore(1981) numbers. In
the case of Bernoulli distribution, we have an explicit formula for I(A) given
by,
#
"
eγ0 +A
.
(4.1)
I(A) = π
(1 + eγ0 +A )2
The Whittemore (1981) formula (1.2) can not be used for certain values of
A in the exponential case as the integral involved in the formula does not
exist. The symbol (xx) in Table 5 indicates such a contingency. Hsieh et al.
(1998) have not considered these distributions.
5
The Bernoulli Case
In the case of Bernoulli X with success probability π, we have calculated
required sample sizes using formula (1.4) of Hsieh et al. (1998) and the
following formula stemming from our proposed method:
N=
√Zα
I(0)
Zβ
+√
A2
where I(A) is given by (2.2).
I(A)
2
=
γ0
Zα
√(1+eγ )
π∗ e 0
Zβ
+√
A2
I(A)
2
,
(5.1)
Sample Size Determination in Logistic Regression
69
The sample sizes differ considerably with formula (1.4) giving higher numbers. We conducted simulations (10,000 times) to check which sample size
provides a good account of the nominal power. We have tabulated the sample sizes in Table 7 and simulation work in Tables 8 and 9. From Tables 8
and 9, it is clear that formula (1.4) gives more power than what we have
sought with larger sample size it provides.
6
Discussion
Sample size formulas are available in the literature in the context of
logistic regression with a single covariate. Formula (1.2) has been derived
by Whittemore (1981) under the assumption of small response probability.
Formula (1.3) has been derived by Hsieh et al. (1998) by formulating the
hypothesis testing problem as a two-sample problem. In this paper, we
propose a new way for calculating sample size given by formula (2.4). We
tabulate below the approaches pursued by the three contributors in this
context.
Whittemore (1981)
Maximum Likelihood
Small response
probability
Hsieh, Block and Larsen
(1998)
Method
Two-sample problem
Assumption
The conditional distributions
of X|Y = 1 and X|Y = 0
have normal distributions
with equal variance under
H1 (Not true)
New Approach
Maximum Likelihood
None
We compare all three formulas via simulations to see how close the observed and nominal powers are. When the covariate is Bernoulli, the new
method provides a better account of nominal power than the method proposed by Hsieh et al. (1998). Unlike Hsieh et al. (1988) method, which has
limitations on X, our method is applicable whatever may be the distribution
of the covariate X.
70
M. Khorshed Alam, M. Bhaskara Rao and Fu-Chih Cheng
Table 5 : Sample Size (N) calculations in logistic regression for different values of
event rate (eγ0 ).
The covariate has a standard Exponential distribution.
eγ0
α
β
.1
0.01
0.01
.5
1
2
.1
0.01
0.05
.5
1
2
.1
0.01
0.01
.5
1
2
.1
0.05
0.01
.5
1
2
.1
0.05
0.05
.5
1
2
.1
0.05
0.10
1
16
A
.5
1
2
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
1
8
25658
14606
31171
15585
787
492
732
366
192
130
xx
xx
58
44
xx
xx
20227
11478
23135
11567
657
401
604
302
192
130
xx
xx
58
44
xx
xx
17595
9965
19340
9670
592
356
541
271
146
93
xx
xx
41
29
xx
xx
17218
9835
22294
11147
495
318
467
233
121
86
xx
xx
38
31
xx
xx
12827
7302
15590
7795
393
246
366
183
96
65
xx
xx
29
22
xx
xx
10750
6106
12506
6253
344
211
317
159
85
56
xx
xx
25
18
xx
xx
Legend: NF =
1
4
1
2
9245
6898
7739
3896
349
297
183
92
102
96
xx
xx
39
42
xx
xx
7229
5355
5784
2892
277
228
151
76
102
96
xx
xx
39
42
xx
xx
6255
4613
4835
2418
241
194
135
68
67
58
xx
xx
23
22
xx
xx
6259
4705
5574
2787
234
296
117
58
71
70
xx
xx
30
34
xx
xx
4622
3449
3898
1949
175
149
92
46
51
48
xx
xx
20
21
xx
xx
3851
2858
3126
1563
147
122
79
40
42
38
xx
xx
16
16
xx
xx
Formula (2.4);
1
2
4
6391
7476
10696
1948
974
487
312
405
623
46
23
12
111
155
250
xx
xx
xx
55
83
142
xx
xx
xx
4922
5714
8129
1446
723
362
232
294
444
38
19
9
111
155
250
xx
xx
xx
55
83
142
xx
xx
xx
4217
4872
6905
1209
604
302
194
241
360
34
17
9
63
82
128
xx
xx
xx
27
38
62
xx
xx
xx
4398
5185
7462
1393
697
348
224
297
465
29
15
7
84
121
199
xx
xx
xx
45
71
122
xx
xx
xx
3195
3738
5347
974
487
244
156
203
312
23
11
6
56
78
125
xx
xx
xx
28
42
71
xx
xx
xx
2633
3063
4364
782
391
195
126
160
243
20
10
5
43
58
93
xx
xx
xx
20
30
50
xx
xx
xx
W = Formula (1.2).
8
16
17661
244
1976
6
443
xx
260
xx
13375
181
758
5
443
xx
260
xx
11333
151
611
4
223
xx
111
xx
12368
174
810
4
357
xx
226
xx
8830
122
538
3
222
xx
130
xx
7186
98
416
3
163
xx
90
xx
31856
122
1988
3
832
xx
496
xx
24075
90
1393
2
832
xx
496
xx
20371
76
1119
2
414
xx
210
xx
22356
87
1505
3
674
xx
434
xx
15926
61
994
1
416
xx
248
xx
12943
49
765
1
304
xx
170
xx
Sample Size Determination in Logistic Regression
71
Table 6 : Sample Size (N) calculations in logistic regression for different values of
event rate (eγ0 ).
The covariate X + 1 has a standard Poisson distribution.
eγ0
α
β
A
.1
0.01
0.01
.5
1
2
.1
0.01
0.05
.5
1
2
.1
0.01
0.01
.5
1
2
.1
0.05
0.01
.5
1
2
.1
0.05
0.05
.5
1
2
.1
0.05
0.10
1
16
.5
1
2
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
NF
W
1
8
1
4
26077
12527
7246
32875
16438
8219
506
302
197
1028
514
257
119
70
46
176
88
44
29
18
12
24
12
6
20489
10165
5974
24171
12085
6043
471
276
178
791
396
198
113
66
42
146
73
37
28
17
11
23
12
6
17786
9007
5331
20077
10039
5020
452
263
168
677
339
170
109
63
41
132
66
33
27
16
10
23
12
6
17562
8143
4636
23739
11870
5935
279
170
113
708
354
177
64
39
26
111
56
28
16
10
7
12
6
3
13037
6263
3623
16443
8222
4111
253
151
99
541
257
129
60
35
23
88
44
22
15
9
6
12
6
3
10899
5361
3134
13100
6550
3275
240
141
91
424
212
106
57
34
22
77
39
19
14
9
6
12
6
3
Legend: NF = Formula
1
2
1
2
4
4959
4137
4216
5361
4109
2055
1027
514
148
132
144
189
129
64
32
16
35
30
34
45
22
11
6
3
9
8
9
12
3
2
1
1
4128
3494
3681
4735
3021
1511
755
378
132
118
130
172
99
49
25
12
32
28
31
42
18
9
5
2
8
7
8
11
3
2
1
0
3716
3173
3384
4417
2510
1255
628
314
124
111
122
164
85
43
21
11
30
27
30
40
17
8
4
2
8
7
8
10
3
2
1
1
3132
2570
2591
3157
2968
1484
742
371
86
77
83
107
89
45
22
11
20
18
20
26
14
7
4
2
5
5
5
7
2
1
1
0
2479
2068
2136
2681
2056
1028
514
257
74
66
72
95
65
32
16
8
18
16
17
23
11
6
3
2
5
4
5
6
2
1
1
0
2163
1824
1911
2443
1638
819
410
205
68
61
67
88
53
27
14
7
16
15
16
22
10
5
3
1
4
4
4
6
2
1
1
0
(2.4); W = Formula (1.2).
8
16
7850
257
285
8
69
2
17
0
7112
189
465
6
65
1
16
0
6733
157
255
6
63
1
16
0
4478
186
158
6
38
1
10
0
3925
129
143
4
35
1
9
0
3645
103
135
4
33
1
9
0
12813
129
478
4
117
1
29
0
11881
94
451
3
111
1
28
0
11398
79
438
3
108
1
27
0
7095
93
258
3
63
1
16
0
6406
64
239
2
59
1
15
0
6053
51
229
2
56
1
14
0
72
M. Khorshed Alam, M. Bhaskara Rao and Fu-Chih Cheng
Table 7: Sample Size (N ) calculations in logistic regression for different values of
event rate (eγ0 ).
The covariate X has a Bernouli distribution (π = 0.5)
eγ0
α
β
A
1
16
1
8
1
4
1
2
1
2
.1 NF
74884 42204 26287 19184 8669 19834
W
135188 67594 33979 16899 8450 4225
HBL 149727 84375 52552 38349 34662 36949
2558
1476
957
739
715
873
0.01 .5 NF
W
4981
2490 1245
623
312
156
HBL 5078
2926 2202 1459 1412 1725
1 NF
544
326
223
186
196
258
W
1156
578
289
145
73
36
HBL 1052
625
426
354
373
494
2 NF
111
73
58
57
70
105
W
266
133
67
34
17
9
HBL
189
122
94
93
117
178
.1 NF
54966 30949 19247 14013 6859 14406
W
98917 49459 24730 12365 6183 3091
0.01
HBL 109079 61469 38326 27939 25253 28886
1932
1109
712
544
518
624
0.05 .5 NF
W
3698
1849
925
462
231
116
HBL 3701
2133 1646 1065 1030 1258
1 NF
424
251
169
137
140
180
W
870
435
218
109
55
27
HBL
768
457
312
259
274
362
2 NF
91
58
43
41
48
69
W
204
102
51
26
13
7
HBL
139
90
70
69
87
131
.1 NF
45603 25661 15941 11588 10431 11866
0.01
W
81894 40947 20474 10237 5119 2559
HBL 90034 50737 31602 23062 20845 23843
1634
935
597
452
426
508
0.01 .5 NF
W
3092
1546
733
387
193
97
HBL 3055
1761 1383
880
851
1039
1 NF
366
215
143
114
114
144
W
734
367
184
92
46
23
HBL
635
378
257
215
227
299
2 NF
80
50
37
33
38
52
W
174
87
44
22
11
6
HBL
116
75
59
58
72
109
.1 NF
29786 16776 10437 7604 17340 7830
W
53681 26840 13420 6710 3355 1678
0.05
HBL 59233 33380 20791 15172 13713 15686
1039
598
385
295
282
341
0.01 .5 NF
W
1999
9999
500
250
125
63
HBL 2010
1158
888
578
559
683
1 NF
226
134
91
74
77
99
W
469
234
117
59
30
15
HBL
417
248
169
141
149
196
2 NF
48
31
24
22
27
39
W
109
55
28
14
7
4
HBL
76
49
38
38
47
71
Legend: NF = Formula (5.1); W = Formula (1.2); HBL
0.01
4
8
16
27913 22806 81795
2113 1056 528
55802 91199 163539
1290 2176 3975
78
39
20
2555 4318 7895
403
1704 1309
18
9
5
777
1363 2546
179
328
627
4
2
1
308
570
548
20228 17959 59136
1546
773
387
40654 66441 119141
913
1531 2785
58
29
15
1863 3147 5753
276
476
881
14
7
4
568
995 1856
114
206
390
3
2
1
226
417
400
16635 27744 48555
1280
640
320
33557 54841 98340
739
1233 2237
49
24
12
1538 2599 4750
218
373
687
12
6
3
469
822 1533
85
152
287
3
2
1
187
345
331
11001 45618 32181
839
420
210
22067 36080 64697
500
840 1529
31
16
8
1012 1709 3124
153
264
489
8
4
2
308
540 1008
64
117
222
2
1
1
123
226
217
= Formula (1.4).
Sample Size Determination in Logistic Regression
Contd. Table 7:
eγ0
1
16
A
α
β
1
8
1
4
1
2
1
2
.1 NF
54144 30544 19055 13939 6861 14494
W
98094 49047 24524 12262 6131 3066
HBL 109076 61466 38283 27936 25250 28883
1797
1043
682
534
524
648
0.01 .5 NF
W
3561
1781
891
445
223
112
HBL 3698
2130 1563 1062 1027 1255
1 NF
370
224
157
135
146
197
W
815
408
204
102
51
26
HBL
765
454
309
257
271
359
2 NF
72
49
41
43
55
85
W
185
93
46
23
12
6
HBL
136
88
68
67
84
129
.1 NF
37437 21099 13142 9591 5263 9916
W
67616 33808 16904 8452 4226 2113
0.05
HBL 74852 42181 26272 19172 17329 19822
1279
738
479
370
358
436
0.05 .5 NF
W
2491
1246
623
312
156
78
HBL 2539
1463 1101
730
706
862
1 NF
272
163
112
93
98
129
W
578
289
145
73
36
18
HBL
526
313
213
177
187
247
2 NF
56
37
29
29
35
53
W
133
67
34
17
9
4
HBL
95
61
47
47
59
89
Legend: NF = Formula (5.1); W = Formula (1.2); HBL
0.05
4
8
16
20443 18135 60043
1553
767
383
40651 66438 119138
967
1642 3009
56
28
14
1860 3144 5750
312
550 1029
13
7
3
565
992 1853
148
275
528
3
2
1
223
414
399
13955 13844 40891
1057
529
264
27897 45593 81757
645
1088 1988
39
20
10
1278 2159 3947
202
352
655
9
5
3
389
682 1273
90
164
314
2
1
1
154
285
274
= Formula (1.4).
Table 8 : Power analysis for given sample sizes (N ) and Nominal size α = 0.05
in logistic regression for testing H0 : γ1 = 0 against H1 : γ1 = 0.5,
when X ∼ Bernoulli (π = 0.5).
Nominal
power
γ0
log
log
log
log
log
log
log
log
log
(1/16)
(1/8)
(1/4)
(1/2)
(1)
(2)
(4)
(8)
(16)
New formula (5.1)
90%
95%
N
1039
598
385
295
282
341
500
840
1529
Actual
Power
0.9160
0.9202
0.9231
0.9163
0.9129
0.9142
0.9137
0.9152
0.9197
N
1279
738
479
370
358
436
645
1088
1988
Actual
Power
0.9512
0.9472
0.9492
0.9501
0.9471
0.9529
0.9506
0.9508
0.9497
Hsieh, Bloch, & Larsen formula (1.4)
90%
95%
N
2010
1158
888
578
559
683
1012
1709
3124
Actual
Power
0.9868
0.9884
0.9945
0.9893
0.9867
0.9976
0.9859
0.9861
0.9881
N
2539
1463
1101
730
706
862
1278
2159
3947
Actual
Power
0.9949
0.9958
0.9971
0.9950
0.9945
0.9952
0.9957
0.9955
0.9965
73
74
M. Khorshed Alam, M. Bhaskara Rao and Fu-Chih Cheng
Table 9 : Power analysis for given sample sizes (N ) and Nominal size α = 0.05
in logistic regression for testing H0 : γ1 = 0 against H1 : γ1 = 1,
when X ∼ Bernoulli (π = 0.5).
Nominal
power
γ0
log
log
log
log
log
log
log
log
log
(1/16)
(1/8)
(1/4)
(1/2)
(1)
(2)
(4)
(8)
(16)
New formula (5.1)
90%
95%
N
226
134
91
74
77
99
153
264
489
Actual
Power
0.9193
0.9245
0.9246
0.9194
0.9227
0.9242
0.9278
0.9239
0.9269
N
272
163
112
93
98
129
202
352
655
Actual
Power
0.9507
0.9536
0.9501
0.9544
0.9524
0.9557
0.9615
0.9589
0.9630
Hsieh, Bloch, & Larsen formula (1.4)
90%
95%
N
417
248
169
141
149
196
308
540
1008
Actual
Power
0.9887
0.9878
0.9861
0.9864
0.9885
0.9897
0.9885
0.9904
0.9912
N
526
313
213
177
187
247
389
682
1273
Actual
Power
0.9964
0.9953
0.9957
0.9951
0.9956
0.9960
0.9947
0.9960
0.9962
References
Elashoff, J. D. (2005). nQuery Advisor Release 6.0, Sample Size and
Power Determination. Boston, MA, USA: Statistical Solutions Ltd.
Hintze, J. H. (2005). PASS USERS GUIDE-III, Power Analysis and
Sample Size Determination. Kaysville, Utah, USA: NCSS.
Hsieh, F.Y. (1989). Sample size tables for logistic regression. Statistics
in Medicine, 8, 795-802.
Hsieh, F.Y., Bloch, D. A., and Larsen, M.D. (1998). A simple
method of sample size calculation for linear and logistic regression.
Statistics in Medicine, 17, 1623-1634.
Rosner, B. (2000). Fundamentals of Biostatistics. California, USA:
Duxbury.
Self, S.G., and Mauritsen, R.H. (1988). Power/sample size calculations for generalized linear models. Biometrics, 44, 79-86.
Self, S.G., Mauritsen, R.H., and Ohara, J. (1992). Power calculations for likelihood ratio tests in generalized linear models. Biometrics,
48, 31-39.
Whittemore, A. (1981). Sample size for logistic regression with small
response probability. Journal of the American Statistical Association,
76, 27-32.
Sample Size Determination in Logistic Regression
M. Khorshed Alam
Center for Genome Information
Department of Environmental Health
University of Cincinnati Medical Centre
OH 45267
U.S.A.
E-mail: [email protected]
75
M. Bhaskara Rao
Center for Genome Information
Department of Environmental Health
University of Cincinnati Medical Centre
OH 45267
U.S.A.
E-mail: [email protected]
Fu-Chih Cheng
Department of Statistics
North Dakota State University
Fargo, ND 58102
U.S.A.
E-mail: [email protected]
Paper received November 2007; revised November 2008.