Sankhy¯ a : The Indian Journal of Statistics 2010, Volume 72-B, Part 1, pp. 58-75 c 2010, Indian Statistical Institute Sample size determination in logistic regression M. Khorshed Alam University of Cincinnati Medical Centre, USA M. Bhaskara Rao University of Cincinnati Medical Centre, USA Fu-Chih Cheng North Dakota State University, USA Abstract Whittemore (1981) and Hsieh et al. (1998) have proposed different methods for determining sample size in the context of testing the significance of a slope parameter in logistic regression. Their sample size formulas have been incorporated in some statistical software packages. In this paper, we use a variation of Whittemore (1981) method to calculate sample size. We compare these three sample size formulas to assess closeness of the nominal and observed powers via simulations. These studies lead us to propose another method, which is a combination of Hsieh et al. (1998) method and the proposed variation of the Whittemore (1981) method for calculating sample size when the covariate is normally distributed. However, when the covariate has a Bernoulli distribution, sample size calculations widely diverge between the Hsieh et al. (1998) method and the proposed method. Interestingly, the proposed method gives a better account of power than the Hsieh et al. (1998) method. AMS (2000) subject classification. Primary 62J12; Secondary 62F03, 62Q05. Keywords and phrases. Covariates, Logistic Regression, Sample Size, Power, Size, Simulations. 1 Introduction Logistic regression is ubiquitous in many epidemiological studies, in which a binary response variable Y , representing disease status, is modeled as a function of some risk factors. We will focus only on a single risk factor (X), which is assumed to have a specific known distribution. The logistic regression model is given by P (Y = 1|X) = eγ0 +γ1 X = 1 − P (Y = 0|X) 1 + eγ0 +γ1 X Sample Size Determination in Logistic Regression 59 for some unknown parameters γ0 and γ1 . We want to test the validity of the null hypothesis H0 : γ1 = 0 against the alternative H1 : γ1 = A (specified) > 0 based on a random sample (Y1 , X1 ), (Y2 , X2 ), . . . , (YN , XN ) of size N on (Y, X). Normally, a test based on the asymptotic theory of the maximum likelihood estimator γˆ1 of γ1 is used for testing H0 against H1 . The basic question tackled in this paper is what should be the sample size N so that the asymptotic test has a given size α and power 1 − β? In a typical sample size calculation, three ingredients are essential: size (α), power (1 − β), and specific alternative value of the parameter of interest (γ1 = A). In the context of logistic regression considered here, γ0 is a nuisance parameter. There are two ways to tackle the nuisance parameter γ0 . 1 Assume that γ0 is known. 2 Estimate γ0 under the null hypothesis and then use the asymptotic theory of the resultant score statistic or use the asymptotic theory of the likelihood ratio test statistic for the calculation of sample size. Approach 1 has been pursued by Whittemore (1981) and Hsieh et al. (1998). Approach 2, which has been pursued by Self and Mauritsen (1988) and Self et al. (1992), is complicated and iterative without an explicit formula, and it will not be pursued here. Whittemore (1981) has made an additional assumption, namely small response probability, in her calculations. The assumption of small response probability means that 1 + eγ0 +γ1 X ∼ = 1 for likely X. Technically, if X has a standard normal distribution, small response probability means that 2 γ1 E(1 + eγ0 +γ1 X ) = 1 + eγ0 + 2 ∼ = 1. The small response probability condition puts severe restrictions on both γ0 and γ1 . She has tabulated sample sizes for each of the following cases of distribution of X : standard normal; standard exponential; Poisson (γ = 1); and Bernoulli; for specified size α, power 1−β, and γ1 = A. When X has a standard normal distribution, the sample size is given by A2 (Zα + e− 4 N= eγ0 A2 Zβ 2 ) (1.1) where Zα is the upper 100* αth percentile of the standard normal distribution. Whittemore (1981) offered a modification of (1.1) in order to ameliorate the assumption of small response probability. The modified formula is given 60 M. Khorshed Alam, M. Bhaskara Rao and Fu-Chih Cheng by N= Zα + e− A2 4 eγ0 A2 Zβ 2 eγ0 ∗ 1 + 2 [1 + (1 + A2 ) e ∗ (1 + 2 eγ0 5A2 4 ] (1.2) Formula (1.2) is incorporated in the software nQuery (See nQuery Advisor, Release 6.0, Appendix 7-19). Sample size calculations are not available in nQuery for other distributions of X. The software warns the user that the formula can only be used if 0.4 ≤ eA ≤ 2.5, or equivalently −0.916 ≤ A ≤ 0.916. The role of γ0 is not addressed. However, in the actual usage of the software, γ0 is restricted to the interval [− ln(99), 0]. If the warning is ignored the sample size numbers are outlandish if formula (1.2) is used directly. Hsieh (1989) has tabulated sample sizes from a different angle using the Whittemore formula. This version is incorporated in nQuery, Sections 18-4 to 18-11. We will now review Hsieh et al. (1998) method. The critical idea here is that the logistic regression problem can be viewed as a two-sample problem. The following is a chain of ideas they presented. Assume that X has a standard normal distribution. Let µ1 = E(X|Y = 1), µ2 = E(X|Y = 0), σ12 = V ar(X|Y = 1) and σ22 = V ar(X|Y = 0). If γ1 = 0, the random variables X and Y are independently distributed. Consequently, the conditional distributions of X|Y = 1 and X|Y = 0 and the distribution of X are all identical. Therefore, the hypothesis H0 : γ1 = 0 implies H0 : µ1 = µ2 . The logic was that if the conditional distributions were indeed normal, the original testing problem can be brought under the purview of a two-sample t-test. However, there is a caveat. If H0 : γ1 = 0 is not true, then µ1 6= µ2 and σ12 6= σ22 , and in fact, the conditional distributions are non-normal. Hsieh et al. (1998) assumed that σ12 = σ22 = σ 2 say, under the alternative hypothesis µ −µ and γ1 = A ∼ = 1 σ 2 , and then using the framework of a two-sample problem (see Rosner, 2000, p. 384), they came up with the following sample size formula (Zα + Zβ )2 N= ∗ , (1.3) P (1 − P ∗ )A2 γ e 0 where P ∗ = 1+e γ0 . This formula is simpler than (1.2). It is incorporated in the software PASS (2005). One crucial advantage of Hsieh et al. (1998) method over the Whittemore (1981) method is that the assumption of small response probability is not used in the calculations. In the same vein, Hsieh Sample Size Determination in Logistic Regression 61 et al. (1998) also developed a sample size formula when the covariate X is binary, i.e., it has a Bernoulli distribution (π). The formula is given by q q 2 P (1−P ) 1−π Zα + Zβ P1 (1 − P1 ) + P2 (1 − P2 ) π π (1.4) N= (P1 − P2 )2 (1 − π) where P1 = eγ0 1+eγ0 , P2 = eγ0 +A 1+eγ0 +A and P = (1 − π) P1 + πP2 . This formula is also incorporated in nQuery and PASS. There is one critical concern about these formulas in that they are very sensitive to the choice of γ0 . A small variation in γ0 leads to a big difference in sample sizes. Therefore, it is imperative to conduct a pilot study in order to have some idea about the intercept γ0 . Hsieh et al. (1998) invoked the standard two-sample framework for calculating sample sizes. However, the conditions that are to be met for the applicability of the two-sample paradigm do not carry verbatim to the framework considered by Hsieh et al. (1998). Standard Two-sample framework 1. Populations: X ∼ (µ1 , σ 2 ) Y ∼ (µ2 , σ 2 ) H 0 : µ1 = µ2 2. Hypotheses: H0 : (µ1 , µ2 ) H 1 : µ1 > µ2 3. Distributions are normal with a common variance whatever the means are. Hsieh, Block and Larsen Set-up 1. Populations: Conditional distributions X|Y = 1 and X|Y = 0. The covariate X has a normal distribution. 2. Hypotheses: H0 : γ1 = 0 H1 : γ1 > 0 3. Distributions X|Y = 1 and X|Y = 0 are normal when only γ = 0. Distributions are non-normal if γ1 6= 0. Further, they have different variances. Even though there is a substantial divergence between these two sets of conditions, Hsieh et al. (1998) forged ahead and derived a sample size formula exploiting the standard two-sample paradigm. Surprisingly, it works to a large extent as we shall see later. However, the calculations fail badly when the covariate X is Bernoulli. In this paper, we propose a variation of the Whittemore (1981) method and derive an explicit formula for the sample size. Its exact computation requires numerical integration or summation, which is easy to incorporate in any software. 62 M. Khorshed Alam, M. Bhaskara Rao and Fu-Chih Cheng In Section 2, we present the proposed variation of the Whittemore (1981) method and derive a formula (Formula (2.4) in Section 2) for sample size. We also present sample size tables contrasting the three formulas when the covariate X has a standard normal distribution. In Section 3, we present our simulation work for comparing nominal and actual sizes and powers under different sample size formulas. There are some issues that stem out of the simulation study. Both Hsieh et al. (1998) method and the proposed variation do not seem to meet the nominal levels of power for a certain range of parameter values. We, however, come up with a remedy by suggesting the average of the sample sizes given by formulas (1.3) and (2.4). Simulations are conducted using the new sample sizes in order to compare nominal and observed powers. The proposed remedy seems to work. In Section 4, sample sizes are calculated using the proposed method for other distributions of X. In Section 5, the Bernoulli case is dealt separately and we observe that samples size calculations widely diverge between the proposed method and Hsieh et al. (1998) method. Using simulations, we demonstrate that the proposed method gives a better account of the nominal power. Finally, in Section 6, a discussion of the methods is carried out. 2 Variation Let X be any covariate with mean zero and variance one. The case of binary X is dealt with separately. The data consist of N independent realizations of (Y1 , X1 ), (Y2 , X2 ), . . . , (YN , XN ) of (Y, X). The conditional likelihood of the data is given by 1−Yi γ0 +γ1 Xi Yi 1 e N L = L(γ1 ) = Πi=1 1 + eγ0 +γ1 Xi 1 + eγ0 +γ1 Xi h i ∂2 2 eγ0 +γ1 X Note that −E ∂γ = N I(γ1 ), say. 2 ln L = N E X 1+eγ0 +γ1 X 1 Let γˆ1 be the maximum likelihood estimator of γ1 . The asymptotic variance 1 . of γˆ1 is N I(γ 1) Note that I(0) = EX 2 and eγ0 eγ0 = (1 + eγ0 )2 (1 + eγ0 )2 " I(A) = E X 2 eγ0 +AX (1 + eγ0 +AX )2 # (2.1) (2.2) Sample Size Determination in Logistic Regression 63 The large sample test (Wald test) for testing H0 : γ1 = 0 is built on the following statistic, p γˆ1 = γˆ1 N I(0) Z= (2.3) SE(ˆ γ1 )H0 Asymptotically, Z has a standard normal distribution underH0 . An α-level one-sided test is given by: Reject H0 in favor of H1 : γ1 > 0 if and only if Z > Zα , where Zα (critical value) is the 100α% upper percentile of the standard normal distribution. Let 1−β be the power decreed and γ1 = A > 0 the specified alternative. Set p 1 − β = P r γˆ1 N I(0) > Zα |H1 ! p p N I(A) Zα |γ1 = A = P r (ˆ γ1 − A + A) N I(A) > p N I(0) ! p p p N I(A) = P r (ˆ γ1 − A) N I(A) > p Zα − A N I(A) |γ1 = A N I(0) p Under H1 : γ1 = A, (ˆ γ1 − A) N I(A) has a standard normal distribution, asymptotically. √ p N I(A) Set −Zβ = √ Zα − A N I(A). Then N I(0) N= √Zα I(0) Zβ +√ A2 I(A) 2 = Zα (1+eγ0 ) √ γ e 0 Zβ +√ A2 I(A) 2 . (2.4) Structurally, the formula for N given above is the same as the one given by Whittemore (1981). In our formula (2.4), I(A) is obtained numerically. The value of I(A) depends on the underlying distribution of X. We have used a combination of FORTRAN and Mathematica for numerical computations. The complete code is available on request. We have tabulated (Table 1) the required sample sizes along with those given by Whittemore (1981) using formula (1.2) and those given by Hsieh et al. (1998) using Formula (1.3). 64 M. Khorshed Alam, M. Bhaskara Rao and Fu-Chih Cheng Table 1: Sample Size (N) calculations in logistic regression for different values of event rate (eγ0 ). The covariate has a standard Normal distribution. eγ0 α β A .1 0.01 0.01 .5 1 2 .1 0.01 0.05 .5 1 2 .1 0.05 0.01 .5 1 2 .1 0.05 0.05 .5 1 2 NF W HBL NF W HBL NF W HBL NF W HBL NF W HBL NF W HBL NF W HBL NF W HBL NF W HBL NF W HBL NF W HBL NF W HBL NF W HBL NF W HBL NF W HBL NF W HBL 1 16 1 8 1 4 1 2 1 2 4 8 16 38908 39030 39101 1411 1524 1565 306 428 392 77 2791 98 28369 28452 28486 1047 1123 1140 233 325 285 59 2369 72 28320 28427 28486 1010 1099 1140 213 299 285 53 1724 72 19451 19522 19548 706 763 782 153 214 196 39 1396 49 21853 21588 21919 829 880 877 198 291 220 58 2771 55 15928 15737 15968 610 648 639 147 221 160 42 2351 40 15912 15723 15968 598 634 639 142 203 160 43 1711 40 10925 10798 10958 414 440 439 99 146 110 29 1386 28 13526 13039 13530 544 554 542 146 222 136 51 2759 34 9855 9505 9857 396 408 395 105 169 99 35 2342 25 9854 9497 9857 396 400 395 108 155 99 40 1704 25 6762 6522 6764 272 277 271 73 111 68 26 1380 17 9766 8679 9742 415 391 390 123 188 98 50 2754 25 7112 6326 7097 299 288 284 86 143 71 33 2341 18 7118 6321 7097 306 282 284 93 132 71 40 1704 18 4883 4341 4870 208 196 195 62 94 49 25 1377 13 8692 6541 8660 378 309 347 117 171 87 50 2752 22 6328 4769 6309 272 228 253 81 130 64 33 2329 16 6336 4765 6309 280 223 253 89 120 64 41 1697 16 4346 3272 4328 189 155 174 59 86 44 25 1382 11 9766 5451 9742 415 269 390 123 163 98 50 2761 25 7112 3974 7097 299 198 284 86 124 71 33 2327 18 7118 3971 7097 306 194 284 93 114 71 40 1696 18 4883 2727 4870 208 135 195 62 82 49 25 1370 13 13526 4912 13530 544 248 542 146 158 136 51 2752 34 9855 3580 9857 396 183 395 105 121 99 35 2335 25 9854 3577 9857 396 179 395 108 111 99 40 1700 25 6762 2457 6764 272 124 271 73 79 68 26 1378 17 21853 4642 21919 829 238 877 198 156 220 58 2747 55 15928 3384 15968 610 175 639 147 119 160 42 2339 40 15912 3381 15968 598 172 639 142 109 160 43 1696 40 10925 2322 10959 414 119 439 99 78 110 29 1374 28 38908 4507 39101 1411 233 1565 306 155 392 77 2747 98 28369 3285 28486 1047 172 1140 233 118 285 59 2330 72 28320 3283 28486 1010 168 1140 213 108 285 53 1704 72 19451 2254 19548 706 117 782 153 78 196 39 1374 49 Legend: NF = Formula (2.4); W = Formula (1.2); HBL = Formula (1.3). Sample Size Determination in Logistic Regression 65 Some comments are in order on Table 1. In our new formula (2.4), sample size N will remain the same for eγ0 and e−γ0 . In the case of the Whittemore formula (1.2), the sample sizes are different. As one can see, for large values of eγ0 and A, the small response probability condition in formula (1.2) is breaking down leading to unreasonable sample sizes in comparison with those provided by formulas (1.3) and (2.4). Even if we follow the advice of nQuery that 0.4 ≤ eA ≤ 2.5, the sample sizes are still uncomfortable. We have calculated required sample sizes using formula (1.2) for the case α = 0.05, 1 1 1 1 , 8 , 4 , 2 , and 1. The numbers for γ0 and A fall 1 − β = 0.90, and eγ0 = 16 within the guidelines. Sample sizes are also calculated using formulas (1.3) and (2.4) for the same specifications. The numbers are tabulated below. Sample Sizes eγ0 Formula (1.2) (1.3) (2.4) 1 16 1 8 1 4 1 2 206 185 152 137 104 95 103 64 67 80 46 55 1 77 41 52 As one can notice, formula (1.2) demands much larger sample size than the other two. 3 Observed Power Simulations are conducted to examine how ours, Whittemore (1981), and Hsieh et al. (1998) sample sizes are achieving nominal size, α, and nominal power, 1 − β. We consider two cases: H0 : γ1 = 0 versus H1 : γ1 = 0.5 and versus H1 : γ1 = 1.0. For each choice for eγ0 (listed in Table 2), α = 0.05, and each choice of β(0.10, 0.05), sample size N is calculated using formulas (1.2), (1.3), and (2.4). The input for simulations are α, β, γ0 and γ1 . If we use the test statistic (2.3) to calculate observed size and power, they will be very close to nominal size and power for sample sizes calculated as per formula (2.4). We would like to see how close the observed and nominal powers are when in (2.3) γ0 is replaced by the maximum likelihood estimator γˆ0 . More specifically, in simulations we use the following test: Reject H0 in favor √ γ ˆ0 Nγ ˆ1 e 2 (1+eγˆ0 ) > Zα to calculate observed size and power. Simulation of H1 if results are reported in Tables 2 and 3. 66 New formula (2.4) Nominal power γ0 log log log log log log log log log (1/16) (1/8) (1/4) (1/2) (1) (2) (4) (8) (16) 90 % N 566 330 215 163 148 163 215 330 566 Whittemore Modified formula (1.2) 90 % 95 % 95 % Actual Power Size 0.8841 0.0477 0.8944 0.0509 0.9028 0.0586 0.9031 0.0558 0.9136 0.0547 0.9111 0.0518 0.9066 0.0571 0.8878 0.0482 0.8966 0.0480 N 706 414 272 208 189 208 272 414 706 Actual Power Size 0.9395 0.0473 0.9443 0.0472 0.9530 0.0516 0.9576 0.0529 0.9605 0.0523 0.9562 0.0534 0.9500 0.0523 0.9424 0.0505 0.9345 0.0494 N 608 351 221 156 124 108 99 95 93 Actual Power Size 0.9016 0.0534 0.9080 0.0491 0.9097 0.0475 0.8981 0.0586 0.8641 0.0541 0.7982 0.0591 0.6377 0.0549 0.4531 0.0477 0.2608 0.0356 N 763 440 277 196 155 135 124 119 117 Actual Power Size 0.9533 0.0462 0.9527 0.0529 0.9498 0.0530 0.9470 0.0531 0.9221 0.5308 0.8585 0.0572 0.7190 0.0573 0.5369 0.0516 0.3252 0.0422 Hsieh, Bloch, & Larsen formula (1.3) 90 % 95 % N 619 347 215 155 138 155 215 347 619 Actual Power Size 0.9083 0.0492 0.9090 0.0515 0.9056 0.0517 0.8969 0.0557 0.8937 0.0570 0.8982 0.0519 0.9042 0.0544 0.9073 0.0489 0.9095 0.0489 N 782 439 271 195 174 195 271 439 782 Actual Power Size 0.9549 0.0465 0.9516 0.0480 0.9485 0.0524 0.9446 0.0565 0.9414 0.0502 0.9430 0.0543 0.9488 0.0518 0.9552 0.0514 0.9573 0.0436 Table 3 : Calculated sample sizes N, actual power, and size in logistic regression for testing H0 : γ1 = 0 against H1 : γ1 = 1.0. Nominal size = α = 0.05 New formula (2.4) Nominal power γ0 log log log log log log log log log (1/16) (1/8) (1/4) (1/2) (1) (2) (4) (8) (16) 90 % N 125 80 57 48 45 48 57 80 125 Actual Power Size 0.8729 0.0393 0.8916 0.0468 0.9122 0.0627 0.9262 0.0676 0.9343 0.0654 0.9311 0.0652 0.9040 0.0585 0.8935 0.0485 0.8765 0.0429 Whittemore Modified formula (1.2) 90 % 95 % 95 % N 153 99 73 62 59 62 73 99 153 Actual Power Size 0.9217 0.0404 0.9390 0.0522 0.9636 0.0559 0.9694 0.0606 0.9749 0.0599 0.9696 0.0628 0.9570 0.0546 0.9433 0.0472 0.9246 0.0460 N 175 199 91 77 70 67 65 64 64 Actual Power Size 0.9506 0.0427 0.9664 0.0508 0.9823 0.0579 0.9886 0.0600 0.9872 0.0604 0.9795 0.0621 0.9362 0.0571 0.8209 0.0511 0.5716 0.0299 N 214 146 111 94 86 82 79 78 78 Actual Power Size 0.9818 0.0472 0.9863 0.0489 0.9926 0.0532 0.9957 0.0566 0.9955 0.0611 0.9913 0.0602 0.9632 0.0541 0.8844 0.0496 0.6666 0.0316 Hsieh, Bloch, & Larsen formula (1.3) 90 % 95 % N 155 87 54 39 35 39 54 87 155 Actual Power Size 0.9326 0.0426 0.9113 0.048 0.8932 0.0644 0.8803 0.0699 0.8777 0.0682 0.8828 0.0686 0.8907 0.0627 0.9205 0.0491 0.9335 0.0417 N 196 110 68 49 44 49 68 110 196 Actual Power Size 0.9675 0.0464 0.9618 0.0505 0.9468 0.0568 0.9344 0.0631 0.9312 0.0685 0.9332 0.0646 0.9466 0.0599 0.9572 0.0515 0.9695 0.0446 M. Khorshed Alam, M. Bhaskara Rao and Fu-Chih Cheng Table 2 : Calculated sample sizes N, actual power, and size in logistic regression for testing H0 : γ1 = 0 against H1 : γ1 = 0.5. Nominal size = α = 0.05 Sample Size Determination in Logistic Regression 67 From Tables 2 and 3, it is clear that the Whittemore formula (1.2) is not giving actual powers close to the nominal one as we move away from small γ0 to large γ0 . This study provides a clear warning to what happens if we ignore the small response probability condition. New formula (2.4) and formula (1.3) are also not up to the mark. For small and large values of γ0 , New formula (2.4) is giving significantly lower powers than the nominal ones. For middle values of γ0 formula (1.3) is giving significantly lower powers than the nominal ones. This is because the formulas are derived by assuming γ0 is known, whereas in simulations estimated γ0 is used. In the case of formula (1.3), the assumption that 2 σ12 = σ22 = σ 2 and A ∼ = µ1 −µ σ , which are not valid, might be playing a role in lower observed powers. For example, when γ0 = 3, γ1 = A = 1, µ1 = 0.05621, µ2 = 0.75642, σ12 = 0.95000, and σ22 = 1.05810. If we take σ 2 to be µ −µ the average of σ12 and σ22 , we have A ∼ = 1 σ 2 = 0.80609 but A = 1. As a remedy for this problem, we suggest to take the average of the sample sizes given by formulas (1.3) and (2.4). With these sample sizes, we did simulations. The results are tabulated in Table 4. We have now a clear indication that the new sample size calculation seems to be working. Table 4 : Calculated sample sizes N , actual power, and size in logistic regrssion for testing H0 : γ1 = 0 against H1 : γ1 = 1.0. Nominal size = α = 0.05. N = [New formula (2.4) + Hsieh, Block, & Larsen (1.3)]/2 Nominal Power 90% 95% γ0 N Actual N Actual Power Size Power Size log (1/16) 140 0.9071 0.0398 175 0.9550 0.0484 log (1/8 ) 84 0.9072 0.0498 105 0.9492 0.0469 56 0.9074 0.0585 71 0.9548 0.0548 log (1/4 ) 44 0.9088 0.0659 56 0.9521 0.0604 log (1/2 ) log (1) 40 0.9111 0.0681 52 0.9550 0.0671 44 0.9116 0.0637 56 0.9536 0.0610 log (2) log (4) 56 0.9064 0.0577 71 0.9544 0.0541 84 0.9012 0.0514 105 0.9537 0.0500 log (8) 140 0.9055 0.0440 175 0.9527 0.0453 log (16) 68 M. Khorshed Alam, M. Bhaskara Rao and Fu-Chih Cheng 4 Other Distributions We look at three other cases for the distribution of the covariate X. 1. X has a standardized exponential distribution, i.e.,X = U − 1, where U has a standard exponential distribution (Exp(λ) ( with λ = 1). 2. X has a standardized Poisson distribution, i.e.,X = U − 1, where U has a Poisson distribution with mean unity. 3. X has a Bernoulli distribution (π). In Cases 1 and 2, the entity I(0), under H0 : γ1 = 0 remains the same as in (2.1), and the test statistic also remains the same as in (2.3). The formula for N structurally also remains the same as in (2.4). The only difference lies in the computation of I(A), which is extracted numerically from (2.2) using the assumed distribution of X. The sample sizes as per formula (2.4) are given in Tables 5, 6, and 7 along side the Whittemore(1981) numbers. In the case of Bernoulli distribution, we have an explicit formula for I(A) given by, # " eγ0 +A . (4.1) I(A) = π (1 + eγ0 +A )2 The Whittemore (1981) formula (1.2) can not be used for certain values of A in the exponential case as the integral involved in the formula does not exist. The symbol (xx) in Table 5 indicates such a contingency. Hsieh et al. (1998) have not considered these distributions. 5 The Bernoulli Case In the case of Bernoulli X with success probability π, we have calculated required sample sizes using formula (1.4) of Hsieh et al. (1998) and the following formula stemming from our proposed method: N= √Zα I(0) Zβ +√ A2 where I(A) is given by (2.2). I(A) 2 = γ0 Zα √(1+eγ ) π∗ e 0 Zβ +√ A2 I(A) 2 , (5.1) Sample Size Determination in Logistic Regression 69 The sample sizes differ considerably with formula (1.4) giving higher numbers. We conducted simulations (10,000 times) to check which sample size provides a good account of the nominal power. We have tabulated the sample sizes in Table 7 and simulation work in Tables 8 and 9. From Tables 8 and 9, it is clear that formula (1.4) gives more power than what we have sought with larger sample size it provides. 6 Discussion Sample size formulas are available in the literature in the context of logistic regression with a single covariate. Formula (1.2) has been derived by Whittemore (1981) under the assumption of small response probability. Formula (1.3) has been derived by Hsieh et al. (1998) by formulating the hypothesis testing problem as a two-sample problem. In this paper, we propose a new way for calculating sample size given by formula (2.4). We tabulate below the approaches pursued by the three contributors in this context. Whittemore (1981) Maximum Likelihood Small response probability Hsieh, Block and Larsen (1998) Method Two-sample problem Assumption The conditional distributions of X|Y = 1 and X|Y = 0 have normal distributions with equal variance under H1 (Not true) New Approach Maximum Likelihood None We compare all three formulas via simulations to see how close the observed and nominal powers are. When the covariate is Bernoulli, the new method provides a better account of nominal power than the method proposed by Hsieh et al. (1998). Unlike Hsieh et al. (1988) method, which has limitations on X, our method is applicable whatever may be the distribution of the covariate X. 70 M. Khorshed Alam, M. Bhaskara Rao and Fu-Chih Cheng Table 5 : Sample Size (N) calculations in logistic regression for different values of event rate (eγ0 ). The covariate has a standard Exponential distribution. eγ0 α β .1 0.01 0.01 .5 1 2 .1 0.01 0.05 .5 1 2 .1 0.01 0.01 .5 1 2 .1 0.05 0.01 .5 1 2 .1 0.05 0.05 .5 1 2 .1 0.05 0.10 1 16 A .5 1 2 NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W 1 8 25658 14606 31171 15585 787 492 732 366 192 130 xx xx 58 44 xx xx 20227 11478 23135 11567 657 401 604 302 192 130 xx xx 58 44 xx xx 17595 9965 19340 9670 592 356 541 271 146 93 xx xx 41 29 xx xx 17218 9835 22294 11147 495 318 467 233 121 86 xx xx 38 31 xx xx 12827 7302 15590 7795 393 246 366 183 96 65 xx xx 29 22 xx xx 10750 6106 12506 6253 344 211 317 159 85 56 xx xx 25 18 xx xx Legend: NF = 1 4 1 2 9245 6898 7739 3896 349 297 183 92 102 96 xx xx 39 42 xx xx 7229 5355 5784 2892 277 228 151 76 102 96 xx xx 39 42 xx xx 6255 4613 4835 2418 241 194 135 68 67 58 xx xx 23 22 xx xx 6259 4705 5574 2787 234 296 117 58 71 70 xx xx 30 34 xx xx 4622 3449 3898 1949 175 149 92 46 51 48 xx xx 20 21 xx xx 3851 2858 3126 1563 147 122 79 40 42 38 xx xx 16 16 xx xx Formula (2.4); 1 2 4 6391 7476 10696 1948 974 487 312 405 623 46 23 12 111 155 250 xx xx xx 55 83 142 xx xx xx 4922 5714 8129 1446 723 362 232 294 444 38 19 9 111 155 250 xx xx xx 55 83 142 xx xx xx 4217 4872 6905 1209 604 302 194 241 360 34 17 9 63 82 128 xx xx xx 27 38 62 xx xx xx 4398 5185 7462 1393 697 348 224 297 465 29 15 7 84 121 199 xx xx xx 45 71 122 xx xx xx 3195 3738 5347 974 487 244 156 203 312 23 11 6 56 78 125 xx xx xx 28 42 71 xx xx xx 2633 3063 4364 782 391 195 126 160 243 20 10 5 43 58 93 xx xx xx 20 30 50 xx xx xx W = Formula (1.2). 8 16 17661 244 1976 6 443 xx 260 xx 13375 181 758 5 443 xx 260 xx 11333 151 611 4 223 xx 111 xx 12368 174 810 4 357 xx 226 xx 8830 122 538 3 222 xx 130 xx 7186 98 416 3 163 xx 90 xx 31856 122 1988 3 832 xx 496 xx 24075 90 1393 2 832 xx 496 xx 20371 76 1119 2 414 xx 210 xx 22356 87 1505 3 674 xx 434 xx 15926 61 994 1 416 xx 248 xx 12943 49 765 1 304 xx 170 xx Sample Size Determination in Logistic Regression 71 Table 6 : Sample Size (N) calculations in logistic regression for different values of event rate (eγ0 ). The covariate X + 1 has a standard Poisson distribution. eγ0 α β A .1 0.01 0.01 .5 1 2 .1 0.01 0.05 .5 1 2 .1 0.01 0.01 .5 1 2 .1 0.05 0.01 .5 1 2 .1 0.05 0.05 .5 1 2 .1 0.05 0.10 1 16 .5 1 2 NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W NF W 1 8 1 4 26077 12527 7246 32875 16438 8219 506 302 197 1028 514 257 119 70 46 176 88 44 29 18 12 24 12 6 20489 10165 5974 24171 12085 6043 471 276 178 791 396 198 113 66 42 146 73 37 28 17 11 23 12 6 17786 9007 5331 20077 10039 5020 452 263 168 677 339 170 109 63 41 132 66 33 27 16 10 23 12 6 17562 8143 4636 23739 11870 5935 279 170 113 708 354 177 64 39 26 111 56 28 16 10 7 12 6 3 13037 6263 3623 16443 8222 4111 253 151 99 541 257 129 60 35 23 88 44 22 15 9 6 12 6 3 10899 5361 3134 13100 6550 3275 240 141 91 424 212 106 57 34 22 77 39 19 14 9 6 12 6 3 Legend: NF = Formula 1 2 1 2 4 4959 4137 4216 5361 4109 2055 1027 514 148 132 144 189 129 64 32 16 35 30 34 45 22 11 6 3 9 8 9 12 3 2 1 1 4128 3494 3681 4735 3021 1511 755 378 132 118 130 172 99 49 25 12 32 28 31 42 18 9 5 2 8 7 8 11 3 2 1 0 3716 3173 3384 4417 2510 1255 628 314 124 111 122 164 85 43 21 11 30 27 30 40 17 8 4 2 8 7 8 10 3 2 1 1 3132 2570 2591 3157 2968 1484 742 371 86 77 83 107 89 45 22 11 20 18 20 26 14 7 4 2 5 5 5 7 2 1 1 0 2479 2068 2136 2681 2056 1028 514 257 74 66 72 95 65 32 16 8 18 16 17 23 11 6 3 2 5 4 5 6 2 1 1 0 2163 1824 1911 2443 1638 819 410 205 68 61 67 88 53 27 14 7 16 15 16 22 10 5 3 1 4 4 4 6 2 1 1 0 (2.4); W = Formula (1.2). 8 16 7850 257 285 8 69 2 17 0 7112 189 465 6 65 1 16 0 6733 157 255 6 63 1 16 0 4478 186 158 6 38 1 10 0 3925 129 143 4 35 1 9 0 3645 103 135 4 33 1 9 0 12813 129 478 4 117 1 29 0 11881 94 451 3 111 1 28 0 11398 79 438 3 108 1 27 0 7095 93 258 3 63 1 16 0 6406 64 239 2 59 1 15 0 6053 51 229 2 56 1 14 0 72 M. Khorshed Alam, M. Bhaskara Rao and Fu-Chih Cheng Table 7: Sample Size (N ) calculations in logistic regression for different values of event rate (eγ0 ). The covariate X has a Bernouli distribution (π = 0.5) eγ0 α β A 1 16 1 8 1 4 1 2 1 2 .1 NF 74884 42204 26287 19184 8669 19834 W 135188 67594 33979 16899 8450 4225 HBL 149727 84375 52552 38349 34662 36949 2558 1476 957 739 715 873 0.01 .5 NF W 4981 2490 1245 623 312 156 HBL 5078 2926 2202 1459 1412 1725 1 NF 544 326 223 186 196 258 W 1156 578 289 145 73 36 HBL 1052 625 426 354 373 494 2 NF 111 73 58 57 70 105 W 266 133 67 34 17 9 HBL 189 122 94 93 117 178 .1 NF 54966 30949 19247 14013 6859 14406 W 98917 49459 24730 12365 6183 3091 0.01 HBL 109079 61469 38326 27939 25253 28886 1932 1109 712 544 518 624 0.05 .5 NF W 3698 1849 925 462 231 116 HBL 3701 2133 1646 1065 1030 1258 1 NF 424 251 169 137 140 180 W 870 435 218 109 55 27 HBL 768 457 312 259 274 362 2 NF 91 58 43 41 48 69 W 204 102 51 26 13 7 HBL 139 90 70 69 87 131 .1 NF 45603 25661 15941 11588 10431 11866 0.01 W 81894 40947 20474 10237 5119 2559 HBL 90034 50737 31602 23062 20845 23843 1634 935 597 452 426 508 0.01 .5 NF W 3092 1546 733 387 193 97 HBL 3055 1761 1383 880 851 1039 1 NF 366 215 143 114 114 144 W 734 367 184 92 46 23 HBL 635 378 257 215 227 299 2 NF 80 50 37 33 38 52 W 174 87 44 22 11 6 HBL 116 75 59 58 72 109 .1 NF 29786 16776 10437 7604 17340 7830 W 53681 26840 13420 6710 3355 1678 0.05 HBL 59233 33380 20791 15172 13713 15686 1039 598 385 295 282 341 0.01 .5 NF W 1999 9999 500 250 125 63 HBL 2010 1158 888 578 559 683 1 NF 226 134 91 74 77 99 W 469 234 117 59 30 15 HBL 417 248 169 141 149 196 2 NF 48 31 24 22 27 39 W 109 55 28 14 7 4 HBL 76 49 38 38 47 71 Legend: NF = Formula (5.1); W = Formula (1.2); HBL 0.01 4 8 16 27913 22806 81795 2113 1056 528 55802 91199 163539 1290 2176 3975 78 39 20 2555 4318 7895 403 1704 1309 18 9 5 777 1363 2546 179 328 627 4 2 1 308 570 548 20228 17959 59136 1546 773 387 40654 66441 119141 913 1531 2785 58 29 15 1863 3147 5753 276 476 881 14 7 4 568 995 1856 114 206 390 3 2 1 226 417 400 16635 27744 48555 1280 640 320 33557 54841 98340 739 1233 2237 49 24 12 1538 2599 4750 218 373 687 12 6 3 469 822 1533 85 152 287 3 2 1 187 345 331 11001 45618 32181 839 420 210 22067 36080 64697 500 840 1529 31 16 8 1012 1709 3124 153 264 489 8 4 2 308 540 1008 64 117 222 2 1 1 123 226 217 = Formula (1.4). Sample Size Determination in Logistic Regression Contd. Table 7: eγ0 1 16 A α β 1 8 1 4 1 2 1 2 .1 NF 54144 30544 19055 13939 6861 14494 W 98094 49047 24524 12262 6131 3066 HBL 109076 61466 38283 27936 25250 28883 1797 1043 682 534 524 648 0.01 .5 NF W 3561 1781 891 445 223 112 HBL 3698 2130 1563 1062 1027 1255 1 NF 370 224 157 135 146 197 W 815 408 204 102 51 26 HBL 765 454 309 257 271 359 2 NF 72 49 41 43 55 85 W 185 93 46 23 12 6 HBL 136 88 68 67 84 129 .1 NF 37437 21099 13142 9591 5263 9916 W 67616 33808 16904 8452 4226 2113 0.05 HBL 74852 42181 26272 19172 17329 19822 1279 738 479 370 358 436 0.05 .5 NF W 2491 1246 623 312 156 78 HBL 2539 1463 1101 730 706 862 1 NF 272 163 112 93 98 129 W 578 289 145 73 36 18 HBL 526 313 213 177 187 247 2 NF 56 37 29 29 35 53 W 133 67 34 17 9 4 HBL 95 61 47 47 59 89 Legend: NF = Formula (5.1); W = Formula (1.2); HBL 0.05 4 8 16 20443 18135 60043 1553 767 383 40651 66438 119138 967 1642 3009 56 28 14 1860 3144 5750 312 550 1029 13 7 3 565 992 1853 148 275 528 3 2 1 223 414 399 13955 13844 40891 1057 529 264 27897 45593 81757 645 1088 1988 39 20 10 1278 2159 3947 202 352 655 9 5 3 389 682 1273 90 164 314 2 1 1 154 285 274 = Formula (1.4). Table 8 : Power analysis for given sample sizes (N ) and Nominal size α = 0.05 in logistic regression for testing H0 : γ1 = 0 against H1 : γ1 = 0.5, when X ∼ Bernoulli (π = 0.5). Nominal power γ0 log log log log log log log log log (1/16) (1/8) (1/4) (1/2) (1) (2) (4) (8) (16) New formula (5.1) 90% 95% N 1039 598 385 295 282 341 500 840 1529 Actual Power 0.9160 0.9202 0.9231 0.9163 0.9129 0.9142 0.9137 0.9152 0.9197 N 1279 738 479 370 358 436 645 1088 1988 Actual Power 0.9512 0.9472 0.9492 0.9501 0.9471 0.9529 0.9506 0.9508 0.9497 Hsieh, Bloch, & Larsen formula (1.4) 90% 95% N 2010 1158 888 578 559 683 1012 1709 3124 Actual Power 0.9868 0.9884 0.9945 0.9893 0.9867 0.9976 0.9859 0.9861 0.9881 N 2539 1463 1101 730 706 862 1278 2159 3947 Actual Power 0.9949 0.9958 0.9971 0.9950 0.9945 0.9952 0.9957 0.9955 0.9965 73 74 M. Khorshed Alam, M. Bhaskara Rao and Fu-Chih Cheng Table 9 : Power analysis for given sample sizes (N ) and Nominal size α = 0.05 in logistic regression for testing H0 : γ1 = 0 against H1 : γ1 = 1, when X ∼ Bernoulli (π = 0.5). Nominal power γ0 log log log log log log log log log (1/16) (1/8) (1/4) (1/2) (1) (2) (4) (8) (16) New formula (5.1) 90% 95% N 226 134 91 74 77 99 153 264 489 Actual Power 0.9193 0.9245 0.9246 0.9194 0.9227 0.9242 0.9278 0.9239 0.9269 N 272 163 112 93 98 129 202 352 655 Actual Power 0.9507 0.9536 0.9501 0.9544 0.9524 0.9557 0.9615 0.9589 0.9630 Hsieh, Bloch, & Larsen formula (1.4) 90% 95% N 417 248 169 141 149 196 308 540 1008 Actual Power 0.9887 0.9878 0.9861 0.9864 0.9885 0.9897 0.9885 0.9904 0.9912 N 526 313 213 177 187 247 389 682 1273 Actual Power 0.9964 0.9953 0.9957 0.9951 0.9956 0.9960 0.9947 0.9960 0.9962 References Elashoff, J. D. (2005). nQuery Advisor Release 6.0, Sample Size and Power Determination. Boston, MA, USA: Statistical Solutions Ltd. Hintze, J. H. (2005). PASS USERS GUIDE-III, Power Analysis and Sample Size Determination. Kaysville, Utah, USA: NCSS. Hsieh, F.Y. (1989). Sample size tables for logistic regression. Statistics in Medicine, 8, 795-802. Hsieh, F.Y., Bloch, D. A., and Larsen, M.D. (1998). A simple method of sample size calculation for linear and logistic regression. Statistics in Medicine, 17, 1623-1634. Rosner, B. (2000). Fundamentals of Biostatistics. California, USA: Duxbury. Self, S.G., and Mauritsen, R.H. (1988). Power/sample size calculations for generalized linear models. Biometrics, 44, 79-86. Self, S.G., Mauritsen, R.H., and Ohara, J. (1992). Power calculations for likelihood ratio tests in generalized linear models. Biometrics, 48, 31-39. Whittemore, A. (1981). Sample size for logistic regression with small response probability. Journal of the American Statistical Association, 76, 27-32. Sample Size Determination in Logistic Regression M. Khorshed Alam Center for Genome Information Department of Environmental Health University of Cincinnati Medical Centre OH 45267 U.S.A. E-mail: [email protected] 75 M. Bhaskara Rao Center for Genome Information Department of Environmental Health University of Cincinnati Medical Centre OH 45267 U.S.A. E-mail: [email protected] Fu-Chih Cheng Department of Statistics North Dakota State University Fargo, ND 58102 U.S.A. E-mail: [email protected] Paper received November 2007; revised November 2008.
© Copyright 2024