EP 521 Spring, 2004, Vol I, Part 5 §3. 1 Sample Size Estimation A key to study design are sample size or “power” calculations. Required of ever grant proposal In this section: (1) we begin with theory behind power calculations and demonstrate how simple formulae for power and sample sizes are derived. (2) Next, show unified treatment of power for RD, OR, RR based on this theory. (3) Then, describe how varying the question being asked can have substantial effect on the required sample sizes. (4) Brief explanation of the information needed for power calculations for matched pair studies. (5) Some demonstrations on how to use and interpret software for power calculations. Goals – To be able to understand what affects power, how to define the problem, and how to get the computer to give you the answer you need. EP 521 Spring, 2004, Vol I, Part 5 2 §3.1 POWER IN GENERAL Sample Size Estimation: Terminology Review Null hypothesis (Ho): specified value for a parameter (OR, RR, RD, IRR, IRD, for example) Alternative hypothesis (Ha): specified alternative value for a parameter Type I error = Pr(Reject Ho | Ho is true) = α Type II error = Pr(fail to reject Ho | Ha is true) = β Pr(fail to reject Ho | Ho is false) Power = Pr(reject Ho | Ha is true) = 1- β 1-α = ? (“Pr” signifies probability over repetitions of the study) (References: Woodward, chap 8; Rothman and Greenland, pp. 184-8) EP 521 Spring, 2004, Vol I, Part 5 3 Notes: (1) α-level is not a p-value. P-value is a quantity computed from and varying with data. α is fixed and is specified without seeing the data. (2) p-value is not the Pr(Ho vs Ha). Is loosely defined: Pr(test statistic as or more extreme than observed |Ho true). (3) p-value is not Pr(data|Ho). That is the likelihood. Likelihood is usually much smaller than the pvalue, because p-value includes not only Pr(data |Ho) but also the Pr(all other more extreme data configurations Ho). (4) Absence of evidence is not evidence of absence. Failing to reject Ho ≠ accept Ho as true. (5) Studies with low power to produce results with appropriately narrow confidence intervals (as defined by the purpose of the study) are not “negative studies” – they are “indeterminate”. An initial description of what we are doing will help. EP 521 Spring, 2004, Vol I, Part 5 4 . . H0 Ha . 0 2 3 4 Type I error ( α ) -- H0 is true but you will reject H0 in favor of Ha. Suppose that 2 is your threshold (critical value) for rejecting H0. So, you have only a very small chance of observing a EP 521 Spring, 2004, Vol I, Part 5 5 value to the right of 2, and a large change of observing something to the left of 2, if H0 is true. Type II error( β )– If Ha is true, then you have a chance of observing a value to the left of 2, below the critical value, but it is not great. You have a much larger chance of observing a value to the right of 2. How big a chance you have of observing a value at 2 or to the right of 2, if Ha is true depends upon how far Ha is away from H0. If Ha is far away, then power is bigger, and type II error is smaller. Now what happens when sample size increases (or when variance decreases). The distributions become narrower. (This is the distribution of the mean, for example). Holding everything else constant, what does that do to my power to detect a difference? At 2, I have little chance of falsely rejecting H0. This would be a very high critical value for rejecting H0. But if Ha is true, you have an almost certain chance of observing a value at least 2, meaning that power is almost 1.0 and Type II error is almost 0. EP 521 Spring, 2004, Vol I, Part 5 6 .8 .6 .4 .2 0 0 1 2 3 4 EP 521 Spring, 2004, Vol I, Part 5 7 I can pick a vertical line (2, for example) to correspond to a type I error. This is usually the case. Then I can posit what Ha is (3 or 4), and if the sample size tells me how broad the distributions of the effect size is under H0 and Ha, then I can estimate what Type II error and power will be. Alternatively, I can specify Type I error, and power (and thus Type II error) and estimate just how close Ha can be to H0 to achieve this level of power. EP 521 Spring, 2004, Vol I, Part 5 Type I error 8 Type II error From: Methods in Observational Epidemiology by J.L. Kelsey, A.S. Whittemore, Alfred S. Evans and W. Douglas Thompson, 1996, New York, Oxford University Press, p. 328. EP 521 Spring, 2004, Vol I, Part 5 9 Power calculations are based on the sampling distribution of the difference (means, proportions) of the groups being compared. d = value of "difference" [Risk difference, log OR, difference in means, etc.] when null is true (d = 0) dc = value of difference that is just significantly different from d at significance level α critical value d* = value of difference when null is false. Some key numbers to remember on SS calcs (For purposes of this presentation) Quantity Interpretation Value Zα/2 Type I error of 0.05 1.96 Zβ Type II error 0.2 (80% power) 0.1 (90% power) +0.84 +1.28 EP 521 Spring, 2004, Vol I, Part 5 10 (Zα/2 +Zβ)5 Used in SS calcs Type I =0.05;Type II=0.2 Type I=0.05; Type II=0.1 7.85 10.5 Some texts refer to Zβ as Z 1-β and Zα/2 as Z 1-α/2 and thus have slightly different formulae. EP 521 Spring, 2004, Vol I, Part 5 11 SO 1. When null is false (HA = true), we are sampling from distribution on right. Values to the left of d c occur with probability β, and represent the probability of inappropriately failing to reject H 0). Area to left of dc, when d* is true = Type II error ( = Pr (failing to reject H 0 | HA is true). 2. Values to the right of dc in the shaded area , α of rejecting H0 when we should fail 2 to reject ( since H0 is true). represent the probability 3. Values to the right of dc, forming part of the distribution of d* represent the power of detecting a true difference [Pr (rejecting H 0 | H0 false)].= 1- β EP 521 Spring, 2004, Vol I, Part 5 12 By using standard normal: ( ) d c = d+ Z α2 [se(d)] (Eq 5.1) and * * (Eq 5.2) d c = d - Zβ[se(d )] where is standard normal deviate corresponding to position of d c on distribution around z α/2 d. Zβ is standard normal deviate corresponding to position of dc on distribution around d* . β = 0.1 = Type II error e.g., (1- β) = 0.9 Z1-β = 1.28 Zβ = -1.28 Think in terms of flipping over the Ha distribution, so we look at z’s in Ha from right to left rather than the usual left to right. EP 521 Spring, 2004, Vol I, Part 5 13 .4 .2 0 -1.28 0 x 1.28 ddcc d* -d -d c c Point: Use + 1.28 for β = 0.1. Then, setting eq5.1 = eq 5.2, and solving for Zβ we get: Zβ = * d - d- Z α2 [se(d)] se(d *) (Eq 5.3) EP 521 Spring, 2004, Vol I, Part 5 14 Usually assume se(d) = se(d*) and simplify * Zβ = d -d - Zα 2 se(d *) (Eq 5.4) Note: Zβ can range −∞ to + ∞. If, as is usual, d = 0, then Zβ = d* − Za 2 se(d *) (Eq 5.5) What if d* = d = 0? Then: Zβ = - Z α 2 Power is 0.025 (in each tail). (for α = 0.05) (Makes sense -- we reject only falsely.) Using the simple Eq 5.5, we can arrive at a series of simple formulae for power and sample size calculations. EP 521 Spring, 2004, Vol I, Part 5 15 §3.2 Power and sample sizes in case control and cohort studies Methods of Sampling and Estimation of Sample Size Definitions of Symbols Used in Equations for Calculating Power and Required Sample Size symbol Definition d* Non-null value of the difference in proportions or means (i.e., the magnitude of difference one wishes to detect) n In a cohort or cross-sectional study, the number of exposed individuals studied; in a casecontrol study, the number of cases r In a cohort or cross-sectional study, the ratio of the number of unexposed individuals studied to the number of exposed individuals studied; in a case-control study, the ratio of the number of controls studied to the number of cases studied σ Standard deviation in the population for a continuously distributed variable p1 In a cohort study (or a cross-sectional study), the proportion of exposed individuals who develop (or have) the disease; in a case-control study, the proportion of cases who are exposed p0 In a cohort study (or a cross-sectional study), the proportion of unexposed individuals who develop (or have) the disease; in a case-control study, the proportion of controls who are exposed p + rp 0 p= 1 = weighted average of p1 and p0 1+ r (Ref:Kelsey et.al. 1996, Table 12-11. EP 521 Spring, 2004, Vol I, Part 5 16 So When n is fixed by costs, time, etc., can use power calculations. Initial derivation of Eq 5.6 from Eq 5.5 Recall: variance of a difference in means (assuming independence) Var(A-B) = Var(A) + Var(B) Assuming a common standard deviation: 1 1 var(d *) = σ 2 + n1 n2 Here, we know n2 = r ⋅ n1 1 1 2 r +1 var(d *) = σ 2 + =σ n1 r ⋅ n1 r ⋅ n1 So, se(d *) = σ r +1 r ⋅ n1 EP 521 Spring, 2004, Vol I, Part 5 17 Therefore: Zβ for difference in means: Zβ = * d nr α -Z σ r+1 2 (Eq 5.6) Zβ for difference in proportions: 1/ 2 nr Zβ = p(1- p) (r+1) d * n(d *) 2 r Zβ = (r+1)p(1- p) Equivalent! Recall Var(p) = Substitute - Zα 2 1/ 2 - Zα (Eq 5.7) 2 p(1- p) n p ⋅ (1 − p ) for σ above EP 521 Spring, 2004, Vol I, Part 5 18 Note: we have defined d* as the risk difference (RD) We can express RD in terms of both RR or OR and the baseline risk ( p0 ) p1 , p1 = p 0 RR p0 So d* = P0RR - p0 = p0 (1- RR) For RR: RR = p1 (1- p1) For OR: OR = p0 (1- p 0) So p1 = p 0 • OR So 1+ p 0(OR-1) p 0 OR d* = - p0 1+ p o (OR-1) We may have a specific OR or RR in mind and need to know the implied value of p 1. So, we have a (1) simple, and (2) unified approach for (a) sample size and (b) power calculations for (i) RD (ii) RR, or (iii) OR, as well as for differences in means. EP 521 Spring, 2004, Vol I, Part 5 19 Example #1: Cohort design: Does smoking during pregnancy show an association with increased risk of low birth weight in offspring? Known facts: 1. Prevalence of smoking during pregnancy is about 3 (25%) , i.e., 3 non-smokers for each smoker. So, r = 3 if we just pick a cohort at random and follow them. 2. Incidence (overall) of low birth weights ( 2500 gm) is ~ 7%. Suppose we have the time and dollars to study 1200 births. Expect 1200/4 = 300 exposed (n = 300) during gestation (to smoking). Suppose we want to measure the difference in risk (proportions of low birth weight babies) and we want to detect a difference of 4% = (d *). What is the power to detect this difference? Must compute p0, p1 from overall incidence of LBW = 0.07. That is simply a weighted average of risks among smokers and non-smokers. [Smokers] 0.07 = (0.25) (p0 + 0.04) EP 521 Spring, 2004, Vol I, Part 5 because p1 = (p0 + 0.04) [Nonsmokers] + (0.75) (p0) 20 EP 521 Spring, 2004, Vol I, Part 5 21 Now, solve for p0: p 0 = 0.06 p1 = 0.10 p= 0.10 + 3(0.06) = 0.07 1+ 3 where p = p1 + r(p 0) unexposed 3 and r = = 1+ r exposed 1 For α = 0.05: n(d *) 2 r Zβ = (r+1)p(1 − p) 1/ 2 300 (0.04) 2 3 α -Z = 2 (3 +1)(0.07)(0.93) 1/ 2 - 1.96 = 0.39 For Zβ = 0.39, power = 0.652. This is depicted on the normal density plot on the next page, and is the shaded area, from left to right, under the curve, representing the cumulative normal from negative infinity to +0.39. Note, this power plot is just the same as the prior plot, page 4, except that we are now depicting power from left to right instead of from right to left (under the normal density). Be careful. Want cumulative probability EP 521 Spring, 2004, Vol I, Part 5 22 What do the power calculation programs produce? 1. 2. 3. STATA Sampsi gets 0.592. Stplan gives 0.606 (uses the arcsin transformation). N-Query Advisor gives 0.63 R. Localio EP 521 Spring, 2004, Vol I, Part 5 23 Example #2: Case control study of smoking during pregnancy and low birth weight in offspring. Using same numbers as before, Case = giving birth to low birthweight baby Control = giving birth to "normal" birthweight baby. (e.g. 2501 gm.) For p0, we will use overall prevalence of smoking (EXPOSURE) in general population of pregnant [because cases are a small minority] i.e., p0 = proportion of controls who are exposed = 0.25 (as before) Want to detect OR = 1.8 Can study 175 cases Plan control: case ratio = r = 2 Solve for : p1 = p 0 OR (0.25)(1.8) = 0.375 p1 = 1+ p 0(OR-1) 1+ (0.25)(1.8 -1.0) EP 521 Spring, 2004, Vol I, Part 5 24 * d = p1 - p 0 = 0.375 - 0.250 = 0.125 p= p1 + r p 0 , 1+ r (0.375) + 2(0.25) p= = 0.292 1+ 2 1/ 2 (175)(0.125) 2(2) Zβ = -1.96 (2 +1)(0.299166)(0.70834) = 1.01 Power = 84.4% to detect OR = 1.8. This result means that the two distributions, one for Ho: OR=1.0 and the other for Ha: OR=1.8 do not overlap very much (see figure on page 4). EP 521 Spring, 2004, Vol I, Part 5 25 NOTES: 1) n = 175 cases, so total sample size = 175 + 350 = 525 2) In cohort study, we had p1 = 0.10, p0 = 0.06, gives OR = 1.74. Cohort Study needed 1200 births. 3) Everything is re-expressed as a difference in proportions (or means). 4). We need to know: a) Exposure prevalence in population (for case control or cohort study b) Disease (incidence) in population (for cohort study). c) Desired "effect size" ("clinically important" difference) d) Minor notational and other differences may be found in different texts -- p(1- p) p (1- p1) p 2 (1- p 2) replaced by 1 + n n n 2 2 e.g., EP 521 Spring, 2004, Vol I, Part 5 26 Sample Size: Solve for n for means: nr , r+1 (Z + Z ) β α 2 2 2 d * nr = 2• σ r+1 n= Then ( ) 2 2 Zβ + Z α2 σ (r+1) 2 d* r for proportions: (Z + Z ) α 2 β 2 2 n d* r = (r+1) p (1- p ) (Z + Z ) n= β α 2 2 p (1- p )(r+1) 2 (d *) r Eq 5.9 There are some common values for given levels of power and Type I error. EP 521 Spring, 2004, Vol I, Part 5 27 Tables for common values of key parameters: (Kelsey et al., 1996, Table 12-16 p 333.) ( Values of Z α + Zβ 2 ) 2 for frequently used combinations of significance level and power Significance level α Power (1 – β) 0.01 0.80 0.90 0.95 0.99 0.80 0.90 0.95 0.99 0.80 0.90 0.95 0.99 0.05 0.10 EP 521 Spring, 2004, Vol I, Part 5 So, 7.85 and 10.5 are the key values to remember. (Z α 2 + Zβ ) 2 11.679 14.879 17.814 24.031 7.849 10.507 12.995 18.372 6.183 8.564 10.822 15.770 28 EP 521 Spring, 2004, Vol I, Part 5 29 Another example: Case control study of smoking and low birth weight Want OR = 1.8 to be detectable Power = 90% α = 0.05 (Recall) p = 0.292,(1- p) = 0.708, d * = .125, r = 2 (plus other prevalence assumptions) Thus: n= (10.507)(0.29166)(0.70834)(3) 2 (0.125) (2) = 208.4 = 209 n=209 + 418 controls = 627 [Remember that 175 cases gave 84% power] EP 521 Spring, 2004, Vol I, Part 5 30 §3.3 Special Concerns in Power (Sample Size) Calculations Worries 1. 2. 3. Measurement error Those selected/invited vs. those who agree to participate. Enroll 80% (0.8)(x) = 500 x = 500/0.8 = 625 Censoring Loss to follow-up Other causes of death than the cause of interest Many assumptions involved in the calculations for such studies EP 521 Spring, 2004, Vol I, Part 5 EP 521 Spring, 2004, Vol I, Part 5 31 32 §3.3.1 Measurement Error: effect on power (Refs: Armstrong et al 1992; Kelsey et al 1996, ch 13) Where errors can occur: Exposure variables (most common worry) Disease (outcome) classification Confounding factor or covariates Effect of nondifferential error (misclassification or measurement): commonly (although not always) biases or attenuates measure (effect size) towards the null Effects of nondifferential error in exposure on sample sizes: [In simple cases] Observed effect size is smaller than true effect size, i.e., it takes more power to demonstrate an effect for a given true effect (observed effect will be closer to null): effect of bias Confidence intervals for corrected measures of effect size are wider than if exposure were measured without error: effect of variance Effects of nondifferential error in confounders -- Effect size can be biased in either direction EP 521 Spring, 2004, Vol I, Part 5 33 Remedies for measurement error in planning studies Estimate measurement error from pre-existing data Use tables on attenuation bias (Kelsey, Armstrong) If error is not known, plan a validation substudy (complex) Plan on multiple measurements of subjects For estimating the impact of nondifferential error, estimate the sensitivity and specificity of observed exposure: True Exposure Observed + exposure + a b c d Then Sn=Pr(O+|True+) = a/(a+c) Sp=Pr(O-|True-) = d/(b+d) Prevalence of exposure =(a+c)/(a+b+c+d) EP 521 Spring, 2004, Vol I, Part 5 34 Effect on the Odds Ratio of Nondifferential Error in the Measurement of a Binary Exposure Variable*(Kelsey et al., 1996, page 350*) *The entries in the body of the table are the attenuated values of the odds ratio resulting from the effects of the nondifferential error in measuring exposure. Classification in terms of disease status is assumed to be error EP 521 Spring, 2004, Vol I, Part 5 35 free. EP 521 Spring, 2004, Vol I, Part 5 36 §3.3.2 How many controls per case What should the value of r be? r= ratio of controls/cases or unexposed/exposed In practice: # of cases in case control study is the total # available, so we can't get any more than there are. Then we can increase power by increasing r (i.e. taking more controls), BUT!! precision does not increase beyond r = 3 or 4 (when c = 1). Summary: Have unified method for computing power and sample size for different parameters (RR, OR, RD, difference in means). They all depend on tradeoffs between Type I and Type II error, the assumed differences (or ratios) of the means (or proportions), the standard deviation of the distributions (in case of differences in means), and the sample size. Power calculation programs do this work for us, but we need to understand what we are asking of those programs. EP 521 Spring, 2004, Vol I, Part 5 37 §3.4 The Fallacy of the Post-hoc Power Calculation (see Berlin & Goodman , 1994) Suppose σ = 10 (σ2 = 100). N = 50 subjects / group We have done a study comparing the effects of two drugs on a continuous outcome measure with the above variance. The result of the study is that the difference between the means of the two groups is 4 units. (The two groups are independent) 100 σ 2 = var (A - B) = var (A) + var (B) 50 n 4 Z = x1 x 2 = =2 100 100 4 + 50 50 We do a Z-test (known variance): So the test (either Z or t) would barely reject H 0 of no difference in means at the α = .05 level. EP 521 Spring, 2004, Vol I, Part 5 38 Now, suppose that the planned detectable difference = 3.0 with 80% power and alpha=0.05. But after the experiment, observe a difference = 2.0, with CI = 0 to 4. This result means that you happened to observe an effect size in the sample that is lower than the true effect size in the hypothesized population. We must always distinguish (1) The hypothesized true (but unobserved) population (2) The actual observed sample from that population Each sample from the true population will differ somewhat and will have a different estimated effect size. If you hypothesized a large difference, and you found only a small difference, then you are “out of luck”. Too bad. Your p-value will likely be .0.05. EP 521 Spring, 2004, Vol I, Part 5 Q: 39 What was the power to detect a difference of 4 units given N = 50 per group (i.e., r = 1) and σ = 10? * nr d - Zα Zβ = σ r+1 2 = 4 50 • 1 -1.96 = (.4) 10 2 ( ) 25 -1.96 = 0.04 So power = 0.50 or 0.51 So if the power was so low, how did we detect a difference? Meaningless question: the d* in the formula relates to the hypothetical mean of an alternative distribution, not to an observed event. An observed event will always have (1- β) < 0.5 if the finding is “not significant”. In other words, if z < +1.96 power is < 0.5 So if observe "NS" finding, then always say study is underpowered. But do not know what we'll find out until after experiment. In short, d* - d = d* dOBS - d = dOBS. There is no place for power after observe d OBS Ref: Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. American Statistician. 2001;55(1):19-24. EP 521 Spring, 2004, Vol I, Part 5 40 §3.5 Sample Sizes for Confidence Intervals Sample size for single mean or proportion L = margin of error within which you want to estimate the mean or proportion (1/2*width of CI) Then for MEAN: σ = 1.96 * se L=Z* n Zσ 2 L = n 2 2 σ n= Z 2 L 2 2 Proportion (e.g., sensitivity and specificity) 2 p(1- p) n=z , where 2 L Z is standard normal value (2-sided) corresponding to the desired proportion of the time that the estimate is to be within the desired margin of error. EP 521 Spring, 2004, Vol I, Part 5 41 Example: Suppose you want to estimate the proportion of people with high cholesterol (> 200 mg ldl) within 4% percentage points. You guess that the proportion will be around 40%. Then: n = (1.96) 2(0.4)(0.6) = 576 2 (0.04) With this n, there is a 95% probability (before doing the study) that the estimate obtained will be within 4% of the population value. This calculation does not address the situation in which you want to "rule out" a true value above (or below) a particular hypothesized value (see later). pˆ ± 1.96 ˆ ˆ p(1p) n "Worst case" for proportions, when you have no idea what p will be; use 0.5 (1.96) 2(0.5)(0.5) For the example: n = 2 (0.04) n = 601 (not much bigger) Suppose you wanted ± 3%? Then let p = 0.5 be the proportion used for the sample size calculation. n = 1068 (much bigger) EP 521 Spring, 2004, Vol I, Part 5 42 This is how the pollsters give you their ± numbers and compute n. i.e., for p = 0.5, L = ± 0.04, and95% CI = 0.5 (0.46, 0.54) Suppose you think p will be around .001 and you want ± 0.0005. This is a small proportion! (1.96) 2 (.001)(.999) n= = 15,352 , a big study. 2 (.0005) (cancer rates, etc. are this low) But these calculations on width of CI fail to consider the uncertainty of the observed point estimate, e.g., ORˆ , even when the true OR is fixed. They assume you will be satisfied with this CI wherever it is centered. The following examples show how that assumption might not hold. EP 521 Spring, 2004, Vol I, Part 5 43 §3.5 (continued) Sample Sizes for Confidence Intervals The same CI question interval question might be answered differently. Question #1 Suppose you want to ensure that your estimate of sensitivity (Sn) will have a confidence interval (2-sided) of 5 percentage points. Assume you think that you will observe Sn = 0.9. How many subjects do you need with disease to produce a confidence interval of (0.85 to 0.95)? z 2 p (1 − p ) L2 If z=1.96, p=0.9, and L=0.05, then n= n= 3.84 * 0.9 * 0.1 /0.0025 = 138 Question #2 Suppose you want to ensure that whatever observed Sn you find after your experiment, that you can eliminate, by means of a 95% confidence interval around estimate, a true Sn <0.85. How many subjects do you need with disease to ensure with 80% power that the lower confidence bound is at least 0.85? This second question is different. It can be viewed as an hypothesis test. How can we calculate this CI? EP 521 Spring, 2004, Vol I, Part 5 44 Here is the STATA code and output for that question: . sampsi .9 .85, power(0.8) onesample Estimated sample size for one-sample comparison of proportion to hypothesized value Test Ho: p = 0.9000, where p is the proportion in the population Assumptions: alpha = 0.0500 power = 0.8000 alternative p = 0.8500 (two-sided) Estimated required sample size: n = 316 This number is much larger. For question #1, you assume that you will observe Sn = 0.90. All you want to know is how wide will the resulting CI be. But for question #2 you are assuming only that the true Sn =0.9, and that the observed Sn might vary randomly around the true value. So, your observed Sn might be smaller than 0.9! You must build in extra power so that whatever you observe, your lower bound of the confidence interval will be at least 0.85. (Simulations confirm this second result.) Correspondence between these two different questions: If in STATA one sets the alternative hypothesis (Ha) at the end of the confidence interval, and one stipulates the power=0.5, then the sample size is the same as for question #1, i.e., n=138. EP 521 Spring, 2004, Vol I, Part 5 45 Question #3: Suppose you want to ensure that whatever observed Sn you find after your experiment, that you can eliminate a true Sn <0.85 and show a p<0.05. How many subjects do you need with disease to ensure with 80% power that the lower confidence bound of a one-sided 95% confidence interval is at least 0.85? This amounts to a onesided onesample test: . sampsi .9 .85, power(0.8) onesample onesided Estimated sample size for one-sample comparison of proportion to hypothesized value Test Ho: p = 0.9000, where p is the proportion in the population Assumptions: alpha = power = alternative p = 0.0500 0.8000 0.8500 (one-sided) Estimated required sample size: n = EP 521 Spring, 2004, Vol I, Part 5 253 46 Question #4: Fourth type of confidence interval problem: predicted CI when planning experiments (Reference Goodman SN, Berlin JA. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med. 1994; 121:200-206.) Problem: Evaluating a medical treatment: 45% cure rate. Proposed surgical alternative must have higher cure rate: 70%+ (to offset higher risk of surgical morbidity) Difference = .70 - .45 = 0.25 (25% pts) Question: if design a study with 90% power to detect a difference this size (or larger), what is going to be predicted confidence interval? EP 521 Spring, 2004, Vol I, Part 5 47 Assume α=0.05, two sided (1) Step 1: Compute samples n1 and n2 for each group to achieve 90% power to detect a difference of 0.25 n= n= ( za / 2 + zβ ) 2 ∗ p (1 − p )( r + 1) (d * ) 2 r (10.507) ∗ 0.575 ∗ 0.425 ∗ 2 = 82 0.25 ∗ 0.25 (2) Step 2: Compute predicted confidence interval Predicted 95% CI= observed difference±0.6* ∆0.90 Where ∆0.90 = True difference for which there is 90% power. = 0.25 Predict CI= observed difference ±0.6 * 0.25= ±0.15. So, the predicted CI for this problem will be 0.30 wide. Thus, if observed difference is 0.15, lower bound of CI will just = 0.0. EP 521 Spring, 2004, Vol I, Part 5 48 The same result holds when using the alternative formula: ±0.7 *V0.80 In that case, given the same set of facts. If there is 80% power to demonstrate a risk difference of 0.25, then one would expect a confidence interval to be wider. It is 0.7*0.25, or ±0.175. (See Goodman and Berlin 1994 for derivation) Why is this so: As the power increases (50%, 80%, 90%) the resulting confidence interval will get narrower, holding constant the observed risk difference. So, (a) Compute sample size to detect a risk difference a power level (b) Use the simple formula to predict the confidence interval. EP 521 Spring, 2004, Vol I, Part 5 49 §3.6 Relative size of (a) standard deviation and (b) desired effect size on power and samples sizes: Suppose, in any of these situations, you have no idea what σ2 will be? e.g., comparing 2 means: n= ( ) 2 2 Z α2 + Zβ σ ( r+1) (d ) * 2 r We can always say that we would like to detect a difference of, say, one (or 0.5, or whatever) standard deviations. e.g., d* = ± σ EP 521 Spring, 2004, Vol I, Part 5 50 Thus, for r = 1 (for example) n = ) ( 2 2 Z α + Zβ σ 2 2 σ 2 ) ( 2 = 2 Z α + Zβ (small) 2 n=2*7.8 = 15.6 For d* = 0.5 σ and r = 1: n = ( ) 2 2 Z α + Zβ σ 2 2 ( 0.5σ ) 2 = ( 2 Z α + Zβ 2 0.25σ 2 ) 2 ( = 8 Z α + Zβ 2 ) 2 n=8* 7.8 = 62.4 (The sample size gets big quickly) Note that this all depends only on the ratio σ 2 (d ) * 2 (Same reasoning could be applied to single mean.) So, sample size depends on the sd relative to the desired difference to be demonstrated EP 521 Spring, 2004, Vol I, Part 5 51 Formulae differ according to textbook and sample size programs: The formula above for comparing groups is approximate (but is used in many texts). A "more exact" form is [(Fleiss, p. 41) for one control per case]: 2 (z 2 pq − z p q +pq ) 1− β 0 0 11 α/2 n= , (per group) 2 (p − p ) 0 1 p= p1 + p0 , (remember r = 1) 2 Tables are also available for common combinations of p and power. Fleiss JL. Statistical Methods for Rates and Proportions, 2nd Edition. New York: John Wiley & Sons, Inc.; 1981: 262. EP 521 Spring, 2004, Vol I, Part 5 52 Always note, however, when using formulae from texts that each author might define the terms differently and therefore had slightly different formulae. For example: Schlesselman Formula: pg. 145 n= Where: p1 = ( Zα 2pq + Zβ p1q1 + p 0q 0 ( p1 - p 0 ) ) 2 NOTE:This calls Z = +1.96 p = 1 ( p1 + p 0 ) q = 1- p p0 R 2 and , q 0 = 1- p 0 [1+ p 0(R-1)] q1 = 1- p1 R= the odds ratio EP 521 Spring, 2004, Vol I, Part 5 53 A formula that is simpler than the ones above (for r = 1, i.e., two equal sized groups) and for practical purposes equivalent, is given by n= 2pq( Z α + Zβ ) ( p1 - p 0 ) 2 2 Corresponding to α = .05 (two-sided) and β = .10, one has Zα = 1.96 and Zβ = 1.28, so that equation reduces to a particularly simple formula: n= 21* pq ( p1 - p 0 ) 2 . EP 521 Spring, 2004, Vol I, Part 5 Look at the huge sample sizes when the baseline risk is low. 54 For r = 1 needed to detect small differences, especially EP 521 Spring, 2004, Vol I, Part 5 55 From: Case-Control Studies: Design, Conduct, Analysis by James J. Schlesselman, New York, 1982, Oxford University Press, Appendix A. EP 521 Spring, 2004, Vol I, Part 5 56 Summary of different results What is important from the Tables, such as the one from Kelsey, is that you can see just how severe are the penalties when one wants to demonstrate small effects. Consider the joint effects of (a) increasing power, and (b) decreasing size of the true OR EP 521 Spring, 2004, Vol I, Part 5 57 §3.7 Sample Sizes for Matched Studies §3.7.1 Frequency matching – as in stratified design First we consider frequency matching. The formulae for stratified studies (frequency matched) and individually matched are in Schlesselman's book. (CC Studies, p 159) Recall, our estimates of MH OR are based on weighted estimates of stratum-specific OR s. This is corresponding method of arriving at sample size for stratified design. This is a way to incorporate strata, or a confounding factor, into the estimation of power or sample size. Must specify the following parameters, assuming have J strata. 1. p0j = exposure prevalence in controls in jth stratum 2. fj = fraction of the total observations in stratum j, where ∑f j = 1.0 j 3. Type I error EP 521 Spring, 2004, Vol I, Part 5 58 4. Power 5. Assumed true effect size (RR=OR, in this case) Assume: Equal number of cases and controls in each stratum Constant RR (OR) across strata (no effect modification) Required total number of "cases" = where (using q = 1 - p) 2 ( ln(OR) ) = gj 1 1 + p 0 jq 0 j p1 jq1 j ( ) ( ) , p1 j = n= ( Z α2 + Zβ Σf jg j ) 2 p 0 j(OR ) 1+ p 0 j ( OR -1) The formula is essentially a weighted sum of d* and var(d*) from our general sample size/power formula EP 521 Spring, 2004, Vol I, Part 5 59 Example: OC Use and MI (Schlesselman, Table 6.5, p 160) Hypothesized Effect Size R == 3, α = 0.05, β = 0.10 (power=0.9) Age fj p0j p 1j gj f jg j 25-29 .03 .22 .46 .122 .0037 30-34 .09 .08 .21 .062 .0056 35-39 .16 .07 .18 .055 .0088 40-44 .30 .02 .06 .018 .0054 45-49 .42 .02 .06 .018 .0076 .0311 = Σf jg j 1.00 by definition f j = .42 is where we have the most cases (age category 45-49) p0j = .22 exposure prevalence - where most exposure is. EP 521 Spring, 2004, Vol I, Part 5 Then: Required (1.96 +1.28 ) N= 60 2 0.0311 = 328 cases Reason for the frequency matching: efficiency. In the context of the case of myocardial infarction and oral contraceptive use. Most cases in what age group? Most exposure in what age group? EP 521 Spring, 2004, Vol I, Part 5 61 §3.7.2 Pair Matched Studies (Schless. '6.6, pp 160 ff.) There are special methods of computing power for matched studies. We consider first the simplest situation: 1 to 1 matching. But “matched” studies can also have multiple controls per case. The number of discordant pairs (= m) required to detect a relative risk (RR) is given by: Z α + Zβ P(1- P) m= 2 2 1 P- 2 where P = 2 OR RR ≈ . So, here we are assuming that OR RR. (1+ OR ) (1+ RR ) We are going to work with the OR because it is the ratio of the frequencies of discordant pairs. (We then make the assumption that OR is a good estimate for RR). Here P= u10 u10 + u01 in the paired data table. (See the notation in Vol I, Part 4) Must distinguish from p0 and p1 = risk of exposure among controls and the cases EP 521 Spring, 2004, Vol I, Part 5 62 Derivation of sample size formula for McNemar’s test: Recall that McNemar’s test is equivalent to a test of a binomial proportion, where the proportion is the fraction of discordant pairs that are, for example, in the u10 cell in the 2 by 2 table of paired data. This was shown in Vol I, Part 4. We can use this relationship and a version of the sample size formula we have seen before to show the correspondence between previous formulae and the ones specifically suited for matched pair case control studies. Details appear in Schlesselman (pp 145, 161) (These calculations can be done by computer: PS, Power and Precision, PASS, for example). EP 521 Spring, 2004, Vol I, Part 5 63 U10 1 ˆ Ho: p= ;(OR=1), where p= , 2 U10 + U 01 (here m=U10 + U01 ). The standard sample size formula for one-sample binomial test: 2 p (1 − p ) p0 (1 − p0 ) zα / 2 + zβ p0 (1 − p0 ) n= , ( p − p0 ) 2 2 p (1 − p ) 1 1 2 ⋅ zα / 2 + zβ 1 1 zα / 2 2 2 (1 ) + z p − p ⋅ β 2 2 2 n= = . 1 2 1 2 (p− ) (p − ) 2 2 Letting m = n, we have derived the formula for number of discordant pairs. Note: The denominator corresponds to d* from before, because we have expressed OR in terms of p, and we are essentially doing the calculation for the difference between the desired OR and OR=1 (null). EP 521 Spring, 2004, Vol I, Part 5 64 Estimating the number of discordant pairs. We do this from our estimates of the risk of exposure in the control group. Let pe = the probability of an exposure-discordant pair and M = the total number of pairs needed to yield m m discordant pairs. M = . pe This probability will depend in part on the baseline risk of exposure among the controls, on the odds ratio that we are trying to demonstrate, and on the skill (or lack thereof) in selecting matching criteria. First, consider the baseline case of estimating what fraction of the matched pairs will be informative, i.e, what fraction will be discordant pairs. Although pe depends on matching criteria, using the notation from McNemar=s test, the matched pairs can be displayed in following table: Control E E CASE E u11 u10 E u01 u00 EP 521 Spring, 2004, Vol I, Part 5 65 Pe=Pr(exposure discordant pairs) By definition: Pe=Pr(U10) + Pr(U01) =Pr(E|case)Pr(NoE|ctrl)+ Pr(NoE|case)Pr(E|ctrl) (1-p0) = p1 + (1-p1) p0 Note:this is an approximation, because Pe depends on the matching criteria (which include factors other than E). We can compute p1 , the proportion of exposed cases, from the OR and the value of p0 , the proportion of exposed controls, using the formula for OR. p1 = M; p 0OR . Then q0 = 1 - p0, and q1 = 1 - p1, and 1+ p 0(OR -1) m m = = sample size needed. pe ( p 0 q1 + p1q 0) But there might be other reasons for assuming that the true percentage of usable discordant pairs is actually smaller than what we might expect. EP 521 Spring, 2004, Vol I, Part 5 66 Example: Pair (1 to 1) Matched study of OC use and congenital heart disease. For α = 0.05, β = 0.1 We think: p0 = 0.03, i.e., 3% risk of exposure among population of controls (so, rare exposure) We want to detect OR = 2 We know from the relationship among OR and p1 = p1 =Pr(E|case) (.03)(2) = .058 , because OR – 1= 2 -1 = 1 1+ .03(1) and P = OR 2 1 = . (1- P ) = . This is from the formula derived from McNemar’s test. 1+ OR 3 3 2 1.96 2 1 +1.28 g 2 3 3 m= = 90 discordant pairs. 2 2 1 − 3 2 EP 521 Spring, 2004, Vol I, Part 5 67 Then, to estimate the total number of pairs: pe prob. (discordant pair) (p0q1 + p1q0) = [(.03) (.942) + (.058) (0.97)] = .028 + .056 = .084 Then: M = m 90 = = 1071 matched pairs p e .084 What happens with other combinations of parameters? EP 521 Spring, 2004, Vol I, Part 5 68 alphaZa/2 power ZB OR p 0.05 0.05 0.05 0.05 1.96 1.96 1.96 1.96 0.9 0.9 0.9 0.9 1.28 1.28 1.28 1.28 2 2 2 2 0.67 0.67 0.67 0.67 0.05 1.96 0.9 1.28 2.5 0.71 0.05 1.96 0.9 1.28 3 0.75 Po=Pr(E|ctrl) r m 0.03 0.1 0.2 0.5 1 1 1 1 P1= qo=1-Po q1=1-P1 pe M DuPont "=Pr(E|case) 90.34 0.06 0.97 0.94 0.086 1046 1066 90.34 0.2 0.9 0.8 0.26 347 368 90.34 0.4 0.8 0.6 0.44 205 266 90.34 1 0.5 0 0.5 181 181 0.03 1 52.93 0.03 1 37.7 0.075 0.09 0.97 0.97 0.925 0.101 527 0.91 0.115 329 543 343 So, can see that M depends heavily on the probability of exposure among the controls, as well as on the OR that one assumes is present in truth. The column labeled M are results from this program. DuPont numbers in right column are from program “PS” written by DuPont and Plummer. EP 521 Spring, 2004, Vol I, Part 5 69 We are making assumptions about p 0, p1, OR and matching factors. Pr (exposed) for members of each pair are independent and have constant probability homogeneity of Pr(E) for each pair If are matching is less than optimal, and we have overmatched to some extent, then the pr(exposure) for the case and control in each pair will tend to be similar, resulting in a larger number of “noninformative” pairs. Program by DuPont and Plummer allow user to adjust for this correlation of exposure. EP 521 Spring, 2004, Vol I, Part 5 70 We can reverse this process and estimate power for given number of discordant pairs. (Ref= Schless p 162) −z 1 zβ = α / 2 + m( P − ) 2 / P (1 − P ), 2 2 where power = Pr(Z ≤ zβ ), and m is the number of discordant pairs (as before). So, zβ = 1.28 is equivalent to power=0.9 Notes: 1. Can estimate m from M by M = m pe 2. Better estimate p e from preliminary data or revised after initial data collection 3. We have looked at case control studies (because that is where matching is more common). But this framework can apply to cohort studies. EP 521 Spring, 2004, Vol I, Part 5 71 §3.7.3 Matched studies with more than one control per case (or in the instance of cohort studies, more than one unexposed per exposed). The same principles apply to these more complex designs. In these instances, there are several sets of paired tables per matched set, each table representing the cross classification of pairing for the case with each of the controls. (So, if there are 3 controls per case, one can think of a set of 3 tables of paired comparisons). (1) A simple adjustment: Let c = the number of controls per case, and let n be the number of cases assuming 1 to 1 matching. Then with c to 1 matching, one needs n1 cases, where: n1 = (c + 1)n / 2c. Thus, if one needed 1050 cases (and 1050 controls, and then one selected 2 controls per case, the new number of cases = (2+1)1050/2*2 = 3*1050/4= 788, and the number of controls = 1576. This approximation is good in many cases, but falls apart when the probability of exposure of a sampled control is low. EP 521 Spring, 2004, Vol I, Part 5 72 (2) More complex methods: Better approximations are available. The programs in DuPont and Plummer (PS) use an estimate of the correlation of the exposure status between a case and its matched controls. The formula we have seen (Schlesselman) assumes no correlation. DuPont and Plummer generalized this formula (for multiple controls per case AND for the possibility of some correlation. ) [Aside: You can think of correlation in terms of two columns of data: Case 1 1 0 0 1 Control 1 0 1 0 0 Where a 1 indicates exposed and a 0 indicates unexposed and each row is a matched pair (or one of a set of matched pairs). Then the correlation is simple to obtain using standard formula ] A good start is corr = 0.2. As the correlation increases, then sample size (number of cases) increases. EP 521 Spring, 2004, Vol I, Part 5 73 Effect of additional matches and correlation on sample sizes: What happens when add controls per case in a matched study: The number of cases needed drops, but the total number of patients increases. Controls No. Case Total per case patients patients 1 1066 2132 2 782 2346 3 688 2752 4 641 3205 Assuming same OR, power, alpha, p0, p1 as in our example. Calculations from program PS Effect of correlation on sample sizes (using same example): Corr 0 0.1 0.2 0.3 Case patients 1066 1230 1437 1705 Correlation might occur when matching is less than optimal. Available software: PS, PASS. Reference: DuPont WD. Power calculations for matched case-control studies. Biometrics. 1988; 44:1157-68. EP 521 Spring, 2004, Vol I, Part 5 74 §3.8 Miscellaneous Comments on Sample Size calculations – 1. More complex problems – (interactions) usually simplify the problem into a 2 by 2 table or a subgroup comparison. Just think about a 2 by 2 table for one of the subgroups of interest and power the study to detect a clinically meaningful effect for that subgroup alone. But there is a program from NCI (Power.exe) that is specifically designed for computing power to detect interaction. [Ms. Holly Brown ([email protected]).] Ref: Lubin JH, Gail MH. On power and sample size for studying features of the relative odds of disease. Am J Epidemiol 1990;131:552-566. Garcia-Closas M, Lubin JH. Power and sample size calculations in case-control studies of gene-environmental interactions: Comments on different approaches. Am J Epidemiol 1999;149:689-693. EP 521 Spring, 2004, Vol I, Part 5 2. Adjustment for sample size from programs Measurement error Loss to followup Lack of independence of observations (clustering) Repeated measures Covariates Comparisons of subgroups End of Vol 1 Part 5 75
© Copyright 2024