EP 521 Spring, 2007 Vol I, Part 5 §3. 1 Sample Size Estimation A key to study design are sample size or “power” calculations. Required of ever grant proposal In this section: (1) we begin with theory behind power calculations and demonstrate how simple formulae for power and sample sizes are derived. (2) Next, show unified treatment of power for RD, OR, RR based on this theory. (3) Then, describe how varying the question being asked can have substantial effect on the required sample sizes. (4) Brief explanation of the information needed for power calculations for matched pair studies. (5) Some demonstrations on how to use and interpret software for power calculations. Goals – To be able to understand what affects power, how to define the problem, and how to get the computer to give you the answer you need. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 §3.1 Power In General Sample Size Estimation: Terminology Review Null hypothesis (Ho): specified value for a parameter (OR, RR, RD, IRR, IRD, for example) Alternative hypothesis (Ha): specified alternative value for a parameter Type I error = Pr(Reject Ho | Ho is true) = α Type II error = Pr(fail to reject Ho | Ha is true) = β Pr(fail to reject Ho | Ho is false) Power = Pr(reject Ho | Ha is true) = 1- β 1-α = ? (“Pr” signifies probability over repetitions of the study) (References: Woodward, chap 8; Rothman and Greenland, pp. 184-8) Copyright © 2006, Trustees of the University of Pennsylvania 2 EP 521 Spring, 2007 Vol I, Part 5 3 Notes: (1) α-level is not a p-value. P-value is a quantity computed from and varying with data. α is fixed and is specified without seeing the data. (2) p-value is not the Pr(Ho vs Ha). Is loosely defined: Pr(observed result or more extreme than observed|Ho true). (3) p-value is not Pr(data|Ho). That is the likelihood. Likelihood is usually much smaller than the pvalue, because p-value includes not only Pr(data|Ho) but also the Pr(all other more extreme data configurations Ho). (4) Absence of evidence is not evidence of absence. Failing to reject Ho ≠ accept Ho as true. (5) Studies with low power to produce results with appropriately narrow confidence intervals (as defined by the purpose of the study) are not “negative studies” – they are “indeterminate”. An initial description of what we are doing will help. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 4 . . H0 Ha . 0 2 3 4 Type I error (α ) -- H0 is true but you will reject H0 in favor of Ha. Suppose that 2 is your threshold (critical value) for rejecting H0. So, you have only a very small chance of observing a value to the right of 2, and a large chance of observing something to the left of 2, if H0 is true. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 5 Type II error(β)– If Ha is true, then you have a chance of observing a value to the left of 2, below the critical value, but it is not great. You have a much larger chance of observing a value to the right of 2. How big a chance you have of observing a value at 2 or to the right of 2, if Ha is true depends upon how far Ha is away from H0. If Ha is far away, then power is bigger, and type II error is smaller. Now what happens when sample size increases (or when variance decreases). The distributions become narrower. (This is the distribution of the mean, for example). Holding everything else constant, what does that do to my power to detect a difference? At 2, I have little chance of falsely rejecting H0. This would be a very high critical value for rejecting H0. But if Ha is true, you have an almost certain chance of observing a value at least 2, meaning that power is almost 1.0 and Type II error is almost 0. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 6 .8 .6 .4 .2 0 0 Copyright © 2006, Trustees of the University of Pennsylvania 1 2 3 4 EP 521 Spring, 2007 Vol I, Part 5 7 I can pick a vertical line (2, for example) to correspond to a type I error. This is usually the case. Then I can posit what Ha is (3 or 4), and if the sample size tells me how broad the distributions of the effect size is under H0 and Ha, then I can estimate what Type II error and power will be. Alternatively, I can specify Type I error, and power (and thus Type II error) and estimate just how close Ha can be to H0 to achieve this level of power. We now change the paradigm only slightly. Every estimate has a distribution. The estimate can be of a sample mean, or a measure of association or effect size in a sample. We now think in terms of a distribution of OR, RR, or RD. We think of the distribution of OR, RR, or RD if the null hypothesis were true vs the distribution if the alternative were true. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 Type I error 8 Type II error From: Methods in Observational Epidemiology by J.L. Kelsey, A.S. Whittemore, Alfred S. Evans and W. Douglas Thompson, 1996, New York, Oxford University Press, p. 328. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 9 Power calculations are based on the sampling distribution of the difference (means, proportions) of the groups being compared. d = value of "difference" [Risk difference, log OR, difference in means, etc.] when null is true (d = 0) dc = value of difference that is just significantly different from d at significance level α critical value d* = value of difference when null is false, i.e, when Ha is true. Some key numbers to remember on SS calcs (For purposes of this presentation) Quantity Interpretation Value Zα/2 Type I error of 0.05 1.96 Zβ Type II error 0.2 (80% power) 0.1 (90% power) +0.84 +1.28 Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 ( Z α + Zβ 2 ) 10 2 Used in SS calcs Type I =0.05;Type II=0.2 Type I=0.05; Type II=0.1 7.85 10.5 Some texts refer to Zβ as Z 1-β and Zα/2 as Z 1-α/2 and thus have slightly different formulae. The key to all these sample size formulae is to look at the two distributions: the difference under Ho vs the difference under Ha, and the cumulative probability, the area under the curve, being calculated from OPPOSITE DIRECTIONS. Why? Because we are contrasting them. We look at Ho (vs Ha) but going from the left to the right. But we look at Ha (vs Ho) by cumulating the probability from the right to the left. This is totally dependent on whether the difference under Ha is larger than under Ho. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 11 So: 1. When null is false (HA = true), we are sampling from distribution on right. Values to the left of dc occur with probability β, and represent the probability of inappropriately failing to reject H0). Area to left of dc, when d* is true = Type II error ( = Pr (failing to reject H0 | HA is true). 2. Values to the right of dc in the shaded area , α of rejecting H0 when we should fail 2 to reject ( since H0 is true). represent the probability 3. Values to the right of dc, forming part of the distribution of d* represent the power of detecting a true difference = Pr (rejecting H0 | H0 false)= Pr(reject Ho|Ha is true) =1- β Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 12 ( ) By using standard normal: d c = d+ Z α [se(d)] 2 and where z α/2 around d. (Eq 5.1), from the frame of reference of Ho, * * (Eq 5.2), for the frame of reference of Ha, d c = d - Zβ[se(d )] is standard normal deviate corresponding to position of d c on distribution This we can see more easily if se(d ) = se( d * ) = 1. Then, d c = d + zα / 2 and d c = d * − zβ Zβ is standard normal deviate corresponding to position of dc on distribution around d* . β = 0.1 = Type II error = (1- β) = 0.9. e.g ., Z1-β = 1.28, Zβ = -1.28 Think in terms of flipping over the Ha distribution, so we look at z’s in Ha from right to left rather than the usual left to right. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 13 .4 .2 0 -1.28 0 x 1.28 ddc d* -dc c -dc practical point: Use + 1.28 for β = The key, 0.1. Then, setting eq 5.1 = eq 5.2: d + zα / 2 se( d ) = d * − zβ se( d * ), or if se=1, d + zα / 2 = d * − zβ . This is the key to estimating power and sample size. Finally, solving for Zβ we get: Zβ = * d - d- Z α2 [se(d)] se(d *) (Eq 5.3) Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 Copyright © 2006, Trustees of the University of Pennsylvania 14 EP 521 Spring, 2007 Vol I, Part 5 15 Usually assume se(d) = se(d*) and simplify * Zβ = d -d - Zα se(d*) 2 Note: Zβ can range −∞ to + ∞. (Eq 5.4) As is usual, d = 0, which corresponds to Ho: RD=0 or ln(OR)=0, ln(RR)=0, or X1 − X 2 = 0. Then, Zβ = d* − Za 2 se( d *) (Eq 5.5) Using the simple Eq 5.5, we can arrive at a series of simple formulae for power and sample size calculations. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 16 §3.2 Power and sample sizes in case control and cohort studies Methods of Sampling and Estimation of Sample Size n In a cohort or cross-sectional study, the number of exposed individuals studied; in a case-control study, the number of cases r In a cohort or cross-sectional study, the ratio of the number of unexposed individuals studied to the number of exposed individuals studied; in a case-control study, the ratio of the number of controls studied to the number of cases studied σ Standard deviation in the population for a continuously distributed variable p1 In a cohort study (or a cross-sectional study), the proportion of exposed individuals who develop (or have) the disease; in a case-control study, the proportion of cases who are exposed p0 In a cohort study (or a cross-sectional study), the proportion of unexposed individuals who develop (or have) the disease; in a case-control study, the proportion of controls who are exposed p + rp 0 p= 1 = weighted average of p1 and p0 1+ r (Ref:Kelsey et.al. 1996, Table 12-11. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 17 So, when n is fixed by costs, time, etc., can use power calculations. Initial derivation of Eq 5.6 from Eq 5.5. We begin with the difference in means, or with risk difference. Recall: variance of a difference in means (assuming independence) Var(A-B) = Var(A) + Var(B) Assuming a common standard deviation, and d * = the difference under Ha, then 1 1 var( d *) = σ 2 + n1 n2 Here, we know n2 = r ⋅ n1 1 1 2 r +1 var( d *) = σ 2 + =σ n1 r ⋅ n1 r ⋅ n1 So, se( d *) = σ r +1 r ⋅ n1 Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 18 Therefore: Zβ for difference in means: Zβ = * d nr α - Z σ r+1 2 (Eq 5.6) Zβ for difference in proportions (or a risk difference): 1/ 2 nr Zβ = p(1- p) (r+1) d * n(d*) 2 r Zβ = (r+1)p(1- p) Recall Var(p) = Substitute - Zα 2 1/ 2 - Zα (Eq 5.7) 2 p(1- p) n p ⋅ (1 − p ) for σ above in eq (5.6.) Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 19 Note: If we have defined d* as the risk difference (RD), then we can express RR or (OR) in terms of RD and the baseline or reference risk ( p0 ) p1 , p1 = p 0 RR p0 So d* = p0RR - p0 = p0 (1- RR) For RR: RR = p1 (1- p1) For OR: OR = p0 (1- p 0) So p1 = p0 • OR and 1+ p 0(OR-1) p 0 OR d* = - p0 . 1+ po (OR-1) We may have a specific OR or RR in mind and need to know the implied value of p1. So, we have a (1) simple, and (2) unified approach for (a) sample size and (b) power calculations for (i) RD (ii) RR, or (iii) OR, as well as for differences in means. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 20 When we think about sample size calculations for OR and RR, we should always think of the problem and the picture in terms of risk difference (RD). Because the RD will give a better idea of the “size” of the effect for which we want to estimate power or sample size. For example, suppose p0 = 0.01 and p1 = 0.03. . Then the RR seems large (3.0), but the RD is small (0.03 – 0.01 = 0.02). But if n0 = 100 and n1 = 100 , we have a total of 4 events (expected) (1 + 3 = 4). Now, if p0 = 0.2 and p1 = 0.6, RR=3.0 as before. The reasons for the increased power for this latter RR is that we have so many more events expected (20+60=80). The much large number of events is reflected in the large risk difference (0.6-0.2 = 0.4). Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 21 Example #1: Cohort design: Does smoking during pregnancy show an association with increased risk of low birth weight in offspring? Known facts: 1. Prevalence of smoking during pregnancy is about (25%) , i.e., 3 non-smokers for each smoker. So, r = 3 if we sample from the cohort using simple random sampling and follow them. 2. Incidence (overall) of low birth weights (# 2500 gm) is ~ 7%. Suppose we have the time and dollars to study 1200 births. Expect 1200/4 = 300 exposed (n = 300) during gestation (to smoking). Suppose we want to measure the difference in risk (proportions of low birth weight babies) and we want to detect a difference of 4% points = (d*). What is the power to detect this difference? Must compute p0, p1 from overall incidence of LBW = 0.07. That is simply a weighted average of risks among smokers and non-smokers. 0.07 Smokers = (0.25) (p0 + 0.04) Non-smokers + (0.75) (p0) because p1 = (p0 + 0.04). Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 22 Now, solve for p0 : p 0 = 0.06, p1 = 0.10, p = where p = 0.10 + 3(0.06) = 0.07 1+ 3 p1 + r( p0 ) unexposed 3 and r = = exposed 1 1+ r n(d *) 2 r For α = 0.05: Zβ = (r+1)p(1 − p) 1/ 2 300 (0.04 ) 2 3 - Zα = 2 (3 +1)(0.07)(0.93) 1/ 2 - 1.96 = 0.39 For Zβ = 0.39, power = 0.652. This is depicted on the normal density plot on the next page, and is the shaded area, from left to right, under the curve, representing the cumulative normal from negative infinity to +0.39. Note, this power plot is just the same as the prior plot, except that we are now depicting power from left to right instead of from right to left (under the normal density). Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 Program Stata (sampsi) Stplan(arcsin transformation) N-query advisor Power 0.59 0.61 23 What do the power calculation programs produce? 0.63 This is the flipped over distribution that we use to estimate power or type II error. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 24 Example #2: Case control study of smoking during pregnancy and low birth weight in offspring. Using same numbers as before, Case Control = giving birth to low birthweight (LBW) baby = giving birth to "normal" birthweight baby. (e.g. 2501 gm.) In case-control setting, we now redefine p0 and p1 to be, not the risk of LBW, but rather the risk of exposure among the controls and the cases respectively. For p0, we will use overall prevalence of smoking (exposures) in general population of pregnant because cases are a small minority, i.e., p0 = proportion of controls who are exposed = 0.25 (as before) We want to detect OR = 1.8, we can study 175 cases, and control to case ratio = r = 2 Solve for the prevalence of smoking among cases: p1 = p1 = (0.25)(1.8) = 0.375 . 1+ (0.25)(1.8 -1.0) Copyright © 2006, Trustees of the University of Pennsylvania p 0 OR , or 1+ p0 (OR-1) EP 521 Spring, 2007 Vol I, Part 5 25 * d = p1 - p 0 = 0.375 - 0.250 = 0.125 p= p1 + r p 0 , 1+ r (0.375) + 2(0.25) p= = 0.292 1+ 2 1/ 2 (175)(0.125) 2(2) = Zβ -1.96 = 1.01. (2 +1)(0.292)(0.708) Power = 0.844 (84.4%) to detect OR = 1.8. This result means that the two distributions, one for Ho: OR=1.0 and the other for Ha: OR=1.8 do not overlap very much . (This answer is very close to what the program PS gives (power = 0.844 for 179 cases). Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 26 Notes 1) n = 175 cases, so total sample size = 175 + 350 = 525 with power = 0.84 2) In cohort study, we had p 1 = 0.10, p0 = 0.06, gives OR = 1.74. Cohort Study needed 1200 births for a power = 0.6. Why the difference in the case control and cohort study calculations? How many events did we stipulate for the case control study? (175). How many events would we expect from the cohort study? Exposed: Unexposed Total 300*0.10 = 30 900*0.06 = 54 = 84 For a cohort study with power = 0.84, we need to enroll even more than 1200 patients, 519 exposed + 1557 unexposed = 2076 total. This would produce the following events: Exposed: Unexposed Total 519*0.10 = 52 1557*0.06=93 =145 Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 27 3) Everything is re-expressed as a difference in proportions (or means). 4). We need to know: a) Exposure prevalence in population (for case control or cohort study) b) Disease risk in the population (for cohort study). c) Desired "effect size" ("clinically important" difference) d) Minor notational and other differences may be found in different texts -e.g., p (1- p1) p 2(1- p 2) p(1- p) replaced by 1 + n n n 2 2 Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 28 To obtain the sample size: Solve for n using Eq (5.6) for means: for proportions: ( ) 2 2 d* nr . Zβ + Z α2 = 2 • σ r+1 ( ) 2 Zβ + Z α2 = *2 Then nd r (r+1) p (1- p ) Copyright © 2006, Trustees of the University of Pennsylvania n= ( ) 2 2 Zβ + Z α2 σ (r+1) 2 d* r (Z + Z ) , and n = β α 2 Eq 5.8 2 p (1- p )(r+1) ( d *) 2 r Eq 5.9 EP 521 Spring, 2007 Vol I, Part 5 29 Tables for common values of key parameters: (Kelsey et al., 1996, Table 12-16 p 333.): ( Values of Z α + Zβ 2 ) 2 for frequently used combinations of significance level and power Significance level α Power (1 – β) 0.01 0.80 0.90 0.95 0.99 0.80 0.90 0.95 0.99 0.80 0.90 0.95 0.99 0.05 0.10 ( α 2 Z + Zβ ) 2 11.679 14.879 17.814 24.031 7.849 10.507 12.995 18.372 6.183 8.564 10.822 15.770 So, 7.85 and 10.5 are the key values to remember. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 30 Example #3: Case control study of smoking and low birth weight Want OR = 1.8 to be detectable, with power = 90%,α = 0.05 Recall from the prior example: p = 0.292,(1- p) = 0.708, d* = .125,r = 2 Thus: n = (10.507)(0.29166)(0.70834)(3) = 208.4 ≅ 209 . (0.125) 2 (2) Then, n=209 + 418 controls = 627 [Remember that 175 cases gave 84% power] In summary, we can use a few formulae to estimate power, or sample size, for simple study designs, and these formulae form the basis for the computer-based sample size programs that we all use. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 §3.3 31 Special Concerns in Power (Sample Size) Calculations §3.3.1 Measurement Error: effect on power (Refs: Armstrong et al 1992; Kelsey et al 1996, ch 13) Where errors can occur: Exposure variables (most common worry) Disease (outcome) classification Confounding factor or covariates Effect of nondifferential error (misclassification or measurement): commonly (although not always) biases or attenuates measure (effect size) towards the null Effects of nondifferential error in exposure on sample sizes:[In simple cases] Observed effect size is smaller than true effect size, i.e., it takes more power to demonstrate an effect for a given true effect (observed effect will be closer to null): effect of bias Confidence intervals for corrected measures of effectsize are wider than if exposure were measured without error: effect of variance Effects of nondifferential error in confounders -- Effect size can be biased in either direction Remedies for measurement error in planning studies Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 32 Estimate measurement error from pre-existing data Use tables on attenuation bias (Kelsey, Armstrong) If error is not known, plan a validation substudy (complex) Plan on multiple measurements of subjects For estimating the impact of nondifferential error, estimate the sensitivity and specificity of observed exposure: True Exposure Observed + exposure + a b c d Then Sn=Pr(O+|True+) = a/(a+c) Sp=Pr(O-|True-) = d/(b+d) Prevalence of exposure =(a+c)/(a+b+c+d) Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 Sensitivity Specificity Exposure prevalence 0.6 0.6 0.01 0.5 0.6 0.9 0.01 0.5 0.6 0.99 0.01 0.5 0.9 0.6 0.01 0.5 0.9 0.9 0.01 0.5 0.9 0.99 0.01 0.5 0.99 0.6 0.01 0.5 0.99 0.9 0.01 0.5 0.99 0.99 0.01 0.5 33 Observed OR 1.01 1.14 1.05 1.42 1.37 1.54 1.02 1.48 1.08 1.72 1.47 1.82 1.02 1.68 1.09 1.89 1.50 1.97 Attenuated values of the odds ratio resulting from the effects of the nondifferential error in measuring exposure. Classification in terms of disease status is assumed to be error free. True OR = 2.0. Ref=(Kelsey et al., 1996, page 350*) Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 34 §3.3.2. Those selected/invited vs. those who agree to participate. Enroll 80%: (0.8)(x) = 500, x = 500/0.8 = 625. We must enroll enough patients so that after the refusers drop there will still be enough patients. §3.3.3. Censoring Loss to follow-up Other causes of death than the cause of interest Many assumptions involved in the calculations for such studies §3.3.4 The scale on which we estimate effect size -- Important to understand The scale can make an apparent (although not real) difference in the sample size or power calculation. Example. Suppose you estimate 80% power to detect OR=3.0, and baseline/reference risk = 0.2. This seems like a study that does not have much power to detect a relative effect of exposure. But this OR = 3.0 corresponds to a RR = 2.14. Now, the study seems more powerful. But it is not! The only difference is that you are expressing your effect size in terms of RR ! Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 35 §3.3.5 How many controls per case What should the value of r be? r= ratio of controls/cases or unexposed/exposed In practice: # of cases in case control study is the total # available, so we can't get any more than there are. Then we can increase power by increasing r (i.e. taking more controls), BUT!! precision does not increase beyond r = 3 or 4. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 36 §3.4 The Fallacy of the Post-hoc Power Calculation (see Berlin & Goodman , 1994) Suppose σ = 10 (σ2 = 100) and n = 50 subjects / group We have done a study comparing the effects of two drugs on a continuous outcome measure with the above variance. The result of the study is that the difference between the means of the two groups is 4 units. (The two groups are independent) 100 σ 2 = , because var (A - B) = var (A) + var (B) 50 n 4 Z = x1 x 2 = =2 100 100 4 + 50 50 We do a Z-test (known variance): So the test (either Z or t) would barely reject H0 of no difference in means at the α = .05 level. But, does this result mean that power was low or high? Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 37 Now, suppose that the planned detectable difference = 6.0 with 80% power and alpha=0.05. But after the experiment, we observe a difference = 4.0, with CI = 0 to 8. This result means that you happened to observe an effect size in the sample that is lower than the true effect size in the hypothesized population. What was the power to detect a difference of 4 units given N = 50 per group (i.e., r = 1) and σ = 10? zβ = d* σ nr 50 ⋅ 1 − 1.96 = − 1.96 = (0.4) 25 − 1.96 = 0.04 r +1 2 So power = 0.50 or 0.51 So if the power was so low, how did we detect a difference? Meaningless question: the d* in the formula relates to the hypothetical mean of an alternative distribution, not to an observed event. An observed event will always have power (1- β) < 0.5 if the finding is “not significant”. In other words, if z < +1.96, power is < 0.5 So if observe "NS" finding, then always say study is underpowered. But do not know what we'll find out until after experiment. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 38 In short, d* - d = d* ; dOBS - d = dOBS. (we can set d=0). There is no place for power after observe dOBS We must always distinguish (1) The hypothesized true (but unobserved) population (2) The actual observed sample from that population Each sample from the true population will differ somewhat and will have a different estimated effect size. If you hypothesized a large difference, and you found only a small difference, then you are “out of luck”. Too bad. Your p-value will likely be 0.05. Ref: Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. American Statistician. 2001;55(1):19-24. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 39 §3.5 Sample Sizes for Confidence Intervals Sample size for single mean or proportion L = margin of error within which you want to estimate the mean or proportion (1/2*width of CI) Then for the mean: σ z 2 σ2 L = z* = 1.96* se, and L 2 = n n n = Z σ2 L 2 2 Eq (5.10) p(1 − p ) σ2 For a proportion (e.g., sensitivity and specificity) we substitute for n n 2 p(1- p) Then, n = z , 2 L Eq (5.11) where Z is standard normal value (2-sided) corresponding to the desired proportion of the time that the estimate is to be within the desired margin of error. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 40 Example: Suppose you want to estimate the proportion of people with high cholesterol (> 200 mg ldl) within 4% percentage points. You guess that the proportion will be around 40%. (1.96) 2(0.4)(0.6) Then: n = = 576 (0.04) 2 Be sure to understand the definition of “L”. L is ½ the width of the confidence interval. Other texts use L= width of entire interval. With this n, there is a 95% probability (before doing the study) that the estimate obtained will be within 4% of the population value. This calculation does not address the situation in which you want to "rule out" a true value above (or below) a particular hypothesized value (see later). "Worst case" for proportions, when you have no idea what p will be. Use 0.5 because the variance is maximized (0.25) when p=0.5. For any other value of p, the variance is lower. (1.96) 2(0.5)(0.5) For the example: n = = 601 (not much bigger) (0.04) 2 Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 41 Suppose you wanted ± 3%? Then let p = 0.5 be the proportion used for the sample size calculation. n = 1068 (much bigger) This is how the pollsters give you their ± numbers and compute n. i.e., for p = 0.5, L = ± 0.03, and 95% CI = 0.5 (0.47, 0.53) Suppose you think p will be around .001 and you want ± 0.0005. This is a small proportion! (1.96) 2(.001)(.999) n= = 15,352 , a big study. (.0005) 2 (cancer rates, etc. are this low) But these calculations on width of the confidence interval fail to consider the uncertainty of the observed point estimate, e.g., ORˆ , even when the true OR is fixed. They assume you will be satisfied with this confidence interval wherever it is centered. The following examples show how that assumption might not hold. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 42 §3.5 (continued) Sample Sizes for Confidence Intervals The same confidence interval question might be answered differently. Question #1 Suppose you want to ensure that your estimate of sensitivity (Sn) will have a confidence interval (2-sided) of ±5 percentage points. Assume you think that you will observe Sn = 0.9. How many subjects do you need with disease to produce a confidence interval of (0.85 to 0.95)? We have just seen this calculation. z 2 p (1 − p) n= L2 If z=1.96, p=0.9, and L=0.05, then n= 3.84 * 0.9 * 0.1 /0.0025 = 138 Question #2 Suppose you want to ensure that whatever observed Sn you find after your experiment, that you can eliminate, by means of a 95% confidence interval around the estimate, a true Sn <0.85. How many subjects do you need with disease to ensure with 80% power that the lower confidence bound is at least 0.85? This second question is different. It can be viewed as an hypothesis test. Again, assume true Sn = 0.9 How can we calculate this CI? Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 43 Here is the STATA code and output for that question: . sampsi .9 .85, power(0.8) onesample Estimated sample size for one-sample comparison of proportion to hypothesized value Test Ho: p = 0.9000, where p is the proportion in the population Assumptions: alpha = 0.0500 power = 0.8000 alternative p = 0.8500 (two-sided) Estimated required sample size: n = 316 This number is much larger. For question #1, you assume that you will observe Sn = 0.90. All you want to know is how wide will the resulting CI be. But for question #2 you are assuming only that the true Sn =0.9, and that the observed Sn might vary randomly around the true value. So, your observed Sn might be smaller than 0.9! You must build in extra power so that whatever you observe, your lower bound of the confidence interval will be at least 0.85. (Simulations confirm this second result.) Correspondence between these two different questions: If in STATA one sets the alternative hypothesis (Ha) at the end of the confidence interval, and one stipulates the power=0.5, then the sample size is the same as for question #1, i.e., n=138. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 44 Question #3: Suppose you want to ensure that whatever observed Sn you find after your experiment, that you can eliminate a true Sn <0.85 and show a p<0.05. How many subjects do you need with disease to ensure with 80% power that the lower confidence bound of a one-sided 95% confidence interval is at least 0.85? Again, assume true Sn = 0.9 This amounts to a onesided onesample test: . sampsi .9 .85, power(0.8) onesample onesided Estimated sample size for one-sample comparison of proportion to hypothesized value Test Ho: p = 0.9000, where p is the proportion in the population Assumptions: alpha = power = alternative p = 0.0500 0.8000 0.8500 (one-sided) Estimated required sample size: n = 253 Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 45 Question #4: Fourth type of confidence interval problem: predicted CI when planning experiments (Reference Goodman SN, Berlin JA. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med. 1994; 121:200-206.) Problem: Evaluating a medical treatment: 45% cure rate. Proposed surgical alternative must have higher cure rate: 70%+ (to offset higher risk of surgical morbidity) Difference = .70 - .45 = 0.25 (25% pts) Question: if design a study with 90% power to detect a difference this size (or larger), what is going to be predicted confidence interval? This will not be ±25% points. It will be narrower. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 46 Assume α=0.05, two sided (1) Step 1: Compute samples n1 and n2 for each group to achieve 90% power to detect a difference of 0.25 ( za / 2 + zβ ) 2 ∗ p (1 − p)( r + 1) (10.507) ∗ 0.575 ∗ 0.425 ∗ 2 n= = = 82 from Eq. 5.9 (d * )2 r 0.25 ∗ 0.25 (2) Step 2: Compute predicted confidence interval Predicted 95% CI= observed difference ±0.6* ∆0.90 Where ∆0.90 = True difference for which there is 90% power. = 0.25 Predict CI= observed difference ±0.6 * 0.25= ±0.15. So, the predicted CI for this problem will be 0.30 wide. Thus, if observed difference is 0.15, lower bound of CI will just = 0.0. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 47 The same result holds when using the alternative formula: ±0.7 * ∆ 0.8 In that case, given the same set of facts. If there is 80% power to demonstrate a risk difference of 0.25, then one would expect a confidence interval to be wider. It is 0.7*0.25, or ±0.175. (See Goodman and Berlin 1994 for derivation) Why is this so: As the power increases (50%, 80%, 90%) the resulting confidence interval will get narrower, holding constant the observed risk difference. So, (a) Compute sample size to detect a risk difference at a power level (b) Use the simple formula to predict the confidence interval. Reference: Goodman SN. Berlin JA. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Annals of Internal Medicine. 1994; 121:200-6. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 48 §3.6 Relative size of (a) standard deviation and (b) desired effect size on power and samples sizes: Suppose, in any of these situations, you have no idea what σ2 will be? There is an easy way to estimate power or sample size without stipulating standard deviation in advance. e.g., comparing 2 means: n= ( ) 2 2 Z α2 + Zβ σ ( r+1) (d ) * 2 , from Eq. 5.8. r We can always say that we would like to detect a difference of, say, one (or 0.5, or whatever) standard deviations. e.g., d* = σ , or d * = 1sd Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 49 Thus, for r = 1 (for example) n = For d* = 0.5 σ and r = 1: n = ( ( ) 2 2 Z α + Zβ σ 2 2 σ ) 2 ( 2 2 Z α + Zβ σ 2 2 ( 0.5σ )2 = ) ( 2 = 2 Z α + Z β = 2*7.85 = 15.7 Eq (5.12) 2 2 Z α + Zβ 2 0.25 ) 2 ( = 8 Z α + Zβ 2 ) 2 =8* 7.85 = 62.8. (The sample size gets big quickly) Note, sample size depends only on the ratio σ 2 (d ) * 2 , i.e., on the sd relative to the desired difference to be demonstrated. This idea is often invoked in planning experiments. But one should have an idea of what the standard deviation might be in the population. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 50 Formulae differ according to textbook and sample size programs: The formula above for comparing groups is approximate (but is used in many texts). A "more exact" form is [(Fleiss, p. 41) for one control per case]: 2 (z 2 pq − z p q +pq ) α/2 1− β 0 0 1 1 n= , (per group) 2 (p − p ) 0 1 p= p1 + p0 , (remember r = 1) 2 Tables are also available for common combinations of p and power. Fleiss JL. Statistical Methods for Rates and Proportions, 2nd Edition. New York: John Wiley & Sons, Inc.; 1981: 262. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 51 Always note, however, when using formulae from texts that each author might define the terms differently and therefore had slightly different formulae. For example: Schlesselman Formula: pg. 145 n= ( Zα 2pq + Zβ p1q1 + p0 q 0 ( p1 - p0 ) 2 ) 2 Note:This calls Z = +1.96, where we have used zα / 2 Where: p1 = p 0OR and p = 1 ( p1 + p 0 ) , q1 = 1- p1 , q = 1- p, q 0 = 1- p 0 2 [1+ p0 (OR -1)] OR= the odds ratio Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 52 A formula that is simpler than the ones above (for r = 1, i.e., two equal sized groups) and for practical purposes equivalent, is given by: n= 2pq( Zα + Zβ ) ( p1 - p0 ) 2 2 Corresponding to α = .05 (two-sided) and β = .10, one has Zα = 1.96 and Zβ = 1.28, so that equation reduces to a particularly simple formula: n= 21* pq ( p1 - p0 ) 2 . We have power and sample size programs to do much of this work. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 53 Sample sizes needed are huge when effect to be shown is small as is baseline risk. From: Case-Control Studies: Design, Conduct, Analysis by James J. Schlesselman, New York, 1982, Oxford University Press, Appendix A Alpha = 0.05 (two sided). 0.1 0.5 0.9 1.1 1.5 2.0 3.0 6.0 0.01 0.02 0.04 0.05 0.10 0.20 0.30 0.35 1420 6323 201260 222890 10649 3206 1074 304 707 3174 101552 112691 5402 1632 550 158 351 1600 51726 57631 2781 846 288 85 279 1286 41773 46635 2258 689 236 70 137 658 21933 24732 1217 378 133 42 66 347 12209 14046 714 229 85 37 42 248 9206 10805 568 105 73 34 36 221 8453 10022 535 102 71 34 Consider the joint effects of (a) small baseline risk and (b) small size of the true OR If the baseline risk is small, and the OR is not very different from 1.0, then the risk differences is small, i.e, d* in our notation is small, and then ( d *) 2 in the denominator of our power calculation formula is very small. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 54 §3.7 Sample Sizes for Matched Studies §3.7.1 Frequency matching – as in stratified design First we consider frequency matching. The formulae for stratified studies (frequency matched) and individually matched are in Schlesselman ( p 159) Recall, our estimates of MH OR are based on weighted estimates of stratum-specific OR s. This is corresponding method of arriving at sample size for stratified design. This is a way to incorporate strata, or a confounding factor, into the estimation of power or sample size. Must specify the following parameters, assuming have J strata. 1. p0j = exposure prevalence in controls in jth stratum 2. fj = fraction of the total observations in stratum j, where ∑f j = 1.0 j 3. Type I error Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 55 4. Power 5. Assumed true effect size (RR=OR, in this case) Assume: Equal number of cases and controls in each stratum Constant RR (OR) across strata (no effect modification) Required total number of "cases" = where (using q = 1 – p): g j = (Z n= ( ln(OR) ) ) ( + Zβ Σf jg j ) 2 Eq (5.13) 2 1 1 + p 0 jq 0 j p1 jq1 j ( α 2 ) , p1 j = p0 j(OR) 1+ p 0 j ( OR -1) The formula is essentially a weighted sum of (d * ) 2 and var(d*) from our general sample size/power formula, where ln(OR) here is the equivalent of d*. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 56 Example: Oral contraceptive use and myocardial infarction (Schlesselman, Table 6.5, p 160) Hypothesized Effect Size R == 3, α = 0.05, β = 0.10 (power=0.9) Age fj p0j p 1j gj f jg j 25-29 .03 .22 .46 .122 .0037 30-34 .09 .08 .21 .062 .0056 35-39 .16 .07 .18 .055 .0088 40-44 .30 .02 .06 .018 .0054 45-49 .42 .02 .06 .018 .0076 1.00 ≈ by definition .0311 = Σf jg j f j = .42 is where we have the most cases (age category 45-49) p0j = .22 exposure prevalence - where most exposure is. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 Then, the required 57 (1.96 +1.28) n= 2 0.0311 = 328 cases . One must then calculate the total sample size. Reason for the frequency matching: efficiency. In the context of the case of myocardial infarction and oral contraceptive use. Most cases in what age group? Most exposure in what age group? So, we use frequency matching to ensure adequate numbers of subjects in each of the 5 age groups. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 58 §3.7.2 Pair Matched Studies (Schlesselman sec 6.6, pp 160 ff.) There are special methods of computing power for matched studies. We consider first the simplest situation: 1 to 1 matching. But “matched” studies can also have multiple controls per case. In paired designs, the power calculations are based on estimating the power given a frequency of discordant pairs, and then estimating the number of discordant pairs from (a) the probability of discordant pairs, and (b) the quality of the matching. (a) The number of discordant pairs (= m) required to detect a relative risk (RR) is given by: 2 ( Z α ) / 2 + Zβ P(1- P) , where P = OR ≈ RR . m= 2 (Eq 5.14) 2 (1+ OR) (1+ RR) 1 P- 2 We are going to work with the OR because it is the ratio of the frequencies of discordant pairs. (We then make the assumption that OR is a good estimate for RR). Here P= u10 u10 + u01 in the paired data table. (See the notation in Vol I, Part 4). That is P = the b . b+c Must distinguish from p0 and p1 = risk of exposure among controls and the cases (These calculations can be done by computer: PS, Power and Precision, PASS, for example). proportion of discordant pairs in the “b” cell of the McNemar table: P = Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 59 Derivation of sample size formula for McNemar’s test: [Advanced Materials] Recall that McNemar’s test is equivalent to a test of a binomial proportion, where the proportion is the fraction of discordant pairs that are, for example, in the u10 cell in the 2 by 2 table of paired data. This was shown in Vol I, Part 4. We can use this relationship and a version of the sample size formula we have seen before to show the correspondence between previous formulae and the ones specifically suited for matched pair case control studies. Details appear in Schlesselman (pp 145, 161) Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 60 U10 1 ˆ Ho: p= ;(OR=1), where p= , and here m=U10 + U01. 2 U10 + U01 The standard sample size formula for one-sample binomial test: 2 p(1 − p) p0 (1 − p0 ) zα / 2 + zβ p0 (1 − p0 ) n= , ( p − p0 ) 2 2 1 1 p(1 − p) 2 ⋅ zα / 2 + zβ 1 1 zα / 2 2 2 ⋅ 2 + zβ p (1 − p) 2 2 = . n= 1 2 1 2 (p− ) (p − ) 2 2 Letting m = n, we have derived the formula for number of discordant pairs. Note: The denominator corresponds to d* from before, because we have expressed OR in terms of p, and we are essentially doing the calculation for the difference between the desired OR and OR=1 (null). [END OF ADVANCED MATERIALS] Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 61 (b) Estimating the proportion of discordant pairs among all pairs. We do this from our estimates of the risk of exposure in the control group. Let pe = the probability of an exposure-discordant pair and M = the total number of pairs needed to yield m m discordant pairs. M = . pe This probability will depend in part on the baseline risk of exposure among the controls, on the odds ratio that we are trying to demonstrate, and on the skill (or lack thereof) in selecting matching criteria. First, consider the baseline case of estimating what fraction of the matched pairs will be informative, i.e, what fraction will be discordant pairs. Although pe depends on matching criteria, using the notation from McNemar’s test, the matched pairs can be displayed in following table: CASE E+ E- Control E+ EU11 U10 U01 u00 Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 62 pe=Pr(exposure discordant pairs) By definition: pe=Pr(U10) + Pr(U01) =Pr(E+|case)* Pr(E-|ctrl)+ Pr(E-|case)* Pr(E+|ctrl) = p1 *(1-p0) + (1-p1) * p0 Note: this is an approximation, because pe depends on the matching criteria (which include factors other than E). We can compute p1 , the proportion of exposed cases, from the OR and the value of p0 , the proportion of exposed controls, using the formula for OR. p1 = M= OR and p0 are stipulated values. p 0OR . Then q0 = 1 - p0, and q1 = 1 - p1, and 1+ p 0(OR -1) m m = = sample size needed. pe ( p 0 q1 + p1q 0) Eq (5.15) But there might be other reasons for assuming that the true percentage of usable discordant pairs is actually smaller than what we might expect. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 63 Example: Pair (1 to 1) Matched study of OC use and congenital heart disease. For α = 0.05, β = 0.1 We think: p0 = 0.03, i.e., 3% risk of exposure among population of controls (so, rare exposure) We want to detect OR = 2 We know from the relationship among OR and p1 = p1 =Pr(E|case) (.03)(2) = .058 , because OR – 1= 2 -1 = 1 1+ .03(1) and P = OR 2 1 = . (1- P ) = . This is from the formula derived from McNemar’s test. 1+ OR 3 3 2 1.96 2 1 +1.28 ⋅ 2 3 3 m= = 90 discordant pairs from Eq 5.14. 2 2 1 3 − 2 Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 64 Then, to estimate the total number of pairs: pe =prob. (discordant pair) = (p0q1 + p1q0) = [(0.03) (0.942) + (0.058) (0.97)] = 0.028 + 0.056 = 0.084 Then: M = m 90 = = 1071 matched pairs p e .084 What happens with other combinations of parameters? alpha Za/2 power ZB 0.05 0.05 0.05 0.05 1.96 1.96 1.96 1.96 0.05 1.96 0.05 1.96 0.9 0.9 0.9 0.9 1.28 1.28 1.28 1.28 OR p 2 2 2 2 Po=Pr(E|ctrl) r m 0.67 0.67 0.67 0.67 0.9 1.28 2.5 0.71 0.9 1.28 3 0.75 Copyright © 2006, Trustees of the University of Pennsylvania 0.03 0.1 0.2 0.5 1 1 1 1 P1= qo = 1-Po q1=1-P1 pe M DuPont "=Pr(E|case) 90.34 0.06 0.97 0.94 0.086 1046 1066 90.34 0.2 0.9 0.8 0.26 347 368 90.34 0.4 0.8 0.6 0.44 205 266 90.34 1 0.5 0 0.5 181 181 0.03 1 52.93 0.03 1 37.7 0.075 0.09 0.97 0.97 0.925 0.101 527 0.91 0.115 329 543 343 EP 521 Spring, 2007 Vol I, Part 5 65 So, can see that M depends heavily on the probability of exposure among the controls, as well as on the OR that one assumes is present in truth. The column labeled M reports results from this program. DuPont numbers in right column are from program “PS” written by DuPont and Plummer. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 66 We are making assumptions about p0, p1, OR and matching factors. If matching is less than optimal, and we have overmatched to some extent, then the pr(exposure) for the case and control in each pair will tend to be similar, resulting in a larger number of “noninformative” pairs. Recall the (hypothetical) example of matching on use of coffee mate in the association of pancreatic cancer and coffee. Program by DuPont and Plummer allows user to adjust for this correlation of exposure. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 67 We can reverse this process and estimate power for given number of discordant pairs. (Ref= Schless p 162) −z 1 zβ = α / 2 + m( P − )2 / P (1 − P), 2 2 Eq (5.16) where power = Pr(Z ≤ zβ ), and m is the number of discordant pairs (as before). So, zβ = 1.28 is equivalent to power=0.9 Notes: 1. Can estimate m from M by M = m pe 2. Better estimate pe from preliminary data or revised after initial data collection 3. We have looked at case control studies (because that is where matching is more common). But this framework can apply to cohort studies. §3.7.3 Matched studies with more than one control per case (or in the instance of cohort studies, Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 68 more than one unexposed per exposed). The same principles apply to these more complex designs. In these instances, there are several sets of paired tables per matched set, each table representing the cross classification of pairing for the case with each of the controls. (So, if there are 3 controls per case, one can think of a set of 3 tables of paired comparisons). (1) A simple adjustment: Let c = the number of controls per case, and let n be the number of cases assuming 1 to 1 matching. Then with c to 1 matching, one needs n1 cases, where: n1 = (c + 1) n / 2c. Thus, if one needed 1050 cases (and 1050 controls, and then one selected 2 controls per case, the new number of cases = (2+1)1050/2*2 = 3*1050/4= 788, and the number of controls = 1576. This approximation is good in many cases, but falls apart when the probability of exposure of a sampled control is low. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 69 (2) More complex methods: Better approximations are available. The programs in DuPont and Plummer (PS) use an estimate of the correlation of the exposure status between a case and its matched controls. The formula we have seen (Schlesselman) assumes no correlation. DuPont and Plummer generalized this formula (for multiple controls per case AND for the possibility of some correlation. ) [Aside: You can think of correlation in terms of two columns of data: Case 1 1 0 0 1 Control 1 0 1 0 0 Where a 1 indicates exposed and a 0 indicates unexposed and each row is a matched pair (or one of a set of matched pairs). Then the correlation is simple to obtain using standard formula. A good start is corr = 0.2. As the correlation increases, then sample size (number of cases) increases. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 70 Effect of additional matches and correlation on sample sizes: What happens when add controls per case in a matched study: The number of cases needed drops, but the total number of patients increases. Controls No. Case Total per case patients patients 1 1066 2132 2 782 2346 3 688 2752 4 641 3205 Assuming same OR, power, alpha, p0, p1 as in our example. Calculations from program PS Effect of correlation on sample sizes (using same example): Corr 0 0.1 0.2 0.3 Case patients 1066 1230 1437 1705 Correlation might occur when matching is less than optimal. Available software: PS, PASS. Reference: DuPont WD. Power calculations for matched case-control studies. Biometrics. 1988; 44:1157-68. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 71 §3.8 Miscellaneous Comments on Sample Size calculations – 1. Power to detect interactions – “Interaction” takes on different means in different applications. Safest way to do a power calculation for interaction is via simulations (stipulate the OR of exposure and outcome in the reference group, and then the ratio of odds ratios (ROR) between the OR (of exposure and outcome) in the comparison group or groups and the reference group. Good software for this application is hard to find. 2. For special application of gene-environment interaction, a program from NCI (Power.exe) The reason for using a special program: Assume that among females, OR=2.0 (1.2 to 3.5), and for males OR= 1.3 ( 0.8 to 2.1), then OR for females is significantly different from 1.0, but might still want to test OR for females > OR for males. Copyright © 2006, Trustees of the University of Pennsylvania EP 521 Spring, 2007 Vol I, Part 5 References: [Ms. Holly Brown ([email protected]).] Lubin JH, Gail MH. On power and sample size for studying features of the relative odds of disease. Am J Epidemiol 1990;131:552-566. Garcia-Closas M, Lubin JH. Power and sample size calculations in case-control studies of gene-environmental interactions: Comments on different approaches. Am J Epidemiol 1999;149:689-693. Copyright © 2006, Trustees of the University of Pennsylvania 72 EP 521 Spring, 2007 Vol I, Part 5 73 3. Adjustment for sample size from programs Most programs give the power and sample size for a simple case of perfect data. But there are other problems that might cause the power under assumptions of simple random sampling to be too high. Adjustments are needed for such common problems as: Measurement error Loss to follow up Lack of independence of observations (clustering= complex) Repeated measures – This requires materials beyond the level of this course. 4. Covariates – Covariates will tend to improve power by reducing variance, if the modeling is done correctly. As a rule of thumb – estimate power for the simple case of no covariates and assume that the introduction of covariates to adjust for confounding will improve power. End of Vol 1 Part 5. End of Volume 1 Copyright © 2006, Trustees of the University of Pennsylvania
© Copyright 2024