7/9/2014 Lecture Topics Institute for Clinical and Translational Research Introduction to Clinical Research Power and Sample Size: Considerations in Study Design Nae-Yuh Wang, PhD Associate Professor of Medicine, Biostatistics & Epidemiology July 16, 2014 ACKNOWLEDGEMENT • John McGready Department of Biostatistics Bloomberg School of Public Health 2 LECTURE TOPICS • Sample size and statistical inferences • Factors influencing sample size and power • Sample size determination when comparing two means • Sample size determination when comparing two proportions 3 1 7/9/2014 Institute for Clinical and Translational Research Introduction to Clinical Research Section A Sample Size and Statistical Inferences WHY IS SAMPLE SIZE RELEVANT • A small sample will not allow precise estimates, therefore could not tell whether observed difference is real or due to variability. We may call things equal when we shouldn’t • A very large sample size will allow us to detect tiny differences that have no clinical significance. Waste of research resources • A very large sample size may not be large enough if the event rate is very small • Sometimes the available sample size is fixed and we have to decide what we will be able to do with such a sample 5 INTERVAL ESTIMATION Interval estimation: using sample mean and SE (SD of the sampling distribution) to construct 95% confidence interval to infer the population mean • Sampling distribution becomes more and more like a normal (Gaussian) distribution with increasingly larger N • 95% CI = sample mean +/- 1.96*SE = sample mean +/- 1.96*SD/√N 6 2 7/9/2014 WHY IS SAMPLE SIZE RELEVANT ESTIMATION Estimation: What is the prevalence of obesity among adults in Baltimore City? • How precise do we want our estimate to be? • 95% CI = p +/- 1.96*SE(p) = p +/- 1.96*SD/√n = p +/- 1.96*√p(1- p)/n • Larger sample leads to more precise estimate, but it requires more resources. 7 ESTIMATION PRECISION • Estimate the percentage of adults who are obese in such way that I have 95% confidence that the true percentage is within +/- 10% of my estimate. • Define parameters of variables such as adults (e.g. age range), obese (e.g. BMI > 30 kg/m2), etc. • Use 95% confidence interval as measure of estimation precision. • Set acceptable precision level e, to be equated to the 95% confidence limit, 1.96 x SE 8 ESTIMATION PRECISION e 1.96 SD/ n n (1 .96 )2SD 2 / e 2 Commonly seen formula: n = 4 pq / e2 n: size of sample e: level of acceptable precision p: expected prevalence value q =1 - p SD2 = Variance = pq 4: constant equivalent to 95% level of confidence 9 3 7/9/2014 ESTIMATION PRECISION • Say p = .30 e = .1 (from literature or pilot data) (determined subjectively, level of minimal clinical significance) n = 4(.3)(.7)/(.1)2 = .84/.01 = 84 • If wanted precision of +/- 5% e = .05 n = 4(.3)(.7)/(.05)2 = .84/.0025 = 336 • Greater estimation precision larger sample size 10 ESTIMATION PRECISION • Estimating the difference in a single parameter (e.g. mean) between 2 populations • Example: What is the difference in bone mineral density (lumber spine) between European and African American women, age 25-34 years of age? 11 ESTIMATION PRECISION • Commonly used formula: n = 8 SD2 / e2 = 4 (2 SD2) / e2 n = size of sample for each population e = level of acceptable precision SD = population standard deviation, SD2 the variance 2 SD2 reflects the fact that statistical inferences on difference of two independent populations subject to two sources of variability 8 = constant equivalent to 95% confidence level and reflecting 2 sources of error ** Assumption: the populations we are comparing have the same variance, i.e. 12 SDB2 = SDW2 = SD2 4 7/9/2014 ESTIMATION PRECISION • Use formula: n = 8 SD2 / e2 Say SD = 1 (based on literature or pilot data) e = .5 (i.e. tolerable error = +/- 0.5 SD) n = 8 (1/.5)2 = 8(4) = 32 (per group) SD = 2 e = .25 (i.e. tolerable error = +/- 0.125 SD) n = 8 (2/.25)2 = 8(8)2 = 512 (per group) • Greater population variability larger sample size • Greater estimation precision larger sample size • Higher confidence level (smaller a error) larger sample size 13 EXAMPLE • Consider the following results from a study done on 29 women, all 35–39 years old Sample Data n Mean SBP SD of SBP OC users 8 132.8 15.3 Non-OC Users 21 127.4 18.2 14 EXAMPLE • Hypothesis: OC use is associated with higher blood pressure • Statistically speaking we are interested in testing : H0:µOC = µNO OC HA: µOC ≠ µNO OC H0: µOC - µNO OC = 0 HA: µOC - µNO OC ≠ 0 represents population mean SBP for OC users µOC µNO OC represents population mean BP for women not using OC 15 5 7/9/2014 STATISTICAL POWER TRUTH DECISION H0 Type I Error Reject H0 alpha-level HA Power 1-beta Not Type II Error Reject H0 beta 16 EXAMPLE • The sample mean difference in blood pressures is 132.8 – 127.4 = 5.4 • This could be considered scientifically (clinically) significant, however, the result is not statistically significant (or even close to it!) at the = 0.05 level • Suppose, as a researcher, you were concerned about detecting a population difference of this magnitude if it truly existed • A study based on 29 women has low power to detect a difference of such magnitude 17 STATISTICAL POWER • Power is a measure of “doing the right thing” when HA is true! • Higher power is better (smaller type II error) • We can calculate power for a given study if we specify a specific HA • An OC/Blood pressure study based on 29 women has power of .13 (13%) to detect a difference in BP of 5.4 mmHg or more, if this difference truly exists between the populations of 35-39 years old women who are using and not using OC • When power is this low, it is difficult to determine whether there is really no difference in population means or we just call it wrong 18 6 7/9/2014 STATISTICAL POWER • Recall, the sampling behavior of x1 x 2 is normally distributed (large samples) with this sampling distributed centered at true mean difference. If H0 is true, then curve is centered at 0 19 o: 1-2= 0 HH 0: µ1 - µ2 = 0 STATISTICAL POWER • Recall, the sampling behavior of x1 x 2 is normally distributed (large samples) with this sampling distributed centered at true mean difference. If HA truth, then curve is centered at some value d, d≠0 HA: 1-2 = d 20 STATISTICAL POWER • H0 will be rejected (for α = 0.05) if the sample result, x1 x 2 , is more than 2 standard errors away from 0, either above or below H0H: oµ: 11-- µ22== 00 21 7 7/9/2014 STATISTICAL POWER • H0 will be rejected (for α = 0.05) if the sample result, x1 x 2 , is more than 2 standard errors away from 0, either above or below H0: µ1 - µ2 = 0 HA: µ1 - µ2 = d 22 WHY IS SAMPLE SIZE RELEVANT HYPOTHESIS TESTING Does the sample better support a distribution under H0 or H1 ? Increased n H1 New TX better H0 No difference H1 H0 23 New TX better No difference Institute for Clinical and Translational Research Introduction to Clinical Research Section B Factors Influencing Sample Size and Power 8 7/9/2014 HYPOTHESIS TESTING Does a clinically significant effect exist between a treated group and a non treated or differently treated group? 1-sided test: n / group = (Z1-α + Z1-β)2 2 (SD/Δ)2 2-sided test: n / group = (Z1-α/2 + Z1-β)2 2 (SD/Δ)2 25 SAMPLE SIZE / POWER CALCULATION BASIC ELEMENTS n , Zα , Zβ , SD2 , Δ2 • Smaller α error larger sample size • Smaller β error (higher power) larger sample size • Higher variability in outcome larger sample size • Smaller Δ difference to detect larger sample size 26 HYPOTHESIS TESTING Use formula: n / group = (Z1-α + Z1-β)2 (2 SD2 ) / Δ2 where Z1-α = normal deviate for a error (two sided test uses Z1-α/2) Z1-β = normal deviate for b error 2 indicates two groups (sources of variability) i.e. treatment group; placebo group Δ = least difference between treated (estrogen) group and non treated group (placebo) that is considered clinically significant n = sample size of each group 27 9 7/9/2014 HYPOTHESIS TESTING Given formula: n = (Z1-α/2 + Z1-β)2 (2 SD2 ) / Δ2 Then n = (1.96 + 0.84)2 2 (.65/.30)2 = (15.68) (4.69) = 73.6 => 74 Allow for 20% random dropout: nc = 74 / .8 = 92.5 Need 93 per arm for 80% power to detect a .3 difference between group with 20% random dropouts .8nc = 74 90% power use n = (1.96 + 1.28)2 2 (.65/.30)2 = 98.6 28 WHAT INFLUENCES POWER? • In order to INCREASE power for a study comparing two group means, we need to do the following: 1. Change the hypothesized values of µ1 and µ2 so that the difference ( µ1 - µ2 ) is bigger 29 H0: µ1 - µ2 = 0 HA: µ1 - µ2 = d WHAT INFLUENCES POWER? • In order to INCREASE power for a study comparing two group means, we need to do the following: 1. Change the hypothesized values of µ1 and µ2 so that the difference ( µ1 - µ2 ) is bigger H0: µ1 - µ2 = 0 30 HA: µ1 - µ2 = d’ 10 7/9/2014 WHAT INFLUENCES POWER? • In order to INCREASE power for a study comparing two group means, we need to do the following: 2. Increase the sample size in each group 31 H0: µ1 - µ2 = 0 HA: µ1 - µ2 = d WHAT INFLUENCES POWER? • In order to INCREASE power for a study comparing two group means, we need to do the following: 2. Increase the sample size in each group 32 H0: µH1o: -1-µ22= =0 0 HAH: Aµ: 11--2 µ=2d = d WHAT INFLUENCES POWER? • In order to INCREASE power for a study comparing two group means, we need to do the following: 3. Increase the -level (functionally speaking, make the test “easier to reject”) with = 0.05 33 H0: µ1 - µ2 = 0 HA: µ1 - µ2 = d 11 7/9/2014 WHAT INFLUENCES POWER? • In order to INCREASE power for a study comparing two group means, we need to do the following: 3. Increase the -level (functionally speaking, make the test “easier to reject”) with = 0.10 34 H0: µ1 - µ2 = 0 HA: µ1 - µ2 = d SAMPLE SIZE / POWER CALCULATION BASIC ELEMENTS n , Zα , Zβ , SD2 , Δ2 • • • • increased α error higher power Larger sample size higher power Less variability (SD) in outcome higher power Larger Δ difference to detect higher power 35 SAMPLE SIZE / POWER CALCULATION OTHER CONSIDERATIONS When the outcome of interest is before-after change of a certain outcome measure, and D denotes the between group difference of these changes: • The SD used in the calculation should be the SD for the before-after change, not the SD for the crosssectional outcome measure • Assuming the SDbefore and SDafter = σ, then SDchange = σ x √ 2(1-ρ) where ρ is the correlation between the outcome measures before and after • Literature usually report SDbefore but not ρ 36 12 7/9/2014 POWER & STUDIES • Power is computed to inform study design, and calculation of post-hoc power after a study is completed does not provide additional information to the study results • Can only be computed for specific HA’s: i.e. this study had XX% to detect a difference in population means of YY or greater. • Frequently, in study design, a required sample size is computed to actually achieve a certain preset power level to find a “Clinically/scientifically” minimal important difference in means • “Industry standard” for power: 80% (or greater) 37 Institute for Clinical and Translational Research Introduction to Clinical Research Section C Sample Size Calculations for Comparing Two Means EXAMPLE • Blood pressure and oral contraceptives – Suppose we used data from the example in Section A to motivate the following question: – Is oral contraceptive use associated with higher blood pressure among individuals between the ages of 35–39? 39 13 7/9/2014 PILOT STUDY • Recall, the data: Sample Data n Mean SBP SD of SBP OC users 8 132.8 15.3 Non-OC Users 21 127.4 18.2 40 PILOT STUDY • 5 mmHg difference in BP is too large to ignore • Need a bigger study that has ample power to detect such a difference, should it really exist in the population • Task: determine sample sizes needed to detect a 5 mmHg increase in blood pressure in O.C. users with 80% power at significance level α = 0.05 – Using pilot data, we estimate that the standard deviations are 15.3 and 18.2 in O.C. and non-O.C. users respectively 41 PILOT STUDY • Here we have a desired power in mind and want to find the sample sizes necessary to achieve a power of 80% to detect a population difference in blood pressure of five or more mmHg between the two groups 42 14 7/9/2014 PILOT STUDY • We can find the necessary sample size(s) of this study if we specify . – α-level of test (.05) – Specific values for μ1 and μ2 (specific HA) and hence d = μ1 - μ2: usually represents the minimum scientific difference of interest) – Estimates of σ1 and σ2 – The desired power(.80) 43 PILOT STUDY • How can we specify d = μ1 - μ2 and estimate population SDs? – Researcher knowledge—experience makes for good educated guesses – Make use of pilot study data! 44 PILOT STUDY • Fill in blanks from pilot study – -level of test (.05) – Specific HA (μOC=132.8, μNO OC =127.4), and hence d = μ1 - μ2 = 5.4 mmHg – – Estimates of σOC ( = 15.3) and σNO OC (=18.2) The power we desire (.80) 45 15 7/9/2014 PILOT STUDY • Given this information, we can use Stata to do the sample size calculation • “sampsi” command – Command syntax (items in italics are numbers to be supplied by researcher) sampsi 1 2, alpha() power(power) sd1(1) sd2(2) 46 STATA RESULTS • Blood Pressure/OC example Total N = 306 47 PILOT STUDY/STATA RESULTS • Our results from Stata suggest that in order to detect a difference in BP of 5.4 mmHg (if it really exists in the population) with high (80%) certainty, we would need to enroll 153 O.C. users and 153 non-users • This assumed that we wanted equal numbers of women in each group! 48 16 7/9/2014 PILOT STUDY/STATA RESULTS • Suppose we estimate that the prevalence of O.C. use amongst women 35–39 years of age is 20% – We wanted this reflected in our group sizes • If O.C. users are 20% of the population, non-OC users are 80% – There are four times as many non-users as there are users (4:1 ratio) 49 PILOT STUDY/STATA RESULTS • We can specify a ratio of group sizes in Stata – Again, using “sampsi” command with “ratio” option sampsi 1 2, alpha() power(power) sd1(1) sd2(2) ratio(n2/n1) 50 STATA RESULTS • Blood Pressure/OC example Total N = 430 51 17 7/9/2014 Institute for Clinical and Translational Research Introduction to Clinical Research Section D Sample Size Determination for Comparing Two Proportions POWER FOR COMPARING TWO PROPORTIONS • Same ideas as with comparing means 53 PILOT STUDY • We can find the necessary sample size(s) of this study if we specify . – α-level of test – Specific values for p1 and p2 (specific HA) and hence d = p1 - p2: usually represents the minimum scientific difference of interest) – The desired power 54 18 7/9/2014 PILOT STUDY • Given this information, we can use Stata to do the sample size calculation • “sampsi” command – Command syntax (items in italics are numbers to be supplied by researcher) sampsi p1 p2, alpha() power(power) 55 ANOTHER EXAMPLE • Two drugs for treatment of peptic ulcer compared (Familiari, et al., 1981) – The percentage of ulcers healed by pirenzepine (drug A) and trithiozine (drug B) was 77% and 58% based on 30 and 31 patients respectively (p-value = .17), 95% CI for difference in proportions healed ( p DRUG A - p DRUG B ) was (-.04, .42) – The power to detect a difference as large as the sample results with samples of size 30 and 31 respectively is only 25% 56 Healed Not Healed Drug A 23 7 Total 30 Drug B 18 13 31 Continued EXAMPLE • As a clinician, you find the sample results intriguing – want to do a larger study to better quantify the difference in proportions healed • Redesign a new trial, using aforementioned study results to “guestimate” population characteristics – Use pDRUG A = .77 and pDRUG B = .58 – 80% power – = .05 • Command in Stata sampsi .77 .58, alpha (.05) power (.8) 57 19 7/9/2014 EXAMPLE • Command in Stata sampsi .77 .58, alpha (.05) power (.8) 58 EXAMPLE • What would happen if you change power to 90%? sampsi .77 .58, alpha (.05) power (.9) 59 EXAMPLE • Suppose you wanted two times as many people on trithiozone (“Drug B”) as compared to pirenzephine (“Drug A”) – Here, the ratio of sample size for Group 2 to Group 1 is 2.0 • Can use “ratio” option in “sampsi” command 60 20 7/9/2014 Example • Changing the ratio sampsi .77 .58, alpha (.05) power (.9) ratio(2) 61 SAMPLE SIZE FOR COMPARING TWO PROPORTIONS • A randomized trial is being designed to determine if vitamin A supplementation can reduce the risk of breast cancer – The study will follow women between the ages of 45–65 for one year – Women were randomized between vitamin A and placebo • What sample sizes are recommended? 62 BREAST CANCER/VITAMIN A EXAMPLE • Design a study to have 80% power to detect a 50% relative reduction in risk of breast cancer w/vitamin A p VITA .50 ) (i.e. RR p PLACEBO using a (two-sided) test with α-level = .05 • To get estimates of proportions of interest: - using other studies, the breast cancer rate in the controls can be assumed to be 150/100,000 per year 63 21 7/9/2014 BREAST CANCER/VITAMIN A EXAMPLE • A 50% relative reduction: if pVITA .50 then, pVIT A .50 p PLACEBO RR p PLACEBO So, for this desired difference in the relative scale: pVIT A 150 0.5 100,000 75 100,000 Notice, that this difference on the absolute scale, pVIT Asmaller p PLACEBOin magnitude: , is much 64 75 0.00075 0.075% 100,000 BREAST CANCER SAMPLE SIZE CALCULATION IN STATA • “sampsi” command sampsi .00075 .0015, alpha(.05) power(.8) 65 BREAST CANCER SAMPLE SIZE CALCULATION IN STATA • You would need about 34,000 individuals per group • Why so many? – Difference between two hypothesized proportions is very small: pVIT A p PLACEBO = .00075 We would expect about 50 cancer cases among the controls and 25 cancer cases among the vitamin A group 150 34,000 51 100,000 75 vitamin A : 34,000 25 100,000 placebo : 66 22 7/9/2014 BREAST CANCER/VITAMIN A EXAMPLE • Suppose you want 80% power to detect only a 20% (relative) reduction in risk associated with vitamin A p A 20% relative reduction: if RR p VITA .80 then PLACEBO pVIT A .80 p PLACEBO So, for this desired difference in the relative scale: pVIT A 150 100,000 0.8 120 100,000 Notice, that this difference on the absolute scale, pVITsmaller , is much in magnitude: A p PLACEBO 67 30 0.0003 0.03% 100,000 BREAST CANCER SAMPLE SIZE CALCULATION IN STATA • “sampsi” command sampsi .0012 .0015, alpha(.05) power(.8) 68 BREAST CANCER—VITAMIN A EXAMPLE REVISITED • You would need about 242,000 per group! • We would expect 360 cancer cases among the placebo group and 290 among vitamin A group 69 23 7/9/2014 AN ALTERNATIVE APPROACH— DESIGN A LONGER STUDY • Proposal – Five-year follow-up instead of one year Here: pVITA≈5×.0012 = .006 pPLACEBO≈5×.0015 = .0075 • Need about 48,400 per group – Yields about 290 cases among vitamin A and 360 cases among placebo • Issue 70 – Loss to follow-up Institute for Clinical and Translational Research Introduction to Clinical Research Section E More Examples HYPOTHESIS TESTING EXAMPLE Johns Hopkins Practice-based Opportunity for Promotion of Weight Reduction (POWER) Trial: • a comparative effectiveness trial that tested two practical behavioral weight loss interventions in obese medical outpatients with cardiovascular risk factors 72 24 7/9/2014 HYPOTHESIS TESTING EXAMPLE 73 HYPOTHESIS TESTING EXAMPLE Study Design Randomization Control Remote In-Person Baseline 6 Mo 12 Mo = Measured weights and other outcomes 24 Mo 74 HYPOTHESIS TESTING EXAMPLE Hypotheses: • Primary: – The Remote arm will produce 2.75kg greater mean weight loss over the 24 month trial period than the Control arm – The In-Person arm will produce 2.75kg greater mean weight loss over the 24 month trial period than the Control arm • Secondary: – In-Person vs. Remote, other weight outcomes (% weight loss, % reached goal), other health outcomes… 75 25 7/9/2014 HYPOTHESIS TESTING EXAMPLE Considerations for selecting the primary hypotheses in POWER trial • Weight loss can be quantitatively measured with high precision, easy interpretation • Continuous outcomes typically allow greater statistical power • Weight loss as small as 2.75 kg has been shown to produce health benefits, so clinically relevant • Difference in intervention effects between Remote vs In-Person not clear but expected to be smaller • Type-I error adjustment for multiple comparisons 76 SAMPLE SIZE / POWER CALCULATION OTHER CONSIDERATIONS Back to the POWER trial example: Minimum Detectable Difference (MDD) in Kg Minimum Detectable Difference A B α=0.025 α=0.05 Sample c 18m ∆ σ VIF ρ β=0.80 β=0.90 β=0.80 β=0.90 size Baseline 300 103.83 -3.41 16.53 1.0 0.85 2.68 3.10 3.60 4.15 (100/group) 16.53 1.5 0.85 2.75 3.27 4.40 5.05 350 16.53 1.0 0.85 2.48 2.90 3.30 3.85 16.53 1.5 0.85 3.05 3.55 4.05 4.70 (117/group) 400 16.53 1.0 0.85 2.33 2.70 3.10 3.60 (133/group) 16.53 1.5 0.85 2.85 3.30 3.80 4.40 A B C Power if either of 2 tests is positive at α=0.05/2; Power for single test at α=0.05; Baseline, 18m ∆ (= change from EST+DASH intervention in PREMIER, net of control), and rho(ρ, correlation between baseline and 18m), all from PREMIER trial unless otherwise indicated. 77 SAMPLE SIZE / POWER CALCULATION OTHER CONSIDERATIONS Then, budget planning indicated that we could afford N = 390 (130/group), so the table evolved to: Net tx effect in Kg 2.00 2.50 2.75 3.00 3.25 3.50 4.00 4.50 α=0.025A (at least 1 of 2 significant) 0.52 0.72 0.80 0.87 0.92 0.95 0.99 1.00 α=0.025B (1 of 1 significant) 0.31 0.47 0.56 0.64 0.72 0.78 0.89 0.95 α=0.05C (1 of 1 significant) 0.41 0.58 0.66 0.74 0.80 0.86 0.93 0.97 78 26 7/9/2014 SAMPLE SIZE / POWER CALCULATION OTHER CONSIDERATIONS Recall POWER has 2 primary hypotheses to test: H0: Remote - Control = 0; H1: Remote - Control > 2.75kg H0’: In-Person - Control = 0; H1’: In-Person - Control > 2.75kg If each test is controlled at α = 0.05, then the prob. for both tests to be type-I error free would be (0.95)(0.95) = 0.9025 So the prob. for at least 1 test to encounter type-I error would be 1 - 0.9025 = .0975 > 0.05 => Multiple comparisons inflate overall α error 79 SAMPLE SIZE / POWER CALCULATION OTHER CONSIDERATIONS If each of the 2 tests is controlled at α = 0.05/2 = 0.025, then the prob. for both independent tests to be type-I error free would be (0.975)(0.975) = 0.950625 So the prob. for at least 1 test to encounter type-I error would be 1 - 0.950625 = .049375 < 0.05 Bonferroni adjustment for multiple comparisons: if we are testing k hypotheses and want to control the overall α at the nominal 0.05 level, then we should test each at α = 0.05/k level 80 SAMPLE SIZE / POWER CALCULATION OTHER CONSIDERATIONS Bonferroni correction is conservative in terms of statistical power, requires greater sample size Holm method: • Compare the smaller p-value of the 2 tests against α = 0.05/2 = 0.025. If significant, then compare the larger p-value against α = 0.05. If the first test is not significant, then declare both tests as non-significant. • Family-wise type I error rate < 0.05 • Provide better statistical power then the Bonferroni correction 81 27 7/9/2014 SAMPLE SIZE / POWER CALCULATION OTHER CONSIDERATIONS Then, budget planning indicated that we could afford N = 390 (130/group), so the table evolved to: Net tx effect in Kg 2.00 2.50 2.75 3.00 3.25 3.50 4.00 4.50 α=0.025A (at least 1 of 2 significant) 0.52 0.72 0.80 0.87 0.92 0.95 0.99 1.00 α=0.025B (1 of 1 significant) 0.31 0.47 0.56 0.64 0.72 0.78 0.89 0.95 α=0.05C (1 of 1 significant) 0.41 0.58 0.66 0.74 0.80 0.86 0.93 0.97 82 HYPOTHESIS TESTING EXAMPLE 83 HYPOTHESIS TESTING EXAMPLE 84 28 7/9/2014 HYPOTHESIS TESTING EXAMPLE Hypotheses: – Ca + Vit D arm will have 22% reduction of colon CA incidence over average 8 years of trial period than the Placebo arm Log-rank test at α = 0.05, with incidence rate of 0.2% per year in the placebo arm, and pre-fixed N = 35,000 Plug these assumed parameters into the calculator => power = 81% (83% stated in the paper) 85 SUMMARY Sample size / power evaluation is a critical part of study design for hypothesis testing • Smaller α error larger sample size • Smaller β error (higher power) larger sample size • Higher variability in outcome larger sample size • Smaller difference to detect larger sample size Need to account for type I error when testing multiple hypotheses 86 Institute for Clinical and Translational Research Introduction to Clinical Research Section F Sample Size and Study Design Principles: A Brief Summary 29 7/9/2014 DESIGNING YOUR OWN STUDY • When designing a study, there is a tradeoff between : – Power – -level – Sample size – Minimum detectable difference (specific HA) • Industry standard—80% power, = 0.05 88 DESIGNING YOUR OWN STUDY • What if sample size calculation yields group sizes that are too big (i.e., cannot afford or will take too long to do the study) or are very difficult to recruit enough participants for the study? – Increase minimum difference of interest – Increase -level – Decrease desired power – Select a more precise primary outcome 89 DESIGNING YOUR OWN STUDY • Sample size calculations are an important part of study proposal – Study funders want to know that the researcher can detect a relationship with a high degree of certainty (should it really exist) • Even if you anticipate confounding factors, these approaches are the best you can do and are relatively easy • Accounting for confounders requires more information and sample size has to be done via computer simulation—consult a statistician! 90 30 7/9/2014 DESIGNING YOUR OWN STUDY • What is this specific alternative hypothesis? – Power can only be calculated for a specific alternative hypothesis – When comparing two groups this means providing value of the true population mean (proportion) for each group 91 DESIGNING YOUR OWN STUDY • What is this specific alternative hypothesis? – Therefore specifying a difference between the two groups, typically based on the smallest difference of scientific / clinical interest – This difference is frequently called effect size – When sample size, power and other parameters are fixed for a study, this difference derived from the calculation is frequently called minimum detectable difference – Typically aiming for a sample size that would afford sufficient power to detect the minimum detectable difference that covers the smallest difference of 92 scientific / clinical interest DESIGNING YOUR OWN STUDY • Where does this specific alternative hypothesis come from? – Hopefully, not the statistician! – As this is generally a quantity of scientific interest, it is best estimated by a knowledgeable researcher or pilot study data – This is perhaps the most difficult component of sample size calculations, as there is no magic rule or “industry standard” 93 31 7/9/2014 CALCULATING POWER • In order to calculate power for a study comparing two population means, we need the following: – Sample size for each group – Estimated (population) means, μ1 and μ2 for each group—these values frame a specific alternative hypothesis (usually minimum difference of scientific interest) – Estimated (population) SD’s, σ1 and σ2 – -level of the hypothesis test 94 OTHER CONSIDERATIONS Obtaining high quality data through best possible study design and rigorous research conduct are critical to the success of clinical investigations Work with a biostatistician starting at the study design stage 95 32
© Copyright 2024