V. Sample size Outline • What is needed to determine sample size? • Sample size calculation for: –Proportions –Means –Time-to-event –Noninferiority •Adjustments Sample Size Question: How many patients do I need? Answer: V- 2 It depends • • • • Study objective – Phase I, II, III Test for a difference (superiority) Test for equivalence (non-inferiority) How treatment effects are measured – Proportions (e.g. response rate) – Time to event endpoint (e.g. survival) – Continuous outcome (e.g. BP) • Size of the effect or tolerance of the CI • Chances to reach wrong conclusions V- 3 It depends • Based on: – – – – – Objectives Hypothesis Primary endpoint Design Method of analysis • Test procedure • Confidence interval V- 4 Consider how results will be analyzed P-Values – Test of significance of H0 based on distribution of a test statistic assuming H0 is true • It is – Probability of observing what we observed or more extreme if H0 is true • It is not – Probability H0 is true Confidence Interval – A range of plausible values of the population parameter • It is – includes parameter of 95% of trials • It is not – 95% probability that parameter is in Sample size depends on planned test procedure or confidence interval of interest V- 5 Where do we begin? N = (Total Budget / Cost per patient)? Hopefully not! V- 6 Sample Size Calculations • Formulate a PRIMARY question or hypothesis to test (or determine what you are estimating). Write down H0 and HA. • Determine the endpoint. Choose an outcome measure. How do we “measure” or “quantify” the responses? • Envision the analysis V- 7 Truth Test Result Ho True Ho False Reject Ho Type I error (α) Correct Do not reject Ho Correct Type II error (β) V- 8 What is Needed to Determine the Sample-Size? • α – Up to the investigator (often = 0.05) – How much type I error can you afford? • 1-β (power) – Up to the investigator (often 80%-90%) – How much type II error can you afford? V- 9 What is Needed to Determine the Sample-Size? • Choosing α and β: – Weigh the cost of a Type I error versus a Type II error. • In early phase clinical trials, we often do not want to “miss” a significant result and thus often consider designing a study for higher power (perhaps 90%) and may consider relaxing the α error (perhaps 0.10). • In order to approve a new drug, the FDA requires significance in two Phase III trials strictly designed with α error no greater than 0.05 (Power = 1-β is often set to 80% but the FDA does not have a rule about power). V- 10 Sample Size Calculations • The idea is to choose a sample size such that both of the following conditions hold: 1. 2. If the null hypothesis is true, then the probability if incorrectly rejecting is (no more than) α If the alternative hypothesis is true, then the probability of correctly rejecting is (at least) 1-β = power. Don’t want to over-power study Expensive Find statistical significance on effects that are clinically irrelevant V- 11 What is Needed to Determine the Sample-Size? • The “minimum difference (between groups) that is clinically relevant or meaningful”. – Requires clinical input – Equivalently, choose the mean for both groups (or the null and alternative differences) • Note for equivalence/noninferiority studies, we need the “maximum irrelevant or non-meaningful difference”. V- 12 What is Needed to Determine the Sample-Size? • Estimates of variability – – – – Do not know in advance Often obtained from prior studies Consider a pilot study for this SD for each group • Explore the literature and data from ongoing studies for estimates needed in calculations – Get SD for placebo group and treatment group – May need to validate this estimate later V- 13 Sample Size Calculations • Provide numbers for a range of scenarios and various combinations of parameters (e.g., for various values of the difference between groups, combinations of α and β, etc. V- 14 Example: Find N so that if the actual event rate were π we would be unlikely to observe zero events. What is the chance of observing zero events among N patients if the true event rate were π? π= N=5 10 15 20 30 .01 .95 .90 .86 .82 .74 Prob (no events) = (1 - π) N .05 .10 .20 If 0/30 then either: .77 .59 .33 1) π =.10 and we .60 .35 .11 observed a rare event .46 .21 .04 (probability = .04) .35 .12 .01 OR .21 .04 .001 2) π is really < .10 V- 15 Sample size for a comparative study involving the analysis of proportions R A N D O M I Z E Objective: Use PA – PB = δ A To determine if PA – PB ≠ 0 By a medically significant amount Test B H0: PA – PB = 0 H1: PA – PB ≠ 0 One-sided or two-sided alternative When N patients per treatment are evaluated we will want to know if the observed δ is significant evidence for us to reject H0, and declare a treatment difference. Test procedure: | δ | is too large then reject H0 V- 16 Four interrelated quantities α, β, ∆, N α = False-positive rate (type I error) = Prob (reject H0 | H0 true) = Prob ( | δ | too large | PA = PB) β = False-negative rate (type II error) = Prob (fail to reject H0 | H0 false) [β is a function of the true PA – PB = ∆] 1 – β = Power or sensitivity = Prob (reject H0 | H0 false) ∆ = Measure of true population difference- must be guessed. Difference of medical importance. N = Sample size per arm Given the test procedure, the quantities are interrelated e.g., as ∆ increases Æ easier to detect diff. Æ 1 – β increases. V- 17 Conclusions in hypothesis testing – the truth table We can characterize the possible outcomes of a hypothesis test as follows: True State H0 True (∆=0) H0 False (∆≠0) Test Result Reject H0 (p<0.05) Type I Error (α) No Error (1- β) Do not reject H0 No Error Type II Error (β) (p>0.05) V- 18 • If test statistic z is large enough (e.g. falls into shaded area of scale), we believe this result is too large to have come from a distribution with mean 0 (i.e. Pc = Pt) • Thus we reject H0: Pc- Pt = 0, claiming that there exists 5% chance this result could have come from a distribution with no difference V- 19 Example: Compare response rates for MVP versus CAMP for metastatic non-small cell lung cancer … MVP NC CAMP RM Responders RM/NM Response rate MVP δ = ∆ = … NM RC Responders RC/NC Response rate CAMP RM/NM – RC/NC Observed difference πM - πC Population difference For example, δ = 7/20 – 4/21 = 16% V- 20 V- 21 V- 22 Test of hypotheses • Two sided H0: PT = PC Ha: PT ≠ PC • Classic test One sided H0: PT ≥ PC Ha: PT < PC zα = critical value If |z| > zα Reject HO If z > zα Reject HO where z = test statistic • Recommend zα be the same value in both cases (e.g. 1.96) two sided one sided 2α = .05 zα = 1.96 or α = .025 zα = 1.96 V- 23 Typical design assumptions 1. 2α = .05, .025, .01 Zα = 1.96, 2.24, 2.58 2. Power = .80, .90, .95 Zβ = 0.84, 1.28, 1.645 Should be at least .80 for trial design 3. ∆ = smallest difference we expect to detect e.g. ∆ = PC - PT = .40 - .30 = .10 25% reduction V- 24 Sample size formula two proportions [ Zα 2 p (1 − p ) + Z β PC (1 − PC ) + PT (1 − PT )]2 N= ∆2 • Zα = constant associated with a P {|Z|> Zα } = 2α (two sided) (e.g. 2α = .05, Zα =1.96) • Zβ = constant associated with 1 – β; P {Z< Zβ} = 1- β (e.g. 1- β = .90, Zβ =1.282) N= 2(Zα + Zβ )2 p(1− p) ∆2 • Standardized delta • Solve for Zβ ⇒ 1- β or ∆ = ( Zα + Zβ ) 2 2 ⎛∆ ⎞ ⎜ ⎟ 2 p(1− p) ⎠ ⎝ V- 25 Simple example • H0: PC = PT • HA: PC = .40, PT = .30 ∆ = .40 - .30 = .10 • Assume α = .05 1 - β = .90 Zα = 1.96 (Two-sided) Zβ = 1.282 • p = ((.40 + .30)/2) = .35 V- 26 Simple example (2) Thus a. [1.96 2(.35)(.65) + 1.282 (.3)(.7) + (.4)(.6)]2 N= (.4 − .3)2 N = 476 2N = 952 (1.96 + 1.282) 2 N = = 478 2 ⎛ ⎞ .4 − .3 ⎜⎜ ⎟⎟ 2(.35).65) ⎝ ⎠ b. 2N = 956 V- 27 Approximate* total sample size for comparing various proportions in two groups with significance level (α) of 0.05 and power (1-β) of 0.80 and 0.90 True Proportions pC pI (Control) (Invervention) 0.60 0.50 0.40 0.30 0.20 0.10 0.50 0.40 0.30 0.20 0.40 0.30 0.25 0.20 0.30 0.25 0.20 0.20 0.15 0.10 0.15 0.10 0.05 0.05 α=0.05 (two-sided) α=0.05 (one-sided) 1-β 0.90 850 210 90 50 850 210 130 90 780 330 180 640 270 140 1980 440 170 950 1-β 0.80 610 160 70 40 610 150 90 60 560 240 130 470 190 100 1430 320 120 690 *Sample sizes are rounded up to the nearest 10 1-β 0.90 1040 260 120 60 1040 250 160 110 960 410 220 790 330 170 2430 540 200 1170 1-β 0.80 780 200 90 50 780 190 120 80 720 310 170 590 250 130 1810 400 150 870 V- 28 Sample size Sample size is very sensitive to values of ∆ e.g. To detect the difference from 25% to 40% N = 205 for α =.05 β = .10 Over 270 less per arm than to detect from 30% to 40% Large numbers are required if we want high power to detect small differences Trials designed to show no difference require very large N because the ∆ to rule out is small Consider i) Current knowledge ii) Likely improvement iii) Feasibility – available accrual V- 29 Sample size (2) Present a range of values i.e. for several N & ∆ give the power for several ∆ & 1- β give the required N See “Friedman article on sample size” (class website) for adjustments for lost to follow-up (or non-adherence) to randomized treatments ⎛ 1 ⎞ N * = Nx ⎜ ⎟ ⎝ 1 − LFU ⎠ 2 LFU = Fraction of cases expected to be lost to follow-up V- 30 V- 31 Comparison of two means • H0: µC = µT ⇔ µC - µT = 0 • HA: µC - µT = ∆ • Test statistic for sample means ~ N (µ, σ) XC − XT Z= σ 2 (1/ N C + 1/ NT ) • Let N = NC = NT for design N= 2( Zα + Z β ) 2 σ 2 • Standardized ∆ • Power Zβ = ∆2 = ~N(0,1) for H0 ( Zα + Z β ) 2 (∆ / 2σ ) 2 N / 2 (∆ / σ ) − Zα V- 32 Two Groups Or µ 0 = µ c − µT = ∆ 0 = 0 µ1 = µC − µT = ∆1 Z = XC − XT ~ N (0,1) σ 2/n V- 33 Example IQ scores σ = 15 ∆ = 0.3x15 = 4.5 • Set α = .05 β = 0.10 1 - β = 0.90 • HA: ∆ = 0.3σ ⇔ ∆ / σ = 0.3 • Sample Size N = 2(1.96 + 1.282) 2 2(10.51) 21.02 = = (0.3) 2 (0.3) 2 (0.3) 2 • N = 234 • ⇒ 2N = 468 V- 34 V- 35 Comparing time to event distributions • Primary efficacy endpoint is the time to an event • Compare the survival distributions for the two groups • Measure of treatment effect is the ratio of the hazard rates in the two groups = ratio of the medians • Must also consider the length of follow-up V- 36 Assuming exponential survival distributions • If P(T > t) = e-λt, where λ = λ1 in group 1, λ = λ2 in group 2, let H0: λ2 = λ1 ⇒ λ1/λ2 = 1 Ha: λ2 ≠ λ1 ⇒ λ1/λ2 ≠ 1 • Then define the effect size by ∆ = λ1/λ2 = med2/med1 where medi=-ln(.5)/ λi • Standardized difference ln(∆) / √ 2 V- 37 Assuming exponential survival distributions • The statistical test is powered by the total number of events observed at the time of the analysis, d. 4(Zα +Zβ )2 d= [ln(∆ )]2 V- 38 Converting number of events (d) to required sample size (2N) • d = 2N x P(event) ⇒ 2N = d / P(event) • P(event) is a function of the length of total follow-up at time of analysis and the average hazard rate. • Let AR = accrual rate (patients per year) A = period of uniform accrual (2N=AR x A) F = period of follow-up after accrual A/2 + F = average total follow-up at planned analysis λ = average hazard rate • Then P(event) = 1 – P(no event) = 1 - e-λ(A/2 + F) V- 39 Survival example Lung cancer trial: median survival = 1 year. Test with 80% power whether a new treatment can improve median survival to 1.5 years. d = 4 (1.96 + .84)2 / (ln 1.5)2 = 191 events How many patients accrued during A years and followed for an additional F years will provide 191 events? λ= ((-ln(.5)/1) + (-ln(.5)/1.5))/2 = (.693 + .462)/2 = .578 For each A and F we can calculate P(event) to obtain the number of required patients (2N). V- 40 Survival example (2) For example, A = 2; F = 2; λ = .578 (avg. 1 yr med. and 1.5 yr med.) P(event) = 1 - e-λ(A/2 + F) = 1 - e- .578 (2/2 + 2) = 1 - e- .578 (3) = .823 To get 191 events for A = 2 and F = 2, we need 2N = 191 / P(event) = 191 / .823 = 232 Because 232 patients are accrued during 2 years, an annual accrual rate (AR) of 116 pts per year is implied. Balance AR, A, F to get sample size (2N) needed to produce the required number of events. V- 41 Survival example Total sample size required to detect a 1.5 year median versus a 1 year median with 2α = .05 and power = 80% Years of additional follow-up 1 2 3 Years 1 330 250 221 of 2 279 232 212 accrual 3 250 221 207 V- 42 Sample size for confidence interval (CI) X Sample mean S Standard deviation T Critical value from t-distribution C.I.: X ± (T S / N ) If L = desired half-width, i.e., magnitude of the permitted difference between sample mean and population mean N = T 95% CI use 2 S L 2 T2 L = tolerance = X − µ 2 =4 will give desired tolerance 99% CI use T2 = 6.6 Guess S2 ( ) If X estimates the response rate P N = L2 Example: to obtain a 95% C.I. with L = .1 4 ( .5 ) ( .5 ) N = = 100 2 for a P around .5 ( .1) T 2 P 1 − P V- 43 Equivalence (non-inferiority) designs • To demonstrate new therapy is “as good as” standard (new is less invasive, less toxic or cheaper than standard) • Can not show H0: ∆=0; specify minimum acceptable ∆0 • For proportion endpoint (π), the hypotheses are H0: πstd > πnew + ∆0 (std is better then new by ∆0) Ha: πstd < πnew + ∆0 (new is good enough) • Wish to achieve power of 1 - β when the true difference is πstd - πnew = ∆a (often = 0) V- 44 Equivalence (non-inferiority) designs • Then the sample size required in each group is: 2 ( Z α + Z β ) ⎡⎣π new (1 − π new ) + π std (1 − π std ) ⎤⎦ N= 2 ( ∆0 − ∆a ) • Sensitive to the definition of equivalence, ∆0 • Report results using confidence intervals V- 45 Equivalence (non-inferiority) designs • If the upper limit of the CI for πstd - πnew excludes ∆0 then we are confident that the new Rx is not too much worse than the std Rx. • Accept new Rx if U1-α ≤ ∆0 U1-α < ) 0 ∆0 ∆ • Sample size: Choose N per group to ensure 1-β probability that U1-α is below ∆0 if true difference is πstd - πnew = 0. N= ( Zα + Zβ ) 2 2 ⎛ ∆0 ⎞ ⎜ ⎟ πnew(1−πnew) +πstd (1−πstd ) ⎠ ⎝ per arm V- 46 Equivalence trial example Lallemant et al. conducted an RCT in Thailand to test the equivalence of the 076 regimen and three abbreviated ZDV regimens (NEJM 2000;343:982-991) Let ∆ = pT - pS – where pT and pS are the HIV transmission rates in the new Rx and standard Rx groups. Note that the standard treatment is the 076 regimen. Choose a maximal difference, ∆0, that is considered acceptable. V- 47 Equivalence trials Define the Null Hypothesis as – H0: pT - pS > ∆0 (transmission with new treatment is higher than [not equivalent to] standard therapy) and – HA: pT - pS < ∆0 (new treatment is (approximately) equivalent to standard treatment) Note that the test is one-sided. V- 48 The Lallemant trial Because the reference treatment in this study was a regimen of established efficacy, we tested for equivalent efficacy of the experimental treatments, choosing a threshold for equivalence that would balance public health concerns with clinical benefits. Using a cost-effectiveness approach, we determined that an absolute increase of 6 percent in the rate of transmission of HIV infection would be the limit beyond which the clinical risk associated with the experimental treatment would not be balanced by its economic and logistical advantages. V- 49 The Lallemant trial With this criterion for equivalence, we calculated that a sample of 1,398 mother-infant pairs would be required to provide more than 90 percent statistical power with a 5 percent one-sided type I error and an 11 percent overall transmission rate. Equivalence would be established if the upper limit of the one-sided 95 percent confidence interval for the arithmetic difference in the percentage rates of HIV transmission was less than 6. V- 50 The COBALT Trial The COBALT trial was an equivalence trial designed to show that double-bolus alteplase was equivalent to accelerated infusion of alteplase for the treatment of acute MI. The criterion for equivalence was a difference in mortality rates no greater than 0.4 percent! V- 51 The COBALT Trial 7,169 patients were randomized. 30-day mortality rates were 7.98 percent in the double-bolus group and 7.53 percent in the accelerated-infusion group, an unfavorable absolute difference of 0.44 percent. Because the one-sided 95 percent confidence limit for the difference in mortality rates exceeded the prespecified limit, the authors concluded that double-bolus alteplase had not been shown to be equivalent to an accelerated infusion of alteplase. V- 52 GUSTO III GUSTO III was a superiority trial to test the hypothesis that double-bolus administration of reteplase would reduce 30-day mortality by 20 percent relative to an accelerated infusion of alteplase. In 15,059 randomized patients, the 30-day mortality rates were 7.47 percent for reteplase and 7.24 percent for alteplase - an absolute difference of 0.23 percent (two-sided 95 percent confidence interval, -0.66 to 1.10 percent). Thus the GUSTO trial failed to reject the null hypothesis of equal 30-day mortality rates in the two treatment groups. V- 53 Comparing results for 30-day mortality (Standard) Accelerated Alteplase Double-bolus Alteplase COBALT 7.53% 7.98% Failed to demonstrate Equivalence of DB Alt Accelerated Alteplase GUSTO 7.24% Double-bolus Reteplase 7.47% Failed to Show Superiority of DB Reteplase V- 54 How did this happen? The COBALT investigators chose such a stringent standard for equivalence that an equally effective regimen was unlikely to be “proven” equivalent. GUSTO does appear to show similar efficacy of the two regimens. Thus it is appropriate not to infer superiority. V- 55 Interpreting superiority and equivalence trials Suppose that we have three treatment options, control (C), standard (S), and new therapy (N). To evaluate the new therapy, one can A: Perform a superiority trial comparing N to C. B: Perform an equivalence trial comparing N to S. V- 56 What can we learn? In a superiority trial, we test the null hypothesis that N and C are equally effective. If we reject H0, we can infer that N is superior to C. The superiority trial provides no internal comparison of N and S. We cannot say, for example, that N is as good as S. V- 57 What can we learn? • In an equivalence trial, we test the null hypothesis that N is inferior to S. If we reject the null hypothesis, we can infer that N is, to a specified magnitude, equivalent to S. • The equivalence trial provides no internal comparison of N to C. V- 58 Sample Size Calculations • Expect things to go wrong and plan for this BEFORE the study begins (and in sample- size calculations). • Think about the challenges of reality and the things that they don’t normally teach you about in the classroom. • Sample size estimates need to incorporate adjustments for real problems that occur but are not often a part of the standard formulas. V- 59 Sample Size Calculations • Each of these can affect sample size calculations: – – – – – – – Missing data (imputation?) Noncompliance Multiple testing Unequal group sizes Use of nonparametric testing (vs. parametric) Noninferiority or equivalence trial Inadvertent enrollment of ineligible subjects V- 60 Adjustment for Lost-to-Follow-up • • • • Calculate sample size N. Let x=proportion expected to be lost-to-follow-up. Nadj=N/(1-x)2 One still needs to consider the potential bias of examining only subjects with non-missing data. • Note: no adjustment is necessary if you plan to impute missing values. If you use imputation, an adjustment for dilution effect may be warranted (for example when treating anyone with missing data as a “failure” or a “non-responder”). V- 61 Adjustment for Dilution Effect • Adjustment for dilution effect due to noncompliance or inclusion (perhaps inadvertently) of subjects that cannot respond: – Calculate sample size N. – Let x=proportion expected to be non-compliant. – Nadj=N/(1-x)2 V- 62 Adjustment for Unequal Allocation • Adjustment for unequal allocation in two groups: – Let QE and QC be the sample fractions such that QE+QC=1. – Calculate sample size Nbal for equal sample sizes (i.e., QE=QC=0.5) – Nunbal=Nbal ((QE-1 +QC-1)/4) – Note power is optimized when QE=QC=0.5 V- 63 Adjustment for Nonparametric Testing • Most sample-size calculations are performed using parametric methods • Adjustment for use of nonparametric test instead of a parametric test (Pitman Efficiency) in case parametric assumptions do not hold (Note: applicable for 1 and 2 sample t-tests): – Calculate sample size Npar for 1 or 2 sample t-test. – Nnonpar = Npar /(0.864) V- 64 Adjustment for Equivalence/Noninferiority Studies • Calculate sample size for standard superiority trial but reverse the roles of α and β. • Works for large sample binomial and the normal. Does not work for survival data. V- 65 Other Points to Remember • Trials are usually powered based on an efficacy endpoint. – Thus a trial may be underpowered to detect differences in safety parameters (particularly rare events or small differences) • Sample-size re-estimation has become common – Since estimates are used in original calculation – Must be done very carefully V- 66
© Copyright 2024