Sample Size Estimation in Clinical Trials (臨床試驗的樣本數估算) Hsiao-Hui Tsou (鄒小蕙) 國家衛生研究院 群體健康科學研究所 生物統計與生物資訊研究組 email: [email protected] REFERENCES 1.Friedman, Furberg & DeMets. (3rd edition, 1998) Fundamentals of Clinical Trials. Springer-Verlag, NY, NY. 2.Chow, S.C., and Liu, J.P. (2004). Design and Analysis of Clinical Trials: Concepts and Methodologies, Second Edition. November, 2003 by John Wiley and Sons, Inc., New York, New York, U.S.A. Outline • Why does this matter? Scientific and ethical implications • Statistical definitions and notation • Questions that need to be answered prior to determining sample size • Study design issues affecting sample size • Some basic sample size formulas Example: How many subjects? • Compare new treatment (T) with a control (C) • Previous data suggests Control Failure Rate (Pc) ~ 40% • Investigator believes treatment can reduce Pc by 25% i.e. PT = .30, PC = .40 • N = number of subjects/group = ? 4 Example: How many subjects? • Compare new treatment (T) with a control (C) • Previous data suggests median survival is 12 months for Control group • Investigator believes treatment can increase median survival to 16 months. • N = number of subjects/group = ? 5 Scientific and Ethical Implications From a scientific perspective: • Can’t be sure we’ve made right decision regarding the effect of the intervention • However, we want enough subjects enrolled to adequately address study question to feel comfortable that we’ve reached correct conclusion From an ethical perspective: Too few subjects: • Cannot adequately address study question. The time, discomfort and risk to subjects have served no purpose. • May conclude no effect of an intervention that is beneficial. Current and future subjects may not benefit from new intervention based on current (inconclusive) study. Too many subjects: • Too many subjects unnecessarily exposed to risk. Should enroll only enough patients to answer study question, to minimize the discomfort and risk subjects may be exposed to. Where to begin? • Understand the research question • “What’s the question?” – The following are NOT research questions: • We want to “look at” median PFS • We want to analyze the data • We want to see if our results are significant • Need to – Visualize the final analysis and the statistical methods to be used Where to begin? • Analysis determines sample size – Sample size calculations are based on the planned method of analysis – If you don’t know how the data will be analyzed (e.g., 2sample t-test), then you cannot accurately estimate the sample size Sample Size Calculation • • Formulate a PRIMARY research question Identify: 1. A hypothesis to test (write down H0 and HA), or 2. A quantity to estimate (e.g., using confidence intervals) • Determine the endpoint or outcome measure associated with the hypothesis test or quantity to be estimated – How do we “measure” or “quantify” the responses? – Is the measure continuous, binary, or a time-to-event? – Is this a one-sample or two-sample problem? Sample Size Calculation • Based upon the PRIMARY outcome • Other analyses (i.e., secondary outcomes) may be planned, but the study may not be powered to detect effects for these outcomes Definitions and Notation • Null hypothesis (H0): No difference between groups H0: p1 = p2 H0: 1 = 2 • Alternative hypothesis (HA): There is a difference between groups HA: p1 p2 HA : 1 2 • P-Value: Chance of obtaining observed result or one more extreme when groups are equal (under H0) – Test of significance of H0 – Based on distribution of a test statistic assuming H0 is true – It is NOT the probability that H0 is true Definitions and Notation • : Measure of true population difference must be estimated. Difference of medical importance = |p1 - p2| = |1 - 2| • n: Sample size per arm • N: Total sample size (N=2n for 2 groups with equal allocation) • Type I error: Rejecting H0 when H0 is true • : The type I error rate. Maximum p-value considered statistically significant • Type II error: Failing to reject H0 when H0 is false • : The type II error rate • Power (1 - ): Probability of detecting group effect given the size of the effect () and the sample size of the trial (N) • =? =? In phase II or phase III? Sample Size Calculation Using Hypothesis Testing • The most common approach • The idea is to choose a sample size such that both of the following conditions simultaneously hold: – If the null hypothesis is true, then the probability of incorrectly rejecting is (no more than) α – If the alternative hypothesis is true, then the probability of correctly rejecting is (at least) 1-β = power Statistical Considerations Power = 1-β α p: 試驗組與對照組無差別的機率 α: 無效藥被判定成有效的機率 β: 有效藥被判定成無效的機率 1 - β: Statistical power (有效藥被當成有效的機率) 17 Determinants of Sample Size • α : Regulated by FDA for phase III pivotal trials (0.05) • β : Up to the investigator (often 80%-90%) – Not regulated by FDA • An “effect size” to detect – Minimum difference that is clinically relevant (for superiority) • E.g., H0: p1 - p2 = 0 vs. HA: p1 - p2 = 0.20 – Maximum difference that is clinically irrelevant (for noninferiority) • Estimates of variability The quantities , , and N are all interrelated. Holding all other values constant, what happens to the power of the study if • • • • increases? decreases? N increases? variability increases? Power ↑ Power ↓ Power ↑ Power ↓ Note: Typical error rates are = .05 and = .1 or .2 (80 or 90% power). Why is often smaller than ? Consideration • • • • • Type of Primary Endpoint 1-sample vs. 2-sample Independent samples or paired? What is the sample size allocation ratio? 1-sided vs. 2-sided – 2-sided hypothesis: • H0 : θT =θC vs. HA : θT ≠ θC – 1-sided hypothesis : • H0 : θT < θC vs. HA : θT > θC • H0 : θT > θC vs. HA : θT < θC 20 Primary Endpoint • Dichotomous response variables (success and failure) Compare event rate PI and PC • Continuous response variables (blood pressure) Compare true mean level I and C • Time to failure (or occurrence of a clinical event) OS, PFS, RFS, … – Compare hazard rate lI and lC – Compare median survival time 21 Distribution of Test Statistics • • • • Many have a common form = population parameter (eg, difference in means) ˆ = sample estimate Then – Z =(ˆ – E( ˆ )) /SE( ˆ ) • Then Z has a(n approximately) Normal (0,1) distribution 22 Test of Hypothesis • Two sided e.g. H0: PT = PC vs. • z1- = critical value Classic test If |z| > z1- Reject H0 = .05 , z1- = 1.96 One sided H0: PT < PC If z > z1- Reject H0 = .05, z1- = 1.645 where z = test statistic • Recommend z1- be same value both cases (e.g. 1.96) two-sided one-sided = .05 or = .025 z1- = 1.96 1.96 23 Typical Design Assumptions (1) 1. = 0.05, 0.025, 0.01 2. Power = 0.80, 0.90 Should be at least .80 for design for Phase III How about for phase II? 3. = smallest difference hope to detect e.g. = PC - PT = .40 - .30 = .10 25% reduction! 24 Typical Design Assumptions (2) Two Sided Significance Level 0.05 0.025 0.01 Z1- 1.96 2.24 2.58 Power 1- 0.80 0.90 0.95 Z1- 0.84 1.282 1.645 25 Binomial Case • H0: PC = PI • Test Statistic (CLT to Normal Approximation) Z PˆC PˆT P 1 P 1 N C 1 N I where ~ N(0,1) under H0 N C PˆC NT PˆT P N C NT • Sample size: Assume – NI = NC = N – Ha: d = PC - PI 26 Binomial Sample Size Formula Z N 2 P 1 P Z PC 1 PC PI 1 PI d2 • P{|Z| > Z } = • P{Z > Z } = 1- or 2Z Z P 1 P 2 N d2 27 2 Example • H0: PC = PI • Ha: PC = 0.4, PI = 0.3 • Assume = 0.05, 1- = 0.90 . i.e. Z = 1.96, Z =1.282, P =(0.4+0.3)/2=0.35 • N 1.96 2.35.65 1.282 2 .3.7 .4.6 476 2 2N = 952 • .4 .3 21.96 1.282 .35.65 OR N 478, 2 N 956 2 .4 .3 28 2 Approximate* Total Sample Size for Comparing Various Proportions in Two Groups with Significance Level () of 0.05 and Power (1-) of 0.80 and 0.90 True Proportions pC pI (Control) (Intervention) 0.60 0.50 0.40 0.30 0.20 0.10 0.50 0.40 0.30 0.20 0.40 0.30 0.25 0.20 0.30 0.25 0.20 0.20 0.15 0.10 0.15 0.10 0.05 0.05 = 0.05 (one-sided) 1- 0.90 850 210 90 50 850 210 130 90 780 330 180 640 270 140 1980 440 170 950 = 0.05 (two-sided) 1- 0.80 610 160 70 40 610 150 90 60 560 240 130 470 190 100 1430 320 120 690 *Sample sizes are rounded up to the nearest 10 1- 0.90 1- 0.80 1040 260 120 60 1040 250 160 110 960 410 220 790 330 170 2430 540 200 1170 780 200 90 50 780 190 120 80 720 310 170 590 250 130 1810 400 150 870 29 Comparison of Means • Some outcome variables are continuous – Blood Pressure – Serum Chemistry – Pulmonary Function – HAM-D • Hypothesis tested by comparison of mean values between groups, or comparison of mean changes 30 Comparison of Two Means Test for Equality • H0: I = C , i.e. I - C = 0 • Ha: I - C = d • Test Statistic: Assume X ~ N(, s2) Z XC X I s 1 N C 1 N I 2 • NC = NI = N ~ N 0, 1 under H 0 2Z Z s 2 2 N d 2 2Z Z 2 d s 2 31 Example • To evaluate the effect of a test drug on cholesterol in patients with coronary heart disease (CHD) • Low density lipidproteins (LDL) is the most directly associated with increased risk of CHD • Goal: to compare two cholesterol lowering agents for treatment of patients with CHD • Endpoint: per cent change in LDL-C • Detect a difference of 5% between the treatment groups (d = 5) 32 Example (Equality) • • • • Standard deviation=10% (s = 10) Significance level=0.05 (2a = 0.05) 80% power (1- b = 0.80) Za = 1.96, Zb =0.84 21.96 0.84 N 62.72 ~ 63, 2 N 126 2 (5 / 10) 2 33 Goals of Controlled Clinical Trials (1) Superiority Trials • A controlled trial may demonstrate efficacy of the test treatment by showing that it is superior to the control – No treatment – Best standard of care 34 Comparison of Two Means Test for superiority Z XT XC M s 1 NT 1 N C 2 • NC = NT = N 2Z1 Z1 s 2 2 N (d M ) 2 ~ N 0, 1 under H 0 2Z1 Z1 2 (d M ) s 2 35 • H0: T - C ≤ M • Ha: T - C > M • Test Statistic: Assume X ~ N(, s2) Example (Superiority) 36 • Consider M = 1% as the minimal difference of clinical importance. • Detect a difference of 5% between the treatment groups (d = 5) • Standard deviation=10% (s = 10) • Significance level=0.05 (2α = 0.05) • 80% power (1-β = 0.80) • Z1-α = 1.96, Z1-β = 0.84 21.96 0.84 N 261.33 ~ 262 2 ((5 1) / 10) 2 Comparison of Two Means (M=0) • H0: T ≤ C T - C ≤ 0 • HA: T - C > 0 • Test statistic for sample means ~ N (, s) Z XT XC s (1 / NT 1 / N C ) 2 ~N(0,1) for H0 • Let N = NC = NT for design, consider C - T = d N 2( Z Z ) 2 s 2 d 2 2( Z Z ) 2 (d / s ) 2 • Power Z N / 2 (d / s ) Z 37 Example 21.96 0.84 N 62.72 ~ 63, 2 N 126 2 (5 / 10) 2 38 • Detect a difference of 5% between the treatment groups (d = 5) • Standard deviation=10% (s = 10) • Significance level=0.05 (2a = 0.05) • 80% power (1- b = 0.80) • Za = 1.96, Zb =0.84 Goals of Controlled Clinical Trials Non-Inferiority Trials • A clinical trial comparing a test product with an active control (without a placebo arm) is designed as active control non-inferiority trial • Objective: – to demonstrate that the test product is not worse than the active control by more than a pre-specified, small amount (the non-inferiority margin M). 39 Goals of Controlled Clinical Trials Non-Inferiority Trials • Controlled trial may demonstrate efficacy by showing the test treatment to be similar in efficacy to a known effective treatment – The active control had to be effective under the conditions of the trials – New treatment cannot be worse by a pre-specified amount (M) – New treatment may not be better than the standard but may have other advantages – Cost – Toxicity – Invasiveness 40 Comparison of Two Means (NI) • H0: C - T ≥ M T ≤ C –M (M>0) • HA: C - T < M T > C -M • Test statistic for sample means ~ N (, s) Z XT XC M s 2 (1 / NT 1 / N C ) ~N(0,1) for H0 • Let N = NC = NT for design, consider T - C = N 2( Z Z ) 2 s 2 ( M ) 2 2( Z Z ) 2 (( M ) / s ) 2 • Power Z N / 2 (( M ) / s ) Z 41 Example • To evaluate the effect of a test drug on cholesterol in patients with coronary heart disease (CHD) • Low density lipidproteins (LDL) is the most directly associated with increased risk of CHD • Goal: to compare two cholesterol lowering agents for treatment of patients with CHD 42 Example (cont’) • Set 2 = .05, = 0.90 (90% power), • Suppose a difference of 5% (M 5%) in percent change of LDL is considered of clinically meaningful difference. • Assume that the standard deviation is 10% (σ = 10%) Non-inferiority test: – The non-inferiority margin is chosen to be 5% i.e., M0.05 43 Example (cont’) Non-inferiority • • test: Assume the true difference in mean LDL between treatment groups is T - C = 0% ie, 0% The test drug is much cheaper 2(1.96 1.282) 2 (0.1) 2 n1 n2 84.08 2 (0 0.05) 44 Comparing Time to Event Distributions • Primary efficacy endpoint is the time to an event • Compare the survival distributions for the two groups • Measure of treatment effect is the ratio of the hazard rates in the two groups = ratio of the medians • Must also consider the length of follow-up 45 Time to Failure • Use a parametric model for sample size • Exponential model – S(t) = e -lt , l hazard rate – H0: lC = lI (i.e., H0: lC - lI =0) – N Z Z 2 loglC lI 2 • No censoring • Instant recruiting 46 More General Model • Not all patients are followed to an event (i.e., censoring) • Patients are recruited over some period of time (i.e., staggered entry) • Z N Z g l g l 2 C lC lI I 2 g(l) is defined as follows. 47 More General Model • Instant recruitment and censored at time T g l l 2 1 e lT • Continual recruitment up to time T and censored at time T 3 g l lT lT 1 e lT • Continual recruitment up to time T0, T > T0, and censored at T, 2 l g l l T T0 lT 1 e e lT0 48 Example Assume • = 0.05, (two-sided) and 1- = 0.90 • lC = 0.3 and lI = 0.2 • T = 5 years follow-up • Recruitment T0=3 49 Example Sample size results • No censoring and instant recruiting N = 128 • Instant recruiting and censored at T N = 188 • Continual recruitment up to T and censored at T N = 310 • Continue recruitment to T0, and censored at T N = 233 50 Example • • • • • • • Control: median survival=12 months Treatment: median survival=16 months T0=3, T=4 α= 0.05, (two-sided) and 1-β = 0.80 0.5=exp(-λC×1), λC= -ln(0.5)=0.6931 0.5=exp(-λI×1.33), λI= -ln(0.5)/1.33=0.5212 N=261 per group 51 Simon’s two-stage design • A priori, decide sample size for Stage 1 and sample size for Stage 2. • If reasonable evidence of efficacy is seen by end of Stage 1, then continue to stage 2. • Minimizes overall sample size. • Can terminate at end of stage 1 either because – treatment very efficacious – treatment not at all efficacious • Usually, termination is due to lack of efficacy Example of Simon’s two-stage design (1) • Suppose we have a new treatment that we would like to investigate for efficacy. – The standard therapy/placebo control has a response rate of 0.25. (uninteresting level) – We would be interested in this new therapy if the response rate were 0.50 or greater. (at least some desired level) • Design assumptions: – H0: p ≤ 0.25 – H1: p ≥ 0.50 • For power of 80% and with overall type I error rate of 0.05, the sample size for stage 1 is 9 and for stage 2 is 15. Example of Simon’s two-stage design (2) • Rules: – If 2 or more responses are seen at the end of stage 1, continue to stage 2. – If 9 or more responses (total) are seen at the end of stage 2, consider new treatment efficacious. Randomized phase II trials with a prospective control (1) • Two-arm two-stage designs – Arm 1: experimental arm – Arm 2: control arm • H0 : p1 ≤ p2 against H1 : p1 > p2 • Example: – to evaluate the anti-tumor activity of CD30 antibody, SGN30, combined with GVD (gemcitabine, vinorelbine, doxorubicin) chemotherapy (Arm 1) compared with GVD plus placebo (Arm 2) in patients with relapsed / refractory classical HL (何杰金氏淋巴癌). Randomized phase II trials with a prospective control (2) • H0 : p1 ≤ p2 against H1 : p1 > p2 • H0 : p1 = p2 = 0.7 and H1 : p1 = 0.85, p2 = 0.7 • Type I error alpha = 0.15, Power = 1-beta = 0.8 P2 0.7 P1 0.85 Optimal (n, n1, a1, a) EN (73, 27, 1, 6) 47.28 Randomized phase II trials with a prospective control (3) • Stage 1, accrue n1 patients to each arm. – – – – X1 = #(responders among the n1 patients for Arms 1, test drug) Y1 = #(responders among the n1 patients for Arms 2) Proceed to stage 2 if X1 − Y1 ≥ a1. Otherwise, we reject Arm 1 (or fail to reject H0, fail to show the trend of effectiveness of test drug) and stop the trial. P2 P1 0.7 0.85 Optimal (n, n1, a1, a) EN (73, 27, 1, 6) 47.28 Randomized phase II trials with a prospective control (4) • Stage 2, accrue an additional n2 patients to each arm. – – – – – – X2 = #(responders among the n2 patients for Arms 1, test drug) Y2 = #(responders among the n2 patients for Arms 2) X= X1+X2 = total number of responders for Arms 1 Y= Y1+Y2 = total number of responders for Arms 2 We accept Arm 1 (or reject H0) for further investigation if X − Y ≥ a. Otherwise, we reject Arm 1. P2 P1 0.7 0.85 Optimal (n, n1, a1, a) EN (73, 27, 1, 6) 47.28 n2 = 73-27 = 46 Thanks You for Your Attention! 59
© Copyright 2024