N - 生物統計研究中心

Sample Size Estimation in
Clinical Trials
(臨床試驗的樣本數估算)
Hsiao-Hui Tsou (鄒小蕙)
國家衛生研究院
群體健康科學研究所
生物統計與生物資訊研究組
email: [email protected]
REFERENCES
1.Friedman, Furberg & DeMets. (3rd edition,
1998) Fundamentals of Clinical Trials.
Springer-Verlag, NY, NY.
2.Chow, S.C., and Liu, J.P. (2004). Design and
Analysis of Clinical Trials: Concepts and
Methodologies, Second Edition. November, 2003
by John Wiley and Sons, Inc., New York, New
York, U.S.A.
Outline
• Why does this matter? Scientific and ethical
implications
• Statistical definitions and notation
• Questions that need to be answered prior to
determining sample size
• Study design issues affecting sample size
• Some basic sample size formulas
Example: How many subjects?
• Compare new treatment (T) with a control (C)
• Previous data suggests Control Failure Rate (Pc) ~ 40%
• Investigator believes treatment can reduce Pc by 25%
i.e. PT = .30, PC = .40
• N = number of subjects/group = ?
4
Example: How many subjects?
• Compare new treatment (T) with a control (C)
• Previous data suggests median survival is 12 months for
Control group
• Investigator believes treatment can increase median survival to
16 months.
• N = number of subjects/group = ?
5
Scientific and Ethical Implications
From a scientific perspective:
• Can’t be sure we’ve made right decision
regarding the effect of the intervention
• However, we want enough subjects enrolled to
adequately address study question to feel
comfortable that we’ve reached correct
conclusion
From an ethical perspective:
Too few subjects:
• Cannot adequately address study question. The time,
discomfort and risk to subjects have served no
purpose.
• May conclude no effect of an intervention that is
beneficial. Current and future subjects may not
benefit from new intervention based on current
(inconclusive) study.
Too many subjects:
• Too many subjects unnecessarily exposed to risk.
Should enroll only enough patients to answer
study question, to minimize the discomfort and risk
subjects may be exposed to.
Where to begin?
• Understand the research question
• “What’s the question?”
– The following are NOT research questions:
• We want to “look at” median PFS
• We want to analyze the data
• We want to see if our results are significant
• Need to
– Visualize the final analysis and the statistical methods to be
used
Where to begin?
• Analysis determines sample size
– Sample size calculations are based on the planned method
of analysis
– If you don’t know how the data will be analyzed (e.g., 2sample t-test), then you cannot accurately estimate the
sample size
Sample Size Calculation
•
•
Formulate a PRIMARY research question
Identify:
1. A hypothesis to test (write down H0 and HA), or
2. A quantity to estimate (e.g., using confidence intervals)
• Determine the endpoint or outcome measure associated with
the hypothesis test or quantity to be estimated
– How do we “measure” or “quantify” the responses?
– Is the measure continuous, binary, or a time-to-event?
– Is this a one-sample or two-sample problem?
Sample Size Calculation
• Based upon the PRIMARY outcome
• Other analyses (i.e., secondary outcomes) may be planned,
but the study may not be powered to detect effects for these
outcomes
Definitions and Notation
• Null hypothesis (H0): No difference between groups
H0: p1 = p2
H0: 1 = 2
• Alternative hypothesis (HA): There is a difference
between groups
HA: p1  p2
HA : 1  2
• P-Value: Chance of obtaining observed result or one
more extreme when groups are equal (under H0)
– Test of significance of H0
– Based on distribution of a test statistic assuming H0 is true
– It is NOT the probability that H0 is true
Definitions and Notation
• : Measure of true population difference must be
estimated. Difference of medical importance
= |p1 - p2|
= |1 - 2|
• n: Sample size per arm
• N: Total sample size (N=2n for 2 groups with
equal allocation)
• Type I error: Rejecting H0 when H0 is true
• : The type I error rate. Maximum p-value considered
statistically significant
• Type II error: Failing to reject H0 when H0 is false
• : The type II error rate
• Power (1 - ): Probability of detecting group effect given
the size of the effect () and the sample size of the trial (N)
• =? =? In phase II or phase III?
Sample Size Calculation Using
Hypothesis Testing
•
The most common approach
•
The idea is to choose a sample size such that both of
the following conditions simultaneously hold:
– If the null hypothesis is true, then the probability of
incorrectly rejecting is (no more than) α
– If the alternative hypothesis is true, then the probability of
correctly rejecting is (at least) 1-β = power
Statistical Considerations
Power = 1-β
α
p: 試驗組與對照組無差別的機率
α: 無效藥被判定成有效的機率
β: 有效藥被判定成無效的機率
1 - β: Statistical power (有效藥被當成有效的機率)
17
Determinants of Sample Size
• α : Regulated by FDA for phase III pivotal trials (0.05)
• β : Up to the investigator (often 80%-90%)
– Not regulated by FDA
• An “effect size” to detect
– Minimum difference that is clinically relevant (for superiority)
• E.g., H0: p1 - p2 = 0 vs. HA: p1 - p2 = 0.20
– Maximum difference that is clinically irrelevant (for noninferiority)
• Estimates of variability
The quantities , ,  and N are all interrelated.
Holding all other values constant, what happens to
the power of the study if
•
•
•
•
 increases?
 decreases?
N increases?
variability increases?
Power ↑
Power ↓
Power ↑
Power ↓
Note: Typical error rates are  = .05 and  = .1 or .2
(80 or 90% power). Why is  often smaller than ?
Consideration
•
•
•
•
•
Type of Primary Endpoint
1-sample vs. 2-sample
Independent samples or paired?
What is the sample size allocation ratio?
1-sided vs. 2-sided
– 2-sided hypothesis:
• H0 : θT =θC vs. HA : θT ≠ θC
– 1-sided hypothesis :
• H0 : θT < θC vs. HA : θT > θC
• H0 : θT > θC vs. HA : θT < θC
20
Primary Endpoint
• Dichotomous response variables (success and failure)
Compare event rate PI and PC
• Continuous response variables (blood pressure)
Compare true mean level I and C
• Time to failure (or occurrence of a clinical event)
OS, PFS, RFS, …
– Compare hazard rate lI and lC
– Compare median survival time
21
Distribution of Test Statistics
•
•
•
•
Many have a common form
 = population parameter (eg, difference in means)
ˆ = sample estimate
Then
– Z =(ˆ – E( ˆ )) /SE( ˆ )
• Then Z has a(n approximately) Normal (0,1)
distribution
22
Test of Hypothesis
•
Two sided
e.g. H0: PT = PC
vs.
•
z1- = critical value
Classic test
If |z| > z1-
Reject H0
 = .05 , z1- = 1.96
One sided
H0: PT < PC
If z > z1-
Reject H0
 = .05, z1- = 1.645
where z = test statistic
•
Recommend
z1- be same value both cases (e.g. 1.96)
two-sided
one-sided
  = .05
or
= .025
z1- = 1.96
1.96
23
Typical Design Assumptions (1)
1.  = 0.05, 0.025, 0.01
2. Power = 0.80, 0.90
Should be at least .80 for design for Phase III
How about for phase II?
3.  = smallest difference hope to detect
e.g. = PC - PT
= .40 - .30 = .10
25% reduction!
24
Typical Design Assumptions (2)
Two Sided
Significance Level

0.05
0.025
0.01
Z1-
1.96
2.24
2.58
Power
1-
0.80
0.90
0.95
Z1-
0.84
1.282
1.645
25
Binomial Case
• H0: PC = PI
• Test Statistic (CLT to Normal Approximation)
Z
PˆC  PˆT
P 1  P 1 N C  1 N I 
where
~ N(0,1) under H0
N C PˆC  NT PˆT
P
N C  NT
• Sample size: Assume
– NI = NC = N
– Ha: d = PC - PI
26
Binomial Sample Size Formula

Z
N

2 P 1  P   Z  PC 1  PC   PI 1  PI 
d2
• P{|Z| > Z } = 
• P{Z > Z } = 1-
or
2Z  Z   P 1  P 
2
N
d2
27

2
Example
• H0: PC = PI
• Ha: PC = 0.4, PI = 0.3
• Assume  = 0.05, 1- = 0.90 . i.e. Z =
1.96, Z =1.282, P =(0.4+0.3)/2=0.35
• N  1.96 2.35.65  1.282 2 .3.7  .4.6  476
2
2N = 952
•
.4  .3
21.96  1.282 .35.65
OR N 
 478, 2 N  956
2
.4  .3
28
2
Approximate* Total Sample Size for Comparing Various
Proportions in Two Groups with Significance Level () of
0.05 and Power (1-) of 0.80 and 0.90
True Proportions
pC
pI
(Control) (Intervention)
0.60
0.50
0.40
0.30
0.20
0.10
0.50
0.40
0.30
0.20
0.40
0.30
0.25
0.20
0.30
0.25
0.20
0.20
0.15
0.10
0.15
0.10
0.05
0.05
 = 0.05
(one-sided)
1-
0.90
850
210
90
50
850
210
130
90
780
330
180
640
270
140
1980
440
170
950
 = 0.05
(two-sided)
1-
0.80
610
160
70
40
610
150
90
60
560
240
130
470
190
100
1430
320
120
690
*Sample sizes are rounded up to the nearest 10
1-
0.90
1-
0.80
1040
260
120
60
1040
250
160
110
960
410
220
790
330
170
2430
540
200
1170
780
200
90
50
780
190
120
80
720
310
170
590
250
130
1810
400
150
870
29
Comparison of Means
• Some outcome variables are continuous
– Blood Pressure
– Serum Chemistry
– Pulmonary Function
– HAM-D
• Hypothesis tested by comparison of mean values
between groups, or comparison of mean changes
30
Comparison of Two Means
 Test
for Equality
• H0: I = C , i.e. I - C = 0
• Ha: I - C = d
• Test Statistic: Assume X ~ N(, s2)
Z
XC  X I
s 1 N C  1 N I 
2
• NC = NI = N
~ N 0, 1 under H 0
2Z  Z   s 2
2
N
d
2
2Z  Z  
2

d s 
2
31
Example
• To evaluate the effect of a test drug on cholesterol
in patients with coronary heart disease (CHD)
• Low density lipidproteins (LDL) is the most directly
associated with increased risk of CHD
• Goal: to compare two cholesterol lowering agents
for treatment of patients with CHD
• Endpoint: per cent change in LDL-C
• Detect a difference of 5% between the treatment
groups (d = 5)
32
Example (Equality)
•
•
•
•
Standard deviation=10% (s = 10)
Significance level=0.05 (2a = 0.05)
80% power (1- b = 0.80)
Za = 1.96, Zb =0.84
21.96  0.84
N
 62.72 ~ 63, 2 N  126
2
(5 / 10)
2
33
Goals of Controlled Clinical
Trials (1)
Superiority Trials
• A controlled trial may demonstrate
efficacy of the test treatment by
showing that it is superior to the control
– No treatment
– Best standard of care
34
Comparison of Two Means
 Test
for superiority
Z
XT  XC  M
s 1 NT  1 N C 
2
• NC = NT = N
2Z1  Z1   s 2
2
N
(d  M )
2
~ N 0, 1 under H 0
2Z1  Z1  
2

(d  M ) s 
2
35
• H0: T - C ≤ M
• Ha: T - C > M
• Test Statistic: Assume X ~ N(, s2)
Example (Superiority)
36
• Consider M = 1% as the minimal difference of clinical
importance.
• Detect a difference of 5% between the treatment groups
(d = 5)
• Standard deviation=10% (s = 10)
• Significance level=0.05 (2α = 0.05)
• 80% power (1-β = 0.80)
• Z1-α = 1.96, Z1-β = 0.84
21.96  0.84
N
 261.33 ~ 262
2
((5  1) / 10)
2
Comparison of Two Means (M=0)
• H0: T ≤ C  T - C ≤ 0
• HA: T - C > 0
• Test statistic for sample means ~ N (, s)
Z
XT  XC
s (1 / NT  1 / N C )
2
~N(0,1) for H0
• Let N = NC = NT for design, consider C - T = d
N
2( Z  Z  ) 2 s 2
d
2

2( Z  Z  ) 2
(d / s ) 2
• Power
Z   N / 2 (d / s )  Z
37
Example
21.96  0.84
N
 62.72 ~ 63, 2 N  126
2
(5 / 10)
2
38
• Detect a difference of 5% between the treatment
groups (d = 5)
• Standard deviation=10% (s = 10)
• Significance level=0.05 (2a = 0.05)
• 80% power (1- b = 0.80)
• Za = 1.96, Zb =0.84
Goals of Controlled Clinical Trials
Non-Inferiority Trials
• A clinical trial comparing a test product with an
active control (without a placebo arm) is designed
as active control non-inferiority trial
• Objective:
– to demonstrate that the test product is not worse than
the active control by more than a pre-specified, small
amount (the non-inferiority margin M).
39
Goals of Controlled Clinical Trials
Non-Inferiority Trials
• Controlled trial may demonstrate efficacy by
showing the test treatment to be similar in
efficacy to a known effective treatment
– The active control had to be effective under the
conditions of the trials
– New treatment cannot be worse by a pre-specified
amount (M)
– New treatment may not be better than the
standard but may have other advantages
– Cost
– Toxicity
– Invasiveness
40
Comparison of Two Means (NI)
• H0: C - T ≥ M  T ≤ C –M (M>0)
• HA: C - T < M  T > C -M
• Test statistic for sample means ~ N (, s)
Z
XT  XC  M
s 2 (1 / NT  1 / N C )
~N(0,1) for H0
• Let N = NC = NT for design, consider T - C = 
N
2( Z  Z  ) 2 s 2
(  M )
2

2( Z  Z  ) 2
((  M ) / s ) 2
• Power
Z   N / 2 ((  M ) / s )  Z
41
Example
• To evaluate the effect of a test drug on
cholesterol in patients with coronary heart
disease (CHD)
• Low density lipidproteins (LDL) is the most
directly associated with increased risk of CHD
• Goal: to compare two cholesterol lowering
agents for treatment of patients with CHD
42
Example (cont’)
• Set 2 = .05,  = 0.90 (90% power),
• Suppose a difference of 5% (M  5%) in percent
change of LDL is considered of clinically
meaningful difference.
• Assume that the standard deviation is 10% (σ =
10%)
 Non-inferiority
test:
– The non-inferiority margin is chosen to be 5% i.e.,
M0.05
43
Example (cont’)
 Non-inferiority
•
•

test:
Assume the true difference in mean LDL between
treatment groups is T - C = 0% ie, 0%
The test drug is much cheaper
2(1.96  1.282) 2 (0.1) 2
n1  n2 
 84.08
2
(0  0.05)
44
Comparing Time to Event Distributions
• Primary efficacy endpoint is the time to an event
• Compare the survival distributions for the two
groups
• Measure of treatment effect is the ratio of the
hazard rates in the two groups = ratio of the
medians
• Must also consider the length of follow-up
45
Time to Failure
• Use a parametric model for sample size
• Exponential model
– S(t) = e -lt , l  hazard rate
– H0: lC = lI (i.e., H0: lC - lI =0)
– N
Z
 Z 
2

loglC
lI 
2
• No censoring
• Instant recruiting
46
More General Model
• Not all patients are followed to an event (i.e.,
censoring)
• Patients are recruited over some period of
time (i.e., staggered entry)
•

Z
N

 Z
 g l   g l 
2
C
lC  lI 
I
2
g(l) is defined as follows.
47
More General Model
• Instant recruitment and censored at time T
g l  
l
2
1  e lT
• Continual recruitment up to time T and
censored at time T
3
g l  
lT
lT  1  e lT
• Continual recruitment up to time T0, T > T0,
and censored at T,
2
l
g l  
l T T0 
lT
1 e
e
lT0


48
Example
Assume
•  = 0.05, (two-sided) and 1- = 0.90
• lC = 0.3 and lI = 0.2
• T = 5 years follow-up
• Recruitment T0=3
49
Example
Sample size results
• No censoring and instant recruiting
N = 128
• Instant recruiting and censored at T
N = 188
• Continual recruitment up to T and censored at
T
N = 310
• Continue recruitment to T0, and censored at T
N = 233
50
Example
•
•
•
•
•
•
•
Control: median survival=12 months
Treatment: median survival=16 months
T0=3, T=4
α= 0.05, (two-sided) and 1-β = 0.80
0.5=exp(-λC×1), λC= -ln(0.5)=0.6931
0.5=exp(-λI×1.33), λI= -ln(0.5)/1.33=0.5212
N=261 per group
51
Simon’s two-stage design
• A priori, decide sample size for Stage 1 and sample size for
Stage 2.
• If reasonable evidence of efficacy is seen by end of Stage 1,
then continue to stage 2.
• Minimizes overall sample size.
• Can terminate at end of stage 1 either because
– treatment very efficacious
– treatment not at all efficacious
• Usually, termination is due to lack of efficacy
Example of Simon’s two-stage design (1)
• Suppose we have a new treatment that we would like to
investigate for efficacy.
– The standard therapy/placebo control has a response rate of 0.25.
(uninteresting level)
– We would be interested in this new therapy if the response rate were
0.50 or greater. (at least some desired level)
• Design assumptions:
– H0: p ≤ 0.25
– H1: p ≥ 0.50
• For power of 80% and with overall type I error rate of 0.05, the
sample size for stage 1 is 9 and for stage 2 is 15.
Example of Simon’s two-stage design (2)
• Rules:
– If 2 or more responses are seen at the end of stage 1, continue to stage 2.
– If 9 or more responses (total) are seen at the end of stage 2, consider
new treatment efficacious.
Randomized phase II trials with a prospective control (1)
• Two-arm two-stage designs
– Arm 1: experimental arm
– Arm 2: control arm
• H0 : p1 ≤ p2 against H1 : p1 > p2
• Example:
– to evaluate the anti-tumor activity of CD30 antibody, SGN30, combined with GVD (gemcitabine, vinorelbine,
doxorubicin) chemotherapy (Arm 1) compared with GVD
plus placebo (Arm 2) in patients with relapsed / refractory
classical HL (何杰金氏淋巴癌).
Randomized phase II trials with a prospective
control (2)
• H0 : p1 ≤ p2 against H1 : p1 > p2
• H0 : p1 = p2 = 0.7 and H1 : p1 = 0.85, p2 = 0.7
• Type I error alpha = 0.15, Power = 1-beta = 0.8
P2
0.7
P1
0.85
Optimal
(n, n1, a1, a)
EN
(73, 27, 1, 6)
47.28
Randomized phase II trials with a prospective control (3)
• Stage 1, accrue n1 patients to each arm.
–
–
–
–
X1 = #(responders among the n1 patients for Arms 1, test drug)
Y1 = #(responders among the n1 patients for Arms 2)
Proceed to stage 2 if X1 − Y1 ≥ a1.
Otherwise, we reject Arm 1 (or fail to reject H0, fail to show the trend of
effectiveness of test drug) and stop the trial.
P2
P1
0.7
0.85
Optimal
(n, n1, a1, a)
EN
(73, 27, 1, 6)
47.28
Randomized phase II trials with a prospective control (4)
• Stage 2, accrue an additional n2 patients to each arm.
–
–
–
–
–
–
X2 = #(responders among the n2 patients for Arms 1, test drug)
Y2 = #(responders among the n2 patients for Arms 2)
X= X1+X2 = total number of responders for Arms 1
Y= Y1+Y2 = total number of responders for Arms 2
We accept Arm 1 (or reject H0) for further investigation if X − Y ≥ a.
Otherwise, we reject Arm 1.
P2
P1
0.7
0.85
Optimal
(n, n1, a1, a)
EN
(73, 27, 1, 6)
47.28
n2 = 73-27
= 46
 Thanks You
for Your Attention!
59