Download Report

V. Sample size
Outline
• What is needed to determine sample size?
• Sample size calculation for:
–Proportions
–Means
–Time-to-event
–Noninferiority
•Adjustments
Sample Size
Question: How many patients do I need?
Answer:
V- 2
It depends
•
•
•
•
Study objective – Phase I, II, III
Test for a difference (superiority)
Test for equivalence (non-inferiority)
How treatment effects are measured
– Proportions (e.g. response rate)
– Time to event endpoint (e.g. survival)
– Continuous outcome (e.g. BP)
• Size of the effect or tolerance of the CI
• Chances to reach wrong conclusions
V- 3
It depends
• Based on:
–
–
–
–
–
Objectives
Hypothesis
Primary endpoint
Design
Method of analysis
• Test procedure
• Confidence interval
V- 4
Consider how results will be analyzed
P-Values – Test of significance of H0 based on distribution
of a test statistic assuming H0 is true
• It is – Probability of observing what we observed or
more extreme if H0 is true
• It is not – Probability H0 is true
Confidence Interval – A range of plausible values of the
population parameter
• It is – includes parameter of 95% of trials
• It is not – 95% probability that parameter is in
Sample size depends on planned test procedure or
confidence interval of interest
V- 5
Where do we begin?
N = (Total Budget / Cost per patient)?
Hopefully not!
V- 6
Sample Size Calculations
• Formulate a PRIMARY question or hypothesis to
test (or determine what you are estimating). Write
down H0 and HA.
• Determine the endpoint. Choose an outcome
measure. How do we “measure” or “quantify” the
responses?
• Envision the analysis
V- 7
Truth
Test
Result
Ho True
Ho False
Reject Ho
Type I error
(α)
Correct
Do not reject
Ho
Correct
Type II error
(β)
V- 8
What is Needed to Determine the
Sample-Size?
• α
– Up to the investigator (often = 0.05)
– How much type I error can you afford?
• 1-β (power)
– Up to the investigator (often 80%-90%)
– How much type II error can you afford?
V- 9
What is Needed to Determine the
Sample-Size?
• Choosing α and β:
– Weigh the cost of a Type I error versus a Type II error.
• In early phase clinical trials, we often do not want to “miss” a
significant result and thus often consider designing a study for
higher power (perhaps 90%) and may consider relaxing the α
error (perhaps 0.10).
• In order to approve a new drug, the FDA requires significance
in two Phase III trials strictly designed with α error no greater
than 0.05 (Power = 1-β is often set to 80% but the FDA does
not have a rule about power).
V- 10
Sample Size Calculations
•
The idea is to choose a sample size such that
both of the following conditions hold:
1.
2.

If the null hypothesis is true, then the probability if
incorrectly rejecting is (no more than) α
If the alternative hypothesis is true, then the
probability of correctly rejecting is (at least) 1-β =
power.
Don’t want to over-power study

Expensive
Find statistical significance on effects that are
clinically irrelevant
V- 11
What is Needed to Determine the
Sample-Size?
• The “minimum difference (between groups) that is
clinically relevant or meaningful”.
– Requires clinical input
– Equivalently, choose the mean for both groups (or the null and
alternative differences)
• Note for equivalence/noninferiority studies, we need the
“maximum irrelevant or non-meaningful difference”.
V- 12
What is Needed to Determine the
Sample-Size?
• Estimates of variability
–
–
–
–
Do not know in advance
Often obtained from prior studies
Consider a pilot study for this
SD for each group
• Explore the literature and data from ongoing studies
for estimates needed in calculations
– Get SD for placebo group and treatment group
– May need to validate this estimate later
V- 13
Sample Size Calculations
• Provide numbers for a range of scenarios and
various combinations of parameters (e.g., for
various values of the difference between groups,
combinations of α and β, etc.
V- 14
Example:
Find N so that if the actual event rate were π we would be
unlikely to observe zero events.
What is the chance of observing zero events among N patients
if the true event rate were π?
π=
N=5
10
15
20
30
.01
.95
.90
.86
.82
.74
Prob (no events) = (1 - π) N
.05
.10
.20
If 0/30 then either:
.77
.59
.33
1) π =.10 and we
.60
.35
.11
observed a rare event
.46
.21
.04
(probability = .04)
.35
.12
.01
OR
.21
.04
.001
2) π is really < .10
V- 15
Sample size for a comparative study
involving the analysis of proportions
R
A
N
D
O
M
I
Z
E
Objective: Use PA – PB = δ
A
To determine if PA – PB ≠ 0
By a medically significant amount
Test
B
H0: PA – PB = 0
H1: PA – PB ≠ 0
One-sided or two-sided alternative
When N patients per treatment are evaluated we will want to
know if the observed δ is significant evidence for us to reject H0,
and declare a treatment difference.
Test procedure: | δ | is too large then reject H0
V- 16
Four interrelated quantities α, β, ∆, N
α
= False-positive rate (type I error) =
Prob (reject H0 | H0 true) =
Prob ( | δ | too large | PA = PB)
β
= False-negative rate (type II error) =
Prob (fail to reject H0 | H0 false)
[β is a function of the true PA – PB = ∆]
1 – β = Power or sensitivity = Prob (reject H0 | H0 false)
∆
= Measure of true population difference- must be guessed.
Difference of medical importance.
N = Sample size per arm
Given the test procedure, the quantities are interrelated
e.g., as ∆ increases Æ easier to detect diff. Æ 1 – β increases.
V- 17
Conclusions in hypothesis testing –
the truth table
We can characterize the possible outcomes of a
hypothesis test as follows:
True State
H0 True (∆=0) H0 False (∆≠0)
Test Result
Reject H0 (p<0.05) Type I Error (α) No Error (1- β)
Do not reject H0
No Error
Type II Error (β)
(p>0.05)
V- 18
• If test statistic z is
large enough (e.g.
falls into shaded area
of scale), we believe
this result is too large
to have come from a
distribution with
mean 0 (i.e. Pc = Pt)
• Thus we reject H0:
Pc- Pt = 0, claiming
that there exists 5%
chance this result
could have come
from a distribution
with no difference
V- 19
Example:
Compare response rates for MVP versus CAMP for
metastatic non-small cell lung cancer
…
MVP
NC
CAMP
RM Responders
RM/NM Response rate MVP
δ =
∆ =
…
NM
RC Responders
RC/NC Response rate CAMP
RM/NM – RC/NC Observed difference
πM - πC
Population difference
For example, δ = 7/20 – 4/21 = 16%
V- 20
V- 21
V- 22
Test of hypotheses
•
Two sided
H0: PT = PC
Ha: PT ≠ PC
•
Classic test
One sided
H0: PT ≥ PC
Ha: PT < PC
zα = critical value
If |z| > zα
Reject HO
If z > zα
Reject HO
where z = test statistic
•
Recommend
zα be the same value in both cases (e.g. 1.96)
two sided
one sided
2α = .05
zα = 1.96
or
α = .025
zα = 1.96
V- 23
Typical design assumptions
1. 2α = .05, .025, .01
Zα = 1.96, 2.24, 2.58
2. Power = .80, .90, .95
Zβ = 0.84, 1.28, 1.645
Should be at least .80 for trial design
3. ∆ = smallest difference we expect to detect
e.g. ∆ = PC - PT
= .40 - .30
= .10
25% reduction
V- 24
Sample size formula two proportions
[ Zα 2 p (1 − p ) + Z β PC (1 − PC ) + PT (1 − PT )]2
N=
∆2
• Zα = constant associated with a P {|Z|> Zα } = 2α (two sided)
(e.g. 2α = .05, Zα =1.96)
• Zβ = constant associated with 1 – β; P {Z< Zβ} = 1- β
(e.g. 1- β = .90, Zβ =1.282)
N=
2(Zα + Zβ )2 p(1− p)
∆2
• Standardized delta
• Solve for Zβ ⇒ 1- β or ∆
=
( Zα + Zβ )
2
2
⎛∆
⎞
⎜
⎟
2 p(1− p) ⎠
⎝
V- 25
Simple example
• H0: PC = PT
• HA: PC = .40, PT = .30
∆ = .40 - .30 = .10
• Assume
α = .05
1 - β = .90
Zα = 1.96 (Two-sided)
Zβ = 1.282
• p = ((.40 + .30)/2) = .35
V- 26
Simple example (2)
Thus
a.
[1.96 2(.35)(.65) + 1.282 (.3)(.7) + (.4)(.6)]2
N=
(.4 − .3)2
N = 476
2N = 952
(1.96 + 1.282) 2
N =
= 478
2
⎛
⎞
.4 − .3
⎜⎜
⎟⎟
2(.35).65)
⎝
⎠
b.
2N = 956
V- 27
Approximate* total sample size for comparing various
proportions in two groups with significance level (α) of
0.05 and power (1-β) of 0.80 and 0.90
True Proportions
pC
pI
(Control) (Invervention)
0.60
0.50
0.40
0.30
0.20
0.10
0.50
0.40
0.30
0.20
0.40
0.30
0.25
0.20
0.30
0.25
0.20
0.20
0.15
0.10
0.15
0.10
0.05
0.05
α=0.05
(two-sided)
α=0.05
(one-sided)
1-β
0.90
850
210
90
50
850
210
130
90
780
330
180
640
270
140
1980
440
170
950
1-β
0.80
610
160
70
40
610
150
90
60
560
240
130
470
190
100
1430
320
120
690
*Sample sizes are rounded up to the nearest 10
1-β
0.90
1040
260
120
60
1040
250
160
110
960
410
220
790
330
170
2430
540
200
1170
1-β
0.80
780
200
90
50
780
190
120
80
720
310
170
590
250
130
1810
400
150
870
V- 28
Sample size
Sample size is very sensitive to values of ∆
e.g. To detect the difference from 25% to 40%
N = 205 for α =.05 β = .10
Over 270 less per arm than to detect from 30% to 40%
Large numbers are required if we want high power to
detect small differences
Trials designed to show no difference require very large N
because the ∆ to rule out is small
Consider
i) Current knowledge
ii) Likely improvement
iii) Feasibility – available accrual
V- 29
Sample size (2)
Present a range of values
i.e. for several N & ∆ give the power
for several ∆ & 1- β give the required N
See “Friedman article on sample size”
(class website) for adjustments for lost to
follow-up (or non-adherence) to randomized
treatments
⎛ 1 ⎞
N * = Nx ⎜
⎟
⎝ 1 − LFU ⎠
2
LFU = Fraction of cases
expected to be lost to
follow-up
V- 30
V- 31
Comparison of two means
• H0: µC = µT ⇔ µC - µT = 0
• HA: µC - µT = ∆
• Test statistic for sample means ~ N (µ, σ)
XC − XT
Z=
σ 2 (1/ N C + 1/ NT )
• Let N = NC = NT for design
N=
2( Zα + Z β ) 2 σ 2
• Standardized ∆
• Power
Zβ =
∆2
=
~N(0,1) for H0
( Zα + Z β ) 2
(∆ / 2σ ) 2
N / 2 (∆ / σ ) − Zα
V- 32
Two Groups
Or
µ 0 = µ c − µT = ∆ 0 = 0
µ1 = µC − µT = ∆1
Z =
XC − XT
~ N (0,1)
σ 2/n
V- 33
Example
IQ scores
σ = 15
∆ = 0.3x15 = 4.5
• Set α = .05
β = 0.10
1 - β = 0.90
• HA: ∆ = 0.3σ ⇔ ∆ / σ = 0.3
• Sample Size N =
2(1.96 + 1.282) 2 2(10.51) 21.02
=
=
(0.3) 2
(0.3) 2
(0.3) 2
• N = 234
• ⇒ 2N = 468
V- 34
V- 35
Comparing time to event distributions
• Primary efficacy endpoint is the time to an
event
• Compare the survival distributions for the
two groups
• Measure of treatment effect is the ratio of the
hazard rates in the two groups = ratio of the
medians
• Must also consider the length of follow-up
V- 36
Assuming exponential survival
distributions
• If P(T > t) = e-λt, where λ = λ1 in group 1,
λ = λ2 in group 2, let
H0: λ2 = λ1 ⇒ λ1/λ2 = 1
Ha: λ2 ≠ λ1 ⇒ λ1/λ2 ≠ 1
• Then define the effect size by
∆ = λ1/λ2 = med2/med1 where medi=-ln(.5)/ λi
• Standardized difference
ln(∆) / √ 2
V- 37
Assuming exponential survival
distributions
• The statistical test is powered by the total number
of events observed at the time of the analysis, d.
4(Zα +Zβ )2
d=
[ln(∆ )]2
V- 38
Converting number of events (d) to
required sample size (2N)
• d = 2N x P(event) ⇒
2N = d / P(event)
• P(event) is a function of the length of total follow-up at
time of analysis and the average hazard rate.
• Let AR = accrual rate (patients per year)
A = period of uniform accrual (2N=AR x A)
F = period of follow-up after accrual
A/2 + F = average total follow-up at planned analysis
λ = average hazard rate
• Then P(event) = 1 – P(no event) = 1 - e-λ(A/2 + F)
V- 39
Survival example
Lung cancer trial: median survival = 1 year.
Test with 80% power whether a new treatment can improve
median survival to 1.5 years.
d = 4 (1.96 + .84)2 / (ln 1.5)2 = 191 events
How many patients accrued during A years and followed
for an additional F years will provide 191 events?
λ= ((-ln(.5)/1) + (-ln(.5)/1.5))/2 = (.693 + .462)/2 = .578
For each A and F we can calculate P(event) to obtain the
number of required patients (2N).
V- 40
Survival example (2)
For example, A = 2; F = 2;
λ = .578 (avg. 1 yr med. and 1.5 yr med.)
P(event) = 1 - e-λ(A/2 + F) = 1 - e- .578 (2/2 + 2) = 1 - e- .578 (3) = .823
To get 191 events for A = 2 and F = 2, we need
2N = 191 / P(event) = 191 / .823 = 232
Because 232 patients are accrued during 2 years,
an annual accrual rate (AR) of 116 pts per year is implied.
Balance AR, A, F to get sample size (2N) needed
to produce the required number of events.
V- 41
Survival example
Total sample size required to detect a 1.5 year median
versus a 1 year median with 2α = .05 and power = 80%
Years of additional follow-up
1
2
3
Years
1
330
250
221
of
2
279
232
212
accrual
3
250
221
207
V- 42
Sample size for confidence interval (CI)
X Sample mean
S Standard deviation
T Critical value from t-distribution
C.I.: X ± (T S /
N
)
If L = desired half-width, i.e., magnitude of the permitted
difference between sample mean and population mean
N
=
T
95% CI use
2
S
L
2
T2
L = tolerance = X − µ
2
=4
will give desired tolerance
99% CI use T2 = 6.6
Guess S2
(
)
If X estimates the response rate P
N =
L2
Example:
to obtain a 95% C.I. with L = .1
4 ( .5 ) ( .5 )
N =
= 100
2
for a P around .5
( .1)
T
2
P
1 − P
V- 43
Equivalence (non-inferiority) designs
• To demonstrate new therapy is “as good as” standard
(new is less invasive, less toxic or cheaper than standard)
• Can not show H0: ∆=0; specify minimum acceptable ∆0
• For proportion endpoint (π), the hypotheses are
H0: πstd > πnew + ∆0
(std is better then new by ∆0)
Ha: πstd < πnew + ∆0
(new is good enough)
• Wish to achieve power of 1 - β when the true difference is
πstd - πnew = ∆a (often = 0)
V- 44
Equivalence (non-inferiority) designs
• Then the sample size required in each group is:
2
( Z α + Z β ) ⎡⎣π new (1 − π new ) + π std (1 − π std ) ⎤⎦
N=
2
( ∆0 − ∆a )
• Sensitive to the definition of equivalence, ∆0
• Report results using confidence intervals
V- 45
Equivalence (non-inferiority) designs
• If the upper limit of the CI for πstd - πnew excludes ∆0 then
we are confident that the new Rx is not too much worse
than the std Rx.
• Accept new Rx if U1-α ≤ ∆0
U1-α
<
)
0
∆0
∆
• Sample size: Choose N per group to ensure 1-β probability
that U1-α is below ∆0 if true difference is πstd - πnew = 0.
N=
( Zα + Zβ )
2
2
⎛ ∆0
⎞
⎜
⎟
πnew(1−πnew) +πstd (1−πstd ) ⎠
⎝
per arm
V- 46
Equivalence trial example
Lallemant et al. conducted an RCT in Thailand to test the
equivalence of the 076 regimen and three abbreviated
ZDV regimens (NEJM 2000;343:982-991)
Let ∆ = pT - pS
– where pT and pS are the HIV transmission rates in
the new Rx and standard Rx groups. Note that the
standard treatment is the 076 regimen.
Choose a maximal difference, ∆0, that is considered
acceptable.
V- 47
Equivalence trials
Define the Null Hypothesis as
– H0: pT - pS > ∆0 (transmission with new treatment
is higher than [not equivalent to] standard therapy)
and
– HA: pT - pS < ∆0 (new treatment is (approximately)
equivalent to standard treatment)
Note that the test is one-sided.
V- 48
The Lallemant trial
Because the reference treatment in this study was a
regimen of established efficacy, we tested for
equivalent efficacy of the experimental treatments,
choosing a threshold for equivalence that would
balance public health concerns with clinical benefits.
Using a cost-effectiveness approach, we
determined that an absolute increase of 6 percent in
the rate of transmission of HIV infection would be the
limit beyond which the clinical risk associated with
the experimental treatment would not be balanced by
its economic and logistical advantages.
V- 49
The Lallemant trial
With this criterion for equivalence, we calculated
that a sample of 1,398 mother-infant pairs would
be required to provide more than 90 percent
statistical power with a 5 percent one-sided type I
error and an 11 percent overall transmission rate.
Equivalence would be established if the upper
limit of the one-sided 95 percent confidence
interval for the arithmetic difference in the
percentage rates of HIV transmission was less
than 6.
V- 50
The COBALT Trial
The COBALT trial was an equivalence trial
designed to show that double-bolus
alteplase was equivalent to accelerated
infusion of alteplase for the treatment of
acute MI.
The criterion for equivalence was a difference
in mortality rates no greater than 0.4
percent!
V- 51
The COBALT Trial
7,169 patients were randomized. 30-day
mortality rates were 7.98 percent in the
double-bolus group and 7.53 percent in the
accelerated-infusion group, an unfavorable
absolute difference of 0.44 percent. Because
the one-sided 95 percent confidence limit for
the difference in mortality rates exceeded the
prespecified limit, the authors concluded that
double-bolus alteplase had not been shown to
be equivalent to an accelerated infusion of
alteplase.
V- 52
GUSTO III
GUSTO III was a superiority trial to test the
hypothesis that double-bolus administration of
reteplase would reduce 30-day mortality by 20
percent relative to an accelerated infusion of
alteplase.
In 15,059 randomized patients, the 30-day mortality
rates were 7.47 percent for reteplase and 7.24
percent for alteplase - an absolute difference of
0.23 percent (two-sided 95 percent confidence
interval, -0.66 to 1.10 percent). Thus the GUSTO
trial failed to reject the null hypothesis of equal
30-day mortality rates in the two treatment groups.
V- 53
Comparing results for
30-day mortality
(Standard)
Accelerated Alteplase Double-bolus Alteplase
COBALT
7.53%
7.98%
Failed to demonstrate Equivalence of DB Alt
Accelerated Alteplase
GUSTO
7.24%
Double-bolus Reteplase
7.47%
Failed to Show Superiority of DB Reteplase
V- 54
How did this happen?
The COBALT investigators chose such a
stringent standard for equivalence that an
equally effective regimen was unlikely to be
“proven” equivalent.
GUSTO does appear to show similar efficacy
of the two regimens. Thus it is appropriate
not to infer superiority.
V- 55
Interpreting superiority and
equivalence trials
Suppose that we have three treatment options,
control (C), standard (S), and new therapy (N).
To evaluate the new therapy, one can
A: Perform a superiority trial comparing N to C.
B: Perform an equivalence trial comparing N to S.
V- 56
What can we learn?
In a superiority trial, we test the null hypothesis that
N and C are equally effective. If we reject H0, we
can infer that N is superior to C.
The superiority trial provides no internal comparison
of N and S. We cannot say, for example, that N is
as good as S.
V- 57
What can we learn?
• In an equivalence trial, we test the null hypothesis
that N is inferior to S. If we reject the null
hypothesis, we can infer that N is, to a specified
magnitude, equivalent to S.
• The equivalence trial provides no internal
comparison of N to C.
V- 58
Sample Size Calculations
• Expect things to go wrong and plan for this BEFORE the
study begins (and in sample- size calculations).
• Think about the challenges of reality and the things that
they don’t normally teach you about in the classroom.
• Sample size estimates need to incorporate adjustments for
real problems that occur but are not often a part of the
standard formulas.
V- 59
Sample Size Calculations
• Each of these can affect sample size
calculations:
–
–
–
–
–
–
–
Missing data (imputation?)
Noncompliance
Multiple testing
Unequal group sizes
Use of nonparametric testing (vs. parametric)
Noninferiority or equivalence trial
Inadvertent enrollment of ineligible subjects
V- 60
Adjustment for Lost-to-Follow-up
•
•
•
•
Calculate sample size N.
Let x=proportion expected to be lost-to-follow-up.
Nadj=N/(1-x)2
One still needs to consider the potential bias of examining only
subjects with non-missing data.
• Note: no adjustment is necessary if you plan to impute missing values.
If you use imputation, an adjustment for dilution effect may be
warranted (for example when treating anyone with missing data as a
“failure” or a “non-responder”).
V- 61
Adjustment for Dilution Effect
• Adjustment for dilution effect due to noncompliance or inclusion (perhaps inadvertently) of
subjects that cannot respond:
– Calculate sample size N.
– Let x=proportion expected to be non-compliant.
– Nadj=N/(1-x)2
V- 62
Adjustment for Unequal Allocation
• Adjustment for unequal allocation in two groups:
– Let QE and QC be the sample fractions such that
QE+QC=1.
– Calculate sample size Nbal for equal sample sizes (i.e.,
QE=QC=0.5)
– Nunbal=Nbal ((QE-1 +QC-1)/4)
– Note power is optimized when QE=QC=0.5
V- 63
Adjustment for Nonparametric Testing
• Most sample-size calculations are performed using
parametric methods
• Adjustment for use of nonparametric test instead of a
parametric test (Pitman Efficiency) in case parametric
assumptions do not hold (Note: applicable for 1 and 2
sample t-tests):
– Calculate sample size Npar for 1 or 2 sample t-test.
– Nnonpar = Npar /(0.864)
V- 64
Adjustment for
Equivalence/Noninferiority Studies
• Calculate sample size for standard superiority trial
but reverse the roles of α and β.
• Works for large sample binomial and the normal.
Does not work for survival data.
V- 65
Other Points to Remember
• Trials are usually powered based on an efficacy
endpoint.
– Thus a trial may be underpowered to detect differences
in safety parameters (particularly rare events or small
differences)
• Sample-size re-estimation has become common
– Since estimates are used in original calculation
– Must be done very carefully
V- 66