7/9/2014 Lecture Topics Power and Sample Size: Considerations in Study Design

7/9/2014
Lecture Topics
Institute for Clinical and Translational Research
Introduction to Clinical Research
Power and Sample Size:
Considerations in Study Design
Nae-Yuh Wang, PhD
Associate Professor of Medicine, Biostatistics &
Epidemiology
July 16, 2014
ACKNOWLEDGEMENT
• John McGready
Department of Biostatistics
Bloomberg School of Public Health
2
LECTURE TOPICS
• Sample size and statistical inferences
• Factors influencing sample size and power
• Sample size determination when comparing two
means
• Sample size determination when comparing two
proportions
3
1
7/9/2014
Institute for Clinical and Translational Research
Introduction to Clinical Research
Section A
Sample Size and Statistical Inferences
WHY IS SAMPLE SIZE RELEVANT
• A small sample will not allow precise estimates,
therefore could not tell whether observed
difference is real or due to variability. We may call
things equal when we shouldn’t
• A very large sample size will allow us to detect tiny
differences that have no clinical significance. Waste
of research resources
• A very large sample size may not be large enough if
the event rate is very small
• Sometimes the available sample size is fixed and we
have to decide what we will be able to do with such
a sample
5
INTERVAL ESTIMATION
Interval estimation: using sample mean and SE (SD of
the sampling distribution) to construct 95%
confidence interval to infer the population
mean
•
Sampling distribution becomes more and more
like a normal (Gaussian) distribution with
increasingly larger N
•
95% CI
= sample mean +/- 1.96*SE
= sample mean +/- 1.96*SD/√N
6
2
7/9/2014
WHY IS SAMPLE SIZE RELEVANT
ESTIMATION
Estimation: What is the prevalence of obesity among
adults in Baltimore City?
• How precise do we want our estimate to be?
• 95% CI = p +/- 1.96*SE(p)
= p +/- 1.96*SD/√n
= p +/- 1.96*√p(1- p)/n
• Larger sample leads to more precise estimate, but it
requires more resources.
7
ESTIMATION PRECISION
• Estimate the percentage of adults who are obese in
such way that I have 95% confidence that the true
percentage is within +/- 10% of my estimate.
• Define parameters of variables such as adults (e.g.
age range), obese (e.g. BMI > 30 kg/m2), etc.
• Use 95% confidence interval as measure of
estimation precision.
• Set acceptable precision level e, to be equated to
the 95% confidence limit, 1.96 x SE
8
ESTIMATION PRECISION
e  1.96 SD/ n  n  (1 .96 )2SD 2 / e 2
Commonly seen formula:
n = 4 pq / e2
n: size of sample
e: level of acceptable precision
p: expected prevalence value
q =1 - p
SD2 = Variance = pq
4: constant equivalent to 95% level of confidence
9
3
7/9/2014
ESTIMATION PRECISION
• Say p = .30
e = .1
(from literature or pilot data)
(determined subjectively, level of
minimal clinical significance)
n = 4(.3)(.7)/(.1)2 = .84/.01 = 84
• If wanted precision of +/- 5%
e = .05
n = 4(.3)(.7)/(.05)2 = .84/.0025 = 336
• Greater estimation precision  larger sample size
10
ESTIMATION PRECISION
• Estimating the difference in a single parameter
(e.g. mean) between 2 populations
• Example:
What is the difference in bone mineral density
(lumber spine) between European and African
American women, age 25-34 years of age?
11
ESTIMATION PRECISION
• Commonly used formula:
n = 8 SD2 / e2 = 4 (2 SD2) / e2
n = size of sample for each population
e = level of acceptable precision
SD = population standard deviation, SD2 the variance
2 SD2 reflects the fact that statistical inferences on difference
of two independent populations subject to two sources of
variability
8 = constant equivalent to 95% confidence level and reflecting
2 sources of error
** Assumption: the populations we are comparing have the same
variance, i.e.
12
SDB2 = SDW2 = SD2
4
7/9/2014
ESTIMATION PRECISION
• Use formula: n = 8 SD2 / e2
Say SD = 1 (based on literature or pilot data)
e = .5 (i.e. tolerable error = +/- 0.5 SD)
n = 8 (1/.5)2 = 8(4) = 32 (per group)
SD = 2
e = .25 (i.e. tolerable error = +/- 0.125 SD)
n = 8 (2/.25)2 = 8(8)2 = 512 (per group)
• Greater population variability  larger sample size
• Greater estimation precision  larger sample size
• Higher confidence level (smaller a error)  larger
sample size
13
EXAMPLE
• Consider the following results from a study done on
29 women, all 35–39 years old
Sample Data
n
Mean SBP
SD of SBP
OC users
8
132.8
15.3
Non-OC Users
21
127.4
18.2
14
EXAMPLE
• Hypothesis: OC use is associated with higher blood
pressure
• Statistically speaking we are interested in testing :
H0:µOC = µNO OC
HA: µOC ≠ µNO OC
H0: µOC - µNO OC = 0
HA: µOC - µNO OC ≠ 0
represents population mean SBP for OC users
µOC
µNO OC represents population mean BP for women not
using OC
15
5
7/9/2014
STATISTICAL POWER
TRUTH
DECISION
H0
Type I Error
Reject H0
alpha-level
HA
Power
1-beta
Not
Type II Error
Reject H0
beta
16
EXAMPLE
• The sample mean difference in blood pressures is
132.8 – 127.4 = 5.4
• This could be considered scientifically (clinically)
significant, however, the result is not statistically
significant (or even close to it!) at the  = 0.05 level
• Suppose, as a researcher, you were concerned about
detecting a population difference of this magnitude
if it truly existed
• A study based on 29 women has low power to detect
a difference of such magnitude
17
STATISTICAL POWER
• Power is a measure of “doing the right thing” when HA is
true!
• Higher power is better (smaller type II error)
• We can calculate power for a given study if we specify a
specific HA
• An OC/Blood pressure study based on 29 women has power
of .13 (13%) to detect a difference in BP of 5.4 mmHg or
more, if this difference truly exists between the populations
of 35-39 years old women who are using and not using OC
• When power is this low, it is difficult to determine whether
there is really no difference in population means or we just
call it wrong
18
6
7/9/2014
STATISTICAL POWER
• Recall, the sampling behavior of x1  x 2 is normally
distributed (large samples) with this sampling
distributed centered at true mean difference. If H0
is true, then curve is centered at 0
19
o:  1-2= 0
HH
0: µ1 - µ2 = 0
STATISTICAL POWER
• Recall, the sampling behavior of x1  x 2 is normally
distributed (large samples) with this sampling
distributed centered at true mean difference. If HA
truth, then curve is centered at some value d, d≠0
HA: 1-2 = d
20
STATISTICAL POWER
• H0 will be rejected (for α = 0.05) if the sample
result, x1  x 2 , is more than 2 standard errors away
from 0, either above or below
H0H: oµ: 11--
µ22== 00
21
7
7/9/2014
STATISTICAL POWER
• H0 will be rejected (for α = 0.05) if the sample
result, x1  x 2 , is more than 2 standard errors away
from 0, either above or below
H0: µ1 - µ2 = 0
HA: µ1 - µ2 = d
22
WHY IS SAMPLE SIZE RELEVANT
HYPOTHESIS TESTING
Does the sample better support a
distribution under H0 or H1 ?
Increased n
H1
New TX better
H0
No difference
H1
H0
23
New TX better No difference
Institute for Clinical and Translational Research
Introduction to Clinical Research
Section B
Factors Influencing Sample Size and Power
8
7/9/2014
HYPOTHESIS TESTING
Does a clinically significant effect exist between a
treated group and a non treated or differently treated
group?
1-sided test: n / group = (Z1-α + Z1-β)2 2 (SD/Δ)2
2-sided test: n / group = (Z1-α/2 + Z1-β)2 2 (SD/Δ)2
25
SAMPLE SIZE / POWER CALCULATION
BASIC ELEMENTS
n , Zα , Zβ , SD2 , Δ2
• Smaller α error  larger sample size
• Smaller β error (higher power)  larger sample
size
• Higher variability in outcome larger sample
size
• Smaller Δ difference to detect  larger sample
size
26
HYPOTHESIS TESTING
Use formula:
n / group = (Z1-α + Z1-β)2 (2 SD2 ) / Δ2
where Z1-α = normal deviate for a error (two sided
test uses Z1-α/2)
Z1-β = normal deviate for b error
2 indicates two groups (sources of variability)
i.e. treatment group; placebo group
Δ = least difference between treated (estrogen)
group and non treated group (placebo) that is
considered clinically significant
n = sample size of each group
27
9
7/9/2014
HYPOTHESIS TESTING
Given formula:
n = (Z1-α/2 + Z1-β)2 (2 SD2 ) / Δ2
Then
n = (1.96 + 0.84)2 2 (.65/.30)2
= (15.68) (4.69) = 73.6 => 74
Allow for
20% random dropout:
nc = 74 / .8 = 92.5
Need
93 per arm for 80% power to detect a .3
difference between group with 20%
random dropouts
.8nc = 74
90% power use n = (1.96 + 1.28)2 2 (.65/.30)2 = 98.6
28
WHAT INFLUENCES POWER?
• In order to INCREASE power for a study comparing
two group means, we need to do the following:
1. Change the hypothesized values of µ1 and µ2 so
that the difference ( µ1 - µ2 ) is bigger
29
H0: µ1 - µ2 = 0
HA: µ1 - µ2 = d
WHAT INFLUENCES POWER?
• In order to INCREASE power for a study comparing
two group means, we need to do the following:
1. Change the hypothesized values of µ1 and µ2 so
that the difference ( µ1 - µ2 ) is bigger
H0: µ1 - µ2 = 0
30
HA: µ1 - µ2 = d’
10
7/9/2014
WHAT INFLUENCES POWER?
• In order to INCREASE power for a study comparing
two group means, we need to do the following:
2. Increase the sample size in each group
31
H0: µ1 - µ2 = 0
HA: µ1 - µ2 = d
WHAT INFLUENCES POWER?
• In order to INCREASE power for a study comparing
two group means, we need to do the following:
2. Increase the sample size in each group
32
H0: µH1o: -1-µ22= =0 0
HAH: Aµ: 11--2 µ=2d = d
WHAT INFLUENCES POWER?
• In order to INCREASE power for a study comparing
two group means, we need to do the following:
3. Increase the -level (functionally speaking,
make the test “easier to reject”)
with  = 0.05
33
H0: µ1 - µ2 = 0
HA: µ1 - µ2 = d
11
7/9/2014
WHAT INFLUENCES POWER?
• In order to INCREASE power for a study comparing
two group means, we need to do the following:
3. Increase the -level (functionally speaking,
make the test “easier to reject”)
with  = 0.10
34
H0: µ1 - µ2 = 0
HA: µ1 - µ2 = d
SAMPLE SIZE / POWER CALCULATION
BASIC ELEMENTS
n , Zα , Zβ , SD2 , Δ2
•
•
•
•
increased α error  higher power
Larger sample size  higher power
Less variability (SD) in outcome higher power
Larger Δ difference to detect  higher power
35
SAMPLE SIZE / POWER CALCULATION
OTHER CONSIDERATIONS
When the outcome of interest is before-after change
of a certain outcome measure, and D denotes the
between group difference of these changes:
• The SD used in the calculation should be the SD for
the before-after change, not the SD for the crosssectional outcome measure
• Assuming the SDbefore and SDafter = σ, then
SDchange = σ x √ 2(1-ρ)
where ρ is the correlation between the outcome
measures before and after
• Literature usually report SDbefore but not ρ
36
12
7/9/2014
POWER & STUDIES
• Power is computed to inform study design, and calculation of
post-hoc power after a study is completed does not provide
additional information to the study results
• Can only be computed for specific HA’s: i.e. this study had
XX% to detect a difference in population means of YY or
greater.
• Frequently, in study design, a required sample size is
computed to actually achieve a certain preset power level to
find a “Clinically/scientifically” minimal important
difference in means
• “Industry standard” for power: 80% (or greater)
37
Institute for Clinical and Translational Research
Introduction to Clinical Research
Section C
Sample Size Calculations for Comparing Two Means
EXAMPLE
• Blood pressure and oral contraceptives
– Suppose we used data from the example in
Section A to motivate the following question:
– Is oral contraceptive use associated with higher
blood pressure among individuals between the
ages of 35–39?
39
13
7/9/2014
PILOT STUDY
• Recall, the data:
Sample Data
n
Mean SBP
SD of SBP
OC users
8
132.8
15.3
Non-OC Users
21
127.4
18.2
40
PILOT STUDY
• 5 mmHg difference in BP is too large to ignore
• Need a bigger study that has ample power to detect
such a difference, should it really exist in the
population
• Task: determine sample sizes needed to detect a 5
mmHg increase in blood pressure in O.C. users with 80%
power at significance level α = 0.05
– Using pilot data, we estimate that the standard
deviations are 15.3 and 18.2 in O.C. and non-O.C.
users respectively
41
PILOT STUDY
• Here we have a desired power in mind and want to
find the sample sizes necessary to achieve a power
of 80% to detect a population difference in blood
pressure of five or more mmHg between the two
groups
42
14
7/9/2014
PILOT STUDY
• We can find the necessary sample size(s) of this
study if we specify .
– α-level of test (.05)
– Specific values for μ1 and μ2 (specific HA) and
hence d = μ1 - μ2: usually represents the minimum
scientific difference of interest)
– Estimates of σ1 and σ2
– The desired power(.80)
43
PILOT STUDY
• How can we specify d = μ1 - μ2 and estimate
population SDs?
– Researcher knowledge—experience makes for
good educated guesses
– Make use of pilot study data!
44
PILOT STUDY
• Fill in blanks from pilot study
–  -level of test (.05)
– Specific HA (μOC=132.8, μNO OC =127.4), and
hence d = μ1 - μ2 = 5.4 mmHg
–
–
Estimates of σOC ( = 15.3) and σNO OC (=18.2)
The power we desire (.80)
45
15
7/9/2014
PILOT STUDY
• Given this information, we can use Stata to do the
sample size calculation
• “sampsi” command
– Command syntax (items in italics are numbers to
be supplied by researcher)
sampsi 1 2, alpha() power(power) sd1(1)
sd2(2)
46
STATA RESULTS
• Blood Pressure/OC example
Total N = 306
47
PILOT STUDY/STATA RESULTS
• Our results from Stata suggest that in order to
detect a difference in BP of 5.4 mmHg (if it really
exists in the population) with high (80%) certainty,
we would need to enroll 153 O.C. users and 153
non-users
• This assumed that we wanted equal numbers of
women in each group!
48
16
7/9/2014
PILOT STUDY/STATA RESULTS
• Suppose we estimate that the prevalence of O.C.
use amongst women 35–39 years of age is 20%
– We wanted this reflected in our group sizes
• If O.C. users are 20% of the population, non-OC
users are 80%
– There are four times as many non-users as there
are users (4:1 ratio)
49
PILOT STUDY/STATA RESULTS
• We can specify a ratio of group sizes in Stata
– Again, using “sampsi” command with “ratio”
option
sampsi 1 2, alpha() power(power) sd1(1)
sd2(2) ratio(n2/n1)
50
STATA RESULTS
• Blood Pressure/OC example
Total N = 430
51
17
7/9/2014
Institute for Clinical and Translational Research
Introduction to Clinical Research
Section D
Sample Size Determination for Comparing Two
Proportions
POWER FOR COMPARING TWO
PROPORTIONS
• Same ideas as with comparing means
53
PILOT STUDY
• We can find the necessary sample size(s) of this
study if we specify .
– α-level of test
– Specific values for p1 and p2 (specific HA) and
hence d = p1 - p2: usually represents the minimum
scientific difference of interest)
– The desired power
54
18
7/9/2014
PILOT STUDY
• Given this information, we can use Stata to do the
sample size calculation
• “sampsi” command
– Command syntax (items in italics are numbers to
be supplied by researcher)
sampsi p1 p2, alpha() power(power)
55
ANOTHER EXAMPLE
• Two drugs for treatment of peptic ulcer compared
(Familiari, et al., 1981)
– The percentage of ulcers healed by pirenzepine (drug
A) and trithiozine (drug B) was 77% and 58% based on
30 and 31 patients respectively (p-value = .17), 95% CI
for difference in proportions healed ( p DRUG A - p DRUG B ) was
(-.04, .42)
– The power to detect a difference as large as the
sample results with samples of size 30 and 31
respectively is only 25%
56
Healed
Not Healed
Drug A
23
7
Total
30
Drug B
18
13
31
Continued
EXAMPLE
• As a clinician, you find the sample results intriguing
– want to do a larger study to better quantify the
difference in proportions healed
• Redesign a new trial, using aforementioned study
results to “guestimate” population characteristics
– Use pDRUG A = .77 and pDRUG B = .58
– 80% power
–  = .05
• Command in Stata
sampsi .77 .58, alpha (.05) power (.8)
57
19
7/9/2014
EXAMPLE
• Command in Stata
sampsi .77 .58, alpha (.05) power (.8)
58
EXAMPLE
• What would happen if you change power to 90%?
sampsi .77 .58, alpha (.05) power (.9)
59
EXAMPLE
• Suppose you wanted two times as many people on
trithiozone (“Drug B”) as compared to pirenzephine
(“Drug A”)
– Here, the ratio of sample size for Group 2 to
Group 1 is 2.0
• Can use “ratio” option in “sampsi” command
60
20
7/9/2014
Example
• Changing the ratio
sampsi .77 .58, alpha (.05) power (.9) ratio(2)
61
SAMPLE SIZE FOR COMPARING
TWO PROPORTIONS
• A randomized trial is being designed to determine if
vitamin A supplementation can reduce the risk of
breast cancer
– The study will follow women between the ages of
45–65 for one year
– Women were randomized between vitamin A and
placebo
• What sample sizes are recommended?
62
BREAST CANCER/VITAMIN A
EXAMPLE
• Design a study to have 80% power to detect a 50%
relative reduction in risk of breast cancer
w/vitamin A
p
VITA
 .50 )
(i.e. RR  p
PLACEBO
using a (two-sided) test with α-level = .05
• To get estimates of proportions of interest:
- using other studies, the breast cancer rate in the
controls can be assumed to be 150/100,000 per year
63
21
7/9/2014
BREAST CANCER/VITAMIN A
EXAMPLE
• A 50% relative reduction: if
pVITA
 .50 then, pVIT A  .50  p PLACEBO
RR 
p PLACEBO
So, for this desired difference in the relative scale:
pVIT A 
150
 0.5 
100,000
75
100,000
Notice, that this difference on the absolute scale,
pVIT Asmaller
 p PLACEBOin magnitude:
, is much

64
75
 0.00075  0.075%
100,000
BREAST CANCER SAMPLE SIZE
CALCULATION IN STATA
• “sampsi” command
sampsi .00075 .0015, alpha(.05) power(.8)
65
BREAST CANCER SAMPLE SIZE
CALCULATION IN STATA
• You would need about 34,000 individuals per group
• Why so many?
– Difference between two hypothesized
proportions is very small: pVIT A  p PLACEBO = .00075
We would expect about 50 cancer cases among the
controls and 25 cancer cases among the vitamin A
group
150
 34,000  51
100,000
75
vitamin A :
 34,000  25
100,000
placebo :
66
22
7/9/2014
BREAST CANCER/VITAMIN A
EXAMPLE
• Suppose you want 80% power to detect only a 20%
(relative) reduction in risk associated with vitamin A
p
A 20% relative reduction: if RR  p VITA  .80 then
PLACEBO
pVIT A  .80  p PLACEBO
So, for this desired difference in the relative scale:
pVIT A 
150
100,000
 0.8 
120
100,000
Notice, that this difference on the absolute scale,
pVITsmaller
, is much
in magnitude:
A  p PLACEBO
67

30
 0.0003  0.03%
100,000
BREAST CANCER SAMPLE SIZE
CALCULATION IN STATA
• “sampsi” command
sampsi .0012 .0015, alpha(.05) power(.8)
68
BREAST CANCER—VITAMIN A
EXAMPLE REVISITED
• You would need about 242,000 per group!
• We would expect 360 cancer cases among the
placebo group and 290 among vitamin A group
69
23
7/9/2014
AN ALTERNATIVE APPROACH—
DESIGN A LONGER STUDY
• Proposal
– Five-year follow-up instead of one year
Here:
pVITA≈5×.0012 = .006
pPLACEBO≈5×.0015 = .0075
• Need about 48,400 per group
– Yields about 290 cases among vitamin A and 360
cases among placebo
• Issue
70
– Loss to follow-up
Institute for Clinical and Translational Research
Introduction to Clinical Research
Section E
More Examples
HYPOTHESIS TESTING
EXAMPLE
Johns Hopkins Practice-based Opportunity for
Promotion of Weight Reduction (POWER) Trial:
• a comparative effectiveness trial that tested
two practical behavioral weight loss
interventions in obese medical outpatients
with cardiovascular risk factors
72
24
7/9/2014
HYPOTHESIS TESTING
EXAMPLE
73
HYPOTHESIS TESTING
EXAMPLE
Study Design
Randomization
Control
Remote
In-Person
Baseline
6 Mo
12 Mo
= Measured weights and other outcomes
24 Mo
74
HYPOTHESIS TESTING
EXAMPLE
Hypotheses:
• Primary:
– The Remote arm will produce 2.75kg greater mean
weight loss over the 24 month trial period than the
Control arm
– The In-Person arm will produce 2.75kg greater mean
weight loss over the 24 month trial period than the
Control arm
• Secondary:
– In-Person vs. Remote, other weight outcomes (%
weight loss, % reached goal), other health outcomes…
75
25
7/9/2014
HYPOTHESIS TESTING
EXAMPLE
Considerations for selecting the primary hypotheses in
POWER trial
• Weight loss can be quantitatively measured with
high precision, easy interpretation
• Continuous outcomes typically allow greater
statistical power
• Weight loss as small as 2.75 kg has been shown to
produce health benefits, so clinically relevant
• Difference in intervention effects between Remote
vs In-Person not clear but expected to be smaller
• Type-I error adjustment for multiple comparisons
76
SAMPLE SIZE / POWER CALCULATION
OTHER CONSIDERATIONS
Back to the POWER trial example:
Minimum Detectable Difference (MDD) in Kg
Minimum Detectable Difference
A
B
α=0.025
α=0.05
Sample
c
18m ∆
σ
VIF
ρ
β=0.80 β=0.90 β=0.80 β=0.90
size
Baseline
300
103.83
-3.41
16.53
1.0
0.85
2.68
3.10
3.60
4.15
(100/group)
16.53
1.5
0.85
2.75
3.27
4.40
5.05
350
16.53
1.0
0.85
2.48
2.90
3.30
3.85
16.53
1.5
0.85
3.05
3.55
4.05
4.70
(117/group)
400
16.53
1.0
0.85
2.33
2.70
3.10
3.60
(133/group)
16.53
1.5
0.85
2.85
3.30
3.80
4.40
A
B
C
Power if either of 2 tests is positive at α=0.05/2; Power for single test at α=0.05; Baseline, 18m ∆ (=
change from EST+DASH intervention in PREMIER, net of control), and rho(ρ, correlation between
baseline and 18m), all from PREMIER trial unless otherwise indicated.
77
SAMPLE SIZE / POWER CALCULATION
OTHER CONSIDERATIONS
Then, budget planning indicated that we could afford
N = 390 (130/group), so the table evolved to:
Net tx effect
in Kg
2.00
2.50
2.75
3.00
3.25
3.50
4.00
4.50
α=0.025A
(at least 1 of 2
significant)
0.52
0.72
0.80
0.87
0.92
0.95
0.99
1.00
α=0.025B
(1 of 1
significant)
0.31
0.47
0.56
0.64
0.72
0.78
0.89
0.95
α=0.05C
(1 of 1
significant)
0.41
0.58
0.66
0.74
0.80
0.86
0.93
0.97
78
26
7/9/2014
SAMPLE SIZE / POWER CALCULATION
OTHER CONSIDERATIONS
Recall POWER has 2 primary hypotheses to test:
H0: Remote - Control = 0; H1: Remote - Control > 2.75kg
H0’: In-Person - Control = 0; H1’: In-Person - Control > 2.75kg
If each test is controlled at α = 0.05, then the prob. for
both tests to be type-I error free would be
(0.95)(0.95) = 0.9025
So the prob. for at least 1 test to encounter type-I error
would be
1 - 0.9025 = .0975 > 0.05
=> Multiple comparisons inflate overall α error
79
SAMPLE SIZE / POWER CALCULATION
OTHER CONSIDERATIONS
If each of the 2 tests is controlled at α = 0.05/2 = 0.025,
then the prob. for both independent tests to be type-I
error free would be
(0.975)(0.975) = 0.950625
So the prob. for at least 1 test to encounter type-I error
would be
1 - 0.950625 = .049375 < 0.05
 Bonferroni adjustment for multiple comparisons:
if we are testing k hypotheses and want to control
the overall α at the nominal 0.05 level, then we
should test each at α = 0.05/k level
80
SAMPLE SIZE / POWER CALCULATION
OTHER CONSIDERATIONS
Bonferroni correction is conservative in terms of
statistical power, requires greater sample size
Holm method:
• Compare the smaller p-value of the 2 tests against α =
0.05/2 = 0.025. If significant, then compare the larger
p-value against α = 0.05. If the first test is not
significant, then declare both tests as non-significant.
• Family-wise type I error rate < 0.05
• Provide better statistical power then the Bonferroni
correction
81
27
7/9/2014
SAMPLE SIZE / POWER CALCULATION
OTHER CONSIDERATIONS
Then, budget planning indicated that we could afford
N = 390 (130/group), so the table evolved to:
Net tx effect
in Kg
2.00
2.50
2.75
3.00
3.25
3.50
4.00
4.50
α=0.025A
(at least 1 of 2
significant)
0.52
0.72
0.80
0.87
0.92
0.95
0.99
1.00
α=0.025B
(1 of 1
significant)
0.31
0.47
0.56
0.64
0.72
0.78
0.89
0.95
α=0.05C
(1 of 1
significant)
0.41
0.58
0.66
0.74
0.80
0.86
0.93
0.97
82
HYPOTHESIS TESTING
EXAMPLE
83
HYPOTHESIS TESTING
EXAMPLE
84
28
7/9/2014
HYPOTHESIS TESTING
EXAMPLE
Hypotheses:
– Ca + Vit D arm will have 22% reduction of colon CA
incidence over average 8 years of trial period than
the Placebo arm
Log-rank test at α = 0.05, with incidence rate of 0.2% per
year in the placebo arm, and pre-fixed N = 35,000
Plug these assumed parameters into the calculator
=> power = 81% (83% stated in the paper)
85
SUMMARY
Sample size / power evaluation is a critical
part of study design for hypothesis testing
• Smaller α error  larger sample size
• Smaller β error (higher power)  larger sample
size
• Higher variability in outcome larger sample
size
• Smaller difference to detect  larger sample
size
Need to account for type I error when testing
multiple hypotheses
86
Institute for Clinical and Translational Research
Introduction to Clinical Research
Section F
Sample Size and Study Design Principles:
A Brief Summary
29
7/9/2014
DESIGNING YOUR OWN STUDY
• When designing a study, there is a tradeoff between
:
– Power
– -level
– Sample size
– Minimum detectable difference (specific HA)
• Industry standard—80% power,  = 0.05
88
DESIGNING YOUR OWN STUDY
• What if sample size calculation yields group sizes
that are too big (i.e., cannot afford or will take too
long to do the study) or are very difficult to recruit
enough participants for the study?
– Increase minimum difference of interest
– Increase -level
– Decrease desired power
– Select a more precise primary outcome
89
DESIGNING YOUR OWN STUDY
• Sample size calculations are an important part of
study proposal
– Study funders want to know that the researcher
can detect a relationship with a high degree of
certainty (should it really exist)
• Even if you anticipate confounding factors, these
approaches are the best you can do and are
relatively easy
• Accounting for confounders requires more
information and sample size has to be done via
computer simulation—consult a statistician!
90
30
7/9/2014
DESIGNING YOUR OWN STUDY
• What is this specific alternative hypothesis?
– Power can only be calculated for a specific
alternative hypothesis
– When comparing two groups this means providing
value of the true population mean (proportion)
for each group
91
DESIGNING YOUR OWN STUDY
• What is this specific alternative hypothesis?
– Therefore specifying a difference between the two
groups, typically based on the smallest difference of
scientific / clinical interest
– This difference is frequently called effect size
– When sample size, power and other parameters are
fixed for a study, this difference derived from the
calculation is frequently called minimum detectable
difference
– Typically aiming for a sample size that would afford
sufficient power to detect the minimum detectable
difference that covers the smallest difference of
92
scientific / clinical interest
DESIGNING YOUR OWN STUDY
• Where does this specific alternative hypothesis
come from?
– Hopefully, not the statistician!
– As this is generally a quantity of scientific
interest, it is best estimated by a knowledgeable
researcher or pilot study data
– This is perhaps the most difficult component of
sample size calculations, as there is no magic
rule or “industry standard”
93
31
7/9/2014
CALCULATING POWER
• In order to calculate power for a study comparing
two population means, we need the following:
– Sample size for each group
– Estimated (population) means, μ1 and μ2 for each
group—these values frame a specific alternative
hypothesis (usually minimum difference of
scientific interest)
– Estimated (population) SD’s, σ1 and σ2
– -level of the hypothesis test
94
OTHER CONSIDERATIONS
Obtaining high quality data through best
possible study design and rigorous research
conduct are critical to the success of clinical
investigations
Work with a biostatistician starting at the
study design stage
95
32