Power analysis and sample size calculation Eirik Skogvoll, MD, PhD

Power analysis and sample size
calculation
Eirik Skogvoll, MD, PhD
Professor, Faculty of Medicine
Consultant, Dept. of Anaesthesiology and Emergency medicine
1
The hypothesis test framework
Rosner tbl. 7.1
P (type II error)
Truth
Conclusion
H0 correct
1– α
α
H0 accepted
H0 rejected
P (type I error)
Power
Type I error: rejecting H0 if H0 is true
Type II error: accepting H0 if H0 is wrong
3
H0 wrong
β
1–β
Hypothesis
A statement about one or more population parameters, e.g.
μ ≤3
μ1 = μ2
σ2 = 14
π = 0,4
π1 =π2
ψ =1
β1 = 0
ρ >0
(Expected value, or “true mean”)
(Two expected values)
(variance)
(probability of a “success”, in a binomial trial)
(two probabilities)
(odds ratio, from a contingency table)
(regression coefficient)
(correlation coefficient)
4
The sample size is a compromise…
• Significance level
• Required or optimal power
• Available resources
– Money
– Time
– Practical issues of recruiting patients or subjects
• The problem under investigation
5
Power
• The probability that some test will detect a difference, if it exists
• With low power, even large differences may go unnoticed:
Absence of evidence is not evidence of absence!
• Power is usually set to 0.8 – 0.9 (80-90 %)
• An infinitely small difference is impossible to detect
• The researcher must specify the smallest difference of (clinical) interest
• Power calculations requires knowledge of the background variability (usually
the standard deviation)
6
Power or significance?
• We wish to reduce the probability of both a type I error (α) and a type II
error (β) , because…
– A small α means that it is diffcult to reject H0
– A small β means that it is easier to reject H0 (and accept H1)
– But minimizing α and β at the same time is complicated, because
α increases as β decreases, and vice versa
• Strategy:
– Keep α at a comfortable level (0.10, 0.05, 0.01);
the maximum ”acceptable” probability of making a type I error; i.e. rejecting H0
when it is true
– Then choose a test that minimizes β, i.e. maximizes power ( =1 - β).
Beware of H1: tests may have different properties!
7
Power (Rosner 7.5)
One- sided alternative hypothesis
H0: μ = μ0
vs.
H1: μ = μ1 < μ0
H0 is rejected if zobs < zα and accepted if zobs ≥ zα .
The decision dose not depend on μ1 as long μ1 < μ0.
(Rosner Fig. 7.5)
Power, Pr (reject H0|H0 wrong) = 1 – Pr(type II error) = 1 – β, however,
depends on μ1.
8
Rosner
Fig. 7.5
H1: μ = μ1 < μ0
9
vs.
H 0: μ = μ 0
⎛ σ2 ⎞
Under H1 , X ~ N ⎜⎜ μ1 , ⎟⎟
n ⎠
⎝
We reject H0 if the observed statistic z is less than critical value zα , i.e. if zobs < zα .
Using the corresponding value for X:
⎛
⎞
⎜ X −μ
⎟
σ ⎞
⎛
0
< zα ⎟ = Pr ⎜ X < μ0 + zα ⋅
Pr ( Z < zα ) = Pr ⎜
⎟
σ
n
⎝
⎠
⎜
⎟
⎜
⎟
n
⎝
⎠
(zα < 0)
10
... remember that Z =
X − μ1
σ
⇔ X =Z⋅
σ
n
+ μ1
n
σ ⎞
σ ⎞
⎛
⎛ σ
+ μ1 < μ 0 + zα
thus, Power = P⎜ X < μ 0 + zα
⎟ = P⎜ Z
⎟
n⎠
n
n⎠
⎝
⎝
11
σ ⎞
σ ⎞
⎛
⎛
zα
zα
⎟
⎜
⎟
⎜
μ
μ
μ
n ⎟ = P⎜ Z < 0 − μ1 +
n⎟
= P⎜ Z + 1 < 0 +
σ
σ
σ ⎟
σ
σ
σ ⎟
⎜
⎜
⎟
⎜
⎟
⎜
n ⎠
n
n
n ⎠
n
n
⎝
⎝
⎞
⎛
⎞
⎛
⎟
⎜
⎟
⎜
−
( μ 0 − μ1 )
μ
μ
μ
μ
−
⎛
⎞
0
1
0
1⎟
⎜
⎟
⎜
= P⎜ Z < zα +
⋅ n⎟
=P Z<
+ zα = P Z < zα +
σ
σ ⎟
⎜
⎟
⎜
σ
⎝
⎠
⎟
⎜
⎟
⎜
n
n
⎠
⎝
⎠
⎝
( μ 0 − μ1 )
⎛
⎞
= Φ⎜ zα +
⋅ n ⎟ ... because P(Z < z ) = Φ ( z ) ... table 3a
σ
⎝
⎠
12
One-sided alternatives, in general:
H0: μ = μ0
vs.
As large as possible …
H1: μ = μ1
| μ0 − μ1 |
| μ0 − μ1 |
⎛
⎞
⎛
⎞
Power = Φ ⎜ zα +
⋅ n ⎟ = Φ ⎜ − z1−α +
⋅ n⎟
σ
σ
⎝
⎠
⎝
⎠
(Rosner, Eq. 7.19)
(remember that zα = – z1-α)
13
Factors influencing power:
•
•
•
•
Power is reduced if α is reduced (zα becomes more negative)
Power increases with increasing |μ0 – μ1|
Power is reduced with increasing σ
Power is increased with increasing n
14
Example (Rosner expl. 7.26)
Birth weight.
H0: μ0 = 120 oz, vs. H1: μ1 = 115 oz,
α = 0.05, σ = 24, n = 100.
Power = Pr(rejecting H0|H0 wrong) =
120 − 115
5 ⋅10
⎞
⎛
= Φ (0.438)
Φ⎜ z0.05 +
⋅ 100 ⎟ = Φ (−1.645 +
24
24
⎠
⎝
= Pr( Z < 0.438) = 0.66
→ Pr(type II error) = Pr(accept H0|H0 wrong) = β= 1– 0.669 = 0,331
15
Example (Rosner expl. 7.29)
Birth weight.
H0: μ0 = 120 oz, vs. H1: μ1 = 110 oz,
α = 0.05, σ = 24, n = 100.
120 − 110
10 ⋅10 ⎞
⎞
⎛
⎛
Φ⎜ z0.05 +
⋅ 100 ⎟ = Φ⎜ − 1.645 +
⎟ = Φ (2.52)
24 ⎠
24
⎝
⎠
⎝
= Pr( Z < 2.52) = 0.99
16
Power for a one sample test, two sided alternative (in general):
H0: μ = μ0
vs.
H1: μ = μ1 ≠ μ0
⎛
⎞
| μ0 − μ1 |
Power ≈ Φ ⎜ − z α +
⋅ n ⎟ (Rosner Eq. 7.21)
σ
⎝ 1− 2
⎠
17
From Power to Sample size…
From Rosner Eq. 7.21 :
⎛
| μ 0 − μ1 | ⋅ n ⎞
⎟ = 1 − β ... now we set (" fix" ) 1 − β
Power = Φ⎜⎜ − z1−α / 2 +
⎟
σ
⎝
⎠
⎛ ⎛
| μ 0 − μ1 | ⋅ n ⎞ ⎞⎟
−1 ⎜
⎟ = Φ −1 (1 − β )
Φ Φ⎜⎜ − z1−α / 2 +
⎟⎟
⎜
σ
⎠⎠
⎝ ⎝
− z1−α / 2 +
| μ 0 − μ1 | ⋅ n
σ
= z1− β ⇔
⇔ μ 0 − μ1 ⋅ n = σ ⋅ (z1− β
| μ 0 − μ1 | ⋅ n
σ
= z1− β + z1−α / 2
σ ⋅ (z1− β + z1−α / 2 )
+ z1−α / 2 ) ⇔ n =
μ 0 − μ1
18
Sample size for one sample test, two sided alternative:
H0: μ = μ0
vs.
H1: μ = μ1 ≠ μ0
σ 2 (z1− β + z1−α / 2 )2
n=
(Rosner Eq. 7.28)
2
(μ0 − μ1 )
19
Two sample test: sample size (Rosner 8.10)
Given X1 ∼ N(μ1, σ21) and X2 ∼ N(μ2, σ22)
H0: μ1 = μ2 vs. H1: μ1 ≠ μ2 (two sided test),
significance level α, and Power 1 – β.
(σ 12 + σ 22 )( z
n=
1−
α
+ z1− β ) 2
2
( μ2 − μ1 )
2
…. in each group (Rosner Eq. 8.26)
20
Rules of thumb
Lehr’s ”quick” equations for two-sample comparison:
• 80 % power, α = 0.05 (two sided), Δ = standardized difference,
m = no. in each group:
m = 16/Δ2
• 90 % power, α = 0.05 (two sided), Δ = standardized difference,
m = no. in each group
m = 21/Δ2
21
The standardized difference
Machin et al. ”Sample size tables for Clinical studies” 3.ed, 2009
Altman 1991 (figure 15.2)
Need to specify the ”standardized difference”:
Δ=
| μ 2 − μ1 |
σ
… the least, clinically important difference expressed as standard
deviations. May vary 0,1 – 1,0, where Δ= 0,1 – 0,2 constitutes a ”small”
difference, while Δ = 0,8 is ”large”.
22
What is the background variability (”noise”)?
•
•
•
From other studies
Pilot data
From the range of possible values (”quick and dirty”):
Range
SD ≈
4
•
This results stems from the basic properties of the Normal (Gaussian) distribution,
in which approx. 95 % of the population are found within ± 2SD of the mean
23
Example
• 80 % power, α = 0.05 (two sided), Δ = standardised difference = 0.5
m = 16/Δ2 = 16/0.25 = 4*16 = 64
Total sample size = 64*2 = 128
24
Nomogram
(Altman 1991)
(total sample size)
Standardised
difference
25
Example
80 % power, α = 0.05
(two sided), Δ = 0.5
N (total) = 125
(total sample size)
Standardised
difference
26
Factors influencing power (1-ß)
•
•
•
•
Power decreases if α is reduced (zα becomes more negative)
Power increases with increasing |μ0 – μ1|
Power decreases with increasing σ
Power increases with increasing sample size ”n”
27
Factors leading to increased sample size (n)
•
•
•
•
Large standard deviation
Strict requirement for α (low level, 0.01- 0.02)
Strict requirement for power (1 – β) (high level: 0.90-0.95)
Low absolute difference between parameters under H1 and H0 (|μ1 – μ0|)
28
Confidence intervals
• Most journals want CI’s reported along with (or instead of) p-values
• The expected length of a confidence interval is determined by the sample
size:
l = 2·tn-1,1-α/2·s/√n
…where
t=
relevant quantile of the t distribution with n-1 degrees of freedom
s = sample standard deviation
n = no. of observations
29
How to make efficient use of your subjects
• Re-use subjects if possible, using a repeated measurements design.
Intra-subject variance will be reduced, as each subject acts as his own
control.
• Enter more than one experimental factor, using a factorial design. Each
subject counts twice (or thrice…) !
• Refine your measurement methods to reduce residual error
• Never, never, never…! dichotomize a continuous outcome variable to a
binary one (i.e. 1 or 0) ! This has the largest variance.
30
Software for sample size calculations
• Machin et al. ”Sample Size Software for Clinical studies” 3.ed, 2009 -- but
you need to set dot (.) as decimal separator /
• Sample Power
• R
• StatExact (for special procedures)
31
References
•
•
•
•
•
•
•
•
Eide, DM. Design av dyreforsøk (kap. 28) i A. Smith: Kompendium i forsøksdyrlære.
http://oslovet.veths.no/dokument.aspx?dokument=262 (2007)
Mead R. The Design of Experiments. Statistical Principles for Practical Applications.
Cambridge: Cambridge Univ. Press; 1990
Altman DG. Practical statistics for medical research. London: Chapman & Hall; 1991.
Altman DG, Bland JM. Absence of evidence is not evidence of absence. BMJ. 1995
Aug 19;311(7003):485.
Rosner B. Fundamentals of Biostatistics. Belmont, CA: Duxbury Thomson
Brooks/Cole; 2005.
Schulz KF, Grimes DA. Sample size calculations in randomised trials: mandatory and
mystical. Lancet. 2005 Apr 9-15;365(9467):1348-53.
Bland JM. The tyranny of power: is there a better way to calculate sample size? BMJ.
2009 October 6, 2009;339(oct06_3):b3985-.
Machin D, Campbell M, Tan SB, Tan SH. Sample Size Tables for Clinical Studies. 3
ed. Oxford: Wiley-Blackwell; 2009.
33