Sample Size and Power Haugesund, 14.6.2011 J¨ org Aßmus

Sample Size and Power
Outline
Haugesund, 14.6.2011
1. Introduction
Why do we talk about sample size?
J¨
org Aßmus
Two samples under H0
Two samples under H1
sample 1
sample 2
sample 1
sample 2
2. Short introduction to hypotheses testing
3. Power and Sample size
”I only believe in statistics that I doctored myself.”
(said to be said by Winston Churchill)
4. What can we do?
2
Introduction
Introduction - A simple example
What is the target of a study?
Body height of 20 years old men in Germany and the Netherlands
RCT: (Randomized controlled trial)
Sample
@
@
Treatment group
@
@
Output 1
@
@
Control group
@
@
Output 2
Research Question: Is there a difference between the Outputs?
Methodical Question: What do we need to be able to see a
difference (if there is one)?
3
No.
1
2
3
4
5
6
7
8
9
10
GER
1.79
1.89
1.77
1.95
1.80
1.85
1.80
1.76
1.82
1.77
NED
1.77
1.89
1.73
1.87
2.00
1.80
1.89
1.87
1.88
1.89
Testing the difference (t-test)
Sample size
∆ p-value Difference?
5 0.014 0.8145
no
10 0.040 0.1993
no
50 0.046 0.0058
yes
100 0.060 0.0000
yes
1000 0.054 0.0000
yes
Conclusion:
The test result depends on the sample size
4
Introduction
Conclusion from the introductory example: We can see an existing difference if we use a sufficiently large sample size
Introduction: What do we want to do?
Question: Why don’t we take the sample size as large as possible?
1. Economy:
- Why should we include more than we have to?
- Every trial costs!
We have to find the correct sample size to detect
the desired effect.
Not too small - Not too large
What do we need on the way?
2. Ethics:
- We should never punish more test persons or animals than
necessary
3. Statistics:
- We can proof almost every effect, if we only have sufficiently
large sample size
- stress field: statistical significance vs. clinical relevance
-
How does a test work?
What means ”Power of a test”?
What determines the sample size?
How do we handle this in practical tasks?
5
6
A short introduction to hypotheses testing
Possible results of a single test
How can we know?
← obvious
not obvious →
Reality
H0 true
H0 false
Strategy:
Wrong decisions:
· Rejection even if H0 is true (type I error)
· No rejection even if H0 is false (type II error)
1. Formulate a hypothesis
Expected heights equal
Nullhypothesis
H 0 : E h 1 = E h2
vs.
vs.
Expected heights different
vs.
Alternative
H1 : Eh1 6= Eh2
2. Find an appropriate test statistics:
3. Compute the observed test statistics:
T =
Tobs =
√
What do we want?
· Reduction of the wrong decisions.
⇒ Minimal probability for both types of errors.
Dilemma:
n|Eh2 −Eh1 |
σ
√
Test decision
accept
reject
RIGHT
type I error
type II error
RIGHT
For a given data set and a given test method it is
impossible to reduce both error types at once.
n|ˆ
h2 −ˆ
h1 |
spooled
4. Reject the nullhypothesis H0 if Tobs ist too large.
But what does this mean: ”too large”?
7
8
Dilemma:
A Simulation experiment:
For a given data set and a given test method it is
impossible to reduce both error types at once.
We generate data for two populations with the following properties:
⇒ We try to deal only with type I error
⇒ We assume that H0 ist true
- all data are Gaussian
- Equally for both populations:
· Mean:
µ1 = µ2 = 0
· Standard deviation: σ1 = σ2 = 1
· Sample size:
M = 20
- Differently for the populations:
· Nothing
- Test problem: H0 : µ1 = µ2
H1 : µ1 6= µ2
Statistical approach:
Idea: What is the probability that everything happend by accident?
The p-value is for a given data set the
probability to get the observed test statistics or worse assuming that the nullhypothesis is true
Remarks:
Given data: p-value is a fixed number
→ characteristically for the data set
Theoretically: p-value is a random variable
⇒ p-value has a distribution
density of the teststatistic
Solution: The p-value
p:=P (T>T
H
sample 1
sample 2
Approach:
)
obs
0
T_obs
Two samples under H0
1.
2.
3.
4.
T_krit
values of the teststatistic
Generate a data set for both populations
Compute the p-value for a t-test (H0 : no difference)
Plot p-value
Repeat 1.-3.
9
10
Result of the experiment
Power and Sample size
Distribution of the p−values under H
Two samples under H0
0
REJECT
N =4963, 4.963%
0
sample 1
sample 2
What did we learn about tests?
ACCEPT
N =95037, 95.037%
1
count
Test decision: made according to control the probability for appearance of a
type I error
⇒
Interpretation: We control the probability of the incorrect detection of an effect.
Question: What about the probability not to detect an existing effect?
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p−values
We know: With
- given data
- a given test method
- a given significance level
we are not able to influence the probability of the type II error anymore (Recall
the test dilemma!)
- p-values are uniformly distributed under true null-hypothesis
- 5% of the p-values under 0.05
- Independent of the Sample size
But how does this probability look like?
Let us do one more simulation experiment:
We reject the null hypothesis if the p-value is under a given significance
level α (usual convention α = 0.05)
The probability of type I error (incorrect rejection) will be lower than
the used significance level.
11
12
A Simulation experiment:
Power and Sample size
We generate data for two populations with the following properties:
Result of the experiment:
1
sample 1
sample 2
sample 1
sample 2
REJECT
ACCEPT
N =29001, 29.001%
N =70999, 70.999%
0
⇒
Approach:
1.
2.
3.
4.
Distribution of the p−values under H
Two samples under H1
Two samples under H1
1
count
- All data are Gaussian
- Equally for both populations:
· Standard deviation: σ1 = σ2 = 1
· Sample size:
M = 20
- Differently for the populations:
· Mean µ1 = 0
µ2 = 0.8
- Test problem: H0 : µ1 = µ2
H1 : µ1 6= µ2
Difference detected:
No difference detected:
Generate a data set for both populations
Compute the p-value for a t-test (H0 : no difference)
Plot p-value
Repeat 1.-3.
0
0.1
≈ 69% of the trials
≈ 31% of the trials
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
← Correct
← Type II error
Ability of the test to detect the difference
@
@
Power of the test
14
13
Power and Sample size
Power and Sample size
σ = 0.04
M=5
Figure:
Mean of simulated data for two
samples repeated 100.000 times.
∆µ = 0
Recall: α Probability for wrong rejections (type I error)
1.7
1.8
1.9
2
1.7
1.8
1.9
2
1.8
σ = 0.08
M = 10
1.9
2
2.1
Question:
What does the Power of a test
depend on?
∆µ = 0.05
Definition β: Probability for wrong acceptations (type II error)
1.7
1.8
1.9
2
1.7
The power of a test is the probability to
detect a wrong hypothesis
1.7
1.8
1.9
2
1.7
1.7
1.8
1.9
1.9
2
1.8
1.8
1.9
2
1.7
2
1.8
1.8
1.9
2
2.1
1.9
2
2.1
∆µ = 0.15
2
1.8
σ = 0.2
M = 1000
1.9
∆µ = 0.1
σ = 0.16
M = 100
Power = 1 − β
1.8
σ = 0.12
M = 50
1.9
2
2.1
∆µ = 0.2
Question: What does the Power of a test depend on?
1.7
15
1.8
1.9
2
1.7
1.8
1.9
2
1.8
1.9
2
2.1
-
1
p−values
sample size
standard deviation
effect (mean difference)
significance level
⇒ Power and sample size are a
”complementary” pair of values
Thumb rule: If you know one of
them you know the other
16
Power and Sample size
Let us turn it around: What are the ingredients needed for the computation of
the needed sample size?
1. Desired detectable effect
- What effect (mean difference, risk, coefficiant size) is clinically relevant?
- Which effect must be detectable to make the study meaningful?
- This is a choice by the researcher (not statistically determined!)
What did we learn?
• Power: Ability of the test to detect a wrong nullhypothesis
(e.g. t-test: ability to detect a difference)
2. Variance in the data:
- e.g. standard deviation of both samples for a t-test
- taken som experience former or pilot studies
• Criteria: Type II error: Power = 1 − β
• Power and sample size: Corresponding values
3. Significance level α
- usually set to α = 0.05
- Adjustments must be taken into account (e.g. multiple testing)
• Needed sample size: depends on
- Desired effect
(effect size ↓
- Sample variance
(variance ↑
- Significance level
(α ↓
- Desired trest power
(Power ↑
- Test type
4. Desired power of the test
- often used 1 − β = 0.8
- This is a choice by the researcher (not statistically determined!)
-
needed
needed
needed
needed
sample
sample
sample
sample
size
size
size
size
↑)
↑)
↑)
↑)
5. Type of test
- Different test for the same problem often have different power
18
17
Computation of the sample size - Pocock’s formula
Continuous outcome: (t-test)
Computation of the sample size
N =
Problem: There is no general formula for the power or sample size
2σ 2
· f (α, β)
(µ2 − µ1)2
Computation possibilities:
-
1.
2.
3.
4.
Dichotomous outcome: (χ2-test)
The old-fashioned way: Pocock’s formula
The modern way: Statistical packages
If no other things help: Simulations, Bootstrap
Ask somebody
µ1, µ2...population means
σ...population standard deviation
α...signicance level
β...type II error probability (β = 1−power)
p (1 − p1) + p2(1 − p2)
N = 1
· f (α, β)
(p2 − p1)2
- p1, p2...proportions, risks (determine effect and variance)
- f (α, β) factor taken from Pocock’s table
19
20
Computation of the sample size - Pocock’s table
f (α, β)
0.10
0.05
0.02
0.01
α
- Larger f (α, β)
- Smaller α
- Larger power
β
⇒
⇒
⇒
⇒
⇒
⇒
0.05
10.8
13.0
15.8
17.8
0.10
8.6
10.5
13.0
14.9
0.20
6.2
7.9
10.0
11.7
Computation of the sample size - Program packages
0.50
2.7
3.8
5.4
6.6
- SPSS Sample power
· former Power and Precision
· stand alone but included in the SPSS-license)
· http://www.spss.com/software/statistics/samplepower/
- Included in different program packages
· R (package pwr)
· Stata (sampsi, powerreg)
· SAS (power)
· Matlab (sampsizepwr)
larger sample size
larger f (α, β)
larger sample size
smaller β
larger f (α, β)
larger sample size
- Interactive online Calculators
· http://statpages.org/#Power (overview)
Problem: How to deal with different test types?
22
21
Computation of the sample size - SamplePower 2.0
Computation of the sample size - Interactive tools
23
24
Computation of the sample size - Simulation example
Computation of the sample size - Simulation
- requires programming
- should usually be done by a statistician
- used if there is no adequate program or formula
Power simulation (t-test), Effect: ∆µ=0.8, σ=1, α=0.05
1
0.9
0.8
0.7
0.6
Power
Idea:
1. Define a power, e.g. 0.8
2. Generate artificial data with given parameters
· means µ1 , µ2
· variance σ
· significance level α
· predefined sample size N
3. Compute the test result
4. Repeat 2.-3. and count number of rejections
5. power= (number of rejections)/(number of simulations)
6. Repeat 2.-5. for different sample sizes
7. Select the lowest sample size with a power above the predefined (step 1)
0.5
0.4
0.3
0.2
Needed sample size: N=25
0.1
- Distinction between Simulation and Bootstrap:
· Bootstrap: Use a random subsample of real data
· Simulation: Generate new data
0
0
10
20
30
40
50
60
Sample size
70
80
90
100
25
Computation of the sample size - Simulation example
SamplePower 2.0
• http://www.spss.com/software/statistics/samplepower/
Empirical power estimation (1000 Repetitions)
1
• Former ”Power and Precision”
0.9
• Standalone program included in the SPSS license
⇒ available in Helse Vest (→ email Helse Vest IKT)!
0.8
0.7
• Different groups of methods:
- Mean comparison (only t-test)
- Proportions (risks, cross tables) - Correlations
- ANOVA
- Regression (linear, logistic)
- Survival analysis
- Some noncentral tests
power
0.6
0.5
0.4
0.3
0.2
M=250
• Help:
- Did not find a proper book
- Textbook ”Power and precision” (Borenstein,M.)
- Embedded help system (not always easy to understand)
- Tutorials on the web
M=12
0.1
0
26
0
0.5
1
1.5
2
2.5
∆σ
3
3.5
4
4.5
5
27
28
Starting with a simple example:
Comparison of the mean of 2 independent samples (t-test)
SamplePower 2.0
Textbook
- All data are Gaussian
- Equally for both populations:
· Standard deviation: σ1 = σ2 = 1
· Sample size:
M = 20
- Differently for the populations:
· Mean µ1 = 0
µ2 = 0.8
- Test problem: H0 : µ1 = µ2
H1 : µ1 6= µ2
Power and precision
Authors:
Michael Borenstein
Hannah Rothstein
Jacob Cohen
David Schoenfeld
Jesse Berlin
Edward Lakatos
Two samples under H1
sample 1
sample 2
What do we need for the calculation?
Compatible with
Sample Power 2.0
-
Test design
Means
Standard deviation
Sample size? Power?
Recall: Experimentally computed power: 1 − β = 0.69 (69%)
31
29
Starting with a simple example:
Comparison of the mean of 2 independent samples (t-test)
What can we do with SamplePower 2.0 with a 2-independent
samples t-test?
Computing the power for a given sample size
Compute the power for given:
· Effect
· Standard deviation
· Sample size
Compute the sample size for given:
· Effect
· Standard deviation
· Power
Adjust:
· Significance level
· Confidence intervals
· Precision of the numbers (µ, SD, N)
Create power tables and plots
for different:
·
·
·
·
Computing the sample size for a given power (0.9)
32
Significance level
Sample size
Effect
Standard deviation
33
Cross tables (RxC)
Cross tables (2x2)
Question: Is the appearance of side effect of the treatment
associated with the sex of the patient?
p < 0.0001
Health Region
Helse SørØst
Helse Vest
Helse Midt
Helse Nord
Total
Side effects
Sex
Male
Female
Total
No
Count Percent
238
74.8%
226
65.5%
464
70.0%
Yes
Count Percent
80
25.2%
119
34.5%
199
30.0%
Question: Is the appearance of side effect of the treatment associated with the sex of the patient?
Total
Count Percent
318
48.0%
345
52.0%
663
100%
From LTBI study (Ann Iren Olsen, Helse Fonna, not published yet)
No
Count Percent
224
64.9%
85
65.9%
60
74.1%
99
63.1%
468
65.7%
Side effects
Yes
Count Percent
107
31.0%
41
31.8%
18
22.2%
33
21.0%
199
27.9%
Don’t know
Count Percent
14
4.1%
3
2.3%
3
3.7%
25
15.9%
45
6.3%
Total
Count
345
129
81
157
712
From LTBI study (Ann Iren Olsen, Helse Fonna, not published yet)
35
34
Cross tables (RxC)
Power with a sample size of 100
Cross tables (RxC)
Power for different sample sizes and significance levels
Sample size for a power of 0.9
36
37
Percent
48.8
18.1
11.4
22.1
100.0