Sample size and power Jon Michael Gran MF9130 Introductory course in statistics

Sample size and power
Jon Michael Gran
Department of Biostatistics, UiO
MF9130 Introductory course in statistics
Wednesday 08.06.2011
1 / 39
Overview
Sample size and power
(Aalen chapter 9.6, Kirkwood and Sterne chapter 35)
• Principles of sample size
• Sample size for precision
• Sample size in hypothesis testing, power
2 / 39
Sample size
Conclusions in scientific studies greatly depend on sample size
• A too small sample reduces the chance to get significant
findings and reduce the generalizability of the study →
Non-significant results not interesting
• There are methods for calculating the required sample size
• Calculation of required sample size is done prior to start of
study
• See e.g. Kirkwood and Sterne Section 35, for a good
discussion
3 / 39
Example: Pocock et al. BMJ 1994; 309:1189-97
Meta-analysis: change in IQ (children above 5 years) for increase in
blood lead from 10 to 20 µg/dl, using three measures of blood lead
4 / 39
Sources of error when testing
• Two main sources:
I Systematic error
I Random error
• Systematic errors are due to a faulty design: Bias, lack of
randomization, no blinding etc.
• Random errors are due to just random variation. Can be
reduced by increasing the sample size
5 / 39
Random errors
• Remember the calculations of standard error in previous
lectures
• The effect of random error decreases by the square root of n
as the sample increases
• Standard error of the mean:
σ
se = √
n
• Standard error of a proportion:
s
se =
p(1 − p)
n
6 / 39
Illustration: Standard error of the mean of a variable with
standard deviation σ = 1
7 / 39
Sample size calculations
You will have to quantify:
• Realistic or clinically relevant difference between groups
• Expected variation in the data (e.g. standard deviation)
• Significance level, e.g. 5%
• Wanted power - or alternatively; how many individuals n you
can afford to include in the study
8 / 39
So what is power?
• Power = P(reject H0 |Ha true)
• With words: The power is the probability of rejecting the null
hypothesis if the alternative hypothesis is true
• Or: The probability to find a difference between the
groups if there really are a difference
• More subjects → higher power
9 / 39
How large a power should you have?
• As high as possible; but a question about what is possible
with regards to sample size
• The usual minimum for a study with “good” power is 80%
(“Industry minimum”)
• Many large definitive studies aim at a power of 99.9%
10 / 39
First: Sample size for precision
• Want to construct a 95% confidence interval with a given
degree of precision
• Wish to be 95% certain that the estimate is within ±2se of
the true value (or 1.96 to be precise)
• If the estimate is normally distributed, sample mean ±2se
covers approximately 95%. This means that a = 2se
11 / 39
Remember the normal distribution
• µ = expected value, σ = standard deviation
• µ ± 2σ covers approximately 95% of the distribution (a = 2σ)
12 / 39
Sample size for estimating the mean with required accuracy (a)
• To estimate a mean, the following number of observations are
needed:
n=
(because a = 2se = 2 ·
4σ 2
a2
√σ )
n
13 / 39
Sample size for estimating the mean with required accuracy (a)
• To estimate a mean, the following number of observations are
needed:
n=
(because a = 2se = 2 ·
4σ 2
a2
√σ )
n
• Example: Number of observations required to estimate the
cholesterol level (mmol/dl) in the population with a precision
of 0.5. Suppose σ = 1. Number of observations required:
n=
4 × 12
= 16
0.52
13 / 39
Sample size for estimating a proportion
• To estimate a proportion, the following number of
observations are needed:
n=
(because a = 2se = 2 ·
q
4p(1 − p)
a2
p(1−p)
)
n
14 / 39
Sample size for estimating a proportion
• To estimate a proportion, the following number of
observations are needed:
n=
(because a = 2se = 2 ·
q
4p(1 − p)
a2
p(1−p)
)
n
• Example: Number of observations required to estimate the
prevalence of depression in the population with accuracy 0.03.
Suppose p = 0.1. Number of observations required:
n=
4 × 0.1(1 − 0.1)
= 400
0.032
14 / 39
Sample size in hypothesis testing
• Study sample has to be so large that an interesting effect will
turn out as significant, with a large probability
• This means that negative results will also be interesting if
the power is high (and worthy of publication)
Figur: For example testing for difference in two normally distributed
variables
15 / 39
Types of errors in hypothesis testing
• Type I error: To conclude that there is an effect (or
difference), when in reality there is not
16 / 39
Types of errors in hypothesis testing
• Type I error: To conclude that there is an effect (or
difference), when in reality there is not
I
I
Controlled by the significance level. Probability: α
Usually choose significance level α = 5%: Only 5% chance of
discovering an effect (rejecting the null hypothesis) when there
really is no effect
16 / 39
Types of errors in hypothesis testing
• Type I error: To conclude that there is an effect (or
difference), when in reality there is not
I
I
Controlled by the significance level. Probability: α
Usually choose significance level α = 5%: Only 5% chance of
discovering an effect (rejecting the null hypothesis) when there
really is no effect
• Type II error: Not discovering a true effect
I Controlled by the sample size. Probability: β
I Power: 1-β (chance of discovering the effect)
I A minimum for ’good’ power was said to usually be 80%: so
80% chance of discovering a true effect (rejecting the null
hypothesis if there is a difference)
16 / 39
Effect size (or standardized difference)
• In order to do sample size calculations, one often needs to
calculate the effect size
• Also called standardized difference
• Formulas for the effect size depends on which methods you
want to use for the analysis
I
I
I
Two sample t-test?
Paired t-test?
Comparing proportions?
• Effect size depends on
I Clinically relevant and realistic difference (often called ∆)
I The standard deviation of your measured variable
17 / 39
Different effect sizes
The smaller the effect size is, the more data you need to
sample to get high power. This is because the data distributions
will overlap to a large extent for small effect sizes
18 / 39
From: http://www.uccs.edu/∼lbecker/psy590/es.htm
19 / 39
Nomogram from Altman: Practical statistics for Medical
Research
20 / 39
Using the nomogram: Comparing means in two groups
• Clinically relevant (and realistic) difference: ∆
• Standard deviation in both groups: σ
• Effect size is given by: ∆/σ
21 / 39
Using the nomogram: Comparing means in two groups
• Clinically relevant (and realistic) difference: ∆
• Standard deviation in both groups: σ
• Effect size is given by: ∆/σ
• Example: Comparing cholesterol level in two groups:
I Clinically relevant (and realistic) difference: ∆ = 0.5
I Standard deviation: σ = 1
I Effect size:∆/σ = 0.5
I A relevant difference of 0.5 means that if the average level in
the two groups is for example 5.7 and 6.2, it would be
important to discover
• Chose significance level α = 0.05 and power
1 − β = 0.80
21 / 39
Using the nomogram from Altman
22 / 39
Conclusion, cholesterol example
• An effect size of 0.5 gives a sample size of 120, i.e. need to
measure cholesterol level in 60 patients in each group in
order to get a significant effect
• The nomogram always states the total sample size
23 / 39
Sample size for continuous data - paired data
• E.g. a cross-over study
• Effect size: 2 · ∆/σd , where σd is the standard deviation of
the difference between the two measurements
• New cholesterol reducing drug, what is the effect of using it
for a month?
• A difference of 0.5 is important to discover, assume σd = 1
• Need to sample 30 patients
24 / 39
Using the nomogram from Altman
25 / 39
Sample size for proportions
• Compare proportions in two groups
I Initial guess on the proportions: p and p
1
2
I Relevant difference: p − p
1
2
I Average proportion: p
¯ = (p1 + p2 )/2
p1 −p2
I Effect size:
p
¯ ×(1−¯
p)
26 / 39
Sample size for proportions
• Compare proportions in two groups
I Initial guess on the proportions: p and p
1
2
I Relevant difference: p − p
1
2
I Average proportion: p
¯ = (p1 + p2 )/2
p1 −p2
I Effect size:
p
¯ ×(1−¯
p)
• Example: Compare the prevalence of depression in two
populations
I
I
I
Guess on the proportions: 0.10 and 0.20
Average proportion: p¯ = 0.15
Effect size: √ 0.20−0.10
= 0.28
0.15×(1−0.15)
26 / 39
Using the nomogram from Altman
27 / 39
Conclusion, depression example
• Choose:
α = 0.05 (5%)
1 - beta = 0.80 (80%)
• An effect size of 0.28 gives a sample size of 400, i.e. need
200 patients in each group to get a significant difference
28 / 39
We can also do sample size calculations in another way
• Example: Want to test the effect of nicotine gum. 15% of
quitting smokers are still not smoking after 6 months. If gum
increases this proportion to 30%, we want a significant test
• Average proportion is 0.225, effect size is 0.36 (calculate
yourself)
• For financial reasons, can only afford 200 patients in total
• Get power of 75%, i.e. 75% chance of discovering this effect
29 / 39
Using the nomogram from Altman
30 / 39
Using formulas: example from Kirkwood and Sterne
• Study comparing a continuous variable in two groups:
n=
(u+v )2 (σ12 +σ02 )
(µ1 −µ0 )2
where n is the sample size in each group, σi is the standard
deviation in group i, µ1 − µ0 = ∆ is the clinical relevant
difference and u and v are constants depending on the
significance level and power (90% power give u = 1.28, and
5% sign. level give v = 1.96, see tables for other values)
• Note: Sample size calculation depend on three main
quantities: The desired power, the clinical relevant difference,
and σ. In other words; the more precise measure, the less n
you need
31 / 39
The effect of averaging from normal distributions
Becomes easy to get significant differences, since the distributions
do not overlap anymore!
32 / 39
Different size of the groups
• If you need 400 patients in total, do you need exactly 200 in
each group?
• Broadly speaking: It does not matter that much
• If you need 400 patients in total, you can put 250 in one
group and 150 in the other
• If you want an exact answer, there are formulas for this. Use
literature
33 / 39
You have done sample size calculations, what can still go
wrong?
• Have been too optimistic on how fast patients are included
in the study
I
I
I
Too strict inclusion criteria
Too time demanding to collect information from each patient
Drop outs
• Negative surprises can be avoided by doing a pilot study!
• Wise to include more patients than estimated from the
sample size calculations to allow for drop outs (if possible)
34 / 39
Final remarks
• Number of required individuals will vary a lot between
studies
• A large difference between groups means that a study with 30
in each group is as good as a study with 100 in each group
• However, if the difference is smaller than expected, the power
will decrease
• The estimates for effect size and variance has a lot of
impact on these calculations! Are they correct?
• What if you don’t know the variance? (usually you don’t)
Look in similar studies or create tables with different variances
and choose the highest sample size you could afford
35 / 39
Final remarks (cont.)
• Important to come up with a few main hypotheses that you
wish to test
I
Choosing significance level 5% means that you will reject a
null hypothesis that is true in reality 5% of the time!
• If you find that other research hypotheses are more interesting
after you have collected your data, the initial sample size
calculations may be worthless
• We have seen simple examples; what if your research
hypothesis specifies a regression problem with many
variables?
I
Sample size calculation quickly becomes very uncertain and
difficult. May be pragmatic and do calculations of sample size
on the basis of a few main variables
• Generally: Need larger samples to control for more variables.
Use programs and see what is available for ANOVA, regression
and so on
36 / 39
Software for calculating sample size
• Buyable programs:
I SamplePower http://www.spss.com/samplepower
I nQuery http://www.statsol.ie/nquery/nquery.htm
• Free programs: http://statpages.org/
I This page also includes other statistical programs
37 / 39
38 / 39
Summary
• For fast or approximate calculations use the nomogram from
Altman
• See table 35.1 in Kirkwood and Sterne for overview of exact
formulas
• Power formulas are not hard, but can be very algebra intensive
- may want to use a computer program
• Take all calculations with a few grain of salt - fudge factor is
important!
• If in doubt when designing a study - ask a statistician. You
don’t want to use months or years collecting data, only to
learn that they are impossible to analyze the way you want to
- analyse comes after design
39 / 39