A Brief Tutorial on Sample Size and Statistical Power Calculations for Objectives:

Objectives:
A Brief Tutorial on Sample Size and
Statistical Power Calculations for
Animal Experiments
1. Overview of basic ideas of power and sample size calculations in
the context of simple examples.
2. Introduction to easily available tools for power and sample size:
• Software available on CSU campus for a reasonable cost
(SAS, $48/yr.; MINITAB, $127 purchase)
Phillip Chapman
• Interactive software available for free on the web. (Russell
Lenth, U. of Iowa, and UCLA Online Power Calculator.)
Department of Statistics, CSU
3. Give some practical advice about how to proceed in some
common situations.
4. Overall objective: help you plan better experiments (and in the
process help you fill out the ACUC A-100 form).
March 8, 2007
1
2
1. Two-sample t-test for two independent samples
Outline:
Mice will be randomly assigned to two groups (n mice per group):
1. Two-sample t-test for two independent samples.
SAS Analyst (interactive menu)
MINITAB
Russell Lenth Online Power Programs
UCLA Online Power Calculator
(http://calculators.stat.ucla.edu/ - no demo today)
1. Treated group (T)
2. Control group (C) – (untreated or reference treatment)
We plan to record Y = log(CFU) of bacteria in the lungs.
We will compare the two groups using a two-sample t-test (α=0.05):
2. Selecting inputs to power programs (two-sample t-test).
3. Confidence interval width for two independent samples.
t=
4. Hypothesis tests for comparing two-proportions (z-test and
Fisher’s Exact Test for small samples) – SAS Proc Power.
( yC − yT )
s pooled
5. Confidence interval width for comparing two-proportions.
2
n
Compare t to tcrit (from t-table)
Significant difference if t > t crit
(or compare p-value to α , usually 0.05)
Significant difference = success of experiment.
6. Additional Examples (fewer details, primarily for reference):
A. T-test for paired samples
B. One-way ANOVA,
Contrasts in a one-way ANOVA – SAS Proc Power
Probability of significance = “POWER”
How do we assure a high probability of success?
3
(i.e. How do we assure high power?)
4
Notation: µC = pop. mean in control gp.
Power = probability of rejecting H0 if false (i.e. HA true)
Distribution of calculated t, as if H0 is true.
0.4
0.3
0.2
3. n = sample size (= 9 above)
-tcrit
The center of the
“HA true” curve is λ.
5
4
3
2
1
0
-1
-2
-3
-4
-5
0.1
0
tcrit
α = “Type I error rate”= probability of rejecting H0, when H0 is true.
4.7
4
3.3
2.6
1.9
1.2
tcrit
5.4
Area =
Power
Power depends on:
1. σ = estimate of the within group std. dev. (= 1.0 above)
2. "True values" of µ C and µT (µ C =6.0 and µT =4.5 above)
Area = α/2
Area = α/2
-tcrit
0.5
(or ∆=µC − µT ≠ 0)
-0.2
H A : µT ≠ µC
-0.9
(or ∆=µC − µT = 0)
0.02
0.015
0.01
0.005
0
Distribution of t, if HA is true
Area = α/2=0.025
-1.6
H0 : µT = µC
0.04
0.035
0.03
0.025
-2.3
Hypothesis test: Are the means the same?
Distribution of t, if H0 is true
-3
µ T = pop. mean in treated gp.
σ C = pop. std. dev. within the control. Assume σ C = σ T = σ
σ T = pop. std. dev. within the treated. (common std. dev.)
5
Power (area under the HA curve) is easily calculated in SAS.
λ =
µC − µT
σ
Result: Power=0.847
2
n
=
6 − 4 .5
1 .0
2
9
= 3 .1 8
6
Initial SAS window with “Analyst” selected:
1. Double-click on SAS Icon to start the SAS program,
2. From the “Solutions” menu, select “Analysis”, then “Analyst”.
3. From the “Statistics” menu, select “Sample Size”, then “TwoSample t-test.
4. Put appropriate values into the boxes (give a range of n values
requested), then select “OK”.
5. Read power results from “Power Analysis” window.
6. Look at graph in “Plot of Power vs. N per group” window.
7
8
Two-sample t-test window:
Analyst window with “Two-sample t-test” selected:
1. It doesn’t matter which means are labeled “1” and “2”
2. You can specify n, and have it calculate power, or vice versa.
3. Select a range of n values, and an increment (“by”).
4. Use “Help” button for explanations, if needed.
9
Two-sample t-test results:
Interpretation: We have a 0.847 probability of observing a statistically
significant two-sided (α=0.05) t-test if the true group means are 4.5
and 6.0, respectively, the within-group standard deviation is 1.0 for
both groups, and the sample size is n=9 per group.
5. Select “OK” when done.
10
The optional plot from SAS Analyst:
11
12
http://www.stat.uiowa.edu/~rlenth/Power
Russell Lenth’s Online Power Calculator (Java applets)
Location: http://www.stat.uiowa.edu/~rlenth/Power
1. Very complete with respect to t-tests and ANOVA’s; less
complete for proportions.
2. See immediate response to “sliders”, or enter numbers by
clicking on the little grey boxes in the corner.
3. Requires Java for interactive applets.
4. Programs can be downloaded to run on your own computer.
Suggested (by Lenth) citation format:
Lenth, R. V. (2006). Java Applets for Power and Sample Size [Computer
software]. Retrieved month day, year, from
http://www.stat.uiowa.edu/~rlenth/Power.
13
Two-sample t-test power calculator from Russell
Lenth’s web page:
Note: This menu allows the two groups to have different
variances (not an option in SAS Analyst).
14
Click on button to enter values (rather
than use the slider)
15
16
More than one sample size (per group), difference between
means, and power value may be specified
In MINITAB 13, select “Power and Sample Size”, then “2-Sample t..”
from the “Stat” menu. (Version 15 was just released.)
Select
“Options” to
set α or
change to
one-sided
test.
17
The options sub-menu from the previous screen. Change what you
like, and then select OK.
18
MINITAB two-sample t-test power results:
Power for n=9( per group) = 0.8476
19
20
2. Selecting inputs to power programs.
λ =
Significant difference if t > tcrit
Observed
σ
From table
1. Selecting the Type I error rate α: There is a lot of tradition behind
using α=0.05. Occasionally, α=0.10 is used when the results are
thought to be preliminary.
µC − µT
2
n
2. Estimating σ usually causes the most difficulty:
A. Use estimates from previous or published work. (be careful to
note whether it is the standard deviation, or the standard error of
the mean that is being reported):
s.e.(mean) = sd / n
1. Power increases with |λ| (It moves the HA curve away from zero.)
2. Power increases with α. (It moves tcrit value toward zero.)
3. Power increases with larger true differences
( ∆ = µC − µT )
B. Use estimates from similar experiments, even though they may
not be completely comparable (but add appropriate caveats).
4. Power decreases with increased variability (σ)
5. Power increases with √n .
C. Use a range of values that might be reasonable. (Larger values
of σ yield pessimistic (i.e. large) sample sizes).
D. Use relative values: If the mean is believed to be about 100, you
might try to estimate the relative variability. (Use the coefficient of
variation= σ /µ) and work backward from there.)
Since power depends on these values, how do you decide
which numbers to use?
21
3. Desired power: 0.80 is usually a minimum; 0.90 is more common.
Power above 0.99 would generally be thought to be a waste of
resources. Since power is the probability of success of the
experiment, you have some leeway in selecting it. You might allow
more leeway, if you are less sure of your variance estimates. Also,
you may have multiple objectives, and you may have more power
for some than others.
4. One-sided or two-sided? Two-sided is the conservative choice
(less power and larger sample size). One-sided tests are suspect
in some fields. (“Home run in Coors Field” analogy.)
(
)
5. Magnitude of difference: ∆ = µC − µT . Select the smallest
difference that it is necessary to detect. Since the power only
depends on the difference, individual means are not important.
Even the difference is only important relative to σ.
Sometimes ∆ / σ is called the “effect size.”
Some references classify effect sizes as small, medium, and large,
when they are 0.2, 0.4 and 0.8, respectively. My opinion is that
you should decide the appropriate effect size for your scientific
purpose; ignore what is identified as small, or large.
23
22
3. Confidence interval width for independent samples
When the purpose of the experiment is to estimate a difference between
treatment means, rather than test whether the difference is zero (or any
other value), a more appropriate sample size tool is confidence interval
width, rather than power.
A 95% C.I. is of the form:
yC − yT ± E where E =t0.025 spooled
2
n
E is called the “margin of error”. Solving the above for n and
letting σ approximate spooled:
2
2 ( t0.025 ) σ 2
n=
t0.025 = 2.0 approx
E2
The formula can only be used for large experiments due to two
problems:
1. For small n, t0.025 is somewhat larger than 2.0.
2. For small n, spooled may be larger than σ, resulting in a wider CI than
planned. (These problems are avoided if you use SAS ANALYST.) 24
After opening SAS Analyst, select Two-Sample Confidence Interval
as below:
In the two-sample problem considered earlier:
Mice will be randomly assigned to two groups (n mice per group):
1. Treated group (T)
2. Control group (C) – (untreated or reference treatment)
We plan to record Y = log(CFU) of bacteria in the lungs.
σ = estimate of the within group std. dev. (assumed = 1.0)
New objective: Estimate the difference between mean
log(CFU)’s for the two groups with margin of error (E)
approximately 0.75log(CFU)’s, and be 90% sure that the
final 95% confidence interval will be no wider than E.
Do the above using SAS Analyst
25
26
Desired precision: E = 0.75
Within group standard deviation: σ = 1.0
Select the “N per group” button to get sample size.
“Power” in this context is the probability that the interval will be no
wider than the specified E. (This is random because spooled is
random and may be larger than σ.). Specified as 0.90
Interpretation (approximate): If the true within-group standard
deviation is 1.0, and the sample size is 20 per group, then there
is a 90% probability that the resulting confidence interval will be
no wider than 1.5 (i.e. diff ±0.75)
27
28
5. My opinion: C. I. width criteria should be used more often, and
power of hypothesis tests should be used less often. Many
researchers that feel forced into power calculations when they are
really trying to estimate a difference with a desired precison, rather
than test whether the difference is zero.
Some comments about the previous C.I. width procedure:
1. Selecting the confidence level %: Analogous to selecting α in the
testing situation. Most common level is 95% confidence, but other
levels are sometimes used (80%, 90%, 99%).
6. Post hoc (retrospective) power: Often researchers are asked the
question, “Did your experiment have enough power to detect a
difference”? This question is circular, because if your effect is
significant, you had enough power, if it is not significant, you didn’t.
A. If you calculate post hoc power, use the estimated σ from the
experiment, but identify the magnitude of the difference to detect
based on scientific reasons; do not use the estimated differences
from the experiment.
B. C.I. Widths can be used in place of post hoc (retrospective) power
analyses:
C.I. Exp #1
2. Estimating σ causes the most difficulty. (as before)
3. Selecting E (margin of error) – Depends on the objectives of the
experiment. With how much precision do you need to estimate
the mean difference? (Be careful what you choose, because n
becomes dramatically larger as E becomes smaller.)
4. “Power” is the probability that the final interval will be no wider
(i.e. not worse) than planned. It is not really power in the testing
sense, but it is still probability of success, if success is measured
by whether the final interval is no wider than planned. Most
people pick 0.80, or 0.90, depending on how lucky they feel.
C.I. Exp #2
29
4. Hypothesis tests for comparing two-proportions
0.0
Difference of interest
To estimate power:
Birds will be sampled from two populations and the rate of West Nile
Virus compared between the two populations.
1. Choose type I error rate α: We assume α= 0.05.
1. Population 1 (y1 positives out of n1 sampled)
2. Choose values for π1 and π2 : We assume 0.20 and 0.30.
2. Population 2 (y2 positives out of n2 sampled)
3. Choose sample sizes: We assume both equal 100 (or choose
desired power).
We will compare the two groups using an α=0.05 two-sample z-test
of proportions:
Let: pˆ1 = y1 / n1
pˆ 2 = y2 / n2
30
Programs:
p = ( y1 + y2 ) / ( n1 +n2 )
1. Lenth’s power program (moderate to large samples).
pˆ1 and pˆ 2 estimate the true proportions π 1 and π 2 .
2. MINITAB (moderate to large samples).
p estimates the true proprotion, assuming that H 0 : π 1 = π 2 .
3. SAS Proc Power (This calculation is not available in SAS Analyst.)
H 0 is tested against a two-sided alterntive using a z-test.
( pˆ1 − pˆ 2 )
z=
2 p (1- p )
n
A. z-test program (moderate to large samples).
Compare z to zcrit (from t-table)
B. Fisher’s Exact Test Program (for small samples)
Significant difference if z > z crit
4. UCLA Power Calculator for Fisher’s Exact Test (no demo today).
(or compare p-value to α , usually 0.05)
31
32
From Lenth’s Web Page:
From MINITAB:
Power estimated to be 0.31 (too
low!) Need to increase n, or be
more realistic about the size of a
difference you can detect.
33
Type I error rate (α) and one-sided, versus two-sided test
options are chosen from the “Options” menu.
35
34
Power from MINITAB is slightly different than from Lenth’s web site,
because Lenth uses an optional continuity correction.
36
Run the following program in SAS:
Click here
to run
SAS Proc Power results match the MINITAB results:
37
38
392 per group required to achieve desired 0.90 power:
Re-run the previous program setting power to 0.90 (SAS computes n.)
39
40
Fisher’s Exact Test for small sample problems:
Consider an extreme case:
π1 =0.10 and π2 = 0.90 (very different)
Because the proportions are so different, the sample size
required to achieve power equal 0.90 should not be very large.
For small samples “Fisher’s Exact Test is generally preferred
the the z-test or chi-square test.
Sample size 10 per group required to achieve 0.90 power.
Note: The UCLA online Power Calculator will also do power for a
Fisher’s Exact Test.
41
5. Confidence interval width for comparing twoproportions
pˆ1 − pˆ 2 ± Ε
pˆ1 (1 − pˆ1 ) pˆ 2 (1 − pˆ 2 )
+
n1
n2
Assume n1 and n 2 are the same (=n), put in
estimates for pˆ1 and pˆ 2 , and solve for n:
n=
Example: How large must n (per group) be to get a 95% C.I. for π1-π2
of width 0.20.(i.e. E=0.10)
Say p1 is thought to be between 0.4 and 0.6
(width=2E)
E = z0.025 × SE ( pˆ1 − pˆ 2 ) = z0.025
42
2
z0.025
( pˆ1 (1 − pˆ1 ) + pˆ 2 (1 − pˆ 2 ) )
E2
What to use for pˆ1 and pˆ 2 in this formula?
Guess the ranges of values for p1 and p2, OR take the worst case:
p1 = 0.5 and p2 = 0.5.
and p2is thought to be between 0.2 and 0.4.
n=
(1.96) 2 [ (0.5)(1 − 0.5) + (0.4)(1 − 0.4)]
(0.10)2
= 177
Notes:
1. This sample size estimate can be inaccurate when p1 or p2 is near (or
thought to be near) 0 or 1.
2. The margin of error (E) obtained using this sample size will be
smaller than the desired E about half the time.
44
6. Additional Examples: A. T-test for paired samples
Selecting a Paired t-test in SAS Analyst
Testing for a mean difference in paired samples is testing
whether the mean difference is zero. The σ in the analysis
is the std. dev. of the differences.
45
The Paired t-test dialog box in SAS Analyst
46
Selecting a Paired Confidence interval in SAS Analyst
47
48
6. Additional Examples: B. One-way ANOVA
The Paired Confidence interval dialog box in SAS Analyst
Select “Balanced ANOVA” to get any of many ANOVA models
49
Lenth’s “Balanced ANOVA” dialog box. Use the arrow to see a
long list of possible models.
50
Below I have used Excel to compute the standard deviation of the
alternative hypothesis treatment means for use in Lenth’s One-way
ANOVA dialog box. (next page)
I have also used Excel to compute the corrected sum of squares
(CSS) for the same alternative hypothesis treatment means for
later use in SAS Analyst.
t = # treatments
µi = treatment means
CSS = ∑ ( µi − µ )
2
∑(µ
SD[treatment} =
i
−µ)
2
(t − 1)
First choose “F-tests” to compare all treatments.
51
52
Lenth’s dialog box for power of contrasts in the One-Way ANOVA.
(Select “Differences/Contrasts” at the bottom of Slide 51.)
Lenth’s One-way ANOVA dialog box.
SD[treatment]=2.137, from previous page.
Power = 0.8024 in the One-way
ANOVA F-test
53
Selecting One-way ANOVA in SAS Analyst:
Power = 0.377 for detecting a difference of 3.0 between two
treatments.
54
The corrected sum of squares of the alternative hypothesis treatment
means (CSS) = 22.83 was previously computed in Excel:
55
56
Power for a contrast in a One-way ANOVA computed using SAS Proc
Power
Results of the SAS Analyst One-way ANOVA power calculation
Contrast of treatment #1 with treatment #4, The true
treatment means under the alternative hypothesis are 2.0
and 5.0, respectively.
57
Output for SAS Proc Power Program on the previous page.
58
A general comment about power for ANOVA models:
Although it is possible to calculate power for a variety
of multi-factor ANOVA models, significance in the Ftests for main effects and interactions is not the end of
the analysis.
It is generally necessary to do follow-up comparisons
of individual treatments, or contrasts of treatment
groups. Often that returns the problem to something
that can be addressed with the two-sample
comparisons, or the contrast comparisons, discussed
in this presentation.
59
60