Sample Size Estimation Required of ever grant proposal

EP 521 Spring, 2007 Vol I, Part 5
§3.
1
Sample Size Estimation
A key to study design are sample size or “power” calculations.
Required of ever grant proposal
In this section:
(1) we begin with theory behind power calculations and demonstrate how simple formulae for
power and sample sizes are derived.
(2) Next, show unified treatment of power for RD, OR, RR based on this theory.
(3) Then, describe how varying the question being asked can have substantial effect on the required
sample sizes.
(4) Brief explanation of the information needed for power calculations for matched pair studies.
(5) Some demonstrations on how to use and interpret software for power calculations.
Goals – To be able to understand what affects power, how to define the problem, and how to get the
computer to give you the answer you need.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
§3.1 Power In General
Sample Size Estimation: Terminology Review
Null hypothesis (Ho): specified value for a parameter (OR, RR, RD, IRR, IRD, for example)
Alternative hypothesis (Ha): specified alternative value for a parameter
Type I error = Pr(Reject Ho | Ho is true) = α
Type II error = Pr(fail to reject Ho | Ha is true) = β
Pr(fail to reject Ho | Ho is false)
Power = Pr(reject Ho | Ha is true) = 1- β
1-α = ?
(“Pr” signifies probability over repetitions of the study)
(References: Woodward, chap 8; Rothman and Greenland, pp. 184-8)
Copyright © 2006, Trustees of the University of Pennsylvania
2
EP 521 Spring, 2007 Vol I, Part 5
3
Notes:
(1)
α-level is not a p-value. P-value is a quantity computed from and varying with data. α is fixed and is
specified without seeing the data.
(2)
p-value is not the Pr(Ho vs Ha). Is loosely defined:
Pr(observed result or more extreme than observed|Ho true).
(3)
p-value is not Pr(data|Ho). That is the likelihood. Likelihood is usually much smaller than the pvalue, because p-value includes not only Pr(data|Ho) but also the Pr(all other more extreme data
configurations Ho).
(4)
Absence of evidence is not evidence of absence. Failing to reject Ho ≠ accept Ho as true.
(5)
Studies with low power to produce results with appropriately narrow confidence intervals (as
defined by the purpose of the study) are not “negative studies” – they are “indeterminate”.
An initial description of what we are doing will help.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
4
.
.
H0
Ha
.
0
2
3
4
Type I error (α ) -- H0 is true but you will reject H0 in favor of Ha. Suppose that 2 is your
threshold (critical value) for rejecting H0. So, you have only a very small chance of observing a
value to the right of 2, and a large chance of observing something to the left of 2, if H0 is true.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
5
Type II error(β)– If Ha is true, then you have a chance of observing a value to the left of 2, below
the critical value, but it is not great. You have a much larger chance of observing a value to the
right of 2. How big a chance you have of observing a value at 2 or to the right of 2, if Ha is true
depends upon how far Ha is away from H0. If Ha is far away, then power is bigger, and type II
error is smaller.
Now what happens when sample size increases (or when variance decreases). The distributions
become narrower. (This is the distribution of the mean, for example). Holding everything else
constant, what does that do to my power to detect a difference? At 2, I have little chance of falsely
rejecting H0. This would be a very high critical value for rejecting H0. But if Ha is true, you have
an almost certain chance of observing a value at least 2, meaning that power is almost 1.0 and Type
II error is almost 0.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
6
.8
.6
.4
.2
0
0
Copyright © 2006, Trustees of the University of Pennsylvania
1
2
3
4
EP 521 Spring, 2007 Vol I, Part 5
7
I can pick a vertical line (2, for example) to correspond to a type I error. This is usually the case.
Then I can posit what Ha is (3 or 4), and if the sample size tells me how broad the distributions of
the effect size is under H0 and Ha, then I can estimate what Type II error and power will be.
Alternatively, I can specify Type I error, and power (and thus Type II error) and estimate just how
close Ha can be to H0 to achieve this level of power.
We now change the paradigm only slightly. Every estimate has a distribution. The estimate can
be of a sample mean, or a measure of association or effect size in a sample.
We now think in terms of a distribution of OR, RR, or RD. We think of the distribution of OR,
RR, or RD if the null hypothesis were true vs the distribution if the alternative were true.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
Type I error
8
Type II error
From: Methods in Observational Epidemiology by J.L. Kelsey, A.S. Whittemore, Alfred S. Evans and W. Douglas Thompson,
1996, New York, Oxford University Press, p. 328.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
9
Power calculations are based on the sampling distribution of the difference (means, proportions) of
the groups being compared.
d = value of "difference" [Risk difference, log OR, difference in means, etc.] when null is true (d =
0)
dc = value of difference that is just significantly different from d at significance level α critical value
d* = value of difference when null is false, i.e, when Ha is true.
Some key numbers to remember on SS calcs
(For purposes of this presentation)
Quantity
Interpretation
Value
Zα/2
Type I error
of 0.05
1.96
Zβ
Type II error
0.2 (80% power)
0.1 (90% power)
+0.84
+1.28
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
(
Z α + Zβ
2
)
10
2
Used in SS calcs
Type I =0.05;Type II=0.2
Type I=0.05; Type II=0.1
7.85
10.5
Some texts refer to Zβ as Z 1-β and Zα/2 as Z 1-α/2 and thus have
slightly different formulae.
The key to all these sample size formulae is to look at the two distributions: the difference under Ho vs
the difference under Ha, and the cumulative probability, the area under the curve, being calculated from
OPPOSITE DIRECTIONS. Why? Because we are contrasting them. We look at Ho (vs Ha) but going
from the left to the right. But we look at Ha (vs Ho) by cumulating the probability from the right to the
left. This is totally dependent on whether the difference under Ha is larger than under Ho.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
11
So:
1. When null is false (HA = true), we are sampling from
distribution on right. Values to the left of dc
occur with probability β, and represent the probability of inappropriately failing to reject H0).
Area to left of dc, when d* is true = Type II error ( = Pr (failing to reject H0 | HA is true).
2. Values to the right of dc in the shaded area ,
α
of rejecting H0 when we should fail
2
to reject ( since H0 is true).
represent the probability
3. Values to the right of dc, forming part of the distribution of d* represent the power of detecting a
true difference = Pr (rejecting H0 | H0 false)= Pr(reject Ho|Ha is true) =1- β
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
12
( )
By using standard normal: d c = d+ Z α [se(d)]
2
and
where z
α/2
around d.
(Eq 5.1), from the frame of reference of Ho,
*
*
(Eq 5.2), for the frame of reference of Ha,
d c = d - Zβ[se(d )]
is standard normal deviate corresponding to position of d c on distribution
This we can see more easily if se(d ) = se( d * ) = 1. Then, d c = d + zα / 2 and d c = d * − zβ
Zβ is standard normal deviate corresponding to
position of dc on distribution around d* .
β = 0.1 = Type II error = (1- β) = 0.9.
e.g ., Z1-β = 1.28, Zβ = -1.28
Think in terms of flipping over the Ha distribution, so we look at z’s in Ha from right to left rather than
the usual left to right.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
13
.4
.2
0
-1.28
0
x
1.28
ddc
d*
-dc
c
-dc
practical point: Use + 1.28 for β =
The key,
0.1.
Then, setting eq 5.1 = eq 5.2: d + zα / 2 se( d ) = d * − zβ se( d * ), or if se=1, d + zα / 2 = d * − zβ . This is the
key to estimating power and sample size. Finally,
solving for Zβ we get:
Zβ =
*
d - d- Z α2 [se(d)]
se(d *)
(Eq 5.3)
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
Copyright © 2006, Trustees of the University of Pennsylvania
14
EP 521 Spring, 2007 Vol I, Part 5
15
Usually assume se(d) = se(d*) and simplify
*
Zβ =
d -d
- Zα
se(d*) 2
Note: Zβ can range −∞ to + ∞.
(Eq 5.4)
As is usual, d = 0, which corresponds to Ho: RD=0 or ln(OR)=0,
ln(RR)=0, or X1 − X 2 = 0.
Then,
Zβ =
d*
− Za
2
se( d *)
(Eq 5.5)
Using the simple Eq 5.5, we can arrive at a series of simple formulae for power and sample size calculations.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
16
§3.2 Power and sample sizes in case control and cohort studies
Methods of Sampling and Estimation of Sample Size
n In a cohort or cross-sectional study, the number of exposed individuals studied; in a case-control
study, the number of cases
r In a cohort or cross-sectional study, the ratio of the number of unexposed individuals studied to
the number of exposed individuals studied; in a case-control study, the ratio of the number of
controls studied to the number of cases studied
σ Standard deviation in the population for a continuously distributed variable
p1 In a cohort study (or a cross-sectional study), the proportion of exposed individuals who develop
(or have) the disease; in a case-control study, the proportion of cases who are exposed
p0 In a cohort study (or a cross-sectional study), the proportion of unexposed individuals who
develop (or have) the disease; in a case-control study, the proportion of controls who are exposed
p + rp 0
p= 1
= weighted average of p1 and p0
1+ r
(Ref:Kelsey et.al. 1996, Table 12-11.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
17
So, when n is fixed by costs, time, etc., can use power calculations.
Initial derivation of Eq 5.6 from Eq 5.5. We begin with the difference in means, or with risk difference.
Recall: variance of a difference in means (assuming independence) Var(A-B) = Var(A) + Var(B)
Assuming a common standard deviation, and d * = the difference under Ha, then
1 1
var( d *) = σ 2  + 
 n1 n2 
Here, we know n2 = r ⋅ n1
1
1 
2  r +1
var( d *) = σ 2  +
=σ 

 n1 r ⋅ n1 
 r ⋅ n1 
So, se( d *) = σ
r +1
r ⋅ n1
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
18
Therefore: Zβ for difference in means:
Zβ =
*
d  nr  α

- Z
σ  r+1  2
(Eq 5.6)
Zβ for difference in proportions (or a risk difference):
1/ 2
 nr 
Zβ =


p(1- p)  (r+1) 
d
*
 n(d*) 2 r 
Zβ = 

 (r+1)p(1- p) 
Recall Var(p) =
Substitute
- Zα
2
1/ 2
- Zα
(Eq 5.7)
2
p(1- p)
n
p ⋅ (1 − p ) for σ above in eq (5.6.)
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
19
Note: If we have defined d* as the risk difference (RD), then we can express RR or (OR) in terms
of RD and the baseline or reference risk ( p0 )
p1
, p1 = p 0 RR
p0
So d* = p0RR - p0 = p0 (1- RR)
For RR: RR =
p1
(1- p1)
For OR: OR =
p0
(1- p 0)
So p1 =
p0 • OR
and
1+ p 0(OR-1)


p 0 OR
d* = 
 - p0 .
 1+ po (OR-1) 
We may have a specific OR or RR in mind and need to know the implied value of p1.
So, we have a (1) simple, and (2) unified approach for (a) sample size and (b) power calculations for (i)
RD (ii) RR, or (iii) OR, as well as for differences in means.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
20
When we think about sample size calculations for OR and RR, we should always think of the problem
and the picture in terms of risk difference (RD).
Because the RD will give a better idea of the “size” of the effect for which we want to estimate power or
sample size.
For example, suppose p0 = 0.01 and p1 = 0.03. . Then the RR seems large (3.0), but the RD is small
(0.03 – 0.01 = 0.02).
But if n0 = 100 and n1 = 100 , we have a total of 4 events (expected) (1 + 3 = 4).
Now, if p0 = 0.2 and p1 = 0.6, RR=3.0 as before.
The reasons for the increased power for this latter RR is that we have so many more events expected
(20+60=80). The much large number of events is reflected in the large risk difference (0.6-0.2 = 0.4).
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
21
Example #1: Cohort design: Does smoking during pregnancy show an association with increased risk of
low birth weight in offspring?
Known facts:
1. Prevalence of smoking during pregnancy is about (25%) , i.e., 3 non-smokers for each
smoker. So, r = 3 if we sample from the cohort using simple random sampling and follow
them.
2.
Incidence (overall) of low birth weights (# 2500 gm) is ~ 7%.
Suppose we have the time and dollars to study 1200 births.
Expect 1200/4 = 300 exposed (n = 300) during gestation (to smoking).
Suppose we want to measure the difference in risk (proportions of low birth weight babies)
and we want to detect a difference of 4% points = (d*). What is the power to detect this
difference?
Must compute p0, p1 from overall incidence of LBW = 0.07. That is simply a weighted
average of risks among smokers and non-smokers.
0.07
Smokers
= (0.25) (p0 + 0.04)
Non-smokers
+ (0.75) (p0)
because p1 = (p0 + 0.04).
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
22
Now, solve for p0 :
p 0 = 0.06, p1 = 0.10, p =
where p =
0.10 + 3(0.06)
= 0.07
1+ 3
p1 + r( p0 )
unexposed 3
and r =
=
exposed
1
1+ r
 n(d *) 2 r 
For α = 0.05: Zβ = 

 (r+1)p(1 − p) 
1/ 2
 300 (0.04 ) 2 3 
- Zα = 

2
 (3 +1)(0.07)(0.93) 
1/ 2
- 1.96 = 0.39
For Zβ = 0.39, power = 0.652. This is depicted on the normal density plot on the next page, and is the
shaded area, from left to right, under the curve, representing the cumulative normal from negative
infinity to +0.39. Note, this power plot is just the same as the prior plot, except that we are now
depicting power from left to right instead of from right to left (under the normal density).
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
Program
Stata (sampsi)
Stplan(arcsin
transformation)
N-query advisor
Power
0.59
0.61
23
What do the power calculation programs produce?
0.63
This is the flipped over distribution that we use to estimate power or type II error.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
24
Example #2: Case control study of smoking during pregnancy and low birth weight in offspring.
Using same numbers as before,
Case
Control
= giving birth to low birthweight (LBW) baby
= giving birth to "normal" birthweight baby. (e.g. 2501 gm.)
In case-control setting, we now redefine p0 and p1 to be, not the risk of LBW, but rather the risk of
exposure among the controls and the cases respectively.
For p0, we will use overall prevalence of smoking (exposures) in general population of pregnant
because cases are a small minority, i.e., p0 = proportion of controls who are exposed = 0.25 (as
before)
We want to detect OR = 1.8, we can study 175 cases, and control to case ratio = r = 2
Solve for the prevalence of smoking among cases: p1 =
p1 =
(0.25)(1.8)
= 0.375 .
1+ (0.25)(1.8 -1.0)
Copyright © 2006, Trustees of the University of Pennsylvania
p 0 OR
, or
1+ p0 (OR-1)
EP 521 Spring, 2007 Vol I, Part 5
25
*
d = p1 - p 0 = 0.375 - 0.250 = 0.125
p=
p1 + r p 0
,
1+ r
 (0.375) + 2(0.25) 
p=
 = 0.292
1+ 2

1/ 2
 (175)(0.125) 2(2) 
=
Zβ 
 -1.96 = 1.01.
(2
+1)(0.292)(0.708)


Power = 0.844 (84.4%) to detect OR = 1.8. This result means that the two distributions, one for
Ho: OR=1.0 and the other for Ha: OR=1.8 do not overlap very much .
(This answer is very close to what the program PS gives (power = 0.844 for 179 cases).
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
26
Notes
1) n = 175 cases, so total sample size = 175 + 350 = 525 with power = 0.84
2) In cohort study, we had p 1 = 0.10, p0 = 0.06, gives OR = 1.74. Cohort Study needed 1200 births
for a power = 0.6. Why the difference in the case control and cohort study calculations? How
many events did we stipulate for the case control study? (175). How many events would we
expect from the cohort study?
Exposed:
Unexposed
Total
300*0.10 = 30
900*0.06 = 54
= 84
For a cohort study with power = 0.84, we need to enroll even more than 1200 patients, 519
exposed + 1557 unexposed = 2076 total. This would produce the following events:
Exposed:
Unexposed
Total
519*0.10 = 52
1557*0.06=93
=145
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
27
3) Everything is re-expressed as a difference in proportions (or means).
4). We need to know:
a) Exposure prevalence in population (for case control or cohort study)
b) Disease risk in the population (for cohort study).
c) Desired "effect size" ("clinically important" difference)
d) Minor notational and other differences may be found in different texts -e.g.,
p (1- p1) p 2(1- p 2)
p(1- p)
replaced by 1
+
n
n
n
2
2
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
28
To obtain the sample size: Solve for n using Eq (5.6)
for means:
for proportions:
(
)
2
2
d* nr .
Zβ + Z α2 = 2 •
σ r+1
(
)
2
Zβ + Z α2 =
*2
Then
nd r
(r+1) p (1- p )
Copyright © 2006, Trustees of the University of Pennsylvania
n=
(
)
2
2
Zβ + Z α2 σ (r+1)
2
d* r
(Z + Z )
, and n =
β
α
2
Eq 5.8
2
p (1- p )(r+1)
( d *) 2 r
Eq 5.9
EP 521 Spring, 2007 Vol I, Part 5
29
Tables for common values of key parameters: (Kelsey et al., 1996, Table 12-16 p 333.):
(
Values of Z α + Zβ
2
)
2
for frequently used combinations of significance level and power
Significance level
α
Power
(1 – β)
0.01
0.80
0.90
0.95
0.99
0.80
0.90
0.95
0.99
0.80
0.90
0.95
0.99
0.05
0.10
(
α
2
Z + Zβ
)
2
11.679
14.879
17.814
24.031
7.849
10.507
12.995
18.372
6.183
8.564
10.822
15.770
So, 7.85 and 10.5 are the key values to remember.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
30
Example #3: Case control study of smoking and low birth weight
Want OR = 1.8 to be detectable, with power = 90%,α = 0.05
Recall from the prior example: p = 0.292,(1- p) = 0.708, d* = .125,r = 2
Thus: n =
(10.507)(0.29166)(0.70834)(3)
= 208.4 ≅ 209 .
(0.125) 2 (2)
Then, n=209 + 418 controls = 627 [Remember that 175 cases gave 84% power]
In summary, we can use a few formulae to estimate power, or sample size, for simple study designs, and
these formulae form the basis for the computer-based sample size programs that we all use.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
§3.3
31
Special Concerns in Power (Sample Size) Calculations
§3.3.1 Measurement Error: effect on power (Refs: Armstrong et al 1992; Kelsey et al 1996, ch 13)
Where errors can occur:
Exposure variables (most common worry)
Disease (outcome) classification
Confounding factor or covariates
Effect of nondifferential error (misclassification or measurement): commonly (although not always)
biases or attenuates measure (effect size) towards the null
Effects of nondifferential error in exposure on sample sizes:[In simple cases]
Observed effect size is smaller than true effect size, i.e., it takes more power to demonstrate an
effect for a given true effect (observed effect will be closer to null): effect of bias
Confidence intervals for corrected measures of effectsize are wider than if exposure were measured
without error: effect of variance
Effects of nondifferential error in confounders -- Effect size can be biased in either direction
Remedies for measurement error in planning studies
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
32
Estimate measurement error from pre-existing data
Use tables on attenuation bias (Kelsey, Armstrong)
If error is not known, plan a validation substudy (complex)
Plan on multiple measurements of subjects
For estimating the impact of nondifferential error, estimate the sensitivity and specificity of observed
exposure:
True Exposure
Observed
+
exposure +
a
b
c
d
Then
Sn=Pr(O+|True+) = a/(a+c)
Sp=Pr(O-|True-) = d/(b+d)
Prevalence of exposure =(a+c)/(a+b+c+d)
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
Sensitivity Specificity Exposure
prevalence
0.6
0.6
0.01
0.5
0.6
0.9
0.01
0.5
0.6
0.99
0.01
0.5
0.9
0.6
0.01
0.5
0.9
0.9
0.01
0.5
0.9
0.99
0.01
0.5
0.99
0.6
0.01
0.5
0.99
0.9
0.01
0.5
0.99
0.99
0.01
0.5
33
Observed
OR
1.01
1.14
1.05
1.42
1.37
1.54
1.02
1.48
1.08
1.72
1.47
1.82
1.02
1.68
1.09
1.89
1.50
1.97
Attenuated values of the odds ratio resulting from the
effects of the nondifferential error in measuring
exposure. Classification in terms of disease status is
assumed to be error free. True OR = 2.0.
Ref=(Kelsey et al., 1996, page 350*)
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
34
§3.3.2. Those selected/invited vs. those who agree to participate.
Enroll 80%: (0.8)(x) = 500, x = 500/0.8 = 625.
We must enroll enough patients so that after the refusers drop there will still be enough patients.
§3.3.3. Censoring
Loss to follow-up
Other causes of death than the cause of interest
Many assumptions involved in the calculations for such studies
§3.3.4 The scale on which we estimate effect size -- Important to understand
The scale can make an apparent (although not real) difference in the sample size or power calculation.
Example. Suppose you estimate 80% power to detect OR=3.0, and baseline/reference risk = 0.2.
This seems like a study that does not have much power to detect a relative effect of exposure.
But this OR = 3.0 corresponds to a RR = 2.14. Now, the study seems more powerful.
But it is not! The only difference is that you are expressing your effect size in terms of RR !
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
35
§3.3.5 How many controls per case
What should the value of r be?
r= ratio of controls/cases or unexposed/exposed
In practice: # of cases in case control study is the total # available, so we can't get any more than
there are. Then we can increase power by increasing r (i.e. taking more controls), BUT!!
precision does not increase beyond r = 3 or 4.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
36
§3.4 The Fallacy of the Post-hoc Power Calculation (see Berlin & Goodman , 1994)
Suppose
σ = 10
(σ2 = 100) and n = 50 subjects / group
We have done a study comparing the effects of two drugs on a continuous outcome measure
with the above variance.
The result of the study is that the difference between the means of the two groups is 4 units.
(The two groups are independent)
100 σ 2
= , because var (A - B) = var (A) + var (B)
50
n
4
Z = x1 x 2 =
=2
100 100
4
+
50 50
We do a Z-test (known variance):
So the test (either Z or t) would barely reject H0 of no difference in means at the α = .05 level.
But, does this result mean that power was low or high?
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
37
Now, suppose that the planned detectable difference = 6.0 with 80% power and alpha=0.05.
But after the experiment, we observe a difference = 4.0, with CI = 0 to 8.
This result means that you happened to observe an effect size in the sample that is lower than the true
effect size in the hypothesized population.
What was the power to detect a difference of 4 units given N = 50 per group (i.e., r = 1) and σ = 10?
zβ =
d*
σ
nr
50 ⋅ 1
− 1.96 =
− 1.96 = (0.4) 25 − 1.96 = 0.04
r +1
2
So power = 0.50 or 0.51
So if the power was so low, how did we detect a difference?
Meaningless question: the d* in the formula relates to the hypothetical mean of an alternative
distribution, not to an observed event. An observed event will always have power (1- β) < 0.5 if
the finding is “not significant”.
In other words, if z < +1.96, power is < 0.5
So if observe "NS" finding, then always say study is underpowered.
But do not know what we'll find out until after experiment.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
38
In short, d* - d = d* ; dOBS - d = dOBS. (we can set d=0).
There is no place for power after observe dOBS
We must always distinguish
(1) The hypothesized true (but unobserved) population
(2) The actual observed sample from that population
Each sample from the true population will differ somewhat and will have a different estimated effect
size.
If you hypothesized a large difference, and you found only a small difference, then you are “out of luck”.
Too bad. Your p-value will likely be 0.05.
Ref: Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis.
American Statistician. 2001;55(1):19-24.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
39
§3.5 Sample Sizes for Confidence Intervals
Sample size for single mean or proportion
L = margin of error within which you want to estimate the mean or proportion (1/2*width of CI)
Then for the mean:
σ
z 2 σ2
L = z*
= 1.96* se, and L 2 =
n
n
n = Z σ2
L
2
2
Eq (5.10)
p(1 − p )
σ2
For a proportion (e.g., sensitivity and specificity) we substitute
for
n
n
2
p(1- p)
Then, n = z
,
2
L
Eq (5.11)
where Z is standard normal value (2-sided) corresponding to the desired proportion of the time that
the estimate is to be within the desired margin of error.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
40
Example: Suppose you want to estimate the proportion of people with high cholesterol (> 200 mg ldl)
within 4% percentage points. You guess that the proportion will be around 40%.
(1.96) 2(0.4)(0.6)
Then: n =
= 576
(0.04) 2
Be sure to understand the definition of “L”.
L is ½ the width of the confidence interval. Other texts use L= width of entire interval.
With this n, there is a 95% probability (before doing the study) that the estimate obtained will be
within 4% of the population value. This calculation does not address the situation in which you
want to "rule out" a true value above (or below) a particular hypothesized value (see later).
"Worst case" for proportions, when you have no idea what p will be. Use 0.5 because the variance is
maximized (0.25) when p=0.5. For any other value of p, the variance is lower.
(1.96) 2(0.5)(0.5)
For the example: n =
= 601 (not much bigger)
(0.04) 2
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
41
Suppose you wanted ± 3%?
Then let p = 0.5 be the proportion used for the sample size calculation. n = 1068 (much bigger)
This is how the pollsters give you their ± numbers and compute n.
i.e.,
for p = 0.5, L = ± 0.03, and 95% CI = 0.5 (0.47, 0.53)
Suppose you think p will be around .001 and you want ± 0.0005. This is a small proportion!
(1.96) 2(.001)(.999)
n=
= 15,352 , a big study.
(.0005) 2
(cancer rates, etc. are this low)
But these calculations on width of the confidence interval fail to consider the uncertainty of the observed
point estimate, e.g., ORˆ , even when the true OR is fixed. They assume you will be satisfied with this
confidence interval wherever it is centered.
The following examples show how that assumption might not hold.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
42
§3.5 (continued) Sample Sizes for Confidence Intervals
The same confidence interval question might be answered differently.
Question #1
Suppose you want to ensure that your estimate of sensitivity (Sn) will have a confidence
interval (2-sided) of ±5 percentage points. Assume you think that you will observe Sn =
0.9. How many subjects do you need with disease to produce a confidence interval of
(0.85 to 0.95)? We have just seen this calculation.
z 2 p (1 − p)
n=
L2
If z=1.96, p=0.9, and L=0.05, then
n= 3.84 * 0.9 * 0.1 /0.0025 = 138
Question #2
Suppose you want to ensure that whatever observed Sn you find after your experiment,
that you can eliminate, by means of a 95% confidence interval around the estimate, a true
Sn <0.85. How many subjects do you need with disease to ensure with 80% power that
the lower confidence bound is at least 0.85? This second question is different. It can be
viewed as an hypothesis test. Again, assume true Sn = 0.9
How can we calculate this CI?
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
43
Here is the STATA code and output for that question:
. sampsi .9 .85, power(0.8) onesample
Estimated sample size for one-sample comparison of proportion
to hypothesized value
Test Ho: p = 0.9000, where p is the proportion in the population
Assumptions:
alpha =
0.0500
power =
0.8000
alternative p = 0.8500
(two-sided)
Estimated required sample size:
n =
316
This number is much larger. For question #1, you assume that you will observe Sn = 0.90. All you
want to know is how wide will the resulting CI be. But for question #2 you are assuming only that the
true Sn =0.9, and that the observed Sn might vary randomly around the true value. So, your observed
Sn might be smaller than 0.9! You must build in extra power so that whatever you observe, your lower
bound of the confidence interval will be at least 0.85.
(Simulations confirm this second result.)
Correspondence between these two different questions:
If in STATA one sets the alternative hypothesis (Ha) at the end of the confidence interval, and one
stipulates the power=0.5, then the sample size is the same as for question #1, i.e., n=138.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
44
Question #3:
Suppose you want to ensure that whatever observed Sn you find after your experiment, that you can
eliminate a true Sn <0.85 and show a p<0.05. How many subjects do you need with disease to
ensure with 80% power that the lower confidence bound of a one-sided 95% confidence interval is
at least 0.85? Again, assume true Sn = 0.9
This amounts to a onesided onesample test:
. sampsi .9 .85, power(0.8) onesample onesided
Estimated sample size for one-sample comparison of proportion
to hypothesized value
Test Ho: p = 0.9000, where p is the proportion in the population
Assumptions:
alpha =
power =
alternative p =
0.0500
0.8000
0.8500
(one-sided)
Estimated required sample size:
n =
253
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
45
Question #4: Fourth type of confidence interval problem: predicted CI when planning experiments
(Reference Goodman SN, Berlin JA. The use of predicted confidence intervals when planning
experiments and the misuse of power when interpreting results. Ann Intern Med. 1994; 121:200-206.)
Problem:
Evaluating a medical treatment: 45% cure rate.
Proposed surgical alternative must have higher cure rate: 70%+
(to offset higher risk of surgical morbidity)
Difference = .70 - .45 = 0.25 (25% pts)
Question: if design a study with 90% power to detect a difference this size (or larger), what is going to be
predicted confidence interval?
This will not be ±25% points. It will be narrower.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
46
Assume α=0.05, two sided
(1) Step 1: Compute samples n1 and n2 for each group to achieve 90% power to detect a difference of
0.25
( za / 2 + zβ ) 2 ∗ p (1 − p)( r + 1) (10.507) ∗ 0.575 ∗ 0.425 ∗ 2
n=
=
= 82 from Eq. 5.9
(d * )2 r
0.25 ∗ 0.25
(2) Step 2: Compute predicted confidence interval
Predicted 95% CI= observed difference ±0.6* ∆0.90
Where ∆0.90 = True difference for which there is 90% power. = 0.25
Predict CI= observed difference ±0.6 * 0.25= ±0.15.
So, the predicted CI for this problem will be 0.30 wide.
Thus, if observed difference is 0.15, lower bound of CI will just = 0.0.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
47
The same result holds when using the alternative formula: ±0.7 * ∆ 0.8
In that case, given the same set of facts. If there is 80% power to demonstrate a risk difference of 0.25,
then one would expect a confidence interval to be wider. It is 0.7*0.25, or ±0.175.
(See Goodman and Berlin 1994 for derivation)
Why is this so: As the power increases (50%, 80%, 90%) the resulting confidence interval will get
narrower, holding constant the observed risk difference. So,
(a) Compute sample size to detect a risk difference at a power level
(b) Use the simple formula to predict the confidence interval.
Reference: Goodman SN. Berlin JA. The use of predicted confidence intervals when planning experiments and the
misuse of power when interpreting results. Annals of Internal Medicine. 1994; 121:200-6.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
48
§3.6 Relative size of (a) standard deviation and (b) desired effect size on power and samples sizes:
Suppose, in any of these situations, you have no idea what σ2 will be? There is an easy way to estimate
power or sample size without stipulating standard deviation in advance.
e.g., comparing 2 means:
n=
(
)
2
2
Z α2 + Zβ σ ( r+1)
(d )
* 2
, from Eq. 5.8.
r
We can always say that we would like to detect a difference of, say, one (or 0.5, or whatever)
standard deviations.
e.g., d* = σ , or d * = 1sd
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
49
Thus, for r = 1 (for example) n =
For d* = 0.5 σ and r = 1: n =
(
(
)
2
2 Z α + Zβ σ 2
2
σ
)
2
(
2
2 Z α + Zβ σ 2
2
( 0.5σ )2
=
)
(
2
= 2 Z α + Z β = 2*7.85 = 15.7 Eq (5.12)
2
2 Z α + Zβ
2
0.25
)
2
(
= 8 Z α + Zβ
2
)
2
=8* 7.85 = 62.8.
(The sample size gets big quickly)
Note, sample size depends only on the ratio
σ
2
(d )
* 2
, i.e., on the sd relative to the desired difference to be
demonstrated. This idea is often invoked in planning experiments. But one should have an idea of what
the standard deviation might be in the population.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
50
Formulae differ according to textbook and sample size programs:
The formula above for comparing groups is approximate (but is used in many texts). A "more
exact" form is [(Fleiss, p. 41) for one control per case]:
2
(z
2 pq − z
p q +pq )
α/2
1− β 0 0 1 1
n=
, (per group)
2
(p − p )
0 1
p=
p1 + p0
, (remember r = 1)
2
Tables are also available for common combinations of p and power.
Fleiss JL. Statistical Methods for Rates and Proportions, 2nd Edition. New York: John Wiley & Sons,
Inc.; 1981: 262.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
51
Always note, however, when using formulae from texts that each author might define the terms
differently and therefore had slightly different formulae.
For example: Schlesselman Formula: pg. 145
n=
(
Zα 2pq + Zβ p1q1 + p0 q 0
( p1 - p0 )
2
)
2
Note:This calls Z = +1.96, where we have used zα / 2
Where:
p1 =
p 0OR
and p = 1 ( p1 + p 0 ) , q1 = 1- p1 , q = 1- p, q 0 = 1- p 0
2
[1+ p0 (OR -1)]
OR= the odds ratio
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
52
A formula that is simpler than the ones above (for r = 1, i.e., two equal sized groups) and for practical
purposes equivalent, is given by:
n=
2pq( Zα + Zβ )
( p1 - p0 )
2
2
Corresponding to α = .05 (two-sided) and β = .10, one has Zα = 1.96 and Zβ = 1.28, so that equation
reduces to a particularly simple formula:
n=
21* pq
( p1 - p0 )
2
.
We have power and sample size programs to do much of this work.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
53
Sample sizes needed are huge when effect to be shown is small as is baseline risk. From: Case-Control
Studies: Design, Conduct, Analysis by James J. Schlesselman, New York, 1982, Oxford University Press, Appendix A
Alpha = 0.05 (two sided).
0.1
0.5
0.9
1.1
1.5
2.0
3.0
6.0
0.01
0.02
0.04
0.05
0.10
0.20
0.30
0.35
1420
6323
201260
222890
10649
3206
1074
304
707
3174
101552
112691
5402
1632
550
158
351
1600
51726
57631
2781
846
288
85
279
1286
41773
46635
2258
689
236
70
137
658
21933
24732
1217
378
133
42
66
347
12209
14046
714
229
85
37
42
248
9206
10805
568
105
73
34
36
221
8453
10022
535
102
71
34
Consider the joint effects of (a) small baseline risk and (b) small size of the true OR If the baseline risk
is small, and the OR is not very different from 1.0, then the risk differences is small, i.e, d* in our
notation is small, and then ( d *) 2 in the denominator of our power calculation formula is very small.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
54
§3.7 Sample Sizes for Matched Studies
§3.7.1 Frequency matching – as in stratified design
First we consider frequency matching. The formulae for stratified studies (frequency matched) and
individually matched are in Schlesselman ( p 159)
Recall, our estimates of MH OR are based on weighted estimates of stratum-specific OR s. This is
corresponding method of arriving at sample size for stratified design.
This is a way to incorporate strata, or a confounding factor, into the estimation of power or sample
size.
Must specify the following parameters, assuming have J strata.
1. p0j = exposure prevalence in controls in jth stratum
2. fj = fraction of the total observations in stratum j, where
∑f
j
= 1.0
j
3. Type I error
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
55
4. Power
5. Assumed true effect size (RR=OR, in this case)
Assume: Equal number of cases and controls in each stratum
Constant RR (OR) across strata (no effect modification)
Required total number of "cases" =
where (using q = 1 – p): g j =
(Z
n=
( ln(OR) )
) (
+ Zβ
Σf jg j
)
2
Eq (5.13)
2
 1
1

+
p 0 jq 0 j
 p1 jq1 j
(
α
2
)



, p1 j =
p0 j(OR)
1+ p 0 j ( OR -1) 
The formula is essentially a weighted sum of (d * ) 2 and var(d*) from our general sample size/power
formula, where ln(OR) here is the equivalent of d*.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
56
Example: Oral contraceptive use and myocardial infarction (Schlesselman, Table 6.5, p 160)
Hypothesized Effect Size R == 3, α = 0.05, β = 0.10 (power=0.9)
Age
fj
p0j
p 1j
gj
f jg j
25-29
.03
.22
.46
.122
.0037
30-34
.09
.08
.21
.062
.0056
35-39
.16
.07
.18
.055
.0088
40-44
.30
.02
.06
.018
.0054
45-49
.42
.02
.06
.018
.0076
1.00 ≈ by definition
.0311 = Σf jg j
f j = .42 is where we have the most cases (age category 45-49)
p0j = .22 exposure prevalence - where most exposure is.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
Then, the required
57
(1.96 +1.28)
n=
2
0.0311
= 328 cases . One must then calculate the total sample size.
Reason for the frequency matching: efficiency.
In the context of the case of myocardial infarction and oral contraceptive use.
Most cases in what age group?
Most exposure in what age group?
So, we use frequency matching to ensure adequate numbers of subjects in each of the 5 age groups.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
58
§3.7.2 Pair Matched Studies (Schlesselman sec 6.6, pp 160 ff.)
There are special methods of computing power for matched studies. We consider first the simplest
situation: 1 to 1 matching. But “matched” studies can also have multiple controls per case.
In paired designs, the power calculations are based on estimating the power given a frequency of
discordant pairs, and then estimating the number of discordant pairs from (a) the probability of discordant
pairs, and (b) the quality of the matching.
(a) The number of discordant pairs (= m) required to detect a relative risk (RR) is given by:
2
 ( Z α ) / 2 + Zβ P(1- P) 
 , where P = OR ≈ RR .
m=  2
(Eq 5.14)
2
(1+ OR) (1+ RR)
 1
 P- 2 


We are going to work with the OR because it is the ratio of the frequencies of discordant pairs. (We then
make the assumption that OR is a good estimate for RR).
Here
P=
u10
u10 + u01
in the paired data table. (See the notation in Vol I, Part 4). That is P = the
b
.
b+c
Must distinguish from p0 and p1 = risk of exposure among controls and the cases
(These calculations can be done by computer: PS, Power and Precision, PASS, for example).
proportion of discordant pairs in the “b” cell of the McNemar table: P =
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
59
Derivation of sample size formula for McNemar’s test: [Advanced Materials]
Recall that McNemar’s test is equivalent to a test of a binomial proportion, where the proportion is the
fraction of discordant pairs that are, for example, in the u10 cell in the 2 by 2 table of paired data. This
was shown in Vol I, Part 4.
We can use this relationship and a version of the sample size formula we have seen before to show the
correspondence between previous formulae and the ones specifically suited for matched pair case control
studies.
Details appear in Schlesselman (pp 145, 161)
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
60
U10
1
ˆ
Ho: p= ;(OR=1), where p=
, and here m=U10 + U01.
2
U10 + U01
The standard sample size formula for one-sample binomial test:
2

p(1 − p) 
p0 (1 − p0 )  zα / 2 + zβ

p0 (1 − p0 ) 

n=
,
( p − p0 ) 2
2



1 1
p(1 − p) 
2
⋅  zα / 2 + zβ
1 1 
 zα / 2

2 2

⋅
 2 + zβ p (1 − p) 
2 2 

=
.
n=
1 2
1 2
(p− )
(p − )
2
2
Letting m = n, we have derived the formula for number of discordant pairs.
Note: The denominator corresponds to d* from before, because we have expressed OR in terms of p, and
we are essentially doing the calculation for the difference between the desired OR and OR=1 (null).
[END OF ADVANCED MATERIALS]
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
61
(b) Estimating the proportion of discordant pairs among all pairs.
We do this from our estimates of the risk of exposure in the control group.
Let pe = the probability of an exposure-discordant pair and M = the total number of pairs needed to yield
m
m discordant pairs. M = .
pe
This probability will depend in part on the baseline risk of exposure among the controls, on the odds ratio
that we are trying to demonstrate, and on the skill (or lack thereof) in selecting matching criteria.
First, consider the baseline case of estimating what fraction of the matched pairs will be informative, i.e,
what fraction will be discordant pairs.
Although pe depends on matching criteria, using the notation from McNemar’s test, the matched pairs can
be displayed in following table:
CASE E+
E-
Control
E+
EU11
U10
U01
u00
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
62
pe=Pr(exposure discordant pairs)
By definition: pe=Pr(U10) + Pr(U01)
=Pr(E+|case)* Pr(E-|ctrl)+ Pr(E-|case)* Pr(E+|ctrl)
= p1
*(1-p0)
+ (1-p1) *
p0
Note: this is an approximation, because pe depends on the matching criteria (which include factors other
than E).
We can compute p1 , the proportion of exposed cases, from the OR and the value of p0 , the proportion of
exposed controls, using the formula for OR.
p1 =
M=
OR and p0 are stipulated values.
p 0OR
. Then q0 = 1 - p0, and q1 = 1 - p1, and
1+ p 0(OR -1)
m
m
=
= sample size needed.
pe ( p 0 q1 + p1q 0)
Eq (5.15)
But there might be other reasons for assuming that the true percentage of usable discordant pairs is
actually smaller than what we might expect.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
63
Example: Pair (1 to 1) Matched study of OC use and congenital heart disease.
For α = 0.05, β = 0.1
We think: p0 = 0.03, i.e., 3% risk of exposure among population of controls (so, rare exposure)
We want to detect OR = 2
We know from the relationship among OR and
p1 =
p1 =Pr(E|case)
(.03)(2)
= .058 , because OR – 1= 2 -1 = 1
1+ .03(1)
and P =
OR
2
1
= . (1- P ) = . This is from the formula derived from McNemar’s test.
1+ OR 3
3
2
1.96
2 1
+1.28 ⋅ 

2
3 3
m= 
= 90 discordant pairs from Eq 5.14.
2
2 1
3 − 2


Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
64
Then, to estimate the total number of pairs:
pe =prob. (discordant pair)
= (p0q1 + p1q0)
= [(0.03) (0.942) + (0.058) (0.97)]
= 0.028 + 0.056 = 0.084
Then: M =
m 90
=
= 1071 matched pairs
p e .084
What happens with other combinations of parameters?
alpha Za/2 power ZB
0.05
0.05
0.05
0.05
1.96
1.96
1.96
1.96
0.05 1.96
0.05 1.96
0.9
0.9
0.9
0.9
1.28
1.28
1.28
1.28
OR p
2
2
2
2
Po=Pr(E|ctrl) r m
0.67
0.67
0.67
0.67
0.9 1.28 2.5 0.71
0.9 1.28 3 0.75
Copyright © 2006, Trustees of the University of Pennsylvania
0.03
0.1
0.2
0.5
1
1
1
1
P1=
qo = 1-Po q1=1-P1 pe
M DuPont
"=Pr(E|case)
90.34
0.06
0.97
0.94 0.086 1046
1066
90.34
0.2
0.9
0.8 0.26 347
368
90.34
0.4
0.8
0.6 0.44 205
266
90.34
1
0.5
0
0.5 181
181
0.03 1 52.93
0.03 1 37.7
0.075
0.09
0.97
0.97
0.925 0.101 527
0.91 0.115 329
543
343
EP 521 Spring, 2007 Vol I, Part 5
65
So, can see that M depends heavily on the probability of exposure among the controls, as well as on the
OR that one assumes is present in truth.
The column labeled M reports results from this program. DuPont numbers in right column are from
program “PS” written by DuPont and Plummer.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
66
We are making assumptions about p0, p1, OR and matching factors.
If matching is less than optimal, and we have overmatched to some extent, then the pr(exposure) for
the case and control in each pair will tend to be similar, resulting in a larger number of
“noninformative” pairs. Recall the (hypothetical) example of matching on use of coffee mate in the
association of pancreatic cancer and coffee.
Program by DuPont and Plummer allows user to adjust for this correlation of exposure.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
67
We can reverse this process and estimate power for given number of discordant pairs. (Ref= Schless
p 162)
 −z
1 
zβ =  α / 2 + m( P − )2  / P (1 − P),
2 
 2
Eq (5.16)
where power = Pr(Z ≤ zβ ),
and m is the number of discordant pairs (as before).
So, zβ = 1.28 is equivalent to power=0.9
Notes:
1. Can estimate m from M by M =
m
pe
2. Better estimate pe from preliminary data or revised after initial data collection
3. We have looked at case control studies (because that is where matching is more common). But
this framework can apply to cohort studies.
§3.7.3 Matched studies with more than one control per case (or in the instance of cohort studies,
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
68
more than one unexposed per exposed).
The same principles apply to these more complex designs.
In these instances, there are several sets of paired tables per matched set, each table representing the
cross classification of pairing for the case with each of the controls. (So, if there are 3 controls per
case, one can think of a set of 3 tables of paired comparisons).
(1) A simple adjustment:
Let c = the number of controls per case, and let n be the number of cases assuming 1 to 1 matching.
Then with c to 1 matching, one needs n1 cases, where: n1 = (c + 1) n / 2c.
Thus, if one needed 1050 cases (and 1050 controls, and then one selected 2 controls per case, the
new number of cases = (2+1)1050/2*2 = 3*1050/4= 788, and the number of controls = 1576. This
approximation is good in many cases, but falls apart when the probability of exposure of a sampled
control is low.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
69
(2) More complex methods:
Better approximations are available. The programs in DuPont and Plummer (PS) use an
estimate of the correlation of the exposure status between a case and its matched controls.
The formula we have seen (Schlesselman) assumes no correlation. DuPont and Plummer
generalized this formula (for multiple controls per case AND for the possibility of some
correlation. )
[Aside: You can think of correlation in terms of two columns of data:
Case
1
1
0
0
1
Control
1
0
1
0
0
Where a 1 indicates exposed and a 0 indicates unexposed and each row is a matched pair (or one
of a set of matched pairs). Then the correlation is simple to obtain using standard formula.
A good start is corr = 0.2. As the correlation increases, then sample size (number of cases)
increases.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
70
Effect of additional matches and correlation on sample sizes:
What happens when add controls per case in a matched study:
The number of cases needed drops, but the total number of patients increases.
Controls No. Case Total
per case patients patients
1
1066
2132
2
782
2346
3
688
2752
4
641
3205
Assuming same OR, power, alpha,
p0, p1 as in our example.
Calculations from program PS
Effect of correlation on sample sizes (using same example):
Corr
0
0.1
0.2
0.3
Case
patients
1066
1230
1437
1705
Correlation might occur when matching is less than optimal.
Available software: PS, PASS.
Reference: DuPont WD. Power calculations for matched case-control
studies. Biometrics. 1988; 44:1157-68.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
71
§3.8 Miscellaneous Comments on Sample Size calculations –
1.
Power to detect interactions –
“Interaction” takes on different means in different applications.
Safest way to do a power calculation for interaction is via simulations (stipulate the OR of
exposure and outcome in the reference group, and then the ratio of odds ratios (ROR) between the
OR (of exposure and outcome) in the comparison group or groups and the reference group.
Good software for this application is hard to find.
2.
For special application of gene-environment interaction, a program from NCI (Power.exe)
The reason for using a special program:
Assume that among females, OR=2.0 (1.2 to 3.5), and for males OR= 1.3 ( 0.8 to 2.1), then OR
for females is significantly different from 1.0, but might still want to test OR for females > OR for
males.
Copyright © 2006, Trustees of the University of Pennsylvania
EP 521 Spring, 2007 Vol I, Part 5
References:
[Ms. Holly Brown ([email protected]).]
Lubin JH, Gail MH. On power and sample size for studying features of
the relative odds of disease. Am J Epidemiol 1990;131:552-566.
Garcia-Closas M, Lubin JH. Power and sample size calculations in
case-control studies of gene-environmental interactions: Comments on
different approaches. Am J Epidemiol 1999;149:689-693.
Copyright © 2006, Trustees of the University of Pennsylvania
72
EP 521 Spring, 2007 Vol I, Part 5
73
3. Adjustment for sample size from programs
Most programs give the power and sample size for a simple case of perfect data. But there are
other problems that might cause the power under assumptions of simple random sampling to be too
high. Adjustments are needed for such common problems as:
Measurement error
Loss to follow up
Lack of independence of observations (clustering= complex)
Repeated measures – This requires materials beyond the level of this course.
4. Covariates –
Covariates will tend to improve power by reducing variance, if the modeling is done correctly. As
a rule of thumb – estimate power for the simple case of no covariates and assume that the
introduction of covariates to adjust for confounding will improve power.
End of Vol 1 Part 5. End of Volume 1
Copyright © 2006, Trustees of the University of Pennsylvania