SAMPLING AND SAMPLE SIZE Andrew Zeitlin Georgetown University and IGC Rwanda

SAMPLING AND
SAMPLE SIZE
Andrew Zeitlin
Georgetown University and IGC Rwanda
With slides from Ben Olken and
the World Bank’s Development Impact Evaluation Initiative
2
Review
•  We want to learn how a program can affect a
group
•  Set up: assume we have performed our lottery
and we have identified our treatment and the
control groups
•  What we would like to measure is difference:
Effect of the Program = Mean in treatment Mean in control
Example: average farmers' income who adopted fertilizer
because of new incentive program vs the farmers in the
control group who didn’t receive any incentives
3
Bias vs. Noise
•  What if we let farmers choose whether or not to use
fertilizer?
•  Our study would be biased! If only the most wealthy,
educated farmers choose to use fertilizer, then we will
not be able to see the effect of fertilizer between
treatment and control. We would see the effect of being
wealthy, educated and using fertilizer.
•  This is why we randomize! To remove other factors that
might create bias.
4
Bias vs. Noise
•  What if we only pick ten farmers in treatment and control,
and we randomly get the four richest farmers in our
control group, and only poor farmers in the control group?
•  The fact that the randomization did not balance farmer wealth
means our groups are not really similar? The control group may
make greater gains (due to more resources) despite the fertilizer.
•  Bottom line: randomization removes bias, but it does
not remove random noise in the data.
•  This is why we worry about sampling!
5
What do we mean by “noise”?
6
What do we mean by “noise”?
Why “random” is necessary but not
enough
•  Random does not mean balanced! It just means it is not
unbalanced for any reason.
•  Which of the following coin flips was random?
T, H, T, T, H,
H, T, H, H, T
This one we made
up
T, T, H, H, H,
H, H, T, H, H
This was a random coin
flip– 5 heads in a row!
A random sample is accurate, but may not
be precise
•  What is the average age of the people in this room?
•  If I pick the youngest looking person in the room and ask their age,
I am biasing the type of response I am likely to get.
•  If I pick someone at random and ask their age, that is not biased,
but still doesn’t tell me much since it is just one person.
•  If I everyone except for one person at random I am likely to get
close to the right average age: this is a good random sample.
•  If I ask everyone, it is no longer a sample of the room—it is the
“universe”
Which of these is more accurate?
88%
II.
I.
A.  I.
B.  II.
C.  Don’t know
12%
0%
A.
B.
C.
Precision (Sample Size)
Accuracy versus Precision
es(mates truth Accuracy (Randomization)
11
Real World Constraints
•  Random sampling can be noisy!
•  In a world with no budget constraint we could collect
data on ALL the individuals (universe) in the
treatment and in the control groups.
•  In practice, we do not observe the entire population,
just a sample.
Example: we do not have data for all farmers of the country/
region, but just for a random sample of them in treatment and
control groups
Bottom line: Estimated Effect = True Effect +
Noise
THE basic questions in statistics
•  How confident can you be in your results?
•  à How big does your sample need to be?
13
Hypothesis Testing
•  In criminal law, most institutions follow the rule:
“innocent until proven guilty”
•  The presumption is that the accused is innocent
and the burden is on the prosecutor to show guilt
•  The jury or judge starts with the “null hypothesis” that
the accused person is innocent
•  The prosecutor has a hypothesis that the accused
person is guilty
Hypothesis Testing
•  In program evaluation, instead of “presumption of
innocence,” the rule is: “presumption of
insignificance”
•  The “Null hypothesis” (H0) is that there was no
(zero) impact of the program
•  The burden of proof is on the evaluator to show a
significant effect of the program
Hypothesis Testing: Conclusions
•  If it is very unlikely (less than a 5% probability) that the
difference is solely due to chance:
•  We “reject our null hypothesis”
•  We may now say:
•  “our program has a statistically significant impact”
16
Two Types of Mistakes (1)
First type of error: conclude that the program
has an effect, when in fact at best it has no
effect
•  Significance level of a test: Probability that you will falsely conclude
that the program has an effect, when in fact it does not
•  If you find an effect with a level of 5%, you can be 95% confident in the
validity of your conclusion that the program had an effect
• Common levels are: 5%, 10%, 1%
17
Two Types of Mistakes (2)
Second Type of Error: You conclude that the
program has no effect when indeed it had an
effect, but it was not measured with enough
precision (or “noise” got in the way)
Power of a test: Probability to find a significant effect if
there truly is an effect
•  Higher power is better since I am more likely to have an effect to report
Practical steps
There are two, related ways one might apply this logic:
1.  Start from the sample size that you can afford. Figure
out what would be the smallest ‘true’ effect that you
could detect with reasonable confidence and power.
Ø This is known as the minimum detectable effect for a given
design.
2.  Start from a plausible effect size, and figure out how big
a sample you need in order to be able to detect this with
reasonable confidence and power.
Ø We will focus on this second approach.
19
Practical Steps
Ø  Set a pre-specified confidence level (5%) – i.e. just
set the initial point of the line in the graph
Ø Decide a level of power.
•  Common values used are 80% or 90%. Intuitively, the
larger the sample, the larger the power.
•  Power is a planning tool: one minus the power is the
probability to be disappointed….
Ø Set a range of pre-specified effect sizes (what you
think the program will do)
Ø What is the smallest effect that should prompt a policy
response?
Picking an Effect Size to choose sample
•  We can guess an effect size using
•  economics
•  past data on the outcome of interest or even past
evaluations
•  What is the smallest effect that should justify the
program to be adopted?
•  Cost of this program v the benefits it brings
•  Cost of this program v the alternative use of the money
Underpowered
•  Common danger: picking effect sizes that are too
optimistic—the sample size may be set too low to detect
an actual effect!
•  Example:
•  Evaluators believe a program will increase high school graduation
by 15 percentage points.
•  They survey enough schools to see increases of 12 percentage
points or more.
•  The program increased graduation rates by 10 percentage points,
but they missed that entirely due to lack of power!
•  They report the program had no statistically significant
effect, even though it actually had one!
22
How difficult is this to do?
•  Proposition I:
There exists at least one statistician in the world
who has already put into a magic formula the
optimal sample size required to address this
problem
•  Proposition II:
The rule has also been implemented for almost all
computer software
• Not difficult to do, and only requires simple calculations
to understand the logic (really simple!)
Power: main ingredients
1. 
2. 
3. 
4. 
5. 
Effect Size
Sample Size
Variance
Proportion of sample in T vs. C
Clustering
Power: main ingredients
1. 
2. 
3. 
4. 
5. 
Effect Size
Sample Size
Variance
Proportion of sample in T vs. C
Clustering
Larger effect= More Power to Detect
•  A device detects all animals over six feet (1.8 meters) tall.
•  Power to detect adult men: Under 10%
•  Power to detect adult women: Under 1%
•  Power to detect adult mice: 0%
•  Power to detect adult giraffes: 100%
•  The taller the animal (effect size) we care about, the more
power we have (and the less we need)
Effect Size: 1*SE
•  Hypothesized effect size determines distance between
means
0.5
0.45
1
Standard
Deviation
0.4
0.35
H0
Hβ
0.3
control
0.25
treatment
0.2
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Effect Size = 1*SE
0.5
0.45
0.4
0.35
Hβ
0.3
H0
0.25
control
treatment
0.2
significance
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Power: 26%
If the true impact was 1*SE…
0.5
0.45
H0
Hβ
0.4
0.35
0.3
control
0.25
treatment
0.2
power
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
The Null Hypothesis would be rejected only 26% of the time
Effect Size: 3*SE
0.5
3*SE
0.45
0.4
0.35
0.3
control
0.25
treatment
0.2
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Bigger hypothesized effect sizeà distributions farther apart
Effect size 3*SE: Power= 91%
0.5
0.45
0.4
0.35
Hβ
0.3
H0
0.25
control
treatment
0.2
power
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
Bigger Effect size means more power
6
What effect size should you use when
designing your experiment?
A.  Smallest effect
size that is still
cost effective
B.  Largest effect
size you expect
your program to
produce
C.  Both
D.  Neither
50%
50%
0%
A.
B.
C.
0%
D.
Power: main ingredients
1. 
2. 
3. 
4. 
5. 
Effect Size
Sample Size
Variance
Proportion of sample in T vs. C
Clustering
By increasing sample size
you increase…
50%
0.5 0.45 0.4 0.35 0.3 control 0.25 treatment 0.2 power 0.15 33%
0.1 0.05 0 -­‐4 -­‐3 A. 
B. 
C. 
D. 
E. 
-­‐2 -­‐1 0 1 2 3 Accuracy
Precision
Both
Neither
Don’t know
4 5 6 17%
A.
B.
C.
0%
0%
D.
E.
Larger sample size= More power to
detect
•  We want to know the average age in the city
•  If we randomly pick one person in the city, we might pick a 100 year
old.
•  If we randomly pick 2000 people, even if we pick the 100 year old
as one of them, he will be balanced out by the other random
selections.
•  This intuition extends to effect sizes.
Power: Effect size = 1SD,
Sample size = N
0.5
0.45
0.4
0.35
0.3
control
0.25
treatment
0.2
significance
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Power: Sample size = 4N
0.5
0.45
0.4
0.35
0.3
control
0.25
treatment
0.2
significance
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Power: 64%
0.5
0.45
0.4
0.35
0.3
control
0.25
treatment
0.2
power
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Power: Sample size = 9N
0.5
0.45
0.4
0.35
0.3
control
0.25
treatment
0.2
significance
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Power: 91%
0.5
0.45
0.4
0.35
0.3
control
0.25
treatment
0.2
power
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Power: main ingredients
1. 
2. 
3. 
4. 
5. 
Effect Size
Sample Size
Variance
Proportion of sample in T vs. C
Clustering
More variance= Less power to detect
•  Imagine the following intervention: Giving away ten bags
of rice
•  In this example, this program has a large effect on ALL poor
people, and no effect on ALL rich people.
•  Low Variance: If our population is all poor, we only need
to sample one person to see the true effect of giving away
rice
•  High Variance: If our population is half poor, and half rich
(high variance) and we randomly sample twenty people,
what happens if only 5 are poor?
What are typical ways to reduce the
underlying (population) variance
80%
A. Include covariates
B. Increase the
sample
C. Do a baseline
survey
D. All of the above
E. A and B
F. A and C
20%
0%
A.
0%
B.
C.
D.
0%
0%
E.
F.
Variance
•  There is sometimes very little we can do to reduce the
noise
•  The underlying variance is what it is
•  We can try to “absorb” variance:
•  using a baseline
•  controlling for other variables
•  In practice, controlling for other variables (besides the baseline
outcome) buys you very little
Power: main ingredients
1. 
2. 
3. 
4. 
5. 
Effect Size
Sample Size
Variance
Proportion of sample in T vs. C
Clustering
More balanced treatment assignment
=> More power
•  What’s better?
•  99 people who get the treatment and one control
•  50 treatment and 50 control
•  This logic continues. What’s better?
•  60 people who get the treatment and 40 control
•  50 treatment and 50 control
Sample split: 50% C, 50% T
0.5
0.45
0.4
0.35
H0
Hβ
0.3
0.25
control
treatment
0.2
significance
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Power: 91%
0.5
0.45
0.4
0.35
0.3
control
0.25
treatment
0.2
power
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
If it’s not 50-50 split?
•  What happens to the relative fatness if the split is not
50-50.
•  Say 25-75?
Sample split: 25% C, 75% T
0.5
0.45
0.4
0.35
H0
Hβ
0.3
control
0.25
treatment
0.2
significance
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
Power: 83%
0.5
0.45
0.4
0.35
0.3
control
0.25
treatment
0.2
power
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
5
6
How unbalanced is too unbalanced?
Bloom (2006): “Because precision erodes slowly until the
degree of imbalance becomes extreme (roughly 𝑃≤0.2 or
𝑃≥0.8), there is considerable latitude for using an
unbalanced allocation.”
This helps if…
•  Politics dictate a
small control group
•  Costs dictate a
small treatment
group
Power: main ingredients
1. 
2. 
3. 
4. 
5. 
Effect Size
Sample Size
Variance
Proportion of sample in T vs. C
Clustering
Clustered design: definition
•  In sampling:
•  When clusters of individuals (e.g. schools, communities, etc) are
randomly selected from the population, before selecting individuals
for observation
•  In randomized evaluation:
•  When clusters of individuals are randomly assigned to different
treatment groups
Clustered design: intuition
•  You want to know how close the upcoming national
elections will be
•  Method 1: Randomly select 50 people from entire Indian
population
•  Method 2: Randomly select 5 families, and ask ten
members of each family their opinion
Low intra-cluster correlation (ICC)
aka ρ (rho)
HIGH intra-cluster correlation (ρ)
All uneducated people live in one village. People with only
primary education live in another. College grads live in a
third, etc. ICC (ρ) on education will be..
A.  High
B.  Low
C.  No effect on rho
D.  Don’t know
If ICC (ρ) is high, what is a more efficient
way of increasing power?
A.  Include more
clusters in the
sample
B.  Include more
people in
clusters
C.  Both
D.  Don’t know
Further topics: Imperfect compliance
In some cases, policymakers/researchers can assign individuals
to a given treatment arm, but this doesn’t mean they will take it
up. What does this mean for power?
•  Consider an extreme cases in which nobody in the treatment
group takes up.
•  In that case, no matter how big the sample size, you can’t detect the
treatment’s impact because you never see it.
•  Alternatively, what happens if everybody ends up getting the
treatment in both treatment and control groups?
•  The required sample size is inversely proportional to
•  where 𝑐 is the fraction of treated who ‘comply’, and
•  𝑑 is fraction of control who ‘defy’
​(𝑐−𝑑)↑2 60
Wrap-up on Power
•  Power calculations look scary but they are just a
formalization of common sense
•  At times we do not have the right information to
conduct it very properly
•  However, it is important to spend effort on them:
•  Avoid launching studies that will have no power at all: waste of time
and money, potentially harmful
•  Devote the appropriate resources to the studies that you decide to
conduct (and not too much)
Appendix: The nuts and bolts (1)
For an experimental design with perfect compliance and
individual-level assignment (no clustering),
•  Minimum detectable effect, for sample size N
𝑀𝐷𝐸=(​𝑡↓1−𝜅 +​𝑡↓𝛼/2 )√⁠​𝜎↑2 /𝑃(1−𝑃)𝑁 •  Minimum sample size, for hypothesized effect size 𝛽
𝑁=​(​𝑡↓1−𝜅 +​𝑡↓​𝛼/2 )↑2 ​𝜎↑2 /​𝛽↑2 𝑃(1−𝑃) Appendix: The nuts and bolts (2)
When compliance becomes imperfect, with 𝑐 the fraction of
those assigned to treatment who take up and 𝑑 the fraction
of control who do likewise….
•  Minimum detectable effect, for sample size N
𝑀𝐷𝐸=​(​𝑡↓1−𝜅 +​𝑡↓𝛼/2 )/(𝑐−𝑑) √⁠​𝜎↑2 /𝑃(1−𝑃)𝑁 •  Minimum sample size, for hypothesized effect size 𝛽
𝑁=​(​𝑡↓1−𝜅 +​𝑡↓​𝛼/2 )↑2 ​𝜎↑2 /​(𝑐−𝑑)↑2 ​𝛽↑2 𝑃(1−𝑃) Appendix: The nuts and bolts (3)
For an experimental design with perfect compliance and
group-based assignment,
•  Minimum detectable effect, for J groups with n members
𝑀𝐷𝐸=(​𝑡↓1−𝜅 +​𝑡↓​𝛼/2 )√⁠​𝜎↑2 /𝐽𝑃(1−𝑃) (𝜌+​1−𝜌/𝑛 )
•  Minimum number of groups, for hypothesized effect
size 𝛽
𝐽=​(​𝑡↓1−𝜅 +​𝑡↓​𝛼/2 )↑2 ​𝜎↑2 /​𝛽↑2 𝑃(1−𝑃) (𝜌+​1−𝜌/𝑛 )