SAMPLING AND SAMPLE SIZE Andrew Zeitlin Georgetown University and IGC Rwanda With slides from Ben Olken and the World Bank’s Development Impact Evaluation Initiative 2 Review •  We want to learn how a program can affect a group •  Set up: assume we have performed our lottery and we have identified our treatment and the control groups •  What we would like to measure is difference: Effect of the Program = Mean in treatment Mean in control Example: average farmers' income who adopted fertilizer because of new incentive program vs the farmers in the control group who didn’t receive any incentives 3 Bias vs. Noise •  What if we let farmers choose whether or not to use fertilizer? •  Our study would be biased! If only the most wealthy, educated farmers choose to use fertilizer, then we will not be able to see the effect of fertilizer between treatment and control. We would see the effect of being wealthy, educated and using fertilizer. •  This is why we randomize! To remove other factors that might create bias. 4 Bias vs. Noise •  What if we only pick ten farmers in treatment and control, and we randomly get the four richest farmers in our control group, and only poor farmers in the control group? •  The fact that the randomization did not balance farmer wealth means our groups are not really similar? The control group may make greater gains (due to more resources) despite the fertilizer. •  Bottom line: randomization removes bias, but it does not remove random noise in the data. •  This is why we worry about sampling! 5 What do we mean by “noise”? 6 What do we mean by “noise”? Why “random” is necessary but not enough •  Random does not mean balanced! It just means it is not unbalanced for any reason. •  Which of the following coin flips was random? T, H, T, T, H, H, T, H, H, T This one we made up T, T, H, H, H, H, H, T, H, H This was a random coin flip– 5 heads in a row! A random sample is accurate, but may not be precise •  What is the average age of the people in this room? •  If I pick the youngest looking person in the room and ask their age, I am biasing the type of response I am likely to get. •  If I pick someone at random and ask their age, that is not biased, but still doesn’t tell me much since it is just one person. •  If I everyone except for one person at random I am likely to get close to the right average age: this is a good random sample. •  If I ask everyone, it is no longer a sample of the room—it is the “universe” Which of these is more accurate? 88% II. I. A.  I. B.  II. C.  Don’t know 12% 0% A. B. C. Precision (Sample Size) Accuracy versus Precision es(mates truth Accuracy (Randomization) 11 Real World Constraints •  Random sampling can be noisy! •  In a world with no budget constraint we could collect data on ALL the individuals (universe) in the treatment and in the control groups. •  In practice, we do not observe the entire population, just a sample. Example: we do not have data for all farmers of the country/ region, but just for a random sample of them in treatment and control groups Bottom line: Estimated Effect = True Effect + Noise THE basic questions in statistics •  How confident can you be in your results? •  à How big does your sample need to be? 13 Hypothesis Testing •  In criminal law, most institutions follow the rule: “innocent until proven guilty” •  The presumption is that the accused is innocent and the burden is on the prosecutor to show guilt •  The jury or judge starts with the “null hypothesis” that the accused person is innocent •  The prosecutor has a hypothesis that the accused person is guilty Hypothesis Testing •  In program evaluation, instead of “presumption of innocence,” the rule is: “presumption of insignificance” •  The “Null hypothesis” (H0) is that there was no (zero) impact of the program •  The burden of proof is on the evaluator to show a significant effect of the program Hypothesis Testing: Conclusions •  If it is very unlikely (less than a 5% probability) that the difference is solely due to chance: •  We “reject our null hypothesis” •  We may now say: •  “our program has a statistically significant impact” 16 Two Types of Mistakes (1) First type of error: conclude that the program has an effect, when in fact at best it has no effect •  Significance level of a test: Probability that you will falsely conclude that the program has an effect, when in fact it does not •  If you find an effect with a level of 5%, you can be 95% confident in the validity of your conclusion that the program had an effect • Common levels are: 5%, 10%, 1% 17 Two Types of Mistakes (2) Second Type of Error: You conclude that the program has no effect when indeed it had an effect, but it was not measured with enough precision (or “noise” got in the way) Power of a test: Probability to find a significant effect if there truly is an effect •  Higher power is better since I am more likely to have an effect to report Practical steps There are two, related ways one might apply this logic: 1.  Start from the sample size that you can afford. Figure out what would be the smallest ‘true’ effect that you could detect with reasonable confidence and power. Ø This is known as the minimum detectable effect for a given design. 2.  Start from a plausible effect size, and figure out how big a sample you need in order to be able to detect this with reasonable confidence and power. Ø We will focus on this second approach. 19 Practical Steps Ø  Set a pre-specified confidence level (5%) – i.e. just set the initial point of the line in the graph Ø Decide a level of power. •  Common values used are 80% or 90%. Intuitively, the larger the sample, the larger the power. •  Power is a planning tool: one minus the power is the probability to be disappointed…. Ø Set a range of pre-specified effect sizes (what you think the program will do) Ø What is the smallest effect that should prompt a policy response? Picking an Effect Size to choose sample •  We can guess an effect size using •  economics •  past data on the outcome of interest or even past evaluations •  What is the smallest effect that should justify the program to be adopted? •  Cost of this program v the benefits it brings •  Cost of this program v the alternative use of the money Underpowered •  Common danger: picking effect sizes that are too optimistic—the sample size may be set too low to detect an actual effect! •  Example: •  Evaluators believe a program will increase high school graduation by 15 percentage points. •  They survey enough schools to see increases of 12 percentage points or more. •  The program increased graduation rates by 10 percentage points, but they missed that entirely due to lack of power! •  They report the program had no statistically significant effect, even though it actually had one! 22 How difficult is this to do? •  Proposition I: There exists at least one statistician in the world who has already put into a magic formula the optimal sample size required to address this problem •  Proposition II: The rule has also been implemented for almost all computer software • Not difficult to do, and only requires simple calculations to understand the logic (really simple!) Power: main ingredients 1.  2.  3.  4.  5.  Effect Size Sample Size Variance Proportion of sample in T vs. C Clustering Power: main ingredients 1.  2.  3.  4.  5.  Effect Size Sample Size Variance Proportion of sample in T vs. C Clustering Larger effect= More Power to Detect •  A device detects all animals over six feet (1.8 meters) tall. •  Power to detect adult men: Under 10% •  Power to detect adult women: Under 1% •  Power to detect adult mice: 0% •  Power to detect adult giraffes: 100% •  The taller the animal (effect size) we care about, the more power we have (and the less we need) Effect Size: 1*SE •  Hypothesized effect size determines distance between means 0.5 0.45 1 Standard Deviation 0.4 0.35 H0 Hβ 0.3 control 0.25 treatment 0.2 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Effect Size = 1*SE 0.5 0.45 0.4 0.35 Hβ 0.3 H0 0.25 control treatment 0.2 significance 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Power: 26% If the true impact was 1*SE… 0.5 0.45 H0 Hβ 0.4 0.35 0.3 control 0.25 treatment 0.2 power 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 The Null Hypothesis would be rejected only 26% of the time Effect Size: 3*SE 0.5 3*SE 0.45 0.4 0.35 0.3 control 0.25 treatment 0.2 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Bigger hypothesized effect sizeà distributions farther apart Effect size 3*SE: Power= 91% 0.5 0.45 0.4 0.35 Hβ 0.3 H0 0.25 control treatment 0.2 power 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 Bigger Effect size means more power 6 What effect size should you use when designing your experiment? A.  Smallest effect size that is still cost effective B.  Largest effect size you expect your program to produce C.  Both D.  Neither 50% 50% 0% A. B. C. 0% D. Power: main ingredients 1.  2.  3.  4.  5.  Effect Size Sample Size Variance Proportion of sample in T vs. C Clustering By increasing sample size you increase… 50% 0.5 0.45 0.4 0.35 0.3 control 0.25 treatment 0.2 power 0.15 33% 0.1 0.05 0 -‐4 -‐3 A.  B.  C.  D.  E.  -‐2 -‐1 0 1 2 3 Accuracy Precision Both Neither Don’t know 4 5 6 17% A. B. C. 0% 0% D. E. Larger sample size= More power to detect •  We want to know the average age in the city •  If we randomly pick one person in the city, we might pick a 100 year old. •  If we randomly pick 2000 people, even if we pick the 100 year old as one of them, he will be balanced out by the other random selections. •  This intuition extends to effect sizes. Power: Effect size = 1SD, Sample size = N 0.5 0.45 0.4 0.35 0.3 control 0.25 treatment 0.2 significance 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Power: Sample size = 4N 0.5 0.45 0.4 0.35 0.3 control 0.25 treatment 0.2 significance 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Power: 64% 0.5 0.45 0.4 0.35 0.3 control 0.25 treatment 0.2 power 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Power: Sample size = 9N 0.5 0.45 0.4 0.35 0.3 control 0.25 treatment 0.2 significance 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Power: 91% 0.5 0.45 0.4 0.35 0.3 control 0.25 treatment 0.2 power 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Power: main ingredients 1.  2.  3.  4.  5.  Effect Size Sample Size Variance Proportion of sample in T vs. C Clustering More variance= Less power to detect •  Imagine the following intervention: Giving away ten bags of rice •  In this example, this program has a large effect on ALL poor people, and no effect on ALL rich people. •  Low Variance: If our population is all poor, we only need to sample one person to see the true effect of giving away rice •  High Variance: If our population is half poor, and half rich (high variance) and we randomly sample twenty people, what happens if only 5 are poor? What are typical ways to reduce the underlying (population) variance 80% A. Include covariates B. Increase the sample C. Do a baseline survey D. All of the above E. A and B F. A and C 20% 0% A. 0% B. C. D. 0% 0% E. F. Variance •  There is sometimes very little we can do to reduce the noise •  The underlying variance is what it is •  We can try to “absorb” variance: •  using a baseline •  controlling for other variables •  In practice, controlling for other variables (besides the baseline outcome) buys you very little Power: main ingredients 1.  2.  3.  4.  5.  Effect Size Sample Size Variance Proportion of sample in T vs. C Clustering More balanced treatment assignment => More power •  What’s better? •  99 people who get the treatment and one control •  50 treatment and 50 control •  This logic continues. What’s better? •  60 people who get the treatment and 40 control •  50 treatment and 50 control Sample split: 50% C, 50% T 0.5 0.45 0.4 0.35 H0 Hβ 0.3 0.25 control treatment 0.2 significance 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Power: 91% 0.5 0.45 0.4 0.35 0.3 control 0.25 treatment 0.2 power 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 If it’s not 50-50 split? •  What happens to the relative fatness if the split is not 50-50. •  Say 25-75? Sample split: 25% C, 75% T 0.5 0.45 0.4 0.35 H0 Hβ 0.3 control 0.25 treatment 0.2 significance 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 Power: 83% 0.5 0.45 0.4 0.35 0.3 control 0.25 treatment 0.2 power 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 How unbalanced is too unbalanced? Bloom (2006): “Because precision erodes slowly until the degree of imbalance becomes extreme (roughly 𝑃≤0.2 or 𝑃≥0.8), there is considerable latitude for using an unbalanced allocation.” This helps if… •  Politics dictate a small control group •  Costs dictate a small treatment group Power: main ingredients 1.  2.  3.  4.  5.  Effect Size Sample Size Variance Proportion of sample in T vs. C Clustering Clustered design: definition •  In sampling: •  When clusters of individuals (e.g. schools, communities, etc) are randomly selected from the population, before selecting individuals for observation •  In randomized evaluation: •  When clusters of individuals are randomly assigned to different treatment groups Clustered design: intuition •  You want to know how close the upcoming national elections will be •  Method 1: Randomly select 50 people from entire Indian population •  Method 2: Randomly select 5 families, and ask ten members of each family their opinion Low intra-cluster correlation (ICC) aka ρ (rho) HIGH intra-cluster correlation (ρ) All uneducated people live in one village. People with only primary education live in another. College grads live in a third, etc. ICC (ρ) on education will be.. A.  High B.  Low C.  No effect on rho D.  Don’t know If ICC (ρ) is high, what is a more efficient way of increasing power? A.  Include more clusters in the sample B.  Include more people in clusters C.  Both D.  Don’t know Further topics: Imperfect compliance In some cases, policymakers/researchers can assign individuals to a given treatment arm, but this doesn’t mean they will take it up. What does this mean for power? •  Consider an extreme cases in which nobody in the treatment group takes up. •  In that case, no matter how big the sample size, you can’t detect the treatment’s impact because you never see it. •  Alternatively, what happens if everybody ends up getting the treatment in both treatment and control groups? •  The required sample size is inversely proportional to •  where 𝑐 is the fraction of treated who ‘comply’, and •  𝑑 is fraction of control who ‘defy’ (𝑐−𝑑)↑2 60 Wrap-up on Power •  Power calculations look scary but they are just a formalization of common sense •  At times we do not have the right information to conduct it very properly •  However, it is important to spend effort on them: •  Avoid launching studies that will have no power at all: waste of time and money, potentially harmful •  Devote the appropriate resources to the studies that you decide to conduct (and not too much) Appendix: The nuts and bolts (1) For an experimental design with perfect compliance and individual-level assignment (no clustering), •  Minimum detectable effect, for sample size N 𝑀𝐷𝐸=(𝑡↓1−𝜅 +𝑡↓𝛼/2 )√⁠𝜎↑2 /𝑃(1−𝑃)𝑁 •  Minimum sample size, for hypothesized effect size 𝛽 𝑁=(𝑡↓1−𝜅 +𝑡↓𝛼/2 )↑2 𝜎↑2 /𝛽↑2 𝑃(1−𝑃) Appendix: The nuts and bolts (2) When compliance becomes imperfect, with 𝑐 the fraction of those assigned to treatment who take up and 𝑑 the fraction of control who do likewise…. •  Minimum detectable effect, for sample size N 𝑀𝐷𝐸=(𝑡↓1−𝜅 +𝑡↓𝛼/2 )/(𝑐−𝑑) √⁠𝜎↑2 /𝑃(1−𝑃)𝑁 •  Minimum sample size, for hypothesized effect size 𝛽 𝑁=(𝑡↓1−𝜅 +𝑡↓𝛼/2 )↑2 𝜎↑2 /(𝑐−𝑑)↑2 𝛽↑2 𝑃(1−𝑃) Appendix: The nuts and bolts (3) For an experimental design with perfect compliance and group-based assignment, •  Minimum detectable effect, for J groups with n members 𝑀𝐷𝐸=(𝑡↓1−𝜅 +𝑡↓𝛼/2 )√⁠𝜎↑2 /𝐽𝑃(1−𝑃) (𝜌+1−𝜌/𝑛 ) •  Minimum number of groups, for hypothesized effect size 𝛽 𝐽=(𝑡↓1−𝜅 +𝑡↓𝛼/2 )↑2 𝜎↑2 /𝛽↑2 𝑃(1−𝑃) (𝜌+1−𝜌/𝑛 )