Sample size and power Jon Michael Gran Department of Biostatistics, UiO MF9130 Introductory course in statistics Wednesday 08.06.2011 1 / 39 Overview Sample size and power (Aalen chapter 9.6, Kirkwood and Sterne chapter 35) • Principles of sample size • Sample size for precision • Sample size in hypothesis testing, power 2 / 39 Sample size Conclusions in scientific studies greatly depend on sample size • A too small sample reduces the chance to get significant findings and reduce the generalizability of the study → Non-significant results not interesting • There are methods for calculating the required sample size • Calculation of required sample size is done prior to start of study • See e.g. Kirkwood and Sterne Section 35, for a good discussion 3 / 39 Example: Pocock et al. BMJ 1994; 309:1189-97 Meta-analysis: change in IQ (children above 5 years) for increase in blood lead from 10 to 20 µg/dl, using three measures of blood lead 4 / 39 Sources of error when testing • Two main sources: I Systematic error I Random error • Systematic errors are due to a faulty design: Bias, lack of randomization, no blinding etc. • Random errors are due to just random variation. Can be reduced by increasing the sample size 5 / 39 Random errors • Remember the calculations of standard error in previous lectures • The effect of random error decreases by the square root of n as the sample increases • Standard error of the mean: σ se = √ n • Standard error of a proportion: s se = p(1 − p) n 6 / 39 Illustration: Standard error of the mean of a variable with standard deviation σ = 1 7 / 39 Sample size calculations You will have to quantify: • Realistic or clinically relevant difference between groups • Expected variation in the data (e.g. standard deviation) • Significance level, e.g. 5% • Wanted power - or alternatively; how many individuals n you can afford to include in the study 8 / 39 So what is power? • Power = P(reject H0 |Ha true) • With words: The power is the probability of rejecting the null hypothesis if the alternative hypothesis is true • Or: The probability to find a difference between the groups if there really are a difference • More subjects → higher power 9 / 39 How large a power should you have? • As high as possible; but a question about what is possible with regards to sample size • The usual minimum for a study with “good” power is 80% (“Industry minimum”) • Many large definitive studies aim at a power of 99.9% 10 / 39 First: Sample size for precision • Want to construct a 95% confidence interval with a given degree of precision • Wish to be 95% certain that the estimate is within ±2se of the true value (or 1.96 to be precise) • If the estimate is normally distributed, sample mean ±2se covers approximately 95%. This means that a = 2se 11 / 39 Remember the normal distribution • µ = expected value, σ = standard deviation • µ ± 2σ covers approximately 95% of the distribution (a = 2σ) 12 / 39 Sample size for estimating the mean with required accuracy (a) • To estimate a mean, the following number of observations are needed: n= (because a = 2se = 2 · 4σ 2 a2 √σ ) n 13 / 39 Sample size for estimating the mean with required accuracy (a) • To estimate a mean, the following number of observations are needed: n= (because a = 2se = 2 · 4σ 2 a2 √σ ) n • Example: Number of observations required to estimate the cholesterol level (mmol/dl) in the population with a precision of 0.5. Suppose σ = 1. Number of observations required: n= 4 × 12 = 16 0.52 13 / 39 Sample size for estimating a proportion • To estimate a proportion, the following number of observations are needed: n= (because a = 2se = 2 · q 4p(1 − p) a2 p(1−p) ) n 14 / 39 Sample size for estimating a proportion • To estimate a proportion, the following number of observations are needed: n= (because a = 2se = 2 · q 4p(1 − p) a2 p(1−p) ) n • Example: Number of observations required to estimate the prevalence of depression in the population with accuracy 0.03. Suppose p = 0.1. Number of observations required: n= 4 × 0.1(1 − 0.1) = 400 0.032 14 / 39 Sample size in hypothesis testing • Study sample has to be so large that an interesting effect will turn out as significant, with a large probability • This means that negative results will also be interesting if the power is high (and worthy of publication) Figur: For example testing for difference in two normally distributed variables 15 / 39 Types of errors in hypothesis testing • Type I error: To conclude that there is an effect (or difference), when in reality there is not 16 / 39 Types of errors in hypothesis testing • Type I error: To conclude that there is an effect (or difference), when in reality there is not I I Controlled by the significance level. Probability: α Usually choose significance level α = 5%: Only 5% chance of discovering an effect (rejecting the null hypothesis) when there really is no effect 16 / 39 Types of errors in hypothesis testing • Type I error: To conclude that there is an effect (or difference), when in reality there is not I I Controlled by the significance level. Probability: α Usually choose significance level α = 5%: Only 5% chance of discovering an effect (rejecting the null hypothesis) when there really is no effect • Type II error: Not discovering a true effect I Controlled by the sample size. Probability: β I Power: 1-β (chance of discovering the effect) I A minimum for ’good’ power was said to usually be 80%: so 80% chance of discovering a true effect (rejecting the null hypothesis if there is a difference) 16 / 39 Effect size (or standardized difference) • In order to do sample size calculations, one often needs to calculate the effect size • Also called standardized difference • Formulas for the effect size depends on which methods you want to use for the analysis I I I Two sample t-test? Paired t-test? Comparing proportions? • Effect size depends on I Clinically relevant and realistic difference (often called ∆) I The standard deviation of your measured variable 17 / 39 Different effect sizes The smaller the effect size is, the more data you need to sample to get high power. This is because the data distributions will overlap to a large extent for small effect sizes 18 / 39 From: http://www.uccs.edu/∼lbecker/psy590/es.htm 19 / 39 Nomogram from Altman: Practical statistics for Medical Research 20 / 39 Using the nomogram: Comparing means in two groups • Clinically relevant (and realistic) difference: ∆ • Standard deviation in both groups: σ • Effect size is given by: ∆/σ 21 / 39 Using the nomogram: Comparing means in two groups • Clinically relevant (and realistic) difference: ∆ • Standard deviation in both groups: σ • Effect size is given by: ∆/σ • Example: Comparing cholesterol level in two groups: I Clinically relevant (and realistic) difference: ∆ = 0.5 I Standard deviation: σ = 1 I Effect size:∆/σ = 0.5 I A relevant difference of 0.5 means that if the average level in the two groups is for example 5.7 and 6.2, it would be important to discover • Chose significance level α = 0.05 and power 1 − β = 0.80 21 / 39 Using the nomogram from Altman 22 / 39 Conclusion, cholesterol example • An effect size of 0.5 gives a sample size of 120, i.e. need to measure cholesterol level in 60 patients in each group in order to get a significant effect • The nomogram always states the total sample size 23 / 39 Sample size for continuous data - paired data • E.g. a cross-over study • Effect size: 2 · ∆/σd , where σd is the standard deviation of the difference between the two measurements • New cholesterol reducing drug, what is the effect of using it for a month? • A difference of 0.5 is important to discover, assume σd = 1 • Need to sample 30 patients 24 / 39 Using the nomogram from Altman 25 / 39 Sample size for proportions • Compare proportions in two groups I Initial guess on the proportions: p and p 1 2 I Relevant difference: p − p 1 2 I Average proportion: p ¯ = (p1 + p2 )/2 p1 −p2 I Effect size: p ¯ ×(1−¯ p) 26 / 39 Sample size for proportions • Compare proportions in two groups I Initial guess on the proportions: p and p 1 2 I Relevant difference: p − p 1 2 I Average proportion: p ¯ = (p1 + p2 )/2 p1 −p2 I Effect size: p ¯ ×(1−¯ p) • Example: Compare the prevalence of depression in two populations I I I Guess on the proportions: 0.10 and 0.20 Average proportion: p¯ = 0.15 Effect size: √ 0.20−0.10 = 0.28 0.15×(1−0.15) 26 / 39 Using the nomogram from Altman 27 / 39 Conclusion, depression example • Choose: α = 0.05 (5%) 1 - beta = 0.80 (80%) • An effect size of 0.28 gives a sample size of 400, i.e. need 200 patients in each group to get a significant difference 28 / 39 We can also do sample size calculations in another way • Example: Want to test the effect of nicotine gum. 15% of quitting smokers are still not smoking after 6 months. If gum increases this proportion to 30%, we want a significant test • Average proportion is 0.225, effect size is 0.36 (calculate yourself) • For financial reasons, can only afford 200 patients in total • Get power of 75%, i.e. 75% chance of discovering this effect 29 / 39 Using the nomogram from Altman 30 / 39 Using formulas: example from Kirkwood and Sterne • Study comparing a continuous variable in two groups: n= (u+v )2 (σ12 +σ02 ) (µ1 −µ0 )2 where n is the sample size in each group, σi is the standard deviation in group i, µ1 − µ0 = ∆ is the clinical relevant difference and u and v are constants depending on the significance level and power (90% power give u = 1.28, and 5% sign. level give v = 1.96, see tables for other values) • Note: Sample size calculation depend on three main quantities: The desired power, the clinical relevant difference, and σ. In other words; the more precise measure, the less n you need 31 / 39 The effect of averaging from normal distributions Becomes easy to get significant differences, since the distributions do not overlap anymore! 32 / 39 Different size of the groups • If you need 400 patients in total, do you need exactly 200 in each group? • Broadly speaking: It does not matter that much • If you need 400 patients in total, you can put 250 in one group and 150 in the other • If you want an exact answer, there are formulas for this. Use literature 33 / 39 You have done sample size calculations, what can still go wrong? • Have been too optimistic on how fast patients are included in the study I I I Too strict inclusion criteria Too time demanding to collect information from each patient Drop outs • Negative surprises can be avoided by doing a pilot study! • Wise to include more patients than estimated from the sample size calculations to allow for drop outs (if possible) 34 / 39 Final remarks • Number of required individuals will vary a lot between studies • A large difference between groups means that a study with 30 in each group is as good as a study with 100 in each group • However, if the difference is smaller than expected, the power will decrease • The estimates for effect size and variance has a lot of impact on these calculations! Are they correct? • What if you don’t know the variance? (usually you don’t) Look in similar studies or create tables with different variances and choose the highest sample size you could afford 35 / 39 Final remarks (cont.) • Important to come up with a few main hypotheses that you wish to test I Choosing significance level 5% means that you will reject a null hypothesis that is true in reality 5% of the time! • If you find that other research hypotheses are more interesting after you have collected your data, the initial sample size calculations may be worthless • We have seen simple examples; what if your research hypothesis specifies a regression problem with many variables? I Sample size calculation quickly becomes very uncertain and difficult. May be pragmatic and do calculations of sample size on the basis of a few main variables • Generally: Need larger samples to control for more variables. Use programs and see what is available for ANOVA, regression and so on 36 / 39 Software for calculating sample size • Buyable programs: I SamplePower http://www.spss.com/samplepower I nQuery http://www.statsol.ie/nquery/nquery.htm • Free programs: http://statpages.org/ I This page also includes other statistical programs 37 / 39 38 / 39 Summary • For fast or approximate calculations use the nomogram from Altman • See table 35.1 in Kirkwood and Sterne for overview of exact formulas • Power formulas are not hard, but can be very algebra intensive - may want to use a computer program • Take all calculations with a few grain of salt - fudge factor is important! • If in doubt when designing a study - ask a statistician. You don’t want to use months or years collecting data, only to learn that they are impossible to analyze the way you want to - analyse comes after design 39 / 39
© Copyright 2024