Sample Size and Power Outline Haugesund, 14.6.2011 1. Introduction Why do we talk about sample size? J¨ org Aßmus Two samples under H0 Two samples under H1 sample 1 sample 2 sample 1 sample 2 2. Short introduction to hypotheses testing 3. Power and Sample size ”I only believe in statistics that I doctored myself.” (said to be said by Winston Churchill) 4. What can we do? 2 Introduction Introduction - A simple example What is the target of a study? Body height of 20 years old men in Germany and the Netherlands RCT: (Randomized controlled trial) Sample @ @ Treatment group @ @ Output 1 @ @ Control group @ @ Output 2 Research Question: Is there a difference between the Outputs? Methodical Question: What do we need to be able to see a difference (if there is one)? 3 No. 1 2 3 4 5 6 7 8 9 10 GER 1.79 1.89 1.77 1.95 1.80 1.85 1.80 1.76 1.82 1.77 NED 1.77 1.89 1.73 1.87 2.00 1.80 1.89 1.87 1.88 1.89 Testing the difference (t-test) Sample size ∆ p-value Difference? 5 0.014 0.8145 no 10 0.040 0.1993 no 50 0.046 0.0058 yes 100 0.060 0.0000 yes 1000 0.054 0.0000 yes Conclusion: The test result depends on the sample size 4 Introduction Conclusion from the introductory example: We can see an existing difference if we use a sufficiently large sample size Introduction: What do we want to do? Question: Why don’t we take the sample size as large as possible? 1. Economy: - Why should we include more than we have to? - Every trial costs! We have to find the correct sample size to detect the desired effect. Not too small - Not too large What do we need on the way? 2. Ethics: - We should never punish more test persons or animals than necessary 3. Statistics: - We can proof almost every effect, if we only have sufficiently large sample size - stress field: statistical significance vs. clinical relevance - How does a test work? What means ”Power of a test”? What determines the sample size? How do we handle this in practical tasks? 5 6 A short introduction to hypotheses testing Possible results of a single test How can we know? ← obvious not obvious → Reality H0 true H0 false Strategy: Wrong decisions: · Rejection even if H0 is true (type I error) · No rejection even if H0 is false (type II error) 1. Formulate a hypothesis Expected heights equal Nullhypothesis H 0 : E h 1 = E h2 vs. vs. Expected heights different vs. Alternative H1 : Eh1 6= Eh2 2. Find an appropriate test statistics: 3. Compute the observed test statistics: T = Tobs = √ What do we want? · Reduction of the wrong decisions. ⇒ Minimal probability for both types of errors. Dilemma: n|Eh2 −Eh1 | σ √ Test decision accept reject RIGHT type I error type II error RIGHT For a given data set and a given test method it is impossible to reduce both error types at once. n|ˆ h2 −ˆ h1 | spooled 4. Reject the nullhypothesis H0 if Tobs ist too large. But what does this mean: ”too large”? 7 8 Dilemma: A Simulation experiment: For a given data set and a given test method it is impossible to reduce both error types at once. We generate data for two populations with the following properties: ⇒ We try to deal only with type I error ⇒ We assume that H0 ist true - all data are Gaussian - Equally for both populations: · Mean: µ1 = µ2 = 0 · Standard deviation: σ1 = σ2 = 1 · Sample size: M = 20 - Differently for the populations: · Nothing - Test problem: H0 : µ1 = µ2 H1 : µ1 6= µ2 Statistical approach: Idea: What is the probability that everything happend by accident? The p-value is for a given data set the probability to get the observed test statistics or worse assuming that the nullhypothesis is true Remarks: Given data: p-value is a fixed number → characteristically for the data set Theoretically: p-value is a random variable ⇒ p-value has a distribution density of the teststatistic Solution: The p-value p:=P (T>T H sample 1 sample 2 Approach: ) obs 0 T_obs Two samples under H0 1. 2. 3. 4. T_krit values of the teststatistic Generate a data set for both populations Compute the p-value for a t-test (H0 : no difference) Plot p-value Repeat 1.-3. 9 10 Result of the experiment Power and Sample size Distribution of the p−values under H Two samples under H0 0 REJECT N =4963, 4.963% 0 sample 1 sample 2 What did we learn about tests? ACCEPT N =95037, 95.037% 1 count Test decision: made according to control the probability for appearance of a type I error ⇒ Interpretation: We control the probability of the incorrect detection of an effect. Question: What about the probability not to detect an existing effect? 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p−values We know: With - given data - a given test method - a given significance level we are not able to influence the probability of the type II error anymore (Recall the test dilemma!) - p-values are uniformly distributed under true null-hypothesis - 5% of the p-values under 0.05 - Independent of the Sample size But how does this probability look like? Let us do one more simulation experiment: We reject the null hypothesis if the p-value is under a given significance level α (usual convention α = 0.05) The probability of type I error (incorrect rejection) will be lower than the used significance level. 11 12 A Simulation experiment: Power and Sample size We generate data for two populations with the following properties: Result of the experiment: 1 sample 1 sample 2 sample 1 sample 2 REJECT ACCEPT N =29001, 29.001% N =70999, 70.999% 0 ⇒ Approach: 1. 2. 3. 4. Distribution of the p−values under H Two samples under H1 Two samples under H1 1 count - All data are Gaussian - Equally for both populations: · Standard deviation: σ1 = σ2 = 1 · Sample size: M = 20 - Differently for the populations: · Mean µ1 = 0 µ2 = 0.8 - Test problem: H0 : µ1 = µ2 H1 : µ1 6= µ2 Difference detected: No difference detected: Generate a data set for both populations Compute the p-value for a t-test (H0 : no difference) Plot p-value Repeat 1.-3. 0 0.1 ≈ 69% of the trials ≈ 31% of the trials 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ← Correct ← Type II error Ability of the test to detect the difference @ @ Power of the test 14 13 Power and Sample size Power and Sample size σ = 0.04 M=5 Figure: Mean of simulated data for two samples repeated 100.000 times. ∆µ = 0 Recall: α Probability for wrong rejections (type I error) 1.7 1.8 1.9 2 1.7 1.8 1.9 2 1.8 σ = 0.08 M = 10 1.9 2 2.1 Question: What does the Power of a test depend on? ∆µ = 0.05 Definition β: Probability for wrong acceptations (type II error) 1.7 1.8 1.9 2 1.7 The power of a test is the probability to detect a wrong hypothesis 1.7 1.8 1.9 2 1.7 1.7 1.8 1.9 1.9 2 1.8 1.8 1.9 2 1.7 2 1.8 1.8 1.9 2 2.1 1.9 2 2.1 ∆µ = 0.15 2 1.8 σ = 0.2 M = 1000 1.9 ∆µ = 0.1 σ = 0.16 M = 100 Power = 1 − β 1.8 σ = 0.12 M = 50 1.9 2 2.1 ∆µ = 0.2 Question: What does the Power of a test depend on? 1.7 15 1.8 1.9 2 1.7 1.8 1.9 2 1.8 1.9 2 2.1 - 1 p−values sample size standard deviation effect (mean difference) significance level ⇒ Power and sample size are a ”complementary” pair of values Thumb rule: If you know one of them you know the other 16 Power and Sample size Let us turn it around: What are the ingredients needed for the computation of the needed sample size? 1. Desired detectable effect - What effect (mean difference, risk, coefficiant size) is clinically relevant? - Which effect must be detectable to make the study meaningful? - This is a choice by the researcher (not statistically determined!) What did we learn? • Power: Ability of the test to detect a wrong nullhypothesis (e.g. t-test: ability to detect a difference) 2. Variance in the data: - e.g. standard deviation of both samples for a t-test - taken som experience former or pilot studies • Criteria: Type II error: Power = 1 − β • Power and sample size: Corresponding values 3. Significance level α - usually set to α = 0.05 - Adjustments must be taken into account (e.g. multiple testing) • Needed sample size: depends on - Desired effect (effect size ↓ - Sample variance (variance ↑ - Significance level (α ↓ - Desired trest power (Power ↑ - Test type 4. Desired power of the test - often used 1 − β = 0.8 - This is a choice by the researcher (not statistically determined!) - needed needed needed needed sample sample sample sample size size size size ↑) ↑) ↑) ↑) 5. Type of test - Different test for the same problem often have different power 18 17 Computation of the sample size - Pocock’s formula Continuous outcome: (t-test) Computation of the sample size N = Problem: There is no general formula for the power or sample size 2σ 2 · f (α, β) (µ2 − µ1)2 Computation possibilities: - 1. 2. 3. 4. Dichotomous outcome: (χ2-test) The old-fashioned way: Pocock’s formula The modern way: Statistical packages If no other things help: Simulations, Bootstrap Ask somebody µ1, µ2...population means σ...population standard deviation α...signicance level β...type II error probability (β = 1−power) p (1 − p1) + p2(1 − p2) N = 1 · f (α, β) (p2 − p1)2 - p1, p2...proportions, risks (determine effect and variance) - f (α, β) factor taken from Pocock’s table 19 20 Computation of the sample size - Pocock’s table f (α, β) 0.10 0.05 0.02 0.01 α - Larger f (α, β) - Smaller α - Larger power β ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ 0.05 10.8 13.0 15.8 17.8 0.10 8.6 10.5 13.0 14.9 0.20 6.2 7.9 10.0 11.7 Computation of the sample size - Program packages 0.50 2.7 3.8 5.4 6.6 - SPSS Sample power · former Power and Precision · stand alone but included in the SPSS-license) · http://www.spss.com/software/statistics/samplepower/ - Included in different program packages · R (package pwr) · Stata (sampsi, powerreg) · SAS (power) · Matlab (sampsizepwr) larger sample size larger f (α, β) larger sample size smaller β larger f (α, β) larger sample size - Interactive online Calculators · http://statpages.org/#Power (overview) Problem: How to deal with different test types? 22 21 Computation of the sample size - SamplePower 2.0 Computation of the sample size - Interactive tools 23 24 Computation of the sample size - Simulation example Computation of the sample size - Simulation - requires programming - should usually be done by a statistician - used if there is no adequate program or formula Power simulation (t-test), Effect: ∆µ=0.8, σ=1, α=0.05 1 0.9 0.8 0.7 0.6 Power Idea: 1. Define a power, e.g. 0.8 2. Generate artificial data with given parameters · means µ1 , µ2 · variance σ · significance level α · predefined sample size N 3. Compute the test result 4. Repeat 2.-3. and count number of rejections 5. power= (number of rejections)/(number of simulations) 6. Repeat 2.-5. for different sample sizes 7. Select the lowest sample size with a power above the predefined (step 1) 0.5 0.4 0.3 0.2 Needed sample size: N=25 0.1 - Distinction between Simulation and Bootstrap: · Bootstrap: Use a random subsample of real data · Simulation: Generate new data 0 0 10 20 30 40 50 60 Sample size 70 80 90 100 25 Computation of the sample size - Simulation example SamplePower 2.0 • http://www.spss.com/software/statistics/samplepower/ Empirical power estimation (1000 Repetitions) 1 • Former ”Power and Precision” 0.9 • Standalone program included in the SPSS license ⇒ available in Helse Vest (→ email Helse Vest IKT)! 0.8 0.7 • Different groups of methods: - Mean comparison (only t-test) - Proportions (risks, cross tables) - Correlations - ANOVA - Regression (linear, logistic) - Survival analysis - Some noncentral tests power 0.6 0.5 0.4 0.3 0.2 M=250 • Help: - Did not find a proper book - Textbook ”Power and precision” (Borenstein,M.) - Embedded help system (not always easy to understand) - Tutorials on the web M=12 0.1 0 26 0 0.5 1 1.5 2 2.5 ∆σ 3 3.5 4 4.5 5 27 28 Starting with a simple example: Comparison of the mean of 2 independent samples (t-test) SamplePower 2.0 Textbook - All data are Gaussian - Equally for both populations: · Standard deviation: σ1 = σ2 = 1 · Sample size: M = 20 - Differently for the populations: · Mean µ1 = 0 µ2 = 0.8 - Test problem: H0 : µ1 = µ2 H1 : µ1 6= µ2 Power and precision Authors: Michael Borenstein Hannah Rothstein Jacob Cohen David Schoenfeld Jesse Berlin Edward Lakatos Two samples under H1 sample 1 sample 2 What do we need for the calculation? Compatible with Sample Power 2.0 - Test design Means Standard deviation Sample size? Power? Recall: Experimentally computed power: 1 − β = 0.69 (69%) 31 29 Starting with a simple example: Comparison of the mean of 2 independent samples (t-test) What can we do with SamplePower 2.0 with a 2-independent samples t-test? Computing the power for a given sample size Compute the power for given: · Effect · Standard deviation · Sample size Compute the sample size for given: · Effect · Standard deviation · Power Adjust: · Significance level · Confidence intervals · Precision of the numbers (µ, SD, N) Create power tables and plots for different: · · · · Computing the sample size for a given power (0.9) 32 Significance level Sample size Effect Standard deviation 33 Cross tables (RxC) Cross tables (2x2) Question: Is the appearance of side effect of the treatment associated with the sex of the patient? p < 0.0001 Health Region Helse SørØst Helse Vest Helse Midt Helse Nord Total Side effects Sex Male Female Total No Count Percent 238 74.8% 226 65.5% 464 70.0% Yes Count Percent 80 25.2% 119 34.5% 199 30.0% Question: Is the appearance of side effect of the treatment associated with the sex of the patient? Total Count Percent 318 48.0% 345 52.0% 663 100% From LTBI study (Ann Iren Olsen, Helse Fonna, not published yet) No Count Percent 224 64.9% 85 65.9% 60 74.1% 99 63.1% 468 65.7% Side effects Yes Count Percent 107 31.0% 41 31.8% 18 22.2% 33 21.0% 199 27.9% Don’t know Count Percent 14 4.1% 3 2.3% 3 3.7% 25 15.9% 45 6.3% Total Count 345 129 81 157 712 From LTBI study (Ann Iren Olsen, Helse Fonna, not published yet) 35 34 Cross tables (RxC) Power with a sample size of 100 Cross tables (RxC) Power for different sample sizes and significance levels Sample size for a power of 0.9 36 37 Percent 48.8 18.1 11.4 22.1 100.0