Significance level vs. confidence level Significance level vs. confidence level Unit 3: Foundations for inference Lecture 3: Decision errors, significance levels, sample size Two sided One sided Statistics 101 0.95 Mine C ¸ etinkaya-Rundel 0.9 0.025 0.025 -1.96 May 29, 2013 0 1.96 Two sided HT with α = 0.05 is equivalent to 95% confidence interval. Statistics 101 (Mine C ¸ etinkaya-Rundel) Significance level vs. confidence level 0.05 0.05 0 1.65 One sided HT with α = 0.05 is equivalent to 90% confidence interval. U3 - L3: Decision errors, significance levels, sample size May 29, 2013 2 / 24 Significance level vs. confidence level Agreement of CI and HT Poll A 95% confidence interval for the average waiting time at an emergency room is (128 minutes, 147 minutes). Which of the following is false? Confidence intervals and hypothesis tests (almost) always agree, as long as the two methods use equivalent levels of significance / confidence. (a) A hypothesis test of HA : µ , 120 min at α = 0.05 is equivalent to this CI. A two sided hypothesis with threshold of α is equivalent to a confidence interval with CL = 1 − α. A one sided hypothesis with threshold of α is equivalent to a confidence interval with CL = 1 − (2 × α). (b) A hypothesis test of HA : µ > 120 min at α = 0.025 is equivalent to this CI. If H0 is rejected, a confidence interval that agrees with the result of the hypothesis test should not include the null value. (c) This interval does not support the claim that the average wait time is 120 minutes. If H0 is failed to be rejected, a confidence interval that agrees with the result of the hypothesis test should include the null value. (d) The claim that the average wait time is 120 minutes would not be rejected using a 90% confidence interval. Statistics 101 (Mine C ¸ etinkaya-Rundel) U3 - L3: Decision errors, significance levels, sample size May 29, 2013 3 / 24 Statistics 101 (Mine C ¸ etinkaya-Rundel) U3 - L3: Decision errors, significance levels, sample size May 29, 2013 4 / 24 Statistical vs. Practical Significance Statistical vs. Practical Significance Sample Size Statistical vs. Practical Significance Real differences between the point estimate and null value are easier to detect with larger samples. However, very large samples will result in statistical significance even for tiny differences between the sample mean and the null value (effect size), even when the difference is not practically significant. This is especially important to research: if we conduct a study, we want to focus on finding meaningful results (we want observed differences to be real, but also large enough to matter). The role of a statistician is not just in the analysis of data, but also in planning and design of a study. Poll All else held equal, will p-value be lower if n = 100 or n = 10, 000? (a) n = 100 (b) n = 10, 000 “To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of.” – R.A. Fisher Statistics 101 (Mine C ¸ etinkaya-Rundel) U3 - L3: Decision errors, significance levels, sample size May 29, 2013 5 / 24 Statistics 101 (Mine C ¸ etinkaya-Rundel) Finding the appropriate sample size May 29, 2013 6 / 24 Finding the appropriate sample size A market researcher collected data about the savings of 82 customers at a competing car insurance company. The mean and standard deviation of this sample are $395 and $102, respectively. He would like conduct another survey but have a margin of error of approximately $10 at a 99% confidence level. How large of a sample should he collect? Poll A random sample of 100 students who took online speed reading courses yielded an average increase in reading speed of 415% and a standard deviation of 220%. We would like to calculate a 95% confidence interval for the average increase in reading speed with a margin of error of no more than 15%. How many students, at a minimum, should we sample? Original sample: x¯ = 395, s = 102, n = 82 Desired CI: ME = 10, confidence level = 99% z ? for a 99% confidence level = ? U3 - L3: Decision errors, significance levels, sample size (a) 29 102 s ME = z × √ → 10 = 2.58 × √ n n n 2.58 × 102 10 = 692.5319 (b) 586 (c) 826 !2 ≥ (d) 827 He should survey at least 693 customers. Statistics 101 (Mine C ¸ etinkaya-Rundel) U3 - L3: Decision errors, significance levels, sample size May 29, 2013 7 / 24 Statistics 101 (Mine C ¸ etinkaya-Rundel) U3 - L3: Decision errors, significance levels, sample size May 29, 2013 8 / 24 Decision errors Type 1 and Type 2 errors Decision errors Decision errors Type 1 and Type 2 errors Decision errors (cont.) There are two competing hypotheses: the null and the alternative. In a hypothesis test, we make a decision about which might be true, but our choice might be incorrect. Hypothesis tests are not flawless. In the court system innocent people are sometimes wrongly convicted and the guilty sometimes walk free. H0 true Truth Similarly, we can make a wrong decision in statistical hypothesis tests as well. The difference is that we have the tools necessary to quantify how often we make errors in statistics. HA true Decision fail to reject H0 reject H0 X Type 1 Error Type 2 Error X A Type 1 Error is rejecting the null hypothesis when H0 is true. A Type 2 Error is failing to reject the null hypothesis when HA is true. We (almost) never know if H0 or HA is true, but we need to consider all possibilities. Statistics 101 (Mine C ¸ etinkaya-Rundel) U3 - L3: Decision errors, significance levels, sample size Decision errors May 29, 2013 9 / 24 Statistics 101 (Mine C ¸ etinkaya-Rundel) Type 1 and Type 2 errors U3 - L3: Decision errors, significance levels, sample size Decision errors May 29, 2013 10 / 24 Type 1 and Type 2 errors Hypothesis Test as a trial Poll If we again think of a hypothesis test as a criminal trial then it makes sense to frame the verdict in terms of the null and alternative hypotheses: A patient named Diana was diagnosed with Fibromyalgia, a longterm syndrome of body pain, and was prescribed anti-depressants. Being the skeptic that she is, Diana didn’t initially believe that anti-depressants would help her symptoms. However after a couple months of being on the medication she decides that the antidepressants are working, because she feels like her symptoms are in fact getting better. What is a Type 1 error in this context? H0 : Defendant is innocent HA : Defendant is guilty Which type of error is being committed in the following cirumstances? Declaring the defendant innocent when they are actually guilty (a) Concluding that anti-depressants work for the treatment of Fibromyalgia symptoms when they actually do not. Declaring the defendant guilty when they are actually innocent (b) Concluding that anti-depressants do not work for the treatment of Fibromyalgia symptoms when they actually do. Which error do you think is the worse error to make? “better that ten guilty persons escape than that one innocent suffer” – William Blackstone Statistics 101 (Mine C ¸ etinkaya-Rundel) U3 - L3: Decision errors, significance levels, sample size May 29, 2013 11 / 24 Statistics 101 (Mine C ¸ etinkaya-Rundel) U3 - L3: Decision errors, significance levels, sample size May 29, 2013 12 / 24 Decision errors Error rates & power Decision errors Type 1 error rate Error rates & power Filling in the table... As a general rule we reject H0 when the p-value is less than 0.05, i.e. we use a significance level of 0.05, α = 0.05. Truth This means that, for those cases where H0 is actually true, we do not want to incorrectly reject it more than 5% of those times. In other words, when using a 5% significance level there is about 5% chance of making a Type 1 error. H0 true Decision fail to reject H0 reject H0 1−α Type 1 Error, α HA true Type 2 Error, β Power, 1 − β Type 1 error is rejecting H0 when you shouldn’t have, and the probability of doing so is α (significance level) Type 2 error is failing to reject H0 when you should have, and the probability of doing so is β (a little more complicated to calculate) P (Type 1 error) = α This is why we prefer to small values of α – increasing α increases the Type 1 error rate. Power of a test is the probability of correctly rejecting H0 , and the probability of doing so is 1 − β In hypothesis testing, we want to keep α and β low, but there are inherent trade-offs. Statistics 101 (Mine C ¸ etinkaya-Rundel) U3 - L3: Decision errors, significance levels, sample size Decision errors May 29, 2013 13 / 24 Statistics 101 (Mine C ¸ etinkaya-Rundel) Error rates & power U3 - L3: Decision errors, significance levels, sample size Decision errors Type 2 error rate May 29, 2013 14 / 24 Error rates & power Achieving desired power There are several ways to increase power (and hence decrease type 2 error rate): If the alternative hypothesis is actually true, what is the chance that we make a Type 2 Error, i.e. we fail to reject the null hypothesis even when we should reject it? 1 Increase the sample size. 2 Decrease the standard deviation of the sample, which essentially has the same effect as increasing the sample size (it will decrease the standard error). With a smaller s we have a better chance of distinguishing the null value from the observed point estimate. This is difficult to ensure but cautious measurement process and limiting the population so that it is more homogenous may help. The answer is not obvious. If the true population average is very close to the null hypothesis value, it will be difficult to detect a difference (and reject H0 ). If the true population average is very different from the null hypothesis value, it will be easier to detect a difference. Clearly, β depends on the effect size (δ): difference between the point estimate and the null value. 3 4 Statistics 101 (Mine C ¸ etinkaya-Rundel) U3 - L3: Decision errors, significance levels, sample size May 29, 2013 15 / 24 Increase α, which will make it more likely to reject H0 (but note that this has the side effect of increasing the Type 1 error rate). Consider a larger effect size. If the true mean of the population is in the alternative hypothesis but close to the null value, it will be harder to detect a difference. Statistics 101 (Mine C ¸ etinkaya-Rundel) U3 - L3: Decision errors, significance levels, sample size May 29, 2013 16 / 24 Example: Dancers Example: Dancers 5 A random sample of 36 female college-aged dancers was obtained and their heights (in inches) were measured. Provided below are some summary statistics and a histogram of the distribution of these dancers’ heights. The average height of all college-aged females is 64.5 inches. Do these data provide convincing evidence that the average height of female college-aged dancers is different from this value? 4 36 63.6 inches sd 2.13 inches Statistics 101 (Mine C ¸ etinkaya-Rundel) 0 1 mean 2 3 n 60 61 62 63 64 65 66 U3 - L3: Decision errors, significance levels, sample size 67 May 29, 2013 17 / 24 Statistics 101 (Mine C ¸ etinkaya-Rundel) Example: Dancers U3 - L3: Decision errors, significance levels, sample size May 29, 2013 18 / 24 Example: Dancers Poll Poll What is the equivalent confidence level for this two-sided hypothesis test with α = 0.05? (a) 80% (b) 90% If we were to calculate a confidence interval with the equivalent confidence level to this hypothesis test, would this interval include 64.5 (the null value)? (a) Yes (c) 95% (b) No (d) 99.7% (c) Cannot tell without calculating the interval (e) 97.5% Statistics 101 (Mine C ¸ etinkaya-Rundel) U3 - L3: Decision errors, significance levels, sample size May 29, 2013 19 / 24 Statistics 101 (Mine C ¸ etinkaya-Rundel) U3 - L3: Decision errors, significance levels, sample size May 29, 2013 20 / 24 Example: Unemployment and relationship problems Example: Unemployment and relationship problems Poll A USA Today/Gallup poll conducted between 2010 and 2011 asked a group of unemployed and underemployed Americans if they have had major problems in their relationships with their spouse or another close family member as a result of not having a job (if unemployed) or not having a full-time job (if underemployed). 27% of the 1,145 unemployed respondents and 25% of the 675 underemployed respondents said they had major problems in relationships as a result of their employment status. Which of the following is the correct set of hypotheses? ˆunemp = pˆunderemp (a) H0 : p ˆunemp , pˆunderemp HA : p (c) H0 : punemp = punderemp HA : punemp > punderemp (b) H0 : punemp = punderemp HA : punemp , punderemp (d) H0 : µunemp = µunderemp HA : µunemp , µunderemp Statistics 101 (Mine C ¸ etinkaya-Rundel) U3 - L3: Decision errors, significance levels, sample size May 29, 2013 21 / 24 The p-value for this hypothesis test is approximately 0.35. Explain what this means in context of the hypothesis test and the data. Statistics 101 (Mine C ¸ etinkaya-Rundel) Review: CLT U3 - L3: Decision errors, significance levels, sample size May 29, 2013 22 / 24 Review: CLT Suppose an iPod has 3,000 songs. The histogram below shows the distribution of the lengths of these songs. We also know that, for this iPod, the mean length is 3.45 minutes and the standard deviation is 1.63 minutes. Calculate the probability that a randomly selected song lasts more than 5 minutes. 800 600 You are about to take a trip to visit your parents and the drive is 6 hours. You make a random playlist of 100 songs. What is the probability that your playlist lasts the entire drive? 400 200 0 0 2 4 6 8 10 Length of song The distribution of the lengths of these songs is slightly right skewed to the right, therefore it is not reasonable to use a normal model to estimate this probability. However, we can approximate the probability using the histogram. P (X > 5) = Statistics 101 (Mine C ¸ etinkaya-Rundel) 350 + 100 + 25 + 20 + 5 500 = = 0.17 3000 3000 U3 - L3: Decision errors, significance levels, sample size May 29, 2013 23 / 24 Statistics 101 (Mine C ¸ etinkaya-Rundel) U3 - L3: Decision errors, significance levels, sample size May 29, 2013 24 / 24