Significance level vs. confidence level Unit 3: Foundations for inference

Significance level vs. confidence level
Significance level vs. confidence level
Unit 3: Foundations for inference
Lecture 3: Decision errors, significance levels,
sample size
Two sided
One sided
Statistics 101
0.95
Mine C
¸ etinkaya-Rundel
0.9
0.025
0.025
-1.96
May 29, 2013
0
1.96
Two sided HT with α = 0.05
is equivalent to
95% confidence interval.
Statistics 101 (Mine C
¸ etinkaya-Rundel)
Significance level vs. confidence level
0.05
0.05
0
1.65
One sided HT with α = 0.05
is equivalent to
90% confidence interval.
U3 - L3: Decision errors, significance levels, sample size
May 29, 2013
2 / 24
Significance level vs. confidence level
Agreement of CI and HT
Poll
A 95% confidence interval for the average waiting time at an emergency room is (128 minutes, 147 minutes). Which of the following is
false?
Confidence intervals and hypothesis tests (almost) always agree,
as long as the two methods use equivalent levels of significance /
confidence.
(a) A hypothesis test of HA : µ , 120 min at α = 0.05 is equivalent to
this CI.
A two sided hypothesis with threshold of α is equivalent to a
confidence interval with CL = 1 − α.
A one sided hypothesis with threshold of α is equivalent to a
confidence interval with CL = 1 − (2 × α).
(b) A hypothesis test of HA : µ > 120 min at α = 0.025 is equivalent
to this CI.
If H0 is rejected, a confidence interval that agrees with the result
of the hypothesis test should not include the null value.
(c) This interval does not support the claim that the average wait time
is 120 minutes.
If H0 is failed to be rejected, a confidence interval that agrees
with the result of the hypothesis test should include the null value.
(d) The claim that the average wait time is 120 minutes would not be
rejected using a 90% confidence interval.
Statistics 101 (Mine C
¸ etinkaya-Rundel)
U3 - L3: Decision errors, significance levels, sample size
May 29, 2013
3 / 24
Statistics 101 (Mine C
¸ etinkaya-Rundel)
U3 - L3: Decision errors, significance levels, sample size
May 29, 2013
4 / 24
Statistical vs. Practical Significance
Statistical vs. Practical Significance
Sample Size
Statistical vs. Practical Significance
Real differences between the point estimate and null value are
easier to detect with larger samples.
However, very large samples will result in statistical significance
even for tiny differences between the sample mean and the null
value (effect size), even when the difference is not practically
significant.
This is especially important to research: if we conduct a study,
we want to focus on finding meaningful results (we want
observed differences to be real, but also large enough to matter).
The role of a statistician is not just in the analysis of data, but
also in planning and design of a study.
Poll
All else held equal, will p-value be lower if n = 100 or n = 10, 000?
(a) n = 100
(b) n = 10, 000
“To call in the statistician after the experiment is done may be no more than asking
him to perform a post-mortem examination: he may be able to say what the
experiment died of.” – R.A. Fisher
Statistics 101 (Mine C
¸ etinkaya-Rundel)
U3 - L3: Decision errors, significance levels, sample size
May 29, 2013
5 / 24
Statistics 101 (Mine C
¸ etinkaya-Rundel)
Finding the appropriate sample size
May 29, 2013
6 / 24
Finding the appropriate sample size
A market researcher collected data about the savings of 82 customers
at a competing car insurance company. The mean and standard deviation of this sample are $395 and $102, respectively. He would like
conduct another survey but have a margin of error of approximately
$10 at a 99% confidence level. How large of a sample should he collect?
Poll
A random sample of 100 students who took online speed reading
courses yielded an average increase in reading speed of 415% and
a standard deviation of 220%. We would like to calculate a 95% confidence interval for the average increase in reading speed with a margin
of error of no more than 15%. How many students, at a minimum,
should we sample?
Original sample: x¯ = 395, s = 102, n = 82
Desired CI: ME = 10, confidence level = 99%
z ? for a 99% confidence level =
?
U3 - L3: Decision errors, significance levels, sample size
(a) 29
102
s
ME = z × √ → 10 = 2.58 × √
n
n
n
2.58 × 102
10
= 692.5319
(b) 586
(c) 826
!2
≥
(d) 827
He should survey at least 693 customers.
Statistics 101 (Mine C
¸ etinkaya-Rundel)
U3 - L3: Decision errors, significance levels, sample size
May 29, 2013
7 / 24
Statistics 101 (Mine C
¸ etinkaya-Rundel)
U3 - L3: Decision errors, significance levels, sample size
May 29, 2013
8 / 24
Decision errors
Type 1 and Type 2 errors
Decision errors
Decision errors
Type 1 and Type 2 errors
Decision errors (cont.)
There are two competing hypotheses: the null and the alternative. In
a hypothesis test, we make a decision about which might be true, but
our choice might be incorrect.
Hypothesis tests are not flawless.
In the court system innocent people are sometimes wrongly
convicted and the guilty sometimes walk free.
H0 true
Truth
Similarly, we can make a wrong decision in statistical hypothesis
tests as well.
The difference is that we have the tools necessary to quantify
how often we make errors in statistics.
HA true
Decision
fail to reject H0
reject H0
X
Type 1 Error
Type 2 Error
X
A Type 1 Error is rejecting the null hypothesis when H0 is true.
A Type 2 Error is failing to reject the null hypothesis when HA is
true.
We (almost) never know if H0 or HA is true, but we need to
consider all possibilities.
Statistics 101 (Mine C
¸ etinkaya-Rundel)
U3 - L3: Decision errors, significance levels, sample size
Decision errors
May 29, 2013
9 / 24
Statistics 101 (Mine C
¸ etinkaya-Rundel)
Type 1 and Type 2 errors
U3 - L3: Decision errors, significance levels, sample size
Decision errors
May 29, 2013
10 / 24
Type 1 and Type 2 errors
Hypothesis Test as a trial
Poll
If we again think of a hypothesis test as a criminal trial then it makes
sense to frame the verdict in terms of the null and alternative
hypotheses:
A patient named Diana was diagnosed with Fibromyalgia, a longterm syndrome of body pain, and was prescribed anti-depressants.
Being the skeptic that she is, Diana didn’t initially believe that
anti-depressants would help her symptoms. However after a couple months of being on the medication she decides that the antidepressants are working, because she feels like her symptoms are
in fact getting better. What is a Type 1 error in this context?
H0 : Defendant is innocent
HA : Defendant is guilty
Which type of error is being committed in the following cirumstances?
Declaring the defendant innocent when they are actually guilty
(a) Concluding that anti-depressants work for the treatment of
Fibromyalgia symptoms when they actually do not.
Declaring the defendant guilty when they are actually innocent
(b) Concluding that anti-depressants do not work for the treatment of
Fibromyalgia symptoms when they actually do.
Which error do you think is the worse error to make?
“better that ten guilty persons escape than that one innocent suffer”
– William Blackstone
Statistics 101 (Mine C
¸ etinkaya-Rundel)
U3 - L3: Decision errors, significance levels, sample size
May 29, 2013
11 / 24
Statistics 101 (Mine C
¸ etinkaya-Rundel)
U3 - L3: Decision errors, significance levels, sample size
May 29, 2013
12 / 24
Decision errors
Error rates & power
Decision errors
Type 1 error rate
Error rates & power
Filling in the table...
As a general rule we reject H0 when the p-value is less than
0.05, i.e. we use a significance level of 0.05, α = 0.05.
Truth
This means that, for those cases where H0 is actually true, we do
not want to incorrectly reject it more than 5% of those times.
In other words, when using a 5% significance level there is about
5% chance of making a Type 1 error.
H0 true
Decision
fail to reject H0
reject H0
1−α
Type 1 Error, α
HA true
Type 2 Error, β
Power, 1 − β
Type 1 error is rejecting H0 when you shouldn’t have, and the
probability of doing so is α (significance level)
Type 2 error is failing to reject H0 when you should have, and the
probability of doing so is β (a little more complicated to calculate)
P (Type 1 error) = α
This is why we prefer to small values of α – increasing α
increases the Type 1 error rate.
Power of a test is the probability of correctly rejecting H0 , and the
probability of doing so is 1 − β
In hypothesis testing, we want to keep α and β low, but there are
inherent trade-offs.
Statistics 101 (Mine C
¸ etinkaya-Rundel)
U3 - L3: Decision errors, significance levels, sample size
Decision errors
May 29, 2013
13 / 24
Statistics 101 (Mine C
¸ etinkaya-Rundel)
Error rates & power
U3 - L3: Decision errors, significance levels, sample size
Decision errors
Type 2 error rate
May 29, 2013
14 / 24
Error rates & power
Achieving desired power
There are several ways to increase power (and hence decrease type
2 error rate):
If the alternative hypothesis is actually true, what is the chance that
we make a Type 2 Error, i.e. we fail to reject the null hypothesis even
when we should reject it?
1
Increase the sample size.
2
Decrease the standard deviation of the sample, which essentially
has the same effect as increasing the sample size (it will
decrease the standard error). With a smaller s we have a better
chance of distinguishing the null value from the observed point
estimate. This is difficult to ensure but cautious measurement
process and limiting the population so that it is more
homogenous may help.
The answer is not obvious.
If the true population average is very close to the null hypothesis
value, it will be difficult to detect a difference (and reject H0 ).
If the true population average is very different from the null
hypothesis value, it will be easier to detect a difference.
Clearly, β depends on the effect size (δ): difference between the
point estimate and the null value.
3
4
Statistics 101 (Mine C
¸ etinkaya-Rundel)
U3 - L3: Decision errors, significance levels, sample size
May 29, 2013
15 / 24
Increase α, which will make it more likely to reject H0 (but note
that this has the side effect of increasing the Type 1 error rate).
Consider a larger effect size. If the true mean of the population is
in the alternative hypothesis but close to the null value, it will be
harder to detect a difference.
Statistics 101 (Mine C
¸ etinkaya-Rundel)
U3 - L3: Decision errors, significance levels, sample size
May 29, 2013
16 / 24
Example: Dancers
Example: Dancers
5
A random sample of 36 female college-aged dancers was obtained and their
heights (in inches) were measured. Provided below are some summary
statistics and a histogram of the distribution of these dancers’ heights. The
average height of all college-aged females is 64.5 inches. Do these data
provide convincing evidence that the average height of female college-aged
dancers is different from this value?
4
36
63.6 inches
sd
2.13 inches
Statistics 101 (Mine C
¸ etinkaya-Rundel)
0
1
mean
2
3
n
60
61
62
63
64
65
66
U3 - L3: Decision errors, significance levels, sample size
67
May 29, 2013
17 / 24
Statistics 101 (Mine C
¸ etinkaya-Rundel)
Example: Dancers
U3 - L3: Decision errors, significance levels, sample size
May 29, 2013
18 / 24
Example: Dancers
Poll
Poll
What is the equivalent confidence level for this two-sided hypothesis
test with α = 0.05?
(a) 80%
(b) 90%
If we were to calculate a confidence interval with the equivalent confidence level to this hypothesis test, would this interval include 64.5 (the
null value)?
(a) Yes
(c) 95%
(b) No
(d) 99.7%
(c) Cannot tell without calculating the interval
(e) 97.5%
Statistics 101 (Mine C
¸ etinkaya-Rundel)
U3 - L3: Decision errors, significance levels, sample size
May 29, 2013
19 / 24
Statistics 101 (Mine C
¸ etinkaya-Rundel)
U3 - L3: Decision errors, significance levels, sample size
May 29, 2013
20 / 24
Example: Unemployment and relationship problems
Example: Unemployment and relationship problems
Poll
A USA Today/Gallup poll conducted between 2010 and 2011 asked
a group of unemployed and underemployed Americans if they have
had major problems in their relationships with their spouse or another
close family member as a result of not having a job (if unemployed) or
not having a full-time job (if underemployed). 27% of the 1,145 unemployed respondents and 25% of the 675 underemployed respondents
said they had major problems in relationships as a result of their employment status. Which of the following is the correct set of hypotheses?
ˆunemp = pˆunderemp
(a) H0 : p
ˆunemp , pˆunderemp
HA : p
(c) H0 : punemp = punderemp
HA : punemp > punderemp
(b) H0 : punemp = punderemp
HA : punemp , punderemp
(d) H0 : µunemp = µunderemp
HA : µunemp , µunderemp
Statistics 101 (Mine C
¸ etinkaya-Rundel)
U3 - L3: Decision errors, significance levels, sample size
May 29, 2013
21 / 24
The p-value for this hypothesis test is approximately 0.35. Explain
what this means in context of the hypothesis test and the data.
Statistics 101 (Mine C
¸ etinkaya-Rundel)
Review: CLT
U3 - L3: Decision errors, significance levels, sample size
May 29, 2013
22 / 24
Review: CLT
Suppose an iPod has 3,000 songs. The histogram below shows the distribution of the lengths of these songs. We also know that, for this iPod, the mean
length is 3.45 minutes and the standard deviation is 1.63 minutes. Calculate
the probability that a randomly selected song lasts more than 5 minutes.
800
600
You are about to take a trip to visit your parents and the drive is 6 hours. You
make a random playlist of 100 songs. What is the probability that your playlist
lasts the entire drive?
400
200
0
0
2
4
6
8
10
Length of song
The distribution of the lengths of these songs is slightly right skewed
to the right, therefore it is not reasonable to use a normal model to
estimate this probability.
However, we can approximate the probability using the histogram.
P (X > 5) =
Statistics 101 (Mine C
¸ etinkaya-Rundel)
350 + 100 + 25 + 20 + 5
500
=
= 0.17
3000
3000
U3 - L3: Decision errors, significance levels, sample size
May 29, 2013
23 / 24
Statistics 101 (Mine C
¸ etinkaya-Rundel)
U3 - L3: Decision errors, significance levels, sample size
May 29, 2013
24 / 24