Download Report

STAT 311 Week 14 Worksheet Solutions: Bringing Everything Together
Confidence Intervals
1. The student body president at a large Midwestern university would like to determine if the students at his school support increasing the cost of student parking permits if it would fund building
a new parking garage. The student government takes a simple random sample of 100 students and
finds that the 47 of them support the increase.
a.) Calculate a 95% confidence interval for the true proportion of students at that university that
support the increase. Be sure to clearly communicate about the parameter in presenting your interval.
The first step is always to identify the population and parameter. The population is the group of subjects that we are studying. The first sentence tells us that the people of interest, the population, are
all students at a large Midwestern university. The parameter is what (a numerical summary) about the
population we are interested in finding out about. In this case, the first sentence tells us that we want
to know “the (TRUE) proportion of students at the school that support increasing the cost of student
parking permits if it would fund building a new parking garage."
Since we are dealing with a proportion, we can take out our formula sheets and find the two sections
about proportions.
Figure 0.1: Formula Sheet Clip for Sampling Distribution of the Sample Proportion
On the front page of the formula sheet, we find information about the sampling distribution of the the
sample proportion. This section of the formula sheet contains the information necessary for determining if it is appropriate to conduct inference (calculate a confidence interval or perform a hypothesis
test). It is appropriate to calculate the confidence interval for a sample proportion if when the underlying sampling distribution is approximately normal, which happens when:
• We have a random sample
ˆ ≥ 10.
• The sample size is large enough: n pˆ ≥ 10 AND n qˆ = n(1 − p)
1
For this problem, we are told that “the student government takes a simple random sample" and thus
the first condition is satisfied. Is the sample large enough? We are told that the sample size is 100
47
= .47). If we then check n pˆ = 100 ∗ .47 =
(n=100) and that 47 students supported the increase (pˆ = 100
ˆ = 100 ∗ .53 = 53 ≥ 10, we can see that it is appropriate to calculate the confi47 ≥ 10 and n qˆ = n(1 − p)
dence interval in this case.
To actually calculate the interval, we find the section of the formula sheet dealing with confidence
intervals for proportions.
Figure 0.2: Formula Sheet Clip for Hypothesis Testing and Confidence Intervals for Proportions
Here we see the formula:
s
pˆ ± z ∗
pˆ qˆ
n
ˆ OR pˆ ± Margin of Error
OR pˆ ± z ∗ SE (p)
We can calculate each of these pieces separately and put them together.
The first step is to calculate the standard error, which is:
s
ˆ =
SE (p)
pˆ qˆ
=
n
s
ˆ − p)
ˆ
p(1
=
n
r
.47 ∗ .53
= 0.04991
100
Next, we need to calculate the appropriate z ∗ value. Remember, for a confidence interval, this is a
value that you get off of the Z-table, NOT a quantity that you calculate. The z ∗ value for a 95% interval
is 1.96. (If you do not know how to find this number, make sure to find out before the final in either the
tutoring center or by asking another student.)
ˆ =
Now that we have a z ∗ and a standard error, we can calculate the Margin of Error, MOE = z ∗ SE (p)
1.96 ∗ 0.04991 = 0.0978. Thus the confidence interval is:
2
s
pˆ ± z ∗
pˆ qˆ
→ pˆ ± 1.96 ∗ 0.04991 → pˆ ± 0.0978 → (0.3722, 0.5678)
n
Once we have a confidence interval, the final step is to interpret in the context of the problem!
We believe that the TRUE proportion of students at the Midwestern university that support increasing
the cost of student parking permits to fund a new parking garage is between 37.22% and 56.78%.
b.) A student senator who opposes the increase saw the results and said “it is clear that less than half
of the students support this measure, therefore we should not increase the permit costs." Use your
answer to part (a) to refute his statement. Be concise in your argument.
What this student is saying is false. The confidence interval contains values above 0.5, indicating that
there are values above 0.5 that are plausible values for the population proportion.
What is the Point?
The point of this problem is to get you to take your Statistics 311 knowledge out into the real world. In
whatever field you will be working in, it is probable that at some point you will come across statistics of
some sort. You may forget the specific formulas and some of the details, but hopefully you can remember the basic idea behind confidence intervals. The idea is that we have some parameter that we want
to study. To study the parameter we take a sample and calculate a sample statistic. We know that every
sample we take will give us different sample statistics. If we can obtain a confidence interval (whether
we calculate it ourselves or have it provided to us) we know that it tells us the plausible values of our
parameter!
Often, you will see confidence intervals often around election times when news organizations report
the results of a poll. For example, suppose a news organization reports that President Obama leads in
the polls with 48% of the vote with a margin of error of ± 4%. This mean that the poll estimates that the
TRUE proportion of all voters that will vote for President Obama is between 44% and 52%!)
Hypothesis Testing
2. A real estate broker was interested in the typical size of homes in Wake County. Specifically she
had been told that homes were typically around 1500 square feet. She believes that perhaps the size
was actually larger than 1500 square feet and took a random sample of 25 homes in Wake County
to try to verify that idea. Since home size is likely to be skewed she decides to use the median to
measure “typical."
Here is the output from a hypothesis test that was conducted:
Hypothesis test results:
Parameter: median of Variable
H0 : Parameter = 1500
H A : Parameter > 1500
3
Variable
n
n for test
Sample Median
Below
Equal
Above
p-val
Sq.Ft.
25
25
1740
9
0
16
0.1148
a.) If the median of the population is really 1500 how many of the 25 observations would you expect
to be less than 1500?
First, let’s think for a second why we are discussing medians as a parameter and not means. Well, we are
told that the population of home sizes in Wake county are skewed AND that we are only taking a sample of size 25. Thus, it is not appropriate to use methods we have learned about for means! Recall that
one of the strengths of the median is that the median is not influenced by outliers, whereas the mean is!
Now let’s contemplate why the population of home sizes is skewed, which may provide some insight
into the nature of our problem. Most people have a salary between $30,000 and $120,000 and thus
live in a “smallish" to medium sized house (maybe between 1500-2500 sq. ft.), as that is what they can
afford. A very small number of people make astronomical amounts of money and build very very big
houses (maybe around 10,000 sq. ft.). This makes for a population distribution that is skewed to the
right and looks like the figure below.
0e+00 3e−04
Density
Population of Home Sizes in Wake County
0
2000
4000
6000
8000
10000
Home Size (Sq. Ft.)
Since the population is so heavily skewed and we are taking a “small" sample size, we do not want our
results to be messed up by one really big house that could potentially be randomly selected to be included in our data. Thus, the median is a great choice for the parameter of interest in this case as it
helps us to learn about what a “typical" house size in Wake County is.
The median is the “middle" of our sample. Since we have 25 observations in our sample, then the median is the thirteenth number. So if 1500 sq. ft. really is the population median than we expect that
around 12 or 13 observations to be less than 1500.
b.) Since she is interested in the median the broker conducts what is called a Sign test. The Sign test
is about the population median rather than the mean. She correctly carried out the test in software
and the resulting output is given below. What conclusion can we draw from this output?
For this problem, we are given output from a test (the Sign Test) that we never learned about. How4
ever, we do know something about hypothesis tests, p-values, and how to make conclusions about a
hypothesis test. In this respect this problem is very much like every other hypothesis test we have done
before, as the overall idea is the same but the details are a little different, we just need to adapt what we
have learned to a new context.
Here the parameter is the TRUE median home size in Wake County, and thus the population is ALL
homes in Wake County. The Null Hypothesis states that the TRUE median home size in Wake County is
1500 sq. ft. whereas the Alternative Hypothesis states that the TRUE median home size in Wake County
is larger than 1500 sq. ft.
In the output, we are given a p-value of 0.1148 which is above our default significance level of 0.05.
Thus, at the 5% significance level, we fail to reject the null hypothesis. That is, there is insufficient evidence to conclude that the TRUE median home size in Wake County is larger 1500 sq. ft.
What is the Point?
The point of this problem is to get you to take your Statistics 311 knowledge out into the real world. In
whatever field you will be working in, it is probable that at some point you will come across statistics
of some sort. You may forget about specific formulas and details, but hopefully you can remember the
basic idea behind hypothesis testing and how to reach a conclusion about the population parameter.
There is some parameter that we are interested in studying. If someone can formulate a question about
that parameter, we can turn that question into a set of hypotheses. A test can be done (usually by a
software program) that provides a p-value. From the p-value we can decide to reject the null hypothesis
or fail to reject it and make an intelligent conclusion about the parameter!
Common Mistakes on Midterm 2 and Things to Review
• When communicating about the parameter of interest, be sure to clearly distinguish between
population parameter and sample statistic. Use the word TRUE or POPULATION! We have focused extensively on three types of parameters: proportion, mean, and slope. When reading
through a problem where you need to calculate something, determine which of these parameters is relevant to the question and go to the appropriate place on the formula sheet before
performing ANY CALCULATIONS!
• When performing a calculation, please show your work and be organized. You may know what
your doing but if we cannot tell then we cannot give you credit.
• Do not mix up Standard Error and Standard Deviation. Standard Error is an estimate of the Standard Deviation. Look on your formula sheet and identify both the Standard Deviation formulas
and the Standard Error formulas. What is the difference?
• Practice correctly stating the hypotheses. The hypotheses are about population parameters and
NOT about sample statistics. The population parameters represents an unknown quantity that
we want to learn about and a statistic is a number that we calculated from a sample. A hypothesis
is a question that we ask about a parameter! The null hypothesis represents the “status quo" or
“general consensus" whereas the alternative hypothesis is a claim that we hope to prove is true.
• P-values are always between 0 and 1, so if you have calculated a p-value that is not between 0 and
1 then something has gone wrong in your calculation!
5
• Test statistics that are very very large (greater than 25) should make you think to double check
your calculations.
• Extrapolation is when you try to predict the response variable (y) at values of the explanatory
variable (x) outside the range of your data. We CAN make predictions for values of the response
for any value of X (the independent variable) that lies within the range of our data set, this is NOT
extrapolation.
• Given a p-value, you need to be able determine if you reject or do not reject the null, and then
make an appropriate conclusion in the context of the problem. For any hypothesis test, you have
a significance level α (usually α = 0.05) and a p-value. If your p-value is less than your significance level then you reject the null hypothesis as what you observed is very unlikely by random
chance if the null hypothesis is true. (It is always important to remember that the hypothesis
test is conducted assuming that the null hypothesis is true!) If your p-value is higher than your
significance level then you fail to reject the null hypothesis. DO NOT ACCEPT THE NULL! (See
Worksheet 10 and 11 solutions)
• Review the connection between confidence intervals and two-sided hypothesis tests, as these
are two ways of doing the same thing, reaching a conclusion about a population parameter. The
confidence interval gives you a range of plausible values of your parameter. A hypothesis test
looks at the null hypothesis and asks: Is this a plausible value for the parameter? Yes or No. If we
are given a confidence interval, then for any two-sided hypothesis test, we will fail to reject any
null hypothesis if the null value is contained in the interval and will reject any null hypothesis if
the null value is not contained in the interval! (See Worksheet 11 Solutions)
• Review how to recognize and perform a paired-t test. The machinery is very similar to a hypothesis test for means except that you instead work with the differenced data, so the hypotheses and
the interpretation will change. (See the last problem on Worksheet 11 Solutions)
• Review how to find p-value when the test statistic is negative, in particular for a t-test. The alternative tells us “where" in the null distribution to use. It almost ALWAYS helps to DRAW PICTURES, students that draw pictures very rarely make mistakes on calculating p-values!
• Review how to find the p-value from the randomization test output. (There were a few quiz
questions where you had to do this.)
• Review the relationship between p-values from a one and two-sided test. If you have the p-value
for a one-sided test, multiply it by 2 to get the p-value for the two-sided test. If you have a pvalue from a two-sided test, divide the p-value by 2 to get the one-sided p-value. Draw the null
distribution to see why this is true.
• Review the meaning of R 2 , know what it is and how to find it in the output. Also, know how to use
R-sq to compare two models. (See Worksheet 12 Solutions)
6