Download Report

STAT 350 (Fall 2014)
Autor: Will A. Eagon
Lab 9: SAS Solution
1
Lab 9: Inference for Regression
Objectives: Perform Inference and check for assumptions in Linear
Regression
A (30 points) Beer and Blood Alcohol (Data Set: ex10-23bac.txt)
How well does the number of beers a student drinks predict his or her blood alcohol content? Sixteen
student volunteers at Ohio State University drank a randomly assigned number of 12-ounce cans of
beer. Thirty minutes later, a police officer measured their blood alcohol content (BAC). Here are the
data:
The students were equally divided between men and women and differed in weight and usual drinking
habits. Because of this variation, many students don’t believe that number of drinks predicts blood
alcohol well.
Solution
File → Open Worksheet → Files of type: Text (*.txt) → ex10-23bac.txt → Open
The following code is used for all answers in this problem except where explicitly mentioned otherwise.
Stat → Regression → Regression → Fit Regression Model → Response: BAC, Continuous predictors:
Beer → Options(Confidence Level for all intervals: 90, type of confidence interval: Two-sided) → OK →
Results: Display of results: Expanded tables → OK → OK
1. (5 pts) Make a scatterplot of the data (including the least squares regression line). Briefly describe
the relationship between blood alcohol content and the number of beers.
Solution:
Graph → Scatterplot → With Regression → OK → (Y-variables: BAC, X variables: Beers) → OK
STAT 350 (Fall 2014)
Autor: Will A. Eagon
Lab 9: SAS Solution
2
It appears that there is moderately strong, positive, linear relationship between BAC and number of
beers.
2. (5 pts) Obtain the equation of the least-squares regression line for predicting blood alcohol from
number of beers. What is r2 for these data?
Solution:
Regression Equation
BAC = -0.0127 + 0.01796 Beers
Model Summary
S
0.0204410
R-sq
79.98%
R-sq(adj)
78.55%
PRESS
0.0086171
R-sq(pred)
70.51%
R2 = 78.44%
3. (5 pts) Using the results from parts (1) and (2), briefly summarize what your data analysis shows.
You should emphasize in this summary whether this data can be used for prediction purposes.
Please comment on whether this is an SRS situation based on the information provided.
Solution:
Analysis of Variance
Source
Regression
Beers
Error
Lack-of-Fit
Pure Error
Total
DF
1
1
14
7
7
15
Seq SS
0.023375
0.023375
0.005850
0.002802
0.003048
0.029225
Contribution
79.98%
79.98%
20.02%
9.59%
10.43%
100.00%
Adj SS
0.023375
0.023375
0.005850
0.002802
0.003048
Adj MS
0.023375
0.023375
0.000418
0.000400
0.000435
F-Value
55.94
55.94
P-Value
0.000
0.000
0.92
0.543
STAT 350 (Fall 2014)
Autor: Will A. Eagon
Lab 9: SAS Solution
3
First the sample can be treated as SRS because it is possible that from an SRS of everyone that you
could obtain equal numbers of women and women. We can see from the scatterplot that we do have a
linear relationship and the standard deviation is close to being approximately constant. The R 2 value is
close to being one and the MSE is very small, therefore the points in the scatterplot are closed to the
regression line in an absolute sense. Therefore, this data can be used for prediction
4. (10 pts) Is there significant evidence that drinking more beers increases blood alcohol on the
average in the population of all students? Please perform the 4*-step process (state
hypotheses, give a test statistic and P-value, and state your conclusion).
Solution:
Coefficients
Term
Constant
Beers
Coef
-0.0127
0.01796
SE Coef
0.0126
0.00240
90% CI
(-0.0350, 0.0096)
(0.01373, 0.02219)
T-Value
-1.00
7.48
P-Value
0.332
0.000
VIF
1.00
Step 0: Definition of the terms
1 is the population slope
Step 1: State the hypotheses
H0: 1 = 0
Ha: 1 > 0
Step 2: Find the Test Statistic, report DF.
tt = 7.48
DF = 14
Step 3: Find the p-value:
P-value < 0.000/2
The value provided is the two-sided so to convert it to a one-sided P-value we need to divide the value
by 2.
Step 4: Conclusion:
 = 0.05
Since 0.000 ≤ 0.05, we should reject H0
The data provides sufficiently strong evidence (P-value = 0.000) that there is an positive linear
association between BAC and number of beers.
5. (5 pts) Steve thinks he can drive legally 30 minutes after he drinks 5 beers. The legal limit is BAC
= 0.08. Give and interpret a 90% prediction interval for Steve’s BAC. Can he be confident he won’t
be arrested if he drives and is stopped? Note: It is still bad to drink when buzzed, that is your BAC
is below 0.08.
Solution:
Stat → Regression → Regression → Predict → Response: BAC, enter individual values, Beers:5 →
Options (Confidence level: 90, Type of Interval: Two-sided) → OK → Results → Be sure that both
Regression equation and Prediction tables are checked → OK → OK
Variable
Beers
Fit
0.0771182
Setting
5
SE Fit
0.0051300
90% CI
(0.0680826, 0.0861538)
90% PI
(0.0399988, 0.114238)
STAT 350 (Fall 2014)
Autor: Will A. Eagon
Lab 9: SAS Solution
4
The 90% prediction interval for Steve’s BAC is (0.039988, 0.114238).
We are 90% confident that the next value of BAC after drinking 5 beers is between 0.039988 and
0.114238.
Since values greater than 0.08 are in the interval, he should not be confident that he can drive 30
minutes after drinking 5 beers.
B (50 points) House Prices (Data Set: sales.txt - webpage)
Real estate is typically reassessed annually for property tax purposes. This assessed
value, however, is not necessarily the same as the fair market value of the property. The data file
summarizes an SRS of 30 properties recently sold in a Midwestern city. Both variables, Sales Price and
Assessed value are measured in thousands of dollars.
Solution:
File → Open Worksheet → Files of type: Text (*.txt) → sales.txt → Open
The following code is used for all answers in this problem except where explicitly mentioned otherwise.
Stat → Regression → Regression → Fit Regression Model → Response: SalesPrice, Continuous
predictors: AssessedValue → Options (Confidence Level for all intervals: 95, type of confidence
interval: Two-sided) → OK → Graphs (Residual Plots: Check Individual plots with histogram and
Normal Plot, Residuals versus the variables: AssessedValue) → OK → Results: Display of results:
Expanded tables → OK → OK
1.
(4 pts) Inspect the data. How many have a selling price greater than the assessed value? Do you
think this trend would be true for the larger population of all homes recently sold? Explain your
answer.
Solution:
There are 17 houses that have a selling price greater than the assessed value.
This is nearly half of the total number of house. Perhaps for large sample, there will still be
approximately half of the houses that have a selling price greater than the assessed value. This trend
may not generalize if we were to examine cities outside of this Midwestern city because there is a
dependency among the real estate values.
2
(5 pts) Make a scatterplot with assessed value on the horizontal axis. Please include the
regression line in your plot. Briefly describe the relationship between assessed value and selling
price.
STAT 350 (Fall 2014)
Autor: Will A. Eagon
Lab 9: SAS Solution
5
Solution:
Graph → Scatterplot → With Regression → OK → (Y-variables: SalesPrice, X variables: AssessedValue)
→ OK
The two variables have strong positive linear relationship. However, it does look like there might be an
x-outlier with an AssessedValue of more than 300.
3.
(5 pts) Obtain the residuals and plot them versus assessed value. Is there anything unusual to
report? If so, explain.
Solution:
I see no pattern here so the association seems to be linear. Also from the plot I would say that
constant standard deviation is valid. Again, there looks like there is an outlier with AssessedValue
greater than 300.
STAT 350 (Fall 2014)
Autor: Will A. Eagon
4.
Lab 9: SAS Solution
6
(5 pts) Do the residuals appear to be approximately Normal? Explain your answer. Be sure to
include the appropriate graph in your answer.
Solution:
It looks like the residuals are normal because on the QQ plot the points are close to the line and the
line on the histogram seems to match without important deviation. Therefore the x-outlier does not
affect the normality of the residuals.
5.
(5 pts) Based on your answers to parts, (2), (3), and (4), do the assumptions for the linear
regression analysis appear reasonable? Explain your answer.
Solution:
First, it is appropriate to treat our sample as SRS. Also, the three other assumptions are met: linear,
constant standard deviation of the residuals and normality of the residuals. The only trouble spot is
the x – outlier to determine if it is influential or not.
6.
(5 pts) Obtain the least-squares regression line for predicting selling price from assessed value.
Solution:
Regression Equation
SalesPrice = 37.4 + 0.849 AssessedValue
7. (3 pts) Calculate the predicted selling prices for homes currently assessed at $155,000,
$220,000, and $285,000. (This part may be done by hand.)
Stat → Regression → Regression → Predict → Response: SalesPrice, enter individual values,
AssessedPrice:155,220,285 → OK
STAT 350 (Fall 2014)
Autor: Will A. Eagon
Variable
AssessedValue
Fit
168.984
Fit
224.160
Fit
279.336
90% CI
(157.347, 180.621)
90% PI
(121.872, 216.096)
Setting
220
SE Fit
5.90144
Variable
AssessedValue
7
Setting
155
SE Fit
6.83222
Variable
AssessedValue
Lab 9: SAS Solution
90% CI
(214.108, 234.212)
90% PI
(177.414, 270.906)
Setting
285
SE Fit
12.0943
90% CI
(258.736, 299.936)
90% PI
(229.251, 329.421)
OR
Sales Price 1 = 37.41025 + 0.84886 * 155 = 168.9836
Sales Price 2 = 37.41025 + 0.84886 * 220 = 224.1595
Sales Price 3 = 37.41025 + 0.84886 * 285 = 279.3354
8.
(3 pts) Suppose these houses sold for $142,900, $224,000, and $286,000 respectively. Calculate
the residual for each of these sales. (This part may be done by hand.)
Solution:
Residual 1 = Observed Value – Predicted value = 142.9 – 168.9836 = 26.08 ($26,080)
Residual 2 = Observed Value – Predicted value = 224 – 224.1595 = 0.159 ($159)
Residual 3 = Observed Value – Predicted value = 286 – 279.3354 = -6.66 (-$6,660)
9.
(10 pts) Construct and interpret a 95% confidence interval for the slope and the intercept.
Explain why inference on the intercept is not of interest in this problem.
Solution:
Coefficients
Term
Constant
AssessedValue
Coef
37.4
0.849
SE Coef
23.9
0.121
95% CI
(-11.7, 86.5)
(0.601, 1.097)
T-Value
1.56
7.03
P-Value
0.130
0.000
VIF
1.00
Slope:
95% CI (0.601, 1.097)
We are 95% confident that the population slope is between 0.601 and 1.097.
Intercept:
95% CI (-11.7, 86.5)
We are 95% confident that the population y-intercept is between -11.67 and 86.5.
Since there cannot be an Assessed Value of 0 for a house, the y-intercept is not relevant in this
situation.
OR
Since the data points do not include an Assessed Value of 0, the y-intercept would be an extrapolated
point so should not be considered in the study.
STAT 350 (Fall 2014)
Autor: Will A. Eagon
Lab 9: SAS Solution
8
10. (5 pts) Using the result from part (9), compare the estimated regression line with y = x, which
says, on average, the selling price is equal to the assessed value. Is there evidence that this
model is not reasonable? In other words, is the selling price typically larger or smaller than the
assessed value? Explain your answer. How does your answer compare to your response in part
(1).
Solution:
To answer this question, you need to look at the confidence intervals of both the slope and the yintercept. If y = x is the regression line for the population then 0 would be in the confidence interval
for the y-intercept and 1 would be in the confidence interval for the slope. Since this is what occurs in
this situation, then there is no data to suggest that this model is not reasonable. Note: It is usually
not appropriate to remove the y-intercept from the model because then then the methodology is not
appropriate.
This is consistent to what was stated in part (1), that is, nearly half of the selling prices were below
the assessed values.