Download Report

Applied Regression Analysis 41100
Instructor: Federico M. Bandi
Sample Midterm (Solutions)
The allotted time is 1 hour and 30 minutes. The exam is divided into three parts. The first and second part
are true-false and multiple choice, respectively. Please answer the true-false and multiple choice questions
on the exam by circling the best answer. There will be no partial credit for these questions. The third part
of the exam consists of several problems. Please answer these problems in the space provided on the exam
(you may use the back of the sheets if necessary). You will get partial credit for these problems provided
that your answers are organized and legible so that your train of thought can be easily followed.
Note: You should answer all questions on the exam. The blue books will not be looked at.
Please print your name in the space provided below and sign.
Panicking is not allowed and will be penalized!
Name:
Please, sign the following pledge: “I pledge my honor that I have not violated the Honor Code during this
examination.”
Signature:
True/False
Multiple Choice
Question 1
Question 2
8 Points
18 Points
13 Points
27 Points
Total
66 Points
1
True or False (1 point each)
(1) The estimated least-squares residuals are not correlated with the X values but could be correlated
with the fitted values T F
b b0 +b1 X. ThereFalse. The fitted values are perfectly correlated with the X values since Y=
fore, the correlation between the fitted values and the residuals is also zero. See Chapter 1 in
the notes.
(2) If the correlation between the estimated least-squares residuals and the X values were positive, then
the estimated slope in linear regression analysis would be too flat.
True. If the slope is too flat the residuals associated with values around the lower bound of
the range of the X variable are negative on average while those associated with values about
the upper bound of the range of the X variable are positive on average. Hence, the shape of
the scatter plot of the residuals against the X variable is upward sloping. We have a similar
discussion in Chapter 1 in the notes.
(3) In simple regression analysis, E(β 0 ) = b0
TF
False. The opposite is true. The estimator is unbiased (not the actual parameter), therefore
E(b0 ) = β 0 . See Chapter 3.
(4) The true variance of the residuals (σ 2 ) will eventually decrease if enough observations are added to the
sample
TF
False. The true variance is a population parameter (see Chapter 2). Only its estimate (s2 )
is affected by the sample size.
(5) If we fail to reject a certain null hypothesis about a certain parameter of interest (the slope in the SLR
model, say) at the 5% level, then we might still reject at the 1% level
TF
False. The cut-off values for a 1% test are larger than the cut-off values for a 5% test.
Therefore, if you fail to reject at the 5% level you also fail to reject at the 1% level. See
Chapter 4.
(6) It is possible to reject a certain null hypothesis about a certain parameter of interest (the slope in the
SLR model, say) even when the conjectured null hypothesis is true
TF
True. The nature of the tests that we have been discussing is such that the probability of
rejecting when the null hypothesis is true coincides with the level of the test (it is 5% for a 5%
test). See Chapter 4.
(7) If the p-value is 3%, then we would always reject the null hypothesis
TF
False. For example, you would not reject the null when the level of the test is 1%. Of course,
you would reject the null if the level of the test is 5%. See Chapter 4.
(8) When the sample size is very large, the interval (b0 +b1 Xf −2s, b0 +b1 Xf +2s) is a valid 95% predictive
interval for Yf given a new Xf .
2
True. Just take a look at the formula for a predictive interval at the end of Chapter 4.
Notice that the t cut-off values can be replaced by 2 and −2 (the approximate cut-off values
based on the normal distribution) because the t distribution tends to the normal as the sample
size (the number of degrees of freedom) gets large. Notice also that the statement implies
that a simple “plug-in” interval (one that you obtain by replacing the true quantities β 0 , β 1
and σ with the estimated quantities b0 , b1 and s) is a good way to predict Yf given a new Xf
when the estimation uncertainty is not substantial (when n is large). In this case the sample
quantities (b0 , b1 and s, that is) are similar to the population quantities (β 0 , β 1 and σ, that is)
which are the quantities that you would use if you knew what the population looks like (i.e.,
if you knew the true data-generating process). (Compare the predictive interval based on the
true population parameters in Chapter 2 to the predictive interval based on estimates of the
population parameters in Chapter 4.)
3
Multiple Choice (3 points each)
[1] You run a linear least-squares regression on a data set of 20 observations. The sample average of the
X values is 10 and the sample average of the Y values is 5. Suppose you add an observation that has
X = 10 and Y = 5. Now you run a new least-squares regression on the sample of 21 observations.
How would the slope estimate change (compared to the previous regression)?
(a) It would increase
(b) It would decrease
(c) It would not change
(d) Cannot tell based on the information given
It would not change. The terms that get added in the expressions for the numerator and
the denominator in the formula for the slope estimate (see Chapter 1 in the notes) for the
21-st observation are both equal to zero, since the new (X, Y) pair is at the point of means
(X,Y).
[2] Assume the same set-up as in the previous question. How would the R2 change?
(a) It would increase
(b) It would decrease
(c) It would not change
(d) Cannot tell based on the information given
It would not change. Both the SSE and the SST in the definition of the coefficient of
determination (see Chapter 1) remain unchanged. In fact, the term for the 21-st observation
in both summations is equal to zero. (To see this, recall that the point of means (X,Y) is on
the regression line.)
4
[3] Assume the same set-up as in the previous questions. How would the estimated standard deviation of
the residuals (s) change?
(a) It would increase
(b) It would decrease
(c) It would not change
(d) Cannot tell based on the information given
It would decrease. SSE is unchanged, but n is larger.
[4] Consider the following simple linear regression (SLR) model:
Yi = 1 + 2Xi + εi ,
where εi ; N (0, 1) i.i.d.. The error term ε is independent of X for every i. Which of the following
statements is WRONG?
(a) E(Y |X = 2) = 5
(b) The 95% predictive interval for Y given X = 2 is (3, 7)
(c) The 68% predictive interval for Y given X = 2 is (4, 6)
(d) If Xi ; N (0, 1), then the variance of Y is equal to 6 (i.e., V ar(Y ) = 6)
(e) If Xi ; N (0, 1), then the expected value of Y is equal to 1 (i.e., E(Y ) = 1)
The answer is (d).
(a) E(Y |X
(b) 95% CI
= 2) = 1 + 2 ∗ 2 = 5,
= (5 − 2 ∗ 1, 5 + 2 ∗ 1) = (3, 7),
(c) 68% CI = (5 − 1 ∗ 1, 5 + 1 ∗ 1) = (4, 6),
(d) V ar(Y ) = 4V ar(X) + V ar(ε) = 5,
(e) E(Y ) = 1 + 2E(X) + E(ε) = 1.
5
[5] Which of the following results in a LARGER confidence interval width for the intercept estimate in
linear regression analysis (everything else being kept constant)?
(a) smaller estimated intercept
(b) smaller degree of confidence
(c) smaller sample size
¡ ¢
(d) smaller estimated residual variance s2
¡ ¢
(e) larger regressors variance s2X
(f) None of the above
The answer is (c). Immediate, by looking at the formula for the confidence interval of the
intercept from Chapter 4.
[6] One concern about the depletion of the ozone layer is that the increase in UV light will decrease crop
yields. An experiment was conducted in a green house where soybean plants were exposed to varying
UV levels measured in Dobson units. At the end of the experiment the yield (kg) was measured. Using
100 observations, a linear regression analysis was performed with the following results:
Intercept
UV
Estimate
3.9800118
−0.046285
Std error
0.053774
0.010741
t-ratio
74.01
??
P-value
< .0001
0.0008
Which of the following statements is WRONG?
(a) An increase in UV light decreases crop yields
(b) The missing t-ratio is −4.309
(c) An approximate 95% confidence interval for the true slope is −0.046285 ± 2 ∗ 0.010741
(d) At the 5% level, we fail to reject the null hypothesis that the true slope is equal to −0.05
(e) None of the above
6
The answer is (e).
(a)yes, of course,
−0.046285
(b)
0.010741
= −4.309,
(c)yes, of course,
(d)
−0.046285 − (−0.05)
0.010741
= 0.34(fail to reject).
7
Long Problems
[1] (13 points) Suppose that the weekly sales (SALES) of a company i depend on advertising (AD) levels
according to the following simple linear regression (SLR) model:
SALESi = 10 + 5ADi + εi ,
where εi is N (0, 4).
The variables and their units are:
SALES = amount of weekly sales (in thousand of dollars)
AD = weekly advertising expenditure (in thousand of dollars)
(a) (3 points) If no money is spent on advertising this week, what is the probability that SALES will be
greater than 10 thousand dollars?
If ADi = 0, then
SALESi = 10 + εi ; N (10, 4).
So, P (SALES > 10) = .5.
(b) (2 points) If the company spends one thousand dollars on advertising this week, what is the expected
value of SALES?
If ADi = 1, then
SALESi = 10 + 5 ∗ 1 + εi ; N (15, 4).
So, E(SALES) = 15.
(c) (2 points) If the company spends one thousand dollars on advertising this week, what is the standard
deviation of SALES?
8
If ADi = 1, then
SALESi = 10 + 5 ∗ 1 + εi ; N (15, 4).
So, V (SALES) = 4 and Std(SALES) = 2.
(d) (3 points) If the company spends one thousand dollars on advertising this week, what is the approximate
probability that SALES will be between 13 thousand dollars and 17 thousand dollars?
If ADi = 1, then
SALESi = 10 + 5 ∗ 1 + εi ; N (15, 4).
¢
¡
< SALES−15
< 17−15
= P (−1 < Z < 1) ≈ 68%.
So, P (13 < SALES < 17) = P 13−15
2
2
2
(e) (3 points) If the company spends one thousand dollars on advertising this week, what is the approximate
probability that SALES will be greater than 19 thousand dollars?
If ADi = 1, then
SALESi = 10 + 5 ∗ 1 + εi ; N (15, 4).
¡ SALES−15
¢
So, P (SALES > 19) = P
> 19−15
= P (Z > 2) ≈ 2.5%.
2
2
9
[2] (27 points) For 114 restaurants in NYC we have Zagat ratings on food as well as information on the
price of a meal. We want to understand if there is a relationship between quality (as summarized by
the Zagat ratings) and average price per meal. We run a simple linear regression of price on quality
and obtain the following output
Intercept
slope
Estimate
−18.154
2.6253
Std error
6.553
0.3315
t-ratio
−2.77
7.92
P-value
0.007
0.000
In addition, we know that s = 8.93 and R − sq = 35.9%
(a) (2 points) Give an economic interpretation for the sign of the slope.
Quality has a positive impact on price. As the rating increases by one, the average price
increases by $2.62.
(b) (2 points) Find an approximate 95% confidence interval for the true slope.
2.6253 ± 2 ∗ 0.3315
(c) (2 points) Test the hypothesis that the slope is equal to zero at the 5% level. (You should be very
precise here.)
P − value = 0.000 < 0.05
10
Reject
(d) (3 points) Test the hypothesis that the slope is equal to 2 at the 5% level. (You should use our usual
rule-of-thumb here.)
2.62 − 2
= 1.87 < 2
0.3315
fail to reject
(e) (3 points) You are planning to have dinner in Soho at the restaurant YumYum. You know that the
Zagat rating is 24. How much do you expect to pay?
Pb = −18.154 + 2.62 ∗ 24 = $44.88
(f) (3 points) Assume the estimated values are the true model parameters and find a 95% predictive
interval for the total price given a rating equal to 24.
$44.88 ± 2 ∗ s = $44.88 ± 2 ∗ (8.93) = ($27.02, $62.74)
(g) (3 points) Compare your result from (f) to the true predictive interval from Minitab which is ($26.831, $62.876) .
What do you notice? Why?
11
They are very similar. A simple “plug-in” interval (which implies acting as if the estimates
were the true model parameters) is a good approximation to the correct predictive interval
from Chapter 4 in the notes since we have a sufficiently large number of observations (notice
what happens to the predictive interval from Chapter 4 in the notes when n gets very very
large...)
(h) (2 points) Use the information in point (g) and the fact that the t cut-off value t112,0.025 = −1.98, to
find the standard error of the predicted values (spred in the notes).
$62.876 = $44.88 + 1.98 ∗ s ⇒ s =
$62.876 − $44.88
= 9.09.
1.98
(i) (3 points) Use your result from part (h) and the value of s to find the standard error of the fitted
values (sf it in the notes).
s2pred = s2f it + s2 ⇒ s2f it = s2pred − s2 = 82.62 − 79.74 = 2.88.
Hence,
sf it =
√
2.88 = 1.69
(j) (2 points) Use your result from part (i) to find a confidence interval for the expected price given a
rating of 24.
$44.88 ± 1.98 ∗ 1.69 = (41.53, 48.22)
(k) (2 points) Do you think quality of food is sufficient to explain the price of a meal? Briefly explain your
answer.
No, there is something else going on. Maybe decor, style and so on. The R-squared is quite
low.
12