Download Report

UNIVERSITY OF TORONTO AT SCARBOROUGH
Sample Exam
STAC67
Duration - 3 hours
AIDS ALLOWED: THIS EXAM IS OPEN BOOK (NOTES)
Calculator (No phone calculators are allowed)
LAST NAME_____________________________________________________
FIRST NAME_____________________________________________________
STUDENT NUMBER___________________________________________
There are 17 pages including this page.
Total marks: 95
PLEASE CHECK AND MAKE SURE THAT THERE ARE NO MISSING PAGES
IN THIS BOOKLET.
1) The following SAS output (from PROC UNIVARIATE) was obtained from a study of
the relationship between the boiling temperature of water (in degrees Fahrenheit) and the
atmospheric pressure (in inches of mercury). In the SAS outputs below the boiling
temperature is denoted by BT and the atmospheric pressure by AP.
The UNIVARIATE Procedure
Variable: BT
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
31
191.6
8.37321125
0.58262076
1140130.68
4.37015201
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
31
5939.6
70.1106667
-0.5660379
2103.32
1.50387314
The UNIVARIATE Procedure
Variable: AP
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
31
20.0276452
3.86371881
0.96406479
12882.1534
19.2919276
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
31
620.857
14.928323
0.23090812
447.849691
0.69394438
The CORR Procedure
2
Variables:
BT
AP
Pearson Correlation Coefficients, N = 31
Prob > |r| under H0: Rho=0
BT
AP
BT
1.00000
0.98455
<.0001
AP
0.98455
<.0001
1.00000
a) [5 points] Assuming that a linear relationship exists between AP and BT and that the
data satisfy the necessary assumptions, calculate the least squares regression equation of
BT on AP.
Sol
B1=rSy/Sx =
Bo= y_bar-b1x_bar
b) [2 points] What proportion of the variability in the boiling temperature of water (i.e.
BT) is explained by the this simple linear regression model?
Sol This is R-sq=0.98455^2
2
c) [5 points] Calculate a 95% confidence interval for the slope of the regression line.
Sol Find MSE first using R^2 = 1-SSE/SST and then use the formula for the CI for b1
Or use SSR=b1_SqSxx
2) A researcher wished to study the relation between patient satisfaction (Y) and patient’s
age (X1), severity of illness (X2, an index) and anxiety level (X3). Some SAS outputs for
the regression analysis of his data are given below. You may assume that the model is
appropriate (i.e. satisfies the assumptions needed.) for answering the questions below.
The REG Procedure
Model: MODEL1
Model Crossproducts X'X X'Y Y'Y
Variable
Intercept
x1
x2
x3
y
Intercept
x1
x2
46
1766
2320
105.2
2832
1766
71378
90051
4107.2
103282
2320
90051
117846
5344.7
140814
Model Crossproducts X'X X'Y Y'Y
Variable
Intercept
x1
x2
x3
y
x3
y
105.2
4107.2
5344.7
244.62
6327
2832
103282
140814
6327
187722
The REG Procedure
Model: MODEL1
Dependent Variable: y
X'X Inverse, Parameter Estimates, and SSE
Variable
Intercept
x1
x2
x3
y
Intercept
x1
x2
3.2477116535
0.0092211391
-0.06793079
-0.067298817
158.49125167
0.0092211391
0.0004560816
-0.000318596
-0.004662271
-1.141611847
-0.06793079
-0.000318596
0.0023924814
-0.017710085
-0.442004262
X'X Inverse, Parameter Estimates, and SSE
Variable
x3
y
3
Intercept
x1
x2
x3
y
-0.067298817
-0.004662271
-0.017710085
0.4982577303
-13.47016319
158.49125167
-1.141611847
-0.442004262
-13.47016319
4248.8406818
Parameter Estimates
Variable
Intercept
x1
x2
x3
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Type I SS
1
1
1
1
158.49125
-1.14161
-0.44200
-13.47016
18.12589
0.21480
0.49197
omitted
8.74
-5.31
-0.90
omitted
<.0001
<.0001
0.3741
omitted
174353
8275.38885
480.91529
364.15952
i) [4 points] Test whether there is a regression relation between Y and the explanatory
variables X1, X2 and X3. State the null and the alternative hypotheses. Use α = 0.05.
Sol SSE = 4248.8406818
SST = 187722 – 46 x (2832)^2 and so calculate F
ii) [4 points] Calculate a 95% confidence interval for β3 (the coefficient of X3 in the
above model)
Sol bera3_hat =
-13.47016
S^2(beta3_hat) = MSE x 3rd diagonal element of X'X
Inverse
= (SSE/(46-3+1)) x 0.4982577303
CI = bera3_hat +/- ts
iii) [4 points] Calculate a 95% confidence interval for β 2 − β3 ( β2 and β3 are the
coefficient of X2 and X3 respectively in the above model)
Sol estimate of β 2 − β3 = -0.44200
- -13.47016
SE^2 of β 2 − β3 = S^2(beta2_hat) + S^2(beta3_hat) -2 x cov(beta2_hat , beta3_hat)
cov(beta2_hat , beta3_hat) is MSE x 2nd row 3rd col element of of X'X
Inverse
iv) [4 points] Calculate and interpret the value of the coefficient of partial determination
between Y and X2, given that X1 is in the model.
Sol SSR(X2|X1)/SSE(X1)
4
We have SST. SSR(X1) = Type I SS for X1 = 8275.38885
and
SSE(X1) = SST – SSR(X1)
SSR(X2|X1) = Type 1 SS for X2 = 480.91529
v) [4 points] Test whether both X2 and X3 can be dropped from the model (i.e. keeping
only X1 in the model). Use α = 0.05.
Sol SSdrop = 480.9+364.16 = 845.07
vi) [4 points] Test whether both X1 and X2 can be dropped from the model (i.e. keeping
only X3 in the model). Use α = 0.05.
Sol to calculate SSR(reduced) calculate b1 for this simple linear regression model and
then SSR =b1^2 x Sxx and then use the drop test
vii) [4 points] Give the ANOVA table (with all entries calculated) for the regression
model for Y with the two independent variables X1 and X2.
3) (Based on q16 p173 Terry this is q5 STAB27 Final W08) A company designing and
marketing lighting fixtures needed to develop forecasts of sales (.SALES = total monthly
sales in thousands of dollars). The company considered the following predictors:
ADEX = adverting expense in thousands of dollars
MTGRATE = mortgage rate for 30-year loans (%)
HSSTARTS = housing starts in thousands of units
The company collected data on these variables and the SAS outputs below were obtained
from this study.
The REG Procedure
Model: MODEL1
Dependent Variable: SALES
Number of Observations Read
46
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
3
42
45
6187071
1026032
7213102
2062357
24429
F Value
Pr > F
84.42
<.0001
5
Root MSE
Dependent Mean
156.29883
1631.32609
R-Square
Adj R-Sq
0.8578
0.8476
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Variance
Inflation
1
1
1
1
1612.46147
0.32736
-151.22802
12.86316
432.34267
0.43919
39.74780
1.18625
3.73
0.75
-3.80
10.84
0.0006
0.4602
0.0005
<.0001
0
2.82856
2.99244
1.14005
Intercept
ADEX
MTGRATE
HSSTARTS
Plot of Residuals vs Predicted Values (Response: Sales)
400
300
Residuals
200
100
0
-100
-200
-300
1000
1250
1500
1750
Predicted Values
2000
2250
Plot of Residuals vs Normal Scores (Response: SALES)
400
300
Residuals
200
100
0
-100
-200
-300
-2
-1
0
Normal Scores
1
2
i) [3 points] Calculate the value of R-squared for the regression of ADEX on MTRATE
and HSSTARTS.
6
Ans VIF = 1/ (1-R-sq(ADEX | , MTGRATE, HSSTARTS)) =
2.82856
And so R-sq = 0.646463218
Here is the complete output:
Regression Analysis: ADEX versus MTGRATE, HSSTARTS
The regression equation is
ADEX = 766 - 71.2 MTGRATE - 0.067 HSSTARTS
Predictor
Constant
MTGRATE
HSSTARTS
S = 54.2712
Coef
765.54
-71.219
-0.0670
SE Coef
94.38
8.516
0.4118
R-Sq = 64.6%
T
8.11
-8.36
-0.16
P
0.000
0.000
0.871
VIF
1.1
1.1
R-Sq(adj) = 63.0%
ii) State whether the following statements are true or false. Circle your answer. [1 point
for each part]
a) The residual plots above show that the distribution of residuals is left-skewed. (True /
False)
Ans F
b) The residual plots above show clear evidence of non-constant variance of errors.
(True / False)
Ans F
c) The small p-value (p = 0.000 from the ANOVA table) for the global F-test for model 1
implies that all three variables should be retained in the model. (True / False)
Ans F
d) If we add another predictor for the above model with three predictors (so that we have
4 predictors), the SSE for that model (i.e. the model with 4 predictors) will be greater
1026032. (True / False)
Ans F, SSE decreases as k increases
e) If we add another predictor for the above model with three predictors (so that we have
4 predictors), the SSRegression for that model (i.e. the model with 4 predictors) will be
less than 6187071. (True / False)
7
And F, SSReg increases as k increases.
f) If we add another predictor for the above model with three predictors (so that we have
4 predictors), the SSTotal for that model (i.e. the model with 4 predictors) will be less
than 7213102. (True / False)
Ans F SST does not depend on X’s
g) The value of the adjusted R-squared for the regression model for SALES on
MTGRATE and HSSTARTS (i.e with only two predictors) will be less than 0.8476.
Ans F
Regression Analysis: SALES versus MTGRATE, HSSTARTS
The regression equation is
SALES = 1863 - 175 MTGRATE + 12.8 HSSTARTS
Predictor
Constant
MTGRATE
HSSTARTS
S = 155.489
Coef
1863.1
-174.54
12.841
SE Coef
270.4
24.40
1.180
R-Sq = 85.6%
T
6.89
-7.15
10.88
P
0.000
0.000
0.000
VIF
1.1
1.1
R-Sq(adj) = 84.9%
4) [5 points] A researcher suspected that the systolic blood pressure of individuals are
relates to weight. He calculated the least squares regression equation of systolic plod
pressure on weight based on a sample of 14 individuals. The estimated slope of this
simple linear regression model was 0.13173 with a standard error of 0.04625 ( i.e
b1 =0.13173 and sb1 = 0.04625). Calculate the correlation between systolic blood pressure
and weight for this sample of individuals.
Sol
T= 0.13173/0.04625 = 2.848216216
F=R-sq/[(1-R-sq)/(14-2)] = t^2 = 8.112335613
And so R-sq = 8.11/(12+8.11) = 0.4032819493
This question is based on the data form summer 06 B22 final (regression question). Here
are some useful outputs
8
Systolic blood pressure readings of individuals are thought to be related to weight The
following MINITAB output was obtained from a regression analysis of systolic blood
pressure on weight (in pounds). The next five questions are based on this information.
Descriptive Statistics: Systolic, Weight
Variable
Systolic
Weight
N
14
14
N*
0
0
Mean
154.50
194.07
SE Mean
1.49
7.18
StDev
5.57
26.86
Minimum
145.00
164.00
Q1
150.75
173.00
Median
153.50
188.00
Q3
158.50
212.00
Correlations: Systolic, Weight
Pearson correlation of Systolic and Weight = 0.635
The regression equation is
Systolic = 129 + 0.132 Weight
Predictor
Constant
Weight
Coef
128.935
0.13173
StDev
9.055
0.04625
T
14.24
2.85
P
0.000
0.015
R-Sq = (omitted)
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
12
13
SS
162.75
240.75
403.50
MS
162.75
20.06
F
8.11
P
0.015
5) The data and some useful information on a response variable y and two explanatory
variables x1 and x2 are given below:
y
3
5
8
11
7
6
12
3
x1
1
2
2
3
3
2
3
2
x2
1
1
2
2
4
4
2
3
 1.67 -0.57 -0.11 


( X ' X) =  -0.57 0.33 -0.08 
 -0.11 -0.08 0.12 


−1
a) [ 4 points] Estimate the linear regression model for y on the two explanatory variables
x1 and x2.
9
Sol Use ( X ' X) −1 X′Y
b) [ 6 points] MSE for the simple linear regression model of y on x1 is 4.5. Test for the
lack of fit of this model (i.e. simple linear regression model of y on x1) using pure error
sums of squares.
Regression Analysis: y versus x1, x2
The regression equation is
y = - 0.75 + 4.41 x1 - 0.966 x2
Predictor
Constant
x1
x2
Coef
-0.746
4.407
-0.9661
S = 2.04193
SE Coef
2.642
1.181
0.7033
T
-0.28
3.73
-1.37
R-Sq = 73.6%
P
0.789
0.014
0.228
R-Sq(adj) = 63.0%
Analysis of Variance
Source
Regression
Residual Error
Total
Source
x1
x2
DF
1
1
DF
2
5
7
SS
58.028
20.847
78.875
MS
29.014
4.169
F
6.96
P
0.036
Seq SS
50.161
7.867
MTB > info
Information on the Worksheet
Column
C1
C2
C3
Count
8
8
8
M3
3
x
MTB > print
Name
y
x1
x2
3
XPXI3
XPXI3
Data Display
Matrix XPXI3
10
1.67373
-0.57203
-0.11017
-0.572034
0.334746
-0.076271
-0.110169
-0.076271
0.118644
MTB > Regress 'y' 1 'x1' ;
SUBC>
Constant;
Regression Analysis: y versus x1
The regression equation is
y = - 1.64 + 3.79 x1
Predictor
Constant
x1
Coef
-1.643
3.786
S = 2.18763
SE Coef
2.742
1.169
R-Sq = 63.6%
T
-0.60
3.24
P
0.571
0.018
R-Sq(adj) = 57.5%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
6
7
SS
50.161
28.714
78.875
MS
50.161
4.786
F
10.48
P
0.018
MTB > Regress 'y' 1 'x1' ;
SUBC>
Constant;
SUBC>
Pure;
SUBC>
Brief 2.
Regression Analysis: y versus x1
The regression equation is
y = - 1.64 + 3.79 x1
Predictor
Constant
x1
Coef
-1.643
3.786
S = 2.18763
SE Coef
2.742
1.169
R-Sq = 63.6%
T
-0.60
3.24
P
0.571
0.018
R-Sq(adj) = 57.5%
Analysis of Variance
Source
Regression
Residual Error
Lack of Fit
Pure Error
Total
DF
1
6
1
5
7
SS
50.161
28.714
1.714
27.000
78.875
MS
50.161
4.786
1.714
5.400
F
10.48
P
0.018
0.32
0.597
1 rows with no replicates
11
5)[5 points] Consider the simple linear regression model: Yi = β 0 + β1 X i + ε i with the
usual assumptions (i.e. E (ε i ) = 0 for all i, V (ε i ) = σ 2 for all i, Cov(ε i , ε j ) = 0 whenever,
i ≠ j . The normality of ε i ’s is not requires for the results below.).
Let b0 and b1 be the least squares estimators of β 0 and β1 respectively.


2 
 1
(X − X )
 , where ei = Yi − Yˆi .
Prove that Var[ei ] = σ 2 1 − − n i
n

( X i − X )2 
∑


i =1
Sol
Var[ei ] = Var[Yi − Yˆi ] = Var[Yi ] + Var[Yˆi ] − 2Cov[Yi , Yˆi ]
(X j − X )
Cov[Yi , Yˆi ] = Cov[Yi , b0 + b1 X i ] = Cov[Yi , ∑ k ′jY j + X i ∑ k jY j ] where k j =
and
S XX
j =1
j =1
1
k ′j = − X k j .
n
n
n
n
n
j =1
j =1
Cov[Yi , Yˆi ] = Cov[Yi , b0 + b1 X i ] = Cov[Yi , ∑ k ′jY j + X i ∑ k jY j ]
= ki′Cov(Yi , Yi ) + X i ki Cov(Yi , Yi ) = σ 2 [ ki′ + X i ki ]


2 

(X − X )
1
1

1


= σ 2  − X ki + X i ki  = σ 2  + ( X i − X ) ki  = σ 2  + n i
2
n

n

n
(Xi − X )
∑


i =1
and so
Var[ei ] = Var[Yi − Yˆi ] = Var[Yi ] + Var[Yˆi ] − 2Cov[Yi , Yˆi ]




2 
2 
1

(X − X )
(X − X )
1
 − 2σ 2  + n i

= σ 2 +σ 2  + n i
n
2
2
n

(Xi − X )
(Xi − X )
∑
∑




i =1
i =1


2 
 1
X
X
(
−
)

= σ 2 1 − − n i
2
 n
(Xi − X )
∑


i =1
■
12
6) A psychologist conducted a study to examine the nature of the relation, if any, between
an employee’s emotional stability (X) and the employee’s ability to perform in a task
group (Y). Emotional stability was measured by a written test, for which the higher the
score, the greater the emotional stability. Ability to perform in a task group (Y = 1 if able,
Y = 0 if unable) was evaluated by the supervisor. The psychologist is considering a
logistic regression model for the data. The SAS output below is based on the results for
27 employees.
The SAS System
The LOGISTIC Procedure
Model Information
Data Set
Response Variable
Number of Response Levels
Number of Observations
Model
Optimization Technique
WORK.A
Y
2
27
binary logit
Fisher's scoring
Response Profile
Ordered
Value
Y
Total
Frequency
1
2
0
1
13
14
Probability modeled is Y=1.
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
8.1512
7.3223
5.7692
1
1
1
0.0043
0.0068
0.0163
Likelihood Ratio
Score
Wald
Analysis of Maximum Likelihood Estimates
Parameter
Intercept
X
DF
1
1
Estimate
-10.3089
0.0189
Standard
Error
4.3770
0.00788
The SAS System
Wald
Chi-Square
5.5472
5.7692
Pr > ChiSq
0.0185
0.0163
The LOGISTIC Procedure
Odds Ratio Estimates
Effect
X
Point
Estimate
omitted
95% Wald
Confidence Limits
omitted
omitted
13
[2 points] i) Estimate the probability that an employee with emotional stability score of
500 (i.e. X = 500) will be able to perform the task.
[4 points] ii) Calculate a 90 percent confidence interval for the odds ration of X.
7) A personnel officer in a company administered four aptitude tests to each of 25
applicants for entry-level clerical positions. For purpose of this study, all 25 applicants
were accepted for positions irrespective of their test scores. After a period each applicant
was rated for proficiency (denoted by Y) on the job. The SAS output below is intended to
identify the best subset of the four tests (denoted by X1, X2, X3, and X4).
The SAS System
1
The REG Procedure
Model: MODEL1
Dependent Variable: Y
Adjusted R-Square Selection Method
Number of Observations Read
Number of Observations Used
Number in
Model
3
4
2
3
2
3
3
2
1
2
2
1
2
1
1
25
25
Adjusted
R-Square
R-Square
C(p)
0.9560
0.9555
0.9269
0.9247
0.8661
0.8617
0.8233
0.7985
0.7962
0.7884
0.7636
0.7452
A
0.2326
0.2143
0.9615
0.9629
0.9330
0.9341
0.8773
0.8790
0.8454
0.8153
0.8047
0.8061
0.7833
0.7558
0.4642
0.2646
0.2470
3.7274
5.0000
17.1130
18.5215
47.1540
48.2310
66.3465
80.5653
84.2465
85.5196
97.7978
110.5974
269.7800
375.3447
384.8325
Variables in Model
X1
X1
X1
X1
X3
X2
X1
X1
X3
X2
X2
X4
X1
X1
X2
X3
X2
X3
X2
X4
X3
X2
X4
X4
X3 X4
X3
X4
X4
X3
X4
X2
Even though this SAS output is for R-square selection method, it has useful information
that can be used in other selection methods.
14
a) [5 points] Identify the variable that will enter the model at the second step of the
stepwise regression procedure. Explain clearly how you identified this variable.
Sol X3 has the largest Rsq among the four single variable models and so it enters the
model at the first step (assuming F = [Rsq/df_reg]/[(1-Rsq)/df_error] is significant at the
required sig level to enter the model.
Now among the three two-variable models containing X3 , the model with X1 has the
highest R-sq and so X1 has the highest t-ratio among the three models containing X3 .
If a variable will be selected at this step, it must be X1 (see MINITAB output below)
b) [2 points] Identify the variables that you will select if you want to use the Mallow’s
C(p) criterion. Explain clearly the reason for your answer.
Sol The model with Cp = number of variables +1 (other than the model with all
variables)
Eg the model with X1 X3 X4 which has Cp = 3.7274
(close to 4)
c) [3 points] Calculate the value of the adjusted R-square for the model with the
predictors X1 and X1 only. (Note this is the model for which the adjusted R-square has
been deleted in the above SAS output)
Ans
1-(24/22)*(1-0.4642) =
0.4154909091
Here are some useful outputs
The SAS System
1
The REG Procedure
Model: MODEL1
Dependent Variable: Y
Adjusted R-Square Selection Method
Number of Observations Read
Number of Observations Used
Number in
Model
3
25
25
Adjusted
R-Square
R-Square
C(p)
0.9560
0.9615
3.7274
Variables in Model
X1 X3 X4
15
4
2
3
2
3
3
2
1
2
2
1
2
1
1
0.9555
0.9269
0.9247
0.8661
0.8617
0.8233
0.7985
0.7962
0.7884
0.7636
0.7452
0.4155
0.2326
0.2143
0.9629
5.0000
0.9330
17.1130
0.9341
18.5215
0.8773
47.1540
0.8790
48.2310
0.8454
66.3465
0.8153
80.5653
0.8047
84.2465
0.8061
85.5196
0.7833
97.7978
0.7558
110.5974
0.4642
269.7800
0.2646
375.3447
0.2470
384.8325
The SAS System
L
X1
X1
X1
X3
X2
X1
X1
X3
X2
X2
X4
X1
X1
X2
X2
X3
X2
X4
X3
X2
X4
X3 X4
X3
X4
X4
X3
X4
X2
2
The REG Procedure
Model: MODEL2
Dependent Variable: Y
Number of Observations Read
Number of Observations Used
25
25
Stepwise Selection: Step 1
Variable X3 Entered: R-Square = 0.8047 and C(p) = 84.2465
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
1
23
24
7285.97715
1768.02285
9054.00000
7285.97715
76.87056
Variable
Intercept
X3
F Value
Pr > F
94.78
<.0001
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
-106.13284
1.96759
20.44719
0.20210
2071.05812
7285.97715
26.94
94.78
<.0001
<.0001
Bounds on condition number: 1, 1
--------------------------------------------------------------------------Stepwise Selection: Step 2
Variable X1 Entered: R-Square = 0.9330 and C(p) = 17.1130
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
2
22
24
8447.34255
606.65745
9054.00000
4223.67128
27.57534
Parameter
F Value
Pr > F
153.17
<.0001
Standard
16
Variable
Intercept
X1
X3
Estimate
Error
Type II SS
F Value
Pr > F
12.68526
2789.93352
0.05369
1161.36540
0.12307
6051.48790
The SAS System
The REG Procedure
Model: MODEL2
Dependent Variable: Y
101.17
42.12
219.45
<.0001
<.0001
<.0001
-127.59569
0.34846
1.82321
^L
3
Stepwise Selection: Step 2
Bounds on condition number: 1.0338, 4.1351
--------------------------------------------------------------------------Stepwise Selection: Step 3
Variable X4 Entered: R-Square = 0.9615 and C(p) = 3.7274
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
3
21
24
8705.80299
348.19701
9054.00000
2901.93433
16.58081
Variable
Intercept
X1
X3
X4
F Value
Pr > F
175.02
<.0001
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
-124.20002
0.29633
1.35697
0.51742
9.87406
0.04368
0.15183
0.13105
2623.35826
763.11559
1324.38825
258.46044
158.22
46.02
79.87
15.59
<.0001
<.0001
<.0001
0.0007
Bounds on condition number: 2.8335, 19.764
--------------------------------------------------------------------------All variables left in the model are significant at the 0.1500 level.
No other variable met the 0.1500 significance level for entry into the
model.
Summary of Stepwise Selection
Variable
Step Entered
1
2
3
X3
X1
X4
Variable
Removed
Number Partial
Model
Vars In R-Square R-Square
1
2
3
0.8047
0.1283
0.0285
0.8047
0.9330
0.9615
C(p)
84.2465
17.1130
3.7274
F Value Pr > F
94.78 <.0001
42.12 <.0001
15.59 0.0007
17
8) In a study of the larvae growing in a lake, the researchers collected data on the
following variables.
Y = The number of larvae of the Chaoborous collected in a sample of the sediment from
an area of approximately 225 cm 2 of the lake bottom
X1 = The dissolved oxygen (mg/l) in the water at the bottom
X2 = The depth (m) of the lake at the sampling point
Some useful SAS outputs for fitting the regression model
Y = β 0 + β1 X 1 + β 2 X 2 + β 3 X 1X 2 + ε , using the data from this study are given below.
Assume that the model given below is appropriate (i.e. satisfies all the necessary
assumptions) to answer the questions below.
The REG Procedure
Model: MODEL1
Dependent Variable: Y
Number of Observations Read
Number of Observations Used
14
14
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
3
10
13
1311.15692
154.84308
1466.00000
437.05231
15.48431
F Value
Pr > F
28.23
<.0001
Parameter Estimates
Variable
Intercept
X1
X2
X1X2
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Type I SS
1
1
1
1
24.30070
-2.31549
1.22493
-0.00563
7.91874
1.08721
0.95978
0.15660
3.07
-2.13
1.28
-0.04
0.0119
0.0590
0.2307
0.9720
5054.00000
1197.46122
113.67566
0.02003
State whether each of the following statements is true or false (based on the information
given above). [1 point for each part]
i) The effect of the amount of oxygen dissolved in water (i.e. X1) on the number of larvae
depends on the depth (at α = 0.1)
(True / False)
Ans F, the p-value for X1X2 is > 0.1
ii) The terms X2 and X1X2 have no significant contribution to the model and so both
these terms can be dropped from the above model (at α = 0.1)
(True / False)
18
Ans False 113.68+0.02 = 113.7
ANS/2 = 56.85
ANS/15.48 = 3.67248062
p-value = 1- 0.936301 = 0.063699,
F(2, 10, 0.10) = 2.92 and so rej Ho.
iii) The value of the t-statistic for testing the hull hypothesis H 0 : β 2 = 1 against
H1 : β 2 > 1 , is greater than 1.20.
(True / False)
Ans F
t = (b2-1)/SE(b2) = (1.2249 – 1)/ 0.9598 = 0.2343196499 < 1.20
iv) The p-value of the t-test for testing the null hypothesis H 0 : β1 = 0 against H1 : β1 > 0
is less than 0.10.
(True / False)
Ans F
P-value = 1- 0.0295 = 0.9705
v) The sum of squares of errors (SSE) for the simple linear regression model of Y on X1
is greater than 150.0. (True / False)
Ans T it is gteater than the SSE for the bigger model above i.e 154.84.
19
Multiple-choice questions (Miscellaneous) (2 points for each question)
9) If the slope of a least squares regression line of Y on X is negative, what else must be
negative?
A)
B)
C)
D)
E)
The correlation of X and Y
The slope of a least squares regression line of X on Y
The coefficient of determination (R-sq) for the regression of Y on X
More than one of the above must be negative
None of the above need be negative
And D, correlation of X and Y and the slope of X on Y must be negative.
10) If there were no linear relationship between X and Y (i.e. correlation (r) = 0), what
would the predicted value Y (predicted using the estimated least squares regression
equation) at any given value of X?
A)
B)
C)
D)
E)
0
mean of Y the values (i. e. Y )
mean of X values(i. e. X )
(Mean of Y values - Mean of X values ) (i.e. Y − X )
It depends on variance of Y
Ans B
20
Total 95 points
21