Solution: - Full

Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
Question 1: We perform best subset, forward stepwise, and
backward stepwise selection on a single data set. For each
approach, we obtain p + 1 models, containing 0, 1, 2,...,p
predictors.
(a) Which of the three models with k predictors has the
smallest training RSS?
Solution:
Best subset selection will have the smallest training RSS
since it chooses the best model with k predictors no matter
what it is. Forward and backward subset selection on the
other hand can only choose the best k given either k-1 or k+1
, best predictors from before.
(b) Which of the three models with k predictors has the
smallest test RSS?
Solution:
As mentioned above in the answer a, best subset selection
will have the smallest test RSS.
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
(c) True or False
i) The predictors in the k-variable model identified by
forward stepwise are a subset of the predictors in the (k+1)variable model identified by forward stepwise selection.
Solution:
This statement is true, since forward stepwise selection adds
the best variable to the other k predictors from before to
make (k+1) – variable model.
ii) The predictors in the k-variable model identified by
backward stepwise are a subset of the predictors in the (k +
1) - variable model identified by backward stepwise
selection.
Solution:
This statement is true, since backward stepwise selecting
removes the worse variable from the (k+1) – variable model
to make the k- variable model.
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
Iii) The predictors in the k-variable model identified by
backward stepwise are a subset of the predictors in the (k +
1)- variable model identified by forward stepwise selection.
Solution:
This statement is false in general. The variable left in the kvariable model from backward stepwise selection can be
different from the variables in the (k+1) –variable model
identified by forward stepwise selection.
Iv) The predictors in the k-variable model identified by
forward stepwise are a subset of the predictors in the (k+1)variable model identified by backward stepwise selection.
Solution:
This statement is false in general due to above mentioned
reason in part iii.
v) The predictors in the k-variable model identified by best
subset are a subset of the predictors in the (k + 1)-variable
model identified by best subset selection.
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
Solution:
This statement is false in general, since best subset at the
best predictors for the k-variable model separate of the
variables in the (k+1)-variable model. Thus the predictors
could be different between models.
2. For parts (a) through (c), indicate which of i. through iv. is
correct. Justify your answer.
(a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction
accuracy when its increase in bias is less than its decrease in
variance.
ii. More flexible and hence will give improved prediction
accuracy when its increase in variance is less than its
decrease in bias.
iii. Less flexible and hence will give improved prediction
accuracy when its increase in bias is less than its decrease in
variance.
iv. Less flexible and hence will give improved prediction
accuracy when its increase in variance is less than its
decrease in bias.
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
Solution:
The solution is option iii. Since we use lasso when least
squares has very high variance, thus we decrease variance
and like-wise flexibility decreases.
(b) Repeat (a) for ridge regression relative to least squares.
Solution:
The solution is option iii. Since we use lasso when least
squares has very high variance, thus we decrease variance
and like-wise flexibility decreases.
(c) Repeat (a) for non-linear methods relative to least
squares.
Solution:
The solution is option ii. As non-linear methods increase
flexibility, therefore increases variance and shrinks bias. So
when variance is less than bias. This model predicts better.
3. Suppose we estimate the regression coefficients in a linear
regression model by minimizing
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
̂0 − ∑𝑝 𝛽𝑗𝑥𝑖𝑗 )2 subject to ∑𝑝 |𝛽𝑗 | ≤ s
∑𝑝𝑖=1(𝑦𝑖 − 𝛽
𝑗=1
𝑗=1
for a particular value of s. For parts (a) through (e), indicate
which of i. through v. is correct. Justify your answer.
(a) As we increase s from 0, the training RSS will:
i. Increase initially, and then eventually start decreasing
in an inverted U shape.
ii. Decrease initially, and then eventually start
increasing in a U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.
Solution:
Option iv. The training RSS will steadily decrease. As p
increases.
(b) Repeat (a) for test RSS.
Solution:
Option ii, Decrease initially, and then eventually start
increasing in a U shape.
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
(c) Repeat (a) for variance.
Solution:
The solution is option iii. Since as s increases, p increases.
(d) Repeat (a) for (squared) bias.
Solution:
The solution is iv, since as s increases from 0, in this bias
decreases since the least squares model is unbiased.
(e) Repeat (a) for the irreducible error.
Solution:
The solution is v, since the irreducible error cannot be
changed based on model, thus constant.
4. Suppose we estimate the regression coefficients in a
linear regression model by minimizing
̂0 − ∑𝑝 𝛽𝑗𝑥𝑖𝑗 )2 + 𝜆 ∑𝑝 𝛽𝑗2
∑𝑝𝑖=1(𝑦𝑖 − 𝛽
𝑗=1
𝑗=1
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
for a particular value of λ. For parts (a) through (e), indicate
which of i. through v. is correct. Justify your answer.
(a) As we increase λ from 0, the training RSS will:
i. Increase initially, and then eventually start decreasing in
an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a
U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.
Solution:
The solution is iii. Since as λ -> ∞ , 𝛽𝑗′ s = 0, causing training
RSS to decrease since there are less.
(b) Repeat (a) for test RSS.
Solution:
The solution is ii, since test RSS decreases as p increases, will
increase giving a U shape.
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
(c) Repeat (a) for variance.
Solution:
The solution is iv, variance decreases.
(d) Repeat (a) for (squared) bias.
Solution:
The solution iii, we are moving from the unbiased least
square towards a biased model.
(e) Repeat (a) for the irreducible error.
Solution:
The solution is iv, irreducible error cannot be changed, thus
constant.
8. In this exercise, we will generate simulated data, and will
then use this data to perform best subset selection.
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
(a) Use the rnorm() function to generate a predictor X of
length n = 100, as well as a noise vector of length n = 100.
Solution:
rnorm()
(b) Generate a response vector Y of length n = 100 according
to the model
Y = 𝛽0 + 𝛽1 𝑋 + 𝛽2 𝑋 2 + 𝛽3 𝑋 3 + 𝜖
Where 𝛽0 , 𝛽1 , 𝛽2 , and 𝛽3 are constants of your choice.
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
Solution:
𝛽0 = 5.00, 𝛽1 = 0.35, 𝛽2 = 4.37, and 𝛽3 = 0.87
(c) Use the regsubsets() function to perform best subset
selection in order to choose the best model containing the
predictors X, X2,...,X10. What is the best model obtained
according to Cp, BIC, and adjusted R2? Show some plots to
provide evidence for your answer, and report the coefficients
of the best model obtained. Note you will need to use the
data.frame() function to create a single data set containing
both X and Y .
Solution:
The best model was the one with 3 predictors. Best model
coefficients were 𝛽0 = 5.00, 𝛽1 = 0.35, 𝛽2 = 4.37, and 𝛽3 =
0.87.
(d) Repeat (c), using forward stepwise selection and also
using backwards stepwise selection. How does your answer
compare to the results in (c)?
Solution:
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
The results of forward and backward subset selection
showed that the coefficients for the model of size 3 match
the results of best subset selection and the actual values.
(e) Now fit a lasso model to the simulated data, again using
X, X2, ...,X10 as predictors. Use cross-validation to select the
optimal value of λ. Create plots of the cross-validation error
as a function of λ. Report the resulting coefficient estimates,
and discuss the results obtained.
Solution:
Plot places degree of freedom on top and MSE as a function
of log(lambda). The two dotted lines are the confidence
interval for the best lambda values. The red dot is the MSE
value for a given lambda and the line segment is a 95%
confidence interval. We then plotted CV error as a function
of λ.
Lasso almost correctly predicted the right simulated model,
except that it included 𝑥 4 , but made the predictor value very
small. Thus,
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
𝛽0 = 5.122005645
𝛽1 = 0.171749843
𝛽2 = 4.214476922
𝛽3 = 0.875687373
𝛽4 = 0.008424754
(f) Now generate a response vector Y according to the model
Solution:
The actual coefficient values chosen were
𝛽0 = 5 and 𝛽7 = .78
The best lambda was 1.105952.
Thus lasso is a very effective technique in both identifying
the right model and predicting correctly.