FIGURE 7.1. Behavior of test sample and training

High Bias
Low Bias
Low Variance
High Variance
0.6
0.0
0.2
0.4
Prediction Error
0.8
1.0
1.2
c
Elements of Statistical Learning (2nd Ed.) Hastie,
Tibshirani & Friedman 2009 Chap 7
0
5
10
15
20
25
30
35
Model Complexity (df)
FIGURE 7.1. Behavior of test sample and training
sample error as the model complexity is varied. The
light blue curves show the training error err, while the
light red curves show the conditional test error ErrT
for 100 training sets of size 50 each, as the model complexity is increased. The solid curves show the expected
test error Err and the expected training error E[err].
c
Elements of Statistical Learning (2nd Ed.) Hastie,
Tibshirani & Friedman 2009 Chap 7
Closest fit in population
Realization
Closest fit
Truth
MODEL
SPACE
Model bias
Estimation Bias
Shrunken fit
Estimation
Variance
RESTRICTED
MODEL SPACE
FIGURE 7.2. Schematic of the behavior of bias and
variance. The model space is the set of all possible
predictions from the model, with the “closest fit” labeled with a black dot. The model bias from the truth is
shown, along with the variance, indicated by the large
yellow circle centered at the black dot labeled “closest
fit in population.” A shrunken or regularized fit is also
shown, having additional estimation bias, but smaller
prediction error due to its decreased variance.
c
Elements of Statistical Learning (2nd Ed.) Hastie,
Tibshirani & Friedman 2009 Chap 7
0.3
0.2
0.1
0.0
0.0
0.1
0.2
0.3
0.4
Linear Model − Regression
0.4
k−NN − Regression
40
30
20
10
0
5
10
15
Subset Size p
k−NN − Classification
Linear Model − Classification
20
0.3
0.2
0.1
0.0
0.0
0.1
0.2
0.3
0.4
Number of Neighbors k
0.4
50
50
40
30
20
Number of Neighbors k
10
0
5
10
15
20
Subset Size p
FIGURE 7.3. Expected prediction error (orange),
squared bias (green) and variance (blue) for a simulated example. The top row is regression with squared
error loss; the bottom row is classification with 0–1 loss.
The models are k-nearest neighbors (left) and best subset regression of size p (right). The variance and bias
c
Elements of Statistical Learning (2nd Ed.) Hastie,
Tibshirani & Friedman 2009 Chap 7
0-1 Loss
O
O
O
O
O
O
O
O
O
O
0.5
O
O
O
O
O
O
0.25
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
2
4
8
16
32
64
128
Number of Basis Functions
O
0.10
O
O
O
O
0.20
1.5
Misclassification Error
2.0
0.30
Train
Test
AIC
O
0.15
2.5
O
1.0
Log-likelihood
0.35
Log-likelihood Loss
2
4
8
16
32
64
128
Number of Basis Functions
FIGURE 7.4.
AIC used for model selection for the phoneme recognition example of Section 5.2.3. The logistic regression coefficient function
P
β(f ) = M
m=1 hm (f )θm is modeled as an expansion in
M spline basis functions. In the left panel we see the
AIC statistic used to estimate Errin using log-likelihood
loss. Included is an estimate of Err based on an independent test sample. It does well except for the extremely over-parametrized case (M = 256 parameters
for N = 1000 observations). In the right panel the
same is done for 0–1 loss. Although the AIC formula
does not strictly apply here, it does a reasonable job in
this case.
1.0
0.0
-1.0
sin(50 · x)
c
Elements of Statistical Learning (2nd Ed.) Hastie,
Tibshirani & Friedman 2009 Chap 7
0.0
0.2
0.4
0.6
0.8
1.0
x
FIGURE 7.5. The solid curve is the function sin(50x)
for x ∈ [0, 1]. The green (solid) and blue (hollow)
points illustrate how the associated indicator function
I(sin(αx) > 0) can shatter (separate) an arbitrarily
large number of points by choosing an appropriately
high frequency α.
c
Elements of Statistical Learning (2nd Ed.) Hastie,
Tibshirani & Friedman 2009 Chap 7
FIGURE 7.6. The first three panels show that the
class of lines in the plane can shatter three points.
The last panel shows that this class cannot shatter four
points, as no line will put the hollow points on one side
and the solid points on the other. Hence the VC dimension of the class of straight lines in the plane is three.
Note that a class of nonlinear curves could shatter four
points, and hence has VC dimension greater than three.
c
Elements of Statistical Learning (2nd Ed.) Hastie,
Tibshirani & Friedman 2009 Chap 7
80
60
40
20
0
% Increase Over Best
100
AIC
reg/KNN
reg/linear
class/KNN
class/linear
class/KNN
class/linear
class/KNN
class/linear
80
60
40
20
0
% Increase Over Best
100
BIC
reg/KNN
reg/linear
80
60
40
20
0
% Increase Over Best
100
SRM
reg/KNN
reg/linear
FIGURE
7.7.
Boxplots
show
the
distribution
of
the
relative
error
ˆ
100×[ErrT (α)−min
α ErrT (α)]/[maxα ErrT (α)−minα ErrT (α)]
over the four scenarios of Figure 7.3. This is the error
in using the chosen model relative to the best model.
0.4
0.0
0.2
1-Err
0.6
0.8
c
Elements of Statistical Learning (2nd Ed.) Hastie,
Tibshirani & Friedman 2009 Chap 7
0
50
100
150
Size of Training Set
200
FIGURE 7.8. Hypothetical learning curve for a classifier on a given task: a plot of 1−Err versus the size of
the training set N . With a dataset of 200 observations,
5-fold cross-validation would use training sets of size
160, which would behave much like the full set. However, with a dataset of 50 observations fivefold cross–
validation would use training sets of size 40, and this
would result in a considerable overestimate of prediction error.
•
0.2
0.3
0.4
•
•
•
• •
•
•
•
• • •
•
•
•
•
•
• • • •• • • • • •• •• •• ••
• •
0.0
0.1
Misclassification Error
0.5
0.6
c
Elements of Statistical Learning (2nd Ed.) Hastie,
Tibshirani & Friedman 2009 Chap 7
5
10
15
20
Subset Size p
FIGURE 7.9. Prediction error (orange) and tenfold
cross-validation curve (blue) estimated from a single
training set, from the scenario in the bottom right panel
of Figure 7.3.
c
Elements of Statistical Learning (2nd Ed.) Hastie,
Tibshirani & Friedman 2009 Chap 7
20
10
0
Frequency
30
Wrong way
−1.0
−0.5
0.0
0.5
1.0
Correlations of Selected Predictors with Outcome
20
10
0
Frequency
30
Right way
−1.0
−0.5
0.0
0.5
1.0
Correlations of Selected Predictors with Outcome
FIGURE 7.10. Cross-validation the wrong and right
way: histograms shows the correlation of class labels, in 10
randomly chosen samples, with the 100 predictors chosen
using the incorrect (upper red) and correct (lower green)
versions of cross-validation.
3
1
2
Error on 1/5
7
6
5
4
2
0
3
Error on Full Training Set
8
4
9
c
Elements of Statistical Learning (2nd Ed.) Hastie,
Tibshirani & Friedman 2009 Chap 7
0
100
200
300
400
500
1
3
4
5
6
7
8
Error on 4/5
0.6
Class Label
0.8
1
1.0
Predictor
2
0
0.0
0.2
0.4
full
4/5
−1
0
Predictor 436 (blue)
1
2
CV Errors
FIGURE 7.11. Simulation study to investigate the performance of cross validation in a high-dimensional problem
where the predictors are independent of the class labels. The
top-left panel shows the number of errors made by individual
stump classifiers on the full training set (20 observations).
The top right panel shows the errors made by individual
stumps trained on a random split of the dataset into 4/5ths
(16 observations) and tested on the remaining 1/5th (4 observations). The best performers are depicted by colored
dots in each panel. The bottom left panel shows the effect of
c
Elements of Statistical Learning (2nd Ed.) Hastie,
Tibshirani & Friedman 2009 Chap 7
Bootstrap
replications
S(Z∗1 )
Z∗1
S(Z∗2 )
S(Z∗B )
Bootstrap
samples
Z∗2
Z∗B
Z = (z1 , z2 , . . . , zN )
Training
sample
FIGURE 7.12. Schematic of the bootstrap process.
We wish to assess the statistical accuracy of a quantity S(Z) computed from our dataset. B training sets
Z∗b , b = 1, . . . , B each of size N are drawn with replacement from the original dataset. The quantity of
interest S(Z) is computed from each bootstrap training
set, and the values S(Z∗1 ), . . . , S(Z∗B ) are used to assess the statistical accuracy of S(Z).
c
Elements of Statistical Learning (2nd Ed.) Hastie,
Tibshirani & Friedman 2009 Chap 7
80
60
40
20
0
% Increase Over Best
100
Cross−validation
reg/KNN
reg/linear
class/KNN
class/linear
class/KNN
class/linear
80
60
40
20
0
% Increase Over Best
100
Bootstrap
reg/KNN
reg/linear
FIGURE
7.13.
Boxplots
show
the
distribution
of
the
relative
error
100 · [Errαˆ − minα Err(α)]/[maxα Err(α) − minα Err(α)]
over the four scenarios of Figure 7.3. This is the error
in using the chosen model relative to the best model.
There are 100 training sets represented in each boxplot.
c
Elements of Statistical Learning (2nd Ed.) Hastie,
Tibshirani & Friedman 2009 Chap 7
0.3
0.1
0.2
Error
0.1
0.2
Error
0.3
0.4
10−Fold CV Error
0.4
Prediction Error
5
10
15
20
5
10
15
Subset Size p
Leave−One−Out CV Error
Approximation Error
0.025
0.035
0.045
ET |CV10 −Err|
ET |CV10 −ErrT |
ET |CVN −ErrT |
0.015
0.1
0.2
Error
0.3
Mean Absolute Deviation
0.4
Subset Size p
20
5
10
Subset Size p
15
20
5
10
15
20
Subset Size p
FIGURE 7.14. Conditional prediction-error ErrT ,
10-fold cross-validation, and leave-one-out cross-validation curves for a 100 simulations from the top-right
panel in Figure 7.3. The thick red curve is the expected
prediction error Err, while the thick black curves are the
expected CV curves ET CV10 and ET CVN . The lower-right panel shows the mean absolute deviation of the
CV curves from the conditional error, ET |CVK −ErrT |
for K = 10 (blue) and K = N (green), as well as from
the expected error ET |CV10 − Err| (orange).
c
Elements of Statistical Learning (2nd Ed.) Hastie,
Tibshirani & Friedman 2009 Chap 7
c
Elements of Statistical Learning (2nd Ed.) Hastie,
Tibshirani & Friedman 2009 Chap 7
0.30
0.10
0.20
CV Error
0.20
0.10
CV Error
0.30
0.40
Subset Size 5
0.40
Subset Size 1
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.10
0.15
Prediction Error
0.20
0.25
0.30
0.35
0.40
Prediction Error
Leave−one−out
10−Fold
−0.6
0.10
−0.4
−0.2
Correlation
0.20
CV Error
0.30
0.0
0.40
0.2
Subset Size 10
0.10
0.15
0.20
0.25
0.30
Prediction Error
0.35
0.40
5
10
15
20
Subset Size
FIGURE 7.15. Plots of the CV estimates of error
versus the true conditional error for each of the 100
training sets, for the simulation setup in the top right
panel Figure 7.3. Both 10-fold and leave-one-out CV
are depicted in different colors. The first three panels
correspond to different subset sizes p, and vertical and
horizontal lines are drawn at Err(p). Although there
appears to be little correlation in these plots, we see in
the lower right panel that for the most part the correlation is negative.