High Bias Low Bias Low Variance High Variance 0.6 0.0 0.2 0.4 Prediction Error 0.8 1.0 1.2 c Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009 Chap 7 0 5 10 15 20 25 30 35 Model Complexity (df) FIGURE 7.1. Behavior of test sample and training sample error as the model complexity is varied. The light blue curves show the training error err, while the light red curves show the conditional test error ErrT for 100 training sets of size 50 each, as the model complexity is increased. The solid curves show the expected test error Err and the expected training error E[err]. c Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009 Chap 7 Closest fit in population Realization Closest fit Truth MODEL SPACE Model bias Estimation Bias Shrunken fit Estimation Variance RESTRICTED MODEL SPACE FIGURE 7.2. Schematic of the behavior of bias and variance. The model space is the set of all possible predictions from the model, with the “closest fit” labeled with a black dot. The model bias from the truth is shown, along with the variance, indicated by the large yellow circle centered at the black dot labeled “closest fit in population.” A shrunken or regularized fit is also shown, having additional estimation bias, but smaller prediction error due to its decreased variance. c Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009 Chap 7 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 Linear Model − Regression 0.4 k−NN − Regression 40 30 20 10 0 5 10 15 Subset Size p k−NN − Classification Linear Model − Classification 20 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 Number of Neighbors k 0.4 50 50 40 30 20 Number of Neighbors k 10 0 5 10 15 20 Subset Size p FIGURE 7.3. Expected prediction error (orange), squared bias (green) and variance (blue) for a simulated example. The top row is regression with squared error loss; the bottom row is classification with 0–1 loss. The models are k-nearest neighbors (left) and best subset regression of size p (right). The variance and bias c Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009 Chap 7 0-1 Loss O O O O O O O O O O 0.5 O O O O O O 0.25 O O O O O O O O O O O O O O O O O O O 2 4 8 16 32 64 128 Number of Basis Functions O 0.10 O O O O 0.20 1.5 Misclassification Error 2.0 0.30 Train Test AIC O 0.15 2.5 O 1.0 Log-likelihood 0.35 Log-likelihood Loss 2 4 8 16 32 64 128 Number of Basis Functions FIGURE 7.4. AIC used for model selection for the phoneme recognition example of Section 5.2.3. The logistic regression coefficient function P β(f ) = M m=1 hm (f )θm is modeled as an expansion in M spline basis functions. In the left panel we see the AIC statistic used to estimate Errin using log-likelihood loss. Included is an estimate of Err based on an independent test sample. It does well except for the extremely over-parametrized case (M = 256 parameters for N = 1000 observations). In the right panel the same is done for 0–1 loss. Although the AIC formula does not strictly apply here, it does a reasonable job in this case. 1.0 0.0 -1.0 sin(50 · x) c Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009 Chap 7 0.0 0.2 0.4 0.6 0.8 1.0 x FIGURE 7.5. The solid curve is the function sin(50x) for x ∈ [0, 1]. The green (solid) and blue (hollow) points illustrate how the associated indicator function I(sin(αx) > 0) can shatter (separate) an arbitrarily large number of points by choosing an appropriately high frequency α. c Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009 Chap 7 FIGURE 7.6. The first three panels show that the class of lines in the plane can shatter three points. The last panel shows that this class cannot shatter four points, as no line will put the hollow points on one side and the solid points on the other. Hence the VC dimension of the class of straight lines in the plane is three. Note that a class of nonlinear curves could shatter four points, and hence has VC dimension greater than three. c Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009 Chap 7 80 60 40 20 0 % Increase Over Best 100 AIC reg/KNN reg/linear class/KNN class/linear class/KNN class/linear class/KNN class/linear 80 60 40 20 0 % Increase Over Best 100 BIC reg/KNN reg/linear 80 60 40 20 0 % Increase Over Best 100 SRM reg/KNN reg/linear FIGURE 7.7. Boxplots show the distribution of the relative error ˆ 100×[ErrT (α)−min α ErrT (α)]/[maxα ErrT (α)−minα ErrT (α)] over the four scenarios of Figure 7.3. This is the error in using the chosen model relative to the best model. 0.4 0.0 0.2 1-Err 0.6 0.8 c Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009 Chap 7 0 50 100 150 Size of Training Set 200 FIGURE 7.8. Hypothetical learning curve for a classifier on a given task: a plot of 1−Err versus the size of the training set N . With a dataset of 200 observations, 5-fold cross-validation would use training sets of size 160, which would behave much like the full set. However, with a dataset of 50 observations fivefold cross– validation would use training sets of size 40, and this would result in a considerable overestimate of prediction error. • 0.2 0.3 0.4 • • • • • • • • • • • • • • • • • • • •• • • • • •• •• •• •• • • 0.0 0.1 Misclassification Error 0.5 0.6 c Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009 Chap 7 5 10 15 20 Subset Size p FIGURE 7.9. Prediction error (orange) and tenfold cross-validation curve (blue) estimated from a single training set, from the scenario in the bottom right panel of Figure 7.3. c Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009 Chap 7 20 10 0 Frequency 30 Wrong way −1.0 −0.5 0.0 0.5 1.0 Correlations of Selected Predictors with Outcome 20 10 0 Frequency 30 Right way −1.0 −0.5 0.0 0.5 1.0 Correlations of Selected Predictors with Outcome FIGURE 7.10. Cross-validation the wrong and right way: histograms shows the correlation of class labels, in 10 randomly chosen samples, with the 100 predictors chosen using the incorrect (upper red) and correct (lower green) versions of cross-validation. 3 1 2 Error on 1/5 7 6 5 4 2 0 3 Error on Full Training Set 8 4 9 c Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009 Chap 7 0 100 200 300 400 500 1 3 4 5 6 7 8 Error on 4/5 0.6 Class Label 0.8 1 1.0 Predictor 2 0 0.0 0.2 0.4 full 4/5 −1 0 Predictor 436 (blue) 1 2 CV Errors FIGURE 7.11. Simulation study to investigate the performance of cross validation in a high-dimensional problem where the predictors are independent of the class labels. The top-left panel shows the number of errors made by individual stump classifiers on the full training set (20 observations). The top right panel shows the errors made by individual stumps trained on a random split of the dataset into 4/5ths (16 observations) and tested on the remaining 1/5th (4 observations). The best performers are depicted by colored dots in each panel. The bottom left panel shows the effect of c Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009 Chap 7 Bootstrap replications S(Z∗1 ) Z∗1 S(Z∗2 ) S(Z∗B ) Bootstrap samples Z∗2 Z∗B Z = (z1 , z2 , . . . , zN ) Training sample FIGURE 7.12. Schematic of the bootstrap process. We wish to assess the statistical accuracy of a quantity S(Z) computed from our dataset. B training sets Z∗b , b = 1, . . . , B each of size N are drawn with replacement from the original dataset. The quantity of interest S(Z) is computed from each bootstrap training set, and the values S(Z∗1 ), . . . , S(Z∗B ) are used to assess the statistical accuracy of S(Z). c Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009 Chap 7 80 60 40 20 0 % Increase Over Best 100 Cross−validation reg/KNN reg/linear class/KNN class/linear class/KNN class/linear 80 60 40 20 0 % Increase Over Best 100 Bootstrap reg/KNN reg/linear FIGURE 7.13. Boxplots show the distribution of the relative error 100 · [Errαˆ − minα Err(α)]/[maxα Err(α) − minα Err(α)] over the four scenarios of Figure 7.3. This is the error in using the chosen model relative to the best model. There are 100 training sets represented in each boxplot. c Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009 Chap 7 0.3 0.1 0.2 Error 0.1 0.2 Error 0.3 0.4 10−Fold CV Error 0.4 Prediction Error 5 10 15 20 5 10 15 Subset Size p Leave−One−Out CV Error Approximation Error 0.025 0.035 0.045 ET |CV10 −Err| ET |CV10 −ErrT | ET |CVN −ErrT | 0.015 0.1 0.2 Error 0.3 Mean Absolute Deviation 0.4 Subset Size p 20 5 10 Subset Size p 15 20 5 10 15 20 Subset Size p FIGURE 7.14. Conditional prediction-error ErrT , 10-fold cross-validation, and leave-one-out cross-validation curves for a 100 simulations from the top-right panel in Figure 7.3. The thick red curve is the expected prediction error Err, while the thick black curves are the expected CV curves ET CV10 and ET CVN . The lower-right panel shows the mean absolute deviation of the CV curves from the conditional error, ET |CVK −ErrT | for K = 10 (blue) and K = N (green), as well as from the expected error ET |CV10 − Err| (orange). c Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009 Chap 7 c Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009 Chap 7 0.30 0.10 0.20 CV Error 0.20 0.10 CV Error 0.30 0.40 Subset Size 5 0.40 Subset Size 1 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.10 0.15 Prediction Error 0.20 0.25 0.30 0.35 0.40 Prediction Error Leave−one−out 10−Fold −0.6 0.10 −0.4 −0.2 Correlation 0.20 CV Error 0.30 0.0 0.40 0.2 Subset Size 10 0.10 0.15 0.20 0.25 0.30 Prediction Error 0.35 0.40 5 10 15 20 Subset Size FIGURE 7.15. Plots of the CV estimates of error versus the true conditional error for each of the 100 training sets, for the simulation setup in the top right panel Figure 7.3. Both 10-fold and leave-one-out CV are depicted in different colors. The first three panels correspond to different subset sizes p, and vertical and horizontal lines are drawn at Err(p). Although there appears to be little correlation in these plots, we see in the lower right panel that for the most part the correlation is negative.
© Copyright 2024