Introduction The approximation method Results Summary Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer Jelle Goeman Department of Medical Statistics Leiden University Medical Center Validation in Statistics and Machine Learning 6 October 2010 lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Outline 1 Introduction 2 The approximation method 3 Results 4 Summary lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Penalized regression Ridge regression βˆridge = argmax{l(β) − λ X βi2 } i Shrinks Lasso regression βˆridge = argmax{l(β) − λ X |βi |} i Shrinks and selects lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary The penalized package On CRAN: R package penalized Ridge Lasso Elastic net Regression models Linear regression Logistic regression (GLM) Cox Proportional Hazards model lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Choosing the value of λ Between λ too large: over-shrinkage λ too small: overfit lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Choosing the value of λ Between λ too large: over-shrinkage λ too small: overfit How to optimize λ? Leave-one-out cross-validation K -fold cross-validation Akaike’s information criterion Generalized cross-validation (.632+) bootstrap cross-validation ... Fast approximate leave-one-out cross-validation for large sample sizes lumc-logo Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Leave-one-out Ingredients Response y1 , . . . , yn Predictor variables x1 , . . . , xn ˆ λ(−i) not using xi and yi Fitted models β A loss function L. Assume continuity. Leave-one-out loss n X ˆ λ(−i) ) L(yi , xi , β i=1 lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Approximate leave-one-out Leave-one-out loss ˆ λ(−1) , . . . , β ˆ λ(−n) Requires calculation of β lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Approximate leave-one-out Leave-one-out loss ˆ λ(−1) , . . . , β ˆ λ(−n) Requires calculation of β Time consuming - when n is large ˆ λ(−i) takes much time - when each β - double cross-validation lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Approximate leave-one-out Leave-one-out loss ˆ λ(−1) , . . . , β ˆ λ(−n) Requires calculation of β Time consuming - when n is large ˆ λ(−i) takes much time - when each β - double cross-validation Solution ˆ λ(−i) based on β ˆλ approximate β lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Models Assumption − ∂2l =D ∂η∂η 0 (diagonal) with η = Xβ the linear predictor Generalized linear models Linear regression Logistic regression Cox proportional hazards (full likelihood) lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary General idea 0 ˆ Taylor approximation of l(−i) (β) at β = β λ 0 0 ˆ λ ) + (β − β ˆ λ )l 00 (β ˆ λ ) + O (β − β ˆ λ )2 . l(−i) (β) = l(−i) (β (−i) 0 ˆ λ(−i) gives: solving l(−i) (β) = 0 at β = β −1 λ 0 ˆ λ(−i) = β ˆ λ − l 00 (β ˆ λ) ˆ λ ) + O (β ˆ (−i) − β ˆ λ )2 β l ( β (−i) (−i) lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary General idea 0 ˆ Taylor approximation of l(−i) (β) at β = β λ 0 0 ˆ λ ) + (β − β ˆ λ )l 00 (β ˆ λ ) + O (β − β ˆ λ )2 . l(−i) (β) = l(−i) (β (−i) 0 ˆ λ(−i) gives: solving l(−i) (β) = 0 at β = β −1 λ 0 ˆ λ(−i) = β ˆ λ − l 00 (β ˆ λ) ˆ λ ) + O (β ˆ (−i) − β ˆ λ )2 β l ( β (−i) (−i) still n inverses to be calculated lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Sherman-Morrison-Woodbury theorem −1 B−1 uvT B−1 B + uvT = B−1 − , 1 + vT B−1 u B nonsingular p × p matrix, u, v p-dimensional column vectors lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Sherman-Morrison-Woodbury theorem −1 B−1 uvT B−1 B + uvT = B−1 − , 1 + vT B−1 u B nonsingular p × p matrix, u, v p-dimensional column vectors 00 (β ˆ λ ))−1 (in the ridge model) Apply to (l(−i) XT (−i) D(−i) X(−i) + λIp −1 −1 = XT DX + λIp − dii xi xT i lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Final formula ridge ˆ λ(−i) β with −1 T xi ∆i X DX + λI p ˆ − , =β 1 − vii λ −1 1 1 XT D 2 V = D 2 X XT DX + λIp ˆλ D and ∆ (residuals) based on value β ˆ λ(−i) ’s with just 1 inverse calculation and all approximate β some matrix multiplications! lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Final formula ridge ˆ λ(−i) β with −1 T xi ∆i X DX + λI p ˆ − , =β 1 − vii λ −1 1 1 XT D 2 V = D 2 X XT DX + λIp ˆλ D and ∆ (residuals) based on value β ˆ λ(−i) ’s with just 1 inverse calculation and all approximate β some matrix multiplications! Reparamaterization Dimension covariate space can be reduced from p to n Fast approximate leave-one-out cross-validation for large sample sizes lumc-logo Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Models Linear model Approximation = exact lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Models Linear model Approximation = exact Cox proportional hazards Use full likelihood, not partial likelihood Baseline hazard not cross-validated Trick possible: add intercept term lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Final formula lasso ˆ λ(−i) β with −1 T xi ∆i X DX ˆ − , =β 1 − vii λ −1 1 1 V = D 2 X XT DX XT D 2 lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Final formula lasso ˆ λ(−i) β with −1 T xi ∆i X DX ˆ − , =β 1 − vii λ −1 1 1 V = D 2 X XT DX XT D 2 ˆ λk ≈ β ˆ λ(−i) we know: locally, if β k ˆ λk = 0 if β ⇒ ˆ λ(−i) = 0 β k Refinements possible lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary To what extent is this approximation useful? Are the approximated values comparable to the real values? ˆ λ(−i) ) ≈ cvl(approximated β ˆ λ(−i) )? - cvl(real β Would we find approximately the same values of λ? - do we find approximately the same maximum of the cvl when ˆ λ(−i) ’s? using the approximated β How much worse are the models? - do we find approximately the same cvl at the maximum found? lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary The dataset used Breast cancer data of the Netherlands Cancer Institute Paper by Van ’t Veer et al. (Nature, 2002) Followed up by Van de Vijver et al. (NEJM, 2002) 295 breast cancer patients effective dimension 79, due to censoring Microarray (Agilent): 4,919 genes preselected (Rosetta technology) Response of interest survival time (up to 18 years follow-up) lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Ridge Regression cvpl −480 −460 −440 −420 −400 cvpl appr cvpl train cvpl 0 2000 4000 6000 8000 lambda Fast approximate leave-one-out cross-validation for large sample sizes 10000 lumc-logo Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Ridge Regression: in more detail −480 cvpl appr cvpl appr cvpl: lambda= 438.2634, −485 cvpl cvpl= -475.8422 real cvpl: lambda= 458.5212, −490 cvpl= -476.2204 0 2000 4000 6000 8000 lambda Fast approximate leave-one-out cross-validation for large sample sizes 10000 lumc-logo Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary cvpl appr cvpl train cvpl cvpl −1000 −900 −800 −700 −600 −500 −400 −300 Lasso Regression 5 10 15 lambda Fast approximate leave-one-out cross-validation for large sample sizes 20 lumc-logo Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Lasso Regression: in more detail −500 −480 cvpl appr cvpl appr cvpl: lambda= 7.60564, cvpl cvpl= -477.3704 −520 real cvpl: lambda= 7.70299, −540 cvpl= -479.4855 5 10 15 lambda Fast approximate leave-one-out cross-validation for large sample sizes 20 lumc-logo Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary −400 Wang breast cancer data: ridge −600 −800 −700 cvpl −500 cvpl appr cvpl appr int cvpl train cvpl 0 10000 20000 30000 40000 lambda Fast approximate leave-one-out cross-validation for large sample sizes 50000 lumc-logo Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary −660 Wang breast cancer data: ridge cvpl −680 −675 −670 −665 cvpl appr cvpl appr int cvpl 0 10000 20000 30000 40000 lambda Fast approximate leave-one-out cross-validation for large sample sizes 50000 lumc-logo Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary cvpl appr cvpl appr int cvpl train cvpl cvpl −700 −680 −660 −640 −620 −600 −580 Wang breast cancer data: lasso 30 40 50 60 70 lambda Fast approximate leave-one-out cross-validation for large sample sizes lumc-logo Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Wang breast cancer data: lasso zoomed cvpl −700 −695 −690 −685 −680 cvpl appr cvpl appr cvpl intercept 30 40 50 60 70 lambda Fast approximate leave-one-out cross-validation for large sample sizes lumc-logo Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Efficiency Time needed to calculate cvl for specific value of λ, lasso λ = 7.70 real cvpl: 49.00 seconds appr cvpl: 6.09 seconds approximately 8 times as fast Time needed to calculate cvl for specific value of λ, ridge λ = 458.5 real cvpl: 389.27 seconds appr cvpl: 17.40 seconds more than 20 times as fast! lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Some additional comments Are these results representative of different datasets? What aspects of a dataset determine the performance of the approximation method? Back to the theory: λ ˆ (−i) − β ˆ λ )2 O (β Error diminishes when: n gets larger λ gets larger lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary In short... Approximate LOOCV gives reasonable approximate of λ in penalization methods reasonable outcomes of approximated cvl: comparisons between models possible works great for ridge; less stable for lasso Can be used to find ”neighborhood” of optimal λ Best for large values of n best possible approximations most time saved double LOOCV Fast approximate leave-one-out cross-validation for large sample sizes lumc-logo Rosa Meijer, Jelle Goeman Introduction The approximation method Results Summary Questions? lumc-logo Fast approximate leave-one-out cross-validation for large sample sizes Rosa Meijer, Jelle Goeman
© Copyright 2024