COMP 598 – Applied Machine Learning Lecture 11: Ensemble Learning (cont’d) ! Instructor: Joelle Pineau ([email protected]) TAs: Pierre-Luc Bacon ([email protected]) Angus Leigh ([email protected]) Class web page: www.cs.mcgill.ca/~jpineau/comp598 Today’s quiz (5 min.) • Consider the training data below (marked “+”, “-” for the output), and the test data (marked “A”, “B”, “C”). – What class would the 1-NN algorithm predict for each test point? – What class would the 3-NN algorithm predict for each test point? – Which training points could be removed without causing change in 3-NN decision rule? • Properties of NN algorithm? X2 + + A - + + + + + C - - + COMP-598: Applied Machine Learning 2 - B - X1 Joelle Pineau 1 Today’s quiz (5 min.) • Consider the training data below (marked “+”, “-” for the output), and the test data (marked “A”, “B”, “C”). – What class would the 1-NN algorithm predict for each test point? A=+, B=-, C=– What class would the 3-NN algorithm predict for each test point? A=+, B=-, C=+ – Which training points could be removed without causing change in 3-NN decision rule? Removed in figure on this slide. • Properties of NN algorithm? X2 + COMP-598: Applied Machine Learning + A - + + + C - - 3 B X1 Joelle Pineau Random forests (Breiman) • Basic algorithm: – Use K bootstrap replicates to train K different trees. – At each node, pick m variables at random (use m<M, the total number of features). – Determine the best test (using normalized information gain). – Recurse until the tree reaches maximum depth (no pruning). • Comments: – Each tree has high variance, but the ensemble uses averaging, which reduces variance. – Random forests are very competitive in both classification and regression, but still subject to overfitting. COMP-598: Applied Machine Learning 4 Joelle Pineau 2 Extremely randomized trees (Geurts et al., 2005) • Basic algorithm: – Construct K decision trees. – Pick m attributes at random (without replacement) and pick a random test involving each attribute. – Evaluate all tests (using a normalized information gain metric) and pick the best one for the node. – Continue until a desired depth or a desired number of instances (nmin) at the leaf is reached. • Comments: – Very reliable method for both classification and regression. – The smaller m is, the more randomized the trees are; small m is best, especially with large levels of noise. Small nmin means less bias and more variance, but variance is controlled by averaging over trees. COMP-598: Applied Machine Learning 5 Joelle Pineau Randomization in general • Instead of searching very hard for the best hypothesis, generate lots of random ones, then average their results. • Examples: – Random feature selection Random projections. • Advantages? – Very fast, easy, can handle lots of data. – Can circumvent difficulties in optimization. – Averaging reduces the variance introduced by randomization. • Disadvantages? – New prediction may be more expensive to evaluate (go over all trees). – Still typically subject to overfitting. – Low interpretability compared to standard decision trees. COMP-598: Applied Machine Learning 6 Joelle Pineau 3 Additive models • In an ensemble, the output on any instance is computed by averaging the outputs of several hypotheses. • Idea: Don’t construct the hypotheses independently. Instead, new hypotheses should focus on instances that are problematic for existing hypotheses. – If an example is difficult, more components should focus on it. COMP-598: Applied Machine Learning 7 Joelle Pineau Boosting • Recall algorithm: – Use the training set to train a simple predictor. – Re-weight the training examples, putting more weight on examples that were not properly classified in the previous predictor. – Repeat n times. classifier – Combine the simple Boosting hypotheses into a single, accurate predictor. D1 Original Data D2 Dn COMP-598: Applied Machine Learning Weak Learner H1 Weak Learner H2 Weak Learner Hn 8 Final hypothesis F(H1,H2,...Hn) Joelle Pineau 4 COMP-652, Lecture 11 - October 16, 2012 16 Notation • Assume that examples are drawn independently from some probability distribution P on the set of possible data D. • Let JP(h) be the expected error of hypothesis h when data is drawn from P: JP(h) = ∑<x,y>J(h(x),y)P(<x,y>) where J(h(x),y) could be the squared error, or 0/1 loss. COMP-598: Applied Machine Learning 9 Joelle Pineau Weak learners • Assume we have some “weak” binary classifiers E.g. A decision stump is a single node decision tree: xi>t • “Weak” means JP(h)<1/2-ɣ (assuming 2 classes), where ɣ>0 I.e. true error of the classifier is only slightly better than random. • Questions: – How do we re-weight the examples? – How do we combine many simple predictors into a single classifier? COMP-598: Applied Machine Learning 10 Joelle Pineau 5 !"##"$!% Example !"##"$!% Toy example ()*%+,-./01% ()*%+,-./01% COMP-598: Applied Machine Learning 11 Joelle Pineau COMP-652, Lecture 11 - October 16, 2012 20 2)345%&% Example: First step Toy example: First step 2)345%&% COMP-598: Applied Machine Learning COMP-652, Lecture 11 - October 16, 2012 12 Joelle Pineau &'% 21 &'% 6 !"##"$!% Example: Second step Toy example: Second step ()*+,%#% !"##"$!% ()*+,%#% COMP-598: Applied Machine Learning COMP-652, Lecture 11 - October 16, 2012 13 Joelle Pineau ()*+,%-% 22 Example: Third step Toy example: Third step ()*+,%-% &'% COMP-598: Applied Machine Learning COMP-652, Lecture 11 - October 16, 2012 14 Joelle Pineau &'% 23 7 !"##"$!% Example: Final Toy example: Finalhypothesis hypothesis ()*+,%-./01234)4% COMP-598: Applied Machine Learning 15 Joelle Pineau 567%83*92:+;<4% AdaBoost (Freund & Schapire, 1995) COMP-652, Lecture 11 - October 16, 2012 24 60:/+;)40*%=)12% •! 6>?'%@AB)*,+*C4%D39)4)0*%E;33%F,G0;)12:H% •! D39)4)0*%I1B:/4%@0*,.%4)*G,3%+J;)KB13H% weight of weak learner t COMP-598: Applied Machine Learning 16 Joelle Pineau &'% 8 '()*+,-.+,/%0123425,,%67.82*9,% •! :*;14*%).%8-+57<%=7.82*9%8<%47*53+>%,*?*752%8-+57<% Properties of AdaBoost @1*,3.+,%A.7%*54B%*(59=2*/% C! C! C! • Compared to other boosting algorithms, main insight is to D;.*,%.7%;.*,%+.)%*(59=2*%x%8*2.+>%).%425,,%&EF% automatically adapt the error rate at each iteration. D;.*,%.7%;.*,%+.)%*(59=2*%x%8*2.+>%).%425,,%#EF% D;.*,%.7%;.*,%+.)%*(59=2*%x%8*2.+>%).%425,,%GEF% ... • Training error on the final hypothesis is at most: recall: ɣt is how much better than random is ht • AdaBoost gradually reduces the training error exponentially fast. COMP-598: Applied Machine Learning 17 Joelle Pineau H*()%I5)*>.7-J53.+% •! K*4-,-.+%,)19=,/%=7*,*+4*%.A%L.7;%.7%,B.7)% =B75,*M%'(59=2*/% set: Textin categorization “If Real the worddata Clinton appears the document predict Real data set: Text Categorization document is about politics” ;5)585,*/%N6% COMP-598: Applied Machine Learning COMP-652, Lecture 11 - October 16, 2012 ;5)585,*/%:*1)*7,% 18 Joelle Pineau 25 &!% 9 Boosting: Experimental Results [Freund & Schapire, 1996] error Comparison of C4.5, Boosting C4.5, Boosting decision Boosting empirical evaluation Boosting empirical evaluation stumps (depth 1 trees), 27 benchmark datasets error error 39 ©Carlos Guestrin 2005-2007 COMP-598: Applied Machine Learning 19 Joelle Pineau COMP-652, Lecture 11 - October 16, 2012 26 Bagging vs. Boosting Bagging vs Boosting 15 20 25 30 0 5 10 15 20 25 30 FindAttrTest boosting FindDecRule 0 5 10 15 20 25 boosting C4.5 30 0 5 10 15 20 25 30 bagging C4.5 40 ©Carlos Guestrin 2005-2007 Comparison of C4.5 versus various other boosting and methods. COMP-652, Lecture 11 - October 16, 2012 COMP-598: Applied Machine Learning 20 27 Joelle Pineau benchmark. These experiments indicate that boostg pseudo-loss clearly outperforms boosting using ing pseudo-loss did dramatically better than error 20 10 Bagging vs Boosting • Bagging is typically faster, but may get a smaller error reduction (not by much). • Bagging works well with “reasonable” classifiers. • Boosting works with very simple classifiers. E.g., Boostexter - text classification using decision stumps based on single words. • Boosting may have a problem if a lot of the data is mislabeled, because it will focus on those examples a lot, leading to overfitting. COMP-598: Applied Machine Learning 21 Joelle Pineau Why does boosting work? • Weak learners have high bias. By combining them, we get more expressive classifiers. Hence, boosting is a bias-reduction technique. • Adaboost looks for a good approximation to the log-odds ratio, within the space of functions that can be captured by a linear combination of the base classifiers. • What happens as we run boosting longer? Intuitively, we get more and more complex hypotheses. How would you expect bias and variance to evolve over time? COMP-598: Applied Machine Learning 22 Joelle Pineau 11 A naïve (but reasonable) analysis of error A naive (but reasonable) analysis of generalization error • Expect the training error to continue to drop (until it reaches 0). • Expect the training error to continue to as drop it reaches 0)and hf • Expect the test error to increase we(until get more voters, • Expect becomes the test error to increase as we get more voters, and h f becomes too complex. too complex. 1 0.8 0.6 0.4 0.2 20 40 60 COMP-598: Applied Machine Learning 80 100 23 Joelle Pineau COMP-652, Lecture 11 - October 16, 2012 30 Actual typical run of AdaBoost • Test error does not increase even after 1000 runs! (more than 2 million decision nodes!) • Test error continues to drop even after training error reaches 0! Actual typical run of AdaBoost • These are consistent results through many sets of experiments! • Conjecture: Boosting does not overfit! 20 15 10 5 0 COMP-598: Applied Machine Learning 10 100 24 1000 Joelle Pineau • Test error does not increase even after 1000 runs! (more than 2 million decision nodes!) • Test error continues to drop even after training error reaches 0! • These are consistent results through many sets of experiments! 12 What you should know • Ensemble methods combine several hypotheses into one prediction. • They work better than the best individual hypothesis from the same class because they reduce bias or variance (or both). • Extremely randomized trees are a bias-reduction technique. • Bagging is mainly a variance-reduction technique, useful for complex hypotheses. • Main idea is to sample the data repeatedly, train several classifiers and average their predictions. • Boosting focuses on harder examples, and gives a weighted vote to the hypotheses. • Boosting works by reducing bias and increasing classification margin. COMP-598: Applied Machine Learning 25 Joelle Pineau 13
© Copyright 2024