COMS 4721: Machine Learning for Data Science Lecture 13, 3/10/2015 Prof. John Paisley Columbia University COMS W4721: Machine Learning for Data Science, Spring 2015 1 / 25 B OOSTING For more, see the textbook: Robert E. Schapire and Yoav Freund, Boosting: Foundations and Algorithms, MIT Press, 2012. (I borrow some figures from that book.) BAGGING CLASSIFIERS Algorithm: Bagging binary classifiers Given (x1 , y1 ), . . . , (xn , yn ), x ∈ X , y ∈ {−1, +1} I I For b = 1, . . . , B I Sample a bootstrap dataset Bb of size n from D0 = I Learn a classifier fb using data in Bb . Pn 1 i=1 n δxi . Set the classification rule to be fbag (x0 ) = sign B X ! fb (x0 ) . b=1 I With bagging, we saw how a committee of classifiers votes on a label. I Each classifier is learned on a bootstrap sample from the data set. I Learning a collection of classifiers is referred to as an ensemble method. COMS W4721: Machine Learning for Data Science, Spring 2015 3 / 25 B OOSTING How is it that a committee of blockheads can somehow arrive at highly reasoned decisions, despite the weak judgment of the individual members? - Schapire & Freund, “Boosting: Foundations and Algorithms” Boosting is another powerful method for ensemble learning. It is similar to bagging in that a set of classifiers are combined to make a better one. Free to choose a classifier, but a “weak” one is usually picked – i.e., one with accuracy a little better than random guessing, but very fast to learn. Short history 1984 : Leslie Valiant and Michael Kearns ask if “boosting” is possible. 1989 : Robert Schapire creates first boosting algorithm. 1990 : Yoav Freund creates an optimal boosting algorithm. 1995 : Freund and Schapire create AdaBoost (Adaptive Boosting), the major boosting algorithm. COMS W4721: Machine Learning for Data Science, Spring 2015 4 / 25 BAGGING VS B OOSTING ( SCHEMATIC ) Bootstrap sample f3(x) Weighted sample f3(x) Bootstrap sample f2(x) Weighted sample f2(x) Bootstrap sample f1(x) Weighted sample f1(x) Training sample Training sample Bagging Boosting COMS W4721: Machine Learning for Data Science, Spring 2015 5 / 25 T HE A DA B OOST A LGORITHM ( A SAMPLING VERSION ) Algorithm: Boosting a binary classifier Given (x1 , y1 ), . . . , (xn , yn ), x ∈ X , y ∈ {−1, +1}, set w1 (i) = I 1 n For t = 1, . . . , T 1. Sample a bootstrap dataset Bt of size n from Dt = 2. Learn a classifier ft using data in Bt . P 3. Set t = ni=1 wt (i)1{yi 6= ft (xi )} and αt = 1 2 ln Pn i=1 1−t t wt (i)δxi . . ˜ t+1 (i) = wt (i) exp{−αt yi ft (xi )}. 4. Update w P ˜ t+1 (i)/ j w ˜ t+1 (j). 5. Normalize wt+1 (i) = w I Set the classification rule to be fboost (x0 ) = sign P T t=1 αt ft (x0 ) . Comment: Step 1 can often be skipped and wt used directly in Step 2, which is the way AdaBoost is normally presented. COMS W4721: Machine Learning for Data Science, Spring 2015 6 / 25 T HE A DA B OOST A LGORITHM ( SCHEMATIC ) Weighted sample Classify B3 ~ D3 {a3, f3(x)} Classify B2 ~ D2 {a2, f2(x)} Classify B1 ~ D1 {a1, f1(x)} e2 Weighted sample e1 Weighted sample Training sample fboost (x0 ) = sign Boosting COMS W4721: Machine Learning for Data Science, Spring 2015 T X ! αt ft (x0 ) t=1 7 / 25 B OOSTING A DECISION STUMP ( EXAMPLE 1) + + + + - Original data Uniform data distribution Learn weak classifier - Here: Use a decision stump x1 > 1.7 - + COMS W4721: Machine Learning for Data Science, Spring 2015 yˆ = 1 yˆ = 3 8 / 25 B OOSTING A DECISION STUMP ( EXAMPLE 1) + + + + - Round 1 classifier Weighted error: 1 = 0.3 Weight update: α1 = 0.42 - + COMS W4721: Machine Learning for Data Science, Spring 2015 8 / 25 B OOSTING A DECISION STUMP ( EXAMPLE 1) + + + - Weighted data After round 1 - + - + - COMS W4721: Machine Learning for Data Science, Spring 2015 8 / 25 B OOSTING A DECISION STUMP ( EXAMPLE 1) + + + - Round 2 classifier Weighted error: 2 = 0.21 Weight update: α2 = 0.65 - + - + - COMS W4721: Machine Learning for Data Science, Spring 2015 8 / 25 B OOSTING A DECISION STUMP ( EXAMPLE 1) + + + + - Weighted data After round 2 - + COMS W4721: Machine Learning for Data Science, Spring 2015 8 / 25 B OOSTING A DECISION STUMP ( EXAMPLE 1) + + + + - Round 2 classifier Weighted error: 3 = 0.14 Weight update: α3 = 0.92 - + COMS W4721: Machine Learning for Data Science, Spring 2015 8 / 25 B OOSTING A DECISION STUMP ( EXAMPLE 1) + + + + - Classifier after three rounds 0.42 x + 0.65 x + - + COMS W4721: Machine Learning for Data Science, Spring 2015 0.92 x 8 / 25 B OOSTING A DECISION STUMP ( EXAMPLE 2) A Toy Problem Random guessing 50% error Decision stump 45.8% error Full decision tree 24.7% error Boosted stump 5.8% error COMS W4721: Machine Learning for Data Science, Spring 2015 9 / 25 B OOSTING Point = one dataset. Location = error rate w/ and w/o boosting. The boosted version of the same classifier almost always produces better results. COMS W4721: Machine Learning for Data Science, Spring 2015 10 / 25 B OOSTING (left) Boosting a bad classifier is often better than not boosting a good one. (right) Boosting a good classifier is often better (but can take more time). COMS W4721: Machine Learning for Data Science, Spring 2015 11 / 25 B OOSTING AND FEATURE MAPS Q: What makes boosting work so well? A: This is a very well studied question. We will present one analysis later, but we can also give intuition by tying it in with what we’ve discussed. The classification for a new x0 from boosting is fboost (x0 ) = sign T X ! αt ft (x0 ) . t=1 Define φ(x) = [ f1 (x), . . . , fT (x)]T , where each ft (x) ∈ {−1, +1}. I We can think of φ(x) as a high dimensional feature map of x. I The vector α = [α1 , . . . , αT ]T correspond to hyperplane. I So the classifier can be written fboost (x0 ) = sign(φ(x0 )T α). I Boosting learns the feature mapping and hyperplane simultaneously. COMS W4721: Machine Learning for Data Science, Spring 2015 12 / 25 A PPLICATION : FACE DETECTION FACE DETECTION (V IOLA & J ONES , 2001) Problem: Locate the faces in an image or video. Processing: Divide image into patches of different scales, e.g., 24 × 24, 48 × 48, etc. Extract features from each patch. Classify each 144 Viola andpatch Jones as face or no face using a boosted decision stump. This can be done in real-time, for example by your digital camara (at 15 fps). number of sub-windows that need further pro with very few operations: 1. Evaluate the rectangle features (requires bet and 9 array references per feature). 2. Compute the weak classifier for each feat quires one threshold operation per feature). 3. Combine the weak classifiers (requires one m per feature, an addition, and finally a thresh A two feature classifier amounts to about Figure 5. The first and second features selected by AdaBoost. The croprocessor instructions. It seems hard to i ItwoOne patch from larger Mask it with many “feature extractors.” features are shown in theatop row andimage. then overlayed on a typ- that any simpler filter could achieve higher re ical training face in the bottom row. The first feature measures the rates. of Byall comparison, Eachinpattern givesthe one number, is the sum pixels inscanning black a simple imag difference intensity between region of the eyeswhich and a region would at least 20 times as many ope across region the upper cheeks. Thesum featureof capitalizes thewhite observation minus pixelsonin regionplate (total of require 45,000+ features). per sub-window. that the eye region is often darker than the cheeks. The second feature compares the intensities the eye regions COMS W4721: Machine Learning for Datain Science, Spring 2015 to the intensity across the 14 / 25 is The overall form of the detection process I FACE DETECTION ( EXAMPLE RESULTS ) Figure 10. Output of our face detector on a number of test images from the MIT + CMU test set. Conclusions COMS6.W4721: Machine Learning for Data Science, Spring 2015 This paper brings together new algorithms, represen15 / 25 A NALYSIS OF BOOSTING A NALYSIS OF BOOSTING Training error theorem We can use analysis to make a statement about the predictive accuracy of boosting on the training data. Theorem: Under the AdaBoost framework, if t is P the weighted error of T classifier ft , then for the classifier fboost (x0 ) = sign( t=1 αt ft (x0 )), training error = n T i=1 t=1 X 1X 1{yi 6= fboost (xi )} ≤ exp − 2 ( 12 − t )2 . n Even if each t is only a little better than random guessing, the accumulation of them over T classifiers can lead to a substantial value in the exponent. For example: t = 0.45, T = 1000 → training error ≤ 0.0067. COMS W4721: Machine Learning for Data Science, Spring 2015 17 / 25 P ROOF OF THEOREM Setup We break the proof into three steps. It is an application of the fact that if < b} |a {z Step 2 and b < }c | {z then Step 3 a < }c | {z conclusion I Step 1 allows us to know what b is above. I Steps 2 and 3 correspond to the two inequalities. Also recall the following step from AdaBoost: ˜ t+1 (i) = wt (i)e−αt yi ft (xi ) and normalize wt+1 (i) = Update w P ˜ t+1 (j) for use in the proof. We define Zt = j w I COMS W4721: Machine Learning for Data Science, Spring 2015 ˜ t+1 (i) Pw ˜ t+1 (j) . jw 18 / 25 P ROOF OF THEOREM Step 1 We first want to expand the equation of the weights to show that = fboost (xi ) z }| { PT 1 exp{−yi t=1 αt ft (xi )} . wT+1 (i) = QT n t=1 Zt Derivation of Step 1: To do so, first notice the recurrence: wt+1 (i) = wt (i)e−αt yi xi /Zt . We can break down wt (i) in exactly the same way, and continue until w1 (i), wT+1 (i) exp{−α1 yi xi } exp{−αT yi xi } × ··· × Z1 ZT PT 1 exp{−yi t=1 αt ft (xi )} QT n t=1 Zt = w1 (i) = COMS W4721: Machine Learning for Data Science, Spring 2015 19 / 25 P ROOF OF THEOREM Step 2 We next need to show QT that the training error of the classifier after T + 1 steps is not greater than t=1 Zt . Derivation of Step 2: From Step 1: wT+1 (i) = f (x )} 1 exp{−y QTi boost i n Zt → wT+1 (i) t=1 n 1X 1{yi 6= fboost (xi )} n QT t=1 Zt = 1n e−yi fboost (xi ) n ≤ i=1 1X exp{−yi fboost (xi )} n i=1 = n X wT+1 (i) i=1 = T Y T Y Zt t=1 Zt t=1 COMS W4721: Machine Learning for Data Science, Spring 2015 20 / 25 P ROOF OF THEOREM Step 3 The final step is to calculate an upper bound on Zt , and therefore of QT t=1 Zt . QT Since t=1 Zt is an upper bound on the training error, the upper bound from Step 3 is also of the training error. Derivation of Step 3: This step is slightly more involved. It also shows why αt = Zt = n X 1 2 ln 1−t t . wt (i) exp{−αt yi ft (xi )} i=1 X = e−αt wt (i) + = e COMS W4721: Machine Learning for Data Science, Spring 2015 eαt wt (i) i : yi 6=ft (xi ) i : yi =ft (xi ) −αt X αt (1 − t ) + e t 21 / 25 P ROOF OF THEOREM Derivation of Step 3 (continued): We’re currently at Zt = e−αt (1 − t ) + eαt t . Remember from Step 2 that training error = n T Y 1X 1{yi 6= fboost (xi )} ≤ Zt . n i=1 t=1 We want the training error to be small. We therefore pick αt to minimize Zt . This minimum is independent for each t and occurs at 1 1 − t αt = ln . 2 t p Plugging this value in gives Zt = 2 t (1 − t ). COMS W4721: Machine Learning for Data Science, Spring 2015 22 / 25 P ROOF OF THEOREM Derivation of Step 3 (conclusion): p Thus Zt = 2 t (1 − t ) 3.5 We re-write this as r 0.5 3 2.5 2 1.5 1 Zt = e -x 0 −0.5 1 1 − 4( − t )2 . 2 −1 −1.5 −2 −1 0 1 2 1-x We use the general inequality 1 − x ≤ e−x to conclude that Zt = 1 − 4( 12 − t )2 COMS W4721: Machine Learning for Data Science, Spring 2015 12 21 2 2 1 1 ≤ e−4( 2 −t ) = e−2( 2 −t ) . 23 / 25 P ROOF OF THEOREM Putting it all together 2 1 Step 3 showed that Zt ≤ e−2( 2 −t ) . Because both sides are positive, the product over t doesn’t change this inequality, T Y Zt ≤ t=1 T Y 2 1 t=1 ( 2 −t ) PT . t=1 From the earlier steps we showed training error = 2 1 e−2( 2 −t ) = e−2 QT t=1 Zt was also an upper bound, n T Y PT 2 1 1X 1{yi 6= fboost (xi )} ≤ Zt ≤ e−2 t=1 ( 2 −t ) . n i=1 t=1 The two ends of this chain is what we set out to prove. COMS W4721: Machine Learning for Data Science, Spring 2015 24 / 25 T RAINING VS T ESTING ERROR Q: Driving the training error to zero leads one to ask, does boosting overfit? A: Sometimes, but very often it doesn’t! Error C4.5 (tree) testing error AdaBoost testing error AdaBoost training error Rounds of boosting COMS W4721: Machine Learning for Data Science, Spring 2015 25 / 25
© Copyright 2025