Download Report

COMS 4721: Machine Learning for Data Science
Lecture 13, 3/10/2015
Prof. John Paisley
Columbia University
COMS W4721: Machine Learning for Data Science, Spring 2015
1 / 25
B OOSTING
For more, see the textbook: Robert E. Schapire and Yoav Freund, Boosting: Foundations
and Algorithms, MIT Press, 2012. (I borrow some figures from that book.)
BAGGING CLASSIFIERS
Algorithm: Bagging binary classifiers
Given (x1 , y1 ), . . . , (xn , yn ), x ∈ X , y ∈ {−1, +1}
I
I
For b = 1, . . . , B
I
Sample a bootstrap dataset Bb of size n from D0 =
I
Learn a classifier fb using data in Bb .
Pn
1
i=1 n δxi .
Set the classification rule to be
fbag (x0 ) = sign
B
X
!
fb (x0 ) .
b=1
I
With bagging, we saw how a committee of classifiers votes on a label.
I
Each classifier is learned on a bootstrap sample from the data set.
I
Learning a collection of classifiers is referred to as an ensemble method.
COMS W4721: Machine Learning for Data Science, Spring 2015
3 / 25
B OOSTING
How is it that a committee of blockheads can somehow arrive at highly reasoned decisions,
despite the weak judgment of the individual members?
- Schapire & Freund, “Boosting: Foundations and Algorithms”
Boosting is another powerful method for ensemble learning. It is similar to
bagging in that a set of classifiers are combined to make a better one.
Free to choose a classifier, but a “weak” one is usually picked – i.e., one with
accuracy a little better than random guessing, but very fast to learn.
Short history
1984 : Leslie Valiant and Michael Kearns ask if “boosting” is possible.
1989 : Robert Schapire creates first boosting algorithm.
1990 : Yoav Freund creates an optimal boosting algorithm.
1995 : Freund and Schapire create AdaBoost (Adaptive Boosting),
the major boosting algorithm.
COMS W4721: Machine Learning for Data Science, Spring 2015
4 / 25
BAGGING VS B OOSTING ( SCHEMATIC )
Bootstrap sample
f3(x)
Weighted sample
f3(x)
Bootstrap sample
f2(x)
Weighted sample
f2(x)
Bootstrap sample
f1(x)
Weighted sample
f1(x)
Training sample
Training sample
Bagging
Boosting
COMS W4721: Machine Learning for Data Science, Spring 2015
5 / 25
T HE A DA B OOST A LGORITHM ( A SAMPLING VERSION )
Algorithm: Boosting a binary classifier
Given (x1 , y1 ), . . . , (xn , yn ), x ∈ X , y ∈ {−1, +1}, set w1 (i) =
I
1
n
For t = 1, . . . , T
1. Sample a bootstrap dataset Bt of size n from Dt =
2. Learn a classifier ft using data in Bt .
P
3. Set t = ni=1 wt (i)1{yi 6= ft (xi )} and αt =
1
2
ln
Pn
i=1
1−t
t
wt (i)δxi .
.
˜ t+1 (i) = wt (i) exp{−αt yi ft (xi )}.
4. Update w
P
˜ t+1 (i)/ j w
˜ t+1 (j).
5. Normalize wt+1 (i) = w
I
Set the classification rule to be
fboost (x0 ) = sign
P
T
t=1
αt ft (x0 ) .
Comment: Step 1 can often be skipped and wt used directly in Step 2,
which is the way AdaBoost is normally presented.
COMS W4721: Machine Learning for Data Science, Spring 2015
6 / 25
T HE A DA B OOST A LGORITHM ( SCHEMATIC )
Weighted sample
Classify
B3 ~ D3
{a3, f3(x)}
Classify
B2 ~ D2
{a2, f2(x)}
Classify
B1 ~ D1
{a1, f1(x)}
e2
Weighted sample
e1
Weighted sample
Training sample
fboost (x0 ) = sign
Boosting
COMS W4721: Machine Learning for Data Science, Spring 2015
T
X
!
αt ft (x0 )
t=1
7 / 25
B OOSTING A DECISION STUMP ( EXAMPLE 1)
+
+
+
+
-
Original data
Uniform data distribution
Learn weak classifier
-
Here: Use a decision stump
x1 > 1.7
-
+
COMS W4721: Machine Learning for Data Science, Spring 2015
yˆ = 1
yˆ = 3
8 / 25
B OOSTING A DECISION STUMP ( EXAMPLE 1)
+
+
+
+
-
Round 1 classifier
Weighted error: 1 = 0.3
Weight update: α1 = 0.42
-
+
COMS W4721: Machine Learning for Data Science, Spring 2015
8 / 25
B OOSTING A DECISION STUMP ( EXAMPLE 1)
+
+ +
-
Weighted data
After round 1
-
+
-
+
-
COMS W4721: Machine Learning for Data Science, Spring 2015
8 / 25
B OOSTING A DECISION STUMP ( EXAMPLE 1)
+
+ +
-
Round 2 classifier
Weighted error: 2 = 0.21
Weight update: α2 = 0.65
-
+
-
+
-
COMS W4721: Machine Learning for Data Science, Spring 2015
8 / 25
B OOSTING A DECISION STUMP ( EXAMPLE 1)
+
+
+
+
-
Weighted data
After round 2
-
+
COMS W4721: Machine Learning for Data Science, Spring 2015
8 / 25
B OOSTING A DECISION STUMP ( EXAMPLE 1)
+
+
+
+
-
Round 2 classifier
Weighted error: 3 = 0.14
Weight update: α3 = 0.92
-
+
COMS W4721: Machine Learning for Data Science, Spring 2015
8 / 25
B OOSTING A DECISION STUMP ( EXAMPLE 1)
+
+
+
+
-
Classifier after three rounds
0.42 x
+
0.65 x
+
-
+
COMS W4721: Machine Learning for Data Science, Spring 2015
0.92 x
8 / 25
B OOSTING A DECISION STUMP ( EXAMPLE 2)
A Toy Problem
Random guessing
50% error
Decision stump
45.8% error
Full decision tree
24.7% error
Boosted stump
5.8% error
COMS W4721: Machine Learning for Data Science, Spring 2015
9 / 25
B OOSTING
Point = one dataset. Location = error rate w/ and w/o boosting. The boosted
version of the same classifier almost always produces better results.
COMS W4721: Machine Learning for Data Science, Spring 2015
10 / 25
B OOSTING
(left) Boosting a bad classifier is often better than not boosting a good one.
(right) Boosting a good classifier is often better (but can take more time).
COMS W4721: Machine Learning for Data Science, Spring 2015
11 / 25
B OOSTING AND FEATURE MAPS
Q: What makes boosting work so well?
A: This is a very well studied question. We will present one analysis later,
but we can also give intuition by tying it in with what we’ve discussed.
The classification for a new x0 from boosting is
fboost (x0 ) = sign
T
X
!
αt ft (x0 ) .
t=1
Define φ(x) = [ f1 (x), . . . , fT (x)]T , where each ft (x) ∈ {−1, +1}.
I
We can think of φ(x) as a high dimensional feature map of x.
I
The vector α = [α1 , . . . , αT ]T correspond to hyperplane.
I
So the classifier can be written fboost (x0 ) = sign(φ(x0 )T α).
I
Boosting learns the feature mapping and hyperplane simultaneously.
COMS W4721: Machine Learning for Data Science, Spring 2015
12 / 25
A PPLICATION : FACE DETECTION
FACE DETECTION (V IOLA & J ONES , 2001)
Problem: Locate the faces in an image or video.
Processing: Divide image into patches of different scales, e.g., 24 × 24,
48 × 48, etc. Extract features from each patch.
Classify
each
144
Viola
andpatch
Jones as face or no face using a boosted decision stump. This
can be done in real-time, for example by your digital camara (at 15 fps).
number of sub-windows that need further pro
with very few operations:
1. Evaluate the rectangle features (requires bet
and 9 array references per feature).
2. Compute the weak classifier for each feat
quires one threshold operation per feature).
3. Combine the weak classifiers (requires one m
per feature, an addition, and finally a thresh
A two feature classifier amounts to about
Figure 5. The first and second features selected by AdaBoost. The
croprocessor instructions. It seems hard to i
ItwoOne
patch
from
larger
Mask
it with many
“feature extractors.”
features
are shown
in theatop
row andimage.
then overlayed
on a typ-
that any simpler filter could achieve higher re
ical training face in the bottom row. The first feature measures the
rates. of
Byall
comparison,
Eachinpattern
givesthe
one
number,
is the sum
pixels inscanning
black a simple imag
difference
intensity between
region
of the eyeswhich
and a region
would
at least
20 times as many ope
across region
the upper cheeks.
Thesum
featureof
capitalizes
thewhite
observation
minus
pixelsonin
regionplate
(total
of require
45,000+
features).
per sub-window.
that the eye region is often darker than the cheeks. The second feature
compares
the intensities
the eye
regions
COMS W4721:
Machine Learning
for Datain
Science,
Spring
2015 to the intensity across the
14 / 25 is
The overall form of the detection process
I
FACE DETECTION ( EXAMPLE RESULTS )
Figure 10. Output of our face detector on a number of test images from the MIT + CMU test set.
Conclusions
COMS6.W4721:
Machine Learning for Data Science, Spring 2015
This paper brings together new algorithms, represen15 / 25
A NALYSIS OF BOOSTING
A NALYSIS OF BOOSTING
Training error theorem
We can use analysis to make a statement about the predictive accuracy of
boosting on the training data.
Theorem: Under the AdaBoost framework, if t is P
the weighted error of
T
classifier ft , then for the classifier fboost (x0 ) = sign( t=1 αt ft (x0 )),
training error =
n
T
i=1
t=1
X
1X
1{yi 6= fboost (xi )} ≤ exp − 2
( 12 − t )2 .
n
Even if each t is only a little better than random guessing, the accumulation
of them over T classifiers can lead to a substantial value in the exponent.
For example:
t = 0.45, T = 1000 → training error ≤ 0.0067.
COMS W4721: Machine Learning for Data Science, Spring 2015
17 / 25
P ROOF OF THEOREM
Setup
We break the proof into three steps. It is an application of the fact that
if
< b}
|a {z
Step 2
and
b
< }c
| {z
then
Step 3
a
< }c
| {z
conclusion
I
Step 1 allows us to know what b is above.
I
Steps 2 and 3 correspond to the two inequalities.
Also recall the following step from AdaBoost:
˜ t+1 (i) = wt (i)e−αt yi ft (xi ) and normalize wt+1 (i) =
Update w
P
˜ t+1 (j) for use in the proof.
We define Zt = j w
I
COMS W4721: Machine Learning for Data Science, Spring 2015
˜ t+1 (i)
Pw
˜ t+1 (j) .
jw
18 / 25
P ROOF OF THEOREM
Step 1
We first want to expand the equation of the weights to show that
= fboost (xi )
z
}|
{
PT
1 exp{−yi t=1 αt ft (xi )}
.
wT+1 (i) =
QT
n
t=1 Zt
Derivation of Step 1:
To do so, first notice the recurrence: wt+1 (i) = wt (i)e−αt yi xi /Zt .
We can break down wt (i) in exactly the same way, and continue until w1 (i),
wT+1 (i)
exp{−α1 yi xi }
exp{−αT yi xi }
× ··· ×
Z1
ZT
PT
1 exp{−yi t=1 αt ft (xi )}
QT
n
t=1 Zt
= w1 (i)
=
COMS W4721: Machine Learning for Data Science, Spring 2015
19 / 25
P ROOF OF THEOREM
Step 2
We next need to show
QT that the training error of the classifier after T + 1 steps
is not greater than t=1 Zt .
Derivation of Step 2:
From Step 1: wT+1 (i) =
f
(x )}
1 exp{−y
QTi boost i
n
Zt
→ wT+1 (i)
t=1
n
1X
1{yi 6= fboost (xi )}
n
QT
t=1
Zt = 1n e−yi fboost (xi )
n
≤
i=1
1X
exp{−yi fboost (xi )}
n
i=1
=
n
X
wT+1 (i)
i=1
=
T
Y
T
Y
Zt
t=1
Zt
t=1
COMS W4721: Machine Learning for Data Science, Spring 2015
20 / 25
P ROOF OF THEOREM
Step 3
The final step is to calculate an upper bound on Zt , and therefore of
QT
t=1
Zt .
QT
Since t=1 Zt is an upper bound on the training error, the upper bound from
Step 3 is also of the training error.
Derivation of Step 3:
This step is slightly more involved. It also shows why αt =
Zt
=
n
X
1
2
ln
1−t
t
.
wt (i) exp{−αt yi ft (xi )}
i=1
X
=
e−αt wt (i) +
=
e
COMS W4721: Machine Learning for Data Science, Spring 2015
eαt wt (i)
i : yi 6=ft (xi )
i : yi =ft (xi )
−αt
X
αt
(1 − t ) + e t
21 / 25
P ROOF OF THEOREM
Derivation of Step 3 (continued):
We’re currently at Zt = e−αt (1 − t ) + eαt t . Remember from Step 2 that
training error =
n
T
Y
1X
1{yi 6= fboost (xi )} ≤
Zt .
n
i=1
t=1
We want the training error to be small. We therefore pick αt to minimize Zt .
This minimum is independent for each t and occurs at
1
1 − t
αt = ln
.
2
t
p
Plugging this value in gives Zt = 2 t (1 − t ).
COMS W4721: Machine Learning for Data Science, Spring 2015
22 / 25
P ROOF OF THEOREM
Derivation of Step 3 (conclusion):
p
Thus Zt = 2 t (1 − t )
3.5
We re-write this as
r
0.5
3
2.5
2
1.5
1
Zt =
e -x
0
−0.5
1
1 − 4( − t )2 .
2
−1
−1.5
−2
−1
0
1
2
1-x
We use the general inequality 1 − x ≤ e−x to conclude that
Zt = 1 − 4( 12 − t )2
COMS W4721: Machine Learning for Data Science, Spring 2015
12
21
2
2
1
1
≤ e−4( 2 −t )
= e−2( 2 −t ) .
23 / 25
P ROOF OF THEOREM
Putting it all together
2
1
Step 3 showed that Zt ≤ e−2( 2 −t ) . Because both sides are positive, the
product over t doesn’t change this inequality,
T
Y
Zt ≤
t=1
T
Y
2
1
t=1 ( 2 −t )
PT
.
t=1
From the earlier steps we showed
training error =
2
1
e−2( 2 −t ) = e−2
QT
t=1
Zt was also an upper bound,
n
T
Y
PT
2
1
1X
1{yi 6= fboost (xi )} ≤
Zt ≤ e−2 t=1 ( 2 −t ) .
n
i=1
t=1
The two ends of this chain is what we set out to prove.
COMS W4721: Machine Learning for Data Science, Spring 2015
24 / 25
T RAINING VS T ESTING ERROR
Q: Driving the training error to zero leads one to ask, does boosting overfit?
A: Sometimes, but very often it doesn’t!
Error
C4.5 (tree) testing error
AdaBoost testing error
AdaBoost training error
Rounds of boosting
COMS W4721: Machine Learning for Data Science, Spring 2015
25 / 25