Foundations of Machine Learning (FOML) Lecture 4

Foundations of Machine Learning (FOML)
Lecture 4
Kristiaan Pelckmans
September 25, 2015
Overview
Today:
I
AdaBoost.
I
Analysis.
I
Extensions.
I
Discussion.
Overview (Ct’d)
Course:
1. Introduction.
2. Support Vector Machines (SVMs).
3. Probably Approximatively Correct (PAC) analysis.
4. Boosting. Fr 2015-09-25.
5. Online Learning. Ti 2015-09-29.
6. Multi-class classification (Kalyam), Fr 2015-10-02.
7. Ranking (Jakob,Fredrik, Andreas), Ti 2015-10-06.
8. Regression (Tatiana, Ruben), To 2015-10-08.
9. Stability-based analysis (Tilo, Juozas, Yevgen), Ti
2015-10-13.
10. Dimensionality reduction (Fredrik, Thomas, Ali
Basirat), To 2015-10-15.
11. Reinforcement learning (Sholeh), On 2015-10-21.
12. Presentations of the results of the mini-projects.
Overview (Ct’d)
Miniprojects: AdaBoost
1. Detecting faces.
2. Integral image representation.
3. AdaBoost.
4. Hierarchical classification.
5. Empirical and actual risk.
6. Complexity of weak learners.
7. How about tuning T ?
8. Numerical results.
AdaBoost.
Average learning and boosting
I
How to boost weak learners into a
global strong learner?
I
Weak learner of a concept class C:
for any D, δ > 0, one gets a hS
where
1
PrS∼D m R(hS ) ≤ − γ ≥ 1 − δ,
2
if m samples provided, with
m ≥ poly(1/δ, n, size(c)).
I
E.g. decision stumps or decision
trees.
AdaBoost.
citations
I
M. Kearns, L.G. Valiant, Cryptographic limitations on learning boolean formulae and automata.
Technical report, 1988.
I
I
I
I
I
M. Kearns, L.G. Valiant, Efficient distribution-free learning of stochastic concepts, 1990.
I
R. E. Schapire, Y. Freund, P. Bartlett and Wee Sun Lee. Boosting the margin: A new
explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):
1651-1686, 1998.
I
I
D. Mease, A. Wyner, Evidence Contrary to the Statistical View of Boosting. JMLR, 2008.
R.E. Schapire, The strength of weak learnability. Machine Learning, 1990.
Y. Freund, Boosting a weak learning algorithm by majority. COLT, 1990.
L. Breiman. Bagging predictors. Machine Learning, 24(2): 123-140, 1996.
Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an
application to boosting. Journal of Computer and System Sciences, 55(1):119-139, 1997.
R. E. Schapire and Y. Freund. Boosting, Foundations and Algorithms. The MIT Press, 2012.
AdaBoost (Ct’d)
AdaBoost (Ct’d)
Initiate D1 (i) = m1 .
for t = 1, . . . , T :
1. Choose a new base-classifier ht ∈ H
based on the weighted dataset, and
t =
m
1 X
Dt (i)I (yi 6= ht (xi )) .
m
i=1
2. Let αt =
1
2
log
t
1−t .
2
3. Let Zt = 2 (t (1 − t )) .
4. Let
Dt+1 (i) =
Dt (i)
exp (−αt yi ht (xi )) , ∀i.
Zt
Then
g = sign
T
X
t=1
αt ht .
AdaBoost (Ct’d)
Theorem. Bound on empirical error of h. Assume t ≤ 12 , then
R̂(g ) ≤ exp −2
T X
1
t=1
2
2 !
− t
proof: First
DT +1 (i) = DT (i)
exp (−αT yi hT (xi ))
exp(−yi g (xi ))
.
=
Q
ZT
m T
Z
t
t=1
then
m
m
1 X
1 X
I (yi g (xi ) ≤ 0) ≤
exp(−yi g (xi ))
R̂(g ) =
m
m
i=1
i=1
!
m
T
T
Y
Y
1 X
=
m
Zt DT +1 (i) =
Zt .
m
i=1
t=1
t=1
AdaBoost (Ct’d)
proof (Ct’d): and
Zt =
m
X
Dt (i) exp(−αt yi ht (xi ))
i=1
X
=
X
Dt (i)e −αt +
i:yi ht (xi )=+1
= (1 − t )e
Dt (i)e αt
i:yi ht (xi )=−1
−αt
p
+ t e αt = 2 t (1 − t ).
(optimal!), and hence
T
Y
T
Y
p
Zt =
2 t (1 − t )
t=1
=
t=1
T
Yq
1 − 4(1/2 − t )2 ≤ exp −2
t=1
T X
1
t=1
2
2 !
− t
.
AdaBoost (Ct’d)
Coordinate Descent. Consider the function
!
m
t
X
X
Ft (α) =
exp −yi
αs hs (xi ) .
i=1
s=1
Then, at each iteration t, one solves
t−1
Y
dFt (α)
arg min
= arg min(2ht − 1) m
Zs
αt dht
ht
ht
s=1
Moreover
dFt (α)
1
1 − t
= 0 ⇔ αt = log
.
dαt
2
t
!
.
AdaBoost (Ct’d)
Relation to logistic regression
I
Zero-one
loss
1 Pm
I
(y
i h(xi ) < 0).
i=1
m
I
Hinge
loss
1 Pm
i=1 max(1 − yi h(xi ), 0).
m
I
Logistic
loss
1 Pm
log
(1 + exp(−2yi h(xi ))).
i=1
m
I
Boosting
loss:
1 Pm
i=1 exp(−yi h(xi )).
m
AdaBoost (Ct’d)
Overfitting?
I
Assume that VCdim(H) = d, and
(
!)
T
X
HT = sign
αt ht + b
t=1
I
Then
VCdim(HT ) ≤ 2(d+1)(T +1) log((T +1)e).
I
Hence one needs early stopping.
AdaBoost (Ct’d)
Rademacher analysis Let H = {h : X → R} be any
hypothesis set, and
(
)
X
X
conv(H) =
µh h,
µh ≤ 1, µh ≥ 0 ,
h
h
then for any sample S one has
RS (H) = RS (conv(H))
AdaBoost (Ct’d)
Proof:
"
#
p
m
X
X
1
RS (conv(H)) = Eσ
sup
σi
µh hk (xi )
m
h1 ,...,hp ,µ1 ,...,µp i=1
k=1
#
"
p
m
X
X
1
σi hk (xi )
= Eσ sup sup
µk
m
h1 ,...,hp µ1 ,...,µp k=1
i=1
#
"
m
X
1
= Eσ sup max
σi hk (xi )
m
h1 ,...,hp k i=1
#
"
m
X
1
σi h(xi )
= Eσ sup
m
h
i=1
AdaBoost (Ct’d)
Margin-based analysis
I
L1 margin:
P
T
t=1 αt ht (x)
.
ρ(x) =
PT
t=1 |αt |
I
On sample
ρS = min
i
I
yi
PT
Pt=1
T
αt ht (xi )
t=1 |αt |
.
Maximising this margin? As an LP ....
AdaBoost (Ct’d)
Game-theoretic interpretation
I
Loss (-Payoff) matrix M and mixed strategies p and q:
min max p T Mq = max min p T Mq.
p
I
q
q
p
equivalently
min max p T Mej = max min eiT Mq.
p
I
q
j
i
Apply to boosting
Mi,t = yi ht (xi )
then
2γ∗ = min max
D t=1,...,T
m
X
D(i)yi ht (xi )
i=1
= max min yi
α
i=1,...,m
T
X
αt
ht (xi ) = ρ∗
kαk1
t=1
Conclusions
Boosting:
I
AdaBoost: weak → strong learner.
I
Analysis.
I
Noise - LogitBoost.
I
Detecting outliers.