Statistical Learning Theory

Machine Learning and Applications
Christoph Lampert
Spring Semester 2014/2015
Lecture 4
Statistical Learning Theory
more: [Shalev-Shwartz, Ben-David: "Understanding Machine Learning", 2014]
Learning scenario:
• X : input set
• Y: output/label set, for now: Y = {−1, 1} or Y = {0, 1}
• p(x, y): data distribution (unknown to us)
• for now: deterministic labels, y = f (x) for unknown f : X → Y
i.i.d.
• D = {(x 1 , y 1 ), . . . , (x m , y m )} ∼ p(x, y): training set
• H ⊆ {h : X → Y}: hypothesis set (the lerner’s choice)
Quantity of interest:
• Rp (h) = P(x,y)∼p(x,y) {h(x) 6= y} = Px∼p(x) {h(x) 6= f (x)}
What to learn?
• We know: there is (at least one) f : X → Y that has Rp (f ) = 0.
• Can we find such f from D? If yes, how large must m be?
Probably Approximately Correct (PAC) Learnability
[L. Valiant, 1984]
A hypothesis class H is called PAC learnable by an algorithm A, if
• for every > 0
• and every δ > 0
(accuracy → "approximate correct")
(confidence → "probably")
there exists an
• m0 = m0 (, δ) ∈ N,
(minimal sample size)
such that
• for every labeling function f ∈ H (i.e. Rp (f ) = 0),
• and for every probability distribution p over X ,
when we run the learning algorithm A on a training set, D, consisting of
m ≥ m0 examples sampled i.i.d. from p, the algorithm returns a
hypothesis h ∈ H that, with probability at least 1 − δ, fulfills Rp (h) ≤ .
∀m ≥ m0 (, δ) PD∼pm [ Rp (A[D]) > ] ≤ δ.
Example: PAC Learnability
Example
• X = { all bitstrings of length k },
• H = {h1 , h2 , h3 , h4 }
I
h1 (x) = 0, h2 (x) = 1,
h4 (x) = 1 − parity(x).
Y = {0, 1}
h3 (x) = parity(x),
• labeling function: f = hj for unknown j
• Dm = {(x 1 , f (x 1 )), (x 2 , f (x 2 )), . . . , (x m , f (x m ))}
• A(D): pick first hypothesis with 0 training error on D
Example: PAC Learnability
Example
• X = { all bitstrings of length k },
• H = {h1 , h2 , h3 , h4 }
I
h1 (x) = 0, h2 (x) = 1,
h4 (x) = 1 − parity(x).
Y = {0, 1}
h3 (x) = parity(x),
• labeling function: f = hj for unknown j
• Dm = {(x 1 , f (x 1 )), (x 2 , f (x 2 )), . . . , (x m , f (x m ))}
• A(D): pick first hypothesis with 0 training error on D
Example: f = h3 (parity), p(x) = uniform, small (even = 0)
• D1 = {(0111, 1)}, A[D] = h1 , Rp (A[D]) =
• D1 = {(0000, 0)}, A[D] = h0 , Rp (A[D]) =
One sample is not enough.
1
2
1
2
>
>
Example: f = h3 (parity), p(x) = uniform, small (even = 0)
• D2 = {(0111, 1), (0000, 0)}, A[D] = h3 , Rp (A[D]) = 0 ≤ • D2 = {(1111, 0), (0111, 1)}, A[D] = h3 , Rp (A[D]) = 0 ≤ • D2 = {(0100, 1), (1000, 1)}, A[D] = h2 , Rp (A[D]) =
1
2
>
Example: f = h3 (parity), p(x) = uniform, small (even = 0)
• D2 = {(0111, 1), (0000, 0)}, A[D] = h3 , Rp (A[D]) = 0 ≤ • D2 = {(1111, 0), (0111, 1)}, A[D] = h3 , Rp (A[D]) = 0 ≤ • D2 = {(0100, 1), (1000, 1)}, A[D] = h2 , Rp (A[D]) =
1
2
>
m = 2 is sometimes enough, sometimes not. Can we quantify this?
• D2 uniquely identifies h3 (only h3 has 0 error) if y 1 6= y 2 ,
i.e. parity(x 1 ) 6= parity(x 2 )
• for p(x) = uniform, this happens with probability 21 .
Example: f = h3 (parity), p(x) = uniform, small (even = 0)
• D2 = {(0111, 1), (0000, 0)}, A[D] = h3 , Rp (A[D]) = 0 ≤ • D2 = {(1111, 0), (0111, 1)}, A[D] = h3 , Rp (A[D]) = 0 ≤ • D2 = {(0100, 1), (1000, 1)}, A[D] = h2 , Rp (A[D]) =
1
2
>
m = 2 is sometimes enough, sometimes not. Can we quantify this?
• D2 uniquely identifies h3 (only h3 has 0 error) if y 1 6= y 2 ,
i.e. parity(x 1 ) 6= parity(x 2 )
• for p(x) = uniform, this happens with probability 21 .
general m?
• A[Dm ] = h3 , iff y i 6= y j for any i 6= j
• A[Dm ] 6= h3 , iff
y1
ym
→
Rp (A[Dm ]) = 0
= ··· =
→ Rp (A[Dm ]) =
1
1
1
• Rp (A[Dm ] ≥ ) = ( )m + ( )m = 2m−1
2
2
| {z } | {z }
all 0
all 1
1
2
Example: f = h3 (parity), p(x) = uniform, small (even = 0)
• D2 = {(0111, 1), (0000, 0)}, A[D] = h3 , Rp (A[D]) = 0 ≤ • D2 = {(1111, 0), (0111, 1)}, A[D] = h3 , Rp (A[D]) = 0 ≤ • D2 = {(0100, 1), (1000, 1)}, A[D] = h2 , Rp (A[D]) =
1
2
>
m = 2 is sometimes enough, sometimes not. Can we quantify this?
• D2 uniquely identifies h3 (only h3 has 0 error) if y 1 6= y 2 ,
i.e. parity(x 1 ) 6= parity(x 2 )
• for p(x) = uniform, this happens with probability 21 .
general m?
• A[Dm ] = h3 , iff y i 6= y j for any i 6= j
• A[Dm ] 6= h3 , iff
y1
ym
→
Rp (A[Dm ]) = 0
= ··· =
→ Rp (A[Dm ]) =
?
1
1
1
• Rp (A[Dm ] ≥ ) = ( )m + ( )m = 2m−1
≤δ
| 2{z } | 2{z }
all 0
1
2
all 1
invert: for given δ, need m ≥ 1 + log2 1δ ,
→ m0 (, δ) = d1 + log2 1δ e
Caveat: we only looked at f = h3 and uniform data distribution,
but learnability must work for every hypothesis and every data
distribution!
Caveat: we only looked at f = h3 and uniform data distribution,
but learnability must work for every hypothesis and every data
distribution!
Example 2: f = h3 (parity), p(x) = only x with parity 0 can occur
• Dm = {(0101, 0), . . . , (0000, 0)}
• A[Dm ] = h0 6= h3 regardless of m
Caveat: we only looked at f = h3 and uniform data distribution,
but learnability must work for every hypothesis and every data
distribution!
Example 2: f = h3 (parity), p(x) = only x with parity 0 can occur
• Dm = {(0101, 0), . . . , (0000, 0)}
• A[Dm ] = h0 6= h3 regardless of m
• Rp (A[Dm ]) = 0 !!!
because h0 (x) = h3 (x) for all x with p(x) > 0
Empirical Risk Minimization
Definition (Empirical Risk Minimization (ERM) Algorithm)
input hypothesis set H ⊆ {h : X → Y} (not necessarily finite)
input training set D = {(x 1 , y 1 ), . . . , (x m , y m )}
output hypothesis h ∈ H with
ˆ D (h) = minh∈H R
ˆ D (h)
R
for
m
X
ˆ D (h) := 1
R
Jh(x i ) 6= y i K
m i=1
(arbitrary deterministic choice if min is not unique)
ERM return a classifier of minimal training error.
• We saw already: this might work (parity) or not (decision trees)
• Can we characterize when ERM works and when it fails?
Results in Statistical Learning Theory
Theorem (Learnability of finite hypothesis classes (realizable case))
Let H = {h1 , . . . , hK } be a finite hypothesis class and f ∈ H. Then H is
PAC-learnable by the empirical risk minimization algorithm with
1
m0 (, δ) = ( log(|H| + log(1/δ) ).
Examples: Finite hypothesis classes
Selecting between different models:
• Amazon sells pre-trained classifiers:
1. a decision tree,
2. LogReg or
3. a support vector machine?
Which one to buy?
Finite precision:
• For x ∈ Rd , the hypothesis set H = {f (x) = signhw, xi} is infinite.
• But: on a computer, w is restricted to 64-bit doubles: |Hc | = 264d .
m0 (, δ) = 1 ( log(|H| + log(1/δ) ) ≈ 1 (44d + log(1/δ))
Implementation:
• H = { all algorithms implementable in 10 KB C-code } is finite.
Logarithmic dependence on |H| makes even large (finite) hypothesis sets
(kind of) tractable.
Example: Learning Thresholding Functions
Infinite H can be PAC learnable:
x
xx
0
x
af
h
a
x x xx
f
1
• X = [0, 1], Y = {0, 1},
• H = {ha (x) = J x ≥ a K, for 0 ≤ a ≤ 1},
• f (x) = haf (x) for some 0 ≤ af ≤ 1.
• training set D = {(x 1 , y 1 ), . . . , (x m , y m )}
• ERM rule:
h = argmin
ha ∈H
m
1 X
Jha (x i ) 6= y i K,
m i=1
pick smallest possible "+1" region (largest a) when not unique
(to make algorithm deterministic)
Claim: ERM learns f (in the PAC sense)
Example: Sine Waves
ERM cannot learn everything:
1.0
0.5
0.0
0.5
1.0
0
1
• X = [0, 1], Y = {0, 1}, H = {hλ (x) = Jsin(λx) > 0K : λ ∈ R}
• f (x) = hλ for some λ ∈ R (here: λ = 55)
• training set S = {(x 1 , y 1 ), . . . , (x m , y m )}
• ERM rule:
h = argminhλ
1
m
Pm
pick smallest λ when not unique
i=1 Jhλ (x
Claim: ERM cannot learn f in the PAC sense
i)
6= y i K,
Example: Sine Waves
ERM cannot learn everything:
1.0
0.5
0.0
0.5
1.0
0
1
• X = [0, 1], Y = {0, 1}, H = {hλ (x) = Jsin(λx) > 0K : λ ∈ R}
• f (x) = hλ for some λ ∈ R (here: λ = 55)
• training set S = {(x 1 , y 1 ), . . . , (x m , y m )}
• ERM rule:
h = argminhλ
1
m
Pm
pick smallest λ when not unique
i=1 Jhλ (x
Claim: ERM cannot learn f in the PAC sense
i)
6= y i K,
Example: Sine Waves
ERM cannot learn everything:
1.0
0.5
0.0
0.5
1.0
0
1
• X = [0, 1], Y = {0, 1}, H = {hλ (x) = Jsin(λx) > 0K : λ ∈ R}
• f (x) = hλ for some λ ∈ R (here: λ = 55)
• training set S = {(x 1 , y 1 ), . . . , (x m , y m )}
• ERM rule:
h = argminhλ
1
m
Pm
pick smallest λ when not unique
i=1 Jhλ (x
Claim: ERM cannot learn f in the PAC sense
i)
6= y i K,
Agnostic PAC Learning
More realistic scenario: labeling isn’t a deterministic function
• X : input set
• Y: output/label set, for now: Y = {−1, 1} or Y = {0, 1}
• p(x, y): data distribution (unknown to us)
• assume deterministic labels, y = f (x) for unknown f : X → Y
i.i.d.
• D = {(x 1 , y 1 ), . . . , (x m , y m )} ∼ p(x, y): training set
• H ⊆ {h : X → Y}: hypothesis set
Quantity of interest:
• Rp (h) = P(x,y)∼p(x,y) {h(x) 6= y}
What can we learn?
• there might not be a f ∈ H that has Rp (f ) = 0.
• there might not be any f : X → Y that has Rp (f ) = 0.
• but can we at least find the best h from the hypothesis set?
Definition (Agnostic PAC Learning)
A hypothesis class H is called agnostic PAC learnable by A, if
• for every > 0
(accuracy → "approximate correct")
• and every δ > 0
(confidence → "probably")
there exists an
• m0 = m0 (, δ) ∈ N,
(minimal sample size)
such that
• for every probability distribution p(x, y) over X × Y,
running the learning algorithm A on a training set, D, consisting of
m ≥ m0 examples sampled i.i.d. from p, yields a hypothesis h ∈ H that,
with probability at least 1 − δ, fulfills
¯ + .
Rp (h) ≤ min Rp (h)
¯
h∈H
¯ > ] ≤ δ.
∀m ≥ m0 (, δ) PD∼pm [ Rp (A[D]) − min Rp (h)
¯
h∈H
Agnostic PAC Learnability
Three main quantities:
ˆ S (h) = 1 Pm Jh(x i ) 6= y i K for any h ∈ H,
• training error, R
i=1
m
• generalizaton error, Rp (h) = E(x,y)∼p Jh(x) 6= yK for any h ∈ H,
¯
Rp (h).
• best achievable generalization error, minh∈H
¯
Agnostic PAC Learnability
Three main quantities:
ˆ S (h) = 1 Pm Jh(x i ) 6= y i K for any h ∈ H,
• training error, R
i=1
m
• generalizaton error, Rp (h) = E(x,y)∼p Jh(x) 6= yK for any h ∈ H,
¯
Rp (h).
• best achievable generalization error, minh∈H
¯
Definition (-representative sample)
A training set D is called -representative (for the current situation), if
∀h ∈ H
ˆ D (h) − Rp (h)| ≤ .
|R
Agnostic PAC Learnability
Three main quantities:
ˆ S (h) = 1 Pm Jh(x i ) 6= y i K for any h ∈ H,
• training error, R
i=1
m
• generalizaton error, Rp (h) = E(x,y)∼p Jh(x) 6= yK for any h ∈ H,
¯
Rp (h).
• best achievable generalization error, minh∈H
¯
Definition (-representative sample)
A training set D is called -representative (for the current situation), if
∀h ∈ H
ˆ D (h) − Rp (h)| ≤ .
|R
Lemma ("ERM works well for (/2)-representative training sets")
Let D be (/2)-representative. Then any h with
ˆ D (h) = min¯ R
ˆ ¯
R
h∈H D (h) (i.e. a possible output of ERM) fulfills
¯ + .
Rd (h) ≤ min Rd (h)
¯
h∈H
Uniform Conference
Definition
A hypothesis class H is said to have the uniform convergence property,
if for every > 0, δ > 0, there exists an m UC = m UC (, δ) ∈ N, such
that for every probability distribution p over X the following holds:
• the probability that a training set D of size m ≥ m UC ,
sampled i.i.d. from p, is -representable is at least 1 − δ.
Theorem
Let H have the uniform convergence property with sample complexity
mUC (, δ), then H is agnostically PAC-learnable with sample complexity
m(, δ) ≤ mUC (δ/2, δ), and the ERM rule is a successful PAC learning
algorithm.
Uniform Conference is sufficient for PAC Learnability
Theorem
Every finite hypothesis set, H, is agnostic PAC-learnable by ERM.
Theorem
H = {Jsin(λx) > 0K : λ ∈ R} is not (agnostic) PAC-learnable by
ERM.
Is there a better algorithm besides ERM?
Maybe one that always works?
There exists no universal learning algorithm
No-Free-Lunch Theorem [Wolpert, Macready 1997]
Let
• X input set, Y = {0, 1} label set,
• A an arbitrary learning algorithm for binary classification,
• m (training size) any number smaller than |X |/2
then, there exists a data distribution p over X × Y such that
• there is a function f : X × Y → {0, 1} with Rp (f ) = 0, but
PS∼pm [ RD (A[D]) ≥ 1/8 ] ≥ 1/7.
Summary: For every learning algorithm, there exists a task on which it
fails, even though another algorithm would have succeeded on that task.
Proof: "Understanding Machine Learning", Section 5.1, page 57ff
We construct a distribution, defined on 2m points, such that
• for every training set of m points, there’s a perfect classifier,
• the labels of the m observed points contain no information about
the labels on the remaining points
Corollary
Let X be a set of infinite size and let H = Y X = {h : X → {0, 1}} be
the set of all functions from X to {0, 1}. Then H is not PAC-learnable
by any algorithm.
So learning is hopeless? Only if we have absolutely no prior knowledge.
If we know, e.g., that the target function is linear, or smooth, we can
work with a smaller H.
Bias–Complexity Tradeoff
ˆ D (h) be an ERM classifier. Then we can
Let c = argminh∈H R
decompose its error:
Rp (c) = EBayes + Eapp + Eest
• EBayes = min Rd (h)
h∈Y X
Bayes error
• Eapp = min Rd (h) − min Rd (h)
h∈Y X
h∈H
• Eest = Rp (c) − min Rp (h)
h∈H
approximation error
estimation error
Bias–Complexity Tradeoff
ˆ D (h) be an ERM classifier. Then we can
Let c = argminh∈H R
decompose its error:
Rp (c) = EBayes + Eapp + Eest
• EBayes = min Rd (h)
h∈Y X
Bayes error
• Eapp = min Rd (h) − min Rd (h)
h∈Y X
h∈H
• Eest = Rp (c) − min Rp (h)
h∈H
approximation error
estimation error
There’s nothing we can do about EBayes , but:
• Eapp gets smaller the "richer" H is,
• Eest usually gets bigger for "richer" H.
Tradeoff of expressive power and complexity
Pick H rich enough to make Eapp small, but not so rich that Eest is large.
Infinite H can be PAC learnable
How to measure richness of H?
Intuition: number of hypotheses? number of free parameter?
• This 1-parameter hypothesis class is learnable:
H = {h(x) = sign f (x) for f (x) = x − θ with θ ∈ R}.
• This 1-parameter hypothesis class is not learnable:
H = {h(x) = sign f (x) for f (x) = sin(θx) with θ ∈ R}.
Definition
Let H ⊂ {h : X → {0, 1}}, let C = {c1 , . . . , cm } ⊂ X .
The restriction of H to C is
HC = { h(c1 ), . . . , h(cm ) : h ∈ H} ⊂ {0, 1}|C | ,
i.e. the set of all label combinations that classifiers from H achieve on C .
Definition (Shattering)
A hypothesis class H shatters a finite set C ⊆ X , if the restriction of H
to C is the set of all possible labeling of C by {0, 1}, so |HC | = 2|C | .
Definition (VC Dimension)
The VC dimension of a hypothesis class H, denoted VCdim(H), is the
maximal size of a set C ⊂ X that can be shattered by H. If H can
shatter sets of arbitrarily large size we say that VCdim(H) = ∞.
Examples
1) Finite H = {h1 , . . . , hK }:
HC = { h(c1 ), . . . , h(cm ) : h ∈ H} ⊂ {0, 1}|C | ,
can at most have |H| elements, so no set of size C ≥ log2 |H| can be
shattered. So VCdim(H) ≤ blog2 |H|c.
Examples
1) Finite H = {h1 , . . . , hK }:
HC = { h(c1 ), . . . , h(cm ) : h ∈ H} ⊂ {0, 1}|C | ,
can at most have |H| elements, so no set of size C ≥ log2 |H| can be
shattered. So VCdim(H) ≤ blog2 |H|c.
2) Threshold functions, H = {hθ (x) = Jx > θK, for θ ∈ R}.
• C = {c1 }
I
I
for θ < c1 : hθ (c1 ) = 1,
for θ ≥ c1 , hθ (c1 ) = 0.
HC = {(0), (1)}, |HC | = 21 , H shatters a C with 1 elements.
Examples
1) Finite H = {h1 , . . . , hK }:
HC = { h(c1 ), . . . , h(cm ) : h ∈ H} ⊂ {0, 1}|C | ,
can at most have |H| elements, so no set of size C ≥ log2 |H| can be
shattered. So VCdim(H) ≤ blog2 |H|c.
2) Threshold functions, H = {hθ (x) = Jx > θK, for θ ∈ R}.
• C = {c1 }
I
I
for θ < c1 : hθ (c1 ) = 1,
for θ ≥ c1 , hθ (c1 ) = 0.
HC = {(0), (1)}, |HC | = 21 , H shatters a C with 1 elements.
• C = {c1 , c2 }, w.l.o.g. c1 < c2
I
I
I
I
for θ < c1 :
hθ (c1 ), hθ (c2 ) = 1, 1
for c1 ≤ θ < c2 : hθ (c1 ), hθ (c2 ) = 0, 1
for θ ≥ c2 :
hθ (c1 ), hθ (c2 ) = 0, 0
there is no h ∈ H with hθ (c1 ), hθ (c2 ) = 1, 0.
HC = {(0, 0), (0, 1), (1, 1)}, |HC | ≤ 3, no matter what c1 , c2 are.
H shatters a set of size 1, but no set of size 2 ⇒ VCdim(H) = 1
Examples
3) Axis-parallel rectangles
• four points in arrangement below can be labeled in 24 ways.
• for every set of five points, the combination below doesn’t work
(this statement needs a proof, of course) → VCdim(H) = 4
1
1
0
1
1
4) Linear classifiers
• H = {h(x) = Jhw, xi + b ≥ 0K : w ∈ Rd , b ∈ R}.
VCdim = d + 1.
Examples
3) Axis-parallel rectangles
• four points in arrangement below can be labeled in 24 ways.
• for every set of five points, the combination below doesn’t work
(this statement needs a proof, of course) → VCdim(H) = 4
1
1
0
1
1
4) Linear classifiers
• H = {h(x) = Jhw, xi + b ≥ 0K : w ∈ Rd , b ∈ R}.
VCdim = d + 1.
5) All (measurable) functions
• H = {h : X → {0, 1}}
VCdim(H) = |X | (usually ∞).
Examples
3) Axis-parallel rectangles
• four points in arrangement below can be labeled in 24 ways.
• for every set of five points, the combination below doesn’t work
(this statement needs a proof, of course) → VCdim(H) = 4
1
1
0
1
1
4) Linear classifiers
• H = {h(x) = Jhw, xi + b ≥ 0K : w ∈ Rd , b ∈ R}.
VCdim = d + 1.
5) All (measurable) functions
• H = {h : X → {0, 1}}
VCdim(H) = |X | (usually ∞).
6) Sine waves
• H = {sign(sin(λx)) for θ ∈ R}.
VCdim(H) = ∞
VC Theory (Vapnik-Chervonenkis, 1971)
Theorem
Fundamental Theorem of Statistical Learning Let H ⊂ { X → {0, 1} } be
a hypothesis set. Then the following statements are equivalent:
• H has the uniform convergence property.
• Any ERM rule is successful in agnostic PAC learning for H.
• H is agnostic PAC learnable.
• H is PAC learnable.
• Any ERM rule is successful in PAC learning for H.
• H has finite VC-dimension.
VC Theory (Vapnik-Chervonenkis, 1971)
Theorem (Fundamental Theorem – quantitative)
Let H ⊂ { X → {0, 1} } be a hypothesis set with VCdim(H) = d < ∞.
• H is PAC learnable with sample complexity
C1
d + log( 1δ )
≤
m0 (, δ)
≤
C2
d log( 1 ) + log( 1δ )
• H is agnostic PAC learnable with sample complexity
C10
d + log( 1δ )
2
≤
m0 (, δ)
≤
C20
d + log( 1δ )
2
• for any m ∈ N it holds with probablity 1 − δ over D ∼ pm
s
ˆ D (h) + 8
Rp (h) ≤ R
uniformly for all h ∈ H.
8
d log( em
d ) + log δ
.
2m
Structural Risk Minimization
Let VCdim(H) = d < ∞. Then, with probablity 1 − δ over D ∼ pm ,
s
ˆ D (h) + 8
Rp (h) ≤ R
8
d log( em
d ) + log δ
.
2m
uniformly for all h ∈ H.
Formal procedure:
• Fix H with VCdim(H) < ∞ (overwise hopeless)
• obtain sufficiently large training set (|D| ≥ m(, δ))
• enjoy the guarantee of (∗)
(∗)
Structural Risk Minimization
Let VCdim(H) = d < ∞. Then, with probablity 1 − δ over D ∼ pm ,
s
ˆ D (h) + 8
Rp (h) ≤ R
8
d log( em
d ) + log δ
.
2m
uniformly for all h ∈ H.
Formal procedure:
• Fix H with VCdim(H) < ∞ (overwise hopeless)
• obtain sufficiently large training set (|D| ≥ m(, δ))
• enjoy the guarantee of (∗)
Real world:
• D (and thereby m) is fixed, but H isn’t.
Task:
• How to choose H?
(∗)
Structural Risk Minimization
Let H1 ⊂ H2 ⊂ . . . HK be a nested sequence of hypothesis spaces.
Examples: Hk are
• all decision trees of depth at most k,
• linear functions with weight vector w such that kwk2 ≤ k, . . .
Not examples (because not nested):
• linear classifiers in R1 , R2 , . . . ,
1-NN, 2-NN, 3-NN, . . .
Structural Risk Minimization
Let H1 ⊂ H2 ⊂ . . . HK be a nested sequence of hypothesis spaces.
Examples: Hk are
• all decision trees of depth at most k,
• linear functions with weight vector w such that kwk2 ≤ k, . . .
Not examples (because not nested):
• linear classifiers in R1 , R2 , . . . ,
1-NN, 2-NN, 3-NN, . . .
Let dk = VCdim(Hk ), then for each Hk we have with probability 1 − δ:
s
ˆ D (h) + 8
Rp (h) ≤ R
8
dk log( em
dk ) + log δ
2m
=: rhs(h)
Structural Risk Minimization (SRM):
hSRM = hk
n
for k = argmin rhs(hk ) with hk = ERM(Hk , S)
k=1,2,...
o
Structural Risk Minimization
Structural Risk Minimization (SRM):
hSRM = hk
n
for k = argmin rhs(hk ) with hk = ERM(Hk , S)
k=1,2,...
• finding ERM(Hk , S) is also a minimization
• guarantee holds for any h, not just the ERM.
o
Structural Risk Minimization
Structural Risk Minimization (SRM):
hSRM = hk
n
for k = argmin rhs(hk ) with hk = ERM(Hk , S)
k=1,2,...
• finding ERM(Hk , S) is also a minimization
• guarantee holds for any h, not just the ERM.
More convenient: merge both steps into one
s
hSRM = argmin
S
h∈
K
H
k=1 k
ˆ S (h) + 8
R
| {z }
loss term
|
em
) + log 8K
d(h) log( d(h)
δ
{z2m
complexity term
• loss term (= training error): encourage good data fit
• complexity term: penalize (too) complex hypotheses
}
o
Example: Structural Risk Minimization
Let Hk be the set of k-sparse linear classifiers in Rn ,
Hk = {h(x) = sign f (x) for
f (x) = hw, xi},
where w ∈ Rn has at most k non-zero entries, (with k < n/2),
• H1 ⊂ H2 ⊂ · · · ⊂ Hbn/2−1c
• VCdim(Hk ) < 2k log(n),
!
[Tyler Neylon: "Sparse solutions for linear prediction problems", 2006]
Example: Structural Risk Minimization
Let Hk be the set of k-sparse linear classifiers in Rn ,
Hk = {h(x) = sign f (x) for
f (x) = hw, xi},
where w ∈ Rn has at most k non-zero entries, (with k < n/2),
• H1 ⊂ H2 ⊂ · · · ⊂ Hbn/2−1c
• VCdim(Hk ) < 2k log(n),
!
[Tyler Neylon: "Sparse solutions for linear prediction problems", 2006]
Simplify the bound (with khkL0 = # nonzeros in w):
ˆ D (h) + C khkL0 + C 0
Rp (h) ≤ R
Simplified hSRM :
hSRM = argmin
h
ˆ D (h) + C khkL0
R
| {z }
loss term
| {z }
regularizer
Guarantees for Maximum-Margin Classifiers
Remember maximum-margin classifiers (SVMs):
minw
m
1 X
1
kwk2 +
ξi
2C
m
| {z }
i=1
regularizer
|
{z
}
(hinge) loss
for ξ i = max{0, 1 − y i hw, x i i}
Guarantees for Maximum-Margin Classifiers
Remember maximum-margin classifiers (SVMs):
minw
m
1 X
1
kwk2 +
ξi
2C
m
| {z }
i=1
regularizer
|
{z
for ξ i = max{0, 1 − y i hw, x i i}
}
(hinge) loss
Slightly different derivation and loss, simplest for squared hinge loss:
Theorem (Radius-Margin Bound
)
[Shawe-Taylor, Cristianini, 2001]
S = {(xi , yi )}i=1,...,m ⊂ Rn × {±1}, with kxi k ≤ R. For f (x) = hw, xi,
set ξi = max{0, 1 − yi hw, xi i}. Then, for h(x) = sign f (x),
c
Rp (h) ≤
m
2
2
R kwk +
m
X
i=1
ξi2
1
log m + log
δ
!
2
Note: bound is independent of feature dimension n
with prob. ≥ 1 − δ