Machine Learning and Applications Christoph Lampert Spring Semester 2014/2015 Lecture 4 Statistical Learning Theory more: [Shalev-Shwartz, Ben-David: "Understanding Machine Learning", 2014] Learning scenario: • X : input set • Y: output/label set, for now: Y = {−1, 1} or Y = {0, 1} • p(x, y): data distribution (unknown to us) • for now: deterministic labels, y = f (x) for unknown f : X → Y i.i.d. • D = {(x 1 , y 1 ), . . . , (x m , y m )} ∼ p(x, y): training set • H ⊆ {h : X → Y}: hypothesis set (the lerner’s choice) Quantity of interest: • Rp (h) = P(x,y)∼p(x,y) {h(x) 6= y} = Px∼p(x) {h(x) 6= f (x)} What to learn? • We know: there is (at least one) f : X → Y that has Rp (f ) = 0. • Can we find such f from D? If yes, how large must m be? Probably Approximately Correct (PAC) Learnability [L. Valiant, 1984] A hypothesis class H is called PAC learnable by an algorithm A, if • for every > 0 • and every δ > 0 (accuracy → "approximate correct") (confidence → "probably") there exists an • m0 = m0 (, δ) ∈ N, (minimal sample size) such that • for every labeling function f ∈ H (i.e. Rp (f ) = 0), • and for every probability distribution p over X , when we run the learning algorithm A on a training set, D, consisting of m ≥ m0 examples sampled i.i.d. from p, the algorithm returns a hypothesis h ∈ H that, with probability at least 1 − δ, fulfills Rp (h) ≤ . ∀m ≥ m0 (, δ) PD∼pm [ Rp (A[D]) > ] ≤ δ. Example: PAC Learnability Example • X = { all bitstrings of length k }, • H = {h1 , h2 , h3 , h4 } I h1 (x) = 0, h2 (x) = 1, h4 (x) = 1 − parity(x). Y = {0, 1} h3 (x) = parity(x), • labeling function: f = hj for unknown j • Dm = {(x 1 , f (x 1 )), (x 2 , f (x 2 )), . . . , (x m , f (x m ))} • A(D): pick first hypothesis with 0 training error on D Example: PAC Learnability Example • X = { all bitstrings of length k }, • H = {h1 , h2 , h3 , h4 } I h1 (x) = 0, h2 (x) = 1, h4 (x) = 1 − parity(x). Y = {0, 1} h3 (x) = parity(x), • labeling function: f = hj for unknown j • Dm = {(x 1 , f (x 1 )), (x 2 , f (x 2 )), . . . , (x m , f (x m ))} • A(D): pick first hypothesis with 0 training error on D Example: f = h3 (parity), p(x) = uniform, small (even = 0) • D1 = {(0111, 1)}, A[D] = h1 , Rp (A[D]) = • D1 = {(0000, 0)}, A[D] = h0 , Rp (A[D]) = One sample is not enough. 1 2 1 2 > > Example: f = h3 (parity), p(x) = uniform, small (even = 0) • D2 = {(0111, 1), (0000, 0)}, A[D] = h3 , Rp (A[D]) = 0 ≤ • D2 = {(1111, 0), (0111, 1)}, A[D] = h3 , Rp (A[D]) = 0 ≤ • D2 = {(0100, 1), (1000, 1)}, A[D] = h2 , Rp (A[D]) = 1 2 > Example: f = h3 (parity), p(x) = uniform, small (even = 0) • D2 = {(0111, 1), (0000, 0)}, A[D] = h3 , Rp (A[D]) = 0 ≤ • D2 = {(1111, 0), (0111, 1)}, A[D] = h3 , Rp (A[D]) = 0 ≤ • D2 = {(0100, 1), (1000, 1)}, A[D] = h2 , Rp (A[D]) = 1 2 > m = 2 is sometimes enough, sometimes not. Can we quantify this? • D2 uniquely identifies h3 (only h3 has 0 error) if y 1 6= y 2 , i.e. parity(x 1 ) 6= parity(x 2 ) • for p(x) = uniform, this happens with probability 21 . Example: f = h3 (parity), p(x) = uniform, small (even = 0) • D2 = {(0111, 1), (0000, 0)}, A[D] = h3 , Rp (A[D]) = 0 ≤ • D2 = {(1111, 0), (0111, 1)}, A[D] = h3 , Rp (A[D]) = 0 ≤ • D2 = {(0100, 1), (1000, 1)}, A[D] = h2 , Rp (A[D]) = 1 2 > m = 2 is sometimes enough, sometimes not. Can we quantify this? • D2 uniquely identifies h3 (only h3 has 0 error) if y 1 6= y 2 , i.e. parity(x 1 ) 6= parity(x 2 ) • for p(x) = uniform, this happens with probability 21 . general m? • A[Dm ] = h3 , iff y i 6= y j for any i 6= j • A[Dm ] 6= h3 , iff y1 ym → Rp (A[Dm ]) = 0 = ··· = → Rp (A[Dm ]) = 1 1 1 • Rp (A[Dm ] ≥ ) = ( )m + ( )m = 2m−1 2 2 | {z } | {z } all 0 all 1 1 2 Example: f = h3 (parity), p(x) = uniform, small (even = 0) • D2 = {(0111, 1), (0000, 0)}, A[D] = h3 , Rp (A[D]) = 0 ≤ • D2 = {(1111, 0), (0111, 1)}, A[D] = h3 , Rp (A[D]) = 0 ≤ • D2 = {(0100, 1), (1000, 1)}, A[D] = h2 , Rp (A[D]) = 1 2 > m = 2 is sometimes enough, sometimes not. Can we quantify this? • D2 uniquely identifies h3 (only h3 has 0 error) if y 1 6= y 2 , i.e. parity(x 1 ) 6= parity(x 2 ) • for p(x) = uniform, this happens with probability 21 . general m? • A[Dm ] = h3 , iff y i 6= y j for any i 6= j • A[Dm ] 6= h3 , iff y1 ym → Rp (A[Dm ]) = 0 = ··· = → Rp (A[Dm ]) = ? 1 1 1 • Rp (A[Dm ] ≥ ) = ( )m + ( )m = 2m−1 ≤δ | 2{z } | 2{z } all 0 1 2 all 1 invert: for given δ, need m ≥ 1 + log2 1δ , → m0 (, δ) = d1 + log2 1δ e Caveat: we only looked at f = h3 and uniform data distribution, but learnability must work for every hypothesis and every data distribution! Caveat: we only looked at f = h3 and uniform data distribution, but learnability must work for every hypothesis and every data distribution! Example 2: f = h3 (parity), p(x) = only x with parity 0 can occur • Dm = {(0101, 0), . . . , (0000, 0)} • A[Dm ] = h0 6= h3 regardless of m Caveat: we only looked at f = h3 and uniform data distribution, but learnability must work for every hypothesis and every data distribution! Example 2: f = h3 (parity), p(x) = only x with parity 0 can occur • Dm = {(0101, 0), . . . , (0000, 0)} • A[Dm ] = h0 6= h3 regardless of m • Rp (A[Dm ]) = 0 !!! because h0 (x) = h3 (x) for all x with p(x) > 0 Empirical Risk Minimization Definition (Empirical Risk Minimization (ERM) Algorithm) input hypothesis set H ⊆ {h : X → Y} (not necessarily finite) input training set D = {(x 1 , y 1 ), . . . , (x m , y m )} output hypothesis h ∈ H with ˆ D (h) = minh∈H R ˆ D (h) R for m X ˆ D (h) := 1 R Jh(x i ) 6= y i K m i=1 (arbitrary deterministic choice if min is not unique) ERM return a classifier of minimal training error. • We saw already: this might work (parity) or not (decision trees) • Can we characterize when ERM works and when it fails? Results in Statistical Learning Theory Theorem (Learnability of finite hypothesis classes (realizable case)) Let H = {h1 , . . . , hK } be a finite hypothesis class and f ∈ H. Then H is PAC-learnable by the empirical risk minimization algorithm with 1 m0 (, δ) = ( log(|H| + log(1/δ) ). Examples: Finite hypothesis classes Selecting between different models: • Amazon sells pre-trained classifiers: 1. a decision tree, 2. LogReg or 3. a support vector machine? Which one to buy? Finite precision: • For x ∈ Rd , the hypothesis set H = {f (x) = signhw, xi} is infinite. • But: on a computer, w is restricted to 64-bit doubles: |Hc | = 264d . m0 (, δ) = 1 ( log(|H| + log(1/δ) ) ≈ 1 (44d + log(1/δ)) Implementation: • H = { all algorithms implementable in 10 KB C-code } is finite. Logarithmic dependence on |H| makes even large (finite) hypothesis sets (kind of) tractable. Example: Learning Thresholding Functions Infinite H can be PAC learnable: x xx 0 x af h a x x xx f 1 • X = [0, 1], Y = {0, 1}, • H = {ha (x) = J x ≥ a K, for 0 ≤ a ≤ 1}, • f (x) = haf (x) for some 0 ≤ af ≤ 1. • training set D = {(x 1 , y 1 ), . . . , (x m , y m )} • ERM rule: h = argmin ha ∈H m 1 X Jha (x i ) 6= y i K, m i=1 pick smallest possible "+1" region (largest a) when not unique (to make algorithm deterministic) Claim: ERM learns f (in the PAC sense) Example: Sine Waves ERM cannot learn everything: 1.0 0.5 0.0 0.5 1.0 0 1 • X = [0, 1], Y = {0, 1}, H = {hλ (x) = Jsin(λx) > 0K : λ ∈ R} • f (x) = hλ for some λ ∈ R (here: λ = 55) • training set S = {(x 1 , y 1 ), . . . , (x m , y m )} • ERM rule: h = argminhλ 1 m Pm pick smallest λ when not unique i=1 Jhλ (x Claim: ERM cannot learn f in the PAC sense i) 6= y i K, Example: Sine Waves ERM cannot learn everything: 1.0 0.5 0.0 0.5 1.0 0 1 • X = [0, 1], Y = {0, 1}, H = {hλ (x) = Jsin(λx) > 0K : λ ∈ R} • f (x) = hλ for some λ ∈ R (here: λ = 55) • training set S = {(x 1 , y 1 ), . . . , (x m , y m )} • ERM rule: h = argminhλ 1 m Pm pick smallest λ when not unique i=1 Jhλ (x Claim: ERM cannot learn f in the PAC sense i) 6= y i K, Example: Sine Waves ERM cannot learn everything: 1.0 0.5 0.0 0.5 1.0 0 1 • X = [0, 1], Y = {0, 1}, H = {hλ (x) = Jsin(λx) > 0K : λ ∈ R} • f (x) = hλ for some λ ∈ R (here: λ = 55) • training set S = {(x 1 , y 1 ), . . . , (x m , y m )} • ERM rule: h = argminhλ 1 m Pm pick smallest λ when not unique i=1 Jhλ (x Claim: ERM cannot learn f in the PAC sense i) 6= y i K, Agnostic PAC Learning More realistic scenario: labeling isn’t a deterministic function • X : input set • Y: output/label set, for now: Y = {−1, 1} or Y = {0, 1} • p(x, y): data distribution (unknown to us) • assume deterministic labels, y = f (x) for unknown f : X → Y i.i.d. • D = {(x 1 , y 1 ), . . . , (x m , y m )} ∼ p(x, y): training set • H ⊆ {h : X → Y}: hypothesis set Quantity of interest: • Rp (h) = P(x,y)∼p(x,y) {h(x) 6= y} What can we learn? • there might not be a f ∈ H that has Rp (f ) = 0. • there might not be any f : X → Y that has Rp (f ) = 0. • but can we at least find the best h from the hypothesis set? Definition (Agnostic PAC Learning) A hypothesis class H is called agnostic PAC learnable by A, if • for every > 0 (accuracy → "approximate correct") • and every δ > 0 (confidence → "probably") there exists an • m0 = m0 (, δ) ∈ N, (minimal sample size) such that • for every probability distribution p(x, y) over X × Y, running the learning algorithm A on a training set, D, consisting of m ≥ m0 examples sampled i.i.d. from p, yields a hypothesis h ∈ H that, with probability at least 1 − δ, fulfills ¯ + . Rp (h) ≤ min Rp (h) ¯ h∈H ¯ > ] ≤ δ. ∀m ≥ m0 (, δ) PD∼pm [ Rp (A[D]) − min Rp (h) ¯ h∈H Agnostic PAC Learnability Three main quantities: ˆ S (h) = 1 Pm Jh(x i ) 6= y i K for any h ∈ H, • training error, R i=1 m • generalizaton error, Rp (h) = E(x,y)∼p Jh(x) 6= yK for any h ∈ H, ¯ Rp (h). • best achievable generalization error, minh∈H ¯ Agnostic PAC Learnability Three main quantities: ˆ S (h) = 1 Pm Jh(x i ) 6= y i K for any h ∈ H, • training error, R i=1 m • generalizaton error, Rp (h) = E(x,y)∼p Jh(x) 6= yK for any h ∈ H, ¯ Rp (h). • best achievable generalization error, minh∈H ¯ Definition (-representative sample) A training set D is called -representative (for the current situation), if ∀h ∈ H ˆ D (h) − Rp (h)| ≤ . |R Agnostic PAC Learnability Three main quantities: ˆ S (h) = 1 Pm Jh(x i ) 6= y i K for any h ∈ H, • training error, R i=1 m • generalizaton error, Rp (h) = E(x,y)∼p Jh(x) 6= yK for any h ∈ H, ¯ Rp (h). • best achievable generalization error, minh∈H ¯ Definition (-representative sample) A training set D is called -representative (for the current situation), if ∀h ∈ H ˆ D (h) − Rp (h)| ≤ . |R Lemma ("ERM works well for (/2)-representative training sets") Let D be (/2)-representative. Then any h with ˆ D (h) = min¯ R ˆ ¯ R h∈H D (h) (i.e. a possible output of ERM) fulfills ¯ + . Rd (h) ≤ min Rd (h) ¯ h∈H Uniform Conference Definition A hypothesis class H is said to have the uniform convergence property, if for every > 0, δ > 0, there exists an m UC = m UC (, δ) ∈ N, such that for every probability distribution p over X the following holds: • the probability that a training set D of size m ≥ m UC , sampled i.i.d. from p, is -representable is at least 1 − δ. Theorem Let H have the uniform convergence property with sample complexity mUC (, δ), then H is agnostically PAC-learnable with sample complexity m(, δ) ≤ mUC (δ/2, δ), and the ERM rule is a successful PAC learning algorithm. Uniform Conference is sufficient for PAC Learnability Theorem Every finite hypothesis set, H, is agnostic PAC-learnable by ERM. Theorem H = {Jsin(λx) > 0K : λ ∈ R} is not (agnostic) PAC-learnable by ERM. Is there a better algorithm besides ERM? Maybe one that always works? There exists no universal learning algorithm No-Free-Lunch Theorem [Wolpert, Macready 1997] Let • X input set, Y = {0, 1} label set, • A an arbitrary learning algorithm for binary classification, • m (training size) any number smaller than |X |/2 then, there exists a data distribution p over X × Y such that • there is a function f : X × Y → {0, 1} with Rp (f ) = 0, but PS∼pm [ RD (A[D]) ≥ 1/8 ] ≥ 1/7. Summary: For every learning algorithm, there exists a task on which it fails, even though another algorithm would have succeeded on that task. Proof: "Understanding Machine Learning", Section 5.1, page 57ff We construct a distribution, defined on 2m points, such that • for every training set of m points, there’s a perfect classifier, • the labels of the m observed points contain no information about the labels on the remaining points Corollary Let X be a set of infinite size and let H = Y X = {h : X → {0, 1}} be the set of all functions from X to {0, 1}. Then H is not PAC-learnable by any algorithm. So learning is hopeless? Only if we have absolutely no prior knowledge. If we know, e.g., that the target function is linear, or smooth, we can work with a smaller H. Bias–Complexity Tradeoff ˆ D (h) be an ERM classifier. Then we can Let c = argminh∈H R decompose its error: Rp (c) = EBayes + Eapp + Eest • EBayes = min Rd (h) h∈Y X Bayes error • Eapp = min Rd (h) − min Rd (h) h∈Y X h∈H • Eest = Rp (c) − min Rp (h) h∈H approximation error estimation error Bias–Complexity Tradeoff ˆ D (h) be an ERM classifier. Then we can Let c = argminh∈H R decompose its error: Rp (c) = EBayes + Eapp + Eest • EBayes = min Rd (h) h∈Y X Bayes error • Eapp = min Rd (h) − min Rd (h) h∈Y X h∈H • Eest = Rp (c) − min Rp (h) h∈H approximation error estimation error There’s nothing we can do about EBayes , but: • Eapp gets smaller the "richer" H is, • Eest usually gets bigger for "richer" H. Tradeoff of expressive power and complexity Pick H rich enough to make Eapp small, but not so rich that Eest is large. Infinite H can be PAC learnable How to measure richness of H? Intuition: number of hypotheses? number of free parameter? • This 1-parameter hypothesis class is learnable: H = {h(x) = sign f (x) for f (x) = x − θ with θ ∈ R}. • This 1-parameter hypothesis class is not learnable: H = {h(x) = sign f (x) for f (x) = sin(θx) with θ ∈ R}. Definition Let H ⊂ {h : X → {0, 1}}, let C = {c1 , . . . , cm } ⊂ X . The restriction of H to C is HC = { h(c1 ), . . . , h(cm ) : h ∈ H} ⊂ {0, 1}|C | , i.e. the set of all label combinations that classifiers from H achieve on C . Definition (Shattering) A hypothesis class H shatters a finite set C ⊆ X , if the restriction of H to C is the set of all possible labeling of C by {0, 1}, so |HC | = 2|C | . Definition (VC Dimension) The VC dimension of a hypothesis class H, denoted VCdim(H), is the maximal size of a set C ⊂ X that can be shattered by H. If H can shatter sets of arbitrarily large size we say that VCdim(H) = ∞. Examples 1) Finite H = {h1 , . . . , hK }: HC = { h(c1 ), . . . , h(cm ) : h ∈ H} ⊂ {0, 1}|C | , can at most have |H| elements, so no set of size C ≥ log2 |H| can be shattered. So VCdim(H) ≤ blog2 |H|c. Examples 1) Finite H = {h1 , . . . , hK }: HC = { h(c1 ), . . . , h(cm ) : h ∈ H} ⊂ {0, 1}|C | , can at most have |H| elements, so no set of size C ≥ log2 |H| can be shattered. So VCdim(H) ≤ blog2 |H|c. 2) Threshold functions, H = {hθ (x) = Jx > θK, for θ ∈ R}. • C = {c1 } I I for θ < c1 : hθ (c1 ) = 1, for θ ≥ c1 , hθ (c1 ) = 0. HC = {(0), (1)}, |HC | = 21 , H shatters a C with 1 elements. Examples 1) Finite H = {h1 , . . . , hK }: HC = { h(c1 ), . . . , h(cm ) : h ∈ H} ⊂ {0, 1}|C | , can at most have |H| elements, so no set of size C ≥ log2 |H| can be shattered. So VCdim(H) ≤ blog2 |H|c. 2) Threshold functions, H = {hθ (x) = Jx > θK, for θ ∈ R}. • C = {c1 } I I for θ < c1 : hθ (c1 ) = 1, for θ ≥ c1 , hθ (c1 ) = 0. HC = {(0), (1)}, |HC | = 21 , H shatters a C with 1 elements. • C = {c1 , c2 }, w.l.o.g. c1 < c2 I I I I for θ < c1 : hθ (c1 ), hθ (c2 ) = 1, 1 for c1 ≤ θ < c2 : hθ (c1 ), hθ (c2 ) = 0, 1 for θ ≥ c2 : hθ (c1 ), hθ (c2 ) = 0, 0 there is no h ∈ H with hθ (c1 ), hθ (c2 ) = 1, 0. HC = {(0, 0), (0, 1), (1, 1)}, |HC | ≤ 3, no matter what c1 , c2 are. H shatters a set of size 1, but no set of size 2 ⇒ VCdim(H) = 1 Examples 3) Axis-parallel rectangles • four points in arrangement below can be labeled in 24 ways. • for every set of five points, the combination below doesn’t work (this statement needs a proof, of course) → VCdim(H) = 4 1 1 0 1 1 4) Linear classifiers • H = {h(x) = Jhw, xi + b ≥ 0K : w ∈ Rd , b ∈ R}. VCdim = d + 1. Examples 3) Axis-parallel rectangles • four points in arrangement below can be labeled in 24 ways. • for every set of five points, the combination below doesn’t work (this statement needs a proof, of course) → VCdim(H) = 4 1 1 0 1 1 4) Linear classifiers • H = {h(x) = Jhw, xi + b ≥ 0K : w ∈ Rd , b ∈ R}. VCdim = d + 1. 5) All (measurable) functions • H = {h : X → {0, 1}} VCdim(H) = |X | (usually ∞). Examples 3) Axis-parallel rectangles • four points in arrangement below can be labeled in 24 ways. • for every set of five points, the combination below doesn’t work (this statement needs a proof, of course) → VCdim(H) = 4 1 1 0 1 1 4) Linear classifiers • H = {h(x) = Jhw, xi + b ≥ 0K : w ∈ Rd , b ∈ R}. VCdim = d + 1. 5) All (measurable) functions • H = {h : X → {0, 1}} VCdim(H) = |X | (usually ∞). 6) Sine waves • H = {sign(sin(λx)) for θ ∈ R}. VCdim(H) = ∞ VC Theory (Vapnik-Chervonenkis, 1971) Theorem Fundamental Theorem of Statistical Learning Let H ⊂ { X → {0, 1} } be a hypothesis set. Then the following statements are equivalent: • H has the uniform convergence property. • Any ERM rule is successful in agnostic PAC learning for H. • H is agnostic PAC learnable. • H is PAC learnable. • Any ERM rule is successful in PAC learning for H. • H has finite VC-dimension. VC Theory (Vapnik-Chervonenkis, 1971) Theorem (Fundamental Theorem – quantitative) Let H ⊂ { X → {0, 1} } be a hypothesis set with VCdim(H) = d < ∞. • H is PAC learnable with sample complexity C1 d + log( 1δ ) ≤ m0 (, δ) ≤ C2 d log( 1 ) + log( 1δ ) • H is agnostic PAC learnable with sample complexity C10 d + log( 1δ ) 2 ≤ m0 (, δ) ≤ C20 d + log( 1δ ) 2 • for any m ∈ N it holds with probablity 1 − δ over D ∼ pm s ˆ D (h) + 8 Rp (h) ≤ R uniformly for all h ∈ H. 8 d log( em d ) + log δ . 2m Structural Risk Minimization Let VCdim(H) = d < ∞. Then, with probablity 1 − δ over D ∼ pm , s ˆ D (h) + 8 Rp (h) ≤ R 8 d log( em d ) + log δ . 2m uniformly for all h ∈ H. Formal procedure: • Fix H with VCdim(H) < ∞ (overwise hopeless) • obtain sufficiently large training set (|D| ≥ m(, δ)) • enjoy the guarantee of (∗) (∗) Structural Risk Minimization Let VCdim(H) = d < ∞. Then, with probablity 1 − δ over D ∼ pm , s ˆ D (h) + 8 Rp (h) ≤ R 8 d log( em d ) + log δ . 2m uniformly for all h ∈ H. Formal procedure: • Fix H with VCdim(H) < ∞ (overwise hopeless) • obtain sufficiently large training set (|D| ≥ m(, δ)) • enjoy the guarantee of (∗) Real world: • D (and thereby m) is fixed, but H isn’t. Task: • How to choose H? (∗) Structural Risk Minimization Let H1 ⊂ H2 ⊂ . . . HK be a nested sequence of hypothesis spaces. Examples: Hk are • all decision trees of depth at most k, • linear functions with weight vector w such that kwk2 ≤ k, . . . Not examples (because not nested): • linear classifiers in R1 , R2 , . . . , 1-NN, 2-NN, 3-NN, . . . Structural Risk Minimization Let H1 ⊂ H2 ⊂ . . . HK be a nested sequence of hypothesis spaces. Examples: Hk are • all decision trees of depth at most k, • linear functions with weight vector w such that kwk2 ≤ k, . . . Not examples (because not nested): • linear classifiers in R1 , R2 , . . . , 1-NN, 2-NN, 3-NN, . . . Let dk = VCdim(Hk ), then for each Hk we have with probability 1 − δ: s ˆ D (h) + 8 Rp (h) ≤ R 8 dk log( em dk ) + log δ 2m =: rhs(h) Structural Risk Minimization (SRM): hSRM = hk n for k = argmin rhs(hk ) with hk = ERM(Hk , S) k=1,2,... o Structural Risk Minimization Structural Risk Minimization (SRM): hSRM = hk n for k = argmin rhs(hk ) with hk = ERM(Hk , S) k=1,2,... • finding ERM(Hk , S) is also a minimization • guarantee holds for any h, not just the ERM. o Structural Risk Minimization Structural Risk Minimization (SRM): hSRM = hk n for k = argmin rhs(hk ) with hk = ERM(Hk , S) k=1,2,... • finding ERM(Hk , S) is also a minimization • guarantee holds for any h, not just the ERM. More convenient: merge both steps into one s hSRM = argmin S h∈ K H k=1 k ˆ S (h) + 8 R | {z } loss term | em ) + log 8K d(h) log( d(h) δ {z2m complexity term • loss term (= training error): encourage good data fit • complexity term: penalize (too) complex hypotheses } o Example: Structural Risk Minimization Let Hk be the set of k-sparse linear classifiers in Rn , Hk = {h(x) = sign f (x) for f (x) = hw, xi}, where w ∈ Rn has at most k non-zero entries, (with k < n/2), • H1 ⊂ H2 ⊂ · · · ⊂ Hbn/2−1c • VCdim(Hk ) < 2k log(n), ! [Tyler Neylon: "Sparse solutions for linear prediction problems", 2006] Example: Structural Risk Minimization Let Hk be the set of k-sparse linear classifiers in Rn , Hk = {h(x) = sign f (x) for f (x) = hw, xi}, where w ∈ Rn has at most k non-zero entries, (with k < n/2), • H1 ⊂ H2 ⊂ · · · ⊂ Hbn/2−1c • VCdim(Hk ) < 2k log(n), ! [Tyler Neylon: "Sparse solutions for linear prediction problems", 2006] Simplify the bound (with khkL0 = # nonzeros in w): ˆ D (h) + C khkL0 + C 0 Rp (h) ≤ R Simplified hSRM : hSRM = argmin h ˆ D (h) + C khkL0 R | {z } loss term | {z } regularizer Guarantees for Maximum-Margin Classifiers Remember maximum-margin classifiers (SVMs): minw m 1 X 1 kwk2 + ξi 2C m | {z } i=1 regularizer | {z } (hinge) loss for ξ i = max{0, 1 − y i hw, x i i} Guarantees for Maximum-Margin Classifiers Remember maximum-margin classifiers (SVMs): minw m 1 X 1 kwk2 + ξi 2C m | {z } i=1 regularizer | {z for ξ i = max{0, 1 − y i hw, x i i} } (hinge) loss Slightly different derivation and loss, simplest for squared hinge loss: Theorem (Radius-Margin Bound ) [Shawe-Taylor, Cristianini, 2001] S = {(xi , yi )}i=1,...,m ⊂ Rn × {±1}, with kxi k ≤ R. For f (x) = hw, xi, set ξi = max{0, 1 − yi hw, xi i}. Then, for h(x) = sign f (x), c Rp (h) ≤ m 2 2 R kwk + m X i=1 ξi2 1 log m + log δ ! 2 Note: bound is independent of feature dimension n with prob. ≥ 1 − δ
© Copyright 2024