582669 Supervised Machine Learning (Spring 2014) Homework 5, sample solutions Credit for the solutions goes to mainly to Panu Luosto and Joonas Paalasmaa, with some additional contributions by Jyrki Kivinen Problem 1 This exercise is of course a drill: there are simpler ways of finding the minimum of x2 in a closed interval. First part. Let f (x) = x2 and p(x) = (x−2)2 −1. We are asked to minimize f (x) with the constraint p(x) ≤ 0. Both functions are clearly convex, and the Karush-Kuhn-Tucker conditions of this convex optimization problem are  0 f (x) + λp0 (x) = 0     p(x) ≤ 0  λ≥0    λp(x) = 0 , or equivalently  2x + λ(2x − 4) = 0      (x − 2)2 − 1 ≤ 0  λ≥0     2 λ((x − 2) − 1) = 0 . (i) (ii) (iii) (iv) The solutions of (iv) are λ = 0, x = 3 or x = 1. If λ = 0, we get x = 0 from (i) , and the condition (ii) is not true. Otherwise if x = 3, equation (i) has the form 6 + 2λ = 0, which means that condition (iii) is violated. We are left with x = 1. Then (i) yields 2 − 2λ = 0 ⇔ λ = 1, and the inequalities (ii) and (iii) are true. In other words, f (x) is minimized subject to p(x) ≤ 0 when x = 1. The Lagrangian of the problem is L(x, λ) = f (x) + λp(x) = x2 + λ((x − 2)2 − 1) = (1 + λ)x2 − 4λx + 3λ , and the dual is g(λ) = inf L(x, λ) x∈R = inf (1 + λ)x2 − 4λx + 3λ . x∈R The dual problem is to maximize g(λ) in the set λ ∈ [0, ∞[. Let h(x) = (1 + λ)x2 − 4λx + 3λ. Because λ ≥ 0, the graph of h is a convex parabola, and the minimum of h is at the point where h0 (x) = (2 + 2λ)x − 4λ = 0 ⇒ x = 2λ/(1 + λ). Thus 4λ2 8λ2 − + 3λ 2 (1 + λ) 1+λ 4λ2 − 8λ2 + 3λ + 3λ2 = 1+λ 3λ − λ2 = . 1+λ g(λ) = (1 + λ) With some more calculations, we could show that the dual g is maximized with λ = 1, and g(1) = 1. Our optimization problem is convex, and p(2) < 0, in other words, there is x ∈ R such that p(x) is strictly less than zero. Therefore, according to Theorem 3.4 of the lecture slides, the strong duality holds, and g(λ∗ ) = f (x∗ ), where λ∗ and x∗ are the solutions of the dual and primal optimization problems, respectively. Second part. Let f (x) = x2 and let p(x) = (x − 2)2 − 8. Again, we want to minimize the convex function f subject to p(x) ≤ 0, where p is also a convex function. The KKT conditions of this convex optimization problem are  (i)   2x + λ(2x − 4) = 0   2  (x − 2) − 8 ≤ 0 (ii)  λ≥0 (iii)     2 λ((x − 2) − 8) = 0 . (iv) From (iv) and (i) we find easily the solution x = λ = 0.√In order to check for other possible solutions √ (for practise only), we get x = 2 + 2 2 and x = 2 − 2 2 from (iv). But then solving λ = x/(2 − x) from (i), shows that the condition (iii) is violated in these two cases. So the solution of the problem is x = 0. The Lagrangian of f is L(x, λ) = f (x) + λp(x) = x2 + λ((x − 2)2 − 8) , and the dual is g(λ) = inf L(x, λ) x∈R = inf (1 + λ)x2 − 4λx − 4λ . x∈R Finding the infinimum goes along similar steps as in the first part of the exercise, and 2 2λ 2λ g(λ) = (1 + λ) − 4λ − 4λ (1 + λ) 1+λ 4λ2 − 8λ2 − 4λ − 4λ2 = 1+λ 2 + 4λ = −2λ . 1+λ In the dual problem, we maximize g(λ) in the set λ ∈ [0, ∞[. Because g(λ) ≤ 0 everywhere in R+ , and g(λ) = 0 only when λ = 0, the maximum is achieved when λ = 0. Then x = (2 · 0)/(1 + 0) = 0. Strong duality holds also in this case, which can be be seen also from the fact that all the KKT conditions are satisfied when λ = x = 0. Problem 2 Let f : Rd → R, f (w) = kw − wt k2 = (w − wt ) · (w − wt ) = w · w − 2w · wt + wt · wt , and let p(w) = 1 − yt w · xt . Notice that this is the revised version of the exercise 2, in the original version p was different. We prove as a warm-up formally that f is a convex function in Rd . A twice differentiable function is convex in a convex set if and only if its Hessian matrix is positive semidefinite on the interior of the convex set (compare with the non-negativeness of the second Pd derivative of a function of type R → R). Because f (w) = i=1 (wi − wt,i )2 , it is easy to see that ∂f (w) = 2(wi − wt,i ) ∂wi (i ∈ {1, 2, . . . , d}) , 2 and for all i, j ∈ {1, 2, . . . , d} ∂ 2 f (w) = 2 if i = j ∂wi ∂wj ∂ 2 f (w) = 0 if i 6= j . ∂wi ∂wj and So the Hessian matrix of f is H(f ) = 2Id . It holds for all z ∈ Rd×1 that zT (2Id )z = 2kzk2 ≥ 0, which means H(f ) is positive semidefinite. The optimization problem is to minimize f (w) subject to p(w) ≤ 0. The Karush-Kuhn-Tucker conditions are  ∇f (w) + λ∇p(w) = 0     p(w) ≤ 0  λ≥0    λp(w) = 0 , or equivalenty  2w − 2wt − λyt xt = 0     1 − yt w · xt ≤ 0  λ≥0    λ(1 − yt w · xt ) = 0 . (i) (ii) (iii) (iv) Equation (i) yields w = wt + (1/2)λyt xt . (1) From equality (iv) we get two solutions. In the first case, λ = 0, and inequality (iii) is true. Also, w = wt , and 1 − yt wt · xt ≤ 0 has to hold. In the second solution λ > 0, and we get from (iv) 1 1 − yt w · xt = 1 − yt (wt + λyt xt ) · xt 2 1 = 1 − yt wt · xt − λxt · xt 2 =0 (2) where we used yt2 = 1. In this case (ii) holds with equality. Solving λ from (2) and using (iii) gives λ=2· 1 − yt wt · xt ≥0 kxt k2 which implies 1 − yt wt · xt ≥ 0. Plugging λ from (3) into (1) yields 1 1 − yt wt · xt ·2· yt x t 2 kxt k2 1 − yt wt · xt xt . = wt + yt kxt k2 w = wt + Putting finally all things together, we have the update rule wt+1 = wt + σt yt 1 − yt wt · xt xt kxt k2 where ( 0 if yt wt · xt ≥ 1 σt = 1 if yt wt · xt < 1 . 3 (3) We were asked to show, that if w1 = 0, then wt is a linear combination of the vectors x1 , x2 , . . . , xt−1 , in other words, wt = α1 x1 + α2 x2 + · · · + αt−1 xt−1 , where α1 , α2 , . . . , αt−1 ∈ R. If P t = 1, we have an t−2 empty sum, and the statement is true. Let t ∈ {2, 3, . . . } and assume that wt−1 = i=1 αi xi . Using our update rule, wt = t−2 X αi xi + σt yt i=1 = t−1 X 1 − yt wt · xt xt kxt k2 αi xi , i=1 where we have written αt = σt yt (1−yt wt ·xt )/kxt k2 . According to the induction priciple, the statement is thus true for all t ∈ N. In the kernelized Perceptron algorithm we replace the xi vectors with their counterparts ψ(xi ) in the feature space. The algorithm can be described as follows: For t = 1, 2, . . . , T do 1. Get the instance xt . Pt−1 Pt−1 α ψ(x ) ·ψ(xt ) = i=1 αi k(xi , xt ). Predict yˆt = sign(pt ). 2. Let pt = wt ·ψ(xt ) = i i i=1 3. Get the correct answer yt . Pt−1 4. If yt pt ≤ 1, set αt ← yt (1 − yt wt · ψ(xt ))/kψ(xt )k2 = yt − i=1 αi k(xi , xt )/k(xt , xt ), otherwise set αt ← 0 (and xt can be discarded). Pt Pt Pt Pt 0 0 5. Let kwt+1 k2 = i=1 αi ψ(xi ) · j=1 αj ψ(xj ) = i=1 j=1 αi αj k(xi , xj ). If kwt+1 k> 0 1, set αi ← αi /kwt k for all i ∈ {1, 2, . . . , t}. Problem 3 The idea of this exercise was to show that if there is a solution to the given minimization problem, then the coordinate i of the new weight vector satisfies wt+1,i = 1 wt,i βtyt xi Z where β ≥ 1. Notice that the formulation of the exercise is slightly revised. Let f : Rd+ → R, f (w) = dre (w, wt ). See some discussion about defining f on the boundary of set d R+ in the solution for the exercise 2b of the third homework. Similarly as in the previous exercise, we start by proving that f is convex in Rd+ , which we do by examining the Hessian matrix of f on the interior of R+ . Let w = (w1 , w2 , . . . , wd ) ∈ (R+ \ {0})d . For all i ∈ {1, 2, . . . , d} Pd ∂ i=1 wi ln(wi /wt,i ) ∂f (w) = ∂wi ∂wi = ln wi − ln wt,i + 1 , and ∂ 2 f (w) ∂(ln wi − ln wt,i + 1) = ∂wi ∂wi ∂wi 1 = . wi Additionally, for all i, j ∈ {1, 2, . . . , d}, i 6= j, ∂ 2 f (w) = 0. ∂wi ∂wj 4 Let z ∈ Rd×1 . The Hessian of f is positively semidefinite because zT H(f ) z = (z1 /w1 ), (z2 , /w2 ), . . . , (zd /wd ) z = d X zi2 wi i=1 ≥ 0. We have now proved that f is convex in R+ . This time we have in addition to inequality constrains also an equality constraint. Let p(w) = −yt w · xt , q(w) = d X wi − 1 . i=1 The conditions wi ≥ 0 are left out, because of the definition of f . The optimization problem is to minimize f in the set Rd+ subject to ( p(w) ≤ 0 q(w) = 0 . The function p is linear, and therefore convex. Let a = (1, 1, . . . , 1). We see that q(w) = a · w − 1 is affine. So our minimization problem is convex. The KKT conditions for the optimization problem are  ∇f (w) + λ∇p(w) + γ∇q(w) = 0      p(w) ≤ 0   q(w) = 0        λ≥0 λp(w) = 0 . where γ ∈ R. For all i ∈ {1, 2, . . . , d} it holds that ∂f (w) = ln wi − ln wt,i + 1 ∂wi ∂p(w) = yt xt,i ∂wi ∂q(w) =1 ∂wi and the first KKT condition yields thus (ln wi − ln wt,i + 1) − λyt xt,i + γ = 0 . This implies ln wi = ln wt,i − 1 + λyt xt,i − γ , and further wi = wt,i exp(−1 + λyt xt,i − γ) = exp(−1 − γ) wt,i exp(λyt xt,i ) 1 y x = wt,i βt t t,i Z 5 where 1/Z = exp(−1 − γ) and βt = exp(λ). The condition q(w) = 0 yields now d X 1 x wt,i βt t,i − 1 = 0 Z i=1 ⇒ Z= d X x wt,i βt t,i , i=1 and because of the condition λ ≥ 0 we know that βt = exp(λ) ≥ 1. Problem 4 The accuracy with different choices of the kernel width and parameter C are shown below. For both parameters, the values 1, 10 and 100 are tried. Note that the smallest test set error is attained with C = 1.0 when width = 1.0 and with C = 100.0 when width = 10.0. When width = 100.0, the classifier breaks down completely, because the kernel size is much larger than the extent of the data set. data = dlmread ( ’ hw05data . t x t ’ ) ; X = data ( : , 1 ) ; Y = data ( : , 2 ) ; l a b e l = data ( : , 3 ) ; N = s i z e ( data , 1 ) ; negative_mask = find ( l a b e l ==−1); p o s i t i v e _ m a s k = find ( l a b e l ==1); X_train = X( 1 :N/ 2 ) ; Y_train = Y( 1 :N/ 2 ) ; l a b e l _ t r a i n = l a b e l ( 1 :N/ 2 ) ; X_test = X(N/2+1:end ) ; Y_test = Y(N/2+1:end ) ; l a b e l _ t e s t = l a b e l (N/2+1:end ) ; t r a i n _ d a t a=data ( 1 :N/ 2 , 1 : 2 ) ; t e s t _ d a t a=data (N/2+1:end , 1 : 2 ) ; sigma = [ 1 10 1 0 0 ] ; C = [ 1 10 1 0 0 ] ; f o r sigma_i =1:3 f o r C_i=1:3 figure ; model = s v m t r a i n ( train_data , l a b e l _ t r a i n , ’ k e r n e l _ f u n c t i o n ’ , . . . ’ r b f ’ , ’ rbf_sigma ’ , sigma ( sigma_i ) , . . . ’ b o x c o n s t r a i n t ’ , C( C_i ) , ’ s h o w p l o t ’ , t r u e ) l a b e l _ t e s t _ d a t a _ r e s u l t = s v m c l a s s i f y ( model , test_data , ’ s h o w p l o t ’ , t r u e ) ; l a b e l _ t r a i n _ d a t a _ r e s u l t = s v m c l a s s i f y ( model , t r a i n _ d a t a ) ; t r a i n _ e r r o r = mean( l a b e l _ t r a i n~=l a b e l _ t r a i n _ d a t a _ r e s u l t ) t e s t _ e r r o r = mean( l a b e l _ t e s t~=l a b e l _ t e s t _ d a t a _ r e s u l t ) 6 t i t l e ( s p r i n t f ( ’ width=%.1 f ␣C=%.1 f ␣ t e s t e r r .=%.3 f ␣ t r a i n e r r .=%.3 f ’ , . . . sigma ( sigma_i ) , C( C_i ) , t e s t _ e r r o r , t r a i n _ e r r o r ) ) ; print −deps end end width=1.0 C=1.0 testerr.=0.045 trainerr.=0.055 width=1.0 C=10.0 testerr.=0.055 trainerr.=0.055 3 3 −1 (training) −1 (classified) 1 (training) 1 (classified) Support Vectors 2 2 1 1 0 0 −1 −1 −2 −2 −3 −3 −2 −1 0 1 2 −1 (training) −1 (classified) 1 (training) 1 (classified) Support Vectors 3 −3 −3 −2 width=1.0 C=100.0 testerr.=0.065 trainerr.=0.050 3 1 2 3 1 0 0 −1 −1 −2 −2 −2 −1 0 1 2 −1 (training) −1 (classified) 1 (training) 1 (classified) Support Vectors 2 1 3 −3 −3 −2 width=10.0 C=10.0 testerr.=0.080 trainerr.=0.085 −1 0 1 2 3 width=10.0 C=100.0 testerr.=0.055 trainerr.=0.075 3 3 −1 (training) −1 (classified) 1 (training) 1 (classified) Support Vectors 2 1 0 0 −1 −1 −2 −2 −2 −1 0 1 2 −1 (training) −1 (classified) 1 (training) 1 (classified) Support Vectors 2 1 −3 −3 0 3 −1 (training) −1 (classified) 1 (training) 1 (classified) Support Vectors 2 −3 −3 −1 width=10.0 C=1.0 testerr.=0.250 trainerr.=0.245 3 7 −3 −3 −2 −1 0 1 2 3 width=100.0 C=1.0 testerr.=0.510 trainerr.=0.485 width=100.0 C=10.0 testerr.=0.510 trainerr.=0.485 3 3 −1 (training) −1 (classified) 1 (training) 1 (classified) Support Vectors 2 2 1 1 0 0 −1 −1 −2 −2 −3 −3 −2 −1 0 1 2 3 width=100.0 C=100.0 testerr.=0.505 trainerr.=0.485 3 −1 (training) −1 (classified) 1 (training) 1 (classified) Support Vectors 2 1 0 −1 −2 −3 −3 −2 −1 0 1 2 −1 (training) −1 (classified) 1 (training) 1 (classified) Support Vectors 3 8 −3 −3 −2 −1 0 1 2 3