582669 Supervised Machine Learning (Spring 2014) Homework 5, sample solutions

582669 Supervised Machine Learning (Spring 2014)
Homework 5, sample solutions
Credit for the solutions goes to mainly to Panu Luosto and Joonas Paalasmaa, with some additional
contributions by Jyrki Kivinen
Problem 1
This exercise is of course a drill: there are simpler ways of finding the minimum of x2 in a closed
interval.
First part. Let f (x) = x2 and p(x) = (x−2)2 −1. We are asked to minimize f (x) with the constraint
p(x) ≤ 0. Both functions are clearly convex, and the Karush-Kuhn-Tucker conditions of this convex
optimization problem are
 0
f (x) + λp0 (x) = 0




p(x) ≤ 0

λ≥0



λp(x) = 0 ,
or equivalently

2x + λ(2x − 4) = 0





(x − 2)2 − 1 ≤ 0

λ≥0




2
λ((x − 2) − 1) = 0 .
(i)
(ii)
(iii)
(iv)
The solutions of (iv) are λ = 0, x = 3 or x = 1. If λ = 0, we get x = 0 from (i) , and the condition
(ii) is not true. Otherwise if x = 3, equation (i) has the form 6 + 2λ = 0, which means that condition
(iii) is violated. We are left with x = 1. Then (i) yields 2 − 2λ = 0 ⇔ λ = 1, and the inequalities (ii)
and (iii) are true. In other words, f (x) is minimized subject to p(x) ≤ 0 when x = 1.
The Lagrangian of the problem is
L(x, λ) = f (x) + λp(x)
= x2 + λ((x − 2)2 − 1)
= (1 + λ)x2 − 4λx + 3λ ,
and the dual is
g(λ) = inf L(x, λ)
x∈R
= inf (1 + λ)x2 − 4λx + 3λ .
x∈R
The dual problem is to maximize g(λ) in the set λ ∈ [0, ∞[. Let h(x) = (1 + λ)x2 − 4λx + 3λ.
Because λ ≥ 0, the graph of h is a convex parabola, and the minimum of h is at the point where
h0 (x) = (2 + 2λ)x − 4λ = 0 ⇒ x = 2λ/(1 + λ). Thus
4λ2
8λ2
−
+ 3λ
2
(1 + λ)
1+λ
4λ2 − 8λ2 + 3λ + 3λ2
=
1+λ
3λ − λ2
=
.
1+λ
g(λ) = (1 + λ)
With some more calculations, we could show that the dual g is maximized with λ = 1, and g(1) = 1.
Our optimization problem is convex, and p(2) < 0, in other words, there is x ∈ R such that p(x) is
strictly less than zero. Therefore, according to Theorem 3.4 of the lecture slides, the strong duality
holds, and g(λ∗ ) = f (x∗ ), where λ∗ and x∗ are the solutions of the dual and primal optimization
problems, respectively.
Second part. Let f (x) = x2 and let p(x) = (x − 2)2 − 8. Again, we want to minimize the convex
function f subject to p(x) ≤ 0, where p is also a convex function. The KKT conditions of this convex
optimization problem are

(i)

 2x + λ(2x − 4) = 0


2

(x − 2) − 8 ≤ 0
(ii)

λ≥0
(iii)




2
λ((x − 2) − 8) = 0 .
(iv)
From (iv) and (i) we find easily the solution
x = λ = 0.√In order to check for other possible solutions
√
(for practise only), we get x = 2 + 2 2 and x = 2 − 2 2 from (iv). But then solving λ = x/(2 − x)
from (i), shows that the condition (iii) is violated in these two cases. So the solution of the problem is
x = 0.
The Lagrangian of f is
L(x, λ) = f (x) + λp(x)
= x2 + λ((x − 2)2 − 8) ,
and the dual is
g(λ) = inf L(x, λ)
x∈R
= inf (1 + λ)x2 − 4λx − 4λ .
x∈R
Finding the infinimum goes along similar steps as in the first part of the exercise, and
2
2λ
2λ
g(λ) = (1 + λ)
− 4λ
− 4λ
(1 + λ)
1+λ
4λ2 − 8λ2 − 4λ − 4λ2
=
1+λ
2 + 4λ
= −2λ
.
1+λ
In the dual problem, we maximize g(λ) in the set λ ∈ [0, ∞[. Because g(λ) ≤ 0 everywhere in R+ , and
g(λ) = 0 only when λ = 0, the maximum is achieved when λ = 0. Then x = (2 · 0)/(1 + 0) = 0. Strong
duality holds also in this case, which can be be seen also from the fact that all the KKT conditions are
satisfied when λ = x = 0.
Problem 2
Let f : Rd → R, f (w) = kw − wt k2 = (w − wt ) · (w − wt ) = w · w − 2w · wt + wt · wt , and let
p(w) = 1 − yt w · xt . Notice that this is the revised version of the exercise 2, in the original version p
was different. We prove as a warm-up formally that f is a convex function in Rd .
A twice differentiable function is convex in a convex set if and only if its Hessian matrix is positive
semidefinite on the interior of the convex set (compare with the non-negativeness of the second
Pd
derivative of a function of type R → R). Because f (w) = i=1 (wi − wt,i )2 , it is easy to see that
∂f (w)
= 2(wi − wt,i )
∂wi
(i ∈ {1, 2, . . . , d}) ,
2
and for all i, j ∈ {1, 2, . . . , d}
∂ 2 f (w)
= 2 if i = j
∂wi ∂wj
∂ 2 f (w)
= 0 if i 6= j .
∂wi ∂wj
and
So the Hessian matrix of f is H(f ) = 2Id . It holds for all z ∈ Rd×1 that zT (2Id )z = 2kzk2 ≥ 0, which
means H(f ) is positive semidefinite.
The optimization problem is to minimize f (w) subject to p(w) ≤ 0. The Karush-Kuhn-Tucker
conditions are

∇f (w) + λ∇p(w) = 0




p(w) ≤ 0

λ≥0



λp(w) = 0 ,
or equivalenty

2w − 2wt − λyt xt = 0




1 − yt w · xt ≤ 0

λ≥0



λ(1 − yt w · xt ) = 0 .
(i)
(ii)
(iii)
(iv)
Equation (i) yields
w = wt + (1/2)λyt xt .
(1)
From equality (iv) we get two solutions. In the first case, λ = 0, and inequality (iii) is true. Also,
w = wt , and 1 − yt wt · xt ≤ 0 has to hold.
In the second solution λ > 0, and we get from (iv)
1
1 − yt w · xt = 1 − yt (wt + λyt xt ) · xt
2
1
= 1 − yt wt · xt − λxt · xt
2
=0
(2)
where we used yt2 = 1. In this case (ii) holds with equality. Solving λ from (2) and using (iii) gives
λ=2·
1 − yt wt · xt
≥0
kxt k2
which implies 1 − yt wt · xt ≥ 0. Plugging λ from (3) into (1) yields
1
1 − yt wt · xt
·2·
yt x t
2
kxt k2
1 − yt wt · xt
xt .
= wt + yt
kxt k2
w = wt +
Putting finally all things together, we have the update rule
wt+1 = wt + σt yt
1 − yt wt · xt
xt
kxt k2
where
(
0 if yt wt · xt ≥ 1
σt =
1 if yt wt · xt < 1 .
3
(3)
We were asked to show, that if w1 = 0, then wt is a linear combination of the vectors x1 , x2 , . . . , xt−1 ,
in other words, wt = α1 x1 + α2 x2 + · · · + αt−1 xt−1 , where α1 , α2 , . . . , αt−1 ∈ R. If P
t = 1, we have an
t−2
empty sum, and the statement is true. Let t ∈ {2, 3, . . . } and assume that wt−1 = i=1 αi xi . Using
our update rule,
wt =
t−2
X
αi xi + σt yt
i=1
=
t−1
X
1 − yt wt · xt
xt
kxt k2
αi xi ,
i=1
where we have written αt = σt yt (1−yt wt ·xt )/kxt k2 . According to the induction priciple, the statement
is thus true for all t ∈ N.
In the kernelized Perceptron algorithm we replace the xi vectors with their counterparts ψ(xi ) in
the feature space. The algorithm can be described as follows:
For t = 1, 2, . . . , T do
1. Get the instance xt . Pt−1
Pt−1
α
ψ(x
)
·ψ(xt ) = i=1 αi k(xi , xt ). Predict yˆt = sign(pt ).
2. Let pt = wt ·ψ(xt ) =
i
i
i=1
3. Get the correct answer yt .
Pt−1
4. If yt pt ≤ 1, set αt ← yt (1 − yt wt · ψ(xt ))/kψ(xt )k2 = yt − i=1 αi k(xi , xt )/k(xt , xt ),
otherwise set αt ← 0 (and xt can be discarded).
Pt
Pt
Pt Pt
0
0
5. Let kwt+1
k2 = i=1 αi ψ(xi ) · j=1 αj ψ(xj ) = i=1 j=1 αi αj k(xi , xj ). If kwt+1
k>
0
1, set αi ← αi /kwt k for all i ∈ {1, 2, . . . , t}.
Problem 3
The idea of this exercise was to show that if there is a solution to the given minimization problem,
then the coordinate i of the new weight vector satisfies
wt+1,i =
1
wt,i βtyt xi
Z
where β ≥ 1. Notice that the formulation of the exercise is slightly revised.
Let f : Rd+ → R, f (w) = dre (w, wt ). See some discussion about defining f on the boundary of set
d
R+ in the solution for the exercise 2b of the third homework. Similarly as in the previous exercise, we
start by proving that f is convex in Rd+ , which we do by examining the Hessian matrix of f on the
interior of R+ . Let w = (w1 , w2 , . . . , wd ) ∈ (R+ \ {0})d . For all i ∈ {1, 2, . . . , d}
Pd
∂ i=1 wi ln(wi /wt,i )
∂f (w)
=
∂wi
∂wi
= ln wi − ln wt,i + 1 ,
and
∂ 2 f (w)
∂(ln wi − ln wt,i + 1)
=
∂wi ∂wi
∂wi
1
=
.
wi
Additionally, for all i, j ∈ {1, 2, . . . , d}, i 6= j,
∂ 2 f (w)
= 0.
∂wi ∂wj
4
Let z ∈ Rd×1 . The Hessian of f is positively semidefinite because
zT H(f ) z = (z1 /w1 ), (z2 , /w2 ), . . . , (zd /wd ) z
=
d
X
zi2
wi
i=1
≥ 0.
We have now proved that f is convex in R+ .
This time we have in addition to inequality constrains also an equality constraint. Let
p(w) = −yt w · xt ,
q(w) =
d
X
wi − 1 .
i=1
The conditions wi ≥ 0 are left out, because of the definition of f . The optimization problem is to
minimize f in the set Rd+ subject to
(
p(w) ≤ 0
q(w) = 0 .
The function p is linear, and therefore convex. Let a = (1, 1, . . . , 1). We see that q(w) = a · w − 1 is
affine. So our minimization problem is convex.
The KKT conditions for the optimization problem are

∇f (w) + λ∇p(w) + γ∇q(w) = 0





p(w) ≤ 0


q(w) = 0







λ≥0
λp(w) = 0 .
where γ ∈ R. For all i ∈ {1, 2, . . . , d} it holds that
∂f (w)
= ln wi − ln wt,i + 1
∂wi
∂p(w)
= yt xt,i
∂wi
∂q(w)
=1
∂wi
and the first KKT condition yields thus
(ln wi − ln wt,i + 1) − λyt xt,i + γ = 0 .
This implies
ln wi = ln wt,i − 1 + λyt xt,i − γ ,
and further
wi = wt,i exp(−1 + λyt xt,i − γ)
= exp(−1 − γ) wt,i exp(λyt xt,i )
1
y x
= wt,i βt t t,i
Z
5
where 1/Z = exp(−1 − γ) and βt = exp(λ). The condition q(w) = 0 yields now
d
X
1
x
wt,i βt t,i − 1 = 0
Z
i=1
⇒
Z=
d
X
x
wt,i βt t,i ,
i=1
and because of the condition λ ≥ 0 we know that βt = exp(λ) ≥ 1.
Problem 4
The accuracy with different choices of the kernel width and parameter C are shown below. For both
parameters, the values 1, 10 and 100 are tried.
Note that the smallest test set error is attained with C = 1.0 when width = 1.0 and with C = 100.0
when width = 10.0. When width = 100.0, the classifier breaks down completely, because the kernel
size is much larger than the extent of the data set.
data = dlmread ( ’ hw05data . t x t ’ ) ;
X = data ( : , 1 ) ;
Y = data ( : , 2 ) ;
l a b e l = data ( : , 3 ) ;
N = s i z e ( data , 1 ) ;
negative_mask = find ( l a b e l ==−1);
p o s i t i v e _ m a s k = find ( l a b e l ==1);
X_train = X( 1 :N/ 2 ) ;
Y_train = Y( 1 :N/ 2 ) ;
l a b e l _ t r a i n = l a b e l ( 1 :N/ 2 ) ;
X_test = X(N/2+1:end ) ;
Y_test = Y(N/2+1:end ) ;
l a b e l _ t e s t = l a b e l (N/2+1:end ) ;
t r a i n _ d a t a=data ( 1 :N/ 2 , 1 : 2 ) ;
t e s t _ d a t a=data (N/2+1:end , 1 : 2 ) ;
sigma = [ 1 10 1 0 0 ] ;
C = [ 1 10 1 0 0 ] ;
f o r sigma_i =1:3
f o r C_i=1:3
figure ;
model = s v m t r a i n ( train_data , l a b e l _ t r a i n , ’ k e r n e l _ f u n c t i o n ’ , . . .
’ r b f ’ , ’ rbf_sigma ’ , sigma ( sigma_i ) , . . .
’ b o x c o n s t r a i n t ’ , C( C_i ) , ’ s h o w p l o t ’ , t r u e )
l a b e l _ t e s t _ d a t a _ r e s u l t = s v m c l a s s i f y ( model , test_data , ’ s h o w p l o t ’ , t r u e ) ;
l a b e l _ t r a i n _ d a t a _ r e s u l t = s v m c l a s s i f y ( model , t r a i n _ d a t a ) ;
t r a i n _ e r r o r = mean( l a b e l _ t r a i n~=l a b e l _ t r a i n _ d a t a _ r e s u l t )
t e s t _ e r r o r = mean( l a b e l _ t e s t~=l a b e l _ t e s t _ d a t a _ r e s u l t )
6
t i t l e ( s p r i n t f ( ’ width=%.1 f ␣C=%.1 f ␣ t e s t e r r .=%.3 f ␣ t r a i n e r r .=%.3 f ’ , . . .
sigma ( sigma_i ) , C( C_i ) , t e s t _ e r r o r , t r a i n _ e r r o r ) ) ;
print −deps
end
end
width=1.0 C=1.0 testerr.=0.045 trainerr.=0.055
width=1.0 C=10.0 testerr.=0.055 trainerr.=0.055
3
3
−1 (training)
−1 (classified)
1 (training)
1 (classified)
Support Vectors
2
2
1
1
0
0
−1
−1
−2
−2
−3
−3
−2
−1
0
1
2
−1 (training)
−1 (classified)
1 (training)
1 (classified)
Support Vectors
3
−3
−3
−2
width=1.0 C=100.0 testerr.=0.065 trainerr.=0.050
3
1
2
3
1
0
0
−1
−1
−2
−2
−2
−1
0
1
2
−1 (training)
−1 (classified)
1 (training)
1 (classified)
Support Vectors
2
1
3
−3
−3
−2
width=10.0 C=10.0 testerr.=0.080 trainerr.=0.085
−1
0
1
2
3
width=10.0 C=100.0 testerr.=0.055 trainerr.=0.075
3
3
−1 (training)
−1 (classified)
1 (training)
1 (classified)
Support Vectors
2
1
0
0
−1
−1
−2
−2
−2
−1
0
1
2
−1 (training)
−1 (classified)
1 (training)
1 (classified)
Support Vectors
2
1
−3
−3
0
3
−1 (training)
−1 (classified)
1 (training)
1 (classified)
Support Vectors
2
−3
−3
−1
width=10.0 C=1.0 testerr.=0.250 trainerr.=0.245
3
7
−3
−3
−2
−1
0
1
2
3
width=100.0 C=1.0 testerr.=0.510 trainerr.=0.485
width=100.0 C=10.0 testerr.=0.510 trainerr.=0.485
3
3
−1 (training)
−1 (classified)
1 (training)
1 (classified)
Support Vectors
2
2
1
1
0
0
−1
−1
−2
−2
−3
−3
−2
−1
0
1
2
3
width=100.0 C=100.0 testerr.=0.505 trainerr.=0.485
3
−1 (training)
−1 (classified)
1 (training)
1 (classified)
Support Vectors
2
1
0
−1
−2
−3
−3
−2
−1
0
1
2
−1 (training)
−1 (classified)
1 (training)
1 (classified)
Support Vectors
3
8
−3
−3
−2
−1
0
1
2
3