Classification - Lihi Zelnik

Classification
Lihi Zelnik-Manor
Video Analysis Course
Summer 2008
Clustering
Objective: Separate between classes
Classification
Given: labeled database
What am I?
Nearest Neighbor
My nearest neighbor is
I am also
What am I?
K-Nearest Neighbor
My nearest neighbors are
I am a
What am I?
Fisher Linear Discriminant
Objective: Find a linear projection which separates data
clusters
Poor separation
Fisher Linear Discriminant
Objective: Find a linear projection which separates data
clusters
Good separation
FLD: theory
• Minimize the within-class scatter
Make the
ellipses small
FLD: theory
• Minimize the within-class scatter
• Maximize the between-class scatter
Maximize distance
between ellipses
FLD: Problem formulation
N vectors (features, images):
[x 1 , …
Mean of all vectors:
]
[c 1 , … , c K ]
K classes:
Mean of each class:
, xN
µk
1
=
Nk
1
µ =
N
∑
xi
xi∈ck
N
∑
i =1
xi
FLD: scatter matrices
Scatter of class k
Sk =
T
(
x
−
µ
)(
x
−
µ
)
∑ i k i k
x i ∈c k
K
Within class scatter:
SW =
∑S
k
k =1
K
Between class scatter
SB =
∑
N k ( µ − µ k )( µ − µ k ) T
k =1
Total scatter
S T = SW + S B
FLD: practice (1)
•After projection:
•And…
y i = W T xi
•Between class scatter (of y’s):
S B ( y ) = W T S B ( x )W
•Within class scatter (of y’s):
S W ( y ) = W T S W ( x )W
FLD: practice (2)
The wanted projection:
W optimal = arg max
W
SB ( y)
SW ( y )
= arg max
W
W T S B ( x )W
W T S W ( x )W
How is it found ?
S B wi = λ i S W wi
i = 1, … , m
Generalized eigen-vectors
FLD: example result
• Face recognition under varying lighting, and
expression
FLD: example result
Linear separators
{
f(x) > 0
ht(x) =
f(x) < 0
hyperplane
f(x)=wx+b
SVM = Support Vector Machines
Slides adapted from
Andrew Moore
http://www.cs.cmu.edu/~awm/tutorials
And
Mingyue Tan, The University of British Columbia
α
Linear Classifiers
y= +1
x
w x + b>0
f
yest
f(x,w,b) = sign(w x + b)
y= -1
How would you
classify this data?
w x + b<0
α
Linear Classifiers
y= +1
x
f
yest
f(x,w,b) = sign(w x + b)
y= -1
How would you
classify this data?
α
Linear Classifiers
y= +1
x
f
yest
f(x,w,b) = sign(w x + b)
y= -1
How would you
classify this data?
α
Linear Classifiers
y= +1
x
f
yest
f(x,w,b) = sign(w x + b)
y= -1
Any of these
would be fine..
..but which is
best?
α
Linear Classifiers
y= +1
x
f
yest
f(x,w,b) = sign(w x + b)
y= -1
How would you
classify this data?
Misclassified
to +1 class
α
Classifier Margin
y= +1
y= -1
x
f
yest
f(x,w,b) = sign(w x + b)
Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
α
Maximum Margin
x
y= +1
y= -1
Support Vectors
are those
datapoints that
the margin
pushes up
against
f
yest
1. Maximizing the margin is good
accordingf(x,w,b)
to intuition
and PAC
= sign(w
x +theory
b)
2. Implies that only support vectors are
important; other The
training
examples
maximum
are ignorable.
margin linear
3. Empirically it works
very very
classifier
iswell.
the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
Linear SVM Mathematically
w
x+
M=Margin Width
X-
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• w . (x+-x-) = 2
+
−
(x − x ) ⋅ w 2
M =
=
w
w
Linear SVM Mathematically
Goal: 1) Correctly classify all training data
wx i + b ≥ 1
wx i + b ≤ 1
yi ( wxi + b) ≥ 1
2) Maximize the Margin
if yi = +1
if yi = -1
for all i
M =
2
w
same as minimize
1 t
ww
2
We can formulate a Quadratic Optimization Problem and solve for w and b
1 t
min Φ ( w) = w w
w,b
2
s.t. yi (wxi + b ) ≥ 1 ∀i
Solving the Optimization Problem
1 t
min Φ ( w) = w w
w,b
2
s.t. yi (wxi + b ) ≥ 1 ∀i
Optimize a quadratic function subject to linear
constraints.
Many existing algorithms for quadratic optimization, e.g.,
using quadratic programming
Dataset with noise
y= +1
Hard Margin:
So far we required all data points be
classified correctly
y= -1
No training error
What if the training set is noisy?
- Solution 1: use very powerful
kernels
OVERFITTING!
Soft Margin Classification
Slack variables ξi can be added to allow
misclassification of difficult or noisy examples.
ε11
What should our quadratic
optimization criterion be?
ε2
Minimize
ε7
R
1
w.w + C ∑ εk
2
k =1
Hard Margin v.s. Soft Margin
The old formulation:
The new formulation incorporating slack variables:
1 t
min Φ ( w) = w w
w,b
2
s.t. yi (wxi + b ) ≥ 1 ∀i
1 t
min Φ ( w) = w w + C ∑ ξ i
w,b
2
s.t. yi (wxi + b ) ≥ 1 − ξ i and ξ i ≥ 0 ∀i
Parameter C can be viewed as a way to control
overfitting.
Dual form
Primal form
1 t
min Φ ( w) = w w
w,b
2
s.t. yi (wxi + b ) ≥ 1 ∀i
An alternative solution involves constructing a dual problem where a
Lagrange multiplier αi is associated with every constraint in the
primary problem:
n
Dual form max
α
∑αi −
i =1
1
T
α
α
y
y
x
∑
i j i j i xj
2 i, j
n
s.t.
α i ≥ 0 and
∑α y
i
i =1
i
=0
The Optimization Problem Solution
The solution has the form:
w =Σαiyixi
b= yk- wTxk
for any xk such that αk≠ 0
Each non-zero αi indicates that corresponding xi is a
support vector.
Then the classifying function will have the form:
f(x) = ΣαiyixiTx + b
Why dual form?
n
1
max ∑ α i − ∑ α iα j yi y j xiT x j
α
2 i, j
i =1
f(x) = ΣαiyixiTx + b
n
s.t.
α i ≥ 0 and
∑α y
i
i
=0
i =1
In training we have to compute the inner products xiTxj
between all pairs of training points.
Query classification relies on an inner product between
the test point x and the support vectors
Lets see why this is useful
Non-linear SVMs
Datasets that are linearly separable with some noise
work out great:
x
0
But what are we going to do if the dataset is just too
hard?
x
0
How about… mapping data to a higher-dimensional
space:
x2
0
x
Non-linear SVMs: Feature spaces
General idea: the original input space can always be mapped to
some higher-dimensional feature space where the training set is
separable:
Φ: x → φ(x)
The Kernel “trick”
The classifier solution relies on
K(xi,xj)=xiTxj
After transformation
Φ: x → φ(x)
the dot product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
We only need to compute the kernels K
A kernel function is some function that corresponds to
an inner product in some expanded feature space.
Kernel example
Let
Need to show that
x = [x1 x2];
K(xi,xj) = (1 + xiTxj)2,
K(xi,xj) = φ(xi) Tφ(xj):
K(xi,xj)= (1 + xiTxj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T
[1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi) Tφ(xj),
where
φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
What Functions are Kernels?
Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel
Semi-positive definite symmetric functions correspond to a
semi-positive definite symmetric Gram matrix:
K=
K(x1,x1) K(x1,x2) K(x1,x3)
K(x2,x1) K(x2,x2) K(x2,x3)
…
…
…
K(xN,x1) K(xN,x2) K(xN,x3)
…
…
…
K(x1,xN)
K(x2,xN)
…
K(xN,xN)
Examples of Kernel Functions
Linear:
K(xi,xj)= xi Txj
Polynomial of power p:
K(xi,xj)= (1+ xi Txj)p
Gaussian (radial-basis function network):
K (x i , x j ) = exp(−
Sigmoid:
xi − x j
2σ
2
2
)
K(xi,xj)= tanh(β0xi Txj + β1)
Non-linear SVMs Mathematically
Dual problem formulation:
n
max
α
∑αi −
i =1
1
α iα j yi y j K ( xi , x j )
∑
2 i, j
n
s.t.
α i ≥ 0 and
∑α y
i
i
=0
i =1
The solution is:
f(x) = ΣαiyiK(xi, xj)+ b
Optimization techniques for finding αi’s remain the same!
Nonlinear SVM - Overview
SVM locates a separating hyperplane in the feature
space and classifies points in that space
It does not need to represent the space explicitly,
simply by defining a kernel function
The kernel function plays the role of the dot product
in the feature space.
Properties of SVM
• Flexibility in choosing a similarity function
• Sparseness of solution when dealing with large data sets
- only support vectors are used to specify the separating hyperplane
• Ability to handle large feature spaces
- complexity does not depend on the dimensionality of the feature
space
• Overfitting can be controlled by soft margin approach
• Nice math property: a simple convex optimization problem which is
guaranteed to converge to a single global solution
• Feature Selection
SVM Applications
• SVM has been used successfully in many real-world
problems
- text (and hypertext) categorization
- image classification
- bioinformatics (Protein classification,
Cancer classification)
- hand-written character recognition
Weakness of SVM
• It is sensitive to noise
- A relatively small number of mislabeled examples can dramatically
decrease the performance
• It only considers two classes
- how to do multi-class classification with SVM?
- Answer:
1) with output arity m, learn m SVM’s
• SVM 1 learns “Output==1” vs “Output != 1”
• SVM 2 learns “Output==2” vs “Output != 2”
• :
• SVM m learns “Output==m” vs “Output != m”
2)To predict the output for a new input, just predict with each SVM
and find out which one puts the prediction the furthest into the
positive region.