In–sample Model Selection for Support Vector Machines Member, IEEE Student Member, IEEE

In–sample Model Selection for Support Vector Machines
Davide Anguita, Member, IEEE, Alessandro Ghio, Member, IEEE,
Luca Oneto, Student Member, IEEE, and Sandro Ridella, Member, IEEE
Abstract— In–sample model selection for Support Vector
Machines is a promising approach that allows using the training
set both for learning the classifier and tuning its hyperparameters. This is a welcome improvement respect to out–of–sample
methods, like cross–validation, which require to remove some
samples from the training set and use them only for model
selection purposes. Unfortunately, in–sample methods require
a precise control of the classifier function space, which can
be achieved only through an unconventional SVM formulation,
based on Ivanov regularization. We prove in this work that,
even in this case, it is possible to exploit well–known Quadratic
Programming solvers like, for example, Sequential Minimal
Optimization, so improving the applicability of the in–sample
approach.
I. I NTRODUCTION
T
HE Support Vector Machine (SVM) algorithm [1]
standed out in the last years as one of the most effective
techniques when facing classification tasks. The widespread
diffusion of SVMs resulted mostly from their successfull
application in many real-world problems, spanning heterogeneous domains (e.g. text categorization [2], computational
biology [3], etc.). The success of the SVM algorithm is
mainly due to two factors: (i) it is possible to optimize the
trade-off between under- and over-fitting capabilities of the
model by suitably controlling the margin [1]; and (ii) the
coefficients of the classifier can be found by solving a Convex
Constrained Quadratic Programming (CCQP) problem, witch
admits only a global minimum. In particular, the convexity
of the SVM training problem amplifies the practical effectiveness of SVMs with respect to, for example, traditional
Artificial Neural Networks (ANNs) [4], which require the
solution of difficult non-linear optimization problems.
Unfortunately, the search for the solution of the CCQP,
does not conclude the SVM learning phase, which consists
of two steps: an efficient QP solver is used for training the
classifier (i.e. in order to solve the CCQP problem), while
a tool for tuning a set of additional variables (the SVM
hyperparameters) must be exploited in order to select the
classifier characterized by the best performance. This last
step is known as the model selection phase of SVM and is
strictly linked to the estimation of the generalization error,
namely the misclassification rate attainable by the model on
new and previously unobserved patterns. In fact, the best
model can be chosen according to the lowest generalization
error [5].
Davide Anguita, Alessandro Ghio, Luca Oneto and Sandro
Ridella are with the Department of Biophysical and Electronic
Engineering, University of Genova, Via Opera Pia 11A, I-16145
Genova, Italy (email: {Davide.Anguita, Alessandro.Ghio, Luca.Oneto,
Sandro.Ridella}@unige.it).
The underlying problem of model selection through generalization error estimation is of major importance. Empirical
approaches, like K-fold Cross Validation [6], Bootstrap [7]
or Leave-One-Out [8], are often exploited for this purpose:
they require the splitting of the dataset into two subsets, one
for solving the CCQP problem (i.e. the training set) and
the remaining one for estimating the generalization error and
performing the model selection (i.e. the hold-out set). Despite
being effective in many applications, these techniques appear
to be unreliable when applied to small sample problems [9],
[10], [11], i.e. datasets where the number of patterns is small
compared to the dimensionality of the data. This is the case,
for example, of microarray data, where few tens of samples,
composed of several thousands features, are available: in
these cases, in-sample methods would be preferable. The
main advantage, respect to empirical approaches, is the use of
the whole set of available data for both training the model and
estimating the generalization error. Moreover, they provide
deep insights on the classification algorithms and are based
on rigorous approaches to give a prediction, in probability,
of the generalization ability of a classifier.
In the last years, several in-sample approaches appeared
in literature [12], [13], [14], [15], but the underlying hypotheses, which must be fulfilled for the consistency of the
estimation, are seldomly satisfied in practice and the generalization estimation can be very pessimistic [16]. However,
in recent works [9], [17], a method for practically applying
an in-sample approach, the Maximal Discrepancy (MD) [15],
has been proposed. The MD-based approach often guarantees
better performance in the small sample setting with respect
to empirical techniques and, therefore, represents an effective
solution to the SVM model selection problem.
Unfortunately, as highlighted in [9], [17], in order to apply
the MD approach to the SVM, an alternative formulation,
based on Ivanov regularization [20], must be used, therefore
giving up the conventional formulation, based on Tikhonov
regularization [18], for which very efficient solvers were
developed throughout the years [19]. To the best knowledge
of the authors, no specific procedures have been proposed for
finding a solution of the Ivanov SVM formulation, despite
the fact that this formulation appears in the original Vapnik’s
work. Therefore, in Sections IV and V, we propose a new
method, which relies on the conventional QP solvers for
solving the Ivanov CCQP optimization problem, so allowing
us to provide a complete and effective tool (MD model
selection & training, using conventional solvers) for the SVM
learning in the small sample setting.
II. T HE M AXIMAL D ISCREPANCY OF A C LASSIFIER
Let us consider a sequence of n independent
=
and identically distributed (i.i.d.) patterns Dn
{(x1 , y1 ), . . . , (xn , yn )}, where x ∈ X ⊆ Rd and
y ∈ Y ⊆ R, sampled according to an unknown distribution
P(x, y). In particular, as we analyze binary classification
problems, we set Y = {±1}.
Let us fix the hypothesis space (i.e. the class of functions)
F and select a classifier f (x) within it. Then, the error
attainable by the classifier on the whole population L(f ),
i.e. the generalization error, is defined as:
L(f ) = E(X ,Y) (f (x), y)
l
(2)
ˆ n (f ) estimates L(f ) but is afflicted by a bias: as
Clearly, L
shown in [15], a penalty term should somehow take into
account the complexity of the hypothesis space F which the
classifier f belongs to.
The Maximal Discrepancy (MD) is a statistical tool which
allows to compute this bias and, thus, to rigorously upperbound the generalization error exploiting the available data.
Let us split Dn in two halves and compute the two empirical
errors:
n
ˆ (1) (f )
L
n/2
=
ˆ (2) (f )
L
n/2
=
2
2
(f (xi ), yi )
n i=1
n
2 (f (xi ), yi ) .
n n
(3)
(4)
i= 2 +1
The Maximal Discrepancy (MD) is:
ˆ n (F) = max L
ˆ (1) (f ) − L
ˆ (2) (f ) .
M
n/2
n/2
f ∈F
f ∈F
ˆ n (F) + Δ(δ, n),
ˆ n (f ) + M
L(f ) ≤ L
III. T HE S UPPORT V ECTOR M ACHINE
f (x) = wT x + b.
(7)
(9)
We will analyze in Section VII how the results can be
generalized to the non-linear case, through the usual kernel
trick [1]. The weights w ∈ Rd and the bias b ∈ R are found
by solving the following primal CCQP problem based on
Tikhonov regularization [14]:
1
2
w + CeT ξ
2
yi wT xi + b ≥ 1 − ξi
ξi ≥ 0 ∀i ∈ [1, . . . , n]
min
w,b,ξ
(10)
∀i ∈ [1, . . . , n]
where ei = 1 ∀i and C is a hyperparameter that must be
chosen through a model selection procedure (e.g. Maximal
2
Discrepancy). The first element 12 w represents the regularization term which forces the classifier to attain a large
margin, while ξ are slack variables introduced to upperbound the number of misclassification according to the hinge
loss function ξ (f (x), y):
[1 − yi f (xi )]+
ξi
= ,
(11)
2
2
where [·]+ = max(0, ·) [1]. By introducing n Lagrange
multipliers α ∈ Rn , it is possible to write the problem (10)
in its dual form, for which efficient QP solvers have been
proposed in literature [19]:
ξ (f (x), y) =
α
(5)
ˆ (f ) is the empirical error on D . Then, we obtain
where L
n
n
the following upper-bound for the generalization error of a
classifier f ∈ F [9], [17], which holds with probability (1 −
δ):
(8)
For the sake of simplicity, as we are working in the small
sample regime, we focus here on the linear Support Vector
Machine (SVM):
1 αi αj qij −
αi
2 i=1 j=1
i=1
n
(6)
log 2δ
2n
is the confidence term, which is independent from both
the hypothesis space and the classifier, but depends on the
number of patterns of the training set.
min
If the loss function is bounded and symmetric, such that
(f (xi ), −yi ) = 1 − (f (xi ), yi ), the value of MD can be
computed by a conventional empirical minimization procedure. In fact a new dataset Dn can be created by simply
flipping the labels to half of the patterns, so that:
ˆ n (F) = 1 − 2 min L
ˆ n (f ),
M
Δ(δ, n) = 3
(1)
where (f (x), y), : X × Y → [0, 1], is a suitable
bounded loss function. Since P(x, y) is unknown, we cannot
directly evaluate L(f ). On the contrary, we can compute the
empirical error of f ∈ F:
ˆ n (f ) = 1
L
(f (xi ), yi ).
n i=1
where
n
0 ≤ αi ≤ C
n
(12)
∀i ∈ [1, . . . , n]
y T α = 0,
where
qij = yi yj xTi xj .
(13)
After solving problem (12), the Lagrange multipliers can be
used to define the SVM classifier:
n
f (x) =
yi αi xTi x + b.
(14)
i=1
The patterns characterized by αi > 0 (or, in other words, by
yi f (xi ) ≤ 1) are called Support Vectors (SVs).
Ideally, to apply the Maximal Discrepancy approach,
we could choose ξ as loss function for computing
ˆ n (f ) =
ˆ n (f ) = 1 n (f (x ), y ) and L
minf ∈F L
i
i
i=1 ξ
n
n
1
(f
(x
),
y
)
in
Eqns.
(6)
and
(7),
respectively
[9].
i
i
i=1 ξ
n
However, three problems arise: (i) the hinge loss is not
bounded and,
n thus, cannot be directly used in Eq. (7); (ii)
the value i=1 ξi is not minimized by problems (10) and
(12); (iii) it is not straightforward to find a simple relation
between C and the hypothesis space F in the conventional
formulations (10) and (12). The peeling approach, proposed
in [9], [17], allows to circumvent the first issue by iteratively
removing the Critical SVs (CSVs) that cause the unboundedness of ξ , i.e. the patterns for which yi f (xi ) < −1.
Concerning problems (ii) and (iii), we propose to use an
alternative SVM formulation [14], based on Ivanov
l regularization, which minimizes the error (minf ∈F i=1 ξi ) and
2
where F is defined as the set of functions with w ≤
2
wM AX and b ∈ (−∞, +∞):
min
w,b,ξ
eT ξ
(15)
2
2
w ≤ wM
AX
T
yi w xi + b ≥ 1 − ξi
ξi ≥ 0 ∀i ∈ [1, . . . , n]
∀i ∈ [1, . . . , n]
This formulation was proposed by Vapnik himself [14]
but, to the best knowledge of the authors, no dedicated
procedures have been developed for solving it. In [9], an
heuristic solution, based on the ideas of [21], has been
proposed, but it does not allow to exploit conventional
optimization algorithms and, most of all, does not allow to
introduce the kernel non-linearity. In this work, we prove that
the well-known Sequential Minimal Optimization algorithm
[22], [23] can be exploited for directly solving problem (15)
by detailing a relationship between Ivanov and Tikhonov
regularization.
IV. I VANOV AND T IKHONOV R EGULARIZATION
Let us define, for the sake of brevity, the Tikhonov
regularization based formulation of (10) as T-SVM and the
Ivanov regularization based formulation of (15) as I-SVM.
We want to solve I-SVM by using the efficient and reliable
optimization tools developed
for T-SVM. For this purpose
we define as wC , bC , ξ C the solution of T-SVM, given a
fixed C, and we introduce some preliminary properties that
we need to prove, for reasons that will become clearer in the
next section, that:
Property 1) T-SVM and I-SVM are equivalent for some
value of C;
Property 2) If we consider two distinct values of C, C1 and
C2 , such that C2 > C1 , we have that:
2 2
n
n
C2
< i=1 ξiC1 if wC2 > wC1 i=1 ξi
2 2
n
n
C2
= i=1 ξiC1 if wC2 = wC1 i=1 ξi
(16)
in other words, if the hyperparameter C grows, than both
the amount of misclassification and the size of the margin
decrease or remain the same.
2 2
Property 3) In T-SVM, if wC2 = wC1 for C2 > C1 ,
2 2
then ∀CU > C1 we have wCU = wC1 . This means
that if the margin stops decreasing it will remain the same
regardless the values of C.
In the following sub-section, we will prove the previous
properties, which also give us interesting insights on the two
different formulations of the SVM problem. The achieved
results will be exploited for formulating the algorithm, which
allows to solve I-SVM using the conventional QP solvers,
designed for T-SVM. For improving the readability of the
paper, the proofs are presented in the Appendix.
A. I-SVM and T-SVM: Properties
The following theorem allows to prove the previously
presented Property 1.
Theorem 1: The I-SVM problem (15) is equivalent to the
conventional T-SVM problem (10) for some value of C [14].
Theorem 2: Let us consider the I-SVM and T-SVM
problems. Let (wwM AX , bwM AX , ξ wM AX ) and wC , bC , ξ C
be
If
Cthe
solutions of, respectively, I-SVM andT-SVM.
n
C
w 2 = wwM AX 2 for a given C, then
ξ
=
i=1 i
n
wM AX
.
i=1 ξi
As we proved that the two problems are equivalent for
some value of C, we consider only the T-SVM problem
in the prosecution of this section. The results given by
the following theorem is propaedeutic for demonstrating the
second property presented in Section IV.
Theorem 3: Let us consider the T-SVM optimization
problem. Given a value of the hyperparameter C = C1 , let
us solve problem (10) and let K C1 be the minimum value
of the cost function:
C 2
n
w 1 C1
+ C1
K =
ξiC1 .
(17)
2
i=1
Let us consider a larger value for the hyperparameter C2 >
C1 . Let us compute the cost at the optimum:
C 2
n
w 2 C2
+ C2
K =
ξiC2 .
(18)
2
i=1
Then:
K C2 ≥ K C1 .
(19)
Corollary 1: Given the same hypotheses of Theorem 3,
if:
n
ξiC2 = 0,
(20)
i=1
then
K C2 > K C1 .
(21)
By exploiting the results of the previous theorems, we can
prove Property 2.
Theorem 4: Let us consider the T-SVM optimization
problem. Given two values of the hyperparameter C such
that C2 > C1 , let us solve problem (10) and let K C1 and
K C2 be the minimum values of the cost function. As, from
Theorem 3, K C2 ≥ K C1 , then:
n
n
ξiC2 < i=1 ξiC1
n
i=1
n
C2
= i=1 ξiC1
i=1 ξi
2 2
if wC2 > wC1 C 2 C 2
if w 2 = w 1 .
(22)
Corollary
2:
Given
the
same
hypotheses
of
Theorem
4,
2 2
then wC2 ≥ wC1 .
the same hypotheses of Theorem 4, if
3: Given
Corollary
wC2 2 = wC1 2 , then ∀CS such that C1 ≤ CS ≤ C2 :
C 2 C 2 C 2
w S = w 2 = w 1 n
ξiCS =
i=1
n
ξiC2 =
i=1
n
ξiC1 .
(23)
(24)
i=1
The following Corollary allows us to prove Property 3.
Corollary
hypotheses of Theorem 4,
2 4: Given
2the same
n
n
C2
C1
if wC2 = wC1 and
=
i=1 ξi
i=1 ξi , then
∀CU ∈ [C1 , +∞):
C 2 C 2 C 2
w U = w 2 = w 1 n
i=1
ξiCU =
n
i=1
ξiC2 =
n
ξiC1 .
(25)
(26)
i=1
V. E XPLOITING E FFICIENT QP S OLVERS FOR I-SVM
In the previous section, some interesting properties have
been proved. In particular:
• Theorem 1 states that the I-SVM and the conventional
T-SVM problems are equivalent for a given (and finite)
value of the hyperparameter C;
• Theorem 2 states that if, for a given value of C,
we obtain the same values of the margin with TSVM and I-SVM, then the upper-bound of the number
of
in T-SVM is minimized
too, as
n
n
nmisclassifications
wM AX
C
ξ
=
ξ
=
min
ξ
;
f ∈F
i=1 i
i=1 i
i=1 i
• Theorem 4 and its corollaries have been used for
demonstrating the monotonicity
of the trend of the
2
optimal value wC when varying C in T-SVM. In
2
2
particular, if, for C2 > C1 , wC2 = wC1 and
n
n
ξ C2 =
ξ C1 , then ∀CU ∈ [C1 , +∞) we have
i=1
Ci 2
iC 2 i=1
n
n
that w U = w 1 and i=1 ξiCU = i=1 ξiC1 .
In other words, we want to search for the optimal value
of C (namely Copt ) which allows us to find the solution of
I-SVM exploiting the conventional T-SVM formulation (Theorems 1 and 2). Then, we start with an initialization value
C = Cstart (e.g. Cstart = 1) and solve the conventional
CCQP problem (12), using the efficient algorithms available
in literature:
C
2
2
• if w start > wM
AX , the solution of problem
(15)
will
lie
on
the
hypothesis
space boundary, i.e.
C 2
w opt = w2
M AX . Then, we set Cup = Cstart and
we decrease C, exploiting
of Theorem 4
Cthe
2 results
2
and its corollaries, until w
≤ wM AX , looking for
a feasible Clow . After a Clow and a Cup have been
identified, we can simply apply a bisection method for
Algorithm 1 The algorithm for solving Eq. (15) using only
conventional and efficient QP solvers.
2
Input: A dataset Dn , wM
AX , a tolerance τ
Input: w, b, ξ
C=1
{w, b, ξ} = QP SV M (C)
2
if w2 > wM
AX then
Cup = C
2
while w2 > wM
AX do
C = C/2
{w, b, ξ} = QP SV M (C)
2
if w2 < wM
AX then
Clow = C
exit from the cycle
end if
end while
else
Clow = C
2
while w2 < wM
AX do
C =C ∗2
wold = wnew
{w, b, ξ} = QP SV M (C)
2
if w2 > wM
AX then
Cup = C
exit from the cycle
end if
wnew = w2
if wnew − wold < τ then
Return {w, b, ξ}
end if
end while
end if
2
while | w2 − wM
AX | > τ do
Cup −Clow
C = Clow +
2
{w, b, ξ} = QP SV M (C)
2
2
if w > wM AX then
Cup = C
end if
2
if w2 < wM
AX then
Clow = C
end if
end while
Return {w, b, ξ}
•
2
finding Copt , as wC is monotonic with respect to
the
variations
2 of C;2
if wCstart ≤ wM
AX , we set Clow = Cstart and we
increase C:
– if, for
two
consecutive values of C, we find the
2
same wC , we have that Copt = C, as stated by
C 2
w opt ≤ w2
Corollary 4, and M AX . Then, the
C
C C
algorithm returns w , b , ξ ;
2
2
– otherwise, wCopt = wM
AX and we look for a
feasible Cup . After a Clow and a Cup have been
identified, we can apply the bisection method for
searching Copt .
The proposed method is presented in Algorithm 1: it is worth
noting that only conventional QP solvers are required for
problem (12) (QPSVM routine).
TABLE I
T EST ERROR RATES OBTAINED USING MD ON TWO REAL – WORLD
DATASETS
n
10
20
40
60
80
100
(MNIST AND DAIMLER C HRYSLER ).
MNIST
2.5
2.4
1.2
0.6
0.7
0.5
±
±
±
±
±
±
0.6
0.3
0.3
0.1
0.2
0.1
DaimlerChrysler
24.9
29.8
27.2
27.3
25.3
25.7
±
±
±
±
±
±
(28)
2
2
w ≤ wM
AX
T
yi w φ(xi ) + b ≥ 1 − ξi
ξi ≥ 0 ∀i ∈ [1, . . . , n].
0.9
0.7
0.9
0.8
0.8
0.5
∀i ∈ [1, . . . , n]
By introducing n Lagrange multipliers, we obtain the same
dual formulation of (35), where qij of Eq. (13) is computed
applying the usual kernel trick [1] and exploiting a kernel
function K(·, ·):
VI. E XPERIMENTAL R ESULTS
We consider the MNIST [25] and the DaimlerChrysler [26]
datasets. The MNIST dataset consists of 62000 images, representing the numbers from 0 to 9: in particular, we consider
the 13074 patterns containing 0’s and 1’s, allowing us to deal
with a binary classification problem. The dimensionality of
the data is 784 (i.e. 28×28 pixels). The DaimlerChrysler
dataset consists of 9800 8-bits grayscale images (36×18
pixels), representing pedestrians crossing a road and nonpedestrian examples. We build the training set by randomly
sampling a small number of patterns, varying from 10 to
100, while the remaining images are used as a test set. In
order to build statistically relevant results, we build a set of
100 replicates using a random sampling technique. Note that
the MNIST dataset is characterized by 784 features, while
the dimensionality of the DaimlerChrysler problem is 648,
thus much higher than the number of samples in each of
the training sets. Therefore, it defines a typical small sample
setting.
Table I shows the average misclassification rates obtained
on the test sets of MNIST and DaimlerChrysler datasets,
where the experimental setup is the following:
• the data are normalized in the range [0, 1];
• the model selection is performed by searching for the
optimal value of wM AX in the interval [10−3 , 102 ],
which includes the cases of interest, among 40 values,
equally spaced in a logarithmic scale. The optimal
model is chosen according to the generalization error,
estimated using MD, where I-SVM problems are solved
exploiting Algorithm 1;
• the error on the test set is computed using the optimal
model.
qij = yi yj φ(xi ) · φ(xj ) = yi yj K (xi , xj ) .
(29)
Analogously, the conventional dual formulation assumes the
same form of problem (12), where qij is computed according
to Eq. (29).
As all the theorems, presented in Section IV, hold also in
the non-linear case, Algorithm 1 can be trivially generalized
for solving problem (28).
VIII. C ONCLUSIONS
We have shown how to apply conventional CCQP solvers
to the solution of the SVM learning phase, when formulated
as an Ivanov regularization problem. As confirmed by experimental results, this approach makes in–sample methods
comparable, at least in the considered cases, to widely used
empirical methods for model selection. We hope that this
work could help in spreading the use of the alternative Ivanov
formulation of the SVM, which has several advantages
respect to the conventional one.
A PPENDIX
In this Appendix we propose the proofs of the theorems
and corollaries, presented in Section IV-A.
Proof of Theorem 1: Let us compute the Lagrangian of
the primal I-SVM problem (15):
L=
n
i=1
ξi −
n
βi [yi (wT xi + b) − 1 − ξi ]
(30)
n
γ 2
T
δi ξi .
− (wM
AX − w w) −
2
i=1
(31)
i=1
We obtain the following Karush Kuhn Tucker (KKT) conditions:
VII. E XTENSION TO THE N ON -L INEAR C ASE
For the sake of simplicity, in this work we focused on
the linear Support Vector Machine. However, the obtained
results can be generalized to non-linear SVMs as well. By
applying a mapping to the input samples
x → φ(x),
eT ξ
min
w,b,ξ
(27)
the SVM training problem based on Ivanov regularization
becomes:
∂L
=0 →
∂w
∂L
=0 →
∂b
∂L
=0 →
∂ξi
w=
n
n
1
βi yi xi
γ i=i
βi yi = 0
(32)
(33)
i=i
1 − βi − δi = 0 → βi ≤ 1
∀i ∈ [1, . . . , n]
(34)
and the dual formulation:
n
n
n
2
1 γwM
AX
(35)
βi βj qij −
βi +
min
β,γ
2γ i=1 j=1
2
i=1
0 ≤ βi ≤ 1
γ≥0
yT β = 0
∀i ∈ [1, . . . , n]
n
1 2
2
ξi
(36)
+
w − wM
AX
2C
i=1
yi wT xi + b ≥ 1 − ξi ∀i ∈ [1, . . . , n]
w,b,ξ
ξi ≥ 0
∀i ∈ [1, . . . , n]
We can compute the dual of the previous problem:
n
1 2
2
+
L =
w − wM
ξi
AX
2C
i=1
−
n
βi [yi (wT xi + b) − 1 − ξi ] −
i=1
n
(37)
δi ξi ,
i=1
obtaining the following KKT conditions:
∂L
=0 →
∂w
∂L
=0 →
∂b
∂L
=0 →
∂ξi
w=C
n
n
βi yi xi
(38)
i=i
βi yi = 0
(39)
i=i
1 − βi − δi = 0 → βi ≤ 1
∀i ∈ [1, . . . , n].
The new dual formulation can be written as:
n
n
n
C w2
βi βj qij −
βi + M AX
min
β,γ
2 i=1 j=1
2C
i=1
0 ≤ βi ≤ 1
y T β = 0,
(40)
∀i ∈ [1, . . . , n]
which is equivalent to (35) for
C=
1
,
γ
(41)
where γ > 0.
Proof of Theorem 2: As the T-SVM training problem is
convex, from Eq. (10) we have that
w
wM AX 2
+C
n
i=1
ξiwM AX
n
2
≥ wC + C
ξiC . (42)
n
ξiC .
(43)
i=1
n
However, as i=1 ξiwM AX is the solution of problem (15),
we must have that
i=1
ξiwM AX =
n
ξiC .
(44)
i=1
n
it is impossible to prove that, if i=1 ξiC =
Unfortunately,
n
ξiwM AX for some value of C, then the condition
i=1
wC 2 = wwM AX 2 is always verified. In fact it is easy
to find a counterexample: let us consider a linearly separable
2
2
problems for which the constraint w ≤ wM
AX holds: in
wM AX 2
solutions
these case, there are infinite possible w
to problem (15). On the contrary,
T-SVM
allows
to choose,
2
among them, the weight vector wC such that the margin
2
2
is maximized, i.e. we will have that wC ≤ wwM AX :
in these case, when solving I-SVM exploiting the QP algowM AX 2
=
rithms
C 2designed for T-SVM, we will force w
w , without loss of generality.
Proof of Theorem 3: Let us proceed via reductio ad absurdum, i.e. suppose that K C2 < K C1 . Then:
C 2
n
w 1 C1
+ C1
=
ξiC1
K
2
i=1
C 2
n
w 2 + C2
>
ξiC2
2
i=1
C 2
n
w 2 + C1
≥
ξiC2
(45)
2
i=1
as, by hypothesis, C2 > C1 . However, problem (10) is
convex, thusit is characterized
by a unique global minimum:
C1
C1 C1
is not the optimal solution or
then, either w , b , ξ
C2 ≤ C1 . In any case, the hypotheses of the theorem are
contradicted and, then, it must be K C2 ≥ K C1 .
Proof of Corollary 1: From Theorem 3, we have that
K C2 ≥ K C1 . Let us suppose that K C2 = K C1 , then:
C 2
n
w 1 C1
+ C1
=
ξiC1
K
2
i=1
C 2
n
w 2 + C2
=
ξiC2 .
(46)
2
i=1
n
C2
Eq.
(46)
is
valid
only
if
= 0. In fact, if we suppose
i=1 ξi
n
C2
ξ
=
0,
we
can
upper-bound
Eq. (46) and get:
i=1 i
K
C1
=
i=1
2
2
As we supposed that C is such that wC = wwM AX ,
then:
ξiwM AX ≥
i=1
n
Let us consider the conventional primal formulation (10):
we can multiply the expression of the cost by 1/C (as C >
2
wM
AX
0) and add a constant
value −
2C without affecting the
C
optimal solution wC , bC , ξ :
min
n
>
C 2
n
w 2 + C2
ξiC2
2
i=1
C 2
n
w 2 + C1
ξiC2 .
2
i=1
(47)
But, in this case, the hypotheses of the corollary are contradicted and, then, it must be K C2 > K C1 .
Proof of Theorem 4: In this proof, we proceed by considering all the possible cases, proving by contradiction that
configurations other than the ones of the thesis are not
admissible.
2
2
< wC1 . If
As a first step, suppose wC2 n
n
C2
C1
C2
<
K C1 , which is
i=1 ξi
i=1 ξi , then
nK C2< n
impossible (see Theorem 3). If i=1 ξi = i=1 ξiC1 , then:
C 2
n
w 2 C1
+ C1
=
ξiC2 < K C1 ,
(48)
K
2
i=1
which contradicts the hypothesis that K C1 is the global
n minin
mum and, then, is not admissible. If i=1 ξiC2 > i=1 ξiC1 ,
then:
K
C1
=
>
C 2
n
w 2 + C1
ξiC2
2
i=1
C 2
n
w 1 + C1
ξiC1 = K C1 .
2
i=1
From Eq. (49), we get:
C 2 C 2
w 1 − w 2 .
C1 > n
n
C2
C1
2
i=1 ξi −
i=1 ξi
Analogously, we have for K C2 :
C 2
n
w 1 C2
+ C2
=
ξiC1
K
2
i=1
C 2
n
w 2 + C2
>
ξiC2 = K C2 ,
2
i=1
(49)
K
C2
>
(50)
(51)
(52)
By joining Eqns. (50) and (52), we have that C2 < C1 , which
contradicts the theorem
2
2
hypotheses.
n
C2
<
Suppose now that wC2 = wC1 . If
i=1 ξi
n
C1
i=1 ξi , then
C 2
n
w 1 C1
K
+ C1
=
ξiC2 < K C1
(53)
2
i=1
which is impossible as we supposed thatK is the global
n
n
minimum. Analogously, if i=1 ξiC2 > i=1 ξiC1 , then:
C 2
n
w 2 C2
K
+ C2
=
ξiC1 < K C2 ,
(54)
2
i=1
C1
or, in other words, K C2 is not the global minimum. This
contradicts the hypothesis and, thus, this configuration is not
admissible.
C 2
w 1 . If
C 2
n
w 1 + C2
=
ξiC2 < K C2
2
i=1
(55)
which is, again, impossible as K C2 is supposed to be the
global
n AsC1a last step, let us consider the case
n minimum.
C2
ξ
>
i
i=1
i=1 ξi , for which we have:
K
C2
C 2
n
w 1 + C2
=
ξiC1 < K C2 ,
2
i=1
(56)
which violates the hypotheses
Thus, all the
of thetheorem.
C (1,2) 2
n
C (1,2)
other than
configurations for w
and
i=1 ξi
the ones of the thesis are not admissible.
Proof of Corollary 2: The proof is a simple
application
of2
2
Theorem 4, as no configurations with wC2 < wC1 are admissible.
Proof of Corollary 3: As C1 ≤ CS , from Theorem 4 we
know that only two configurations are possible:
C 2 C 2
n
n
• if w S > w 1 , then
ξiCS < i=1 ξiC1 ;
i=1
C 2 C 2
n
n
• if w S = w 1 , then
ξ CS =
ξ C1 .
i=1 i
from which we obtain:
C 2 C 2
w 1 − w 2 C2 < n
n
C2
C1
2
i=1 ξi −
i=1 ξi
2
Finally, let us consider wC2 n
n
C2
= i=1 ξiC1 , then
i=1 ξi
i=1 i
Analogously, as CS ≤ C2 , we have:
C 2 C 2
n
n
• if w 2 > w S , then
ξiC2 < i=1 ξiCS ;
i=1
C 2 C 2
n
n
C2
• if w 2 = w S , then
= i=1 ξiCS .
i=1 ξi
2
2
As we supposed that wC2 = wC1 , we have that
the configurations of Eqns. (23) and (24) are the only ones
which do not violate the corollary hypotheses.
Proof of Corollary 4: When varying the hyperparameter
C, the optimal weight vector wC , the bias bC and the slack
variables ξ C are characterized by a piecewise linear trend, as
shown by Hastie et al. [24], where the discontinuity points
are enumerable and computable. In every interval between
two consecutive discontinuity points, the SVM classification
function can be expressed as
f (x) =
C l
f (x) − hl (x) + hl (x)
l
C
(57)
where C l is the value of the hyperparameter at the last
discontinuity point and f l (x) is the solution of the SVM
training problem for C = C l . Moreover, hl (x) is defined as:
hl (x) =
αil yi xT xi + b
(58)
i∈SV
where αl is the solution of problem (12) for C = C l and
SV is the set of indexes of the Support Vectors.
For the sake of simplicity, let us suppose that C l = C1 . If,
as hypothesized, we obtain the same SVM solution for two
distinct values C1 and C2 , then
f l (x) − hl (x) = 0
(59)
and C2 is not a discontinuity point, as E l is not modified
[24]. Thus, we can safely move to the following discontinuity
point C l+1 , which occurs in correspondence of
yi − hl (xi )
C l+1 = min C l
.
(60)
i∈SV
f l (xi ) − hl (xi )
However, from Eq. (59), we have that
(61)
C l+1 → +∞,
C 2
C 2
so we have the same solution w U = w 2 =
C 2
w 1 and n ξ CU = n ξ C2 = n ξ C1 , ∀CU ∈
i=1 i
i=1 i
i=1 i
[C1 , +∞).
R EFERENCES
[1] V. Vapnik, “An overview of statistical learning theory”, IEEE Trans. on
Neural Networks, vol. 10, pp. 988–999, 1999.
[2] T. Joachims, “Learning to Classify Text using Support Vector Machines:
Methods, Theory, and Algorithms”, Kluwer, 2002.
[3] T. Jaakkola, M. Diekhans, D. Haussler, “A discriminative framework
for detecting remote protein homologies”, Journal of Computational
Biology, vol. 7, pp. 95–114, 2000.
[4] B. Schoelkopf, K.K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, V.
Vapnik, “Comparing Support Vector Machines with Gaussian Kernels to
Radial Basis Function Classifiers”, IEEE Trans. on Signal Processing,
vol. 45, pp. 2758–2765, 1997.
[5] , I. Guyon, A. Saffari, G. Dror, G. Cawley, “Model Selection: Beyond
the Bayesian/Frequentist Divide”, The Journal of Machine Learning
Research, vol. 11, pp. 61–87, 2010
[6] D. Anguita, S. Ridella, S. Rivieccio, “K–fold generalization capability
assessment for support vector classifiers”, Proc. of the Int. Joint Conf.
on Neural Networks, pp. 855–858, Montreal, Canada, 2005.
[7] B. Efron, R. Tibshirani, “An introduction to the Bootstrap”, Chapmann
and Hall, 1993.
[8] O. Bousquet, A. Elisseeff, “Stability and generalization”, Journal of
Machine Learning Research, vol. 2, pp. 499-526, 2002.
[9] D. Anguita, A. Ghio, N. Greco, L. Oneto, S. Ridella, “Model selection
for support vector machines: Advantages and disadvantages of the
machine learning theory”, International Joint Conference on Neural
Networks, 2010.
[10] U.M. Braga–Neto, E.R. Dougherty, “Is cross-validation valid for smallsample microarray classification?”, Bioinformatics, vol. 20, pp. 374–
380, 2004.
[11] A. Isaksson, M. Wallman, H. Goeransson, M.G. Gustafsson, “Cross–
validation and bootstrapping are unreliable in small sample classification”, Pattern Recognition Letters, vol. 29, pp. 1960–1965, 2008.
[12] O. Chapelle, V. Vapnik, O. Bousquet, S. Mukherjee, “Choosing
multiple parameters for Support Vector Machines”, Machine Learning,
vol. 46, pp. 131-159, 2002.
[13] S. Floyd, M. Warmuth, “Sample compression, learnability, and the
Vapnik-Chervonenkis dimension”, Machine Learning, vol. 21, pp. 136, 1995.
[14] V.N. Vapnik, “The nature of statistical learning theory”, Springer
Verlag, 2000.
[15] P.L. Bartlett, S. Boucheron, G. Lugosi, “Model selection and error
estimation”, Machine Learning, vol. 48, pp. 85–113, 2002.
[16] , K. Duan, S.S. Keerthi, A.N. Poo, “Evaluation of simple performance
measures for tuning SVM hyperparameters”, Neurocomputing vol. 51,
pp. 41–59, 2003
[17] D. Anguita, A. Ghio, S. Ridella, “Maximal Discrepancy for Support
Vector Machines”, Neurocomputing, (in press), 2011.
[18] A.N. Tikhonov, V.Y. Arsenin, “Solution of Ill-posed Problems”, Winston & Sons, 1977.
[19] L. Bottou, C.J. Lin, “Support Vector Machine Solvers”, in “Large Scale
Learning Machines”, edited by L. Bottou, O. Chapelle, D. DeCoste, J.
Weston, The MIT Press, pp. 1–28, 2007.
[20] V.V. Ivanov, “The Theory of Approximate Methods and Their Application to the Numerical Solution of Singular Integral Equations”, Nordhoff
International, 1976.
[21] L. Martein, S. Schaible, “On solving a linear program with one
quadratic constraint”, Decisions in Economics and Finance, vol. 10,
1987.
[22] J.C. Platt, “Fast training of support vector machines using sequential
minimal optimization”, Advances in Kernel Methods, pp. 185–208, MIT
press, 1999.
[23] C.J. Lin, “Asymptotic convergence of an SMO algorithm without any
assumption”, IEEE Trans. on Neural Networks, vol. 13, pp. 248–250,
2002.
[24] T. Hastie, S. Rosset, R. Tibshirani, J. Zhu, “The entire regularization
path for the support vector machine”, Journal of Machine Learning
Research, vol. 5, pp. 1391–1415, 2004.
[25] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, Y. Bengio, “An
empirical evaluation of deep architectures on problems with many
factors of variation”, Proc. of the Int. Conf. on Machine Learning,
pp. 473–480, 2007.
[26] S. Munder, D.M. Gavrila, “An Experimental Study on Pedestrian Classification”, IEEE Trans. on Pattern Analysis and Machine Intelligence,
vol. 28, pp. 1863–1868, 2006.