In–sample Model Selection for Support Vector Machines Davide Anguita, Member, IEEE, Alessandro Ghio, Member, IEEE, Luca Oneto, Student Member, IEEE, and Sandro Ridella, Member, IEEE Abstract— In–sample model selection for Support Vector Machines is a promising approach that allows using the training set both for learning the classifier and tuning its hyperparameters. This is a welcome improvement respect to out–of–sample methods, like cross–validation, which require to remove some samples from the training set and use them only for model selection purposes. Unfortunately, in–sample methods require a precise control of the classifier function space, which can be achieved only through an unconventional SVM formulation, based on Ivanov regularization. We prove in this work that, even in this case, it is possible to exploit well–known Quadratic Programming solvers like, for example, Sequential Minimal Optimization, so improving the applicability of the in–sample approach. I. I NTRODUCTION T HE Support Vector Machine (SVM) algorithm [1] standed out in the last years as one of the most effective techniques when facing classification tasks. The widespread diffusion of SVMs resulted mostly from their successfull application in many real-world problems, spanning heterogeneous domains (e.g. text categorization [2], computational biology [3], etc.). The success of the SVM algorithm is mainly due to two factors: (i) it is possible to optimize the trade-off between under- and over-fitting capabilities of the model by suitably controlling the margin [1]; and (ii) the coefficients of the classifier can be found by solving a Convex Constrained Quadratic Programming (CCQP) problem, witch admits only a global minimum. In particular, the convexity of the SVM training problem amplifies the practical effectiveness of SVMs with respect to, for example, traditional Artificial Neural Networks (ANNs) [4], which require the solution of difficult non-linear optimization problems. Unfortunately, the search for the solution of the CCQP, does not conclude the SVM learning phase, which consists of two steps: an efficient QP solver is used for training the classifier (i.e. in order to solve the CCQP problem), while a tool for tuning a set of additional variables (the SVM hyperparameters) must be exploited in order to select the classifier characterized by the best performance. This last step is known as the model selection phase of SVM and is strictly linked to the estimation of the generalization error, namely the misclassification rate attainable by the model on new and previously unobserved patterns. In fact, the best model can be chosen according to the lowest generalization error [5]. Davide Anguita, Alessandro Ghio, Luca Oneto and Sandro Ridella are with the Department of Biophysical and Electronic Engineering, University of Genova, Via Opera Pia 11A, I-16145 Genova, Italy (email: {Davide.Anguita, Alessandro.Ghio, Luca.Oneto, Sandro.Ridella}@unige.it). The underlying problem of model selection through generalization error estimation is of major importance. Empirical approaches, like K-fold Cross Validation [6], Bootstrap [7] or Leave-One-Out [8], are often exploited for this purpose: they require the splitting of the dataset into two subsets, one for solving the CCQP problem (i.e. the training set) and the remaining one for estimating the generalization error and performing the model selection (i.e. the hold-out set). Despite being effective in many applications, these techniques appear to be unreliable when applied to small sample problems [9], [10], [11], i.e. datasets where the number of patterns is small compared to the dimensionality of the data. This is the case, for example, of microarray data, where few tens of samples, composed of several thousands features, are available: in these cases, in-sample methods would be preferable. The main advantage, respect to empirical approaches, is the use of the whole set of available data for both training the model and estimating the generalization error. Moreover, they provide deep insights on the classification algorithms and are based on rigorous approaches to give a prediction, in probability, of the generalization ability of a classifier. In the last years, several in-sample approaches appeared in literature [12], [13], [14], [15], but the underlying hypotheses, which must be fulfilled for the consistency of the estimation, are seldomly satisfied in practice and the generalization estimation can be very pessimistic [16]. However, in recent works [9], [17], a method for practically applying an in-sample approach, the Maximal Discrepancy (MD) [15], has been proposed. The MD-based approach often guarantees better performance in the small sample setting with respect to empirical techniques and, therefore, represents an effective solution to the SVM model selection problem. Unfortunately, as highlighted in [9], [17], in order to apply the MD approach to the SVM, an alternative formulation, based on Ivanov regularization [20], must be used, therefore giving up the conventional formulation, based on Tikhonov regularization [18], for which very efficient solvers were developed throughout the years [19]. To the best knowledge of the authors, no specific procedures have been proposed for finding a solution of the Ivanov SVM formulation, despite the fact that this formulation appears in the original Vapnik’s work. Therefore, in Sections IV and V, we propose a new method, which relies on the conventional QP solvers for solving the Ivanov CCQP optimization problem, so allowing us to provide a complete and effective tool (MD model selection & training, using conventional solvers) for the SVM learning in the small sample setting. II. T HE M AXIMAL D ISCREPANCY OF A C LASSIFIER Let us consider a sequence of n independent = and identically distributed (i.i.d.) patterns Dn {(x1 , y1 ), . . . , (xn , yn )}, where x ∈ X ⊆ Rd and y ∈ Y ⊆ R, sampled according to an unknown distribution P(x, y). In particular, as we analyze binary classification problems, we set Y = {±1}. Let us fix the hypothesis space (i.e. the class of functions) F and select a classifier f (x) within it. Then, the error attainable by the classifier on the whole population L(f ), i.e. the generalization error, is defined as: L(f ) = E(X ,Y) (f (x), y) l (2) ˆ n (f ) estimates L(f ) but is afflicted by a bias: as Clearly, L shown in [15], a penalty term should somehow take into account the complexity of the hypothesis space F which the classifier f belongs to. The Maximal Discrepancy (MD) is a statistical tool which allows to compute this bias and, thus, to rigorously upperbound the generalization error exploiting the available data. Let us split Dn in two halves and compute the two empirical errors: n ˆ (1) (f ) L n/2 = ˆ (2) (f ) L n/2 = 2 2 (f (xi ), yi ) n i=1 n 2 (f (xi ), yi ) . n n (3) (4) i= 2 +1 The Maximal Discrepancy (MD) is: ˆ n (F) = max L ˆ (1) (f ) − L ˆ (2) (f ) . M n/2 n/2 f ∈F f ∈F ˆ n (F) + Δ(δ, n), ˆ n (f ) + M L(f ) ≤ L III. T HE S UPPORT V ECTOR M ACHINE f (x) = wT x + b. (7) (9) We will analyze in Section VII how the results can be generalized to the non-linear case, through the usual kernel trick [1]. The weights w ∈ Rd and the bias b ∈ R are found by solving the following primal CCQP problem based on Tikhonov regularization [14]: 1 2 w + CeT ξ 2 yi wT xi + b ≥ 1 − ξi ξi ≥ 0 ∀i ∈ [1, . . . , n] min w,b,ξ (10) ∀i ∈ [1, . . . , n] where ei = 1 ∀i and C is a hyperparameter that must be chosen through a model selection procedure (e.g. Maximal 2 Discrepancy). The first element 12 w represents the regularization term which forces the classifier to attain a large margin, while ξ are slack variables introduced to upperbound the number of misclassification according to the hinge loss function ξ (f (x), y): [1 − yi f (xi )]+ ξi = , (11) 2 2 where [·]+ = max(0, ·) [1]. By introducing n Lagrange multipliers α ∈ Rn , it is possible to write the problem (10) in its dual form, for which efficient QP solvers have been proposed in literature [19]: ξ (f (x), y) = α (5) ˆ (f ) is the empirical error on D . Then, we obtain where L n n the following upper-bound for the generalization error of a classifier f ∈ F [9], [17], which holds with probability (1 − δ): (8) For the sake of simplicity, as we are working in the small sample regime, we focus here on the linear Support Vector Machine (SVM): 1 αi αj qij − αi 2 i=1 j=1 i=1 n (6) log 2δ 2n is the confidence term, which is independent from both the hypothesis space and the classifier, but depends on the number of patterns of the training set. min If the loss function is bounded and symmetric, such that (f (xi ), −yi ) = 1 − (f (xi ), yi ), the value of MD can be computed by a conventional empirical minimization procedure. In fact a new dataset Dn can be created by simply flipping the labels to half of the patterns, so that: ˆ n (F) = 1 − 2 min L ˆ n (f ), M Δ(δ, n) = 3 (1) where (f (x), y), : X × Y → [0, 1], is a suitable bounded loss function. Since P(x, y) is unknown, we cannot directly evaluate L(f ). On the contrary, we can compute the empirical error of f ∈ F: ˆ n (f ) = 1 L (f (xi ), yi ). n i=1 where n 0 ≤ αi ≤ C n (12) ∀i ∈ [1, . . . , n] y T α = 0, where qij = yi yj xTi xj . (13) After solving problem (12), the Lagrange multipliers can be used to define the SVM classifier: n f (x) = yi αi xTi x + b. (14) i=1 The patterns characterized by αi > 0 (or, in other words, by yi f (xi ) ≤ 1) are called Support Vectors (SVs). Ideally, to apply the Maximal Discrepancy approach, we could choose ξ as loss function for computing ˆ n (f ) = ˆ n (f ) = 1 n (f (x ), y ) and L minf ∈F L i i i=1 ξ n n 1 (f (x ), y ) in Eqns. (6) and (7), respectively [9]. i i i=1 ξ n However, three problems arise: (i) the hinge loss is not bounded and, n thus, cannot be directly used in Eq. (7); (ii) the value i=1 ξi is not minimized by problems (10) and (12); (iii) it is not straightforward to find a simple relation between C and the hypothesis space F in the conventional formulations (10) and (12). The peeling approach, proposed in [9], [17], allows to circumvent the first issue by iteratively removing the Critical SVs (CSVs) that cause the unboundedness of ξ , i.e. the patterns for which yi f (xi ) < −1. Concerning problems (ii) and (iii), we propose to use an alternative SVM formulation [14], based on Ivanov l regularization, which minimizes the error (minf ∈F i=1 ξi ) and 2 where F is defined as the set of functions with w ≤ 2 wM AX and b ∈ (−∞, +∞): min w,b,ξ eT ξ (15) 2 2 w ≤ wM AX T yi w xi + b ≥ 1 − ξi ξi ≥ 0 ∀i ∈ [1, . . . , n] ∀i ∈ [1, . . . , n] This formulation was proposed by Vapnik himself [14] but, to the best knowledge of the authors, no dedicated procedures have been developed for solving it. In [9], an heuristic solution, based on the ideas of [21], has been proposed, but it does not allow to exploit conventional optimization algorithms and, most of all, does not allow to introduce the kernel non-linearity. In this work, we prove that the well-known Sequential Minimal Optimization algorithm [22], [23] can be exploited for directly solving problem (15) by detailing a relationship between Ivanov and Tikhonov regularization. IV. I VANOV AND T IKHONOV R EGULARIZATION Let us define, for the sake of brevity, the Tikhonov regularization based formulation of (10) as T-SVM and the Ivanov regularization based formulation of (15) as I-SVM. We want to solve I-SVM by using the efficient and reliable optimization tools developed for T-SVM. For this purpose we define as wC , bC , ξ C the solution of T-SVM, given a fixed C, and we introduce some preliminary properties that we need to prove, for reasons that will become clearer in the next section, that: Property 1) T-SVM and I-SVM are equivalent for some value of C; Property 2) If we consider two distinct values of C, C1 and C2 , such that C2 > C1 , we have that: 2 2 n n C2 < i=1 ξiC1 if wC2 > wC1 i=1 ξi 2 2 n n C2 = i=1 ξiC1 if wC2 = wC1 i=1 ξi (16) in other words, if the hyperparameter C grows, than both the amount of misclassification and the size of the margin decrease or remain the same. 2 2 Property 3) In T-SVM, if wC2 = wC1 for C2 > C1 , 2 2 then ∀CU > C1 we have wCU = wC1 . This means that if the margin stops decreasing it will remain the same regardless the values of C. In the following sub-section, we will prove the previous properties, which also give us interesting insights on the two different formulations of the SVM problem. The achieved results will be exploited for formulating the algorithm, which allows to solve I-SVM using the conventional QP solvers, designed for T-SVM. For improving the readability of the paper, the proofs are presented in the Appendix. A. I-SVM and T-SVM: Properties The following theorem allows to prove the previously presented Property 1. Theorem 1: The I-SVM problem (15) is equivalent to the conventional T-SVM problem (10) for some value of C [14]. Theorem 2: Let us consider the I-SVM and T-SVM problems. Let (wwM AX , bwM AX , ξ wM AX ) and wC , bC , ξ C be If Cthe solutions of, respectively, I-SVM andT-SVM. n C w 2 = wwM AX 2 for a given C, then ξ = i=1 i n wM AX . i=1 ξi As we proved that the two problems are equivalent for some value of C, we consider only the T-SVM problem in the prosecution of this section. The results given by the following theorem is propaedeutic for demonstrating the second property presented in Section IV. Theorem 3: Let us consider the T-SVM optimization problem. Given a value of the hyperparameter C = C1 , let us solve problem (10) and let K C1 be the minimum value of the cost function: C 2 n w 1 C1 + C1 K = ξiC1 . (17) 2 i=1 Let us consider a larger value for the hyperparameter C2 > C1 . Let us compute the cost at the optimum: C 2 n w 2 C2 + C2 K = ξiC2 . (18) 2 i=1 Then: K C2 ≥ K C1 . (19) Corollary 1: Given the same hypotheses of Theorem 3, if: n ξiC2 = 0, (20) i=1 then K C2 > K C1 . (21) By exploiting the results of the previous theorems, we can prove Property 2. Theorem 4: Let us consider the T-SVM optimization problem. Given two values of the hyperparameter C such that C2 > C1 , let us solve problem (10) and let K C1 and K C2 be the minimum values of the cost function. As, from Theorem 3, K C2 ≥ K C1 , then: n n ξiC2 < i=1 ξiC1 n i=1 n C2 = i=1 ξiC1 i=1 ξi 2 2 if wC2 > wC1 C 2 C 2 if w 2 = w 1 . (22) Corollary 2: Given the same hypotheses of Theorem 4, 2 2 then wC2 ≥ wC1 . the same hypotheses of Theorem 4, if 3: Given Corollary wC2 2 = wC1 2 , then ∀CS such that C1 ≤ CS ≤ C2 : C 2 C 2 C 2 w S = w 2 = w 1 n ξiCS = i=1 n ξiC2 = i=1 n ξiC1 . (23) (24) i=1 The following Corollary allows us to prove Property 3. Corollary hypotheses of Theorem 4, 2 4: Given 2the same n n C2 C1 if wC2 = wC1 and = i=1 ξi i=1 ξi , then ∀CU ∈ [C1 , +∞): C 2 C 2 C 2 w U = w 2 = w 1 n i=1 ξiCU = n i=1 ξiC2 = n ξiC1 . (25) (26) i=1 V. E XPLOITING E FFICIENT QP S OLVERS FOR I-SVM In the previous section, some interesting properties have been proved. In particular: • Theorem 1 states that the I-SVM and the conventional T-SVM problems are equivalent for a given (and finite) value of the hyperparameter C; • Theorem 2 states that if, for a given value of C, we obtain the same values of the margin with TSVM and I-SVM, then the upper-bound of the number of in T-SVM is minimized too, as n n nmisclassifications wM AX C ξ = ξ = min ξ ; f ∈F i=1 i i=1 i i=1 i • Theorem 4 and its corollaries have been used for demonstrating the monotonicity of the trend of the 2 optimal value wC when varying C in T-SVM. In 2 2 particular, if, for C2 > C1 , wC2 = wC1 and n n ξ C2 = ξ C1 , then ∀CU ∈ [C1 , +∞) we have i=1 Ci 2 iC 2 i=1 n n that w U = w 1 and i=1 ξiCU = i=1 ξiC1 . In other words, we want to search for the optimal value of C (namely Copt ) which allows us to find the solution of I-SVM exploiting the conventional T-SVM formulation (Theorems 1 and 2). Then, we start with an initialization value C = Cstart (e.g. Cstart = 1) and solve the conventional CCQP problem (12), using the efficient algorithms available in literature: C 2 2 • if w start > wM AX , the solution of problem (15) will lie on the hypothesis space boundary, i.e. C 2 w opt = w2 M AX . Then, we set Cup = Cstart and we decrease C, exploiting of Theorem 4 Cthe 2 results 2 and its corollaries, until w ≤ wM AX , looking for a feasible Clow . After a Clow and a Cup have been identified, we can simply apply a bisection method for Algorithm 1 The algorithm for solving Eq. (15) using only conventional and efficient QP solvers. 2 Input: A dataset Dn , wM AX , a tolerance τ Input: w, b, ξ C=1 {w, b, ξ} = QP SV M (C) 2 if w2 > wM AX then Cup = C 2 while w2 > wM AX do C = C/2 {w, b, ξ} = QP SV M (C) 2 if w2 < wM AX then Clow = C exit from the cycle end if end while else Clow = C 2 while w2 < wM AX do C =C ∗2 wold = wnew {w, b, ξ} = QP SV M (C) 2 if w2 > wM AX then Cup = C exit from the cycle end if wnew = w2 if wnew − wold < τ then Return {w, b, ξ} end if end while end if 2 while | w2 − wM AX | > τ do Cup −Clow C = Clow + 2 {w, b, ξ} = QP SV M (C) 2 2 if w > wM AX then Cup = C end if 2 if w2 < wM AX then Clow = C end if end while Return {w, b, ξ} • 2 finding Copt , as wC is monotonic with respect to the variations 2 of C;2 if wCstart ≤ wM AX , we set Clow = Cstart and we increase C: – if, for two consecutive values of C, we find the 2 same wC , we have that Copt = C, as stated by C 2 w opt ≤ w2 Corollary 4, and M AX . Then, the C C C algorithm returns w , b , ξ ; 2 2 – otherwise, wCopt = wM AX and we look for a feasible Cup . After a Clow and a Cup have been identified, we can apply the bisection method for searching Copt . The proposed method is presented in Algorithm 1: it is worth noting that only conventional QP solvers are required for problem (12) (QPSVM routine). TABLE I T EST ERROR RATES OBTAINED USING MD ON TWO REAL – WORLD DATASETS n 10 20 40 60 80 100 (MNIST AND DAIMLER C HRYSLER ). MNIST 2.5 2.4 1.2 0.6 0.7 0.5 ± ± ± ± ± ± 0.6 0.3 0.3 0.1 0.2 0.1 DaimlerChrysler 24.9 29.8 27.2 27.3 25.3 25.7 ± ± ± ± ± ± (28) 2 2 w ≤ wM AX T yi w φ(xi ) + b ≥ 1 − ξi ξi ≥ 0 ∀i ∈ [1, . . . , n]. 0.9 0.7 0.9 0.8 0.8 0.5 ∀i ∈ [1, . . . , n] By introducing n Lagrange multipliers, we obtain the same dual formulation of (35), where qij of Eq. (13) is computed applying the usual kernel trick [1] and exploiting a kernel function K(·, ·): VI. E XPERIMENTAL R ESULTS We consider the MNIST [25] and the DaimlerChrysler [26] datasets. The MNIST dataset consists of 62000 images, representing the numbers from 0 to 9: in particular, we consider the 13074 patterns containing 0’s and 1’s, allowing us to deal with a binary classification problem. The dimensionality of the data is 784 (i.e. 28×28 pixels). The DaimlerChrysler dataset consists of 9800 8-bits grayscale images (36×18 pixels), representing pedestrians crossing a road and nonpedestrian examples. We build the training set by randomly sampling a small number of patterns, varying from 10 to 100, while the remaining images are used as a test set. In order to build statistically relevant results, we build a set of 100 replicates using a random sampling technique. Note that the MNIST dataset is characterized by 784 features, while the dimensionality of the DaimlerChrysler problem is 648, thus much higher than the number of samples in each of the training sets. Therefore, it defines a typical small sample setting. Table I shows the average misclassification rates obtained on the test sets of MNIST and DaimlerChrysler datasets, where the experimental setup is the following: • the data are normalized in the range [0, 1]; • the model selection is performed by searching for the optimal value of wM AX in the interval [10−3 , 102 ], which includes the cases of interest, among 40 values, equally spaced in a logarithmic scale. The optimal model is chosen according to the generalization error, estimated using MD, where I-SVM problems are solved exploiting Algorithm 1; • the error on the test set is computed using the optimal model. qij = yi yj φ(xi ) · φ(xj ) = yi yj K (xi , xj ) . (29) Analogously, the conventional dual formulation assumes the same form of problem (12), where qij is computed according to Eq. (29). As all the theorems, presented in Section IV, hold also in the non-linear case, Algorithm 1 can be trivially generalized for solving problem (28). VIII. C ONCLUSIONS We have shown how to apply conventional CCQP solvers to the solution of the SVM learning phase, when formulated as an Ivanov regularization problem. As confirmed by experimental results, this approach makes in–sample methods comparable, at least in the considered cases, to widely used empirical methods for model selection. We hope that this work could help in spreading the use of the alternative Ivanov formulation of the SVM, which has several advantages respect to the conventional one. A PPENDIX In this Appendix we propose the proofs of the theorems and corollaries, presented in Section IV-A. Proof of Theorem 1: Let us compute the Lagrangian of the primal I-SVM problem (15): L= n i=1 ξi − n βi [yi (wT xi + b) − 1 − ξi ] (30) n γ 2 T δi ξi . − (wM AX − w w) − 2 i=1 (31) i=1 We obtain the following Karush Kuhn Tucker (KKT) conditions: VII. E XTENSION TO THE N ON -L INEAR C ASE For the sake of simplicity, in this work we focused on the linear Support Vector Machine. However, the obtained results can be generalized to non-linear SVMs as well. By applying a mapping to the input samples x → φ(x), eT ξ min w,b,ξ (27) the SVM training problem based on Ivanov regularization becomes: ∂L =0 → ∂w ∂L =0 → ∂b ∂L =0 → ∂ξi w= n n 1 βi yi xi γ i=i βi yi = 0 (32) (33) i=i 1 − βi − δi = 0 → βi ≤ 1 ∀i ∈ [1, . . . , n] (34) and the dual formulation: n n n 2 1 γwM AX (35) βi βj qij − βi + min β,γ 2γ i=1 j=1 2 i=1 0 ≤ βi ≤ 1 γ≥0 yT β = 0 ∀i ∈ [1, . . . , n] n 1 2 2 ξi (36) + w − wM AX 2C i=1 yi wT xi + b ≥ 1 − ξi ∀i ∈ [1, . . . , n] w,b,ξ ξi ≥ 0 ∀i ∈ [1, . . . , n] We can compute the dual of the previous problem: n 1 2 2 + L = w − wM ξi AX 2C i=1 − n βi [yi (wT xi + b) − 1 − ξi ] − i=1 n (37) δi ξi , i=1 obtaining the following KKT conditions: ∂L =0 → ∂w ∂L =0 → ∂b ∂L =0 → ∂ξi w=C n n βi yi xi (38) i=i βi yi = 0 (39) i=i 1 − βi − δi = 0 → βi ≤ 1 ∀i ∈ [1, . . . , n]. The new dual formulation can be written as: n n n C w2 βi βj qij − βi + M AX min β,γ 2 i=1 j=1 2C i=1 0 ≤ βi ≤ 1 y T β = 0, (40) ∀i ∈ [1, . . . , n] which is equivalent to (35) for C= 1 , γ (41) where γ > 0. Proof of Theorem 2: As the T-SVM training problem is convex, from Eq. (10) we have that w wM AX 2 +C n i=1 ξiwM AX n 2 ≥ wC + C ξiC . (42) n ξiC . (43) i=1 n However, as i=1 ξiwM AX is the solution of problem (15), we must have that i=1 ξiwM AX = n ξiC . (44) i=1 n it is impossible to prove that, if i=1 ξiC = Unfortunately, n ξiwM AX for some value of C, then the condition i=1 wC 2 = wwM AX 2 is always verified. In fact it is easy to find a counterexample: let us consider a linearly separable 2 2 problems for which the constraint w ≤ wM AX holds: in wM AX 2 solutions these case, there are infinite possible w to problem (15). On the contrary, T-SVM allows to choose, 2 among them, the weight vector wC such that the margin 2 2 is maximized, i.e. we will have that wC ≤ wwM AX : in these case, when solving I-SVM exploiting the QP algowM AX 2 = rithms C 2designed for T-SVM, we will force w w , without loss of generality. Proof of Theorem 3: Let us proceed via reductio ad absurdum, i.e. suppose that K C2 < K C1 . Then: C 2 n w 1 C1 + C1 = ξiC1 K 2 i=1 C 2 n w 2 + C2 > ξiC2 2 i=1 C 2 n w 2 + C1 ≥ ξiC2 (45) 2 i=1 as, by hypothesis, C2 > C1 . However, problem (10) is convex, thusit is characterized by a unique global minimum: C1 C1 C1 is not the optimal solution or then, either w , b , ξ C2 ≤ C1 . In any case, the hypotheses of the theorem are contradicted and, then, it must be K C2 ≥ K C1 . Proof of Corollary 1: From Theorem 3, we have that K C2 ≥ K C1 . Let us suppose that K C2 = K C1 , then: C 2 n w 1 C1 + C1 = ξiC1 K 2 i=1 C 2 n w 2 + C2 = ξiC2 . (46) 2 i=1 n C2 Eq. (46) is valid only if = 0. In fact, if we suppose i=1 ξi n C2 ξ = 0, we can upper-bound Eq. (46) and get: i=1 i K C1 = i=1 2 2 As we supposed that C is such that wC = wwM AX , then: ξiwM AX ≥ i=1 n Let us consider the conventional primal formulation (10): we can multiply the expression of the cost by 1/C (as C > 2 wM AX 0) and add a constant value − 2C without affecting the C optimal solution wC , bC , ξ : min n > C 2 n w 2 + C2 ξiC2 2 i=1 C 2 n w 2 + C1 ξiC2 . 2 i=1 (47) But, in this case, the hypotheses of the corollary are contradicted and, then, it must be K C2 > K C1 . Proof of Theorem 4: In this proof, we proceed by considering all the possible cases, proving by contradiction that configurations other than the ones of the thesis are not admissible. 2 2 < wC1 . If As a first step, suppose wC2 n n C2 C1 C2 < K C1 , which is i=1 ξi i=1 ξi , then nK C2< n impossible (see Theorem 3). If i=1 ξi = i=1 ξiC1 , then: C 2 n w 2 C1 + C1 = ξiC2 < K C1 , (48) K 2 i=1 which contradicts the hypothesis that K C1 is the global n minin mum and, then, is not admissible. If i=1 ξiC2 > i=1 ξiC1 , then: K C1 = > C 2 n w 2 + C1 ξiC2 2 i=1 C 2 n w 1 + C1 ξiC1 = K C1 . 2 i=1 From Eq. (49), we get: C 2 C 2 w 1 − w 2 . C1 > n n C2 C1 2 i=1 ξi − i=1 ξi Analogously, we have for K C2 : C 2 n w 1 C2 + C2 = ξiC1 K 2 i=1 C 2 n w 2 + C2 > ξiC2 = K C2 , 2 i=1 (49) K C2 > (50) (51) (52) By joining Eqns. (50) and (52), we have that C2 < C1 , which contradicts the theorem 2 2 hypotheses. n C2 < Suppose now that wC2 = wC1 . If i=1 ξi n C1 i=1 ξi , then C 2 n w 1 C1 K + C1 = ξiC2 < K C1 (53) 2 i=1 which is impossible as we supposed thatK is the global n n minimum. Analogously, if i=1 ξiC2 > i=1 ξiC1 , then: C 2 n w 2 C2 K + C2 = ξiC1 < K C2 , (54) 2 i=1 C1 or, in other words, K C2 is not the global minimum. This contradicts the hypothesis and, thus, this configuration is not admissible. C 2 w 1 . If C 2 n w 1 + C2 = ξiC2 < K C2 2 i=1 (55) which is, again, impossible as K C2 is supposed to be the global n AsC1a last step, let us consider the case n minimum. C2 ξ > i i=1 i=1 ξi , for which we have: K C2 C 2 n w 1 + C2 = ξiC1 < K C2 , 2 i=1 (56) which violates the hypotheses Thus, all the of thetheorem. C (1,2) 2 n C (1,2) other than configurations for w and i=1 ξi the ones of the thesis are not admissible. Proof of Corollary 2: The proof is a simple application of2 2 Theorem 4, as no configurations with wC2 < wC1 are admissible. Proof of Corollary 3: As C1 ≤ CS , from Theorem 4 we know that only two configurations are possible: C 2 C 2 n n • if w S > w 1 , then ξiCS < i=1 ξiC1 ; i=1 C 2 C 2 n n • if w S = w 1 , then ξ CS = ξ C1 . i=1 i from which we obtain: C 2 C 2 w 1 − w 2 C2 < n n C2 C1 2 i=1 ξi − i=1 ξi 2 Finally, let us consider wC2 n n C2 = i=1 ξiC1 , then i=1 ξi i=1 i Analogously, as CS ≤ C2 , we have: C 2 C 2 n n • if w 2 > w S , then ξiC2 < i=1 ξiCS ; i=1 C 2 C 2 n n C2 • if w 2 = w S , then = i=1 ξiCS . i=1 ξi 2 2 As we supposed that wC2 = wC1 , we have that the configurations of Eqns. (23) and (24) are the only ones which do not violate the corollary hypotheses. Proof of Corollary 4: When varying the hyperparameter C, the optimal weight vector wC , the bias bC and the slack variables ξ C are characterized by a piecewise linear trend, as shown by Hastie et al. [24], where the discontinuity points are enumerable and computable. In every interval between two consecutive discontinuity points, the SVM classification function can be expressed as f (x) = C l f (x) − hl (x) + hl (x) l C (57) where C l is the value of the hyperparameter at the last discontinuity point and f l (x) is the solution of the SVM training problem for C = C l . Moreover, hl (x) is defined as: hl (x) = αil yi xT xi + b (58) i∈SV where αl is the solution of problem (12) for C = C l and SV is the set of indexes of the Support Vectors. For the sake of simplicity, let us suppose that C l = C1 . If, as hypothesized, we obtain the same SVM solution for two distinct values C1 and C2 , then f l (x) − hl (x) = 0 (59) and C2 is not a discontinuity point, as E l is not modified [24]. Thus, we can safely move to the following discontinuity point C l+1 , which occurs in correspondence of yi − hl (xi ) C l+1 = min C l . (60) i∈SV f l (xi ) − hl (xi ) However, from Eq. (59), we have that (61) C l+1 → +∞, C 2 C 2 so we have the same solution w U = w 2 = C 2 w 1 and n ξ CU = n ξ C2 = n ξ C1 , ∀CU ∈ i=1 i i=1 i i=1 i [C1 , +∞). R EFERENCES [1] V. Vapnik, “An overview of statistical learning theory”, IEEE Trans. on Neural Networks, vol. 10, pp. 988–999, 1999. [2] T. Joachims, “Learning to Classify Text using Support Vector Machines: Methods, Theory, and Algorithms”, Kluwer, 2002. [3] T. Jaakkola, M. Diekhans, D. Haussler, “A discriminative framework for detecting remote protein homologies”, Journal of Computational Biology, vol. 7, pp. 95–114, 2000. [4] B. Schoelkopf, K.K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, V. Vapnik, “Comparing Support Vector Machines with Gaussian Kernels to Radial Basis Function Classifiers”, IEEE Trans. on Signal Processing, vol. 45, pp. 2758–2765, 1997. [5] , I. Guyon, A. Saffari, G. Dror, G. Cawley, “Model Selection: Beyond the Bayesian/Frequentist Divide”, The Journal of Machine Learning Research, vol. 11, pp. 61–87, 2010 [6] D. Anguita, S. Ridella, S. Rivieccio, “K–fold generalization capability assessment for support vector classifiers”, Proc. of the Int. Joint Conf. on Neural Networks, pp. 855–858, Montreal, Canada, 2005. [7] B. Efron, R. Tibshirani, “An introduction to the Bootstrap”, Chapmann and Hall, 1993. [8] O. Bousquet, A. Elisseeff, “Stability and generalization”, Journal of Machine Learning Research, vol. 2, pp. 499-526, 2002. [9] D. Anguita, A. Ghio, N. Greco, L. Oneto, S. Ridella, “Model selection for support vector machines: Advantages and disadvantages of the machine learning theory”, International Joint Conference on Neural Networks, 2010. [10] U.M. Braga–Neto, E.R. Dougherty, “Is cross-validation valid for smallsample microarray classification?”, Bioinformatics, vol. 20, pp. 374– 380, 2004. [11] A. Isaksson, M. Wallman, H. Goeransson, M.G. Gustafsson, “Cross– validation and bootstrapping are unreliable in small sample classification”, Pattern Recognition Letters, vol. 29, pp. 1960–1965, 2008. [12] O. Chapelle, V. Vapnik, O. Bousquet, S. Mukherjee, “Choosing multiple parameters for Support Vector Machines”, Machine Learning, vol. 46, pp. 131-159, 2002. [13] S. Floyd, M. Warmuth, “Sample compression, learnability, and the Vapnik-Chervonenkis dimension”, Machine Learning, vol. 21, pp. 136, 1995. [14] V.N. Vapnik, “The nature of statistical learning theory”, Springer Verlag, 2000. [15] P.L. Bartlett, S. Boucheron, G. Lugosi, “Model selection and error estimation”, Machine Learning, vol. 48, pp. 85–113, 2002. [16] , K. Duan, S.S. Keerthi, A.N. Poo, “Evaluation of simple performance measures for tuning SVM hyperparameters”, Neurocomputing vol. 51, pp. 41–59, 2003 [17] D. Anguita, A. Ghio, S. Ridella, “Maximal Discrepancy for Support Vector Machines”, Neurocomputing, (in press), 2011. [18] A.N. Tikhonov, V.Y. Arsenin, “Solution of Ill-posed Problems”, Winston & Sons, 1977. [19] L. Bottou, C.J. Lin, “Support Vector Machine Solvers”, in “Large Scale Learning Machines”, edited by L. Bottou, O. Chapelle, D. DeCoste, J. Weston, The MIT Press, pp. 1–28, 2007. [20] V.V. Ivanov, “The Theory of Approximate Methods and Their Application to the Numerical Solution of Singular Integral Equations”, Nordhoff International, 1976. [21] L. Martein, S. Schaible, “On solving a linear program with one quadratic constraint”, Decisions in Economics and Finance, vol. 10, 1987. [22] J.C. Platt, “Fast training of support vector machines using sequential minimal optimization”, Advances in Kernel Methods, pp. 185–208, MIT press, 1999. [23] C.J. Lin, “Asymptotic convergence of an SMO algorithm without any assumption”, IEEE Trans. on Neural Networks, vol. 13, pp. 248–250, 2002. [24] T. Hastie, S. Rosset, R. Tibshirani, J. Zhu, “The entire regularization path for the support vector machine”, Journal of Machine Learning Research, vol. 5, pp. 1391–1415, 2004. [25] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, Y. Bengio, “An empirical evaluation of deep architectures on problems with many factors of variation”, Proc. of the Int. Conf. on Machine Learning, pp. 473–480, 2007. [26] S. Munder, D.M. Gavrila, “An Experimental Study on Pedestrian Classification”, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 28, pp. 1863–1868, 2006.
© Copyright 2024