In-sample and Out-of-sample Model Selection ... Estimation for Support Vector Machines

In-sample and Out-of-sample Model Selection and Error
Estimation for Support Vector Machines
Authors: Davide Anguita, Alessandro Ghio, Luca Oneto, Sandro Ridella
Submitted to: IEEE Transactions on neural networks and learning systems
Disclaimer
The following article has been accepted for publication by the IEEE Transactions on neural
networks and learning systems, to which the copyright is transferred. The authors distributed this
copy for personal use to potentially interested parties. No commercial use may be made of the
article or the work described in it, nor re-distribution of such work is allowed.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012
1
In–sample and Out–of–sample Model Selection and
Error Estimation for Support Vector Machines
Davide Anguita, Member, IEEE, Alessandro Ghio, Member, IEEE,
Luca Oneto Member, IEEE, and Sandro Ridella, Member, IEEE
Abstract—In–sample approaches to model selection and error
estimation of Support Vector Machines (SVMs) are not as
widespread as out–of–sample methods, where part of the data is
removed from the training set for validation and testing purposes,
mainly because their practical application is not straightforward
and the latter provide, in many cases, satisfactory results. In this
paper, we survey some recent and not-so-recent results of the
data–dependent Structural Risk Minimization (SRM) framework
and propose a proper re-formulation of the SVM learning
algorithm, so that the in–sample approach can be effectively
applied. The experiments, performed both on simulated and
real–world datasets, show that our in–sample approach can
be favorably compared to out–of–sample methods, especially
in cases where the last ones provide questionable results. In
particular, when the number of samples is small compared to
their dimensionality, like in classification of microarray data, our
proposal can outperform conventional out–of–sample approaches
like the Cross Validation, the Leave–One–Out or the Bootstrap.
Index Terms—Structural Risk Minimization, Model Selection,
Error Estimation, Statistical Learning Theory, Support Vector
Machine, Cross Validation, Leave One Out, Bootstrap
I. I NTRODUCTION
ODEL selection addresses the problem of tuning the
complexity of a classifier to the available training data,
so to avoid either under– or overfitting [1]. These problems
affect most classifiers because, in general, their complexity
is controlled by one or more hyperparameters, which must
be tuned separately from the training process in order to
achieve optimal performances. Some examples of tunable
hyperparameters are the number of hidden neurons or the
amount of regularization in Multi Layer Perceptrons (MLPs)
[2], [3] and the margin/error trade–off or the value of the
kernel parameters in Support Vector Machines (SVMs) [4],
[5], [6]. Strictly related to this problem is the estimation of the
generalization error of a classifier: in fact, the main objective
of building an optimal classifier is to choose both its parameters and hyperparameters, so to minimize its generalization
error and compute an estimate of this value for predicting
the classification performance on future data. Unfortunately,
despite the large amount of work on this important topic, the
problem of model selection and error estimation of SVMs is
still open and the objective of extensive research [7], [8], [9],
[10], [11], [12].
Among the several methods proposed for this purpose, it
is possible to identify two main approaches: out–of–sample
M
Davide Anguita, Alessandro Ghio, Luca Oneto and Sandro Ridella are
with the DITEN Department, University of Genova, Via Opera Pia 11A, I16145 Genova, Italy (email: {Davide.Anguita, Alessandro.Ghio, Luca.Oneto,
Sandro.Ridella}@unige.it).
and in–sample methods. The first ones are favored by practitioners because they work well in many situations and allow
the application of simple statistical techniques for estimating
the quantities of interest. Some examples of out–of–sample
methods are the well–known k–Fold Cross Validation (KCV),
the Leave–One–Out (LOO), and the Bootstrap (BTS) [13],
[14], [15]. All these techniques rely on a similar idea: the
original dataset is resampled, with or without replacement,
to build two independent datasets called, respectively, the
training and validation (or estimation) sets. The first one is
used for training a classifier, while the second one is exploited
for estimating its generalization error, so that the hyperparameters can be tuned to achieve its minimum value. Note
that both error estimates computed through the training and
validation sets are, obviously, optimistically biased; therefore,
if a generalization estimate of the final classifier is desired, it
is necessary to build a third independent set, called the test set,
by nesting two of the resampling procedures mentioned above.
Unfortunately, this additional splitting of the original dataset
results in a further shrinking of the available learning data and
contributes to a further increase of the computational burden.
Furthermore, after the learning and model selection phases,
the user is left with several classifiers (e.g. k classifiers in
the case of KCV), each one with possibly different values of
the hyperparameters and combining them, or retraining a final
classifier on the entire dataset for obtaining a final classifier,
can lead to unexpected results [16].
Despite some drawbacks, when reasonably large amount of
data are available, out–of–sample techniques work reasonably
well. However, there are several settings where their use has
been questioned by many researchers [17], [18], [19]. In particular, the main difficulties arise in the small–sample regime
or, in other words, when the size of the training set is small
compared to the dimensionality of the patterns. A typical example is the case of microarray data, where less than a hundred
samples, composed of thousands of genes, are often available
[20]. In these cases, in–sample methods would represent the
obvious choice for performing the model selection phase: in
fact, they allow to exploit the whole set of available data for
both training the model and estimating its generalization error,
thank to the application of rigorous statistical procedures.
Despite their unquestionable advantages with respect to outof-sample methods, their use is not widespread: one of the
reasons is the common belief that in–sample methods are very
useful for gaining deep theoretical insights on the learning
process or for developing new learning algorithms, but they
are not suitable for practical purposes. The SVM itself, that is
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012
one of the most successful classification algorithms of the last
decade, stems out from the well–known Vapnik’s Structural
Risk Minimization (SRM) principle [5], [21], which represents
the seminal approach to in–sample methods. However, SRM
is not able, in practice, to estimate the generalization error
of the trained classifier or select its optimal hyperparameters
[22], [23]. Similar principles are equally interesting from a
theoretical point of view, but seldom useful in practice [21],
[24], [25] as they are overly pessimistic. In the past years,
some proposals have heuristically adapted the SRM principle
to in–sample model selection purposes, with some success, but
they had to give up its theoretical rigour, so compromising its
applicability [26].
We present in this work a new method for applying a data–
dependent SRM approach [27] to model selection and error
estimation, by exploiting new results in the field of Statistical
Learning Theory (SLT) [28]. In particular, we describe an in–
sample method for applying the data–dependent SRM principle to a slightly modified version of the SVM. Our approach
is general, but is particularly effective in performing model
selection and error estimation in the small–sample setting: in
these cases, it is able to outperform out–of–sample techniques.
The novelty of our approach is the exploitation of new results
on the Maximal Discrepancy and Rademacher Complexity
theory [28], trying not to give up any theoretical rigor, despite
achieving good performance in practice. Our purpose is not to
claim the general superiority of in–sample methods above out–
of–sample ones, but to explore advantages and disadvantages
of both approaches, in order to understand why and when they
can be successfully applied. For this reason, a theoretically
rigorous analysis of out–of–sample methods is also presented.
Finally, we show that the proposed in–sample method allows
using a conventional quadratic programming solver for SVMs
to control the complexity of the classifier. In other words, even
if we make use of a modified SVM, to allow for the application
of the in–sample approach, any well–known optimization algorithm like, for example, the Sequential Minimal Optimization
(SMO) method [29], [30] can be used for performing the
training, the model selection and the error estimation phases,
as in the out-of-sample cases.
The paper is organized as follows: Section II details the
classification problem framework and describe the in–sample
and out–of–sample general approaches. Section III and Section
IV survey old and new statistical tools, which are the basis
for the subsequent analysis of out–of–sample and in–sample
methods. Section V propose a new method for applying the
data–dependent SRM approach to the model selection and
error estimation of a modified Support Vector Machine and
details also an algorithm for exploiting conventional SVM–
specific Quadratic Programming solvers. Finally, Section VI
shows the application of our proposal to real–world small–
sample problems, along with a comparison to out–of–sample
methods.
II. T HE CLASSIFICATION PROBLEM FRAMEWORK
We consider a binary classification problem, with an input
space X ∈ ℜd and an output space Y ∈ {−1, +1}. We
2
assume that the data (x, y), with x ∈ X and y ∈ Y, is
composed of random variables distributed according to an
unknown distribution P and we observe a sequence of n
independent and identically distributed (i.i.d.) pairs Dn =
{(x1 , y1 ), . . . , . . . , (xn , yn )}, sampled according to P . Our
goal is to build a classifier or, in other words, to construct
a function f : X → Y, which predicts Y from X .
Obviously, we need a criterion to choose f , therefore we
measure the expected error performed by the selected function
on the entire data population, i.e. the risk:
L(f ) = E(X ,Y) ℓ(f (x), y),
(1)
where ℓ(f (x), y) is a suitable loss function, which measures
the discrepancy between the prediction of f and the true Y,
according to some user–defined criteria. Some examples of
loss functions are the hard loss
0 yf (x) > 0
(2)
ℓI (f (x), y) =
1 otherwise,
which is an indicator function that simply counts the number
of misclassified samples; the hinge loss, which is used by the
SVM algorithm and is a convex upper bound of the previous
one [4]:
ℓH (f (x), y) = max (0, 1 − yf (x)) ;
(3)
the logistic loss, which is used for obtaining probabilistic
outputs from a classifier [31]:
1
,
(4)
ℓL (f (x), y) =
−yf
(x)
1+e
and, finally, the soft loss [32], [33]:
1 − yf (x)
ℓS (f (x), y) = min 1, max 0,
,
(5)
2
which is a piecewise linear approximation of the former and
a clipped version of the hinge loss.
Let us consider a class of functions F. Thus, the optimal
classifier f ∗ ∈ F is:
f ∗ = arg min L(f ).
f ∈F
(6)
Since P is unknown, we cannot directly evaluate the risk,
nor find f ∗ . The only available option is to depict a method
for selecting F, and consequently f ∗ , based on the available
data and, eventually, some a–priori knowledge. Note that the
model selection problem consists, generally speaking, in the
identification of a suitable F: in fact, the hyperparameters
of the classifier affect, directly or indirectly, the function
class where the learning algorithm searches for the, possibly
optimal, function [21], [34].
The Empirical Risk Minimization (ERM) approach suggests
to estimate the true risk L(f ) by its empirical version:
n
so that
X
ˆ n (f ) = 1
ℓ(f (xi ), yi ),
L
n i=1
ˆ n (f ).
fn∗ = arg min L
f ∈F
(7)
(8)
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012
ˆ n (f ) typically underestimates L(f ) and can
Unfortunately, L
bring to severe overfitting because, if the class of functions is
sufficiently large, it is always possible to find a function that
perfectly fits the data but shows poor generalization capability.
For this reason, it is necessary to perform a model selection
step, by selecting an appropriate F, so to avoid classifiers that
are prone to overfit the data.
A typical approach is to study the random variable L(f ) −
ˆ n (f ), which represents the generalization bias of the clasL
sifier f . In particular, given a user–defined confidence value
δ, the objective is to bound the probability that the true risk
exceeds the empirical one:
h
i
ˆ n (f ) + ǫ ≤ δ,
P L(f ) ≥ L
(9)
leading to bounds of the following form, which hold with
probability (1 − δ):
ˆ n (f ) + ǫ.
L(f ) ≤ L
(10)
Eq. (10) can be used to select an optimal class function and,
consequently, an optimal classifier, by minimizing the term on
right side of the inequality.
The out–of–sample approach suggests to use an independent
dataset, sampled from the same data distribution that generated
the training set, so that the bound is valid, for any classifier,
even after it has learned the Dn set. Given additional m samples Dm = {(x1 , y1 ), . . . , (xm , ym )} and given a classifier
fn , which has been trained on Dn , its generalization error can
be upper bounded, in probability, according to:
ˆ m (fn ) + ǫ,
L(fn ) ≤ L
(11)
where the bound holds with probability (1−δ). Then the model
selection phase can be performed by varying the hyperparameters of the classifier until the right side of Eq. (11) reaches
its minimum. In particular, let us suppose to consider several
function classes F1 , F2 , . . ., indexed by different values of the
∗
hyperparameters, then the optimal classifier fn∗ = fn,i
∗ is the
result of the following minimization process:
h
i
∗
ˆ m (fn,i
)+ǫ ,
(12)
i∗ = arg min L
i
where
∗
ˆ n (f ).
fn,i
= arg min L
f ∈Fi
(13)
Note that, if we are interested in estimating the generalization
error of fn∗ , we need to apply again the bound of Eq. (11), but
using some data (i.e. the test set) that has not been involved in
this procedure. It is also worth mentioning that the partition of
the original dataset into training and validation (and eventually
test) sets can affect the tightness of the bound, due to lucky or
unlucky splittings, and, therefore, its effectiveness. This is a
major issue for out–of–sample methods and several heuristics
have been proposed in the literature for dealing with this
problem (e.g. stratified sampling or topology–based splitting
[35]) but they will be not analyzed here, as they are outside
of the scope of this paper. Eq. (11) can be rewritten as
ˆ n (fn ) + (L
ˆ m (fn ) − L
ˆ n (fn )) + ǫ,
L(fn ) ≤ L
(14)
3
Fig. 1: The Structural Risk Minimization principle.
clearly showing that the out–of–sample approach can be considered as a penalized ERM, where the penalty term takes in
account the discrepancy between the classifier performance on
the training and the validation set. This formulation explains
also other approaches to model selection like, for example,
the early stopping procedure, which is widely used in neural
networks learning [36]. In fact, Eq. (14) suggests to stop the
learning phase when the performance of the classifier on the
training and validation set begin to diverge.
The in–sample approach, instead, targets the use of the same
dataset for learning, model selection and error estimation,
without resorting to additional samples. In particular, this
approach can be summarized as follows: a learning algorithm
takes as input the data Dn and produces a function fn and
ˆ n (fn ), which is a random variable
an estimate of the error L
depending on the data themselves. As we cannot aprioristically
know which function will be chosen by the algorithm, we
consider uniform deviations of the error estimate:
ˆ n (fn ) ≤ sup (L(f ) − L
ˆ n (f )).
L(fn ) − L
(15)
f ∈F
Then, the model selection phase can be performed according to
a data–dependent version of the Structural Risk Minimization
(SRM) framework [27], which suggests to choose a possibly
infinite sequence {Fi , i = 1, 2, . . .} of model classes of increasing complexity, F1 ⊆ F2 ⊆ . . ., (Figure 1) and minimize
the empirical risk in each class with an added penalty term,
which, in our case, give rise to bounds of the following form:
ˆ n (fn ) + sup (L(f ) − L
ˆ n (f )).
L(fn ) ≤ L
(16)
f ∈F
From Eq. (16) it is clear that the price to pay for avoiding
the use of additional validation/test sets is the need to take
in account the behavior of the worst possible classifier in the
class, while the out–of–sample approach focuses on the actual
learned classifier.
Applying Eq. (16) for model selection and error estimation
purposes is a straightforward operation, at least in theory.
A very small function class is selected and its size (i.e. its
complexity) is increased, by varying the hyperparameters, until
the bound reaches its minimum, which represents the optimal
trade–off between under– and overfitting, and therefore, iden-
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012
4
[27], but only few authors have proposed some methods for
dealing with this problem [37]. One example is the use of
Localized Rademacher Complexities [38] or, in other words,
the study of penalty terms that takes into account only the
classifier with low empirical error: although this approach is
very interesting from a theoretical point of view, its application
in practice is not evident. A more practical approach has been
proposed by Vapnik and other authors [21], [39], introducing
the concept of Universum, i.e. a dataset composed of samples
that do not belong to any class represented in the training set.
However, no generalization bounds, like the ones that will be
presented here, have been proposed for this approach.
III. A THEORETICAL ANALYSIS OF OUT– OF – SAMPLE
Fig. 2: Hypothesis spaces with different centroids.
tifies the optimal classifier:
"
#
∗
ˆ
ˆ
fn = arg
min
Ln (f ) + sup (L(f ) − Ln (f )) .
f ∈Fi
Fi ∈{F1 ,F2 ,...}
f ∈Fi
(17)
Furthermore, after the procedure has been completed, the value
of Eq. (16) provides, by construction, a probabilistic upper
bound of the error rate of the selected classifier.
A. The need for a strategy for selecting the class of functions
From the previous analysis, it is obvious that the class of
functions F, where the classifier fn∗ is searched, plays a central
role in the successful application of the in–sample approach.
However, in the conventional data–dependent SRM formulation, the function space is arbitrarily centered and this choice
severely influences the sequence Fi . Its detrimental effect on
Eq. (16) can be clearly understood through the example of
Fig. 2, where we suppose to know the optimal classifier f ∗ .
In this case, in fact, the hypothesis space Fi∗ , which includes
fn∗ , is characterized by a large penalty term and, thus, fn∗ is
greatly penalized with respect to other models. Furthermore,
the bound of Eq. (16) becomes very loose, as the penalty term
takes in account the entire Fi∗ class, so the generalization error
of the chosen classifier is greatly overestimated.
The main problem lies in the fact that fn∗ is ‘far’ from the
aprioristically chosen centroid f0 of the sequence Fi . On the
contrary, if we were able to define a sequence of function
classes Fi′ , i = 1, 2, ..., centered on a function f0′ sufficiently
close to f ∗ , the penalty term would be noticeably reduced and
we would be able to improve both the SRM model selection
and error estimation phases, by choosing a model close to the
optimal one. We argue that this is one of the main reasons
for which the in–sample approach has not been considered
effective so far and that this line of research should be better
explored, if we are interested in building better classification
algorithms and, at the same time, more reliable performance
estimates.
In the recent literature, the data–dependent selection of the
centroid has been theoretically approached, for example, in
TECHNIQUES
Out–of–sample methods are favored by practitioners because they work well in many real cases and are simple to
implement. Here we present a rigorous statistical analysis of
two well known out–of–sample methods for model selection
and error estimation: the k–fold Cross Validation (KCV) and
the Bootstrap (BTS). The philosophy of the two methods
is similar: part of the data is left out from the training set
and is used for estimating the error of the classifier that
has been found during the learning phase. The splitting of
the original training data is repeated several time, in order
to average out unlucky cases, therefore the entire procedure
produces several classifier (one for each data splitting). Note
that, from the point of view of our analysis, it is not statistically
correct to select one of the trained classifiers to perform the
classification of new samples because, in this case, the samples
of the validation (or test) set would not be i.i.d. anymore. For
this reason, every time a new sample is received, the user
should randomly select one of the classifiers, so that the error
estimation bounds, one for each trained classifier, can be safely
averaged.
A. Bounding the true risk with out–of–sample data
Depending on the chosen loss function, we can apply different statistical tools to estimate the classifier error. When dealing with the hard loss, we are considering sums of Bernoulli
random variables, so we can use the well–known one–sided
ˆ m (fn )
Clopper–Pearson bound [40], [41]. Given t = m L
ˆ m (fn ) + ǫ, then the
misclassifications, and defining p = L
errors follow a Binomial distribution:
t X
m j
B(t; m, p) =
p (1 − p)m−j
(18)
j
j=0
so we can bound the generalization error by computing the
inverse of the Binomial tail:
n
o
ˆ m + ǫ) ≥ δ
ˆ m , m, δ) = max ǫ : B(t; m, L
(19)
ǫ ∗ (L
ǫ
and, therefore, with probability (1 − δ):
ˆ m (fn ) + ǫ∗ (L
ˆ m , m, δ).
L(fn ) ≤ L
(20)
More explicit bounds, albeit less sharp, are available in the
literature and allow to gain a better insight on the behavior of
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012
the error estimate. Among them are the so–called Empirical
Chernoff Bound [40] and the more recent Empirical Bernstein
Bound [42], [43], which is valid for bounded random variables,
including Bernoulli ones:
s
2
2
ˆ m (fn ) + s 2 ln( δ ) + 7 ln( δ ) ,
L(fn ) ≤ L
(21)
m
3(m − 1)
q
ˆ m (fn )(1 − L
ˆ m (fn )). This bound can be easily
where s = L
related to Eq. (11) and shows clearly that the classifier error
decays at a pace between O(n−1 ) and O(n−1/2 ), depending
on the performance of the trained classifier on the validation
dataset.
The hinge loss, unfortunately, gives rise to bounds that
decay at a much slower pace respect to the previous one. In
fact, as noted in [44] almost fifty years ago, when dealing
with a positive unbounded random variable, Markov inequality cannot be improved, as there are some distributions for
which the equality is attained. Therefore, the assumption that
the loss function is bounded becomes crucial to obtain any
improvement over Markov bound.
On the contrary, the soft and the logistic losses show a
behavior similar to the hard loss: in fact, Eq. (21) can be used
in these cases as well. In this paper, however, we propose to
use a tighter bound, which was conceived by Hoeffding in
[44] and has been neglected in the literature, mainly because
it cannot be put in closed form. With our notation, the bound
is:
h
i
ˆ m (f ) > ǫ
P L(f ) − L

!Lˆ (f ) m
!1−Lˆ m (f )
ˆ m (f ) − ǫ m
ˆ m (f ) − ǫ
L
1
−
L
 .
≤
ˆ m (f )
ˆ m (f )
L
1−L
(22)
By equating the right part of Eq. (22) to δ and solving it
numerically, we can find the value ǫ∗ , that can be inserted in
ˆ m (f ).
Eq. (20), as a function of δ, m and L
Note that the above inequality is, in practice, as tight as the
one derived by the application of the Clopper–Pearson bound
[45], [46], is numerically more tractable and is valid for both
hard and soft losses, so it will be used in the rest of the paper,
when dealing with the out–of–sample approach.
B. k–fold Cross Validation
The KCV technique consists in splitting a dataset in k
independent subsets and using, in turn, all but one set to train
the classifier, while the remaining set is used to estimate the
generalization error. When k = n this becomes the well–
known Leave–One–Out (LOO) technique, which is often used
in the small–sample setting, because all the samples, except
one, are used for training the model [14].
If our target is to perform both the model selection and
the error estimation of the final classifier, a nested KCV is
required, where (k − 2) subsets are used, in turn, for the
training phase, one is used as a validation set to optimize
the hyperparameters and the last one as a test set to estimate
5
the generalization error. Note that O(k 2 ) training steps are
necessary in this case.
To guarantee the statistical soundness of the KCV approach,
one of the k trained classifiers must be randomly chosen before
classifying a new sample. This procedure is seldomly used in
practice because, usually, one retrains a final classifier on the
entire training data: however, as pointed out by many authors,
we believe that this heuristic procedure is the one to blame
for unexpected and inconsistent results of the KCV technique
in the small–sample setting.
If the nested KCV procedure is applied, so to guarantee
the independence of the training, validation and test sets, the
generalization error can be bounded by [40], [44], [46]:
L(fn∗ ) ≤
k
i
1 X h ˆj ∗ ˆ jn , n , δ
L n f nk ,j + ǫ∗ L
k
k j=1 k
k
(23)
ˆ jn (f ∗ ) is the error performed by the j-th optimal
where L
n,j
k
classifier on the corresponding test set, composed of nk
samples, and fn∗ is the randomly selected classifier.
It is interesting to note that, for the LOO procedure, nk = 1
so the bound becomes useless, in practice, for any reasonable
value of the confidence δ. This is another hint that the LOO
procedure should be used with care, as this result raises a
strong concern on its reliability, especially in the small–sample
setting, which is the elective setting for LOO.
C. Bootstrap
The BTS method is a pure resampling technique: at each jth step, a training set, with the same cardinality of the original
one, is built by sampling the patterns with replacement. The
remaining data, which consists, on average, of approximately
36.8% of the original dataset, are used to compose the
validation set. The
procedure is then repeated several times
(NB ∈ [1, 2n−1
]) in order to obtain statistically sound results
n
[13]. As for the KCV, if the user is interested in performing the
error estimation of the trained classifiers, a nested Bootstrap
is needed, where the sampling procedure is repeated twice in
order to create both a validation and a test set. If we suppose
that the test set consists of mj patterns then, after the model
selection phase, we will be left with NB different models, for
which the average generalization error can be expressed as:
L(fn∗ )
NB h
i
1 X
∗
ˆ jm , mj , δ) .
ˆ jm (fm
) + ǫ ∗ (L
L
≤
j ,j
j
j
NB j=1
(24)
As can be seen by comparing (23) and (24), the KCV and the
BTS are equivalent, except for the different sampling approach
of the original dataset.
IV. A THEORETICAL ANALYSIS OF IN – SAMPLE
TECHNIQUES
As detailed in previous sections, the main objective of in–
sample techniques is to upper bound the supremum on the right
of Eq. (16), so we need a bound which holds simultaneously
for all functions in a class. The Maximal Discrepancy and
Rademacher Complexity are two different statistical tools,
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012
which can be exploited for such purposes. The Rademacher
Complexity of a class of functions F is defined as:
n
(25)
where σ1 , . . . , σn are n independent random variables for
which P(σi = +1) = P(σi = −1) = 1/2. An upper bound of
ˆ
L(f ) in terms of R(F)
was proposed in [47] and the proof
is mainly an application of the following result, known as
McDiarmid’s inequality:
Theorem 1: [48] Let Z1 , . . . , Zn be independent random
variables taking values in a set Z, and assume that g : Z n → ℜ
is a function satisfying
sup
|g(z1 , . . . , zn ) − g(z1 , . . . , zˆi , . . . , zn )| < ci
Then, for any ǫ > 0,
P {g(z1 , . . . , zn ) − E {g(z1 , . . . , zn )} ≥ ǫ} < e
P {E {g(z1 , . . . , zn )} − g(z1 , . . . , zn ) ≥ ǫ} < e
from which we obtain:
h
i
ˆ n (f ) ≤ E(X ,Y) R(F).
ˆ
E(X ,Y) sup L(f ) − L
(34)
f ∈F
2X
ˆ
σi ℓ(f (xi ), yi )
R(F)
= Eσ sup
f ∈F n i=1
z1 ,...,zn ,ˆ
zi
6
2
c2
i=1 i
− Pn2ǫ
2
c2
i=1 i
− Pn2ǫ
In other words, the theorem states that, if replacing the i-th
coordinate zi by any other value, g changes by at most ci , then
the function is sharply concentrated around its mean. Using the
McDiarmid’s inequality, it is possible to bound the supremum
of Eq. (16), thank to the following theorem [47]. We detail
here a simplified proof, which also corrects some of the errors
that appear in [47].
Theorem 2: Given a dataset Dn , consisting in n patterns
xi ∈ X d , given a class of functions F and a loss function
ℓ(·, ·) ∈ [0, 1], then
"
#
h
i
−2nε2
ˆ
ˆ
P sup L (f ) − Ln (f ) ≥ R(F) + ε ≤ 2 exp
9
f ∈F
(26)
Proof: Let us consider a ghost sample Dn′ = {x′i , yi′ },
composed of n patterns generated from the same probability
distribution of Dn , the following upper bound holds1 :
h
i
ˆ n (f )
E(X ,Y) sup L (f ) − L
(27)
f ∈F
h
i
ˆ ′n (f )] − L
ˆ n (f )
= E(X ,Y) sup E(X ′ ,Y ′ ) [L
(28)
f ∈F
h
i
ˆ ′ (f ) − L
ˆ n (f )
≤ E(X ,Y) E(X ′ ,Y ′ ) sup L
(29)
n
f ∈F
" n
#
1X ′
= E(X ,Y) E(X ′ ,Y ′ ) sup
(ℓi − ℓi )
(30)
f ∈F n i=1
#
" n
1X
′
σi (ℓi − ℓi )
(31)
= E(X ,Y) E(X ′ ,Y ′ ) Eσ sup
f ∈F n i=1
" n
#
1X
≤ 2E(X ,Y) Eσ sup
(32)
σ i ℓi
f ∈F n i=1
ˆ
= E(X ,Y) R(F)
(33)
1 In order to simplify the notation we define ℓ′ = ℓ(f (x′ , y ′ )) and ℓ =
i
i
i i
ℓ(f (xi , yi )).
ˆ
For theh sake of simplicity,
let us define S(F)
=
i
ˆ n (f ) . Then, by using McDiarmid’s insupf ∈F L(f ) − L
ˆ
equality, we know that S(F)
is sharply concentrated around
its mean:
h
i
2
ˆ
ˆ
P S(F)
≥ E(X ,Y) S(F)
+ ǫ ≤ e−2nǫ ,
(35)
because the loss function is bounded. Therefore, combining
these two results, we obtain:
i
h
ˆ
ˆ
≥ E(X ,Y) R(F)
+ǫ
(36)
P S(F)
h
i
2
ˆ
ˆ
≤ P S(F)
≥ E(X ,Y) S(F)
+ ǫ ≤ e−2nǫ .
(37)
ˆ
We are interested in bounding L(f ) with R(F),
so we can
write:
h
i
ˆ
ˆ
P S(F)
≥ R(F)
+ǫ
(38)
i
h
ˆ
ˆ
≥ E(X ,Y) S(F)
+ aǫ
≤ P S(F)
h
i
ˆ
ˆ
+P ER(F)
≥ R(F)
+ (1 − a)ǫ
(39)
≤ e−2na
2 2
ǫ
n
+ e− 2 (1−a)
2 2
ǫ
(40)
where, in the last step, we applied again the McDiarmid’s
inequality. By setting a = 31 we have:
h
i
−2nǫ2
ˆ
ˆ
(41)
P S(F)
≥ R(F)
+ ǫ ≤ 2e 9 .
The previous theorem allows us to obtain the main result
by fixing a confidence δ and solving Eq. (26) respect to ǫ,
so to obtain the following explicit bound, which holds with
probability (1 − δ):
s
log 2δ
ˆ n (fn ) + R(F)
ˆ
.
(42)
L(fn ) ≤ L
+3
2n
The approach based on the Maximal Discrepancy is similar
to the previous one and provides similar results. For the sake
of brevity we refer the reader to [32], [49] for the complete
proofs and to [32] for a comparison of the two approaches:
here we give only the final results.
(1)
(2)
Let us split Dn in two halves, Dn/2 and Dn/2 , and compute
the corresponding empirical errors:
n
ˆ (1) (f )
L
n/2
2
2X
=
ℓ (f (xi ), yi )
n i=1
n
X
ˆ (2) (f ) = 2
ℓ (f (xi ), yi ) .
L
n/2
n n
(43)
(44)
i= 2 +1
ˆ of F is defined as
Then, the Maximal Discrepancy M
ˆ (1) (f ) − L
ˆ (2) (f )
ˆ (F) = max L
M
n/2
n/2
f ∈F
(45)
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012
and, under the same hypothesis of Theorem 2, the following
bound holds, with probability (1 − δ):
s
ˆ n (f ) + M
ˆ (F) + 3
L(f ) ≤ L
2
log δ
.
2n
on a subset G ⊆ Z with probability 1 − δn , while
¯ ∃z ′ ∈ Z such that
∀ {z1 , . . . , zn } ∈ G,
i
cn < |g(z1 , . . . , zn ) − g(z1 , . . . , zi′ , . . . , zn )| ≤ 2A
2Anδn
.
cn
In other words, the theorem states that if g satisfies the same
conditions of Theorem 1, with high probability (1 − δn ), then
the function is (almost) concentrated around its mean. Unfortunately, the bound is exponential only if it is possible to show
that δn decays exponentially, which requires to introduce some
constraints on the probability distribution generating the data.
As we are working in the agnostic case, where no hypothesis
on the data is assumed, this approach is outside of the scope
of this paper, albeit it opens some interesting research cases
like, for example, when some additional information on the
data is available.
The use of the soft loss function, instead, which is bounded
in the interval [0, 1] and will be adapted to the SVM in the
following sections, allows us to apply the bound of Eq. (42).
By noting that the soft loss satisfies the following symmetry
property:
ℓ(f (x), y) = 1 − ℓ(f (x), −y)
2
n
"
"
X
X
(ℓ(fi , yi ) − 1) −
i∈I +
ℓ(fi , yi )
i∈I −
2 X
2 X
= 1 + Eσ sup −
ℓ(fi , −yi ) −
ℓ(fi , yi )
n
n
f ∈F
+
−
i∈I
i∈I
"
= 1 + Eσ sup −
f ∈F
= 1 − Eσ inf
2
n
f ∈F
n
X
ℓ(fi , −σi yi )
i=1
"
#
(48)
#
(49)
#
(50)
#
(51)
#
n
X
2
ℓ(fi , σi ) .
n i=1
(52)
In other words, the Rademacher Complexity of the class F
can be computed by learning the original dataset, but where
the labels have been randomly flipped. Analogously, it can be
proved that the Maximal Discrepancy of the class F can be
computed by learning the original dataset, where the labels
(2)
of the samples in Dn/2 have been flipped [28].
The second problem that must be addressed is finding an
ˆ and M
ˆ , avoiding
efficient way to compute the quantities R
the computation of Eσ [·], which would require N = 2n
training phases. A simple approach is to adopt a Monte–Carlo
estimation of this quantity, by computing:
where G¯ ∪ G = Z, then for any ǫ > 0,
−ǫ2
n
2X
ˆ
σ i ℓi
R(F)
= Eσ sup
f ∈F n i=1
f ∈F
|g(z1 , . . . , zn ) − g(z1 , . . . , zˆi , . . . , zn )| < cn ∀i
P {|g − E[g]| ≥ ǫ} ≤ 2e 8nc2n +
"
= 1 + Eσ sup
The theoretical analysis of the previous section does not
clarify how the in–sample techniques can be applied in
practice and, in particular, can be used to develop effective
model selection and error estimation phases for SVMs. The
first problem, analogously to the case of the out–of–sample
approach, is related to the boundedness requirement of the
loss function, which is not satisfied by the SVM hinge loss.
A recent result appears to be promising in generalizing the
McDiarmid’s inequality to the case of (almost) unbounded
functions [50], [51]:
Theorem 3: [50] Let Z1 , . . . , Zn be independent random
variables taking values in a set Z and assume that a function
g : Z n → [−A, A] ⊆ ℜ satisfies
z1 ,...,zn ,ˆ
zi
I + = {i : σi = +1} and I − = {i : σi = −1}, then:
(46)
A. The in–sample approach in practice
sup
7
(47)
it can be shown that the Rademacher Complexity can be easily
computed by learning a modified dataset. In fact, let us define
k
n
X
2X j
ˆ k (F) = 1
sup
σ ℓ(f (xi ), yi )
R
k j=1 f ∈F n i=1 i
(53)
where 1 ≤ k ≤ N is the number of Monte–Carlo trials.
ˆ k , instead of R,
ˆ can be explicited
The effect of computing R
by noting that the Monte–Carlo trials can be modeled as a
sampling without replacement from the N possible label configurations. Then, we can apply any bound for the tail of the
hypergeometric distribution like, for example, the Serfling’s
bound [52], to write:
2
h
i
− 2kǫ
ˆ
ˆ k (F) + ǫ ≤ e 1− k−1
N .
P R(F)
≥R
(54)
We know that
ˆ
ˆ
E(X ,Y) S(F)
≤ E(X ,Y) R(F)
(55)
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012
and, moreover,
h
ˆ
ˆ k (F) + aǫ
P S(F)
≥R
h
i
(56)
i
ˆ
ˆ
≤ P S(F)
≥ E(X ,Y) S(F)
+ a1 ǫ
h
i
ˆ
ˆ
+P E(X ,Y) R(F)
≥ R(F)
+ a2 ǫ
h
i
ˆ
ˆ k (F) + a3 ǫ
+P R(F)
≥R
≤e
−2na21 ǫ2
+e
2 2
−n
2 a2 ǫ
+e
Then, with probability (1 − δ):

ˆ n (f ) + R
ˆ k (F) + 3 +
L(f ) ≤ L
s
α
−2ka23 ǫ2
where aq= a1 + a2 + a3 . So, by setting a1 = 14 , a2 =
n(1− k−1
N )
, we have that:
a3 = 14
k
q


n(1− k−1
N )
3+
nǫ2
k
ˆ
ˆ
ǫ ≤ 3e− 8 .
≥ R(F)
+
P S(F)
4
(58)
1
2
and
(59)
(60)
which recovers the bound of Eq. (42), for k → N , up to some
constants.
The Maximal Discepancy approach results in a very similar
bound (the proofs can be found in [32]):
s
k
X
− log 2δ
1
(j)
ˆ
ˆ
(61)
M (F) + 3
L(f ) ≤ Ln (f ) +
k j=1
2n
which holds with probability (1−δ) and where k is the number
of random shuffles of Dn before splitting it in two halves. Note
that, in this case, the confidence term does not depend on k:
this is a consequence of retaining the information provided
by the labels yi , which is lost in the Rademacher Complexity
approach.
V. T HE APPLICATION OF THE IN – SAMPLE APPROACH TO
THE S UPPORT V ECTOR M ACHINE
Let us consider the training set Dn , and the input space
X ∈ ℜd . We map our input space in a feature space X ′ ∈ ℜD
with the function φ : ℜd → ℜD . Then, the SVM classifier is
defined as:
(62)
where the weights w ∈ ℜD and the bias b ∈ ℜ are found
by solving the following primal convex constrained quadratic
programming (CCQP) problem:
min
w,b,ξ
1
kwk2 + CeT ξ
2
yi (w · φ(xi ) + b) ≥ 1 − ξi
n
n
n
X
1 XX
αi
αi αj yi yj K(xi , xj ) −
2 i=1 j=1
i=1
(64)
0 ≤ αi ≤ C
n
X
yi αi = 0,
i=1
where K(xi , xj ) = φ(xi )·φ(xj ) is a suitable kernel function.
After solving the problem (64), the Lagrange multipliers can
be used to define the SVM classifier in its dual form:
n
X
yi αi K(xi , x) + b,
(65)
f (x) =
i=1
s
ln 3δ
)
n(1 − k−1
N

k
2n
f (x) = w · φ(x) + b
By introducing n Lagrange multipliers α1 , . . . , αn , it is
possible to write the problem of Eq. (63) in its dual form,
for which efficient solvers have been developed throughout
the years:
min
(57)
8
(63)
ξi ≥ 0
where ei = 1 ∀i ∈ {1, . . . , n} [4]. The above problem is also
known as the Tikhonov formulation of the SVM, because it
can be seen as a regularized ill–posed problem.
The hyperparameter C in problems (63) and (64) is tuned
during the model selection phase, and indirectly defines the
set of functions F. Then, any out–of–sample technique can
be applied to estimate the generalization error of the classifier
and the optimal value of C can be chosen accordingly.
Unfortunately, this formulation suffers from several drawbacks:
• the hypothesis space F is not directly controlled by
the hyperparameter C, but only indirectly through the
minimization process;
• the loss function of SVM is not bounded, which represents a problem for out–of–sample techniques as well,
because the optimization is performed using the hinge
loss, while the error estimation is usually computed with
the hard loss;
• the function space is centered in an arbitrary way respect
to the optimal (unknown) classifier.
It is worthwhile to write the SVM optimization problem as
[21]:
min
w,b,ξ
n
X
ξi
(66)
kwk2 ≤ ρ
(67)
yi (w · φ(xi ) + b) ≥ 1 − ξi
ξi ≥ 0,
(68)
(69)
i=1
which is the equivalent Ivanov formulation of problem (63)
for some value of the hyperparameter ρ. From Eq. (67) it is
clear that ρ explicitly controls the size of the function space
F, which is centered in the origin and consists of the set of
linear classifiers with margin greater or equal to 2/ρ. In fact,
as ρ is increased, the set of functions is enriched by classifiers
with smaller margin and, therefore, of greater classification
(and overfitting) capability.
A possibility to center the space F in a different point is to
translate the weights of the classifiers by some constant value,
so that Eq. (67) becomes kw − w0 k2 ≤ ρ. By applying this
idea to the Ivanov formulation of the SVM and substituting
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012
the hinge loss with the soft loss, we obtain the following new
optimization problem:
min
w,b,ξ,η
n
X
ηi
(70)
Algorithm 1 ConCave–Convex Procedure
Initialize θ (0)
repeat
(θ) θ (t+1) = arg minθ Jconvex (θ) + dJconcave
dθ
until θ
i=1
2
kwk ≤ ρ
yi (w · φ(xi ) + b) + yi λf0 (xi ) ≥ 1 − ξi
ξi ≥ 0
(t+1)
z
}|
{ z }| {
n
n
X
X
1
2
ξi −C
ςi
min kwk + C
w,b,ξ,ς 2
i=1
i=1
·θ
(71)
ςi = max(0, ξi − 2)
where θ = [w|b] is introduced to simplify the notation.
Obviously, the algorithm does not guarantee to find the optimal
solution, but it converges to a usually good solution in a finite
number of steps [33].
To apply the CCCP we must compute the derivative of the
concave part of the objective function at the t-th step:
dJconcave (θ) ·θ
(73)
dθ
(t)
θ=θ
!
n
X
d (−Cςi ) ·θ
(74)
=
dθ (t)
i=1
=
n
X
θ=θ
(t)
∆i yi (w · φ(xi ) + b)
(75)
i=1
where
if yi w(t) · φ(xi ) + b(t) < −1
(76)
=
otherwise.
Then, the (t + 1)-th solution w(t+1) , b(t+1) can be found
by solving the following learning problem:
(t)
∆i
min
w,b,ξ
yi (w · φ(xi ) + b) + yi f0 (xi ) ≥ 1 − ξi
ξi ≥ 0
ηi = min(2, ξi ).
It can be shown that the two formulations are equivalent, in
the sense that, for any ρ, there is at least one C for which the
Ivanov and Tikhonov solutions coincide [21]. In particular, the
value of C is a non–decreasing function of the value of ρ, so
that, given a particular C, the corresponding ρ can be found
by a simple bisection algorithm [55], [56], [57].
Regardless of the formulation, the optimization problem is
non–convex, so we must resort to methods that are able to find
an approximate suboptimal solution, like the Peeling technique
[32], [58] or the ConCave–Convex Procedure (CCCP) [33].
In particular, the CCCP, which is synthesized in Algorithm
1, suggests breaking the objective function of Eq. (71) in its
(72)
yi (w · φ(xi ) + b) + yi f0 (xi ) ≥ 1 − ξi
ξi ≥ 0
n
min
θ=θ (t)
Jconcave (θ)
Jconvex (θ)
where f0 is the classifier, which has been selected as the
center of the function space F, and λ is a normalization
constant. Note that f0 can be either a linear classifier f0 (x) =
w0 · φ(x) + b0 or a non-linear one (e.g. analogously to
what shown in [53]) but, in general, can be any a–priori and
auxiliary information, which helps in relocating the function
space closer to the optimal classifier. In this respect, f0 can be
considered as a hint, a concept introduced in [54] in the context
of neural networks, which must be defined independently from
the training data. The normalization constant λ weights the
amount of hints that we are keen to accept in searching for our
optimal classifier: if we set λ = 0, we obtain the conventional
Ivanov formulation for SVM, while for larger values of λ the
hint is weighted even more than the regularization process
itself. The sensitivity analysis of the SVM solution with
respect to the variations of λ is an interesting issue that would
require a thorough study, so we are not address it here. In
any case, as we are working in the agnostic case, we equally
weight the hint and the regularized learning process, thus we
choose λ = 1 in this work.
The previous optimization problem can be re-formulated in
its dual form and solved by general–purpose convex programming algorithms [21]. However, we show here that it can also
be solved by conventional SVM learning algorithms, if we
rewrite it in the usual Tikhonov formulation:
w,b,ξ,η
=θ
(t)
convex and concave parts:
ηi = min(2, ξi )
X
1
ηi
kwk2 + C
2
i=1
9
C
0
n
n
X
X
1
(t)
∆i yi (w · φ(xi ) + b)
ξi +
kwk2 + C
2
i=1
i=1
(77)
yi (w · φ(xi ) + b) + yi f0 (xi ) ≥ 1 − ξi
ξi ≥ 0.
As a last issue, it is worth noting that the dual formulation
of the previous problem can be obtained, by introducing n
lagrange multipliers βi :
min
β
n
n
n
X
1 XX
(yi f0 (xi ) − 1) βi
βi βj yi yj K(xi , xj ) +
2 i=1 j=1
i=1
(t)
(t)
− ∆i ≤ βi ≤ C − ∆i
n
X
yi βi = 0,
(78)
i=1
which can be solved by any SVM–specific algorithm like, for
example, the well–known Sequential Minimal Optimization
algorithm [29], [30].
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012
A. The method in a nutshell
In this section, we briefly summarize the method, which
allows to apply the in-sample approach to the SVM model
selection and error estimation problems.
As a first step, we have to identify a centroid f0 : for this
purpose, possible a-priori information can be exploited, else
in [53] a method to identify a hint in a data-dependent way is
suggested. Note that f0 can be either a linear or a non-linear
SVM classifier and, in principle, can be even computed by
exploiting a kernel that differs from the one used during the
learning phase.
Once the sequence of classes of functions is centered,
we explore its hierarchy according to the Structural Risk
Minimization principle: ideally by looking for the optimal
hyperparameter ρ ∈ (0, +∞), similarly to the search for
the optimal C in conventional SVMs. For every value of ρ,
i.e. for every class of functions, Problem (70) is solved by
exploiting the procedure previously presented in Section V and
either the Rademacher Complexity (Eq. (60)) or the Maximal
Discrepancy (Eq. (61)) bounds are computed. Finally, F and,
therefore, the corresponding classifier are chosen, for which
the value of the estimated generalization error is minimized.
Note that this value, by construction, is a statistically valid
estimate.
VI. E XPERIMENTAL R ESULTS
We describe in this section two sets of experiments. The first
one is built by using relatively large datasets that allow us to
simulate the small–sample setting. Each dataset is sampled by
extracting a small amount of data to build the training sets and
exploiting the remaining data as a good representative of the
entire sample population. The rationale behind this choice is to
build some toy problems, but based on real–world data, so to
better explore the performance of our proposal in a controlled
setting. Thank to this approach, the experimental results can be
easily interpreted and the two approaches, in–sample vs. out–
of–sample, easily compared. The second set, instead, targets
the classification of microarray data, which consists of true
small–sample datasets.
A. The simulated small–sample setting
We consider the well–known MNIST [59] dataset consisting
of 62000 images, representing the numbers from 0 to 9: in
particular, we consider the 13074 patterns containing 0’s and
1’s, allowing us to deal with a binary classification problem.
We build a small–sample dataset by randomly sampling a
small number of patterns, varying from n = 10 to n = 400,
which is a value much smaller then the dimensionality of the
data d = 28 × 28 = 784, while the remaining 13074 − n
images are used as a reference set. In order to build statistically
relevant results, the entire procedure is repeated 30 times
during the experiments. We also consider a balanced version
of the DaimlerChrysler dataset [60], where half of the 9800
images, of d = 36 × 18 = 648 pixels, contains the picture of
a pedestrian, while the other half contains only some general
background or other objects.
10
These two datasets target different objectives: the MNIST
dataset represents an easy classification problem, in the sense
that a low classification error, well below 1%, can be easily
achieved; on the contrary, the DaimlerChrysler dataset is a
much more difficult problem, because the samples from each
class are quite overlapped, so the small–sample setting makes
this problem even more difficult to solve. By analyzing these
two opposite cases, it is possible to gain a better insight
on the performance of the various methods. In all cases,
we use a linear kernel φ(x) = x, as the training data are
obviously linearly separable (d > n) and the use of a nonlinear
transformation would further complicate the interpretation of
the results.
In Tables I–VIII, the results obtained with the different
methods are reported. Each column refers to a different
approach:
•
•
•
•
•
RC and MD are the in–sample procedures using, respectively, the Rademacher Complexity and Maximal
Discrepancy approaches, with f0 = 0;
RCf and MDf are similar to the previous cases, but 30%
of the samples of the training set are used for finding a
hint f0 (x) = w0 · x + b0 by learning a linear classifier
on them (refer to [53] for further details);
KCV is the k-fold Cross Validation procedure, with k =
10;
LOO is the Leave–One–Out procedure;
BTS is the Bootstrap technique with NB = 100.
For in–sample methods, the model selection is performed by
searching the optimal hyperparameter ρ ∈ [10−6 , 103 ] among
30 values, equally spaced in a logarithmic scale, while for the
out–of–sample approaches the search is performed by varying
C in the same range.
Tables I and II show the error rate achieved by each selected
classifier on the reference sets, using the soft loss for computing the error. In particular, the in–sample methods exploit
the soft loss for the learning phase, which, by construction,
includes also the model selection and error estimation phases.
The out–of–sample approaches, instead, use the conventional
hinge loss for finding the optimal classifier and the soft loss
for model selection. When the classifiers have been found,
according to the respective approaches, their performance is
verified on the reference dataset, so to check if a good model
has been selected, and the achieved misclassification rate is
reported in the tables. All the figures are in percentage and
the best values are highlighted. As can be easily seen, the best
approaches are RCf and MDf , which consistently outperform
the out–of–sample ones. It is also clear that centering the
function space in a more appropriate point, thank to the hint
f0 , improves the ability of the procedure to select a better
classifier, respect to in–sample approaches without hints. This
is a result of the shrinking of the function space, which directly
affects the tightness of the generalization error bounds. As
a last remark, it is possible to note that RC and MD often
select the same classifier, with a slight superiority of RC,
when dealing with difficult classification problems. This is
also an expected result, because MD makes use of the label
information, which is misleading if the samples are not well
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012
separable [49].
The use of the soft loss is not so common in the SVM
literature, so we repeated the experiments by applying the hard
loss for computing the misclassification error on the reference
dataset. Tables III and IV report the results and confirm the
superiority of the in–sample approach, when dealing with the
MNIST problem, while the results are comparable with the
out–of–sample methods in the case of the DaimlerChrysler
dataset. In particular, the in–sample methods with hints appear
to perform slightly better than the BTS and slightly worse than
KCV and LOO, even though the difference is almost negligible. This is not surprising, because the in–sample methods
adopt a soft–loss for the training phase, which is not the same
loss used for evaluating them.
Tables V and VI show the error estimation computed
with the various approaches by exploiting the error bounds,
presented in Sections III and IV: in particular, the in–sample
methods provide these values directly, as a byproduct of
the training phase, while the figures for the out–of–sample
methods are obtained by applying the Hoeffding bound of Eq.
(22) on the samples of the test set. The missing values indicate
that the estimation is not consistent, because it exceeds 100%.
In this case, BTS (but not always KCV nor LOO) outperforms
the in–sample ones, which are more pessimistic. However, by
taking a closer view to the results, several interesting facts can
be inferred. The out–of–sample methods are very sensitive to
the number of test samples: in fact, the LOO method, which
uses only one sample at the time for the error estimation,
is not able to provide consistent results. The quality of the
error estimation improves for KCV, which uses one tenth of
the data, and even more for BTS, which selects, on average, a
third of the data for performing the estimation. In any case, by
comparing the results in Table V with the ones in Table I, it is
clear that even out–of–sample methods are overly pessimistic:
in the best case (BTS, with n = 400), the generalization error
is overestimated by a factor greater than 4. This result seems
to be in contrast with the common belief that out–of–sample
methods provide a good estimation of the generalization error,
but they are not surprising because, most of the times, when
the generalization error of a classifier is reported in the
literature, the confidence term (i.e. the second term on the
right side of Eq. (23)) is usually neglected and only its average
performance is disclosed (i.e. the first term on the right side
of Eq. (23)). The results on the two datasets provide another
interesting insight on the behavior of the two approaches: it
is clear that out–of–sample methods exploit the distribution of
the samples of the test set, because they are able to identify
the intrinsic difficulty of the classification problem; in–sample
methods, instead, do not possess this kind of information and,
therefore, maintain a pessimistic approach in all cases, which
is not useful for easy classification problems, like MNIST. This
is confirmed also by the small difference in performance of the
two approaches on the difficult DaimlerChrysler problem. On
the other hand, the advantage of having a test set, for out–of–
sample methods, is overcome by the need of reducing the size
of the training and validation sets, which causes the methods
to choose a worse performing classifier. This is related to the
well–known issue of the optimal splitting of the data between
11
TABLE IX: Human gene expressions datasets.
Dataset
Brain Tumor 1 [62]
Brain Tumor 2 [62]
Colon Cancer 1 [63]
Colon Cancer 2 [64]
DLBCL [62]
Duke Breast Cancer [65]
Leukemia [66]
Leukemia 1 [62]
Leukemia 2 [62]
Lung Cancer [62]
Myeloma [67]
Prostate Tumor [62]
SRBCT [62]
d
5920
10367
22283
2000
5469
7129
7129
5327
11225
12600
28032
10509
2308
n
90
50
47
62
77
44
72
72
72
203
105
102
83
training and test sets, which is still an open problem.
Finally, Tables VII and VIII show the error estimation of
the out–of–sample methods using the hard loss. In this case,
the in–sample methods cannot be applied, because it is not
possible to perform the learning with this loss. As expected,
the error estimation improves, respect to the previous case,
except for the LOO method, which is not able to provide
consistent results. The improvement respect to the case of
the soft loss is due to the fact that we are now working in
a parametric setting (e.g. the errors are distributed according
to a Binomial distribution), while the soft loss gives rise to a
non–parametric estimation, which is a more difficult problem.
In synthesis, the experiments clearly show that in–sample
methods with hints are more reliable for model selection than
out–of–sample ones and that the Boostrap appears to be the
best approach to perform the generalization error estimation
of the trained classifier.
B. Microarray small–sample datasets
The last set of experiments deals with several Human Gene
Expression datasets (Table IX), where all the problems are
converted, where needed, to two classes by simply grouping
some data. In this kind of setting, a reference set of reasonable
size is not available, then, we reproduce the methodology used
by [61], which consists in generating five different training/test
pairs, using a cross validation approach. The same procedures
of Section VI-A are used in order to compare the different
approaches to model selection and error estimation and the
results are reported in Tables X–XIII.
Table X shows the error rate obtained on the reference sets
using the soft loss, where the in–sample methods outperform
out–of–sample ones most of the time (8 vs. 5). The interesting
fact is the large improvement of the in–sample methods
with hints, respect to the analogous versions without hints.
Providing some a–priori knowledge for selecting the classifier
space appears to be very useful and, in some cases (e.g. the
Brain Tumor 2, Colon Cancer 2 and DLBCL datasets), allows
to solve problems that no other method, in–sample without
hints or out–of–sample, can deal with.
In Table XI, analogously, the misclassification rates on the
reference sets using the hard loss, which favors out–of–sample
methods, are reported. In this case, the three out–of–sample
methods globally outperform in–sample ones, but none of
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012
them, considered singularly, is consistently better than in–
sample ones.
Finally, Tables XII and XIII show the error estimation using
the soft and hard loss, respectively. The Bootstrap provides
better estimates respect to all other methods but, unfortunately,
it suffers from two serious drawbacks: the estimates are very
loose and, in some cases (e.g. the Brain Tumor 2, Colon Cancer 2 and DLBCL datasets), the estimation is not consistent
as it underestimates the actual classifier error rate. This is an
indication that, in the small sample setting, where the test data
is very scarce, both in–sample and out–of–sample methods
are of little use for estimating the generalization error of a
classifier; however, while out–of–sample methods cannot be
improved because they work in a parametric setting, where
the Clopper–Pearson bound is the tightest possible, in–sample
methods could lead to better estimation as they allow for
further improvements, both in the theoretical framework and
in their practical application.
C. Some notes on the computational effort
The proposed approach addresses the problem of model
selection and error estimation of SVMs. Though general, this
approach gives benefits when we deal with small sample
problems (d >> n) like the Gene Expression datasets, where
only few samples are available (n ≈ 100). In this framework,
the computational complexity and the computational cost of
the proposed method is not a critical issue because the time
needed to perform the procedure is small. In fact, as an
example, we report in Table XIV the computational time
(in seconds)2 needed to perform the different in–sample and
out–of–sample procedures on the MNIST dataset, where it is
worthwhile noting that the learning procedures always require
less than one minute to conclude. Similar results are obtained
using the other datasets of this paper.
VII. C ONCLUSION
We have detailed a complete methodology for applying two
in–sample approaches, based on the data–dependent Structural
Risk Minimization principle, to the model selection and error
estimation of Support Vector Classifiers. The methodology is
theoretically justified and obtains good results in practice. At
the same time, we have shown that in–sample methods can be
comparable to, or even better than, more widely–used out–of–
sample methods, at least in the small-sample setting. A step
for improving their adoption is our proposal for transforming
the in–sample learning problem from the Ivanov formulation
to the Tikhonov one, so that it can be easily approached by
conventional SVM solvers.
We believe that our analysis opens new perspectives on the
application of the data–dependent SRM theory to practical
problems, by showing that the common misconception about
its poor practical effectiveness is greatly exagerated. The SRM
theory is just a different and sophisticated statistical tool that
needs to be used with some care and that, we hope, will
2 The values are referred to an Intel Core I5 2.3 GHz architecture. The
source code is written in Fortran90.
12
be further improved in the future, both by building sharper
theoretical bounds and by finding more clever ways to exploit
hints for centering the classifier space.
R EFERENCES
[1] I. Guyon, A. Saffari, G. Dror, and G. Cawley, “Model selection: beyond
the bayesian/frequentist divide,” The Journal of Machine Learning
Research, vol. 11, pp. 61–87, 2010.
[2] S. Geman, E. Bienenstock, and R. Doursat, “Neural networks and the
bias/variance dilemma,” Neural Computation, vol. 4, no. 1, pp. 1–58,
1992.
[3] P. Bartlett, “The sample complexity of pattern classification with neural
networks: the size of the weights is more important than the size of the
network,” IEEE Transactions on Information Theory, vol. 44, no. 2, pp.
525–536, 1998.
[4] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning,
vol. 20, no. 3, pp. 273–297, 1995.
[5] V. Vapnik, “An overview of statistical learning theory,” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 988–999, 1999.
[6] Y. Shao, C. Zhang, X. Wang, and N. Deng, “Improvements on twin
support vector machines,” IEEE Transactions on Neural Networks,
no. 99, pp. 962–968, 2011.
[7] D. Anguita, S. Ridella, and F. Rivieccio, “K-fold generalization capability assessment for support vector classifiers,” in Proceedings of the
IEEE International Joint Conference on Neural Networks, 2005.
[8] B. Milenova, J. Yarmus, and M. Campos, “Svm in oracle database
10g: removing the barriers to widespread adoption of support vector
machines,” in Proceedings of the 31st International Conference on Very
Large Data Bases, 2005.
[9] Z. Xu, M. Dai, and D. Meng, “Fast and efficient strategies for model
selection of gaussian support vector machine,” IEEE Transactions on
Cybernetics, vol. 39, no. 5, pp. 1292–1307, 2009.
[10] T. Glasmachers and C. Igel, “Maximum likelihood model selection for
1-norm soft margin svms with multiple parameters,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, pp. 1522–1528, 2010.
[11] K. De Brabanter, J. De Brabanter, J. Suykens, and B. De Moor, “Approximate confidence and prediction intervals for least squares support
vector regression,” IEEE Transactions on Neural Networks, no. 99, pp.
110–120, 2011.
[12] M. Karasuyama and I. Takeuchi, “Nonlinear regularization path for
quadratic loss support vector machines,” IEEE Transactions on Neural
Networks, no. 99, pp. 1613–1625, 2011.
[13] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. Chapman & Hall, 1993.
[14] R. Kohavi, “A study of cross-validation and bootstrap for accuracy
estimation and model selection,” in Proceedings of the International
Joint Conference on Artificial Intelligence, 1995.
[15] F. Cheng, J. Yu, and H. Xiong, “Facial expression recognition in jaffe
dataset based on gaussian process classification,” IEEE Transactions on
Neural Networks, vol. 21, no. 10, pp. 1685–1690, 2010.
[16] D. Anguita, A. Ghio, S. Ridella, and D. Sterpi, “K–fold cross validation
for error rate estimate in support vector machines,” in Proceedings of
the International Conference on Data Mining, 2009.
[17] T. Clark, “Can out-of-sample forecast comparisons help prevent overfitting?” Journal of forecasting, vol. 23, no. 2, pp. 115–139, 2004.
[18] D. Rapach and M. Wohar, “In-sample vs. out-of-sample tests of stock
return predictability in the context of data mining,” Journal of Empirical
Finance, vol. 13, no. 2, pp. 231–247, 2006.
[19] A. Isaksson, M. Wallman, H. Goransson, and M. Gustafsson, “Crossvalidation and bootstrapping are unreliable in small sample classification,” Pattern Recognition Letters, vol. 29, no. 14, pp. 1960–1965, 2008.
[20] U. M. Braga-Neto and E. R. Dougherty, “Is cross-validation valid for
small-sample microarray classification?” Bioinformatics, vol. 20, no. 3,
pp. 374–380, 2004.
[21] V. N. Vapnik, Statistical Learning Theory. Wiley-Interscience, 1998.
[22] K. Duan, S. Keerthi, and A. Poo, “Evaluation of simple performance
measures for tuning svm hyperparameters,” Neurocomputing, vol. 51,
pp. 41–59, 2003.
[23] D. Anguita, A. Boni, R. Ridella, F. Rivieccio, and D. Sterpi, “Theoretical
and practical model selection methods for support vector classifiers,” in
Support Vector Machines: Theory and Applications, L. Wang, Ed., 2005,
pp. 159–180.
[24] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi, “General conditions
for predictivity in learning theory,” Nature, vol. 428, no. 6981, pp. 419–
422, 2004.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012
[25] B. Scholkopf and A. J. Smola, Learning with Kernels. The MIT Press,
2001.
[26] V. Cherkassky, X. Shao, F. Mulier, and V. Vapnik, “Model complexity
control for regression using vc generalization bounds,” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 1075–1089, 1999.
[27] J. Shawe-Taylor, P. Bartlett, R. Williamson, and M. Anthony, “Structural
risk minimization over data-dependent hierarchies,” IEEE Transactions
on Information Theory, vol. 44, no. 5, pp. 1926–1940, 1998.
[28] P. Bartlett, S. Boucheron, and G. Lugosi, “Model selection and error
estimation,” Machine Learning, vol. 48, no. 1, pp. 85–113, 2002.
[29] J. C. Platt, “Fast training of support vector machines using sequential
minimal optimization,” in Advances in kernel methods: support vector
learning, 1999.
[30] C. Lin, “Asymptotic convergence of an smo algorithm without any
assumptions,” IEEE Transactions on Neural Networks, vol. 13, no. 1,
pp. 248–250, 2002.
[31] J. C. Platt, “Probabilistic Outputs for Support Vector Machines and
Comparisons to Regularized Likelihood Methods,” in Advances in Large
Margin Classifier, 1999.
[32] D. Anguita, A. Ghio, N. Greco, L. Oneto, and S. Ridella, “Model
selection for support vector machines: Advantages and disadvantages
of the machine learning theory,” in Proceedings of the International
Joint Conference on Neural Networks, 2010.
[33] R. Collobert, F. Sinz, J. Weston, and L. Bottou, “Trading convexity
for scalability,” in Proceedings of the 23rd International Conference on
Machine learning, 2006.
[34] M. Anthony, Discrete mathematics of neural networks: selected topics.
Society for Industrial Mathematics, 2001.
[35] M. Aupetit, “Nearly homogeneous multi-partitioning with a deterministic generator,” Neurocomputing, vol. 72, no. 7-9, pp. 1379–1389, 2009.
[36] C. Bishop, Pattern recognition and machine learning. springer New
York, 2006.
[37] D. Anguita, A. Ghio, L. Oneto, and S. Ridella, “The impact of unlabeled
patterns in rademacher complexity theory for kernel classifiers,” in
Proceedings of the Neural Information Processing System (NIPS), 2011,
pp. 1009–1016.
[38] P. Bartlett, O. Bousquet, and S. Mendelson, “Local rademacher complexities,” The Annals of Statistics, vol. 33, no. 4, pp. 1497–1537, 2005.
[39] J. Weston, R. Collobert, F. Sinz, L. Bottou, and V. Vapnik, “Inference
with the universum,” in Proceedings of the 23rd International Conference on Machine learning, 2006, pp. 1009–1016.
[40] J. Langford, “Tutorial on practical prediction theory for classification,”
Journal of Machine Learning Research, vol. 6, no. 1, p. 273, 2006.
[41] C. Clopper and E. Pearson, “The use of confidence or fiducial limits
illustrated in the case of the binomial,” Biometrika, vol. 26, no. 4, p.
404, 1934.
[42] J. Audibert, R. Munos, and C. Szepesv´ari, “Exploration-exploitation
tradeoff using variance estimates in multi-armed bandits,” Theoretical
Computer Science, vol. 410, no. 19, pp. 1876–1902, 2009.
[43] A. Maurer and M. Pontil, “Empirical bernstein bounds and sample
variance penalization,” Proceedings of the Int. Conference on Learning
Theory, 2009.
[44] W. Hoeffding, “Probability inequalities for sums of bounded random
variables,” Journal of the American Statistical Association, vol. 58, no.
301, pp. 13–30, 1963.
[45] V. Bentkus, “On hoeffdings inequalities,” The Annals of Probability,
vol. 32, no. 2, pp. 1650–1673, 2004.
[46] D. Anguita, A. Ghio, L. Ghelardoni, and S. Ridella, “Test error bounds
for classifiers: A survey of old and new results,” in Proceedings of The
IEEE Symposium on Foundations of Computational Intelligence, 2011,
pp. 80–87.
[47] P. Bartlett and S. Mendelson, “Rademacher and gaussian complexities:
Risk bounds and structural results,” The Journal of Machine Learning
Research, vol. 3, pp. 463–482, 2003.
[48] C. McDiarmid, “On the method of bounded differences,” Surveys in
combinatorics, vol. 141, no. 1, pp. 148–188, 1989.
[49] D. Anguita, A. Ghio, L. Oneto, and S. Ridella, “Maximal discrepancy
vs. rademacher complexity for error estimation,” in Proceeding of The
European Symposium on Artificial Neural Networks, 2011, pp. 257–262.
[50] S. Kutin, “Extensions to mcdiarmid’s inequality when differences are
bounded with high probability,” TR-2002-04, University of Chicago,
Tech. Rep., 2002.
[51] E. Ordentlich, K. Viswanathan, and M. Weinberger, “Denoiser-loss
estimators and twice-universal denoising,” in Proceedings of the IEEE
International Symposium on Information Theory, 2009.
[52] R. Serfling, “Probability inequalities for the sum in sampling without
replacement,” The Annals of Statistics, vol. 2, no. 1, pp. 39–48, 1974.
13
[53] D. Anguita, A. Ghio, L. Oneto, and S. Ridella, “Selecting the hypothesis
space for improving the generalization ability of support vector machines,” in Proceedings of the International Joint Conference on Neural
Networks, 2011, pp. 1169–1176.
[54] Y. Abu-Mostafa, “Hints,” Neural Computation, vol. 7, no. 4, pp. 639–
671, 1995.
[55] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu, “The entire regularization
path for the support vector machine,” The Journal of Machine Learning
Research, vol. 5, pp. 1391–1415, 2004.
[56] M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K. Muller, and A. Zien,
“Efficient and accurate lp-norm multiple kernel learning,” Advances in
Neural Information Processing Systems (NIPS), vol. 22, no. 22, pp. 997–
1005, 2009.
[57] D. Anguita, A. Ghio, L. Oneto, and S. Ridella, “In-sample model selection for support vector machines,” in Proceedings of the International
Joint Conference on Neural Networks, 2011, pp. 1154–1161.
[58] D. Anguita, A. Ghio, and S. Ridella, “Maximal discrepancy for support
vector machines,” Neurocomputing, vol. 74, pp. 1436–1443, 2011.
[59] L. Bottou, C. Cortes, J. Denker, H. Drucker, I. Guyon, L. Jackel,
Y. LeCun, U. Muller, E. Sackinger, P. Simard et al., “Comparison
of classifier methods: a case study in handwritten digit recognition,”
in Proceedings of the 12th IAPR International Conference on Pattern
Recognition Computer Vision and Image Processing, 1994.
[60] S. Munder and D. Gavrila, “An experimental study on pedestrian
classification,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 28, no. 11, pp. 1863–1868, 2006.
[61] A. Statnikov, C. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy, “A
comprehensive evaluation of multicategory classification methods for
microarray gene expression cancer diagnosis,” Bioinformatics, vol. 21,
no. 5, pp. 631–643, 2005.
[62] A. Statnikov, I. Tsamardinos, and Y. Dosbayev, “GEMS: a system for
automated cancer diagnosis and biomarker discovery from microarray
gene expression data,” International Journal of Medical Informatics,
vol. 74, no. 7-8, pp. 491–503, 2005.
[63] N. Ancona, R. Maglietta, A. Piepoli, A. D’Addabbo, R. Cotugno,
M. Savino, S. Liuni, M. Carella, G. Pesole, and F. Perri, “On the
statistical assessment of classifiers using DNA microarray data,” BMC
bioinformatics, vol. 7, no. 1, pp. 387–399, 2006.
[64] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and
A. Levine, “Broad patterns of gene expression revealed by clustering
analysis of tumor and normal colon tissues probed by oligonucleotide
arrays,” Proceedings of the national academy of sciences of the United
States of America, vol. 96, no. 12, pp. 6745–6767, 1999.
[65] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang,
H. Zuzan, J. Olson, J. Marks, and J. Nevins, “Predicting the clinical
status of human breast cancer by using gene expression profiles,”
Proceedings of the National Academy of Sciences of the United States
of America, vol. 98, no. 20, pp. 11 462–11 490, 2001.
[66] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P.
Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D.
Bloomfield, and E. S. Lander, “Molecular classification of cancer: Class
discovery and class prediction by gene expression monitoring,” Science,
vol. 286, no. 5439, pp. 531–537, 1999.
[67] D. Page, F. Zhan, J. Cussens, M. Waddell, J. Hardin, B. Barlogie,
and J. Shaughnessy Jr, “Comparative data mining for microarrays:
A case study based on multiple myeloma,” in Poster presentation at
International Conference on Intelligent Systems for Molecular Biology
August, 2002.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012
n
10
20
40
60
80
100
120
150
170
200
250
300
400
MDf
8.46 ± 0.97
5.10 ± 0.67
3.05 ± 0.23
2.36 ± 0.23
1.96 ± 0.14
1.63 ± 0.11
1.44 ± 0.11
1.27 ± 0.09
1.20 ± 0.08
1.08 ± 0.09
0.92 ± 0.05
0.81 ± 0.07
0.70 ± 0.06
RCf
8.98 ± 1.12
5.10 ± 0.67
3.05 ± 0.23
2.36 ± 0.23
1.96 ± 0.14
1.63 ± 0.11
1.44 ± 0.11
1.27 ± 0.09
1.20 ± 0.08
1.08 ± 0.09
0.92 ± 0.05
0.81 ± 0.07
0.70 ± 0.06
MD
12.90 ± 0.83
8.39 ± 1.11
6.26 ± 0.16
5.95 ± 0.12
5.61 ± 0.07
5.26 ± 0.29
4.98 ± 0.40
3.71 ± 0.58
2.71 ± 0.42
2.25 ± 0.21
2.07 ± 0.03
2.02 ± 0.04
1.93 ± 0.02
RC
13.20 ± 0.86
8.93 ± 1.20
6.26 ± 0.16
5.95 ± 0.12
5.61 ± 0.07
5.36 ± 0.21
4.98 ± 0.40
4.41 ± 0.53
3.59 ± 0.57
2.75 ± 0.47
2.07 ± 0.03
2.02 ± 0.04
1.93 ± 0.02
KCV
10.70 ± 0.88
6.96 ± 0.70
4.56 ± 0.27
3.42 ± 0.27
2.94 ± 0.18
2.42 ± 0.14
2.17 ± 0.14
1.89 ± 0.12
1.74 ± 0.11
1.53 ± 0.09
1.34 ± 0.06
1.18 ± 0.08
0.98 ± 0.06
14
LOO
10.70 ± 0.88
6.69 ± 0.71
4.31 ± 0.26
3.25 ± 0.29
2.79 ± 0.17
2.35 ± 0.17
2.09 ± 0.17
1.85 ± 0.15
1.65 ± 0.11
1.44 ± 0.09
1.27 ± 0.06
1.11 ± 0.09
0.92 ± 0.07
BTS
13.40 ± 0.76
9.37 ± 0.62
5.93 ± 0.26
4.40 ± 0.25
3.61 ± 0.17
3.15 ± 0.14
2.86 ± 0.15
2.43 ± 0.14
2.18 ± 0.12
1.98 ± 0.09
1.67 ± 0.08
1.48 ± 0.09
1.24 ± 0.07
TABLE I: MNIST dataset: error on the reference set, computed using the soft loss.
n
10
20
40
60
80
100
120
150
170
200
250
300
400
MDf
37.40 ± 3.38
31.50 ± 2.02
28.00 ± 0.76
26.60 ± 0.51
25.70 ± 0.50
25.20 ± 0.71
23.80 ± 0.43
22.90 ± 0.38
22.40 ± 0.35
21.80 ± 0.39
21.30 ± 0.39
20.50 ± 0.40
19.60 ± 0.29
RCf
37.90 ± 3.52
31.70 ± 2.00
28.00 ± 0.75
26.60 ± 0.50
25.70 ± 0.50
25.20 ± 0.71
23.80 ± 0.43
22.90 ± 0.37
22.40 ± 0.35
21.90 ± 0.38
21.30 ± 0.39
20.50 ± 0.41
19.60 ± 0.29
MD
42.80 ± 2.91
37.70 ± 2.43
33.10 ± 0.64
31.60 ± 0.49
30.60 ± 0.48
30.20 ± 0.46
29.70 ± 0.40
28.90 ± 0.37
27.90 ± 0.33
27.90 ± 0.33
27.10 ± 0.23
27.00 ± 0.30
26.10 ± 0.35
RC
44.80 ± 2.54
37.90 ± 2.38
33.10 ± 0.64
31.70 ± 0.46
30.90 ± 0.47
30.40 ± 0.49
29.80 ± 0.39
29.40 ± 0.34
28.70 ± 0.41
28.20 ± 0.37
27.30 ± 0.21
27.10 ± 0.24
26.30 ± 0.32
KCV
37.10 ± 2.58
32.00 ± 1.16
29.10 ± 0.81
27.60 ± 0.53
26.80 ± 0.59
25.70 ± 0.60
24.60 ± 0.42
23.80 ± 0.45
23.30 ± 0.38
22.70 ± 0.40
21.80 ± 0.34
21.00 ± 0.33
20.00 ± 0.27
LOO
37.10 ± 2.58
31.40 ± 1.11
29.50 ± 0.80
27.60 ± 0.75
26.90 ± 0.78
26.00 ± 0.78
24.60 ± 0.53
23.70 ± 0.43
23.20 ± 0.52
22.60 ± 0.44
21.60 ± 0.39
20.80 ± 0.35
20.00 ± 0.30
BTS
39.00 ± 2.63
34.60 ± 1.34
30.70 ± 0.68
28.90 ± 0.55
28.20 ± 0.48
27.90 ± 0.69
26.60 ± 0.49
25.40 ± 0.34
25.00 ± 0.34
24.70 ± 0.38
23.40 ± 0.30
22.50 ± 0.32
21.50 ± 0.25
TABLE II: DaimlerChrysler dataset: error on the reference set, computed using the soft loss.
n
10
20
40
60
80
100
120
150
170
200
250
300
400
MDf
2.33 ± 0.94
1.16 ± 0.31
0.47 ± 0.05
0.47 ± 0.10
0.39 ± 0.05
0.30 ± 0.04
0.28 ± 0.03
0.29 ± 0.04
0.30 ± 0.05
0.28 ± 0.04
0.25 ± 0.02
0.25 ± 0.04
0.20 ± 0.02
RCf
2.55 ± 1.04
1.16 ± 0.31
0.47 ± 0.05
0.47 ± 0.10
0.39 ± 0.05
0.30 ± 0.04
0.28 ± 0.03
0.29 ± 0.04
0.30 ± 0.05
0.28 ± 0.04
0.25 ± 0.02
0.25 ± 0.04
0.20 ± 0.02
MD
2.69 ± 0.67
1.41 ± 0.43
0.72 ± 0.09
0.73 ± 0.10
0.65 ± 0.07
0.58 ± 0.06
0.56 ± 0.05
0.45 ± 0.07
0.35 ± 0.05
0.31 ± 0.04
0.28 ± 0.02
0.27 ± 0.02
0.26 ± 0.01
RC
2.78 ± 0.66
1.49 ± 0.43
0.72 ± 0.09
0.73 ± 0.10
0.65 ± 0.07
0.59 ± 0.06
0.56 ± 0.05
0.51 ± 0.06
0.43 ± 0.07
0.37 ± 0.06
0.28 ± 0.02
0.27 ± 0.02
0.26 ± 0.01
KCV
2.77 ± 0.83
1.58 ± 0.42
0.90 ± 0.27
0.85 ± 0.26
0.73 ± 0.17
0.53 ± 0.14
0.57 ± 0.12
0.61 ± 0.19
0.49 ± 0.09
0.50 ± 0.11
0.40 ± 0.07
0.39 ± 0.08
0.29 ± 0.06
LOO
2.77 ± 0.83
1.91 ± 0.63
1.72 ± 0.56
1.15 ± 0.36
1.26 ± 0.30
0.76 ± 0.22
0.79 ± 0.18
0.77 ± 0.22
0.58 ± 0.11
0.61 ± 0.13
0.50 ± 0.10
0.48 ± 0.11
0.36 ± 0.07
BTS
3.21 ± 0.67
1.79 ± 0.34
0.78 ± 0.12
0.66 ± 0.12
0.56 ± 0.09
0.40 ± 0.05
0.43 ± 0.06
0.37 ± 0.05
0.37 ± 0.06
0.34 ± 0.04
0.32 ± 0.03
0.28 ± 0.04
0.24 ± 0.02
TABLE III: MNIST dataset: error on the reference set, computed using hard loss.
n
10
20
40
60
80
100
120
150
170
200
250
300
400
MDf
33.60 ± 4.53
27.10 ± 2.53
23.80 ± 0.75
23.10 ± 0.54
22.20 ± 0.54
22.00 ± 0.75
20.90 ± 0.50
20.10 ± 0.42
20.00 ± 0.43
19.40 ± 0.41
19.10 ± 0.39
18.50 ± 0.40
17.90 ± 0.32
RCf
34.60 ± 4.91
27.10 ± 2.52
23.80 ± 0.75
23.10 ± 0.54
22.30 ± 0.54
22.00 ± 0.75
20.90 ± 0.52
20.10 ± 0.40
20.00 ± 0.43
19.40 ± 0.41
19.10 ± 0.39
18.50 ± 0.41
17.90 ± 0.32
MD
31.40 ± 4.01
27.20 ± 1.05
26.00 ± 0.63
25.90 ± 0.79
25.00 ± 0.51
24.20 ± 0.49
24.10 ± 0.55
23.70 ± 0.48
23.10 ± 0.36
22.70 ± 0.49
22.60 ± 0.43
22.40 ± 0.36
21.40 ± 0.55
RC
31.70 ± 4.00
27.30 ± 1.09
26.00 ± 0.63
26.00 ± 0.80
25.20 ± 0.55
24.20 ± 0.48
24.30 ± 0.50
24.00 ± 0.49
23.60 ± 0.41
23.00 ± 0.51
22.70 ± 0.43
22.50 ± 0.31
21.60 ± 0.57
KCV
32.30 ± 3.54
26.70 ± 1.06
24.70 ± 0.88
23.50 ± 0.85
22.70 ± 0.73
21.80 ± 0.77
20.80 ± 0.51
19.80 ± 0.46
19.80 ± 0.51
19.20 ± 0.44
18.60 ± 0.44
17.80 ± 0.31
17.00 ± 0.25
LOO
32.30 ± 3.54
25.70 ± 0.61
24.40 ± 0.80
23.20 ± 0.78
22.80 ± 0.84
22.00 ± 0.74
21.00 ± 0.71
20.30 ± 0.78
19.90 ± 0.63
19.10 ± 0.48
18.50 ± 0.42
17.80 ± 0.37
16.90 ± 0.34
BTS
35.30 ± 3.56
29.40 ± 1.41
26.00 ± 0.78
24.40 ± 0.53
23.50 ± 0.48
23.40 ± 0.72
22.30 ± 0.51
21.50 ± 0.40
21.10 ± 0.40
20.80 ± 0.40
19.70 ± 0.32
19.00 ± 0.29
18.20 ± 0.27
TABLE IV: DaimlerChrysler dataset: error on the reference set, computed using hard loss.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012
n
10
20
40
60
80
100
120
150
170
200
250
300
400
MDf
–
–
77.00 ± 0.00
62.90 ± 0.00
54.50 ± 0.00
48.70 ± 0.00
44.50 ± 0.00
39.80 ± 0.00
37.40 ± 0.00
34.40 ± 0.00
30.80 ± 0.00
28.10 ± 0.00
24.40 ± 0.00
RCf
–
–
77.00 ± 0.00
62.90 ± 0.00
54.50 ± 0.00
48.70 ± 0.00
44.50 ± 0.00
39.80 ± 0.00
37.40 ± 0.00
34.40 ± 0.00
30.80 ± 0.00
28.10 ± 0.00
24.40 ± 0.00
MD
–
–
–
83.60 ± 0.52
72.70 ± 0.45
65.30 ± 0.30
59.50 ± 0.33
54.90 ± 0.26
51.80 ± 0.22
47.80 ± 0.19
43.30 ± 0.18
39.70 ± 0.17
34.80 ± 0.15
RC
–
–
–
85.00 ± 0.46
73.90 ± 0.41
66.50 ± 0.32
60.90 ± 0.34
55.10 ± 0.27
52.20 ± 0.23
48.40 ± 0.19
43.20 ± 0.17
39.60 ± 0.16
34.90 ± 0.16
KCV
–
85.60 ± 0.76
62.19 ± 0.43
48.28 ± 0.31
39.16 ± 0.25
32.66 ± 0.19
28.41 ± 0.18
24.02 ± 0.17
21.39 ± 0.11
18.66 ± 0.12
15.44 ± 0.11
13.40 ± 0.10
10.23 ± 0.08
15
LOO
–
–
–
–
–
–
–
–
–
–
–
–
–
BTS
78.09 ± 1.56
54.56 ± 0.98
32.34 ± 0.58
23.87 ± 0.41
18.21 ± 0.31
15.71 ± 0.24
13.81 ± 0.22
11.29 ± 0.21
10.01 ± 0.17
9.00 ± 0.14
7.13 ± 0.14
6.31 ± 0.16
5.03 ± 0.08
TABLE V: MNIST dataset: error estimation using the soft loss.
n
10
20
40
60
80
100
120
150
170
200
250
300
400
MDf
–
–
88.30 ± 1.92
71.80 ± 1.48
64.40 ± 1.07
58.90 ± 1.06
52.20 ± 0.95
48.80 ± 0.92
44.50 ± 0.88
42.30 ± 0.76
36.70 ± 0.66
35.40 ± 0.58
30.80 ± 0.75
RCf
–
–
88.40 ± 1.84
72.10 ± 1.50
64.80 ± 1.11
58.60 ± 1.08
52.20 ± 0.88
48.70 ± 0.89
44.60 ± 0.93
42.30 ± 0.78
36.70 ± 0.65
35.30 ± 0.60
30.80 ± 0.75
MD
–
–
–
–
94.50 ± 0.89
87.90 ± 1.08
82.00 ± 0.81
77.60 ± 0.85
73.90 ± 0.74
71.10 ± 0.81
66.20 ± 0.65
62.80 ± 0.68
58.30 ± 0.74
RC
–
–
–
–
94.30 ± 0.87
87.80 ± 1.01
82.30 ± 0.78
76.90 ± 0.83
73.50 ± 0.74
70.90 ± 0.82
65.60 ± 0.64
62.40 ± 0.64
57.80 ± 0.71
KCV
–
–
84.05 ± 2.03
73.58 ± 2.02
68.70 ± 1.13
63.54 ± 1.31
58.93 ± 1.19
55.36 ± 1.36
51.12 ± 0.74
48.64 ± 0.90
44.61 ± 0.82
42.22 ± 0.74
37.97 ± 0.84
LOO
–
–
–
–
–
–
–
–
–
–
–
–
–
BTS
98.11 ± 4.96
77.76 ± 3.80
63.48 ± 2.18
54.50 ± 1.87
51.50 ± 1.45
48.10 ± 1.37
44.52 ± 1.02
41.83 ± 1.17
39.04 ± 0.94
37.84 ± 0.99
34.79 ± 0.82
33.05 ± 0.77
30.83 ± 0.84
TABLE VI: DaimlerChrysler dataset: error estimation using the soft loss.
n
10
20
40
60
80
100
120
150
170
200
250
300
400
KCV
95.00 ± 1.99
77.64 ± 0.76
52.71 ± 0.32
39.30 ± 0.25
31.23 ± 0.24
25.89 ± 0.17
22.09 ± 0.13
18.10 ± 0.15
16.16 ± 0.09
13.91 ± 0.10
11.29 ± 0.09
9.50 ± 0.10
7.22 ± 0.07
LOO
–
–
–
–
–
–
–
–
–
–
–
–
–
BTS
57.65 ± 2.48
34.15 ± 0.90
18.63 ± 0.45
12.79 ± 0.33
9.74 ± 0.23
7.86 ± 0.19
6.59 ± 0.17
5.30 ± 0.15
4.69 ± 0.11
4.00 ± 0.11
3.21 ± 0.09
2.68 ± 0.09
2.02 ± 0.08
TABLE VII: MNIST dataset: error estimation using the hard loss.
n
10
20
40
60
80
100
120
150
170
200
250
300
400
KCV
95.00 ± 6.88
77.64 ± 3.83
75.14 ± 2.18
58.18 ± 2.27
59.97 ± 1.44
50.69 ± 1.67
43.81 ± 1.45
43.98 ± 1.52
39.56 ± 1.02
34.37 ± 0.91
32.96 ± 0.97
31.90 ± 0.89
27.47 ± 0.86
LOO
–
–
–
–
–
–
–
–
–
–
–
–
–
BTS
80.75 ± 7.42
64.80 ± 4.73
52.41 ± 2.78
42.17 ± 2.20
40.29 ± 1.53
38.98 ± 1.62
35.52 ± 1.28
32.94 ± 1.50
31.09 ± 1.09
29.71 ± 1.10
27.68 ± 0.99
26.27 ± 0.82
24.43 ± 0.83
TABLE VIII: DaimlerChrysler dataset: error estimation using the hard loss.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012
Dataset
Brain Tumor1
Brain Tumor2
Colon Cancer1
Colon Cancer2
DLBCL
DukeBreastCancer
Leukemia
Leukemia1
Leukemia2
Lung Cancer
Myeloma
Prostate Tumor
SRBCT
MDf
14.00 ± 6.37
5.78 ± 2.35
27.40 ± 14.78
25.50 ± 8.59
19.10 ± 2.03
33.40 ± 5.07
14.20 ± 5.05
17.40 ± 4.66
11.30 ± 4.34
11.60 ± 2.93
8.60 ± 2.00
17.80 ± 4.65
10.20 ± 4.47
RCf
14.00 ± 7.66
5.78 ± 3.34
27.40 ± 13.97
25.50 ± 7.92
19.10 ± 3.40
33.40 ± 5.61
14.20 ± 5.59
17.40 ± 5.16
11.30 ± 4.08
11.60 ± 3.35
8.60 ± 2.14
17.80 ± 3.82
10.20 ± 4.49
MD
33.30 ± 0.02
76.20 ± 0.12
45.30 ± 0.05
67.70 ± 0.04
58.00 ± 0.04
50.00 ± 0.06
31.20 ± 0.04
38.80 ± 5.07
31.10 ± 0.03
30.20 ± 0.01
28.40 ± 0.00
25.00 ± 4.25
31.50 ± 0.01
RC
33.30 ± 0.02
76.20 ± 0.12
45.30 ± 0.05
67.70 ± 0.04
58.00 ± 0.04
50.00 ± 0.06
31.20 ± 0.04
42.20 ± 3.33
31.10 ± 0.03
30.20 ± 0.00
28.40 ± 0.00
25.60 ± 3.96
31.50 ± 0.01
16
KCV
17.10 ± 1.94
73.90 ± 0.30
30.10 ± 3.01
67.70 ± 0.07
57.90 ± 0.06
27.90 ± 1.61
20.90 ± 3.61
18.20 ± 3.46
9.56 ± 3.08
11.20 ± 1.53
9.06 ± 0.65
12.40 ± 3.71
10.90 ± 1.33
LOO
15.70 ± 1.80
73.90 ± 0.27
29.00 ± 1.60
67.70 ± 0.07
57.90 ± 0.05
26.40 ± 3.00
20.60 ± 3.20
17.40 ± 4.15
9.59 ± 2.78
10.60 ± 1.63
8.63 ± 0.66
11.80 ± 3.40
10.50 ± 1.55
BTS
18.90 ± 3.29
75.60 ± 0.14
29.80 ± 4.14
67.70 ± 0.05
58.00 ± 0.03
31.60 ± 3.08
23.30 ± 3.29
21.20 ± 4.24
10.80 ± 3.71
13.10 ± 1.38
11.50 ± 0.60
15.30 ± 3.93
13.60 ± 1.51
TABLE X: Human gene expression dataset: error on the reference set, computed using the soft loss.
Dataset
Brain Tumor1
Brain Tumor2
Colon Cancer1
Colon Cancer2
DLBCL
DukeBreastCancer
Leukemia
Leukemia1
Leukemia2
Lung Cancer
Myeloma
Prostate Tumor
SRBCT
MDf
5.56 ± 4.52
2.86 ± 4.50
18.20 ± 16.50
18.60 ± 11.00
8.57 ± 4.58
21.70 ± 10.90
7.50 ± 6.01
1.00 ± 2.57
3.00 ± 3.15
5.96 ± 3.19
0.00 ± 0.00
10.90 ± 4.67
1.74 ± 2.74
RCf
5.56 ± 4.52
2.86 ± 4.50
18.20 ± 16.50
18.60 ± 11.00
8.57 ± 4.58
21.70 ± 10.90
5.00 ± 6.01
1.00 ± 2.57
4.00 ± 4.81
5.96 ± 3.19
0.00 ± 0.00
10.90 ± 4.67
1.74 ± 2.74
MD
33.30 ± 0.00
2.86 ± 4.50
54.50 ± 0.00
42.90 ± 0.00
33.30 ± 0.00
48.30 ± 12.50
31.20 ± 0.00
49.00 ± 2.57
40.00 ± 0.00
34.00 ± 0.00
28.00 ± 0.00
34.50 ± 12.00
39.10 ± 0.00
RC
33.30 ± 0.00
2.86 ± 4.50
54.50 ± 0.00
42.90 ± 0.00
33.30 ± 0.00
48.30 ± 12.50
31.20 ± 0.00
50.00 ± 0.00
40.00 ± 0.00
34.00 ± 0.00
28.00 ± 0.00
35.50 ± 10.10
39.10 ± 0.00
KCV
5.56 ± 4.52
2.86 ± 4.50
12.70 ± 17.50
15.70 ± 6.87
7.62 ± 6.24
6.67 ± 8.02
5.00 ± 3.21
2.00 ± 3.15
5.00 ± 4.07
7.23 ± 2.19
0.00 ± 0.00
10.90 ± 4.67
2.61 ± 4.47
LOO
8.89 ± 7.28
2.86 ± 4.50
16.40 ± 15.50
17.10 ± 9.36
8.57 ± 7.14
10.00 ± 10.50
2.50 ± 3.94
1.00 ± 2.57
5.00 ± 4.07
5.96 ± 3.19
0.00 ± 0.00
10.90 ± 4.67
3.48 ± 6.52
BTS
5.56 ± 7.82
2.86 ± 4.50
20.00 ± 13.60
17.10 ± 4.50
8.57 ± 4.58
8.33 ± 9.58
6.25 ± 5.08
1.00 ± 2.57
5.00 ± 4.07
7.23 ± 2.19
0.00 ± 0.00
10.90 ± 4.67
5.22 ± 8.21
TABLE XI: Human gene expression dataset: error on the reference set, computed using the hard loss.
Dataset
Brain Tumor1
Brain Tumor2
Colon Cancer1
Colon Cancer2
DLBCL
DukeBreastCancer
Leukemia
Leukemia1
Leukemia2
Lung Cancer
Myeloma
Prostate Tumor
SRBCT
MDf
61.10 ± 1.25
81.70 ± 0.49
92.60 ± 2.84
74.90 ± 2.36
66.30 ± 1.02
–
68.20 ± 0.96
70.70 ± 1.43
68.30 ± 0.43
39.10 ± 0.01
54.50 ± 0.00
56.90 ± 1.39
63.40 ± 0.63
RCf
61.10 ± 1.25
81.50 ± 0.02
92.50 ± 2.91
75.10 ± 2.55
66.30 ± 1.02
–
67.60 ± 1.34
70.90 ± 1.44
69.00 ± 1.23
39.10 ± 0.01
54.50 ± 0.00
56.90 ± 1.39
63.50 ± 0.75
MD
89.50 ± 0.00
96.10 ± 0.06
–
82.70 ± 0.06
66.90 ± 0.08
–
–
–
99.90 ± 0.01
70.60 ± 0.00
82.70 ± 0.01
–
97.70 ± 0.00
RC
90.20 ± 0.00
–
–
91.10 ± 0.06
74.10 ± 0.07
–
–
–
99.20 ± 0.01
70.20 ± 0.00
82.60 ± 0.01
–
97.60 ± 0.00
KCV
58.30 ± 0.78
82.40 ± 0.04
85.40 ± 0.76
72.10 ± 0.01
59.10 ± 0.02
85.30 ± 2.57
66.10 ± 1.22
70.00 ± 1.45
63.40 ± 0.86
42.00 ± 0.66
51.10 ± 0.36
63.50 ± 0.66
58.20 ± 0.64
LOO
–
–
–
–
–
–
–
–
–
–
–
–
–
BTS
31.03 ± 0.54
39.50 ± 0.07
55.18 ± 0.35
32.44 ± 0.02
23.70 ± 0.02
60.27 ± 4.02
36.21 ± 1.61
41.80 ± 1.52
33.41 ± 1.55
15.68 ± 0.81
22.77 ± 0.77
36.83 ± 1.08
31.10 ± 1.83
TABLE XII: Human gene expression dataset: error estimation using the soft loss.
Dataset
Brain Tumor1
Brain Tumor2
Colon Cancer1
Colon Cancer2
DLBCL
DukeBreastCancer
Leukemia
Leukemia1
Leukemia2
Lung Cancer
Myeloma
Prostate Tumor
SRBCT
KCV
34.04 ± 2.03
56.49 ± 1.29
56.49 ± 5.14
46.43 ± 4.11
41.43 ± 1.71
60.79 ± 2.40
41.43 ± 1.60
43.79 ± 1.60
43.79 ± 2.57
17.47 ± 0.88
31.23 ± 0.00
47.07 ± 1.64
39.30 ± 2.57
LOO
–
–
–
–
–
–
–
–
–
–
–
–
–
BTS
22.05 ± 1.46
20.50 ± 1.68
49.29 ± 9.54
38.65 ± 3.85
21.21 ± 3.54
45.08 ± 3.90
21.21 ± 1.92
22.70 ± 3.63
22.70 ± 0.99
13.00 ± 1.16
9.74 ± 0.18
20.00 ± 1.14
19.91 ± 1.68
TABLE XIII: Human gene expression dataset: error estimation using the hard loss.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, MAY 2012
n
10
20
40
60
80
100
120
150
170
200
250
300
400
MDf
0.1 ± 0.1
0.3 ± 0.1
0.7 ± 0.2
1.1 ± 0.3
2.3 ± 0.2
2.4 ± 0.2
4.5 ± 0.4
11.4 ± 0.4
17.8 ± 0.4
25.1 ± 0.3
27.4 ± 0.4
48.1 ± 0.4
58.9 ± 0.6
RCf
0.1 ± 0.1
0.4 ± 0.1
0.6 ± 0.2
1.1 ± 0.2
2.2 ± 0.4
2.6 ± 0.3
3.9 ± 0.3
10.9 ± 0.4
17.1 ± 0.3
24.8 ± 0.4
28.9 ± 0.4
47.2 ± 0.4
59.4 ± 0.4
MD
0.1 ± 0.1
0.3 ± 0.1
0.6 ± 0.1
1.1 ± 0.1
2.0 ± 0.1
2.7 ± 0.2
4.2 ± 0.3
10.1 ± 0.3
12.8 ± 0.3
21.3 ± 0.3
25.2 ± 0.2
36.1 ± 0.2
44.2 ± 0.2
RC
0.1 ± 0.1
0.3 ± 0.1
0.7 ± 0.1
1.2 ± 0.1
1.9 ± 0.2
2.7 ± 0.2
4.0 ± 0.2
9.7 ± 0.2
11.9 ± 0.2
20.5 ± 0.2
25.9 ± 0.3
36.8 ± 0.2
44.1 ± 0.2
KCV
0.0 ± 0.1
0.0 ± 0.1
0.1 ± 0.1
0.1 ± 0.1
0.3 ± 0.1
0.7 ± 0.1
1.1 ± 0.3
1.7 ± 0.3
2.4 ± 0.3
2.9 ± 0.2
4.1 ± 0.4
4.9 ± 0.3
6.3 ± 0.3
17
LOO
0.0 ± 0.1
0.1 ± 0.1
0.3 ± 0.1
0.8 ± 0.1
2.0 ± 0.3
2.9 ± 0.7
5.1 ± 0.4
9.3 ± 0.4
13.4 ± 0.4
18.1 ± 1.1
27.6 ± 1.2
39.3 ± 1.1
59.7 ± 1.3
BTS
0.0 ± 0.1
0.0 ± 0.1
0.0 ± 0.1
0.0 ± 0.1
0.1 ± 0.1
0.2 ± 0.2
0.5 ± 0.2
0.7 ± 0.1
0.9 ± 0.2
1.4 ± 0.3
1.9 ± 0.3
2.3 ± 0.3
4.1 ± 0.4
TABLE XIV: MNIST dataset: computational time required by the different in–sample and out–of–sample procedures.