Developing Methods for Machine Learning Algorithms Hala Helmi

Developing Methods for Machine Learning Algorithms
Using Automated Feature and Sample Selection
by
Hala Helmi
Thesis submitted to The University of Nottingham
for the Degree of Doctor of Philosophy
School of Computer Science
The University of Nottingham
Nottingham, United Kingdom
January 2014
i
ii
Abstract
Machine learning is concerned with the design and development of
algorithms and techniques that allow computers to "learn" from experience
with respect to some class of tasks and performance measure. One application
of machine learning is to improve the accuracy and efficiency of computeraided diagnosis systems to assist physicians, radiologists, cardiologists,
neuroscientists, and health-care technologists. This thesis focuses on machine
learning and the applications to different types of data sets for example breast
cancer detection.
Breast cancer, which is the most common cancer in women, is a
complex disease characterised by multiple molecular alterations. Current
routine clinical management relies on availability of robust clinical and
pathologic prognostic and predictive factors, like the Nottingham Prognostic
Index, to support decision making. Recent advances in high throughput
molecular technologies supported the evidence of a biologic heterogeneity of
breast cancer.
Emphasis is laid on preprocessing of features, pattern classification, and
model selection. Before the classification task, feature selection and feature
transformation may be performed to reduce the dimensionality of the features
and to improve the classification performance. Genetic algorithm (GA) can be
employed for feature selection based on different measures of data separability
or the estimated risk of a chosen classifier. A separate nonlinear transformation
can be performed by applying kernel principal component analysis and kernel
partial least squares.
Different classifiers are proposed in this work. The aim is to fuse the
output of multiple classifiers in the test phase weights that are derived during
training; such fusion of classifier output improves classifiers performance. Five
classifiers are used: Support vector machine, Naïve Bayes, k-nearest
neighbour, Logistic Regression and Neural Network. Gathering numerous
classifiers for the purpose of solving certain dilemma of learning seems to be
iii
logic. Overall, a suite of classifiers prove to perform better comparing to an
individual classifier. Besides, the native varied conduct which belongs to each
single classifier is employed by various classifiers in order to get higher
precision of the system as whole in addition, it helps us to hedge the hazard of
selecting insufficient individual classifiers.
We propose a novel sampling method to replace the random sampling
used by SVM and TSVM to find whether this could extensively reduce the
amount of needed labelled experimental examples, and to see if this can
improve the performance of both SVM and TSVM, as well. The new method
uses redundant views to expand the labelled dataset to build strong learning
models. The major difference is that the new method uses a number of
classifier views (two views in our case)
to select and sample unlabeled
examples to be labelled by the domain experts, while the original SVM and
TSVM randomly sample some unlabeled examples and use classifiers to assign
labels to them.
The problem of model selection is studied to pick the best values of the
hyperparameters for a parametric classifier. To choose the optimal kernel or
regularization parameters of a classifier, we investigate different criteria, such
as the validation error estimate and the leave-out-out bound, as well as
different optimization methods, such as grid search, gradient descent, and GA.
By viewing the tuning problem of the multiple parameters of an 2-norm
support vector machine (SVM) as an identification problem of a nonlinear
dynamic system. Independent kernel optimization based on different measures
of data separability are also investigated for different kernel-based classifiers.
Numerous computer experiments using the benchmark datasets verify the
theoretical results, make comparisons among the techniques in measures of
classification accuracy or area under the receiver operating characteristics
curve. Computational requirements, such as the computing time and the
number of hyper-parameters, are also discussed. Experimental results
demonstrate the excellence of these methods with improved classification
performance.
iv
Acknowledgements
First of all, I would like to thank my supervisors, Prof. Jonathan
Garibaldi and Prof. Uwe Aickelin, for the independence, guidance, and support
he has given me throughout all this project.
Being at the University of
Nottingham has been a great fun. I have been lucky to have such great friends
who have given me so much love and support, who have made this experience
such a great pleasure and fun. I am really grateful specially to all my
colleagues in Intelligent Modeling and Analysis Group, School of Computer
Science, the University of Nottingham.
I would like to acknowledge financial support from the King Abdullah
Foreign Scholarship Program (KAS), for providing me scholarships and travel
funds for my studies at the University of Nottingham, UK.
A super sweet and very special thank goes to my friends and my family,
the most tolerant people I have ever met, for taking care of me, encouraging
me, supporting me, and loving me. I want to thank my sisters for their endless
patience whenever I wish to talk to them about my research, even they works
in a very different area. My PhD. life would not have been so smooth and
enjoyable without them always by my side.
Finally, I am deeply indebted to my dear parents for their love, patience
and encouragement all the years. I am infinitely grateful for the values they
have passed down to me, and for their continuous support throughout all my
studies.
Thanks you all !
v
Contents
Abstract
ii
Acknowledgements....................................................................................................... iv
Contents ......................................................................................................................... v
List of Figure .............................................................................................................. viii
List of Tables ................................................................................................................ ix
1
Introduction...................................................................................................... 2
1.1
Background ........................................................................................................ 2
1.2
Motivation.......................................................................................................... 7
1.3
Aims and Objective ........................................................................................... 9
1.4
Thesis Organisation ........................................................................................... 9
2
Literature Review ......................................................................................... 15
2.1
Introduction...................................................................................................... 16
2.2
Types of Algorithms ........................................................................................ 17
2.2.1
Supervised Learning ............................................................................... 17
2.2.2
Unsupervised Learning ............................................................................ 19
2.2.3
Semi-Supervised Learning....................................................................... 19
2.2.4
Reinforcement Learning .......................................................................... 20
2.2.5
Transduction ........................................................................................... 20
2.2.6
Multi-task Learning ................................................................................. 21
2.3
Classification ................................................................................................... 22
2.3.1
Linear Classifiers ..................................................................................... 22
2.3.2
Artificial Neural Networks ..................................................................... 26
2.3.3 Kernel-based Classifiers.................................................................................. 32
2.3.4
Proximal Classifiers ................................................................................. 43
2.3.5
Prototype Classifiers ................................................................................ 47
2.4
Optimization ................................................................................................... 49
2.4.1
Genetic Algorithm ................................................................................... 50
2.4.2
Gradient Descent .................................................................................... 51
2.5
Feature Selection ............................................................................................ 52
2.5.1
Genetic Algorithm ................................................................................... 52
2.5.2
Sequential Backward Selection ............................................................... 53
2.5.3
Recursive Feature Elimination................................................................. 54
2.6
Summary ......................................................................................................... 55
vi
3
Improving TSVM vs. SVM Accordance Based Sample Selection ............. 57
3.1
Experimental Datasets and Feature Analysis................................................... 57
3.1.1
Nottingham Tenovus Primary Breast Carcinoma (NTBC) ...................... 58
3.1.2
Wisconsin Diagnosis Breast Cancer Dataset .......................................... 61
3.2
Background and motivation ............................................................................ 64
3.3
Experiments settings ........................................................................................ 66
3.3.1
Support Vector Machine .......................................................................... 67
3.3.2
Measures for predictive accuracy ............................................................ 69
3.5
Results ............................................................................................................. 74
3.6
Discussion of results ........................................................................................ 83
3.7
Summary ......................................................................................................... 86
4
Automatic Features and Samples Ranking for SVM Classifier ................ 88
4.1
Introduction...................................................................................................... 88
4.2
Background ...................................................................................................... 89
4.2.1
Feature Ranking ....................................................................................... 89
4.2.2
Samples Ranking ..................................................................................... 91
4.3
Methodology .................................................................................................... 94
4.3.1
Feature Selection .................................................................................... 94
4.3.2
Sample Selection ..................................................................................... 97
4.4
Experiment Settings ....................................................................................... 102
4.4.1
Datasets .................................................................................................. 103
4.4.2
Evaluation measures ............................................................................. 103
4.4.3
Ranking model ....................................................................................... 105
4.4.4
Experiments ........................................................................................... 106
4.5
Experimental Results ..................................................................................... 108
4.5.1
MADELON data set(Feature Ranking) ................................................. 108
4.5.2
MADELON data set(Samples Ranking) ................................................ 112
4.5.3
Nottingham Breast Cancer Data set(Feature Ranking) .......................... 113
4.5.4
Nottingham Breast Cancer Data set(Samples Ranking) ........................ 115
4.6
Discussions .................................................................................................... 118
4.7
Summary ....................................................................................................... 120
5
Ensemble weighted classifiers with accordance-based sampling ........... 121
5.1
Introduction.................................................................................................... 121
5.2
Background .................................................................................................... 125
5.2.1
Ensemble Weighed Classifier ............................................................... 125
vii
5.2.2
Sampling Most Informative Sample Method (Multi Views Sample MVS)
... ............................................................................................................ 128
5.3
Experimental Design ...................................................................................... 130
5.4
Experimental Results and Discussion ............................................................. 132
5.4.1
Runtime Performance Study .................................................................. 147
5.5 Summary ............................................................................................................. 148
6
Examination of TSVM Algorithm Classification Accuracy with Feature
Selection in Comparison with GLAD Algorithm .................................................. 150
6.1
Introduction.................................................................................................... 150
6.2
Background .................................................................................................... 150
6.2.1
Support Vector Machines ...................................................................... 151
6.2.2
Transductive Support Vector Machines ................................................ 152
6.2.3
Recursive Feature Elimination............................................................... 156
6.2.4
Genetic Algorithms ............................................................................... 159
6.3
Methods ......................................................................................................... 159
6.3.1
Support Vector Machines ...................................................................... 160
6.3.2
Transductive Support Vector Machines ............................................... 162
6.3.3
Recursive Feature Elimination............................................................... 164
6.3.4
Genetic Learning Across Datasets (GLAD) .......................................... 166
6.4
Experiments and Results............................................................................... 168
6.4.1
Datasets .................................................................................................. 168
6.4.2
TSVM Recursive Feature Elimination (TSVM-RFE) Result ................ 168
6.4.3
Comparing TSVM Algorithm result with GLAD Algorithm ................ 169
6.5
Discussion of results ...................................................................................... 172
6.6
Summary ........................................................................................................ 173
7
Conclusions and Future Work.................................................................... 176
7.1
Contributions ................................................................................................. 176
7.3
Dissemination ................................................................................................. viii
7.3.1
Journal papers ....................................................................................... 183
7.3.2
Conference papers ................................................................................. 183
References
viii
List of Figure
‎ igure 2.1: The structure of an SLP with one neuron in the output layer. ....... 27
F
‎Figure 2.2: The overall structure of an MLP with one hidden layer. ............... 28
‎Figure 2.3: The overall structure of the RBF networks. ................................... 29
‎Figure 2.4: Illustration of support vectors for linear, non-separable patterns. . 42
‎Figure 3.1: Histogram of variable CK19 .......................................................... 66
‎Figure 3.2: Histogram of variable P53 ............................................................. 66
‎Figure 3.3: Histogram of WDBC...................................................................... 66
‎Figure 3.4: Accordance Based Sampling TSVM vs. SVM with different
percentages of labelled training data for each class.......................................... 79
‎Figure 3.5: Original random sampling TSVM vs. SVM with different
percentages of labelled training data with random sampling. .......................... 80
‎Figure 3.6: Accordance Based Sampling TSVM vs. SVM with different
percentages of labelled training data for each class.......................................... 82
‎Figure 3.7: Original random sampling TSVM vs. SVM with different
percentages of labelled training data with random sampling. .......................... 83
‎Figure 4.1: Linear projection of four data points ............................................ 101
‎Figure 4.2: Diagram showing an example of an existing feature selection
procedure ........................................................................................................ 102
‎Figure. 4.3 Ranking accuracy of Ranking SVM with different feature selection
methods on the MADELON dataset ............................................................... 110
‎Figure 4.4. Ranking accuracy of RankNet with different feature selection
methods on the MADELON dataset ............................................................... 111
‎Figure 4.5: Accuracy convergence of random and selective sampling on
MADELON dataset ........................................................................................ 112
‎Figure 4.6. Ranking accuracy of Ranking SVM with different feature selection
methods on the NTBC dataset ........................................................................ 114
‎Figure 4.7. Ranking accuracy of RankNet with different feature selection
methods on the NTBC dataset ........................................................................ 115
‎Figure 4.8: Accuracy convergence of random and selective sampling on NDBC
dataset ............................................................................................................. 118
‎Figure 5.1 Performance of SVM, LR, KNN, NN, NB and Majority under
varying 10 folds with random sampling .vs. multi view sampling ................. 139
‎Figure 5.2 Performance of SVM, LR, KNN, NN, NB and Majority under
varying different Folds number sizes with multi view sampling method. ..... 143
‎Figure 5.3 Error bar and Performance of SVM, LR, KNN, NN, NB and
Majority for different dimensionality number sizes with multi view sampling
method (d number of dimensional)................................................................. 146
‎Figure 5.4 System runtime with respect to different fold sizes ...................... 148
‎Figure 6.1 : Multi Margin vs. SVM Maximum Margin Optimal hyperplane
separation ........................................................................................................ 152
‎Figure 6.2: Separation hyperplane for (semi-supervised data) ....................... 156
‎Figure 6.3: Maximum margin Separation hyperplane for Transductive SVM
(semi-supervised data) .................................................................................... 162
‎Figure 6.4: Testing error for 3 data sets. The 5-fold cross validated pair t-test
shows the SVM-RFE and the TSVM-RFE have relative differences comparing
two methods at the confidence rate 95%.(Linear kernel, C = 1) ................... 171
ix
List of Tables
Table 3.1: Benchmark datasets used in this work............................................. 57
Table 3.2: Complete list of antibodies used and their dilutions ....................... 60
Table 3.3: Comparison of results on three classifiers using all samples .......... 75
Table 3.4: Comparison of results on three classifiers using only 50 samples .. 76
Table 3.5: Average accuracies 10 cross validation experiments for the
classifiers (standard deviation in brackets) ....................................................... 76
Table 3.6: Comparing SVM and TSVM using random sampling using different
percentages of training ...................................................................................... 78
Table 3.7: Comparing SVM and TSVM using accordance sampling using
different percentages of training samples for each class .................................. 78
Table 3.8: Comparing SVM and TSVM using random sampling using different
percentages of training ...................................................................................... 81
Table 3.9: Comparing SVM and TSVM using accordance sampling using
different percentages of training samples for each class .................................. 82
Table 5.1 summarizes the features of sets of data employed for assessment. 132
Table 5.2 Predictive accuracy of each comparing algorithm under 2 folds
comparing Majority voting system (Bold 1st, Italic 2nd) .............................. 134
Table 5.3 Predictive accuracy of each comparing algorithm under 5 folds
comparing Majority voting system (Bold 1st , Italic 2nd ) ........................... 135
Table 5.4 Predictive accuracy of each comparing algorithm under 10 folds
comparing Majority voting system (Bold 1st , Italic 2nd ) ............................ 136
Table 5.5 Predictive accuracy of each comparing algorithm under 2 folds
comparing Majority voting system with multi view sampling method (Bold 1st
, Italic 2nd )..................................................................................................... 137
Table 5.6 Predictive accuracy of each comparing algorithm under 5 folds
comparing Majority voting system with multi view sampling method (Bold 1st
, Italic 2nd )..................................................................................................... 140
Table 5.7 Predictive accuracy of each comparing algorithm under 10 folds
comparing Majority voting system with multi view sampling method (Bold 1st
, Italic 2nd )..................................................................................................... 141
Table 5.8 Predictive accuracy of each comparing folds for Majority voting
system with random sampling comparing to multi view sampling method ... 142
Table 6.1: Accuracy Obtained with SVM-RFE, TSVM-RFE and GLAD ..... 172
Chapter 1
Introduction
1.1
Background
Machine learning usually refers to the changes in systems that perform
tasks associated with artificial intelligence, such as recognition, diagnosis,
planning, robot control, and prediction [1]. Machine learning is very important
not only because the achievement of learning in machines might help us
understand how animals and humans learn. In addition, there are the following
engineering reasons [1]: some tasks cannot be defined well except by
examples; that is, we might be able to specify input/output pairs but not a
concise relationship between inputs and desired outputs. We would like
machines to be able to adjust their internal structure to produce correct outputs
for a large number of sample inputs and, thus, suitably constrain their
input/output function to approximate the relationship implicit in the examples.
Also, machine learning can be used to reach on-the-job improvement of
existing machine designs, to capture more knowledge than humans would want
to write down in order to adapt to a changing environment, to reduce the need
for constant redesign, and to track as much new knowledge as possible.
Classification is a supervised learning procedure in which individual
items are placed into groups based on quantitative information on one or more
characteristics inherent in the items and based on a training set of previously
labelled items. Classification has attracted much research attention as it spans a
vast number of application areas, such as medical diagnosis, speech
2
3
recognition, handwriting recognition, natural language processing, document
classification, and internet search engines.
A classification system includes feature extraction, feature selection,
classification, and model selection. Feature extraction is to characterize an
object by measurements whose values are similar for objects in the same
category but different for objects in different categories. Feature selection is
performed to remove the irrelevant or redundant features that have a negative
effect on the accuracy of the classifier. A classifier uses the feature vector
provided by the feature extractor and feature selector to assign the object to a
category. Parameters of a classifier may be adjusted by optimizing the
estimated classification performance or measures of data separability. This
could lead to the problem of model selection.
The classification methods are so-called‎ ‘supervised‎ algorithms’.‎
Supervised machine learning is the search for algorithms that reason from
externally supplied instances to produce general hypotheses, which then make
predictions about future instances. In other words, the goal of supervised
learning is to build a concise model of the distribution of class labels in terms
of predictor features. The resulting classifier is then used to assign class labels
to the testing instances where the values of the predictor features are known but
the value of the class label is unknown [2].
Algorithms for supervised learning range from decision trees to
artificial neural networks and from support vector machines to Bayesian
classifiers. Decision tree learning, used in data mining and machine learning,
uses a decision tree as a predictive model which maps observations about an
4
item to conclusions‎ about‎ the‎ item’s‎ target‎ value.‎ In‎ these‎ tree‎ structures,‎
leaves represent classifications and branches represent conjunctions of features
that lead to those classifications. Learned trees can also be re-represented as
sets of if-then rules to improve human readability [3].
Artificial Neural Networks (ANNs) provide a general, practical method
for learning real-valued, discrete-valued, and vector-valued functions from
examples. For certain types of problems, such as learning to interpret complex
real-world sensor data, artificial neural networks are among the most effective
learning methods currently known [3,4]. However, especially for big data sets,
ANNs may become huge and produce sets of rules which are then difficult to
interpret, especially for those researchers not familiar with computational
analysis.
Support Vector Machines (SVMs) can also be used for pattern
classification and nonlinear regression. The main idea of an SVM is to
construct a hyperplane as the decision surface in such a way that the margin of
separation between positive and negative examples is maximised in multidimensional space. The support vector machine can provide a good
generalization performance on pattern classification problems despite the fact
that it does not incorporate problem domain knowledge [4].
Bayesian classifiers are based on the assumption that the quantities of
interest are governed by probability distributions and that optimal decisions can
be made by reasoning about these probabilities together with observed data. In
addition, Bayesian learning provides a quantitative approach to weighing the
evidence supporting alternative hypotheses [3].
5
Worldwide, cancer has become a major issue for human health. The
classification of cancer patients is of great importance for its prognosis. In the
last few years, many unsupervised and supervised algorithms have been
proposed for this task and modern machine learning techniques are
progressively being used by biologists to obtain proper tumour information
from‎ databases.‎ The‎ World‎ Health‎ Organization’s‎ Global‎ Burden‎ of‎ Disease‎
statistics identified cancer as the second largest global cause of death, after
cardiovascular disease [5]. Cancer is the fastest growing segment of the disease
burden; the number of global cancer deaths is projected to increase by 45%
from 2007 to 2030 from 7.9 million to 11.5 million [6]. Breast cancer is the
second most common type of cancer after lung cancer, with 10.4% of all
cancer incidence, both sexes counted [7], and the fifth most common cause of
cancer death [8]. Breast cancer is a common disease which affects mostly but
not only women. The ability to accurately identify the malignancy is crucial for
prognosis and the preparation of effective treatment.
Breast cancer is usually, but not always, primarily classified by its
histological appearance [9]. The first subjective indication or sign of breast
cancer is typically a lump that feels different from the surrounding breast
tissue. More than 80% of breast cancer cases are discovered when the woman
feels a lump [10]. Lumps found in lymph nodes located in the armpits can also
indicate breast cancer. Whereas manual screening techniques are useful in
determining the possibility of cancer, further testing is necessary to confirm
whether a lump detected on screening is cancer, as opposed to a benign
alternative such as a simple cyst. In a clinical setting, breast cancer is
6
commonly‎diagnosed‎using‎a‎“triple‎test”‎of‎clinical‎breast‎examination‎(breast‎
examination by a trained medical practitioner), mammography, and fine needle
aspiration cytology. Both mammography and clinical breast examination, also
used for screening, can indicate an approximate likelihood that a lump is
cancer, and may also identify any other lesions.
Several treatments are available for breast cancer patients, depending
on the stage of the cancer. Doctors usually take many different factors into
account when deciding how to treat breast cancer. These factors may include
the‎patient’s‎age, the size of the tumour, the type of cancer a patient has, and
many more. Cancer research produces huge quantities of data that serve as a
basis for the development of improved diagnosis and therapies. Advanced
statistical and machine learning methods are needed for the interpretation of
primary data and generation of new knowledge needed for the development of
new diagnostic tools, drugs, and vaccines. Identification of functional groups
and subgroups of genes responsible for the development and spread of this type
of cancer as well as its subtypes is urgently needed for proper classification and
identification of key processes that can be targeted therapeutically. In addition,
accurate diagnostic techniques could enable various cancers to be detected in
their early stages and, consequently, the appropriate treatments could be
undertaken earlier [11].
7
1.2
Motivation
The fundamental motivation for this research is to take the SVM
framework, which is one of the most fundamental techniques in machine
learning, and try to make it more useful and powerful. A serious and solid
improvement in this scenario-based approach reveals many opportunities
where further research in computer science problems can pay a large dividend
in the quality of classification and other data mining research, as well as the
quantity of results and the speed at which new research can be proposed,
understood, and accomplished.
The second motivation for this work is to enhance SVM by wrapping
and integrating feature selection within SVM to expand the use of SVM to be
more applicable and practical for real datasets. This can be implemented by
using multi-view feature selection, because one selected feature set may
perform well on a certain dataset but may still not be the best feature set. The
main‎aim‎is‎to‎find‎an‎‘optimum‎feature‎set’‎which‎should‎perform‎reasonably‎
well. After selecting a number of best performing feature sets, having multi
view feature set per datasets this will lead to the best performs. One of the
ways to search for the optimum feature set is to combine all the feature sets
into one big set consisting of the union of all individual sets and ranking them.
The third motivation for the current research has been to examine the
SVM and TSVM (Transductive support vector machines has been widely used
as a means of treating partially labeled data in semi- supervised learning) as
supervised and semi-supervised algorithms to be able to classify and categories
data into sub-groups. In machine learning and statistics, classification is the
8
problem of identifying to which set of categories of sub-populations a new
observation belongs, on the basis of a training set of data containing
observations or instances whose category membership is known. The
individual observations are analysed into a set of quantifiable properties,
known as various explanatory variables, features, etc. These properties may
variously be categorical (e.g. "A", "B", "AB" or "O", for blood type), ordinal
(e.g. "large", "medium" or "small"), integer-valued (e.g. the number of
occurrences of a part word in an email) or real-valued (e.g. a measurement of
blood pressure). Some algorithms work only in terms of discrete data and
require that real-valued or integer-valued data be discretized into groups (e.g.
less than 5, between 5 and 10, or greater than 10). An example would be
assigning a given email into "spam" or "non-spam" classes or assigning a
diagnosis to a given patient as described by observed characteristics of the
patient (gender, blood pressure, presence or absence of certain symptoms, etc.).
In the terminology of machine learning, classification is considered an
instance of supervised learning, i.e. learning where a training set of correctly
identified observations is available. The corresponding unsupervised procedure
is known as clustering (or cluster analysis), and involves grouping data into
categories based on some measure of inherent similarity (e.g. the distance
between instances, considered as vectors in a multi-dimensional vector space).
The fourth motivation of this research is to expand the use of SVM to
provide more useful ways to be able to deal with large amounts of data. Over
the past few decades, rapid developments in data analysis technologies and
developments in information technologies have combined to produce an
9
incredible amount of information, and analysing all this information could help
with decision-making in many fields. So we need to exploit the creation and
advances in technologies and the consequently fast growth of datasets,
algorithms, computational and statistical techniques, and theory to solve formal
and practical problems arising from the management and analysis of data. Data
mining (the analysis step of the "knowledge discovery in databases" process, or
KDD), an interdisciplinary sub-field of computer science, is the computational
process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database
systems. The overall goal of the data mining process is to extract information
from a data set and transform it into an understandable structure for further use.
Aside from the raw analysis step, it involves database and data management
aspects,
data
interestingness
pre-processing,
metrics,
model
complexity
and
inference
considerations,
considerations,
post-processing
of
discovered structures, visualization, and online updating.
1.3
Aims and Objectives
The objectives of this research are to develop an SVM based on one of
the most powerful machine learning algorithms for classification by integrating
feature selection and model selection to improve the classification
performance. The ultimate goals of this multi-disciplinary research project
concern both decision-making and technical aspects. This work aims to help in
the field of decision-making from the widely used approach of considering
classification techniques in which different methods are investigated and
10
results are derived from a consensus between techniques. Considering a more
technical and computational aspect, the aim is to develop an original
framework to elucidate core representative classes in a given dataset.
Several research questions and hypotheses underlying the overall work
were identified at the beginning of the project. Starting from the already
published studies on machine learning techniques, it was noted that an
extended review and comparison of different classification algorithms had not
yet been carried out. This knowledge gap led to the formulation of the
following research questions:

Can sample selection methods improve machine learning (TSVM,
SVM) to provide more accurate classification results?

Is it possible to find an automated way to rank features and samples and
use them to classify new records in the future?

Is there a way to combine the results obtained by the multi-classifiers
approach?

Can feature selection improve machine learning (TSVM, SVM) to
provide more accurate classification results?
In order to achieve the aims stated above and to answer the research questions,
the following objectives were identified:
(i) To establish standard methodologies for multi-classifier approaches for
categorising the data into the right group.
11
(ii) To investigate the effect on the classification accuracy of reducing the
number of features or samples applying to the various data sets available, using
both semi-supervised and supervised (SVM and TSVM) methods.
(iii) To investigate different computational analysis methods applicable across
different types of data sets.
(iv) To determine an effective method to evaluate classification results and to
combine them in a set of representative groups of different characteristics.
(v) To develop an automated supervised and semi-supervised classification
algorithm using SVM and TSVM to be applied to any possible source of data,
independent of their underlying distributions.
1.4
Thesis Organisation
This thesis is structured as follows. Chapter 2 presents a literature
review of various classification approaches developed in the past to categorise
data points with high similarity. A review of different classification methods
used in literature to classify various data sets, including breast cancer data, has
been performed and is reported as well. Classification validity is introduced in
this chapter as a technique to assess the quality of classification results, and as
a method to select the best number of features to consider in the analysis.
Several validity measures used in this thesis work are reviewed and analysed in
detail. In addition, a general description of techniques developed for consensus
classification is reported, explaining various methods to assess the comparison
and the accord among different classification approaches. To conclude the
overview on classification algorithms, the feature selection approach is
12
described together with the most commonly used algorithm. The chapter ends
with a review of semi-supervised and supervised classification methods, which
are used to build models of the class distribution labels and to predict the class
assignment of possible new objects.
Chapter 3 is dedicated to measures that aim to evaluate the separability
of features. A comparison was made between the performance of the original
random sampling SVM and TSVM on the one hand, and the new accordancesampling method on the other. Accuracy was applied in measuring the
performance strength of the model. An accordance sampling method was
applied on TSVM and SVM on a couple of datasets. Then each method was
run 10 times on each dataset. Later, the original SVM and TSVM algorithms
were run using the same setup to measure the performance. To investigate the
performance of accordance sampling on breast cancer classification for each
dataset, unlabelled sampling pools of 10%, 20%, 30% etc. were created.
Examples of the data training for each class are provided. Then two examples
from the pool were randomly selected, giving initial labelled figures. Thus, the
learner obtained the remaining unlabelled examples, two labelled examples and
the test classifier.
In Chapter 4 supervised classification methods are used to validate the
classification. It presents the modelling of automated features and samples
ranking using Support Vector Machines (SVM). A new technique for feature
selection and ranking setting is suggested in this chapter. In biomedical data
classification, feature selection can help to get rid of unimportant /recurring
features to avoid classifier complexity. We have carried out many experiments
13
to check the performance of the suggested method in ranking for medical data,
and the method proves its ability to outperform traditional feature selection
methods. Our experiments are based on two main benchmark datasets. The
first, MADELON, is for sorting random data. This dataset includes around
4400 patients with dual relevance judgments. The Nottingham Tenovus
Primary Breast Cancer (NTBC) dataset [12] is the second dataset.
Chapter 5 provides a description of the new weighted voting
classification ensemble method based on a classifier combination scheme and a
novel multi view sampling method. Due to the high cost of supervised labelling
of data, we are able to save effort in order to precisely learn a result by means
of avoiding examples with less information for labelling in terms of active
learning, various types of active learning concerning the dilemma of semisupervised learning represent the central issue of our classification
consideration.The point here revolves around five classifiers: Support vector
machine, Naïve Bayes, k-nearest neighbor, Logistic Regression and Neural
Network. In simple Majority voting all classifiers have equal weights. Hence, if
each classifier makes different predications for an instance, the final decision
becomes arbitrary due to the tied votes. Assuming the classifiers tend to
classify the most informative instances correctly based on the new multi views
sample method. If the classifiers make different predictions on an unseen
instance, it is reasonable to give more weight to the classifier that give the
largest number of prediction equal to the Majority
Chapter 6 proposes to observe the performance of Transductive SVMs
(TSVM) combined with a feature selection method called recursive feature
14
elimination (RFE), which we use to select features for TSVMs for the first
time. The goal is to examine the classifiers’‎accuracy‎and‎classification‎errors‎
using the TSVM method. This is in order to determine whether this method is
an effective model when combined with recursive feature elimination,
compared with another algorithm called Genetic Learning Across Datasets
(GLAD). On average, the results for semi-supervised learning surpass
supervised learning. However, it is also shown that the GLAD algorithm
outperforms SVM-RFE when we make use of the labelled data only. On the
other hand, TSVM-RFE exceeds GLAD when unlabelled data along with
labelled data are used. It performs much better with gene selection and
performs well even if the labelled data set is small.
The last chapter of this thesis, Chapter 7, concludes the work, drawing
out the main contributions and highlighting several possible directions for
future research. A list of publications and oral presentations derived from this
thesis is reported at the end of the chapter.
Chapter 2
Literature Review
This chapter presents the basis of machine learning, based on the
topic of classification with an introduction to machine learning and its
applications, set out in section 2.1. A brief survey of the prevalent kinds of
machine learning algorithms is provided in section 2.2. Classification, as one of
the most common learning techniques used by scientists as well as the focal
point of this PhD study, can be regarded as a typical formulation of the
supervised learning task. Using unanimous classification resolution concerning
the correspondence of types of breast cancer is a vital point included in this
work. A broad overview of the principal classification methods in use is
presented in section 2.3. Many optimization algorithms are revisited in section
2.4, as the majority of machine learning algorithms either use optimization or
are cases of optimization algorithms.
This chapter has two main goals: firstly, to provide relevant
fundamental information concerning all the research subjects, which have been
used in the development of the original framework, and secondly to, indicate
the gaps in the body of knowledge. This provides the motivation for the thesis:
to evolve a framework for the purpose of making the core class in a data set
clear, as applicable for any accessible source of data.
15
16
2.1
Introduction
Machine learning is a sub-field of artificial intelligence. The design and
development algorithms and techniques of machine learning have led
computers‎to‎''learn''.‎In‎Mitchel’s‎definition‎[13], it is:
''A computer program that learns from experience E regarding some
class of tasks T and performance measure M, provided that its
performance at tasks in T, according to AI, enhances with experience
E.''
In the last 50 years, the study of machine learning has developed from
the efforts exerted by a few computer engineers attempting to detect whether
computers could learn to play games, into a field of statistics that clearly passes
beyond computational considerations. Studies have led us to major statistical
computational theories of learning operations, and modelled learning
algorithms that are typically used in trading systems, from speech recognition
to computer vision, and has spun off an industry in data mining to find out the
underlying rules within the spectacular volume of data now available from the
internet [14]. A number of choices are involved in designing machine learning
approach, including choosing the type of training experience, the target
function to be learned, a representation for this target function, and an
algorithm for learning the target function from training samples [13]. Machine
learning is naturally a multidisciplinary field, which draws on results from
artificial
intelligence,
probability
and
statistics,
optimization
theory,
computational complexity theory, control theory, information theory,
philosophy and other fields.
17
There are countless applications of machine learning, such as natural
language processing [15,16,17], handwriting recognition [18,19,20,21], both
faces and fingerprints recognition [22,19,20,23], search engines [24,25],
medical
analysis
[26,27,28,29,30,31,32,33],
bioinformatics
and
cheminformatics [34,35,36]. In addition, they include detecting credit card
fraud [37], assaying the stock market [38], classifying DNA sequences [39],
object recognition in computer vision [40], compressing images [41], playing
games [42,43], machinery movement and robot locations [44], and machine
learning condition monitoring [45,46,47,48,49,50,51,52].
2.2
Types of Algorithms
Machine learning algorithms are classified according to the required
results. Prevalent kinds of algorithm include supervised learning, unsupervised
learning, semi-supervised learning, reinforcement learning, transduction and
finally multi-task learning.
2.2.1
Supervised Learning
The main aspect of this thesis is SVM, which is a type of supervised
learning, so it is worth giving some brief information about different types of
learning. To start with, supervised learning is a technique of machine learning
used in order to create a function from a group of training samples containing
pairs of primary objects (exemplary feature vectors) and required outcomes
(results). The function outcome can obtain values with no limit (regression), or
is able to predict a label for the category of the input data (classification).
18
Supervised learning is mainly to predict the value of the function for
any valid input object after learning from a number of training samples (i.e.
pairs of input feature vectors and output targets). Dealing with a problem of
supervised learning requires different stages:
1.
Underline the kind of training samples, which could be as follows: a
feature‎from‎a‎patient’s‎record‎or‎all‎features‎for‎that‎one‎patient,‎or‎all‎
features from the records of several patients.
2. Gather a training set involving the real-life background of a problem.
Consequently, we combine a set of the main data and the results, either
from the manual efforts of scientists or mechanically.
3. Underline the input feature representation of the learned function
(feature extraction). The accuracy of the learned function is highly
dependent on the quality representation of the input. Input data on the
object becomes a vector of a feature, including several features or
expressing the object itself. We should present a sufficient number of
features to avoid dimensionality (as a dilemma) and to allow at the
same time enough prediction accuracy for the outcome.
4. Identify the structure of the learned function and identical learning
algorithm.
5. Lay out the final touches of the model. We should run the learning
algorithm against the combined training set. We are able to amend the
learning algorithm parameters through well accomplishment on a subset
of the training set (namely a validation set) or by means of crossvalidation. After learn and amending the parameters, a test set which
19
has been separated from the training set should be used as a measure of
the performance.
2.2.2 Unsupervised Learning
Unsupervised learning [53] is a method of machine learning where a
model is fit to observations. It is distinguished from supervised learning by the
fact that there is no a priori output. In unsupervised learning, a data set of input
objects is gathered, and treated as a set of random variables. A joint density
model is then built for the data set.
Unsupervised learning can be used in combination with Bayesian
inference to produce conditional probabilities for any of the random variables
given the others. A holy grail of unsupervised learning is the creation of
factorial code of the data, which may make the later supervised learning
method work better when the raw input data is first translated into a factorial
code. Unsupervised learning is also useful for data compression. Another form
of unsupervised learning is clustering [54], which is sometimes not
probabilistic.
2.2.3 Semi-Supervised Learning
Semi-supervised learning [55] in computer science is a technique of
machine learning which depends, for training, on both labelled and unlabelled
data (though generally more on unlabelled data). Our proposed technique is
called
"semi-supervised
learning",
since
it
is
intermediate
between
unsupervised learning (which does not depend on labelled data at all) and
supervised learning (which on the contrary depends entirely on labelled data).
Many scholars [56,57] of computer science have found a new technique to
20
enhance the precision of learning by mixing unlabelled and labelled data at the
same time. Only skilled and qualified experts, who are able to sort samples of
training, can acquire labelled data for the dilemma in learning. The process of
labelling is very costly, which may completely prevent the application of
labelled training, while unlabelled data is widely available. Therefore, we can
benefit from semi-supervised learning.
2.2.4 Reinforcement Learning
Reinforcement learning [58,59,60] is a sub-area of machine learning
concerned with how an agent ought to take actions in an environment so as to
maximize some notion of long-term reward. Reinforcement learning
algorithms attempt to find a policy that maps states of the world to the actions
the agent ought to take in those states. The environment is typically formulated
as a finite-state Markov decision process (MDP), and reinforcement learning
algorithms for this context are highly related to dynamic programming
techniques. State transition probabilities and reward probabilities in the MDP
are typically stochastic but stationary over the course of the problem.
Reinforcement learning differs from the supervised learning problem
in that correct input/output pairs are never presented, nor sub-optimal actions
explicitly corrected. Further, there is a focus on on-line performance, which
involves finding a balance between exploration (of uncharted territory) and
exploitation (of current knowledge). The exploration vs. exploitation trade-off
in reinforcement learning has been mostly studied through the multi-armed
bandit problem.
21
2.2.5
Transduction
Vapnik [61] presented transduction in the last decades of the twentieth
century. Motivated by his view that transduction is preferable to induction,
since induction requires solving a more general problem (inferring a function)
before solving a more specific problem (computing outputs for new cases),
Vapnik also added that:
"When solving a problem of interest, do not solve a more general
problem as an intermediate step. Try to get the answer that you
really need but not a more general one." [61]
Binary classification is a clear instance of non-inductive learning, as attaining
the cluster requires a large set of test inputs, which may make classification
labels clearer by providing useful information about the labels.
We can clearly consider this an instance of semi-supervised learning. A
transductive support vector machine (TSVM) [61] provides an instance of an
algorithm in this category. We may regard our need for approximation as a
third means of transduction. The Bayesian committee machine (BCM) [62] is
another instance of an algorithm belonging to this category.
2.2.6 Multi-task Learning
Multi-task learning [63] never deals with a single problem, as its
methodology relies on gathering all related problems at the same time. Since
the learner becomes able to utilize the commonality among the tasks, a
modified pattern is provided for the principal task, and for these reasons, multitask learning proves to be a type of inductive transfer.
22
2.3
Classification
This section introduces classification. As this is the main aspect of this
thesis, almost all the chapters will cover at least aspects of classification so it is
important to give a brief introduction to classification and its methods. One
standard formulation of the supervised learning task is classification, including
binary and multi-class classification. Binary classification is the task of
classifying the input samples into two groups on the basis of whether they have
some
common
property
or
not,
such
as
medical
diagnosis
[26,27,28,29,30,31,32,33]. Given a set of l labelled training samples
where
a binary label space
is the n-dimensional real feature space with
= {1, -1}, and
is the label assigned to sample
, the aim of binary classification is to seek a function
that best predicts the label for the input sample. Multi-class classification is the
task of assigning the input samples into one of the multiple categories. One
conventional way to extend binary classifiers to a multi-class scenario is to
decompose a multi-class problem into a series of two-class problems using
one-against-all implementation [64]. To solve a c-class classification problem
with a label space
= {1, 2, ... , c}, c binary classifiers can be constructed,
each separating one class from the remaining classes.
2.3.1 Linear Classifiers
Fisher [65] suggested a method of linear classification called Fisher
Linear Discriminant Analysis (FLDA) by detaching functions that best detach
two or more classes of samples based on the proportion of the between-class
and within-class dispersion. The detaching function, provided as:
23
(2.1)
is fixed by maximizing this objective :
(2.2)
where
and
represent the weight vector and the alignment of
successively, and
and
,
stand as the between and within class dispersion
matrices, provided by
(2.3)
(2.5)
(2.6)
where
is a sample of positive classes (+), while
classes (-) ;
is a sample of negative
refers to the number of positive training samples, while
to the number of negative training samples;
training samples, while
refers
denote the subsets of positive
denote the subsets of negative training samples.
The optimal values of
generalized problem [66]. Letting
and
can be calculated by solving a
denote the derived optimal separating
function, the label for an input sample is predicted by:
24
(2.7)
where
is
when
of the label for the input sample
and
otherwise, and
is n estimate
.
Logistic Regression
Logistic classing represents a method for computing based on a pattern
called logistic regression (LR) which allows scientists to predict what may
happen [67]. This method will be used in chapter 5.
An incident, according to a logistic model of classing, is the
membership of a vector within one of the two classes concerned. The
mentioned technique is used to define a variable that has a certain range [0, 1]
and certain input properties, and it can therefore be interpreted as a potentiality.
For input property vector , we can denote the LR (logistic regression) with the
following formula:
We can expect the model of the pattern to be based on the technique of least
squares, in linear regression: the particular model of regression
is generated
from the smallest sums of squares distance between the perceived and
predicted values of the subordinate variable, where
.
The parameters of the LR model are estimated using the maximum likelihood
method [68]: the coefficients that make the‎observed‎results‎“most‎likely”‎are‎
selected. The label of an input sample x can be predicted by the LR model [68],
which makes it possible to predict the parameters using the maximum
likelihood‎ method:‎ the‎ coefficients‎ that‎ make‎ the‎ observed‎ results‎ “most
25
likely”‎ are‎ selected.‎ The‎ following‎ formula‎ presents‎ the‎ label‎ of‎ the‎ given‎
sample .
(2.9)
Naïve Bayes Classifier
Part of chapter 5 will be about the naïve Bayes classifier, so it is
important to give the reader some background on this. A naïve Bayes classifier
(NBC) [69] is a simple probabilistic classifier that applies the Bayes theorem
with strong (naïve) independence assumptions. Let
occurrence of class
be the probability of
this is known as a priori. A posteriori
means that an observed sample
coming from
is expressed as
.
According to the Bayes rule [70],
where
is the unconditional probability density function (PDF) of
, and
is the likelihood function of class
The NBC simply assumes that features are independent given the class
is
Where
, that
, thus,
is a scaling factor dependent only on , i.e. a constant if the
values of the feature vector are known. Models of this from in pervious Eq. are
much more manageable, since they only factor into the class prior
independent probability distributions
and the
of which the model parameters
26
can be approximated from the training samples. The decision function of the
Bayes classifier is given as:
We can easily assume that each distribution is a dimensional
distribution in itself, since the class conditional property distributions may be
decoupled. Thus, we partially avoid some of the problems caused by
dimensionality, including the lack of limitation, and the lack of sets of data,
which is explicitly shown by the numbers of features.
2.3.2
Artificial Neural Networks
Research on Artificial Neural Networks (ANN) confirms that our own
brain system works discordantly compared with that of an ordinary digital
computer.
The manner in which our brain works is characterised by complexity,
parallelism and never being linear. Artificial Neural Networks provide us with
beneficial features and competencies of nonlinearity, including outcome
mapping, adaptation, clear feedback, structural or contextual background, defect
endurance, highly valued implementation, consistency of analysis and style and
finally neurobiological parallelism [71].
Single-layer Perceptrons
The most natural outline of a layered Artificial Neural Network (ANN) is singlelayer perceptrons (SLP), with a given layer of original nodes that visualizes an
outcome layer of neurons. Moreover, single-layer perceptrons (SLP) are used to
sort linear detachable models [71]. Figure 2.1 illustrates the texture of an SLP
27
with one neuron in the outcome layer. Such an SLP built on a single output
neuron is limited to performing binary classification; the label for an input
sample
is predicted by
where
is the weight vector of the output neuron.
Figure 2.1: The structure of an SLP with one neuron in the output layer.
Multi-layer Perceptrons
A multi-layer perceptron (MLP) [71], with one input layer, one or more
concealed layers, and an outcome layer, has distinct features:

Each neuron pattern in the network embraces a non-linear
stimulation function, namely the logistic function [71], whose
commonly used form is a sigmoidal nonlinearity. Missions by
gradually separating meaningful features from nonsensical
28
features within the given feature vectors on condition of
beginning with significant ones.

Network synapses provide high connectivity. Any change in the
population or weights of synaptic connections necessarily leads to
a change in network connectivity.
Figure 2.2: The overall structure of an MLP with one hidden layer.
The overall structure of an MLP with one hidden layer is shown in Fig. 2.2.
The back-propagation algorithm [71] can be employed to train an MLP by
minimizing the average squared error energy over all the training samples.
29
Radial Basis Function Networks
To perform a complex pattern classification task, radial basis function
(RBF) networks [71] transform the classification task into a high-dimensional
space in a nonlinear manner, by involving the following three layers:

Input layer: a given layer is fragmented into sensory units, which play
the role of connectors between the network and the environment.

Hidden layer: acts as a nonlinear transform from the input space to the
hidden space

, like a kernel function (see section 2.3.3).
One or more layers of hidden neurons are involved within the network,
which acquires complexity by means of these hidden neurons.

Output layer: furnishes feedback on the network to the input model,
which defines the detaching function
in the shifted
concealed space.
Figure 2.3: The overall structure of the RBF networks.
The non-linear transformation
is made up of
, real-valued
functions
,
30
denotes the number of neurons in the hidden layer input
, varied items
, then the following form presents the functions of truly valuated
mapping:
,
(2.14)
The structure of the RBF network is shown in Fig. 2.3, In the transformed
feature space , the RBF network determines the separating function
(2.16)
by minimizing the following cost function
denotes the weight vector, while
denotes the alignment of the detaching
function in the transformed feature space .
parameter and
denotes the rationalization
finally denotes a linear distinctive operator. Letting
as the rationalization parameter
the idealistic value of
approaches zero,
almost assembles the false-reverse solution to the
over-determined defined least-squares data-fitting dilemma for
provided by [72].
in which
denotes a column vector of labels of all the training samples.
, as
31
with
The label for an input sample
is then predicted by
.
Self-organizing Maps
Self-organizing maps (SOM) [73] are sets of data with multidimensions onto one or two-dimensional lattices. It is possible but not common
to use higher-dimensional lattices. SOMs are made up of two layers of
neurons: a one-dimensional given layer and two-dimensional (2D) competitive
layer organized as a 2D-lattice of neurons. Each neuron in the competitive
layer holds a weight vector
which expresses the same dimensionality of
the given space.
Preparing the competitive layer allows the given samples to move in parallel
toward the competitive neurons. If the competitive layer includes
neurons,
we can choose the superior and adaptable neuron as the winner, of which the
index is symbolized by
for a given sample , and contents.
(2.19)
In accordance with the following rule at the
winning neurons
iteration, the weights of the
, in addition to all the other neurons in the competitive
layer, become adaptable to suit the given sample
(2.20)
32
and
indicates the side distance between the winning neuron
and the
agitated neuron in the 2D-lattice; the following are the training parameters
set by user
,
,
and
. The magnitude of the change decreases with time
and with distance from the winning neuron.
2.3.3 Kernel-based Classifiers
Aizerman was the first to exploit kernel functions in machine learning
as inner products in an identical feature space. The data in appropriate feature
space becomes part and parcel of kernel techniques in model analysis, and for
the purpose of discovering models in the mentioned data it may utilize
algorithms based on linear algebra, geometry and statistics. A kernel technique
is made up of two components: a module which carries out the mapping into an
experimental feature space
, and a learning algorithm applied in order to
detect linear models in that space. A kernel function is a brief computational
method that allows efficient representation of linear models in highdimensional experimental feature spaces, keeping in mind the adequacy of
representational power. Researches focus [74] on four main aspects of kernelbased classifiers:
33

Samples in the original feature space
are embedded into an
empirical feature space .

Linear relations are sought among the embedded feature vectors.

The way we execute algorithms allows us to spare the counterparts of
the included feature vectors except their pairwise inner products.

A kernel function allows us to compute the pairwise inner products
directly from the feature vectors.
Kernel Functions
Kernel functions provide a powerful and fundamental way to discover
nonlinear connections based on intelligible linear algorithms in a suitable
feature space. In the kernel-based classification model, the kernel matrix acts
as a bottleneck. The kernel matrix should be the main source of all information.
Inner Product Space A vector space
over the real’s
is an inner product
space, in the event of the existence of a real-valuated symmetric bilinear (linear
in each dispute) map
that satisfies
(2.24)
The bilinear map is known as the inner dot or scalar product. In the real feature
space
, the standard inner product between two vectors
and
is given
by
(2.25)
An inner product space is sometimes referred to as a Hilbert space, though
most researchers require the additional properties of completeness and
separability, as well as sometimes requiring that the dimension be infinite [74].
34
Gram Matrix Given a set of feature vectors
, the Gram matrix is
defined as the
. If a kernel function
matrix E, entire are
is used to evaluate the inner products in the transformed feature space
with nonlinear mapping
, the associated Gram
matrix is referred to as the kernel matrix, denoted by, with entries given by
(2.26)
Different kernel functions can be designed based on their closure properties
[74].
Kernel Forms Three types of kernel function are used in this work, including
the Gaussian, Cauchy, and triangle kernels, defined as follows:

Gaussian kernel (RBF kernel):

Cauchy kernel [75]:

Triangle kernel [75]:
where
is the kernel width set by the user. A more versatile RBF kernel with
a different kernel width for each feature can also be used, and is given as
where
are the kernel widths for the
feature, and set by the user.
35
Kernel Fisher Discriminant Analysis
Kernel Fisher discriminant analysis (KFDA) is the final draft set by
Mika et al. [76] of kernels functions and FLDA. By expanding the weight
vector of the detaching function into a linear combination of all training
samples, the following formula represents the detaching function in the kernel
identified feature space:
where
denotes the summating weights, and KFDA determines the
optimal separating function
by maximizing the Fisher criterion [74], as
where
indicates the average of estimations of the positive samples, while
indicates the average of estimation of the negative samples,
represents the
identical scale deviance, by means of merging Eq. (2.31) and Eq. (2.32) , the
36
idealistic gullies of
and is countable depending on the reconciliation of
the universalized dilemma [62]. Eq. (2.7) anticipates the label for a given
sample .
Support Vector Machines
SVMs [77, 78] set up a hyperplane as the decision surface, maximizing
the margin of detachment between the positive and negative samples in a
proper feature space called the maximal margin norm. Boser et al. [79] evolved
an SVM based on a kernel by combining the kernel function and large-margin
hyperplanes, SVM successfully reconciles different nonlinear and nondetachable dilemmas in machine learning. Along with the authentic C-SVM
learning technique [77], Schölkopf et al. [80] evolved the v-SVM learning
technique, which is very similar to C-SVM, except for the optimization risk.
This section describes the hard-margin SVM and three soft-margin SVMs,
including the 1-norm C-SVM (L1-SVM), the 2-norm C-SVM (L2-SVM), and
the v-SVM, which will be used in several places in this thesis.
Hard-margin SVM Researchers use the hard-margin SVM in clearly
detachable instances in order to define the separating function
(2.33)
In the kernel specified feature space, the modified optimization risk hereinafter
is estimated to be minimized:
(2.34)
37
where
denotes the norm in the transformed feature space
introducing Lagrange multipliers
. By
, this is equivalent to
solving the following constrained quadratic programming (QP) problem:
and
Soft-margin C-SVM The C-SVM is an ordinary SVM with a soft margin,
dealing with never detachable instances, which presents the margin torpid
vector
, which makes breaching the inequality hereinafter
accessible for samples
(2.37)
with the soft-margin loss
By involving the 1-norm of the margin slack vector
the L1 –SVM determines
the separating function in Eq. (2.33) by minimizing the following regularized
optimization risk:
38
CSVM represents the positive adjustment parameter defined by the user, by
providing the Lagrange multipliers
, which corresponds to solving the
constrained QP dilemma hereinafter:
By involving the 2-norm of the margin slack vector , the L2-SVM determines
the separating function in Eq. (2.33) by minimizing the following regularized
optimization risk:
(2.42)
By introducing the Lagrange multipliers, this is equivalent to solving the
following constrained QP problem:
39
represents the Kronecker
which is 1 when
, and 0 in any other case.
L2-SVM is a truly exceptional instance of the hard-margin SVM with the fixed
kernel matrix
(2.44)
where I is the identity matrix. Eq. (2.36) is satisfied for both the L1-SVM and
L2-SVM.
Soft-margin v-SVM is simply SVM with a soft margin for non-separable
instances, which recruits the referred margin slack vector
as in the C-SVM
learning execution, but a varied soft-margin loss, provided by
indicates the width of the margin which differs within positive values. In that
way, if the margin width
of the v-SVM value is 1, we can simply call it C-
SVM. Minimizing the adjusted optimization risk hereinafter allows us to
identify the separating function by v-SVM
where the user sets
as the adjustment parameter that differs through [0,
1]. By introducing the Lagrange multipliers , this is equivalent to solving the
following constrained QP problem:
40
Eq. (2.36) is also satisfied for the v-SVM.
Karush-Kuhn-Tucker Optimality Conditions Let
denote the optimal
solution of the constrained QP problems in Eq. (2.35), Eq. (2.40), Eq. (2.43),
and Eq. (2.47), and
and
denote the optimal weights and bias of the
separating function in Eq. (2.33), respectively, calculated with
. The
following Karush-Kuhn-Tucker (KKT) optimality conditions must be satisfied,
which are slightly different for different types of SVM [78,80]:

Hard-margin SVM
(2.48)

L2-SVM
(2.49)

L1-SVM
(2.50)
(2.51)

v-SVM
(2.52)
41
(2.53)
Support Vectors The mentioned KKT ideal clauses identify the meaning of
support vectors (SVs). SVs are training samples with non-zero
. For L1-SVM
and v-SVM, there are two main types of SV: margin and non-margin SVs.
Margin SVs are training samples with
CSVM for the L1-SVM or less than
never equal to zero, but less than
for the v-SVM, that are dispensed along
the margin in Figure 2.4.
On the other hand, non-margin SVs are training samples with
to CSVM for the L1-SVM or equal to
exactly equal
for the v-SVM, that are dispensed within
the margin on condition of being either on the right side of the decision
surface, or on the wrong one of it, as in Figure 2.4. Based on the clauses of the
KKT, the SVs of the hard margin SVM and the L2-SVM, in addition to the
margin SVs of the L1-SVM content
SVs of the v-SVM content
value of the vastness of the margin, computed with
, while the margin
, which indicates the ideal
.
Separating Function The optimal value of the bias of the separating function
for the C-SVM and v-SVM can be derived by
42
and
represent two sets of SVs with the same volume of
but varied
labels of + 1 and -1. By merging Eq. (2.36) with Eq. (2.33), the ideal detaching
function
is provided by
The label for an input sample
value of the margin width
is then predicted by Eq. (2.7). The optimal
for the v-SVM can be calculated by
Figure 2.4: Illustration of support vectors for linear, non-separable patterns.
The optimal value of the slack vector can be calculated based on the KKT
conditions:
43

C-SVM
(2.58)

v-SVM
(2.59)
2.3.4 Proximal Classifiers
Proximal classifiers solve the binary classification task by seeking two
proximal planes in a corresponding feature space, instead of one separating
plane. Bradley and Mangasarian [81] first addressed the topic of multi-plane
learning by proposing the unsupervised k-plane clustering method in 2000.
Later, series of studies on multi-plane learning have been developed for
supervised learning, such as the proximal SVM (PSVM) [82, 83] and its
corresponding statistical interpretation [84], parallelized algorithms for
classification with the incremental PSVM [85], a fuzzy extension of the PSVM
[86], and the multi-surface PSVM (MPSVM) [87].
Proximal Support Vector Machines
The PSVMs seek two parallel proximal planes that are pushed as far
apart as possible; samples are classified by assigning them to the closest plane
[82,83]. To maintain the parallelism condition and bound samples based on the
maximal margin rule, the following proximal planes are employed for the
linear PSVMs:
(2.60)
(2.61)
44
The optimal values of
and
are obtained by minimizing the following
regularized optimization risk:
(2.63)
where
denotes the error variables (see also slack vector in
Section 2.3.3), and
is the non-negative regularization parameter set by
the user. Substituting
in terms of
and
based on the linear constraint as
given in Eq. (2.63), the constrained optimization problem in Eq. (2.62) is
reduced to an unconstrained minimization problem, which can be solved by
setting the gradient with respect to
and
to zero.
For the nonlinear PSVMs, the following proximal planes are employed:
where
are Lagrangian multipliers. The constrained optimization
problem to be solved becomes:
Compared with L2-SVM, the key idea of PSVM is to make a simple
fundamental change by replacing the inequality constraint in Eq. (2.42) with
45
the equality constraint in Eq. (2.42) with constraint in Eq. (2.63) and Eq.
(2.67).
Multi-surface Proximal Support Vector Machines
MPSVMs drop the parallelism condition on the proximal planes of the
PSVMs and require that each plane be as close as possible to one of the two
classes of training samples and as far apart as possible; samples are classified
by assigning them to the closest plane [87]. The following formulae denote the
two proximal planes in the original feature space
.
(2.68)
(2.69)
is the weight vector (direction), while
is the bias of the proximal planes.
The first and the second plane are referred to by the symbols 1 and 2
hereinunder.
In the kernel determined feature space , with the functions of the kernel
assigned in order to merge nonlinearity, by stretching the direction vector of
the hyperplane to form a linear clustering of all the training samples, the
following formulae denote the two proximal planes as
where
vectors of
and
and
are summating weights, forming two column
, respectively. To obtain two planes
and
, the
46
following objective functions with the numerator parts given in the "sum of
squares" form are maximized:
A Tikhonov regularization term [88], which is often used to regularize least
squares and mathematical programming problems, is employed to improve the
classification performance of MPSVMs [87].
For linear classification, by incorporating Eq. (2.68) and Eq. (2.69), as
well as the regularization term, into Eq. (2.72) and Eq. (2.73), and letting
and
, the following objective functions are
required to be maximized:
The user sets the parameter of non-negative systematization which is
referred to here as , the
class, while the
mould
mould
denotes samples from the negative
denotes samples from the positive class;
represents a column vector with all factors equal to 1. Solving two generalized
eigenvalue
values
problems
,
,
, and
[87]
allows
the
user
to
compute
the
ideal
.
For nonlinear classification, by incorporating Eq. (2.70) and Eq. (2.71),
as well the regularization term, into Eq. (2.72) and Eq. (2.73), and letting
47
and
the following objective functions are derived
and are required to be maximized:
The
mould
denotes the kernel matrix that intermediates samples that
belong to the positive class as well as training samples; on the other hand, the
mould
denotes the kernel matrix, which intermediates samples
belonging to the negative class, plus all the training samples as well. Solving
two generalized eigenvalue problems [87] allows the user to compute the ideal
values
,
,
, and
.
2.3.5 Prototype Classifiers
A new type of classifiers, namely prototype classifiers, is completely
distinguished from all the other techniques mentioned before in this chapter.
Prototype classifiers simply present several samples (prototypes) for each class,
then estimate the label of a new sample based on the nearest proximal
prototype. Meanwhile, SVMs briefly appeal one distinctive hyperplane,
PSVMs appeal two proximal hyperplanes and finally, ANNs depend on
neurons as the principal units of a model for resolving data.
k- Nearest Neighbours
The method of k-Nearest Neighbours (KNNs) [89] is an extreme end of
the scale for prototype classifiers, where each training sample serves as a
prototype, leading to prototype
Given a query sample,
48
k number of prototypes closest to the query samples (with the smallest
Euclidean distances) are found. The classification uses a majority vote among
the classification of the k prototypes. We consider it important to mention the
method and give a brief introduction, as it will be mentioned later in chapter 5.
Minimum Distance Classifier
The minimum distance classifier (MDC) [90,91,92] is another extreme
end of the scale for prototype classifiers, where there is only one prototype
for each class, namely the class centre (or mean), thus,
distance between the query sample
denoted as
as
. The
and each prototype is computed, and
. Then, the label of the nearest prototype, given
, is chosen as the label of .
Learning Vector Quantization
Learning vector quantization (LVQ) [93] is regarded as a prototype
algorithm that depends on supervised classification. LVQ is simply a particular
instance of an ANN. LVQ accurately carries out a winner-take-all [93], which
applies a Hebbian learning based approach. An LVQ network has a first
competitive layer of competitive neurons (prototype) and a second linear layer
of outcome neurons (classes).
The classes of the competitive layer are switched into identified object
classifications by the linear layer. The user denotes the classes presented by the
competitive layer as secondary classes, each connected to a prototype, and the
linear layer classes are denoted as object ones. Any experienced user can
49
derive benefit from LVQ because it presents accessible and intelligible
prototypes.
Clustering-based Prototypes
Clustering [54] categorizes masses into varied groups, or specifically,
resolves a set of information into secondary sets (clusters), which explains the
prevalence of a certain trait in each subset in terms of a compatible scale of
distance. All clustering algorithms are truly unsupervised techniques evolving
prototypes [94]. As much as we have clusters, we can create prototypes, since
the centre of a cluster ascertains each prototype.
2.4
Optimization
Research by Bennett and Parrado-Hernandez [95] recently revealed the
synergistic link between the fields of machine learning and mathematical
programming. They note:
“Optimization‎ lies‎ at‎ the‎ heart‎ of‎ machine‎ learning.‎ Most‎ machine‎
learning problems reduce to optimization problems. Consider the
machine learning analyst in action solving a problem for some set of
data. The modeller formulates the problem by selecting an appropriate
family of models and massages the data into a format amenable to
modelling. Then the model is typically trained by solving a core
optimization problem that optimizes the variables or parameters of the
model with respect to the selected loss function and possibly some
regularization function. In the process of model selection and
50
validation, the core optimization problem may be solved many times.
The research area of mathematical programming theory intersects with
machine learning through these core‎optimization‎problems.”
Examples of machine learning models with existing optimization methods
include QP in SVM [77,78], semi-definite programming (SDP) in model
selection and hypothesis selection [96,97,98], and dynamic programming in
lightest derivation problems [99]. The actual optimization methods used in this
work are briefly described in the following sections.
2.4.1 Genetic Algorithm
A genetic algorithm (GA) [100] is an investigation method which
provides decisive solutions or even compromises to optimization and search
dilemmas. GA proposes some possible solutions (a population). It presents
close estimations (individuals) of a solution, in terms of the known biological
"survival of the fittest" rule (see tutorial by Chipperfield et al. [101]). There is a
sign standing for each individual as a series or chromosome alphabetically
formed, of which one commonly used depiction is the dual alphabet {0,1}.
Both the objective and the fitness function are used to evaluate the
accomplishment executed by each individual.
Quite proper individuals have a better chance of being recombined with
a crossover to give rise to the following generation with a probability
,
which in turn alternates genetic data between pairs or larger collections of
individuals. Producing a secondary tree at any spot (blindly chosen), carried
out a beyond genetic operator, namely mutation, which the user reapplies later
51
to the new individuals with a low probability
to be away from the local
junior borders.
Mutation guarantees that the likelihood of finding a certain secondary
space of the dilemma space is never zero. Selection, crossover and mutation
differentiate the size of the old and new populations by fractions, which we call
the generation gap. The user has to bring new individuals with high fitness
values to the new population in order to keep an authentic population size.
Consequently, the fitter individuals occupy the rest positions. After a predetermined number of generations, the GA reaches an end; the user may later
set a test in order to examine the high quality and fitness of population
members. The user may either reset a genetic algorithm or start a fresh search,
in the event of the lack of any reasonable solutions.
2.4.2
Gradient Descent
Gradient descent (GD) is an optimization algorithm functioning
definitely for variable systems. In order to detect a local minimum of a function
relating to the parameter vector
using GD, the user has to relatively
move towards the negative of the gradient (or the approximate gradient) of the
function at the actual spot. In order to perform GD, the user primarily has to
reset the parameter vector to some value
, and then update each factor of
, on the basis of the principle hereinafter, at the
The number of factors of the vector
iteration:
is referred to as
control the rapidity of convergence (determined by the user).
, while
and
52
2.5
Feature Selection
Classification uses samples with labels, presented by a vector of
digital/numeral or titular/nominal features, to figure out a pattern that is able to
classify things into a definite set of specific categories. This process may
include some inappropriate features by mistake, which may negatively affect
the precision of the classifier, which is why they must be eliminated.
Moreover, using fewer features makes the whole process cheaper and makes
the pattern of classification accessible and clear. Different methods of feature
selection can be employed, such as sequential backward selection [89, 102],
GA [46, 47, 102, 103,104], and recursive feature elimination.
2.5.1 Genetic Algorithm
GA is a common and accessible optimization technique; we can use it
as a stochastic global search technique that imitates the biological "survival of
the fittest" rule (see section 2.4.1). Selecting features using GA requires each
feature in the nominated feature cluster to be a binary gene.
Each potential feature cluster is referred to as an ‫ـ‬bit binary series, where
denotes the total number of potential features. An ‫ـ‬bit individual matches with
‫ـ‬dimensional binary feature vector
includes the inclusion of the
of six features, ordered as
denotes the feature combination
where
denotes removal and
feature. To select a subset from a total
the ‫ـ‬bit individual 010001
. Two schemes can be employed for
GA-based feature selection:

Independent selection: investigating the ideal cluster of features with
the objective function that measures the possibility of detaching data,
53
like the alignment of the kernel with a target function, the possibility of
detaching a class, ordinary distance, computed in the original feature
space. This technique of selection is independent, since it is free of any
classifier.

Wrapper-type selection: searching for the optimal combination of
features, with the objective function set as the estimated risk of a
chosen classifier, such as the leave-one-out (LOO) error and crossvalidation error.
2.5.2 Sequential Backward Selection
In Sequential Backward Selection (SBS), starting from the complete set
of features, we discard features one by one until all the conditions are met and
the features have been deleted. SBS provides a diminishing level of complexity
which is determined by the number of interplays among the feature variables,
or in other words, the number of the edges in the graphical representation of
the model. An SBS begins by designating the saturated model as the current
model. A saturated model has a complexity level
, where
denotes
the number of feature variables. At each phase in SBS, the user evolves the set
of resolvable models of complexity level
which is produced by
eliminating single edges from the recent model of complexity level . Each
element of this set represents an assumed model which can be considered in
order to designate a particular model whose proper conclusions in the least
deterioration compared to the recent model, with the result that the model shall
be in usage, allowing the search to go on. The search may be handicapped if
54
none of the assumed models achieves a low deterioration or the complexity
level of the recent model equals zero.
2.5.3 Recursive Feature Elimination
A large number of features harms the effectiveness of most algorithms at
illustrating patterns for prediction. RFE is one of the keys used to minimize the
number of features. A good feature ranking criterion is not necessarily a good
feature subset ranking criterion. The criteria
or
estimate the effect
of removing one feature at a time on the objective function. They become very
ineffective when it comes to removing several features at a time, which is
necessary to obtain a small feature subset. This problem can be overcome by
using the following iterative procedure, called Recursive Feature Elimination:
1. Train the classifier (enhance the weights
according to ) .
2. Calculate the rank of categorization for all features
or
.
3. Use the smallest rank of categorization to eliminate the feature.
The point of RFE is to start with all features without exclusion, then to
determine and eliminate the least useful one, after which it keeps on iterating
until some stopping condition is met. RFE is varied in accordance with the time
of stopping and the way we select the removable feature. Researchers have
investigated RFE to determine how it can benefit studies of gene expression. In
RFE, it is forbidden to provide an ideal feature subset. On the other hand, RFE
reduces the complexity of feature selection by‎being‎“greedy”.‎That‎is,‎once‎a‎
feature is selected for removal, it is never reintroduced. Most studies have
found RFE able to select very good feature sets, but with 12,000 or more
features to select from, when the number of the samples is large, RFE requires
55
a considerable computation time. Recursive feature elimination is extremely
computationally expensive when only one least useful feature is removed
during each iteration. For computational reasons, it may be more efficient to
remove several features at a time, at the expense of possible classification
performance degradation. In such a case, the method produces a feature subset
ranking, as opposed to a feature ranking. Feature subsets are nested
. If features are removed one at a time, there is also a corresponding
feature ranking. However, the features that are top ranked (eliminated last) are
not necessarily the ones that are individually most relevant. Only taken
together, the features of a subset
are optimal in some sense. It should be
noted that RFE has no effect on correlation methods since the ranking criterion
is computed with information about a single feature.
2.6
Summary
This chapter has surveyed the fundamentals of machine learning. As
this work focuses on pattern classification, several well-known classification
methods have been reviewed, covering the categories of Support Vector
Machines in detail, as this thesis focuses mainly on SVM and improving SVM
and TSVM. In addition, the chapter gives an overview and some details of
Linear Classifiers, Neural Networks, Kernel-based Classifiers, Proximal
Classifiers, and Prototype Classifiers. These classifiers will be used later in
validating the classification in order to present the modelling of automated
features and samples ranking compared to Support Vector Machines (SVM), as
one of the aims of this thesis is to improve SVM.
56
This chapter debates the synergistic relations between the scopes of
machine learning and optimization; in addition, it presents GA, which
represents the mean of evolutional optimization. Moreover, this chapter gives
an outline of feature selection and different types of feature selection and the
reason for using it, which is directly related to the aims and objectives of this
thesis, which are to investigate the effect on the classification accuracy of
reducing the number of features or samples applying in various available data
sets, using both semi-supervised and supervised (SVM and TSVM) methods.
Chapter 3
Improving TSVM vs. SVM Accordance-Based Sample
Selection
This chapter introduces the experimental datasets used in this work and
summarizes the commonly used measures to evaluate the separability of
features. Section 3.1 provides a brief description of the datasets used. Section
3.2 presents background and motivation on the based sample selection
methods. Section 3.3 describes experiment settings. Section 3.4 introduces the
derivation of a new algorithm. Sections 3.5 and 3.6 present
results
and
discussion of results respectively. Lastly, section 3.7 gives a brief summary of
the chapter.
3.1
Experimental Datasets and Feature Analysis
We use two datasets, the public dataset Wisconsin Diagnostic Breast
Cancer Dataset from the UCI Machine Learning Repository [105], and an inhouse dataset Nottingham Tenovus Primary Breast Cancer [106], to evaluate
the proposed methods in this work. Information on each dataset is listed in
Table 3.1.
No. of
features
25
No. of
samples
1076
L
U
V
(NTBC)
No. of
classes
6
663
413
200/224
(WDBC)
2
14
569
519
50
200/213
Datasets
Table 3.1: Benchmark datasets used in this work.
57
58
Accordance sampling selects the unlabelled examples on whose label
the two view classifiers agree the most. When the two classifiers are sufficient
and independent, the sampled examples are more reliably labelled. Thus,
selecting those examples on whose labels the two view classifiers agree is less
likely to introduce errors in the expanded labelled dataset.
NTBC Six breast cancer classes, three of them present luminal
characteristics (luminal biomarkers are over-expressed), differentiated by the
presence or absence of other markers like estrogens receptors and/or
progesterone receptors. In one of the six classes the HER2 marker (a human
epidermal growth factor which gives higher aggressiveness in breast cancers) is
strongly expressed. The last two classes are characterised by the overexpression of basal markers and the subsequent under-expression of the
luminal ones. These two groups differ by the presence or absence of a marker
called p53. And for WDBC data set two classes are malignant and benign.
Since the number of features is large we randomly generate some small
groups contains 2 features. We randomly select 200 pairs of views to run of
views. The last column in the Table 3.1 represent the number of view pair used
in our experiments and the total number of all possible view splits. We use one
against all method for multiclass problem we group each class as positive and
the rest as negative. U and L represent Unlabelled and labelled samples
respectively.
3.1.1 Nottingham Tenovus Primary Breast Cancer (NTBC)
A series of 1076 patients from the Nottingham Tenovus Primary Breast
Carcinoma Series presenting with primary operable (stages I, II and III)
59
invasive breast cancer between 1986- 1998 were used. This dataset is the
Nottingham Tenovus Primary Breast Cancer (NTBC) data [106], which has
been used in many experiments in data mining for biomedical research
[107,108]. NTBC is an invasive breast cancer collection, developed by Daniele
Soria et al. at the University of Nottingham. All tumours were less than 5 cm in
diameter on clinical/pre-operative measurement and/or on operative histology
(pT1 and pT2). Women aged over 70 years were not included because of the
increased confounding factor of death from other causes and because primary
treatment protocols for these patients often differed from those for younger
women. There are in total 1076 patients upon which six stages of relevance
judgments are made, having in total 25 features for each patient. Processdetecting proteins reactivity for twenty-five proteins with known relevance in
breast cancer, including those used in routine clinical practice, was previously
determined using the standard techniques for detecting proteins on tumour
samples prepared as tissue microarrays [106]. Levels of process-detecting
reactivity were determined by microscopic analysis using the modified H-score
(values between 0-300), giving a semi-quantitative assessment of both the
intensity of staining and the percentage of positive cells. The complete list of
variables used in this study is given in Table 3.2.
This is a well-characterised series [106] of patients who were treated
according to standard clinical protocols. Patient management was based on
tumour characteristics using Nottingham Prognostic Index (NPI) and hormone
receptor status. Patients with an NPI score 3:4 received no adjuvant therapy,
those with a NPI score > 3:4 received hormone therapy if oestrogen receptor
60
(ER) positive or classical cyclophosphamide, methotrexate and 5-fluorouracil
(CMF) if ER negative and fit enough to tolerate chemotherapy.
Table 3.2: Complete list of antibodies used and their dilutions
Antibody, clone
Luminal phenotype
CK 7/8 [clone CAM 5.2]
CK 18 [clone DC10]
CK 19 [clone BCK 108]
Basal Phenotype
CK 5/6 [cloneD5/16134]
CK 14 [clone LL002]
SMA [clone 1A4]
p63 ab-1 [clone 4A4]
Hormone receptors
ER [clone 1D5]
PgR [clone PgR 636]
AR [clone F39.4.1]
EGFR family members
EGFR[clone EGFR.113]
HER2/c-erbB-2
HER3/c-erbB-3 [clone RTJ1]
HER4/c-erbB-4 [clone HFR1]
Tumour suppressor genes
p53 [clone DO7]
nBRCA1 Ab-1 [clone MS110]
Anti-FHIT [clone ZR44]
Cell adhesion molecules
Anti E-cad [clone HECD-1]
Anti P-cad [clone 56]
Mucins
NCL-Muc-1 [clone Ma695]
NCL-Muc-1 core [clone Ma552]
NCL muc2 [clone Ccp58]
Apocrine differentiation
Anti-GCDFP-15
Neuroendocrine differentiation
Chromogranin A [clone DAK-A3]
Synaptophysin [clone SY38]
Short Name
Dilution
CK7/8
CK18
CK19
1:2
1:50
1:100
CK5/6
CK14
Actin
p63
1:100
1:100
1:2000
1:200
ER
PgR
AR
1:80
1:100
1:30
EGFR
HER2
HER3
HER4
1:10
1:250
1:20
6:4
p53
nBRCA1
FHIT
1:50
1:150
1:600
E-cad
P-cad
1:10/20
1:200
MUC1
MUC1co
MUC2
1:300
1:250
1:250
GCDFP
1:30
Chromo
Synapto
1:100
1:30
61
Hormonal therapy was given to 420 patients (39%) and chemotherapy
to 264 (24.5%). Data relating to survival was collated in a prospective manner
for those patients presenting after 1989 only; including survival time, defined
as the interval (in months) from the date of the primary treatment to the time of
death. The overall survival was taken as the time (in months) from the date of
the primary surgical treatment to the time of death or censorship.
3.1.2
Wisconsin Diagnostic Breast Cancer Dataset
Breast cancer represents the main type of cancer all over the world, as
well as being the second most common cause of death in women. About 10%
of women are infected with breast cancer in the western countries [109].
Doctors can decisively diagnose breast cancer only through an FNA biopsy or
core needle biopsy [110]. FNA, using a needle smaller than those used for
blood tests to remove fluid, cells, and small fragments of tissue for examination
under a microscope, is the easiest and fastest method of obtaining a breast
biopsy, and is effective for women who have fluid-filled cysts. We applied the
Wisconsin Diagnostic Breast Cancer (WDBC) data taking this into
consideration [111, 112], in order to precisely identify malignant breast
tumours from a set of benign and malignant samples depending only on FNA.
To estimate the size, shape and texture of each cell nucleus, many previous
studies determined the following features:
1. Radius is calculated from the average of the length of radial line
fractions that represent the same lines from the centre of the mass of
the border to each of the border points.
62
2. Perimeter is measured as the sum of the distances between
consecutive boundary points.
3. Area is calculated by determining how many pixels are on the
inside borders, then sub-joining half of the pixels on the parameters
to modify the digitization error.
4. Compactness includes the parameters and area, presenting the scale
of compactness of the cell, counted as follows:
This dimensionless number is diminished to the lowest scale
revolving. Irregular borders increase this number.
5. Smoothness can be specified by calculating the dissimilarity
between the length of each radial line and the target length of the
two ambient radial lines. We have to smooth the outline of a region,
in the event of that number being small, taking into consideration
the space among successive points of the borders. We can rely on
the formula hereinafter in computing smoothness to get rid of
numerical inconstancy caused by small divisors:
where
is the length of the line from the centre to each
border point.
6. Concavity is observed by measuring the size of any indentations in
the borders of the cell nucleus.
63
7. Concave points seem to resemble concavity, but just compute
borders dots points through the concave zones of the borders
compared to the quantity of those concavities.
8. Symmetry is measured by providing the proportional divergence in
length between mates of line fractions vertical to the major axis of
the outline of the cell nucleus. The major axis is measured likewise
by providing the longest chord that goes by a border dot within the
central point of the nucleus. Then it needs to delineate mates of
fraction within certain periods. We divide the sums instead of
totalling the division results, to get rid of unreliable numeral
outcomes caused by small fractions.
where
and
denote the lengths of perpendicular segments
on the left and right of the major axis, respectively.
9. Fractal dimension: we employ the ''coast line approximation''
presented by Mandelbrot [113] in order to approach the dimension
of the fractal. This depends on growingly bulkier 'rulers' to measure
the parameters of the kernel. Increasing the ruler affects the
accuracy of measurement negatively and on the contrary minimizes
the parameters. Drawing these values on a log-log scale and
measuring the downward slope approaches the dimension of the
fractal.
10. Texture is measured by finding the variance of the grey-scale
intensities in the component pixels. The mean value, standard error,
64
and the extreme (largest or "worst" ) value of each characteristic
were computed for each case, which resulted in 14 features of 569
images, yielding a database of 569 x 14 samples representing 357
benign and 212 malignant cases.
3.2
Background and motivation
Cancer diagnosis necessarily requires the ranking of patients infected
by breast cancer. Recently, scientists have applied several algorithms to tackle
this task. Biologists are increasingly using modern machine learning
techniques, for the purpose of providing appropriate data about tumours. This
chapter discusses semi-supervised and supervised machine learning methods to
rank sets of data. A scientific comparison between semi-supervised and
supervised machine learning methods is provided hereinafter. For the first part
of this analysis, Nottingham Tenovus Primary Breast Cancer data [106] will be
considered. The full list of variables is reported in Table 3.2.
A Support Vector Machine (SVM) classifier and Transductive Support
Vector Machine classifier will be applied throughout this dataset. The same
machine learning techniques have already been used in studies such as
Bellaachia and Guven [114] and the revised study by Delen et al. [115]. The
above methods were applied in the search for the most suitable one for
predicting the survivability rate of breast cancer patients. This study was
motivated by the necessity to find an automated and robust method to validate
the classification of breast cancer on the accordance based sampling (pg. 58).
In fact, six classes were obtained using agreement between different
65
classification algorithms. Starting from these groups, the aim was to reproduce
the classification, taking into account the high abnormality of the data (see
Figures 3.1 and 3.2). For this reason, the Support Vector Machine classifier
was used, then the results were compared with the Transductive Support
Vector Machine. It is important to note that, of the 1076 patients, only 62%
(663 cases) were classified into one of the six core groups, while the remaining
38% presented indeterminate or mixed characteristics. This part of the study
focused‎only‎on‎the‎subset‎of‎the‎‘in-class’‎cases.‎The‎objective‎was‎to‎run‎the‎
classifiers to find an automated way to justify and reproduce the classification
obtained before having the results that were based on our accorded sampling
methods.
In the second part of the chapter, a dataset taken from the UCI Machine
Learning Repository [105] will be considered in coping with abnormal data. In
this dataset, the performance of the SVM classifier with accordance-based
sampling will be compared with the TSVM approach for classification.
Moreover, as the SVM assumption of normality of the data is strongly violated
in many real-world problems, the implementation of SVM and TSVM
classifiers with accordance-based sampling was developed and is presented in
this chapter. This method deals with constant and abnormal variables which, as
in the cases presented here, do not follow normal distributions (see Figure 3.3).
The algorithm has the same structure as the SVM and TSVM, considering the
effect of accordance-based sampling on the SVM and TSVM performance. The
results obtained with the new method will be compared with both those found
by the SVM algorithm and those obtained by applying TSVM.
66
Figure 3.1: Histogram of variable CK19
Figure 3.2: Histogram of variable P53
Figure 3.3: Histogram of WDBC
67
3.3
Experimental setting
For our empirical evaluation, two breast cancer datasets were used – the
in-house Nottingham Tenovus Primary Breast Cancer Series Dataset and
Wisconsin Diagnostic Breast Cancer UCI Dataset.
A comparison was made between the performance of the original
random sampling SVM and TSVM on the one hand, and the new accordancesampling method on the other. Accuracy was applied in measuring the
performance strength of the model. The accordance-sampling method was
applied on TSVM and SVM on both datasets. Then each method was run 10
times on each dataset. Later, the original SVM and TSVM algorithms were run
using the same setup to measure the performance.
To investigate the performance of accordance sampling for breast
cancer classification for each dataset, we created a pool of unlabelled data by
sampling 10%, 20%, 30% etc. examples from the data training for each class,
as mentioned previously. Then we randomly selected two examples from the
pool to give as the initial labelled data. Thus, we give the learner the remaining
unlabelled examples and the two labelled examples and then test the classifier.
The attributes of each dataset are split into two sets and are used as two
classifier views. This is a practical approach for generating two views from a
single attribute set. To comprehensively explore SVM and TSVM performance
on these views, we want to experiment with all possible combinations of the
views. Since the number of attributes is large, we randomly generate some
small groups containing 2 attributes. We randomly select 200 pairs of views for
a run of views. The last column in Table 3.1 represents the number of view
68
pairs used in our experiments and the total number of all possible view splits.
We‎use‎the‎‘one‎against‎all’‎method‎for‎the‎multiclass problem, and we group
each class as positive and the rest as negative.
The performances of the Support Vector Machine classifier and the
Transductive Support Vector Machine were evaluated using the SVM
algorithm, which was implemented by using the LIBSVM package [116].
LIBSVM is a library for Support Vector Machines (SVMs). It has been
actively developing this package since the year 2000. The goal is to help users
to easily apply SVM to their applications. LIBSVM has gained wide popularity
in machine learning and many other areas. All the techniques analysed were
run 10 times using the 10-fold cross-validation option, and the accuracy of the
obtained classification was evaluated simply by looking at the percentage of
correctly classified instances. The mean of the returning results was then
computed.
3.3.1 Support Vector Machine
Generally, SVM is a binary classification technique. Assume that the
given training data{
are given their labels {
} are vectors in some space
} where
. Also, they
{−1,‎1}.‎In‎their‎simplest‎form,‎
SVMs use hyperplanes that separate the training data by a maximal margin. All
vectors‎lying‎on‎one‎side‎of‎the‎hyperplane‎are‎labelled‎as‎−1,‎and‎all‎vectors‎
lying on the other side are labelled as 1. The training instances that lie closest
to the hyperplane are called support vectors [117].
This works within the framework of induction. Using a labelled training
set of data, the task was to create a classifier that would perform well on
69
unseen test data. In addition to regular induction, SVMs can also be used for
transduction. Here they are first given a set of both labelled and unlabelled
data. The learning task is to assign labels to the unlabelled data as accurately as
possible [118,119]. SVMs can perform transduction by finding the hyperplane
that maximizes the margin relative to both the labelled and unlabelled data
[118,120].
A binary (two-class) classification using SVMs presents an interesting
and effective approach to solving automated classification tasks. The initial
support vector machines were designed for binary classification; this has now
been extended to multiclass classification [120]. Almost all the current
multiclass classification methods fall into two categories: one against one, or
one against all [120,121]. We use the one against all method because it uses
less computational time and is more suitable for semi-supervised data.
Usually a bioinformatics classification problem is a multiclass one,
since more than two classes are usually needed, relying on a clustering-based
approach to predict labels for unlabelled examples [122,123]. Then, the
multiclass SVM is used to learn with the augmented training set, to classify the
test set [123,124].
3.3.2 Measures for predictive accuracy
There are many different measures for assessing the accuracy of a
model [125], two of which are calibration and discrimination. When a fraction
of about P of the events that are predicted with probability P actually occur, it
can be said that the predicted probabilities are well calibrated and a suitable
70
model for P(C|X) has been found [126]. Discrimination, by contrast, measures
a‎predictor’s‎ability‎to‎separate‎patients‎with‎different‎responses‎[125].
When the outcome variable is dichotomous and predictions are stated as
probabilities that an event will occur, calibration and discrimination are more
informative than other indices (such as the expected squared error) in
measuring accuracy [125]. The calibration plot is a method that shows how
well the classifier is calibrated, and a perfectly calibrated classifier is
represented by a diagonal on the graph [127] and it only apply for probabilistic
model and not for SVM or TSVM .
A c concordance index is a widely applicable measure of predictive
discrimination and it applies to ordinary continuous outcomes, dichotomous
diagnostic outcomes and ordinal outcomes. This index of predictive
discrimination is related to a rank correlation between predicted and observed
outcomes. The c index is defined as the proportion of all patient pairs in which
the predictions and outcomes are concordant. For predicting binary outcomes, c
is identical to the area under a receiver operating characteristic (ROC) curve
[127].
A ROC curve is a tool to measure the quality of a binary classifier
independently from the variation in time of the ratio between positive and
negative events [127]. In other words, it is a graphical plot of the sensitivity
versus (1 - specificity) for a binary classifier system as its discrimination
threshold is varied. The ROC can also be represented equivalently by plotting
the fraction of true positives (TPR = true positive rate) versus the fraction of
false positives (FPR = false positive rate). A completely random guess would
71
give a point along a diagonal line (the so-called line of non-discrimination)
from the left bottom to the top right corners. Usually, one is interested in the
area under the ROC curve, which gives the probability that a classifier will
rank a randomly chosen positive instance higher than a randomly chosen
negative one. A random classifier has an area of 0.5, while an ideal one has an
area of 1.
3.4
Derivation of a New Algorithm
The main idea of the new algorithm is that the closer a variable value is
to its median in a particular class, the higher is the probability to be assigned to
that specific group. At the beginning of the algorithm, the value of each sample
was computed, as well as the prior probabilities.
The following step is the main part of the method in which the single
labelled samples are calculated. Accordance sampling selects the unlabelled
examples on whose label the two view classifiers agree the most. When the two
classifiers are sufficient and independent, the sampled examples are more
reliably labelled. Thus, selecting those examples on whose labels the two view
classifiers agree is less likely to introduce errors in the expanded labelled
dataset. This means that one of the classifiers assigns the wrong label to the
example, which may lead to labelling errors in the expanded labelled dataset.
This is one approach to investigate why the sampling method could work well
in exploring the labelling errors. However, in our case we cannot calculate the
labelling errors rates since the real labels of unlabelled examples are not
known.
72
The basic idea is to have all the class representatives present in the
sample;‎ say‎ we‎ have‎ a‎ dataset‎ of‎ 1000‎ records‎ with‎ 5‎ classes,‎ class‎ 1→‎ 500‎
records,‎class‎2→‎250‎records,‎class‎3→‎125‎records,‎class‎4‎→‎65‎records‎and‎
class‎ 5‎ →‎ 60‎ records.‎ ‎ We‎ find‎ that‎ it‎ will‎ not‎ be‎ fair‎ to‎ randomly‎ select‎ a‎
sample of 250 because there is a big probability that class 6 or 5 will not be
represented. Also, it will not be fair to take a sample of 50 records from each
class, because that leaves only 10 records of class 5 to be tested. In addition to
that, class 1 will not be well trained, as only 50 out of 500 will be used to train
it. Ideally, we found that we should combine both concepts, as all classes
should be used but with the same percentage as in the data; say we need 20%
of the data for the training, then 20% of class 1, 20% of class 2, etc. ... to be
selected on random bases.
We propose a novel sampling method to replace the random sampling
used by SVM and TSVM to find whether this could extensively reduce the
amount of labelled training examples needed, and in addition to see if this can
improve the performance of both SVM and TSVM. The new method uses
redundant views to expand the labelled dataset to build strong learning models.
The major difference is that the new method uses a number of features views
(two views in our case) to select and sample unlabelled examples to be labelled
by the domain experts, while the original SVM and TSVM randomly samples
some unlabelled examples and uses classifiers to assign labels to them.
We expect that SVM and TSVM will benefit from the accordance
sampling method. Let V1 and V2 be the two view features learned from
73
training labelled data L to classify all unlabelled examples U using both views
for each example
in U.
3.1
SVM and TSVM will train the redundant view features by learning from the
most informative labelled examples [128,129]. It then uses the view features to
classify the most informative unlabelled examples. The unlabelled examples on
whose classification the two view classifiers agree the most are then sampled.
We use a ranking function to rank all the unlabelled instances according to the
predictions of the view classifiers. The ranking score function for an unlabelled
instance xi is defined as
(3.2)
74
Scores generated result in a rank where examples in the highest positions are
the ones to which both view classifiers assign the same label, with high
confidence, which means that these are the most informative unlabelled
examples. Then it selects the larger one of the average predicted probabilities
for the positive and negative classes using the two view classifiers.
3.5
Results
First of all, cases for which the value was missing were deleted from
the data sets. Then, the experiments were started by running the SVM and
TSVM classifiers in Java using the 10-fold cross-validation option and
evaluating the accuracy of the obtained classification simply by looking at the
percentage of correctly classified instances.
The classifiers were applied in order to get an automated way to justify
and reproduce the classification previously obtained. The results obtained from
the SVM were quite good; 572 out of 663 cases were correctly classified
(86.3%) and just 91 (13.7%) incorrectly classified. The main concern in using
this classifier came from the set of rules that were produced: these appear to be
quite numerous and not straightforward, especially if they were to be used by
scientists not familiar with computational analysis.
The Transductive Support Vector Machine (TSVM) was then considered. This
method performed better than the SVM, succeeding in correctly classifying
632 instances (95.4%) out of 663; just 31 cases (4.6%) were misclassified. A
summary of the above results can be found in Table 3.3.
75
Whole data
Method
Classified
Misclassified
SVM
572 (86.3%)
91 (13.7%)
TSVM
632 (95.4%)
31 (4.6%)
Table 3.3: Comparison of results on three classifiers using all samples
The strategy was to select those discriminating samples in the
categorisation process whose distribution ranked highest among the classes.
These samples were selected based on the accordance sampling method. An
exhaustive search of the best combination of 50 samples out of these 663 was
then performed based on the Support Vector Machine classification results.
This was done to reduce the number of samples used for classification as a
clinical aim. This should both simplify and reduce the cost of clinical testing
applying such samples.
This‎ ‘new’‎ smaller‎ dataset‎ was‎ used‎ in‎ repeating‎ the‎ previous‎
experiments applying the above classifiers. With SVM, significant differences
could not be seen, since 576 cases (86.9%) were correctly classified and there
was a reduction in the number of misclassified instances obtained, this time
being 87 (13.1%). The TSVM, instead, performed very well compared to the
previous run. Now 6 out of 651 cases (98.2%) were classified properly and just
12 (1.8%) were misclassified. A summary of these results is reported in Table
3.4.
76
Top 50 Samples
Method
Classified
Misclassified
SVM
576 (86.9%)
87 (13.1%)
TSVM
651 (98.2%)
12 (1.8%)
Table 3.4: Comparison of results on three classifiers using only 50 samples
The 10 accuracies of each algorithm were compared using t-tests, after
checking for normality using the Shapiro test [130]. It was found that, for both
the whole data and the 50-sample datasets, the TSVM classifier performed
significantly better than SVM (p < 0:01). The findings for the whole data set
are summarized in Table 3.5.
Average accuracies
SVM
TSVM
Whole data
86.9 (2.5)
94.9 (2.6)
50 samples
87.8 (6.3)
97.6 (1.8)
Table 3.5: Average accuracies for 10 cross-validation experiments for the
classifiers (standard deviation in brackets)

Nottingham Tenovus Primary Breast Cancer (NTBC)
Figure 3.4 shows a graph of accuracy across different percentages of
the training data for each class using TSVM and SVM with accordance
sampling for the Nottingham Tenovus Primary Breast Cancer Series dataset. In
comparison, Fig. 3.5 shows a graph of accuracy presenting different
percentages using random sampling training data for TSVM and SVM.
77
We observed that TSVM was able to produce higher average
classification accuracy than SVM with the use of the accordance sampling
method across different amounts of labelled training data ranging between 10%
and 90%. Although SVM starts with a relatively small difference in contrast to
TSVM, this gap grows wider starting at 40%, to become 90.9% and 84.76% for
TSVM and SVM respectively. However, at 75% to 90% the difference starts to
narrow again. This indicates that active learning based on accordance sampling
provides more benefit for both supervised and semi-supervised learning
methods. Nevertheless, the TSVM algorithm outperforms SVM. In practice,
when a large amount of training data is used, such as 90%, TSVM achieved an
average accuracy of 99.75 % with the sampling method, while SVM achieved
an average accuracy of 94.56%.
The accuracy was used to measure the performance strength of our
model. The results of the SVM and TSVM performance using the new
accordance sampling method were compared to the random sampling in Tables
3.6 and 3.7. We found that the maximum accuracy of the SVM classifier with
random sampling was obtained when using 90% training data, with an
accuracy of 90.54%, while the maximum accuracy of SVM with accordancebased sampling using 90% training data gave 94.56%. A comparison with the
minimum accuracy obtained at 10% training data gave an accuracy of 80.87%,
which is exactly the same result obtained with random sampling. Tables 3.6
and 3.7 summarize these findings. It can be seen that Figure 3.5 interprets the
Table and presents a different amount of training data of accuracy.
78
Nottingham Tenovus Breast Cancer
Training
TSVM
SVM
10
81.99862
80.87
20
81.97931
80.32
25
86.44067
82.33
30
86.92682
85.09
40
87.8
85.36
50
89.99
85.92
60
91.54
85.97
70
93.75
86.34
75
94.59459
88.96
80
94.12141
89.72
90
94.11965
90.54
Table 3.6: Comparing SVM and TSVM using random sampling with different
percentages of training samples for each class
Nottingham Tenovus Breast Cancer
Training %
#Test
TSVM SVM
10
595
83.12 80.87
20
529
85.66 82.66
25
496
85.88 84.43
30
463
86.34 84.12
40
397
90.9
84.76
50
331
93.07 87.09
60
265
93.27 88.87
70
199
94.37 90.34
75
166
95.64 92.76
80
133
97.37 93.88
90
67
99.75 94.56
Table 3.7: Comparing SVM and TSVM using accordance sampling with
different percentages of training samples for each class
79
Accordance-Based Sampling (TSVM vs. SVM)
Breast Cancer Wisconsin
Accuracy %
100
95
90
TSVM
85
SVM
80
0
10
20
30
40
50
60
70
80
90
100
% Training data
Figure 3.4: Accordance-based sampling TSVM vs. SVM with different
percentages of labelled training data for each class

Wisconsin Diagnostic Breast Cancer (WDBC)
Figure 3.6 indicates the accuracy across different percentages of
training data of each class using TSVM and SVM with accordance sampling
for the Wisconsin Diagnostic Breast Cancer (WDBC) data set. In contrast
Figure 3.7 indicates the accuracy across different percentages of training data
using the original random sampling TSVM and SVM run ten times for each
amount of training labelled data.
80
(TSVM vs. SVM)
Nottingham Tenovus Breast Cancer
Accuracy %
100
95
90
TSVM
85
SVM
80
0
10
20
30
40
50
60
70
80
90
100
% Training data
Figure 3.5: Original random sampling TSVM vs. SVM with different
percentages of labelled training data with random sampling.
It appeared that TSVM was able to provide a slight advantage over
regular SVM, producing higher average classification accuracy than SVM with
the accordance sampling method with all the different amounts of training data
ranging between 10% and 90%. Moreover, SVM starts with a relatively large
gap compared to TSVM. This gap widened and narrowed irregularly.
Specifically, the difference extended starting from 20% to 90.05% - 86.45% of
TSVM and SVM respectively. Meanwhile using the original random sampling
for TSVM and SVM gave 84.75% - 82.45% using 20% labelled training data.
However, at 75% the difference narrows, but later starts to widen again.
This indicates that the performance of active learning based on
accordance sampling is more beneficial both the supervised and semisupervised learning methods. However, the TSVM algorithm outperforms
SVM. In practice, when measuring the accuracy at 90% of training, TSVM
achieved an average accuracy of 96.5 % with the sampling method, though the
original random sampling TSVM achieved an average accuracy of 93.51%,
81
while the minimum was 89.36 using 10% of training. Compared to SVM, at
90% it recorded 93.93% with accordance-based sampling, while the original
random sampling SVM gave around 90%. The results of the performances of
SVM and TSVM using the new accordance sampling method were matched to
the random sampling in Tables 3.8 and 3.9. We found that the maximum
accuracy of the SVM classifier with random sampling was achieved when
using 90% training, with an accuracy of 90.97%, while the maximum accuracy
of SVM with accordance-based sampling using 90% training was 93.93%.
Comparing the minimum accuracy obtained at 10% of training, the accuracy
with accordance sampling was 85.22%, which is slightly better than the results
obtained with random sampling, 80.24%.
Breast Cancer Wisconsin
TSVM
SVM
Training
84.37
80.24
10
84.75
82.48
20
85.95
84.75
25
86.23
84.12
30
88.11
84.31
40
88.56
85.19
50
90.44
87.70
60
91.02
87.91
70
91.28
89.92
75
92.37
90.10
80
93.51
90.97
90
Table 3.8: Comparing SVM and TSVM using random sampling with different
percentages of training samples for each class
82
Breast Cancer Wisconsin
Training %
#Test
TSVM
10
629
89.36
20
559
90.05
25
524
91.95
30
489
92.23
40
419
93.11
50
349
93.56
60
280
94.44
70
210
94.02
75
175
94.8
80
140
95.37
90
70
96.5
SVM
85.22
86.45
86.77
88.09
89.34
90.22
91.67
92.88
92.98
93.07
93.93
Table 3.9: Comparing SVM and TSVM using accordance sampling with
different percentages of training samples for each class
Accordance-Based Sampling (TSVM vs. SVM)
Wisconsin Diagnostic Breast Cancer
Accuracy %
100
95
90
TSVM
85
SVM
80
0
10
20
30
40
50
60
70
80
90
100
% Training data
Figure 3.6: Accordance-based sampling TSVM vs. SVM with different
percentages of labelled training data for each class
83
(TSVM vs. SVM)
Wisconsin Diagnostic Breast Cancer
Accuracy %
100
95
90
TSVM
85
SVM
80
0
10
20
30
40
50
60
70
80
90
100
% Training data
Figure 3.7: Original random sampling TSVM vs. SVM with different
percentages of labelled training data with random sampling.
3.6
Discussion of results
In the experiments presented in this chapter, several different results
were obtained from the classifiers. Using the whole Nottingham dataset, the
best performance was obtained from the TSVM classifier: in fact, just 31 cases
were incorrectly classified. The SVM returned results worse than the TSVM,
and 91 cases were incorrect. When just the 50 samples were considered, there
was a substantial improvement for both learning methods, supervised and
semi-supervised. The performance was assessed: even though it did not return
the highest number of correctly classified instances, it performed much better
than with all the samples, reducing the number of misclassified instances from
31 to 12 for TSVM and from 91 to 87 for SVM. Again, the best results were
obtained using the TSVM, and this time the SVM did not perform as well as
TSVM: there were just 4 fewer cases of misclassification. Finally, the
supervised learning method support vector machine was the worst classifier of
the two used, performing almost identically with all samples. Starting from
84
these‎results,‎a‎‘non-parametric’‎approach‎of‎SVM‎and‎TSVM‎classifier‎to‎deal‎
with continuous and non-normal covariates was developed. The method was
presented and its performance over two particular data sets was compared to
the original random sampling SVM as well as TSVM. The SVM method did
not perform well on all data considered in this part of the work; focusing on the
breast cancer, this reflects a sort of independence between biological markers
and clinical information. Moreover, all the datasets samples strongly violated
the‎normality‎assumption,‎so‎proving‎the‎reason‎for‎taking‎a‎‘non-parametric’‎
approach. For each class, the median value and the histogram of its distribution
were computed.
Different situations that might occur were then considered. If, fixing a
particular class and a particular data point, the value of generic samples was
lower or greater than the extreme values of the same sample in the class
considered at that stage, then a probability close to zero of belonging to the
specified class was assigned to that data point. Secondly, if the value was
identical to the median, the probability was set to be one. Finally, if the data
point‎ was‎ smaller‎ than‎ the‎ median,‎ the‎ area‎ between‎ the‎ distribution’s‎
minimum and the actual value was calculated (or between the value and the
distribution’s‎ maximum‎if‎ the‎ value‎ was‎ greater‎ than‎ the‎ median).‎ The‎ value‎
obtained was then divided by half the number of observations. As for the SVM
classifier, for each case, the product of the probabilities of all samples given
the classes was calculated. Data were classified looking at the class number for
which the highest rank was reached. With the method just described, a larger
number of data points was correctly classified, raising the percentage from
85
80.24% to 85.22% for the Wisconsin breast cancer dataset using only 10%
training with SVM, and from 85.19% to almost 90.22% for the same dataset
using only 50% training also with SVM. An increase from 90.97 % to 93.93%
for SVM was obtained using 90% training with accordance-based sampling.
However, when using TSVM, different results were obtained, for the
Wisconsin breast cancer dataset, and the proposed new model seemed to be
more accurate (in terms of percentages of patients correctly classified) and
better calibrated with respect to the semi-supervised learning. The
improvement was from almost 84.37% to more than 89.36% for TSVM using
only 10% training for the Wisconsin Diagnostic Breast Cancer dataset, from
88.56% to almost 93.56% for the same dataset using only 50% training with
TSVM, and from 93.51% to 96.50% for TSVM using 90% training with
accordance-based sampling.
This was true also when considering the Nottingham Tenovus Breast
Cancer data sets, for which the new algorithm appeared to be slightly better
calibrated and more accurate. A larger amount of data points was correctly
classified, raising the percentage from 81.99% to 83.12% for the Nottingham
breast cancer dataset using only 10% of the training data with TSVM, and from
89.99% to almost 93.07% for the same dataset using only 50% of the training
data with TSVM. Moreover, there was a rise from 94.11% to 99.75% for
TSVM using 90% training with accordance-based sampling. We obtained
varying outcomes using SVM. The estimated modern pattern was not precise
enough according to the rate of patients who were properly classified, and was
less well calibrated according to the supervised learning, for the Nottingham
86
breast cancer set of data. The improvement was from just under 80.87% to just
over 80.87% for SVM using only 10% training for the Wisconsin breast cancer
dataset, from 85.92% to almost 87.09% for the same dataset using only 50%
training with SVM, and from 90.54% to 94.56% for SVM using 90% training
with accordance-based sampling.
However, for this dataset, when a random sampling SVM was fitted to
the data, the number of patients correctly assigned to their class was identical
to the one obtained when using SVM with accordance-based sampling for 10%
training. In addition, the ROC curve associated with the method presented here
was very similar to the one produced by the SVM, providing two close values
for the areas under the curve.
It is important to note that a couple of datasets presented in this work
were also used in [131] for comparing SVM as a supervised method with the
kernel and TSVM as semi-supervised methods, obtaining both better and worse
results. Those methods were considered as dealing with continuous sampling
when using a SVM classifier. Instead, the accordance-based sampling method
was developed to deal with the random sampling of SVM and TSVM for
several dataset samples. Moreover, it outperformed all the other methods
proposed in [131] when applied to the breast cancer datasets.
It is also worth noting that the newly developed method is not meant to
be applicable to all available datasets. In this chapter several situations were
presented for which a classical approach, the TSVM classifier, was
outperformed by a more general algorithm that does not assume any particular
distribution of the analysed samples. In general, according to experience, the
87
new method outperforms the classical SVM classifier when datasets with
categorical samples are considered or when the majority of them follows a
normal distribution. In these situations, it is advisable to use the SVM with the
accordance-based sampling approach.
3.7
Summary
In this chapter, supervised and semi-supervised learning were applied
on several case studies. In particular, two different classifiers, namely the
Support Vector Machine and the Transductive Support Vector Machine were
reviewed‎and‎used‎over‎the‎‘in-class’‎patients‎of the Abd El-Rehim et al. [106]
breast cancer dataset in order to validate the earlier classification derived and
characterized in previous studies. Surprisingly, the TSVM classifiers
performed‎quite‎ well,‎especially‎ when‎just‎50‎‘most-important’‎samples‎were
considered. This happened even though one of the underlying assumptions of
the TSVM was strongly violated by the data. As a matter of fact, all the
samples did not follow a normal distribution pattern. An accordance-based
sampling version of the TSVM was then developed and validated on known
datasets. These latter results were presented in this chapter, together with their
comparison with the Support Vector Machine approach.
Chapter 4
Automatic Features and Samples Ranking for SVM
Classifier
This chapter presents the modelling of automated features and samples ranking
using Support Vector Machines (SVM). Section 4.3 provides a description of
how the input data for the SVM algorithm was derived, and section 4.2
presents the motivation and background. Different models for ranking are
studied and presented in section 4.4 together with two measures, MAP and
NDCG. Sections 4.5 and 4.6 present experimental results and discussion
respectively. Lastly, section 4.7 gives a brief summary of the chapter.
4.1
Introduction
Ranking is a central issue in data mining for biomedical data, in which
a given set of objects (e.g., documents, patients) are categorized in terms of the
computed score of each one. Depending on the application, the scores may
represent the degrees of relevance, preference, or importance. Generally, this
chapter takes the examples of ranking for relevance and the search for
importance. Only a small number of strong features combined with the most
informative samples were used to represent relevance and to rank biomedical
data for breast cancer patients. As mentioned, one of the most vital topics in
data mining for medical data is ranking. While algorithms for learning ranking
models have been intensively studied, this is not the case for sample or feature
selection, despite its importance. The reality is that many samples and feature
88
89
selection methods used in classification are directly applied to ranking. We
argue that, because of the striking differences between ranking and
classification, it is better to develop different feature and samples selection
methods for ranking.
4.2
Background
4.2.1 Feature Ranking
In recent years, with the development of supervised learning algorithms
like Ranking SVM [132,133] and RankNet [134], it has become possible to
incorporate more features and samples (strong or weak) into ranking models. In
this situation, feature selection inevitably becomes an important issue,
particularly from the following viewpoints. First, feature selection can help
enhance accuracy in many machine learning problems, which strongly
indicates that feature selection is also necessary for ranking. For example,
although the generalization ability of Support Vector Machines (SVM)
depends on a margin which does not change with the addition of irrelevant
features, it also depends on the radius of training data points, which can
increase when the number of features increases [135,136,137]. Moreover, the
probability of over-fitting also increases as the dimensions of the feature space
increase, and feature selection is a powerful means to avoid over-fitting [138].
Secondly, feature selection may make training more efficient. In data mining,
especially in biomedical data research, usually the data size is very large and
thus training of ranking models is computationally costly. For example, when
applying Ranking SVM to biomedical datasets, it is easy to encounter a
90
situation in which training cannot be completed in an acceptable time period.
To deal with such a problem, we can conduct feature selection before training,
as it is the number of features that may cause the complexities of most learning
algorithms.
Although feature selection is important, to our knowledge, there have
been no methods of feature selection dedicated specifically to ranking. Most of
the methods used in ranking were developed for classification. Basically,
feature selection methods in classification fall into three categories [139]. In
the first category, which is named filter, feature selection is defined as a preprocessing step and can be independent from learning. A filter method
computes a score for each feature and then selects features according to the
scores [140]. Yang et al. [141] and Forman [142] conducted comparative
studies on filter methods, and they found that information gain (IG) and chisquare (CHI) are among the most effective methods of feature selection for
classification. The second category, referred to as wrapper [142], utilizes the
learning system as a black box to score subsets of features, and the third
category, called the embedded method [142], performs feature selection within
the process of training. Of these three categories, the most comprehensively
studied methods are the filter methods. Therefore, we also base our discussions
on‎ this‎ category‎ in‎ this‎ chapter,‎ and‎ we‎ will‎ use‎ “feature‎ selection”‎ and‎ “the‎
filter‎ methods‎ for‎ feature‎ selection”‎ interchangeably.‎ When‎ applying‎ the‎
feature selection methods to ranking, several problems may arise. First, there is
a significant gap between classification and ranking. In ranking, a number of
ordered categories are used, representing the ranking relationship between
91
instances,‎while‎in‎ classification‎the‎categories‎are‎“flat”.‎ Obviously,‎existing‎
feature selection methods for classification are not suitable for ranking.
Second, the evaluation measures (e.g. mean average precision (MAP) [143]
and normalized discounted cumulative gain (NDCG) [144]) used in ranking
problems are different from those measures used in classification:
1- In classification, both precision and recall are of equal importance,
while in ranking, we consider precision to be more significant than
recall.
2- In ranking, it is critical to rank the top- n cases properly, while in
classification, it is very important to classify all cases integrally. Due
to these distinctions, new and particular methods for feature
selection are imperative in ranking.
Precision (also called positive predictive value) is the fraction of retrieved
instances that are relevant, while recall (also known as sensitivity) is the
fraction of relevant instances that are retrieved.
4.2.2 Samples Ranking
Certainly, learning ranking (or preference) samples has become a
pivotal issue within the machine learning community [145,146,147] and as a
result many applications have been produced in data mining for biomedical
data [148,133]. Aiming at data mining for biomedical data applications,
distinctions can be made between the role of learning ranking samples and
learning classification samples as follows:
92
1.
Unlike classification samples, which output a distinct class for a
data object, ranking samples output a score for each data object,
from which a global ordering of data is constructed.
2.
Unlike training in classification, which is a set of data objects and
the label of their category, the training set in classification is about
partial‎ orders‎ of‎ data.‎ For‎ example,‎ let‎ “
specified‎as‎“ > ”.‎For‎a‎dataset‎
partial orders is
target function
is preferred to ”‎ be‎
an example of
and the
outputs a ranking score of
for any
.
There are other types of ranking models. However, this model has produced
practical applications in data mining for biomedical data [147,133,150].
Other appropriate patterns for ranking are discussed in references [149,118].
Present SVM methods for selecting samples for binary classification and
influence are practically provided by reference [151] to the proposed methods
[118], to conduct effective binary relevance feedback for image retrieval.
However, these techniques are proposed within the context of binary
classification (of whether the image is relevant or not) and thus do not support
the learning of ranking samples from partial orders.
Extending the selective sampling to ranking [152] requires considerable
effort. It depends on pairwise decomposition [145] as well as constraint
classification [146]. They extend multi-class classification to ranking and thus
are limited to a finite and a priori fixed set of data, and the model is not
93
scalable to the size of the dataset. Our selective sampling is based on the largemargin ranking which has proven effective in practice for learning global
ranking functions [133]. Paying due attention to their ranking scores, this
ranking model orders new instances, and is thus scalable and represents a
means of producing several applicable methods for data mining for biomedical
data [147,133,150].
SVM (support vector machines) have proven effective in learning
classification and regression functions [61,153,78]. They have also shown
excellent performance in learning ranking functions [147,133,150]. They
effectively learn ranking‎ functions‎ of‎ high‎ generalization:‎ “In‎ the‎ context‎ of‎
ranking, function F of high generalization means that a learned ranking
function F not only is concordant with the ordering of the training set (i.e.,
partial orders) but also generalizes well beyond‎the‎training‎set,”‎based‎on‎the‎
“large-margin”‎principle‎and‎also‎systematically‎supports‎nonlinear‎ranking‎by‎
the‎“kernel‎trick”‎[147].
The SVM ranking leads us to learn the function behind the outline in a
supervised batch learning that assumes that a set of training samples (i.e.,
partial orders) is given. In many applications, however, collecting training
samples involves human labour, which is time-consuming and often expensive.
Unlike in classification, in ranking, since this represents the central
issue [154], it is considered a more serious problem. Labelled data in ranking
denotes partial ordering of data, and thus users must consider relative ordering
of data in labelling, while in classification users only consider the absolute
class of data.
94
The concept of active learning or selective sampling refers to approaches that
aim at reducing the labelling effort by selecting only the most informative
samples to be labelled. SVM selective sampling techniques have been
developed and proven effective in achieving a high accuracy with fewer
examples in many applications [133,134]. However, they are restricted to
classification problems and do not extend to ranking problems.
SVM has been proposed as a selective sampling technique for learning ranking
functions. That is, using our selective sampling technique, an accurate SVM
ranking function can be learned with fewer partial orders. Our method is
“optimal”‎in‎the‎sense‎that‎it selects the most concordant set of samples at each
round which is considered most informative in SVM ranking.
The labelling effort is significantly reduced according to the results of
experiments. The sampling technique is applied to the data mining application
[142]. Many experiments are carried out on this application and all their
results prove to be in harmony. In other words, accurate ranking functions with
fewer interactions with users are the outcome of depending on the selective
sampling method.
4.3
Methodology
4.3.1
Feature Selection
The contribution of this chapter is the proposal of a new technique for
feature selection and ranking setting. Feature selection is useful in biomedical
data classification since it helps to remove useless/redundant features and thus
reduce classifier complexity.
95
In this chapter, we propose a novel method for this purpose with the following
properties.
1)
The method makes use of ranking information, instead of simply
viewing the ranks as flat categories. For example, it uses evaluation
measures or loss functions [148,108] in ranking to measure the
importance of features.
2)
Considering the similarities between features, inspired by the work
in [154,155], can help to avoid redundant selection of features.
3)
Feature selection is modelled as a multi-objective optimization
problem in ranking. Finding the most important and least similar
features represents the final objective.
4)
A greedy search algorithm is presented here, aiming to solve the
optimization problem. We can consider the solution provided by that
method as the optimal one for the original problem, provided that we
depend on specific conditions.
Feature selection needs such properties in ranking. Scientists have
attempted to determine how the proposed technique of feature selection
performs depending on two sets of data, NTBC [95] and MADELON [105],
and also in terms of two state-of-the-art ranking models, Ranking SVM [134]
and RankNet [133]. We have carried out many experiments to check the
performance of the suggested method. In ranking for medical data, the method
proves its ability to outperform traditional feature selection methods.
96
Aiming at selecting
features from the entire feature set
in our method we first define the importance score of each
feature, and a clear definition is given to the importance score of each
feature
, explaining the similarity between any two features
and
. Then
an efficient algorithm is employed aiming to maximize the total importance
scores and minimize the total similarity scores of a set of features.
First, an importance score is assigned to each feature. A standard of
assessment is clearly provided, like MAP and NDCG (these are defined
hereinafter in section 4.4) or a loss function (e.g. pairwise errors of ranking
[121,122] to clarify the importance score.
Earlier, depending on the feature to rank instances, then in terms of the
measure, we evaluate the performance, considering its result as the importance
score. We depend on the feature to rank instances, and then present a score
contrarily compared to the corresponding loss as the importance score. It is
worth mentioning that features are different regarding their higher ranks, which
refer or correspond to larger values in some cases and to smaller values in other
cases. Computing MAP, NDCG or the loss of ranking models enables us to
categorize the cases twice (in terms of both normal and inverse order),
considering the larger score as the importance score of the feature. Inspired by
the work in [154,156,155], we also consider removing redundancy in the
selected features.
This is particularly necessary in cases in which it is required to utilize
only a small number of features. In this work, we measure the similarity
between any two features on the basis of their ranking results. That is, we
97
regard each feature as a ranking model, and the similarity between two features
is represented by the similarity between the ranking results that they produce.
Many methods have been proposed to measure the distance between two
ranking‎results‎(ranking‎lists),‎such‎as‎Spearman’s‎footrule‎F,‎rank‎correlation‎
R,‎ and‎ Kendall’s‎
[157,158]. We can use all of them here. For instance,
considering‎ Kendall’s‎ as an example,‎ the‎ Kendall’s‎ value of query
any two features
where
and
can be calculated as follows:
indicates the set of case pairs
respect to query
is ranked ahead of instance
by feature
values of all the queries are averaged, and the result
used as the final similarity score between features
that
in response with
,‎ #{∙}‎ represents‎ the‎ number‎ of‎ elements‎ in‎ a‎ set,‎ and‎
denotes that instance
The Kendall
=
for
and
.
is
. It is easy to see
holds.
4.3.2 Sample Selection
To establish the context of our discussion, we first discuss the
preliminaries, representing data as a vector
in which each element is a
numerical value indicating an attribute value of the data. For instance, a vector
representing a patient from a set of a real breast cancer dataset can be
represented by a vector (age, recurrence or survival, size, menopausal status,
nodal status, histologic subtype).
98
Say
rank than
or
in a case where vector
in an order , we can say
presume that
. Otherwise, we simply
is ordered strictly, indicating that for all pairs
, either
or
has a higher
and
in a set
. Anyway, it can be directly generalized for
poor orderings.
Let
be the optimal ranking of data in which the data is ordered
perfectly according to the patient feature or situation. A ranking function
evaluated by how closely its ordering
approximates
There is wide reliance‎ on‎ Kendall’s‎
similarity between two orderings
orderings
and
and
[133]. In terms of two separate
through the number
of discordant pairs. The
how they order a pair,
and
For ordering
on a dataset
and
.
as a method to measure the
,‎ we‎ can‎ present‎ Kendall’s‎
concordant pairs and the number
is
and
of
agree in
, and the pair is either concordant or discordant.
, we define the similarity function
as
the following:
To explain, we presume
and
order five vectors
as follows:
(4.2)
(4.3)
Here we calculate
as 0.7, since the number of discordant
pairs is 3, specifically,
pairs are 90% in harmony. In terms of the
while all the remaining 7
measure, the degree of accuracy of
99
is evaluated as the similarity of the ordering
optimal ordering

coming from
and the
, i.e.,
SVM Rank Learning
SVM techniques enable us to learn global ranking function F from
partial orders
presuming F is considered a linear ranking function
such that:
(4.4)
The learning algorithm allows us to amend a weight vector
set of partial orders
function
can be ranked linearly in the case that there exists a
(i.e., a weight vector
.Concluding, we aim at learning
partial orders
the weight vector
. Say a
) that satisfies Eq. (4,4) for all
which goes in harmony with the provided
and has good generalization beyond
. For providing
satisfying Eq. (4.4) for most data pairs
reaching the maximum extent of
,r
. Though this problem is known to
be NP-hard [107], reference [147] achieves the solution approximately based
on SVM techniques by introducing (non-negative) slack variables
dragging the upper bound into minimum
as well as
[107] as follows.
QP 1.
(4.5)
(4.6)
(4.7)
100
By the constraint (4.6) and minimizing the upper bound
satisfies orderings on the training set
in (4.5), QP 1
with minimal error. By minimizing
or‎ by‎ maximizing‎ the‎ “margin”‎
it tries to maximize the
generalization of the ranking function. This chapter will discuss how
maximizing the margin corresponds to increasing the generalization of ranking.
The soft margin parameter
conducts the trade-off between the size of margin
and error of training. QP 1 becomes equivalent to that of the SVM
classification on pairwise difference vectors
.
By rearranging the constraint (4.6) as
(4.8)
we can extend an existing SVM implementation to solve the QP.

Maximizing Generalization of Ranking
The support vectors in QP 1 denote the data pairs
such that
. Assume that the training data is linearly rankable and thus
all
. Then, from Eq. (4.8), the support vectors are the closest data pairs
when projected to
: From Eq. (4.4), the linear ranking function
data vectors onto a weight vector
vectors
for
projected onto
geometrically, the margin
projects
. The geometrical distance of the two
is formulated as
. Thus,
which originally in the classification
problem denotes the distance from the support vectors to the boundary, denotes
in the ranking problem the distance between the closest two projections. This is
illustrated by two different linear functions
and
that project four
101
data vectors
dimensional
onto
and
and
respectively
in
a
two-
space.
Figure 4.1: Linear projection of four data points
Both
and
make the same ordering
for the four vectors, namely,
. The following formula presents the distances
between the closest two projections onto
and
:
and
and
which are denoted as
respectively. We compute the weight vector
such
that it is concordant with the given orders and generalizes beyond it by
maximizing the distance of the closest data pairs in ranking. By minimizing
, maximizes the margin, i.e., the distance of the closest data vectors in
ranking. For example, in Figure 4.1, although the two weight vectors
and
are ordering similarly,
shows better generalization than
because the distance of the closest vectors in
(i.e.,
,
) is larger than that in
(i.e., ). Maximizing the margin in ranking is explained in detail in [133].
Dot products of data vectors represent the learned ranking function F, and thus
it is possible to use nonlinear kernels to learn nonlinear ranking functions. See
[133] for the nonlinear kernel extension of the ranking function.
102
Feature Selection
Feature
Extraction
Initial
Feature
set
Ranking SVM
Training
Ranking
SVM Model
NDCG /
MAP
Evaluation
Feature
Inference
Is
NDCG/
MAP OK?
No
Yes
Selected
Features
Training
Data
Input
Transformation
Ranking SVM
Training
Ranking SVM
Training
Ranking
SVM Model
Input Features
Figure 4.2: Diagram showing an example of an existing feature selection
procedure
103
4.4
Experiment Settings
4.4.1 Datasets
Our experiments are based on two main benchmark datasets.
MADELON is UCI artificial data set we use it for classifying random sampling
data. The dataset includes around 4400 patients with binary relevance
judgments. This represents a two-class classification problem with scattered
binary contribution changes. The BM25 model [159] is used to recall the top
2000 patients for each feature who were already recalled. In our experiments
50 features were drawn out for each patient, involving both conventional
features‎ such‎ as‎ patient’s‎ age,‎ size‎ of‎ tumour‎ and‎ survival.‎ Relatively, we
specify the number of test examples required for the test set as 1800, the
number of training examples needed for the training set as 2000, and finally set
600 for validation.
The second dataset is the Nottingham Tenovus Primary Breast Cancer
(NTBC) data [95], which has been used in many experiments in data mining
for biomedical research [107,108]. NTBC is an invasive breast cancer
collection, developed by Daniele Soria et al. at the University of Nottingham..
In our experiments, we divided each of the two datasets into three parts, for
training (both feature selection and model training), validation, and testing.
Therefore, for each dataset, we can create different settings corresponding to
different training, validation, and testing sets, and run ten trials. The results
reported in this chapter are those averaged over ten trials.
104
4.4.2
Evaluation measures
Two common measures were determined to evaluate ranking methods
for data mining in biomedical data, namely MAP [143] and NDCG [133,134].

Mean average precision (MAP)
MAP measures the accuracy of ranking results. We assume two kinds
of instances: positive and negative (pertinent and impertinent) accuracy at
standards of N how the top N outputs for inquiry are precise, N is the sample
size, n is the number of features.
Precision at n controls counting the average precision of:
where the position is referred to as
is referred to as
,
the instance at position
, the number of recalled instances
denotes a binary function indicating whether
is positive (sign) . MAP is defined as
averaged
over all runs. In our experiments, the NTBC dataset has six groups of labels.
The‎ “first”‎ class‎ was‎ defined‎ as‎ positive‎ and‎ the‎ other five as negative when
calculating MAP, as in [107].
105

Normalized discount cumulative gain (NDCG)
NDCG measures the accuracy or precision of ranking when there is a
multiplicity of levels for relevance judgment. Given an instance, NDCG at
position
where
is defined as
denotes the position,
denotes the score for rank , and
is a
normalization‎ factor‎ to‎ guarantee‎ that‎ a‎ perfect‎ ranking’s‎ NDCG‎ at‎ position‎
equals 1. For runs in which the number of retrieved instances is less than ,
NDCG is only calculated for the retrieved instances. In evaluation, NDCG is
further averaged over all runs. Note that the above measures are not only used
for evaluating feature selection methods, but also used within our method to
compute the importance scores of features.
4.4.3 Ranking model
It is necessary to evaluate to what extent feature and sample selection
are effective methods after combining them with ranking models knowing that
they represent a primary or preparatory step. We used two ranking models in
our experiments, Ranking SVM and Rank Net.

Ranking SVM
Ranking SVM [132,133] has proved to be an effective algorithm for ranking in
several earlier studies. Ranking SVM extends SVM to ranking; in contrast to
traditional SVM which works on instances, Ranking SVM utilizes instance
106
pairs and their preference labels in training. The optimization formulation of
Ranking SVM is as follows:

RankNet
Similarly to Ranking SVM, RankNet [134] also uses instance pairs in training.
RankNet assigns a neural network that expresses the function of ranking and
proportional entropy as a loss function. Let
probability
) and
let
be the expected successive
be‎ the‎ “true”‎ posterior‎ probability,‎ and‎
. The loss for an instance pair in
RankNet is defined as
Gradient descent is employed to minimize the total loss, keeping in
mind the training data using RankNet. RankNet selects the best model based on
the validation set, as the gradient descent mentioned above may drive toward
the local optimum. Therefore, it has become possible to check to what extent
RankNet is effective particularly in large-scale datasets.
4.4.4 Experiments
This section reports our extensive experiments for studying the
effectiveness of our selective sampling algorithm. Due to the lack of labelled
real-world datasets for ranking which are completely ordered, we mostly
evaluate the method on artificially generated global orderings
. Firstly, we
evaluate the method on the UCI data set with generated ranking functions.
107
Then we use real-world data from the Nottingham breast cancer dataset to
demonstrate the practicality of the method, using the Kendall’s‎ τ‎ measure‎
discussed previously for samples and features ranking evaluation.
The experiments for feature ranking were conducted in the following
way. First, we ran a feature selection method on the training set. Then we used
the selected features to train a ranking model with the training set, and tuned
the parameters of the ranking model (e.g. the combination coefficient
in the
objective function of Ranking SVM, and the number of epochs in RankNet)
with the validation set. These two steps were repeated several times to tune the
parameters in the feature selection methods (e.g. the parameter
in our
method). Finally, we used the obtained ranking model to conduct ranking on
the test set, and evaluated the results in terms of MAP and NDCG.
In order to make a comparison, we selected IG and CHI as the
baselines. IG measures the reduction in uncertainty (entropy) in classification
prediction when the feature is known. CHI measures the degree of
independence between the feature and the categories. Since the notion of
category in ranking differs, in theory these two methods cannot be directly
applied to ranking.
Relevant and irrelevant in the MADELON data are two proportional
categories, while ''definitely relevant'' and "not relevant'' are extended to three
categories in terms of the NDBC dataset. For this reason, we ignored the order
information among the "categories". It is worth mentioning here that IG and
CHI are directly used as feature selection methods in ranking, and this kind of
approximation‎ is‎ always‎ made.‎ In‎ addition,‎ we‎ also‎ used‎ “With‎ All‎ Features‎
108
(WAF)”‎as‎another‎baseline,‎in‎order‎to‎show‎the‎benefit‎of‎conducting feature
selection.
Based on Matlab, SVM ranking and the selective sampling and features
algorithms were applied. Our experiments were carried out with a 64 2800+PC
with 1GB RAM.
4.5
Experimental Results
4.5.1 MADELON dataset (Feature Ranking)
Fig. 4.3 shows the performance of the feature selection methods on the
MADELON dataset when they work as pre-processors of RankingSVM. Fig.
4.4 shows the performance when using RankNet as the ranking model. In the
figures, the x-axis represents the number of selected features. Let us take Fig.
4.3 (a) as an example. It is found that by using the Support Vector Machine
algorithm, with only ten features RankingSVM can achieve the same or even
better performance when compared with the baseline method WAF. With more
features selected, the performance can be further enhanced. In particular, when
the number of features is 18, the ranking performance is 15% higher than that
of WAF. When the number of selected features increases further, the
performance does not improve, and in some cases, even decreases. Therefore,
feature selection has become inevitable: however, as previously mentioned,
selecting more features does not directly mean increasing the ranking
performance. Technically, that goes back to the fact that selecting more
features may enhance performance on the training set, but moving to the test
set, over-fitting may cause it to deteriorate. Many other learning tasks
109
including the classification process itself clearly testify to that phenomenon.
Consequently, it has been possible to enhance both the precision and efficiency
of learning for ranking by selecting the feature effectively.
Results of our experiments demonstrate that SVM is able to outperform
CHI often, although not significantly. It is clearly verified hereinafter that our
feature selection and the ranking pattern should go together in training after the
preparatory move towards selecting the feature.
We should bear separately in mind that the features selected using
SVM may be regarded as a successful selection on one level (on the basis of
MAP or NDCG), but that the same selection may be considered bad in terms of
training the model. It is worth mentioning here that there is very little
distinction between CHI and IG allowing common outperformance.
(a) MAP of Ranking SVM
110
(b) NDCG@10 of Ranking SVM
Figure. 4.3: Ranking accuracy of Ranking SVM with different feature selection
methods on the MADELON dataset
Experimental results also indicate that with SVM feature selection
methods the ranking performances of Ranking SVM are more stable than those
with IG and CHI as the feature selection method. This is particularly true when
the number of selected features is small. For example, from Fig. 4.3 (a) it can
be seen that with 12 features, the MAP values with SVM are more than 0.3,
while those of IG and CHI are only 0.22 and 0.25 respectively. Furthermore,
IG and CHI do not result in a clearly better performance than WAF.
111
(a) MAP of RankNet
(b) NDCG@10 of RankNet
Figure 4.4: Ranking accuracy of RankNet with different feature selection
methods on the MADELON dataset
There may be two reasons for this: IG and CHI are not designed for
ranking and the ordinal information between instances may be lost when using
them; this happens when we redundantly select features using IG and CHI for
NDCG@10 and for RankNet. On the other hand, we can observe that similar
tendencies lead to similar conclusions.
112
4.5.2 MADELON dataset (Samples Ranking)
At this point, we evaluate the learning performance of our method against
random sampling. We randomly created 1000 data samples of 10 dimensions
(i.e., a 1000-by-10 matrix) such that each element is a random number between
zero and one.
For evaluation, we artificially generated ranking functions: First, we
generated arbitrary linear functions
weight vector
. The global ordering
by randomly generating the
is constructed from the function.
Secondly, we constructed a second-degree polynomial function
+1)2 by also generating
randomly.
Figure 4.5: Accuracy convergence of random and selective sampling on
MADELON dataset random sampling for both linear L and polynomial P
functions.
Global orderings
were generated from the two types of ranking
functions, and then we tested the accuracy of our sampling method compared
to random sampling on the orderings. The outcomes were averaged over 20
113
runs. Figure 4.5 shows how the accuracy of selective sampling approximately
approaches that of random sampling for both linear and polynomial functions.
The SVM linear and polynomial kernels are used respectively. The
selective sampling method consistently outperforms random sampling on both
types of functions. The accuracy at the first iteration is the same because they
start with the same random samples. Selective sampling achieves higher
accuracy at each iteration (or for each number of training samples) and the
selected four samples at each iteration (i.e.,
= 4).
4.5.3 Nottingham Breast Cancer Dataset (Feature Ranking)
The results of different feature selection methods on the NBCD dataset when
they work as pre-processors of Ranking SVM is presented in Fig. 4.6. It can be
seen that IG performs the worst this time. When the number of features
selected by IG is less than 30, the ranking accuracy is significantly below that
of WAF.
(a) MAP of Ranking SVM
114
(b) NDCG@10 of Ranking SVM
Figure 4.6: Ranking accuracy of Ranking SVM with different feature selection
methods on the NTBC dataset
On the other hand, 22 features or less are enough for both CHI and our
algorithms to achieve good ranking accuracies, but gradually our algorithms
are able to outperform CHI by adding more features. For instance, in Figure 4.5
(a), selecting more features helps to increase the MAP of ranking SVM with
our algorithms (from 15 to 20). Moving to CHI, after selecting 12 features, it
begins to decrease. Our experiments demonstrate clearly that the algorithms are
able to outperform both IG and CHI.
For NDCG@10 and for RankNet, we can observe similar tendencies and reach
similar conclusions. In summary, our feature selection algorithms for ranking
really outperform the feature selection methods proposed for classification, and
also improve upon the baseline method without feature selection.
115
(a) MAP of RankNet
(a) NDCG@10 of RankNet
Figure 4.7: Ranking accuracy of RankNet with different feature selection
methods on the NTBC dataset
4.5.4 Nottingham Breast Cancer Dataset (Samples Ranking)
In this section, we perform experiments with a real-life in-house dataset
extracted from Nottingham University. We extracted all immunohistochemistry
data for all breast cancer patients, resulting in N = 1076 for breast cancer
patients, each with attributes id, age, size, grade, stage, etc. SVM linear kernels
are used in this experiment.
116
Since datasets of real-world data seem to be rare, it is hard to evaluate the
selected sampling in real situations. Thus, this section focuses on presenting
the potential of the selective sampling method in the applications of data
mining by demonstrating experimental results with real-life patients. We
selected 10 ordinary different groups to test our system with different
preferences and collected around 100 real selections.
It is worth mentioning that the perfect ordering
of the selection
intended remains unclear and thus it is hard to make fair evaluations. It is not
feasible for a user to provide a complete ordering on hundreds or thousands of
patients or samples. Thus, we evaluated the accuracy of the ranking function at
this iteration against the partial ordering specified by the selection in the next
iteration. That is, the accuracy of the ranking function
learned at the
iteration is measured by comparing the similarity of the selection for the
patients’‎ partial‎ ordering‎ on‎
generated by
. We set
at the next iteration and the ordering
at each iteration.
This is called a measure of the expected accuracy, as it is an approximation
evaluated over ten pairwise orderings. (An ordering on five samples generates
ten
pairwise‎orderings.)‎That‎is,‎“100%‎expected‎accuracy”‎means‎it‎
correctly ranks five samples that are randomly chosen.
This measure approximates the generalization performance of ranking
functions, as
is not a part of the training data for learning
. Further, this
evaluation method can also be used to acquire fair evaluations from the
selection since the selection are not aware of whether they are providing
feedback or evaluating the functions at each round. However, this measure
117
severely disfavours selective sampling. Intuitively, selective sampling will be
most effective for learning if the patient selection ordering on
is not what
is expected from the previous iteration.
Thus, we used a random sampling for the study reported in this section.
However, note that selective sampling is expected to be more effective in
practice, as demonstrated in our experiments on the MADELON dataset in the
previous section and on real data.
In terms of random sampling (RAN) and selective sampling (SEL), we
generated five samples at each iteration at each round, and it was possible to
measure both the accuracy of the learned ranking function and the response
time. The experimental results were averaged over 20 runs. Highlights of the
experiments are demonstrated as follows: Figure 4.7 explains how SEL
outperforms RAN from the second iteration, although they begin with similar
accuracy because they start with the same random samples.
The size of the dataset does not affect the response time of RAN, and
the time seems to be the same for both group 1 and group 2. SEL is capable of
working more accurately in a shorter response time than RAN for both
datasets. (SEL requires
times of function evaluations). The dataset
size does not affect its accuracy.
118
Figure 4.8: Accuracy convergence of random and selective sampling on NDBC
dataset
4.6
Discussion
From the results of the two datasets, the following observations are made:
1- Ranking performance can be enhanced more significantly by selecting
features for the MADELON dataset than for the NDBC dataset. For instance,
our methods of feature selection can proportionally enhance the MADELON
dataset by 10 %; on the other hand, other methods of feature selection can
enhance the dataset of the NDBC by just 5~6 % .
2- Our proposed methods outperform IG and CHI, which seems to be more
significant for the MADELON dataset than for the NDBC dataset. For
instance, in the principle of the MADELON dataset, SVM is more significantly
better compared with IG and CHI; on the other hand, the enhancement over IG
is upright for the NDBC dataset. More experiments are required to identify the
real reasons as follows. We study features as ranking models against their
MAP values. Moreover, the features are sorted according to their MAP values.
We determine that the MADELON dataset includes useless or redundant
119
features. For instance, there are 10 features whose MAP is smaller than 0.5.
Therefore, feature selection becomes necessary not only to get rid of
ineffective (or noisy) features, but also to enhance the final ranking
performance. On the other hand, the relative effectiveness of most of the
features in the NDBC dataset is not differentiated. Therefore, the benefit of
eliminating useless features is not great.
Moreover, in the two datasets, the similarity between any two features has been
discussed‎(on‎the‎basis‎of‎Kendall’s‎ ). Features in the MADELON dataset are
clustered into many blocks, with highly similar features in the same blocks and
less similar features in different blocks. As our proposed technique may help to
reduce the total scores between selected features, for each cluster, only
representative features can be selected and thus we can reduce the redundancy
in the features. As a result, our method performs better than other feature
selection methods. For the NDBC dataset, there are only two large blocks, with
most features similar to each other. In this case, the similarity punishment in
our approach does not work well. That is why the improvement of our method
over the other methods is not so significant.
According to our approach, based on the discussion above, we conclude that, if
the effects of features vary largely and there are redundant features, our method
can work very well. When applying our method in practice, therefore, one can
first test the two aspects. For sampling ranking, we used only SVM linear
kernels in the experiments with NBC, because linear functions, while simple,
are often expressive enough, and have thus been used as a popular model for
120
rank (or top-k) selection processing. However, deploying nonlinear functions
might be necessary to deal with complex preferences that are not rankable by
linear functions. Nonlinear ranking functions can be learned directly using
SVM nonlinear kernels.
4.7
Summary
In this chapter, we have proposed an optimization method for feature and
samples selection in ranking. The contributions of this chapter include the
following. We discussed the differences between classification and ranking,
and made clear the limitations of the existing feature and samples selection
methods when applied to ranking. In addition, we proposed a novel method to
select features and samples for ranking, in which the problem is formalized as
an optimization issue. In this method, we maximize the total importance scores
of selected features and samples, and at the same time minimize the total
similarity scores between the features in addition to samples. In this chapter,
we evaluated the proposed method using two datasets, with two ranking
models, and in terms of a number of evaluation measures. Experimental results
have validated the effectiveness and efficiency of the proposed method
improving the accuracy for both labelled and unlabelled data.
Chapter 5
Ensemble weighted classifiers with accordance-based
sampling
The main aim of this chapter is to propose a new weighted voting classification
ensemble method based on a classifier combination scheme and novel multiview sampling method as a response to the high cost of supervised labelling
data. Active learning attempts to reduce the human effort needed to learn an
accurate result by selecting only the most informative examples for labelling.
Our work has focused on diverse ensembles for active learning for semisupervised learning problems. The chapter is organised as follows. Majority
voting classifier weights based on the accordance sampling method are
discussed in further detail in the next section, followed by a description of the
experiments carried out in section 5.3. In section 5.4, the results are presented
and discussed. Lastly, conclusions and future work are drawn out in section
5.5.
5.1
Introduction
Recent developments in storage technology have made it possible for broad
areas of applications to rely on stream data for quick responses and rapid
decision-making [160]. One of the recent challenges facing data mining is to
digest the massive volumes of data collected from data stream environments
[160,161].
121
122
In the domain of classification, providing a set of labelled training
examples is essential for generating predictive models. It is well accepted that
labelling training examples is a costly procedure [162] which requires
comprehensive and intensive investigations on the instances, and incorrectly
labelled examples will significantly degrade the performance of the model built
from the data [162,165].
A common practice to address the problem is to use based sampling
methods to selectively label a number of instances from which an accurate
predictive model can be formed [163,164]. Selective sampling is a form of
active learning that reduces the cost and number of training examples that need
to be labelled by examining unlabelled examples and selecting the most
informative ones [166,151].
A based sampling method generally begins with a very small number of
randomly labelled examples, carefully selects a few additional examples for
which it requests labels, learns from the results of that request, and then by
using its newly gained knowledge carefully chooses which examples to label
next. The goal of the based sampling method is to maximize the prediction
accuracy by labelling only a very limited number of instances, and the main
challenge‎is‎to‎identify‎“important”‎instances‎that should be labelled in order to
improve the model training, given that one cannot afford to label all samples
[167]. A general practice for the based sampling method is to employ some
rules in determining the most needed instances. For example, uncertainty
sampling principles take instances with which the current learners have the
123
highest uncertainty as the most needed instances for labelling. The intention is
to label instances on which the current learner(s) has the highest uncertainty, so
providing labels to those instances can help improve the model training [168].
Classification is a predictive modelling whose target variable is
categorical. A multiple classifier model or ensemble method is a set of
individual classifiers whose decisions are combined when classifying new
patterns. In the ensemble method of classification, many classifiers are
combined to make a final prediction. There are many different reasons for
combining multiple classifiers to solve a given learning problem. Ensemble
classifiers perform better than a single classifier in general. Moreover, multiple
classifiers usually try to exploit the local different behaviour of the individual
classifiers to improve the accuracy of the overall system. In addition, it can
eliminate the risk of picking inadequate single classifiers [167,169].
The final decision is usually made by voting after combining the
predictions from a set of classifiers. The use of an ensemble of classifiers has
gained wide acceptance in the machine learning and statistics community after
significant improvements in accuracy [151]. Two popular ensemble methods,
boosting and bagging, have received heavy attention. These methods are used
to resample or reweight training sets from the original data. Then a learning
algorithm is repeatedly applied for each of the resampled or reweighted
training sets [166,168].
Simple Majority voting is a decision rule that selects one of many alternatives,
based on the predicted classes with the most votes. This is the decision rule
used most often in ensemble methods. Weighted Majority voting can be
124
applied if the decision of each classifier is multiplied by a weight to reflect the
individual confidence of these decisions. Simple Majority voting is a special
case of weighted Majority voting [163,164].
Our new sampling method generates multiple views and compares
their results to identify results that agree. Each view is a subset of the dataset
where certain attributes of all the data patterns are chosen and trained. For
instance, view 1 consists of a number of attributes and view 2 consists of the
rest of the attributes in a two-view case. Once trained, the solutions from each
view are compared with each other. If they agree, they are selected as
informative examples. The scheme employed for ranking how informative they
are uses two types of weight: a weight vector of classifiers and a weight vector
of instances. The instance weight vector assigns the most informative instance
and gives it a high weight based on our new sampling method. The weight
vector of classifiers puts large weights on classifiers and gives the highest
number of instances equal to the Majority of the most informative instances.
The instance with higher weights plays a more important role in determining
the weights of classifiers.
The aim is to fuse the outputs of multiple classifiers in the test phase
weights that are derived during training; this fusion of classifier output
improves‎the‎classifiers’‎performance.‎Five‎classifiers‎are‎used:‎Support‎Vector‎
Machine, Naïve Bayes, K-Nearest Neighbour, Logistic Regression and Neural
Network. The common theme across these classifiers is that they attempt to
minimize or maximize a single function. A classifier based on multiple
functions is computed on the local properties of the data; the main function is
125
to search for an optimal set of weights based on the result of majority voting of
classifiers from training data so as to maximize the recognition ability and
apply it to the output of classifiers during testing.
5.2
Background
5.2.1 Ensemble Weighted Classifier
The point here revolves around five classifiers: Support Vector
Machine, Naïve Bayes, K- Nearest Neighbour, Logistic Regression, and Neural
Network. All these classifiers are equally weighted in terms of unaffected
Majority voting. Consequently, if each classifier predicts an instance
differently, the final decision becomes arbitrary due to the tied votes, assuming
the classifiers tend to classify the most informative instances correctly based on
the new multi-views sample method. In the case of dissimilar predictions made
by classifiers for an unknown instance, logically more weight is granted to the
classifier that provides the biggest number of predictions equal to the Majority.
Accurate evaluation of the most informative samples seems to be essential. The
most informative samples can be thought of as those on which fewer classifiers
make correct predictions. In that case, we lay out the two weight vectors
described above, namely a weight vector of classifiers and an instance weight
vector, so our suggested process has two different phases. In addition, the
weights of instances are proportional to the degree of informativity of each
instance. We have to consider these in assigning weights to classifiers, thus the
weights for classifiers and the weights for the instance are dependent on and
linked with each other. Through the cross-relationship between the
126
performance of classifiers and the informativity of the instance, we can find the
optimal weights for the classifiers and instances. These weights are found by
an iterative producer and determined by only the matrix about the instances and
the classifiers. We do not need to assume prior knowledge of the behaviour of
the individual classifiers.
Suppose we have n instances and k classifiers in an ensemble. We let X be an n
× k performance matrix indicating whether the classification is right 1 or wrong
0 and its transpose X′. Let Jij be an i × j matrix‎ consisting‎ of‎ 1’s‎ for‎ any‎
dimension i and j. We also define 1n and 1k as n × 1 and k ×‎1‎vectors‎of‎1’s.‎
Finally, Ik denotes a k × k identity matrix.
1. Set initial instance weight vector
Q0 gets higher weights of instance for the rows of X with fewest 1’s (the most
informative instance). The denominator (mathematical term) is simply
regarded as an element that normalizes the model of the unit.
a) Calculate a classifier weight vector
Pm appoints higher weights on more accurate selected classifiers, after merging
instance weight
. The denominator hereinbefore is simply regarded as an
element that normalizes the model of the unit.
127
a) Update the instance weight vector
Qm appoints higher weights for the most informative instances after merging
the weight vector of the classifier Pm. The denominator in our formula is
simply regarded as an element that normalizes the model of the unit. We
provide a simple example illustrating the above algorithm. We suppose that
there are five classifiers and five instances in an ensemble. We define X = (x1|
x2| x3| x4| x5) and let x1 (0,1,1,0,0)ꞌ, x2 (0,1,1,1,0)ꞌ,‎x3 (0,1,1,1,0)ꞌ, x4 (0,1,1,1,0)ꞌ,‎
x5 (0,1,0,1,1)ꞌ,‎ where‎ xi indicate a performance vector by the ith classifier.
Here 1 represent correct decision from a classifier and 0 represent wrong
decision. We obtain the normalized weight vector on classifier decision P* =
(0.114,0.769,0.433,0.314,0.824)ꞌ‎ and‎ the‎ normalized‎ weight‎ vector‎ on‎
instances Q* = (0.032,0.731,0.439,0.633,0.453)ꞌ‎.
The classifier weight P* can be explained as follow: the accuracies of the
classifiers are (0.75,0.67,0.5,0.45,0.75), although the first and the last
classifiers have the same error rate, more weight is given to the last classifiers
because weight vector of fifth classifier (0.824) than first classifier (0.114) for
select former classified the most informative instance correctly. As well as it is
the only classifier that had the correct answer or decision for the fifth instance.
The weight of the first instance is lowest (0.332) in Q*. The least classifier
weight is given to the first (0.114) because it misclassified higher weighted
instances and the most inaccurate among the five.
128
Regarding Q*, highest weight is given to the second instance because all the
classifiers made correct decisions which mean it is the most informative. The
least weight (0.032) is given to the first instance because it is the least
informative instance none of classifiers give right decision to it. Although the
first and the fifth instance get the same accuracy, we note that the weights are
different, this due to the effect of P*. The first instance is misclassified by all
classifiers which has the least weight. On the other hand, the third instance is
misclassified by the most important classifier which is the fifth classifier.
When the instances are equally informative, an instance on which the higher
weighted classifier works better must get a higher value in Q*.
5.2.2 Sampling Most Informative Sample Method (Multi-View
Sample MVS)
We observe a new method, the idea of which is to generate multiple
views and to compare their results to identify results that agree. Each view is a
subset of the data set where certain attributes of all the data patterns are chosen
and trained. For instance, view 1 consists of attributes 1-5 and view 2 consists
of attributes 6-25. Once trained, the solutions from each view are compared
with each other. If they agree, they are selected as informative instances. In
addition, a scheme for ranking how informative they are is employed and the
ones which rank highly are selected as the most informative instances to be
used for training the classifiers and then testing.
In effect, this uses random features selection to generate different views
which are then used for training to obtain results used to identify agreement, so
that they can be selected as informative instances, and ranking is then used to
129
find the most informative ones. In this way, less labelled data can be used to
achieve higher classification accuracy.
Let V1 and V2 be the two view classifiers learned from training labelled data L
to classify all unlabelled examples U using both views. For each example
in
U
Our classifiers will train the redundant view classifiers by learning from
the most informative labelled examples [177,178]. Then the view classifiers are
used to classify the most informative unlabelled examples. The unlabelled
examples on whose classification the two view classifiers agree the most are
then sampled. We use a ranking function to rank all the unlabelled instances
according to the predictions of the view classifiers. The ranking score function
for an unlabelled instance xi is defined as
The scores generated result in a rank where examples in the highest positions
are the ones to which both view classifiers assign the same label with high
confidence, which means that those are the most informative unlabelled
130
examples. Then it selects the larger one of the average predicted probabilities
for the positive and negative classes by two view classifiers.
5.3 Experimental Design
Within our experiments, we use 20 publicly accessible binary data sets
and two multiclass data sets. Table 5.1 briefly illustrates their characteristics.
Sixteen of them belong to the UCI Machine Learning Repository [105], five
are from the UCIKDD Archive [157] and the last is the in-house Nottingham
Tenovus Breast Cancer data set. To obtain a better measure of predictive
accuracy, we compare five classifiers using 10 cross-validations. The average
of the 10 estimates simply expresses the accuracy of the cross-validation. The
10-fold cross-validation is repeated 100 times with varied assets each time to
give more steady estimates. The rate of 100 accuracies of cross-validations
represents the final accuracy estimate for the dataset.
Multiclass Data sets: we select two multiclass data sets, one of which,
from the UCI, is a dermatology dataset. This contains 34 attributes, 33 of
which are linear valued and one of which is nominal, and 366 instances and
four classes. The second dataset is the Nottingham Tenovus Breast Cancer
dataset. The dataset contains three main clinical groups, Luminal, Basal and
HER2, with six subgroups for 1076 patients in the period 1986-1998 and with
immunohistochemical reactivity for 25 proteins with known relevance in breast
cancer.
Support Vector Machine, K-Nearest Neighbour, Naïve Bayes, Logistic
Regression and Neural Network are employed as the base learning algorithms.
131
The generated classifiers are then combined to form an ensemble using our
weighted voting system. We performed all the experiments using Java
statistical packages. The SVM algorithm was implemented by the LIBSVM
package. We also implement the K-Nearest Neighbour algorithm available at
[169,170]. We use LIBLINEAR to implement Logistic Regression [171].
Classifier4J is a Java library designed to do classification. It comes with an
implementation of a Naïve Bayes classifier. We implement the Neural Network
from the Java library [180].
There are a number of different parameters that must be decided upon when
designing a neural network NN, SVM and KNN. For NN, among the
parameters there are the number of neurons per layer = 50 and the number of
training iterations =100. Some of the more important parameters in terms of
training and network capacity are the number of hidden neurons, the learning
rate and the momentum parameter = 0.1. In addition, kernels are used in
transductive support vector machines to map the learning data (nonlinearly)
into a higher dimensional feature space, w0, where the computational power of
the linear learning machine is increased. The classifier uses RBF and
polynomial kernels. The first RBF kernel sets the regularization parameter =
200, while the remaining kernels that are polynomial will be set = 100. SVM
adjusts the cost = 10 and gamma = 0.1. For K-Nearest Neighbours (KNN) we
select only one parameter, the number of K.
.
132
5.4
Experimental Results and Discussion
For comparison purposes, we implemented five algorithms. The results
given in Tables 5.2, 5.3 and 5.4 indicate the performance of the five
algorithms, as well as our Majority voting class system, as a consequence of
different numbers of cross-validations without using any special sampling
method, in other words before using a multi-view sample. The results in Tables
5.5, 5.6 and 5.7 indicate the performance of all algorithms as a consequence of
different numbers of cross-validations using the new multi-view sample MVS
method to evaluate the effect it has on the five algorithms and on the majority
voting system.
We choose different numbers of cross-validations to evaluate our method
because a smaller number of cross-validations contains fewer examples, and
sparse training examples generally produce inferior learners. The advantage of
having a small number of examples is the training efficiency.
133
Table 5.1: Summary of the features of sets of data employed for assessment.
Data Set
Dimensionality
Dermatology
35
#instances
Labelled
220
NBDC
Diabetes
Heart
WDBC
Austra
House
Vote
Vehicle
Hepatitis
Labor
Ethn
Ionosphere
kr_vs_kp
Isolet
Sonar
Colic
Credit-g
BCI
Digital
COIL2
g241n
25
8
9
14
15
16
16
16
19
26
30
34
40
51
60
60
61
117
241
241
241
663
268
120
357
307
108
168
218
123
37
1310
225
1527
300
111
136
300
200
734
750
748
#classes
Unlabelled
146
Total
366
6
413
500
150
212
383
124
267
217
32
20
1320
126
1669
300
97
232
700
200
766
750
752
1076
786
270
569
690
232
435
435
155
57
2630
381
3196
600
208
368
1000
400
1500
1500
1500
6
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
From our results, SVM gives the best performance compared with the
other algorithms when evaluating the algorithms without using any particular
sampling methods. For two-fold cross-validation, SVM achieves the best
production accuracy for 16 out of 22 data sets. The performance of SVM is
enhanced even more when using 5- and 10-fold cross-validation, and gives the
best performance for 18 and 20 data sets out of 22 respectively.
134
Table 5.2: Predictive accuracy of each algorithm under two-fold crossvalidation compared with the Majority voting system (Bold 1st, Italic 2nd)
Data Set
Dermatology
NTBC
SVM
87.82
87.61
LR
84.12
83.92
KNN
81.36
82.90
NN
71.02
75.82
NB
82.46
82.23
Majority
88.17
89.62
Diabetes
77.36
84.16
77.02
79.74
Heart
76.22
73.46
83.13
85.62
80.68
83.62
83.30
WDBC
90.56
92.36
84.22
77.21
91.82
83.37
91.82
Austra
70.56
62.86
66.85
60.75
71.81
House
85.12
85.64
88.09
89.04
Vote
89.65
70.19
72.22
86.56
64.37
68.77
69.22
Vehicle
84.56
70.75
80.53
66.25
76.73
82.07
81.88
78.05
Hepatitis
87.07
86.49
88.02
86.55
88.48
Labor
90.98
63.92
57.66
57.42
58.64
Ethn
65.87
63.22
61.54
66.08
65.97
65.22
68.46
Ionosphere
90.03
89.14
86.42
87.03
89.49
71.27
90.02
kr_vs_kp
80.76
77.54
76.93
76.25
79.39
79.52
Isolet
80.70
82.48
76.66
77.28
79.13
80.91
Sonar
61.36
58.26
59.02
58.72
61.73
Colic
72.29
70.41
71.02
69.14
71.99
62.92
72.24
Credit-g
85.51
82.54
83.92
82.14
84.59
83.12
BCI
91.53
89.72
90.22
86.23
90.35
89.35
Digital
83.31
81.46
82.07
80.19
83.00
83.20
COIL2
86.84
87.24
84.94
85.44
82.46
85.27
86.73
85.53
86.92
87.13
86.69
88.39
g241n
Comparing all five algorithms, we can easily see that the Support
Vector Machine (SVM) for two-fold cross-validation with multi-view sampling
(MVS) methods gives the best performance across 12 data sets out of 22. In
addition, the performance of SVM ranked as the second best algorithm for the
rest of the data sets, except for the diabetes data set. However, for five-fold
cross-validation, the performance of SVM was enhanced and it gave the best
performance for 14 out of 22 data sets and ranked as the second best algorithm
for the rest, except again for the diabetes dataset. Using 10 cross-validations
and adding two more data sets resulted in SVM giving the best performance for
135
16 data sets, although still the diabetes data sets were not assigned an
appropriate rank.
Table 5.3: Predictive accuracy of each algorithm under five-fold crossvalidation compared with the Majority voting system (Bold 1st, Italic 2nd )
Data Set
Dermatology
NTBC
Diabetes
SVM
89.41
89.03
82.38
KNN
81.93
82.96
81.79
NN
72.42
75.8
83.04
NB
82.44
82.51
78.13
Majority
91.46
91.97
83.04
86.21
LR
84.32
84.64
83.29
76.42
Heart
74.25
81.52
78.94
84.86
WDBC
92.01
92.49
83.94
77.21
92.33
95.24
Austra
72.10
72.93
63.09
66.94
61.79
75.24
House
90.62
88.88
85.20
86.08
87.84
93.11
Vote
71.11
73.19
66.37
65.96
69.93
72.93
Vehicle
85.30
81.34
76.54
81.88
81.72
86.22
Hepatitis
90.76
88.10
86.79
87.83
87.38
91.60
Labor
65.74
61.73
58.14
59.88
65.73
68.44
Ethn
71.32
67.85
64.57
62.84
66.67
72.28
Ionosphere
92.14
89.15
88.24
87.15
88.35
93.25
kr_vs_kp
82.48
76.99
77.92
76.18
78.33
83.14
Isolet
82.00
82.53
75.89
76.22
78.17
84.40
Sonar
63.32
58.55
59.83
59.24
61.51
64.95
Colic
73.22
70.53
71.97
69.67
72.25
74.55
Credit-g
86.65
83.41
82.62
82.39
82.24
87.31
BCI
92.61
89.85
89.74
87.68
89.73
Digital
87.26
81.53
83.16
81.70
82.78
93.75
84.98
COIL2
88.60
89.45
86.07
85.42
82.43
85.26
88.32
85.53
87.27
87.14
86.94
91.51
g241n
The Neural Network method (NN) is not an appropriate option for
multi-view sampling across almost all the data sets, and its performance is
consistently worse than the other methods, regardless of whether the
classification problems are for binary or multiple classes. Although Logistic
Regression (LR) MVS outperforms SVM MVS quite often, for binary-class
data sets, its performance is unsatisfactory and is almost always inferior to
136
SVM. The results from LR are surprisingly good, and they generally improve
as we increase the number of cross-validations whether we use MVS or not.
Table 5.4: Predictive accuracy of each algorithm under 10-fold cross-validation
compared with the Majority voting system (Bold 1st, Italic 2nd)
Data Set
Dermatology
NTBC
SVM
89.65
90.37
LR
86.09
85.21
KNN
82.22
83.17
NN
72.70
75.93
NB
84.65
83.23
Majority
92.08
92.97
Diabetes
84.02
78.38
85.61
83.12
79.12
Heart
78.19
76.23
84.19
82.95
Wdbc
87.75
92.27
87.66
87.30
92.60
84.91
77.32
92.52
95.86
Austra
72.24
71.62
65.31
67.36
62.84
House
89.08
86.58
87.15
88.56
Vote
93.87
72.04
75.43
93.86
74.88
69.61
66.82
69.89
74.35
Vehicle
87.21
81.59
77.06
83.33
80.41
88.92
Hepatitis
91.11
88.41
87.62
88.23
89.09
92.74
Labor
67.97
65.18
58.58
59.16
66.82
69.16
Ethn
73.17
68.86
63.86
62.51
67.92
75.04
Ionosphere
kr_vs_kp
92.56
82.97
89.22
77.58
88.42
78.35
88.36
78.22
89.78
80.33
93.82
83.76
Isolet
82.39
81.62
78.09
78.26
79.20
85.79
Sonar
63.38
59.60
59.68
59.43
61.92
65.64
Colic
74.47
70.66
72.46
70.27
72.62
75.19
Credit-g
86.36
83.65
83.15
82.42
84.22
87.22
BCI
92.66
90.06
90.92
89.76
90.27
93.97
Digital
85.36
82.79
83.41
81.07
83.32
88.26
COIL2
88.95
86.17
85.58
82.52
85.50
90.45
g241n
89.11
85.71
88.24
87.51
88.82
92.18
Naïve Bayes (NB), surprisingly and conversely, achieves the best
predictive accuracy with two-fold multi-view sampling for four data sets. This
performance decreases compared to the other algorithms, and obtains the best
predictive accuracy for only one data set when using MVS with 5- and 10- fold
cross-validation. From the results, it can be seen that MVS improves the
accuracy for all the algorithms. The extent of the improvements differs from
one algorithm to another and from one data set to another.
137
Comparing different algorithms, the results again confirm that the
Majority voting system produces a considerably higher performance over all
data sets than the other algorithms. This is very encouraging because our
results show that, even when base classifiers are incapable of estimating each
instance’s‎ class‎probabilities‎accurately,‎our‎approach‎may‎still‎perform‎ well.‎
This‎ is‎ because‎ Majority‎ voting‎ relies‎ on‎ the‎ variance‎ of‎ the‎ base‎ classifiers’‎
probability estimation, but not on the absolute probability values. The results
also show that the Majority voting system benefits very clearly from the multiview sampling method.
Table 5.5: Predictive accuracy of each algorithm under 2-fold cross-validation
comparing Majority voting system with multi-view sampling method (Bold
1st, Italic 2nd)
Data Set
Dermatology
NTBC
Diabetes
Heart
WDBC
Austra
House
Vote
Vehicle
Hepatitis
Labor
Ethn
Ionosphere
kr_vs_kp
Isolet
Sonar
Colic
Credit-g
BCI
Digital
COIL2
g241n
SVM
92.82
90.91
80.56
86.20
93.56
73.66
92.55
74.79
88.76
94.43
67.82
72.26
93.23
85.76
83.86
64.66
76.29
89.81
95.23
86.26
89.66
92.24
LR
90.02
88.12
88.26
80.02
96.26
76.22
90.36
76.25
85.63
91.42
62.46
70.57
93.24
83.44
86.54
62.46
75.31
87.74
94.32
85.31
88.66
91.43
KNN
86.66
86.50
80.52
76.66
87.52
66.26
88.32
71.15
81.23
90.24
61.62
67.32
89.92
82.23
80.12
62.62
75.32
88.52
94.22
85.32
88.56
92.22
NN
77.22
80.32
84.14
87.23
81.41
71.15
89.74
70.17
87.47
92.67
63.74
66.54
91.43
82.45
81.64
63.22
74.34
87.64
91.13
84.34
86.48
93.33
NB
87.36
85.43
88.72
83.48
94.72
63.75
90.89
73.27
85.98
89.90
69.88
69.67
92.59
84.29
82.19
64.93
75.89
88.79
93.95
85.85
87.99
91.59
Majority
94.47
94.22
88.12
87.57
96.12
76.21
93.24
75.12
83.55
93.23
70.42
76.37
94.52
85.82
85.37
67.52
77.54
88.72
94.35
87.45
90.85
94.69
Tables 5.5, 5.6 and 5.7 indicate that the results for multi-class data sets
are not worse than for a binary class data set. This shows that multi-view
sampling for a multi-class data set is not more challenging than a binary class
138
data set. Therefore, the performance for both binary and multi classes has
shown significant improvement using MVS for all the data sets compared to
the results in Tables 5.2, 5.3 and 5.4, which evaluate algorithms without using
any particular sampling methods and using 2-, 5- and10-fold cross-validation
to validate the prediction accuracy.
In addition, it can be seen from Table 5.8 that MVS improves the
accuracy for the Majority voting system for almost all the data sets. The effect
of increasing the number of cross-validations is to increase the prediction
accuracy for the Majority voting system for all the data sets when not using any
sampling method, but this was not the case when we implemented MVS. Then
some data sets had worse results when increasing the number of crossvalidations.
For multi-view sampling selection, we observed that its performance is
better than random selection for almost all the times for all the data sets. Since
the instance selection procedure is replaced by using a distribution-based
measure instead of the uncertainty measure, this leads to the conclusion that
instance distribution is more effective than uncertainty-based measures for
finding‎the‎most‎“important”‎samples for labelling. We believe that the reason
for which the two-fold results were less accurate than the other results is the
reliance‎ on‎ sample‎ distributions‎ to‎ select‎ “important”‎ instances‎ for‎ labelling.‎
Note that sample distributions do not necessarily have a direct connection to
indicate‎ whether‎ a‎ sample‎ is‎ “important”‎ for‎ labelling‎ or‎ not,‎ even‎ if‎ the‎
method does accurately capture the data distributions.
139
Figure 5.1: Performance of SVM, LR, KNN, NN, NB and majority under 10fold cross-validation with random sampling vs. multi-view sampling
140
Table 5.6: Predictive accuracy of each algorithm under 5-fold cross-validation
comparing Majority voting system with multi-view sampling method (Bold
1st, Italic 2nd)
Data Set
SVM
LR
KNN
NN
NB
Majority
Dermatology
NTBC
Diabetes
Heart
WDBC
Austra
House
Vote
Vehicle
Hepatitis
Labor
Ethn
Ionosphere
kr_vs_kp
Isolet
Sonar
Colic
Credit-g
BCI
Digital
COIL2
g241n
93.41
91.33
84.58
88.11
94.01
74.20
92.52
74.71
88.50
93.21
68.64
74.12
94.34
86.48
84.16
65.62
76.22
89.95
95.31
89.21
90.42
93.45
90.22
88.84
87.39
80.22
96.39
76.93
92.68
78.69
86.44
92.45
66.53
72.55
93.25
82.89
86.59
62.75
75.43
88.61
94.45
85.38
89.79
91.43
87.23
86.56
85.29
77.45
87.24
66.49
88.40
71.27
81.04
90.54
62.34
68.67
91.74
83.22
79.35
63.43
76.27
87.22
93.74
86.41
88.54
92.57
78.62
80.30
87.44
85.62
81.41
71.24
90.18
71.76
87.28
92.48
64.98
67.84
91.55
82.38
80.58
63.74
74.87
87.89
92.58
85.85
86.45
93.34
87.34
85.71
81.23
81.74
95.23
64.79
90.64
74.43
85.82
90.73
69.53
70.37
91.45
83.23
81.23
64.71
76.15
86.44
93.33
85.63
87.98
91.84
95.66
94.47
85.44
86.96
97.44
77.54
95.21
76.73
89.62
94.25
71.54
75.28
95.65
87.34
86.76
67.45
77.75
90.81
96.65
87.13
90.34
95.71
When a data set experiences a significant class distribution change, this
immediately has an impact on the classification accuracy. Of course, the actual
accuracy relies not only on the class distributions, but also on the complexity of
the decision surfaces. The results here indicate that changing class distributions
is a challenge for active learning from data sets, especially for multi-class data
sets.
141
Table 5.7: Predictive accuracy of each algorithm under 10-fold cross-validation
comparing Majority voting system with multi-view sampling method (Bold
1st, Italic 2nd)
Data Set
SVM
LR
KNN
NN
NB
Majority
Dermatology
NTBC
Diabetes
Heart
WDBC
Austra
House
Vote
Vehicle
Hepatitis
Labor
Ethn
Ionosphere
kr_vs_kp
Isolet
Sonar
Colic
Credit-g
BCI
Digital
COIL2
g241n
93.65
92.67
86.22
89.65
94.27
74.34
95.77
75.64
90.41
93.56
70.87
75.97
94.76
86.97
84.55
65.68
77.47
89.66
95.36
87.31
90.77
93.11
91.99
89.41
82.48
81.99
96.50
75.62
92.88
80.38
86.69
92.76
69.98
73.56
93.32
83.48
85.68
63.80
75.56
88.85
94.66
86.64
89.89
91.61
87.52
86.77
89.11
79.43
88.21
68.71
89.78
74.51
81.56
91.37
62.78
67.96
91.92
83.65
81.55
63.28
76.76
87.75
94.92
86.66
88.70
93.54
78.90
80.43
87.52
88.29
81.52
71.66
91.25
72.62
88.73
92.88
64.26
67.51
92.76
84.42
82.62
63.93
75.47
87.92
94.66
85.22
86.54
93.71
89.55
86.43
82.22
85.75
95.42
65.84
91.36
74.39
84.51
92.44
70.62
71.62
92.88
85.23
82.26
65.12
76.52
88.42
93.87
86.17
88.22
93.72
95.78
94.97
89.56
88.90
97.56
77.23
95.46
77.65
91.82
94.89
71.76
77.54
95.72
87.46
87.65
67.64
77.89
90.22
96.37
89.91
91.97
95.88
Table 5.8 summarizes the statistics of accuracy from 30 runs, where the mean,
standard error, median, standard deviation, sample variance, minimum,
maximum and confidence level values are reported for eight data sets:
Dermatology, NTBC, WDBC, Austra, kr_vs_kp, Ionosphere, COIL2 and
g241n. We select two multiclass data sets and two data sets depending on
small, medium and large dimensionality numbers respectively.
142
Table 5.8: Predictive accuracy of each number of cross-validations for
Majority voting system with random sampling compared to multi-view
sampling method using the whole data.
Data Sets
Majority
Dermatology
Majority MVS
2-fold
5-fold
10-fold
2-fold
5-fold
10-fold
88.17
91.46
92.08
94.74
95.66
95.78
94.47
94.97
NTBC
89.62
91.97
92.97
94.22
Diabetes
83.62
83.04
87.66
88.12
85.44
89.56
Heart
83.37
84.86
87.3
87.57
86.96
88.9
WDBC
91.82
95.24
95.86
96.12
97.44
97.56
77.54
76.23
Austra
71.81
75.24
75.43
76.21
House
89.04
93.11
93.86
93.22
95.21
95.46
Vote
69.22
72.93
74.35
75.12
76.44
77.56
Vehicle
78.05
86.22
88.92
83.55
89.62
91.82
94.25
94.89
Hepatitis
88.48
91.6
92.74
93.23
Labor
65.22
68.44
69.16
70.42
71.54
71.76
Ethn
71.27
72.28
75.04
76.37
75.28
77.54
Ionosphere
90.02
93.25
93.82
94.52
95.65
95.72
87.34
87.46
kr_vs_kp
79.52
83.14
83.76
85.82
Isolet
80.91
84.4
85.79
85.52
86.76
87.65
Sonar
62.92
64.95
65.64
67.37
67.45
67.64
Colic
72.24
74.55
75.19
77.54
77.75
77.89
90.81
90.22
Credit-g
83.12
87.31
87.22
88.72
BCI
89.35
93.75
93.97
94.35
96.65
96.37
Digital
83.2
84.98
88.26
87.45
87.13
89.91
COIL2
86.73
88.32
90.45
90.85
90.34
91.97
92.18
94.69
95.71
95.88
g241n
88.39
91.51
Figure 5.1, illustrates the performance of Majority vs. Majority MVS
on the 22 data sets with different dimensionality size numbers. Our empirical
experience suggests that the performance of the Majority does not significantly
change within successive cross-validations and gradually levels out, as the
greatest size growth was for the Vehicles data set at around 12%.
Figure 5.1 also shows the results for the other methods, SVM vs. SVM
MVS against various sizes of dimensionality. The improvement for prediction
accuracy from SVM to SVM MVS was between 2-4% for all of the data sets,
143
while the accuracy improvement for LR vs. LR MVS was higher at 2-6%. The
Vote dataset showed the greatest accuracy improvement when comparing the
results from using multi-view sampling with LR, at 6%. NN vs. NN MVS
showed the largest number of enhanced datasets, at four datasets, and the
highest improvement prediction accuracy was 6%.
Figure 5.2: Performance of SVM, LR, KNN, NN, NB and Majority under
varying different Folds number sizes with multi-view sampling method.
144
Figure 5.1 illustrates in addition the performance of KNN vs. KNN
MVS for the 22 data sets with different dimensionality size numbers. The
experiments indicate that, in most cases, the performance of the KNN changes
with successive cross-validations and the improvement when using MVS was
in excess of 3% to 6% for all the data sets.
Lastly, from Figure 5.1, we can see the performance of NB vs. NB
MVS on the all data sets against different dimensionality size numbers. The
results show that, in all cases, the performance of the NB increases when using
MVS compared to the results before implementing MVS. NB performs very
well for some data sets, with a prediction accuracy growth of 8%, while for
most of the cases the improvement was between 2% and 5%. Figure 5.2 shows
a graph of accuracy against different numbers of cross-validations of data using
SVM, LR, KNN, NN, NB and the Majority weighted ensemble classifier for
different data sets with the multi-view sampling method. The graph indicates
that for datasets like Austra, BCI and g241n, the number of cross-validations
did affect the prediction accuracy in a non-stable way. For the Majority voting
system, the best prediction accuracy was with 5-fold cross-validations and 10fold validation was second (large dimensionality).
For the ethn dataset the different numbers of cross-validations did not
have any Majority effect on the Majority voting system. On the other hand, for
the digital, COLI2 and diabetes data sets the changing number of crossvalidations affects the prediction accuracy in a non-stable way, although 10fold cross-validation is highly recommended because the accuracy improves
from 87% to around 90%. Lastly, the prediction accuracy of the dermatology,
145
vehicles and hepatitis data sets increases steadily as we increase the number of
cross-validations (small dimensionality).
Figure 5.3 shows a graph of accuracy against the percentage of mean
using SVM, LR, KNN, NN, NB and Majority for the above eight data sets with
the implementation of MVS. We observed that Majority was able to produce a
higher average classification accuracy than the other methods. For some
datasets, Majority was able to achieve a maximum accuracy of 97%. However,
SVM also performed well, and achieved an average accuracy of 93%. Majority
achieved high accuracy for both binary and multiclass. The error bars show the
average positive and negative error for the different methods. The results from
Figure 5.4 show that the Majority voting system consistently outperforms other
methods across all data sets. We used the t-test to compare the means
difference for Majority against other methods for the dermatology data set (P <
0.001, paired t-test; Fig. 5.4). We found that there is no significant mean
difference between Majority and SVM, LR, KNN, NN, NB: (0.0029), (0.0016),
(0.0137), (0.00136), (0.00424) respectively.
The probability is extremely high, so the means are not significantly
different for NTBC, WDBC, kr_vs_kp and g241n. On the other hand, the
Austra, ionosphere, COIL2 data sets have a highly significant mean difference
at (P < 0.001, paired t-test).
146
Figure 5.3: Error bar and performance of SVM, LR, KNN, NN, NB and
Majority for different dimensionality number sizes with multi-view sampling
method (d - number of dimensions).
147
This observation asserts that for concept drifting data streams with a
constantly changing number of cross-validations distributions and continually
evolving decision surfaces, Majority can adaptively label instances and build a
superior classifier ensemble. The advantage of Majority can be observed across
different types of data streams (binary-class and multi-class), because it takes
the single advantage of each classifier or method.
5.4.1 Runtime Performance Study
In Figure 5.4, we report the system runtime performance in response to
different numbers of cross-validations. The x-axis in Figure 5.4 denotes the
number of cross-validations and the y-axis denotes the average system runtime.
Comparing all five methods, the runtimes of KNN and NB are very close to
each other, with NB slightly more efficient than KNN. Not surprisingly, LR
has demonstrated itself to be the most efficient method due to the nature of its
simple random selection. NB and KNN are at the second tier because instance
labelling involves a recursive labelling and retraining process.
Similar to NB and KNN, NN also requires a recursive instance
selection process plus a number of local training runs to build classifiers from
each cross-validation. Consequently, NN is less efficient than KNN and NB.
The proposed SVM method is a time-consuming approach mainly because the
calculation of the ensemble variance and the weight updating require additional
scanning for each cross-validation. On average, when the number of crossvalidations is 500 or less, the runtime of SVM is about 2 to 4 times longer than
148
its peers. The larger the number of cross-validations, the more expensive the
SVM is, because the weight updating and instance labelling require more time.
Figure 5.4: System runtime with respect to different numbers of crossvalidations
5.5 Summary
We proposed a new research topic on active learning with different fold
data sizes, where the data volumes continuously increase and data concepts
dynamically develop, and the objective is to label a portion of data to form a
classifier ensemble with the highest accuracy rate in predicting future samples.
We studied the‎connection‎between‎a‎classifier‎ensemble’s‎variance‎and‎
its‎ prediction‎ accuracy‎ and‎ showed‎ that‎ combining‎ a‎ classifier‎ ensemble’s‎
variance is equivalent to maximizing the accuracy rates classifier. We derived
an optimal Majority weighting method to assign weight values for base
classifiers, such that they can form an ensemble with maximum prediction
accuracy. Following the above derivations, we proposed a Majority voting
system for active learning from different fold data sizes, where the key is to
149
label instances which are responsible for a large variance value from the
classifier ensemble.
Our intuition was that providing class labels for such instances can
significantly benefit from the variance of the ensemble classifier, and therefore
maximize the prediction accuracy rates. Experimental results on synthetic and
real-world data showed that the dynamic nature of data streams poses
significant challenges to the existing active learning algorithms, especially
when dealing with multiclass problems. By applying random sampling globally
or locally results in good performance in practice.
The proposed Majority voting system and active learning framework
address these challenges using a variance measure to guide the instance
classification process, followed by the voting weight to ensure that the instance
labelling process can classify the future sample in the most appropriate class.
Chapter 6
Examination of TSVM Algorithm Classification Accuracy with
Feature Selection in Comparison with GLAD Algorithm
The chapter is organized as follows. Section 6.2 presents the literature
background regarding Support Vector Machines, and Transductive Support
Vector Machines are also discussed. Finally, section 6.2 deals with recursive
feature removal. A clear elaboration is provided in section 6.3 concerning the
TSVM algorithm in addition to RFE. Then the GLAD algorithm is summarized
in a manner that enables us to compare the estimation precision of these two
algorithms. Section 6.4 provides an analysed comparison of the outcomes of
the empirical execution of both algorithms, TSVM and GLAD. Finally, a brief
conclusion is provided in section 6.5.
6.1
Introduction
Data mining techniques have been traditionally used to extract the
hidden predictive information in many diverse contexts. Usually datasets
contained thousands of examples. Recently the growth in biology, medical
science, and DNA analysis has led to the accumulation of vast amounts of
biomedical data that requests in–depth analysis. There have been many data
mining machines learning, statistics analysis systems and tools available, after
years of research and developments that have been use in bio-data exploration
and bio-data analysis. Consequently, this chapter will examine relatively new
techniques in data mining, This technique called Semi supervised support
150
151
vector machines S3VMs, which is also named Transductive supervised support
vector machines [172], located between supervised learning with fully labelled
training data and unsupervised learning without any labelled training data
[173]. In this method we use both labelled and unlabeled samples for training,
small amount of labelled data with large amount of unlabeled data. This
chapter propose to observe the performance of Transductive SVMs combining
with a feature selection method called recursive feature elimination (RFE),
used to select molecular descriptors for transductive support vector machines
(TSVM). We used LIBSVM open source machine learning libraries. LIBSVM
implements support vector machines (SVMs), supporting classification and
regression [116]. We modified some of the libraries to extend them to be used
for TSVM along with RFE.
6.2
Background
6.2.1 Support Vector Machines
A support vector machine (SVM) is a distinctive classifier which is
basically introduced by separating a hyperplane, i.e. a source of labelled
training data (supervised learning). The algorithm presents an idealistic
hyperplane, which sorts recent instances. The SVM algorithm depends on
getting a hyperplane which provides the training instances with the largest
minimum space. In the theory of SVM, this space is called the margin.
Consequently, the idealistic separating hyperplane clearly increases the margin
of the training data; a line should not be allowed to pass nearby the dots in
order to avoid noise and sensitivity. Moreover, it will not properly generalize.
As a result, a line passing as far as possible from all the dots is preferable.
152
Support vector machine (SVMs), a supervised machine learning method, are
useful in many fields of biomedical research, such as microarray expression
data assessment [174], far protein homologies detection [175] and translating
initiation websites identify [176].
SVMs can not only properly categorize objects, but also recognize
instances categorized without any data [177]. The SVMs method relies on
training samples in order to specify in advance which data need to be banded
together [174].
Figure 6.1: Multi Margin vs. SVM Maximum Margin optimal hyperplane
separation
153
6.2.2 Transductive Support Vector Machines
Transductive learning is a method that is closely connected with semisupervised learning, where semi-supervised learning is intermediate between
supervised and unsupervised. Vladimir Vapnik introduced Support Vector
Machines Semi-Supervised Learning in the 1990s. This was motivated by his
view that transduction (TSVM) is preferable to induction (SVM), since the
induction needs to solve more general problems (inferring a function) before
being able to solve a more detailed problem (computing outputs for new cases)
[178,179].
TSVM seeks a hyperplane, a labelling of the unlabelled examples, so
that the SVM objective function is minimized, subject to the constraint that a
fraction of the unlabelled data is classified as positive. SVM margin
maximization in the presence of unlabelled examples can be interpreted as an
implementation of the cluster assumption.
The Transductive Support Vector Machine attempts to maximize the
hyperplane classifier between two classes, using labelled training data, while at
the same time forcing the hyperplane to be far away from the unlabelled
samples. TSVM seems to be a perfect semi-supervised learning algorithm
because it combines the regularization of the Support Vector Machine with the
straight implementation of the clustering assumption [180].
In semi-supervised learning, a sample
observed with an independent unlabelled sample
.
is a -dimensional input and
is
, and
,
independently and identically distributed according to an unknown
154
distribution
and
is distributed according to distribution
.
TSVM is based on the idea of maximizing the separation between labelled and
unlabelled data (see Vapnik [178]). It deals with:
where
represents a decision function in
, a candidate function class,
indicates the decisive loss, and
refers to the counter of
the geometric separation margin.
In the linear condition,
nonlinear
kernel
and
. In the
case,
is‎ a‎ kernel‎ satisfying‎ Mercer’s‎ case‎ to‎ confirm‎
, where
with
being a proper norm (see [176,177] for more details).
Minimizing (
) with respect to
is non-convex, which can be
solved through integer programming, and is known to be NP [182]. To solve
(
), Joachims [181] proposed an efficient local search algorithm that is the
basis of SVMLight. This algorithm may fail to deliver a good local solution,
resulting in worse performance of TSVM against SVM. This aspect is
confirmed by our numerical results, as well as empirical studies in the
literature. Chapelle and Zien [176] aimed to correct this problem by
approximating
by a smooth convex problem through gradient descent. [183]
used an extended bundle method to treat non-convexity and non-smoothness of
the cost function.
155
TSVMs [180] enhance the generalization accuracy of SVMs [182]
based on unlabelled data. Both TSVMs and SVMs aim to maximize the margin
of the hyperplane classifier based on labelled training data, while TSVM is
distinguished by pushing the hyperplane away from the unlabelled data.
One way of justifying this algorithm in the context of semi-supervised
learning is that one is finding a decision boundary that lies in a region of low
density, implementing the so-called cluster assumption (see e.g. [176]). In this
framework, if you believe the underlying distribution of the two classes is such
that‎there‎is‎a‎“gap”‎or‎low‎density‎region‎between‎them,‎then‎TSVMs‎can‎help‎
because they select a rule with exactly those properties.
Vapnik [178] has a different interpretation of the success of TSVMs
that is rooted in the idea that transduction (labelling a test set) is inherently
easier than induction (learning a general rule). In either case, experimentally it
seems clear that algorithms such as TSVMs can give considerable
improvement in generalization over SVMs, if the number of labelled points is
small and the number of unlabelled points is large. Unfortunately, TSVM
algorithms (like other semi-supervised approaches) are often unable to deal
with a large number of unlabelled examples. The first implementation of
TSVM appeared in [184], using an integer programming method that is
intractable for large problems. Joachims [181] then proposed a combinatorial
approach known as SVMLight-TSVM, which is practical for a few thousand
examples.
156
A sequential optimization procedure is introduced in [182] that could
potentially scale well, although their largest experiment used only 1000
examples. However, their method was for the linear case only, and used a
special kind of SVM with a 1-norm regularizer, in order to retain linearity.
Finally, Chapelle and Zien [176] proposed a primal method, which turned out
to show improved generalization performance over the previous approaches,
but still scales as (L+U)3, where L and U are the numbers of labelled and
unlabelled examples. This method also stores the entire (L+U) × (L+U) kernel
matrix in memory. Other methods [177,179] transform the non-convex
transductive problem into a convex semi-definite programming problem that
scales as (L+U)4 or worse.
Figure 6.2: Separation hyperplane for (semi-supervised data)
6.2.3 Recursive Feature Elimination
An enormous size of data set negatively affects the performance of most
prediction models algorithms. In order to minimize the feature set, we
underline the recursive feature elimination/removal (RFE) among many
proposed techniques. The proposal of RFE is to begin with all the features,
157
select the least useful features and remove them, and repeat until some
stopping condition is reached. Detecting the best subset features costs much, so
RFE decreases the difficulty of feature selection by being greedy. REF worked
well in gene expression studies by Guyon et al. [183]. Recursive feature
elimination (RFE) is another multivariate mapping approach that allows us to
detect (sparse) discriminative patterns in the dataset that are not limited to the
local neighbourhood of a feature, i.e. features may be spread across the whole
sample. The basic principle of RFE is to include initially all the features of a
large region, and to gradually exclude features that do not contribute to
discriminating patterns from different classes. Whether a feature in the current
feature set contributes enough to be kept is determined by the weight value of a
feature resulting from training a classifier (e.g. SVM) with the current set of
features. In order to increase the likelihood that the "best" features are selected,
feature elimination progresses gradually and includes cross-validation steps. In
each feature elimination step, a small proportion of features is discarded until a
core set of features remains with the highest discriminative power. Note that
using SVM to separate "good" from "bad" features implements a multivariate
feature selection strategy, as opposed to univariate feature selection which uses
single-feature F or t values from a statistical analysis. Nonetheless, an initial
feature reduction step using a univariate method might be useful if one wants
to restrict RFE to the subset of "active" features.
The implementation of RFE includes two nested levels of cross-validation to
maximize the chance of keeping the "best" features. At the first level, the
training data is partitioned and RFE is applied a number of times. In each
158
application, one of the folds is put aside for testing generalization performance,
while the remainder together form the training data for the RFE procedure, i.e.
for each of the RFEs another "split" of the data is used. When all the separate
RFEs have been performed, the final generalization performance is determined
as the average of the performance across the NF different splits, separately for
each reduction level. The final set of features (for a specific reduction level) is
obtained by merging the features with the best weights (highest absolute
values) across all splits.
The training data from each first-level split is used for a separate RFE
procedure, while the split with the test data is set aside and only used for
performance testing. The training data is then partitioned again into L subsplits and an SVM is trained on L splits in order to obtain robust weight
rankings for‎ feature‎ elimination.‎ A‎ feature’s‎ ranking‎ score‎ is‎ obtained‎ by‎
averaging the weights of that feature across the different second-level splits.
The absolute values of these scores are then ranked and the features with the
lowest ranks are removed. The "surviving" features are then used for the next
RFE iteration, which starts again with (a new) partitioning of the data into L
splits. The whole procedure is repeated R times until a desired number of
features has been reached. As described above, the RFE level that produces the
highest generalization performance across all first-level splits is finally selected
and‎the‎level’s‎set‎of‎features‎is‎determined‎by‎merging‎the‎best‎features‎of‎the‎
respective first-level splits.
159
The support vector machine based recursive feature elimination, the socalled (RFE-SVM) approach [173] is a commonly used method to select
features, as well as for subsequent classification, especially in the scope of
biological data. Each time we attempt an iteration, a linear SVM is trained,
followed‎by‎removing‎one‎or‎more‎“bad”‎features‎from‎further‎consideration.‎
The goodness of the features is determined by the absolute value of the
corresponding weights used in the SVM. The features remaining after a
number of iterations are deemed to be the most useful for discrimination, and
can be used to provide insights into the given data. A similar feature selection
strategy was used in the author unmasking approach, proposed for the task of
authorship verification [172] (a sub-area within the natural language processing
field). Instead of excluding the worst features, we could let the best features
iteratively drop. Recently, it has been observed experimentally on two
microarray datasets that using very low values for the regularisation constant C
can enhance the effectiveness of RFE-SVM execution [185].
Instead of continually evolving SVMs within the usual iterations, we rely on
the‎ limit‎ C‎ →‎ 0.‎ This‎ limit‎ can‎ be‎ clearly‎ computed‎ using‎ a cantered based
classifier. Moreover, unlike RFE-SVM, in the mentioned limit, removing a
number of features has no influence on resolving the rest of the features.
Consequently, the need for multiple recursion is obviated, resulting in
considerable computational savings.
6.2.4 Genetic Algorithms
Genetic algorithms are one of the best ways to solve a problem for which little
is known. They are a very general algorithm and so will work well in any
160
search space. All you need to know is what you need the solution to be able to
do well, and a genetic algorithm will be able to create a high quality solution.
Genetic algorithms use the principles of selection and evolution to produce
several solutions to a given problem.
Genetic algorithms tend to thrive in an environment in which there is a very
large set of candidate solutions and in which the search space is uneven and has
many hills and valleys. True, genetic algorithms will do well in any
environment, but they will be greatly outclassed by more situation specific
algorithms in the simpler search spaces. Therefore you must keep in mind that
genetic algorithms are not always the best choice. Sometimes they can take
quite a while to run and are therefore not always feasible for real time use.
They are, however, one of the most powerful methods with which to
(relatively) quickly create high quality solutions to a problem.
6.3
Methods
This section describes TSVM-RFE, the problem motivated by the task
of classifying biomedical data. The goal is to examine classifiers accuracy and
classification errors using the transductive support vector machine method. We
set out to determine whether this method is an effective model when combined
with recursive feature elimination (RFE), compared with another algorithm
called Genetic Learning Across Datasets (GLAD).
161
6.3.1 Support Vector Machines
SVMs aim at creating a classifier with major margins between the
samples in respect of two varied classes, where the training error is minimized.
Consequently, we employ a set of -dimensional training samples
labelled by
and their outline
across the kernel function:
The formula hereinafter initially expresses SVM :
SVM has the following primal form:
The SVM predictor for samples X, as stated hereinafter, is settled by
the‎vector’s‎inner‎product‎between‎the as well as the mapped vector
the constant .
, plus
162
Figure 6.3: Maximum margin separation hyperplane for Transductive SVM
(semi-supervised data)
The predictor actually corresponds to a separating hyperplane in the
mapped feature space. The prediction for each training sample
with a violation term
. The
is connected
is a user-specified constant to manage the
penalty to these violation terms.
The parameter/indicator
hereinbefore specifies a certain type of norm of
to
be evaluated. It is often assigned to 1 or 2, forming the 1-norm ( -SVM) or 2norm SVM ( -SVM) respectively. The 1-norm and 2-norm TSVMs have been
discussed in [181,184].
6.3.2
Transductive Support Vector Machines
163
This chapter introduces the extended SVM method that is simply
transductive SVM. The 2-norm has been orderly recruited for TSVM. The
following formula expresses the status of the level:
The standard setting can be illustrated as:
where each
represents the un given label for
which is considered one of
the K unlabelled samples. In contrast with SVM, the TSVM formula is
concerned with the unlabelled data by standing for the violation terms
presented by predicting each unlabelled model
into
these violation terms is controlled by new constant
labelled with unlabelled
samples, while
. The penalty to
consists of labelled samples only.
Precisely solving the transductive problem needs us to search all the potential
assignments of
and to identify various terms of
, which is
regularly intractable for big data sets. It is worth mentioning the
implemented in the SVMLight [178,186].
-TSVM
164
6.3.3 Recursive Feature Elimination
Recursive Feature Elimination (RFE) has the advantage of decreasing
the redundant and recursive features. RFE decreases the difficulty of feature
selection by being greedy.

SVM Recursive Feature Elimination (SVM RFE)
SVM RFE is an application of RFE using the weight magnitude as ranking
criterion. We present below an outline of the algorithm in the linear case, using
the SVM-train equation, which is mentioned earlier.
SVM RFE Algorithm:
Inputs:
Training examples
Class labels
Initialize:
Subset of surviving features
Feature ranked list
Repeat until
Restrict training examples to good feature indices
Train the classifier
Compute the weight vector of dimension length (s)
165
Compute the ranking criteria
for all i
Find the feature with the smallest ranking criterion
Update feature ranked list
Eliminate the feature with the smallest ranking criterion
Output:
Feature ranked list r.
As mentioned before, the algorithm can be generalized to remove more than
one feature per step for speed reasons.
Extending SVM feature selection techniques to transductive feature selection is
straightforward. Specifically, we can produce TSVM RFE by iteratively
eliminating features with weights calculated from TSVM models. The
following steps illustrate the theories of evolving TSVM RFE from TSVM.
.
1. Pre-process data to be ready, then compute percolating markers/grades
. Optionally further normalize data. This approach first filters some
features based on scores like Pearson correlation coefficients.
2. Adapt as an all-one input vector.
3. Assign
. Partially assign small entries of
proportion/ threshold, and maybe distinct non-zero
zero in terms of a
to 1.
4. Get a sub-optimal TSVM as computed by cross-validation accuracy.
166
5. In accordance with theories of RFE, expect weighting of feature
from
the pattern in step 4 as follows:
where
are the given samples,
with feature
eliminated. The
weighting of the -th feature is expressed by the following formula:
The following estimation suggested in [160] is easier to measure:
Specifically, the feature weights are identical to the
if the SVM is built upon
a linear kernel. We go back to step 3 unless there is an accepted amount of
features/iteration. Output the closing predictor and features pointed out by
large values of .
Step 3 comprises the selection of a proportion/number of features according to
a threshold cutting the vector . For filtering scores and the RFE method, the
vector
is changed to a binary vector. Then the
has the effect of pruning
or deactivating some features.
The threshold is usually found to prune a (fixed) number/proportion of features
at each iteration. The value of the remaining features is then measured by the
optimality of the TSVM model obtained in step 4. We apply cross-validation
accuracy as the performance measure of the TSVM algorithm. For a subset of
features as selected by choosing a threshold value, we extend the model search
167
upon the free parameters like [
and choose the preferred
parameter set which results in the highest cross-validation accuracy.
6.3.4 Genetic Learning Across Datasets (GLAD)
The GLAD algorithm is simply distinguishing a semi-supervised learning
algorithm. The GLAD algorithm is applied as a wrapper method for feature
selection. The GA is implemented to generate a population of related feature
subsets. The labelled data and the unlabelled data samples are computed
separately. Linear Discriminant Analysis (LDA) and K-means (K = 2) were
used for these two data forms of cluster algorithms [182]. A distinctive twoterm scoring function resulted in independently scoring the labelled and
unlabelled data samples. Generally, the score is calculated as a weighted
average of the two terms, as shown below:
As the typical leave-one-out-cross-validation accuracy for the labelled training
samples, they identify the labelled data samples score. The unlabelled data
samples score exists of two terms: a cluster separation term and a steady ratio
term.
168
= centroid of cluster,
cluster
6.4
,
= ratio of data in cluster ,
= number of data samples in cluster ,
= expected ratio in
= number of clusters.
Experiments and Results
This section shows empirically the outcomes that help us to evaluate
the performance of the classification pattern accuracy introduced earlier.
6.4.1 Datasets

Leukaemia (AML-ALL) involving 7129 genes detects two different
forms of leukaemia:
Acute Myeloblastic Leukaemia (AML), 25
samples, and Acute Lymphoblastic Leukaemia (ALL), 47 samples
[187].

Lymphoma (DLBCL) consisting of 7129 genes, and 58 DLBCL
samples of Diffuse large B-Cell lymphoma (DLBCL) and 19 samples
of Follicular lymphoma (FL) [188].

Chronic Myeloid Leukaemia (CML) includes 30 samples (18 severe
emphysema, 12 mild or no emphysema) detected from a set of 22,283
human genes [189].
6.4.2

TSVM Recursive Feature Elimination (TSVM-RFE) Result
Leukaemia (AML-ALL). The results for the leukaemia ALL/AML data
set are summarized in Figure 6.4. TSVM-RFE gives the smallest error
of 3.68%, and compassionately smaller errors compared to SVM-RFE
at 3.97% for 30, 40, . . . , 70 genes. Interestingly, in our experiments
both of the methods give the lowest error when 60 genes are used. This
169
provides a reasonable suggestion for the number of relevant genes that
should be used for the leukaemia data.

Lymphoma (DLBCL). The results for the lymphoma (DLBCL) data set
are summarized in Figure 6.4. TSVM-RFE gives the smallest error of
3.89%, and considerably smaller errors compared to SVM-RFE at
4.72% for 30, 40, . . . , 70 genes. The TSVM methods give the lowest
error with 60 genes, while the SVM methods give the lowest error at 50
genes with 4.72%, compared to 4.97% with 60 genes. This suggests the
number of relevant genes that should be used for the lymphoma
(DLBCL) data.

Leukaemia (CML). Lastly, the TSVM-RFE and SVM-RFE results for
the leukaemia (CML) data set are provided in Figure 6.4 in the bottom
diagram. TSVM-RFE gives the smallest error of 6.52%, and markedly
smaller errors in contrast to the 7.85% with SVM-RFE for 30, 40, . . . ,
70 genes. Both algorithms show the lowest error when 50 genes are
used. This represents a sensible number of related genes that should be
used for the leukaemia (CML) data.
6.4.3 Comparing TSVM Algorithm result with GLAD
Algorithm
When implementing the Genetic Learning Across Datasets, we conduct
three experiments using the previous data sets, each addressing a different
cancer diagnostic problem: the aim with ALL/AML is to disparity the
170
diagnosis; with the CML data set it is to predict the response of imatinib; with
DLBCL it is to forecast the outcome.
In the AML-ALL data set, the accuracy range using only labelled samples is
73.46%. Combining unlabelled and labelled samples increases the range to
75.14%. Adding unlabelled samples increases the accuracy from 59.34% to
65.57% in the CML experiments. The addition of the unlabelled samples to the
unlabelled sample for DLBCL raises the accuracy from 49.67% to 55.79%.
This shows that the GLAD algorithm outperforms SVM-RFE and TSVM-RFE
in some cases when we make use of the labelled data only without gene
selection. Table 6.1 shows, for example, that with the AML-ALL data set, the
GLAD algorithm gives 73.46%, while the SVM-RFE and TSVM-RFE
accuracy was 52.8% and 55.6% respectively.
However, the results for the second data set (DLBCL) shows that the
GLAD‎algorithm’s‎accuracy‎was‎49.67%‎and‎SVM-RFE 55.8%. Furthermore,
for the third data set (CML), SVM-RFE gives 59.02% without gene selection,
while GLAD gives 59.34%. On the other hand, TSVM exceeds GLAD when
we make a use of unlabelled data along with labelled data and selection of
genes. The results are shown in Table 6.1.
171
Error Rate
DLBCL Data set
6.5
6
5.5
5
4.5
4
3.5
3
SVM-RFE
TSVM-RFE
30
40
50
60
70
Number of Genes
Error Rate
CML Data set
SVM-RFE
TSVM-RFE
9
8.5
8
7.5
7
6.5
6
5.5
5
30
40
50
60
70
Number of Genes
Error Rate
AML/ALL Data set
7.5
7
6.5
6
5.5
5
4.5
4
3.5
3
SVM-RFE
TSVM-RFE
30
40
50
60
70
Number of Genes
Figure 6.4: Testing error for 3 data sets. The 5-fold cross-validated pair t-test
shows the differences between SVM-RFE and the TSVM-RFE when
comparing the two methods at the confidence rate of 95%. (Linear kernel, C =
1)
172
Table 6.1: Accuracy obtained with SVM-RFE, TSVM-RFE and GLAD
SVM-RFE
Accuracy
(labelled)
TSVMRFE
Accuracy
52.8%
55.6%
96.03%
96.32%
75.14%
7219 Genes,
77 Samples
55.8%
57.1%
49.67%
(labelled)
60 Genes,
77 Samples
95.03%
96.11%
55.79%
22,283 Genes, 59.02%
30 Samples
72.6%
59.34%
(labelled)
50 Genes,
30 Samples
93.48%
65.57%
Datasets
GLAD
Accuracy
ALL-AML
Without
Selection
With
Selection
7219 Genes,
72 Samples
60 Genes,
72 Samples
73.46%
(labelled)
DLBCL
Without
Selection
With
Selection
CML
Without
Selection
With
Selection
92.15%
For instance, with the CML data set using all the samples without gene
selection, TSVM gives 72.6% but with gene selection based on REF, TSVM
exceeds 93.48%, while the GLAD results is 65.57% with gene selection. In the
same vein, the accuracy for the DLBCL data set reaches 96.11% by TSVM
with gene selection.
On the other hand, the GLAD algorithm gives 55.79% with gene selection. In
addition, TSVM with the AML-ALL data set with gene selection gives 96.32%
173
while the GLAD algorithm gives 75.14%. This means that TSVM performs
better than the GLAD algorithm, and with gene selection the result is superior.
6.5
Discussion of results
From the results of the three datasets, we made the following observations:
1. TSVM-RFE can improve the performance more significantly for the
CML dataset than for the ALL-AML and DLBCL datasets. For
example, our TSVM-RFE methods can lead to a relative improvement
of more than 12% over the other methods, SVM-RFE and GLAD, for
the CML dataset, while the other methods only result in a 3~4%
improvement for the ALL-AML and DLBCL datasets.
2. Our proposed algorithms outperform SVM-RFE and GLAD more
significantly when we select some genes for the CML dataset than for
the other datasets, compared to when we use all the genes. On the other
hand, for example, GLAD is significantly better than SVM-RFE and
TSVM-RFE for the ALL-AML dataset; in contrast, the improvement
over SVM-RFE is modest for the DLBCL dataset. To determine the
reasons, we conducted the following additional experiments.
We
studied genes against their values when they were regarded as ranking
models. There are more than 10% genes which can improve the
performance. In this case, feature selection can help to remove noisy
features and thus improve the performance of the final ranking. In
contrast, there are many features in the ALL-AML, DLBCL, and CML
174
datasets that are not effective. Therefore, the benefit of removing noisy
features is great.
Based on the discussion above, we conclude that if the effects of
features vary largely and there are redundant features, our method can work
very well when applied in practice. It is also worth noting that the newly
developed method is meant to be applicable over large-scale datasets. In this
chapter, several situations were presented, for which the TSVM classifier was
outperformed by a more general algorithm that does not assume any particular
distribution of the analysed samples. In general, according to our experience,
the new method outperforms the SVM classifier.
6.6
Summary
This chapter has investigated topics focused on semi-supervised learning. This
was achieved by comparing two different methods for semi-supervised
learning using previously classified cancer data sets.
The results, on average, for semi-supervised learning surpass supervised
learning. However, it was shown that the GLAD algorithm outperforms SVMRFE when we make use of the labelled data only. On the other hand, TSVMRFE exceeds GLAD when unlabelled data is used along with labelled data. It
performs much better with gene selection, and performs well even if the
labelled data set is small.
On the other hand, TSVM still has some drawbacks; when increasing the size
of the labelled data set, the result does not significantly increase accordingly.
Moreover, when the size of the unlabelled samples is extremely small, the
175
computational effort will be extremely high because small size of unlabelled
set requires more computation.
Like almost all semi-supervised learning algorithms, TSVM shows some
instability and some results differ on different runs. This happens because
unlabelled samples may be wrongly labelled during the learning process. If we
find a way in future to select and eliminate the unlabelled sample first, then we
can limit the number of newly labelled samples for re-training the classifiers.
176
Chapter 7
Conclusions and Future Work
This thesis has been devoted to the core problem of pattern classification and
its applications. Three stages have been studied through pre-processing of
features, classification, and model selection for a classifier.
7.1
Contributions
Improving TSVM vs. SVM Accordance-Based Sample Selection
In this chapter, supervised and semi-supervised learning were applied over
several case studies. In particular, two different classifiers, the Support Vector
Machine and the Transductive Support Vector Machine, were reviewed and
used‎over‎the‎‘in-class’‎patients of the Abd El-Rehim et al. [95] breast cancer
dataset in order to validate the previous classification derived and characterised
in earlier studies. Surprisingly, the TSVM classifiers performed quite well,
especially‎ when‎ only‎ the‎ 50‎ ‘most‎ important’‎ samples‎ were‎ considered.‎ This‎
happened even though one of the underlying assumptions of the TSVM was
strongly violated by the data: as a matter of fact, all the samples did not follow
a normal distribution. An accordance-based sampling version of the TSVM
was then developed and validated over known data sets. These latter results
were presented in this chapter, together with their comparison with the Support
Vector Machine approach. Using accordance based sampling improved the
accuracy of both data sets the results show that the improvement for TSVM
was more than SVM as it shown in chapter 3.
177
Automatic Features and Samples Ranking for SVM Classifier
In this chapter, we have proposed an optimization method for feature and
sample selection in ranking. The contributions of this chapter include the
following. We discussed the differences between classification and ranking,
and made clear the limitations of the existing feature and samples selection
methods when applied to ranking. In addition, we proposed a novel method to
select features and samples for ranking, in which the problem is formalized as
an optimization issue. In this method, we maximize the total importance scores
of selected features and samples, and at the same time minimize the total
similarity scores between the features in addition to samples. In this chapter,
we evaluated the proposed method using two datasets, with two ranking
models, and in terms of a number of evaluation measures. Experimental results
validated the effectiveness and efficiency of the proposed method.
Ensemble weighted classifiers with accordance-based sampling
Chapter 5 proposed a new research topic on active learning with different fold
data sizes, where the data volumes continuously increase and data concepts
dynamically develop, and the objective is to label a portion of data to form a
classifier ensemble with the highest accuracy rate in predicting future samples.
This chapter also studied the connection‎ between‎ a‎ classifier‎ ensemble’s‎
variance and its prediction accuracy and showed that combining a classifier
ensemble’s‎variance‎is‎ equivalent‎to‎maximizing‎the‎accuracy‎rates‎classifier.‎
We derived an optimal Majority weighting method to assign weight values for
base classifiers, such that they can form an ensemble with maximum prediction
178
accuracy. Following the above derivations, we proposed a Majority voting
system for active learning from different fold data sizes, where the key is to
label instances which are responsible for a large variance value from the
classifier ensemble.
Our intuition was that providing class labels for such instances can
significantly benefit from the variance of the ensemble classifier, and therefore
maximize the prediction accuracy rates. Experimental results on synthetic and
real-world data showed that the dynamic nature of data streams poses
significant challenges to existing active learning algorithms, especially when
dealing with multiclass problems. Simply applying uncertainty sampling
globally or locally achieves good performance in practice. The proposed
Majority voting system and active learning framework address these challenges
using a variance measure to guide the instance classification process, followed
by the voting weight to ensure that instance labelling process can classify the
future sample in the most appropriate class.
Examination of TSVM Algorithm Classification Accuracy with Feature
Selection in Comparison with GLAD Algorithm
This chapter has investigated topics focused on semi-supervised learning. The
result in chapter 6 was achieved by comparing two different methods for semisupervised learning using previously classified cancer data sets.
The results on average for semi-supervised surpass supervised learning.
However, it is shown that the GLAD algorithm outperforms SVM-RFE when
179
we make use of the labelled data only. On the other hand, TSVM-RFE exceeds
GLAD when unlabelled data is used along with labelled data. It performs much
better with gene selection and performs well even if the labelled data set is
small. On the other hand, TSVM still has some drawbacks; when increasing the
size of the labelled dataset, the result does not improve accordingly. Moreover,
when the size of the unlabelled sample is extremely small, the computational
effort will be extremely high because small size of unlabelled set requires more
computation time will be extremely high.
Like almost all semi-supervised learning algorithms, TSVM shows some
instability and some results differ on different runs. This happens because
unlabelled samples may be wrongly labelled during the learning process. If we
find a way in future to select and eliminate the unlabelled samples first, then
we can limit the number of newly labelled samples for re-training the
classifiers.
Future Work
The following is a list of possible points that could lead the
continuation of the present investigation:

The classification performance of the comparatively weak
features has been improved by using our proposed kernel-based
classifier in this work. However, it is difficult to pre-determine
the associated optimal kernel parameter, as the robustness is not
satisfactory around those values of the kernel parameters that
can provide a high classification accuracy. The formulation and
180
optimization of the kernel parameter is an important issue to
explore.

Investigation‎of‎‘not-classified’‎patients:‎by‎a‎form‎of‎consensus‎
clustering, six breast cancer classes have been defined in this
work. However, as highlighted in Chapter 4, not all the
available patients were classified in one of these six groups. A
very important future project will be to define a proper
classification for those patients in order to help doctors give
them more accurate prognoses, as well as targeting patients with
more specialised treatments. This represents a big challenge for
future work, as finding the proper cure for each patient will
decrease‎hospital‎costs‎as‎well‎as‎the‎patient’s‎pain.

Getting new patients: one of the strategies that may be followed
to achieve the previous goal might be to increase the number of
available patients. This could be done by retrieving medical
records or by performing again the same biological analyses in
order to recover some missing data. It would be interesting to
investigate if it could be feasible to combine different sources of
data by merging studies from different research groups in which
data have been collected using very similar protocols.

Globally optimum feature set: all wrapper or filter feature
selection methods try to find a set of features that perform better
than others under certain conditions, but they cannot guarantee
that the selected feature set is the globally optimum solution.
181
Searching for a globally optimum set of features has always
been a computationally un-feasible task. The task is to select
one combination out of 2D possible combinations of features.
To solve this problem, we can apply any feature selection
algorithm as a polynomial mixed 0-1 problem, where 0-1
corresponds to the absence or presence of the corresponding
features in the selected feature set. Any linear optimiser to get a
global solution can easily solve this mixed 0-1 linear problem.
Potentially, this means that we can search for a globally
optimum feature set at a running cost of a linear optimization
technique. This is a huge improvement in computation costs and
the solution is also globally optimum. These techniques need to
be further investigated and tested on several available datasets
to verify their effectiveness.

All of the importance weighting algorithms discussed in this
thesis try to minimise the distance between the distributions.
However, these methods do not guarantee to preserve data
variance properties. In this regard, kernel PCA does guarantee
to preserve the maximum variance of the data. There is a
requirement of developing transfer learning algorithms that
minimise the distance between the distributions while
preserving the data variance.

The weighting algorithms have been developed for problems
where a large amount of unlabelled testing data is available and
182
importance weights can be calculated using limited training
data. However, these algorithms have been developed for offline
processes. This means that they cannot be applied to runtime
problems where processing time is of the essence. If we wish to
make these algorithms integral to SER systems, they have to be
modified to make them work online. This is where weighting
algorithms are lacking in comparison to CMN and MLLR
algorithms. If we consider the example of kernel mean
matching, it is a quadratic optimization problem. We know that
SVMs are also solved as a quadratic optimization problem.

More complex classifiers designed with a small dataset may
appear to have higher performance than simpler classifiers
because of overtraining, but they may generalize poorly to
unknown cases. Many studies have shown that sometimes even
thousands of cases are not enough to ensure generalization. This
is particularly true when using powerful nonlinear techniques
with multiple stages. As many of the experiments carried out in
this study employed small medical datasets, further studies
should be conducted with 10 x larger sets, such as the digital
database for screening mammography (DDSM).
7.3
Dissemination
183
The research work reported in this thesis has been used in various
conference and journal papers as well as several internal and international
talks. What follows is a list of publications and presentations derived from this
work, together with a reference to the chapter in which the topic is covered.
7.3.1 Journal papers
In submission
Hala Helmi, Jonathan M. Garibaldi, Ensemble weighted classifiers
with accordance-based sampling, submitted to Data Mining and Knowledge
Discovery,2013.
7.3.2 Conference papers
Hala Helmi, Jonathan M. Garibaldi, Improving SVM and TSVM with
Multiclass Accordance Sampling for Breast Cancer, in the Proceedings of 14th
International Conference on Bioinformatics and Computational Biology BIOCOMP 2013, Las Vegas, USA, July 2013
Hala Helmi, Daphne Teck Ching Lai, Jonathan M. Garibaldi, SemiSupervised Techniques in Breast Cancer Classification: A Comparison
between Transductive SVM and Semi-Supervised FCM, in the Proceedings of
12th Annual Workshop on Computational Intelligence (UKCI), Heriot-Watt
University, Edinburgh, 2012.
184
Hala Helmi, Jonathan M. Garibaldi,
Improving
SVM
with
Accordance Sampling in Breast Cancer Classification, in the Proceedings of
International Conference on Bioinformatics and Computational Biology BIOCOMP BG 2012, Varna, Bulgaria.
Hala Helmi, Jon M. Garibaldi and Uwe Aickelin, Examining the
Classification Accuracy of TSVMs with Feature Selection in Comparison with
the GLAD Algorithm, in the Proceedings of UKCI 2011, the 11th Annual
Workshop on Computational Intelligence, Manchester, UK
185
References
[1]
Platt, J. (1998). Sequential minimal optimization: A fast algorithm for training
support vector machines. Technical Report MSR-TR-98-14, Microsoft Research.
[2]
Nils J. Nilsson. Introduction to Machine Learning. Artificial Intelligence
Laboratory, Department of Computer Science, Stanford University, 2005. Draft
of Incomplete Notes.
[3]
I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, NY,
1986.
[4]
Lluis Marquez. Machine learning and natural language processing. Technical
Report LSI-00-45-R, Departament de Llenguatges i Sisternes Informatics (LSI),
Universitat Politecnica de Catalunya (UPC), 2000.
[5]
J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver
operating characteristic curve. Radiology, 143: 29-36,1982.
[6]
B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal
margin classifiers. In Proc. of the 501 Annual /1CAI Workshop on
Computational Learning Theory, pages 144--152,1992.
[7]
O. L. Mangasariani. Breast cancer diagnosis and prognosis via linear
programming. Cancer Letter, 43(4): 570-577,1995.
[8]
D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation
analysis: An overview with application to learning methods. Neural
Computation. 16(12): 2639 - 2664,2004.
[9]
K. 0. Ladly, C. B. Frank, G. D. Bell, Y. T. Zhang, and R. M. Rangayyan. The
effect of external loads and cyclic loading on normal patellofemoral joint signals.
Special Issue on Biomedical Engineering, Defence Science Journal (India). 43:
201-210, July 1993.
[10]
R. M. Haralick, K. Shanmugam, and I. Dinstein. Textural features for image
classification. IEEE Trans. on Systems, Man, Cybernetics, SMC-3(6): 610-622,
1973.
[11]
R. M. Rangayyan. Biomedical Signal Analysis -A Case-Study Approach. IEEE
and Wiley, New York, NY, 2002.
[12]
D.M. Abd El-Rehim, G. Ball, S.E. Pinder, E. Rakha, C. Paish, J.F. Robertson, D.
Macmillan, R.W. Blamey, and I.O. Ellis. High-throughput protein expression
186
analysis using tissue microarray technology of a large well-characterised series
identifies biologically distinct classes of breast cancer confirming recent cDNA
expression analyses. Int. Journal of Cancer, 116:340–350, 2005.
[13]
T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
[14]
T. M. Mitchell. The Discipline and Future of Machine Learning. Machine
Learning Department, School of Computer Science, Carnegie Mellon University,
2007.
[15]
P. Day and A. K. Nandi. Robust text-independent speaker verification using
genetic programming. IEEE Trans. on Audio, Speech and Language Processing,
15: 285-295,2007.
[16]
Lluis Marquez. Machine learning and natural language processing. Technical
Report LSI-00-45-R, Departament de Llenguatges i Sisternes Informatics (LSI),
Universitat Politecnica de Catalunya (UPC), 2000.
[17]
R. Navigli, P. Velardi, and A. Gangemi. Ontology learning and its application to
automated terminology translation. Intelligent Systems, IEEE, 18(1): 1541-1672,
2003.
[18]
C. -L. Liu, S. Jaeger, and M. Nakagawa. Online recognition of chinese
characters: the state-of-the-art. IEEE Trans. on Pattern Analysis and Machine
Intelligence, 26(2): 198-213,2004.
[19]
A. D. Parkins and A. K. Nandi. Genetic programming techniques for hand
written digit recognition. Signal Processing, 84(12): 2345-2365,2004.
[20]
A. D. Parkins and A. K. Nandi. Method for calculating first-order derivative
based feature saliency information in a trained neural network and its application
to handwritten digit recognition. IEE Proceedings - Part VIS, 152(2): 137147,2005.
[21]
R. Plamondon and S. N. Srihari. Online and off-line handwriting recognition: a
comprehensive survey. IEEE Trans. on Pattern Analysis and Machine
Intelligence, 22(l): 63-84,2000.
[22]
Y. Jian, A. F. Frangi, J. -Y. Yang, D. Zhang, and Z. Jin. KPCA plus LDA: a
coinplete kernel Fisher discriminant framework for feature extraction and
recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 27(2):
230 244,2005.
[23]
S. Yang, J. Song, H. Rajamani, C. Taewon, Y. Zhang, and R. Mooney. Fast and
effective worm fingerprinting via machine learning. In Proc. of the Int'l Conf. on
Autonomic Computing, ICAC, pages 311-313, TX, US, 2006.
[24]
S. Oyama, T. Kokubo, and T. Ishida. Domain-specific web search with keyword
spices. IEEE Trans. on Knowledge and Data Engineering. 16(1): 17 27.2004.
[25]
S. J. Vaughan-Nichols. Researchers make web searches more intelligent.
Computer, 39(12): 16-18,2006.
187
[26]
H. Alto, R. M. Rangayyan, and J. E. L. Desautels. Content-based retrieval mid
analysis of mammographic masses. Journal of Electronic Imaging, 1=1(2)1: -1 7.
2005. article no. 023026.
[27]
T. C. S. S. Andre and R. M. Rangaýýail. Classification of breast masses in
mammograms using neural Networks with shape. edge sharpness. and texture
features. Journal of Electronic Imaging. 15(1): 1-10,2006. article no. 013010.
[28]
H. Guo and A. K. Nandi. Breast cancer diagnosis using genetic programming
generated feature. Pattern Recognition, 39: 980-987,2006.
[29]
P. J. Lisboa and A. F. G. Taktak. The use of artificial neural networks in decision
support in cancer: a systematic review. Neural Networks, 19(4): 408-415,2006.
[30]
T. Mu and A. K. Nandi. Breast cancer detection from FNA using SVM with
different parameter tuning systems and SOM-RBF classifier. Journal of the
Franklin Institute, 344(3-4): 285-311,2007.
[31]
T. Mu, A. K. Nandi, and R. M. Rangayyan. Classification of breast masses via
nonlinear transformation of features based on a kernel matrix. Medical and
Biological Engineering and Computing, 45(8): 769-780,2007.
[32]
R. J. Nandi, A. K. Nandi, R. M. Rangayyan, and D. Scutt. Classification of
breast masses in mammograms using genetic programming and feature selection.
Medical and Biological Engineering and Computing, 44(8): 693-694,2006.
[33]
W. H. Wolberg, W. N. Street, and 0. L. Mangasarian. Machine learning
techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letter,
77: 163-171,1994.
[34]
P. Bertone and M. Gerstein. Integrative data mining: the new direction in
bioinformatics. Engineering in Medicine and Biology Magazine, IEEE. 20(4):
33-40, 2001.
[35]
H. Hae-Jin, P. Yi, R. Harrison, and P. C. Tai. Improved protein secondary
structure prediction using support vector machine with a new encoding scheme
and an advanced tertiary classifier. IEEE Trans. on NanoBioscience, 3(4): 265271,2004.
[36]
S. Winters-Hilt, M. Landry, M. Akeson, M. Tanase, I. Amin, A. Coombs, E.
Morales, J. Millet, C. Baribault, and S. Sendamangalam. Cheminformatics
methods for novel nanopore analysis of HIV DNA termini. BMC
Bioinformatics, 2006.7 Suppl 2: S22.
[37]
Huang et al., Z. Huang, H. Chen, C.-J. Hsu, W.-H. Chen, S. Wu Credit rating
analysis with support vector machines and neural networks: a market
comparative study Decision Support Systems, 37 (4) (2004), pp. 543–558
[38]
P. D. Yoo, M. H. Kim, and T. Jan. Machine learning techniques and use of event
information for stock market prediction: A survey and evaluation. In Proc. Of the
Int'l Conf. on Computational Intelligence for Modelling Control and Automation,
188
CIMCA, and the Int'l Conf. on Intelligent Agents, Web Technologies and
Internet Commerce, IAWTIC, volume 2, pages 835-841, Vienna, Austria, 2005.
[39]
S. Handley. Predicting whether or not a nucleic acid sequence is an E. coli
promoter region using genetic programming. In Proc. of the Ist Int'l Symposium
on Intelligence in Neural and Biological Systems, INBS, pages 122-127,
Herndon, VA, 1995.
[40]
H. Tong-Cheng and D. You-Dong. Generic object recognition via integrating
distinct features with SVM. In Proc. of the Int'l Conf. on Machine Learning and
Cybernetics, pages 3897-3902, Dalian, 2006.
[41]
W. Xu, A. K. Nandi, and J. Zhang. Novel fuzzy reinforced learning vector
quantization algorithm and its application in image compression. IEEE
Proceedings - Part VIS, 150(5): 292-298,2003.
[42]
M. Lent. Game Smarts. Computer, 40(4): 99-101.2007.
[43]
K. O. Stanley, B. D. Bryant, and R. Miikkulainen. Real-time neurorevolution in
the NERO video game. IEEE Trans. on Evolutionary Computation. 9(6): 653668,2005.
[44]
N. Kohl and P. Stone. Machine learning for fast quadrupedal locomotion. 2004.
[45]
H. Guo, L. B. Jack, and A. K. Nandi. Feature generation using genetic
programming with application to fault classification. IEEE Trans. on Systems,
Man, and Cybernetics, B: Cybernetics, 35(1): 89-99,2005.
[46]
L. B. Jack and A. K. Nandi. Genetic algorithms for feature selection in machine
condition monitoring with vibration signals. IEEE Proc. - Vision. Image Signal
Process. 147(3): 205-212,2000.
[47]
L. B. Jack and A. K. Nandi. Fault detection using support vector machine's and
artificial neural networks, augmented by genetic algorithms. Mechanical
Systems and Signal Processing, 16(2-3): 373-390,2002.
[48]
M.L.D. Wong, L.B. Jack and A.K. Nandi, Modified self-organising map for
automated novelty detection applied to vibration signal monitoring. Mechanical
Systems and Signal Processing, 20(3): 593-610,2006.
[49]
A. C. McCormick and A. K. Nandi. Real time classification of rotating shaft
loading conditions using artificial neural networks. IEEE Tran..s . on Neural
Network. 55, 8(3): 748-757,1997.
[50]
A. Rojas and A. K. Nandi. Practical scheme for fast detection and classification
of rolling-element bearing faults using support vector machines. Mechanical
Systems and Signal Processing, 20(7): 1523-1536,2006.
[51]
L. Zhang, L. B. Jack, and A. K. Nandi. Fault detection using genetic
programming. Mechanical Systems and Signal Processing, 19: 271-289,2005.
189
[52]
L. Zhang and A. K. Nandi. Fault classification using genetic programming.
Mechanical Systems and Signal Processing, 21: 1273-1284,2007.
[53]
G. Hinton and T. J. Sejnowski. Unsupervised Learning and Map Forioatr. on:
Foundations of Neural Computation. MIT Press, Cambridge, MA. 1999.
[54]
S. Kotsiantis and P. Pintelas. Recent advances in clustering: A brief survey.
WSEAS Trans. on Information Science and Applications, 1(1): 73-81,2004.
[55]
O. Chapelle, B. Schälkopf, and A. Zien. Semi-Supervised learning. MIT Press.
Cambridge, MA, 2006.
[56]
N. Dean, T. B. Murphy, and G. Downey. Updating classification rules with
unlabelled data with applications in food authenticity studies. Journal of the
Royal Statistical Society, Series C., 55(1): 1-14,2006.
[57]
B. Sahiner, N. Petrick, H. P. Chan, L. M. Hadjiiski, C. Paramagul. M. A. Helvie.
and M. N. Gurcan. Computer-aided characterization of mammographic masses:
Accuracy of mass segmentation and its effects on characterization. IEEE Trans
Medical Imaging, 20(12): 1275-1284,2001.
[58]
R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT
Press, Cambridge, ILIA, 1998.
[59]
W. Xu, A. K. Nandi, and J. Zhang. Novel fuzzy reinforced learning vector
quantization algorithm and its application in image compression. IEE
Proceedings - Part VIS, 150(5): 292-298,2003.
[60]
W. Xu, A. K. Nandi, J. Zhang, and K. G. Evans. Novel vector quantiser design
using reinforced learning as a pre-process. Signal Processing, 85(7): 1315-1333,
2005.
[61]
V. N. Vapnik. Statistical learning theory, pages 339-371. New York: Wiley.
1998.
[62]
V. Tresp. A Bayesian committee machine. Neural Computation, 12: 2719-2741,
2000.
[63]
R. Caruana. Multitask learning: A knowledge-based source of inductive bias.
Machine Learning, 28: 41-75,1997.
[64]
V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag, London,
UK, 1995.
[65]
R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals
of Eugenics, 7(2): 179-188,1936.
[66]
R. 0. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and
Sons, New York, NY, 2nd edition, 2001.
[67]
J. Neter, M. H Kutner, C. J. Nachtsheim, and W. Wasserman. Applied Linear
Statistical Models. Irwin, Chicago, IL, 4 edition, 1990.
190
[68]
S. M. Kay. Fundamentals of Statistical Signal Processing: Estimation Theory.
Prentice Hall, 1993. Chapter 7.
[69]
P. Domingos and M. J. Pazzani. On the optimality of the simple bayesian
classifier under zero-one loss. Machine Learning, 29(2-3): 103-130,1997.
[70]
A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGrawHill, New York, NY, 2 edition, 1984.
[71]
S. S. Haykin. Neural Networks: A Comprehensive, Foundation. Prentice Hall,
London, UK, 1999.
[72]
D. S. Broornhead and D. Lowe. Multivariable functional interpolation and
adaptive networks. Complex System, 2: 321-355,1988.
[73]
T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9): 14041480,1990.
[74]
J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis.
Cambridge University Press, Cambridge, UK, 2004.
[75]
Y. Chen and J. Z. Wang. Support vector learning for fuzzy rule-based
classification systems. IEEE Trans. on Fuzzy Systems, 11(6) : 716-728,2003.
[76]
S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K. Muller. Fisher discriminant
analysis with kernels. In Proc. of IEEE Neural Networks for Signal Processing
Workshop, pages 41- 48,1999.
[77]
C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):
273-297,1995.
[78]
N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines
and other kernel-based learning methods. Cambridge University Press,
Cambridge, UK, 2000.
[79]
B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal
margin classifiers. In Proc. of the 501 Annual ACM Workshop
on
Computational Learning Theory, pages 144--152,1992.
[80]
B. Schölkopf, A. J. Smola, R. Williamson, and P. Bartlett. New support vector
algorithms. Neural Computation, 12: 1207-1245, May 2000.
[81]
B. Cady and M. Chung. Mammographic screening: No longer controversial.
American Journal of Clinical Oncology, 28(1): 1-4.2005.
[82]
G. Fung and 0. L. Mangasarian. Proximal support vector machine classifiers. In
Proc, of the 7th ACM SIGKDD international conference on Knowledge
Discovery and Data Mining, pages 77-86, San Francisco, CA, 2001.
[83]
G. Fung and 0. L. Mangasarian. Multicategory proximal support vector machine
classifiers. Machine Learning, 59: 77-97, May 2005.
191
[84]
D. Agarwal. Shrinkage estimator generalizations of proximal support vector
machines. In Proc. of the 8th Int'l Conf. Knowledge Discovery and Data Mining,
pages 173-182, Edmonton, Alberta, Canada, 2002.
[85]
A. Tveit and H. Engum. Parallelization of the incremental proximal support
vector machine classifier using a heap-based tree topology. In Workshop on
Parallel and Distributed computing for Machine Learning (In conjunction with
ECML'2003 and PKDD'2003), Cavtat, Dubrovnik, Croatia, 2003.
[86]
S. K. Pal, S. Bandyopadhyay, and S. Biswas. Fuzzy proximal support vector
classification via generalized eigenvalues. In Proc. of 1st Int’l‎Conf.‎on‎Pattern‎
Recognition and Machine Intelligence, pages 360-363, Kolkata, India. 2005.
[87]
O. L. Mangasarian and E. W. Wild. Multisurface proximal support vector
machine classification via generalized eigenvalues. IEEE Trans Pattern Analysis
and Machine Intelligence, 28: 69-74, January 2006.
[88]
A. N. Tikhonov and V. Y. Arsen. Solutions of Ill-posed Problems. John Wiley
and Sons, New York, NY, 1977.
[89]
R. Webb. Andrew. Statistical Pattern Recognition. John Wiley and Ltd.. 2
edition, 2002.
[90]
S. Bleha and D. Gillespie. Computer user identification using the mean and the
median asfeatures. In Proc. of the IEEE Int'l Conf. on System. Man, and
Cybernetics, pages 4379-4381, San Diego, CA. 1998.
[91]
H. Lin and A. N. Venetsanopoulos. A weighted minimum distance classifier for
pattern recognition. In Proc. of the 6th Canadian Conf. on Electrical and
Computer Engineering, pages 904-907, Vancouver, BC, Canada, 1993.
[92]
D. Zhang, S. Chen, and Z. Zhou. Learning the kernel parameters in kernel
minimum distance classifier. Pattern Recognition, 39(l): 133-135,2006.
[93]
P. Somervuo and T. Kohonen. Self-organizing maps and learning vector
quantization for feature sequences. Neural Processing Letters. 10(2): 151 159.1999.
[94]
C. Chou, C. Lin, Y. Liu, and F. Chang. A prototype classification method and its
use in a hybrid solution for multiclass pattern recognition. Pattern Recognition,
39(4): 624-634,2006.
[95]
K.P. Bennett and E. Parrado-Hernandez, The Interplay of Optimization and
Machine Learning Research, Journal of Machine Learning Research,7(Jul):12651281,
2006.
[96]
G. Lanckriet, N. Cristianini, P. Bartlett, L. Ghaoui and M.I. Jordan. Learning the
kernel matrix with semidefinite programming. Journal of Machine Learning
Research, 5: 27-72,2004.
192
[97]
C. S. Ong, A. Smola. and R. Williamson. Learning the kernel with hyperkernels.
Journal of Machine Learning Research, 6: 1045-1071,2005.
[98]
Y. Zhang, S. Burer, and W. N. Street. Ensemble pruning via semi-definite
programming. Journal of Machine Learning Research, 7: 1315-1338,2006.
[99]
P. F. Felzenszwalb and D. McAllester. The generalized A* architecture. Journal
of Artificial Intelligence Research, 29: 153-190,2007.
[100] D. E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine
Learning. Reading, MA: Addison-Wesley, 1989.
[101] A. J. Chipperfield, P. J. Fleming, H. Pohlheim, and C. M. Fonseca. Genetic
Algorithm Toolbox for use with MATLAB (version 1.2). University of
Sheffield, Sheffield, UK, 1994.
[102] D. Windridge and J. Kittler. Combined classifier optimisation via feature
selection. In Book Advances in Pattern Recognition: Proc. of the Joint 1APR
Int'l Workshops, SSPR 2000 and SPR 2000, volume 1876, pages 687-695,
Alicante, Spain, 2000.
[103] H. Vafaie and K. A. De Jong. Improving the performance of rule induction
system using genetic algorithms. In Proc. of the 1st Int'l Workshop on
Multistrategy Learning, pages 305-315, Harpers Ferry, WV, 1991.
[104] M. L. D. Wong and A. K. Nandi. Automatic digital modulation recognition using
the artificial neural network and genetic algorithm. Signal Processing, 84(2):
351-365,2004.
[105] Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of
Information and Computer Science.
[106] D.M. Abd El-Rehim, G. Ball, S.E. Pinder, E. Rakha, C. Paish, J.F. Robertson, D.
Macmillan, R.W. Blamey, and I.O. Ellis. High-throughput protein expression
analysis using tissue microarray technology of a large well-characterised series
identifies biologically distinct classes of breast cancer confirming recent cDNA
expression analyses. Int. Journal of Cancer, 116:340–350, 2005.
[107]
W. W. Cohen, R. E. Schapire, and Y. Singer. Learning to order things. In Proc.
Advances‎in‎Neural‎Information‎Processing‎Systems‎(NIPS’98),‎1998.
[108]
V. Hristidis, N. Koudas, and Y. Papakonstantinou. PREFER: A system for the
efficient execution of multi-parametric ranked queries. Proceedings ACM
SIGMOD International Conference on Management of Data, 2001.
[109] L. Duijm, J. H. Groenewoud, F. H. Jansen, J. Fracheboud, M. Beek, and H. J. de
Koning. Mammography screening in the Netherlands: delay in the diagnosis of
breast cancer after breast cancer screening. British Journal of Cancer , 91:1795–
1799, 2004.
193
[110] Breast Cancer Diagnostic Algorithms for Primary Care Providers, Breast Expert
Workgroup, third ed., Cancer Detection Section, California Department of
Health Service, 2005.
[111] W. H. Wolberg, W. N. Street, and O. L. Mangasarian. Breast cytology diagnosis
via digital image analysis. Analytical and Quantitative Cytology and Histology,
15(6): 396-404,1993.
[112] W. H. Wolberg, W. N. Street, and O. L. Mangasarian. Machine learning
techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letter,
77: 163-171,1994.
[113] B. B. Mandelbrot. The Fractal Geometry of Nature. Cupter 5. W. H. Freeman
and Company, New York, 1997.
[114] A. Bellaachia and E. Guven. Predicting breast cancer surviv-ability using data
mining techniques. Scientific Data Mining Workshop, in Conjunction with the
2006 SIAM Conference on Data Mining, 2006.
[115] D. Delen, G. Walker, and A. Kadam. Predicting breast cancer survivability: a
comparison of three data mining methods. Artificial Intelligence in Medicine,
34(2):113–127, 2005
[116] C.-C. Chang and C.-J. Lin, LIBSVM : a library for support vector machines.
ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011.
Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
[117] N. El Barbri, E. Llobet, N. El Bari, X. Correig, B. Bouchikhi, Application of a
portable electronic nose system to assess the freshness of Moroccan sardines,
Materials Science and Engineering: C, 28 (5–6), 666-670, 2008.
[118] S.Tong and D.Koller, Support vector machine active learning with applications
to text classification, The Journal of Machine Learning Research, 2, 45-66, 2002.
[119] T.Nugent and DT. Jones, Transmembrane protein topology prediction using
support vector machines. BMC Bioinformatics, 10:159, 2009.
[120] Fu, Li M., and Casey S. Fu-Liu. "Multi-class cancer subtype classification based
on gene expression signatures with reliability analysis." FEBS letters 561.1,186190,2004.
[121] Liu, Yi, and Yuan F. Zheng. "One-against-all multi-class SVM classification
using reliability measures." Neural Networks, 2005. IJCNN'05. Proceedings.
2005 IEEE International Joint Conference on. Vol. 2. IEEE, 2005.
[122] P. Mahesh, Multiclass approaches for support vector machine based land cover
classification, arXiv preprint arXiv:0802.2411,2008.
[123] J. Weston and C. Watkins, Multi-class support vector machines. Technical
Report CSD-TR-98-04, Department of Computer Science, Royal Holloway,
University of London, May, 1998.
194
[124] C.W.Hsu and Lin.C-J, A comparison of methods for multiclass support vector
machines." Neural Networks, IEEE Transactions on 13.2 ,415-425, 2002.
[125] F.E. Harrell Jr., K.L. Lee, and D.B. Mark. Multivariable prognostic models:
Issues in developing models, evaluating assumptions and adequacy, and
measuring and reducing errors. Statistics in Medicine, 15:361–387, 1996.
[126] W.N. Venables and B.D. Ripley. Modern Applied Statistics with S. New
York:Springer, 4th edition, 2002.
[127] M.‎ Vuk‎ and‎ T.‎ Curk.‎ Roc‎ curve,‎ lift‎ chart‎ and‎ calibration‎ plot.‎ Metodoloˇski‎
zvezki,3(1):89–108, 2006.
[128] W. Wang and , Z. H. Zhou, On multi-view active learning and the combination
with semi-supervised learning, Proceedings of the 25th international conference
on Machine learning. ACM, 2008.
[129] I. Muslea, S. Minton, and, C. A. Knoblock, Selective sampling with redundant
views, Proceedings of the national conference on artificial intelligence. Menlo
Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2000.
[130] P. Royston. Algorithm AS 181: The W test for normality. Applied
Statistics,31:176–180, 1982.
[131] R.R. Bouckaert. Naive bayes classifiers that perform well with continuous
variables.In Proceedings of the 17th Australian Conference on AI (AI04). Berlin:
Springer, 2004.
[132] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for
ordinal regression. Advances in Large Margin Classifiers, MIT Press, Pages:
115-132, 2000.
[133] T. Joachims. Optimizing search engines using clickthrough data. KDD 2002.
[134] C. Burges, T. Shaked, E. Renshaw, A .Lazier, M. Deeds, N. Hamilton, G.
Hullender Learning to rank using gradient descent. ICML 2005.
[135] W. Lior, S. Bileschi. Combining variable selection with dimensionality
reduction. CVPR 2005.
[136] A. Blum and P. Langley. Selection of relevant features and examples in machine
learning. AI, 97(1-2), 1997.
[137] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio and V. Vapnik.
Feature selection for SVMs. NIPS 2001.
[138] A. Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance.
ICML 2004.
[139] I. Guyon, A. Elisseeff. An introduction to variable and feature selection. Journal
of Machine Learning Research, 2003.
195
[140] Y. Yang and Jan O. Pedersen. A comparative study on feature selection in text
categorization. ICML 1997.
[141] G. Forman. An extensive empirical study of feature selection metrics for text
classification. Journal of Machine Learning Research, 2003.
[142] R. Kohavi, G. H. John. Wrappers for feature selection. Artificial Intelligence,
1997.
[143] R. B. Yates, B. R. Neto. Modern information retrieval, Addison Wesley, 1999.
[144] K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of IR
techniques, ACM Transactions on Information Systems, 2002.
[145] J. Furnkranz and E. Hullermeier. Pairwise preference learning and ranking. In
Proc.‎European‎Conf.‎Machine‎Learning‎(ECML’03),‎2003.
[146] S. Har-Peled, D. Roth, and D. Zimak. Constraint classification: A new approach
to multiclass classification and ranking. In Proc. Advances in Neural Information
Processing‎Systems‎(NIPS’02),‎2002.
[147] R. Herbrich, T. Graepel, and K. Obermayer, editors. Large margin rank
boundaries for ordinal regression. MIT-Press, 2000.
[148] E. Chang and S. Tong. Support vector machine active learning for image
retrieval. In ACM Multimedia 2001, 2001.
[149] G. Schohn and D. Cohn. Less is more: Active learning with support vector
machines.‎ In‎ Proc.‎ Int.‎ Conf.‎ Machine‎ Learning‎ (ICML’00),‎ pages‎ 839–846,
2000.
[150] H. Yu, S. Hwang, and K. C.-C. Chang. Rankfp: A framework for supporting
rank formulation and processing. In Proc. Int. Conf. Data Engineering
(ICDE’05),‎2005.
[151] D.‎ Lewis‎ and‎ J.‎ Catlett,‎ “Heterogeneous‎ uncertainty‎ sampling‎ for‎ supervised‎
learning,”‎in‎Proc.‎ICML,‎1994,‎pp.‎148–156.
[152] K. Brinker. Active learning of label ranking functions. In Proc. Int. Conf.
Machine‎Learning‎(ICML’04),‎2004.
[153] C. J. C. Burges. A tutorial on support vector machines for pattern
Data Mining and Knowledge Discovery, 2:121–167, 1998.
recognition.
[154] R. Battiti. Using mutual information for selecting features in supervised neural
net learning. IEEE Transactions on Neural Networks. vol. 5, NO.4, July 1994.
[155] S. Theodoridis, K. Koutroumbas. Pattern recognition. Academic Press, New
York, 1999.
196
[156] N. Kwak, C. H. Choi. Input feature selection for classification problems. Neural
Networks, IEEE Transactions on Neural Networks, vol.13, No.1, January 2002.
[157] M. Kendall. Rank correlation methods. Oxford University Press, 1990.
[158] A. M. Liebetrau. Measures of association, volume 32 of Quantitative
Applications in the Social Sciences. Sage Publications, Inc., 1983.
[159] S. Robertson. Overview of the okapi projects, Journal of Documentation, Vol.
53, No. 1, pp. 3-7, 1997.
[160] L. Breiman, J. H. Friedman, R. A. Olshen, and C.J.Stone. Classification and
regression trees. Wadsworth and Brooks, 1984.
[161] C. Aggarwal, Data Streams: Models and Algorithms. New York: SpringerVerlag, 2007.
[162] D.‎ Cohn,‎ L.‎ Atlas,‎ and‎ R.‎ Ladner,‎ “Improving‎ generalization‎ with‎ active‎
learning,”‎Mach.‎Learn.,‎vol.‎15,‎no. 2, pp. 201–221, May 1994.
[163] X.‎Zhu,‎X.‎Wu,‎and‎Q.‎Chen,‎“Eliminating‎class‎noise‎in‎large‎datasets,”‎in‎Proc.‎
ICML, 2003, pp. 920–927.
[164] H.‎ Seung,‎ M.‎ Opper,‎ and‎ H.‎ Sompolinsky,‎ “Query‎ by‎ committee,”‎ in‎
Proc.COLT, 1992, pp. 287–294.
[165] M. Culver, D. Kun, and‎ S.‎ Scott,‎ “Active‎ learning‎ to‎ maximize‎ area‎ under‎ the‎
ROC‎curve,”‎in‎Proc.‎ICDM,‎2006,‎pp.‎149–158.
[166] X.‎Zhu‎and‎X.‎Wu,‎“Class‎noise‎vs‎attribute‎noise:‎A‎quantitative‎study‎of‎their‎
impacts,”‎Artif.‎Intell.‎Rev.,‎vol.‎22,‎no.‎3/4,‎pp.‎177–210, Nov. 2004.
[167] W.‎Hu,‎W.‎Hu,‎N.‎Xie,‎and‎S.‎Maybank,‎“Unsupervised‎active‎learning‎based‎on‎
hierarchical graph-theoretic‎ clustering,”‎ IEEE‎ Trans.‎ Syst.,‎ Man,‎ Cybern.‎ B,‎
Cybern., vol. 39, no. 5, pp. 1147–1161, Oct. 2009.
[168] P.‎Mitra,‎C.‎Murthy,‎and‎S.‎Pal,‎“A‎probabilistic active support vector learning
algorithm,”‎IEEE‎Trans.‎Pattern‎Anal.‎Mach.‎Intell.,‎vol.‎26,‎no.‎3,‎pp.‎413–418,
Mar. 2004.
[169] B.‎ Settles,‎ “Active‎ learning‎ literature‎ survey,”‎ Univ.‎ Wisconsin-Madison,
Madison, WI, Computer Science Tech. Rep. 1648, 2009.
[170] K. Czarnecki, Model Driven Architecture, OOPSLA Tutorial, http://www.sts.tuharburg.de/teaching/ss-07/FMDM/K-NearestNeighbors.pdf
[171] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-RuiWang, and Chih-Jen
Lin. LIBLIN-EAR: A library for large linear classication. Journal of Machine
Learning
Research,
9:
1871
1874,
2008.
URL
http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf. Laral.istc.cnr.it [online].
197
2003 [cit. 2011-02-02]. Neural Networks Library In Java. Dostupné z URL
http://laral.istc.cnr.it/daniele/software/NNLibManual.pdf
[172] Zhu, X., "Semi-Supervised Learning Tutorial". Department of Computer
Sciences University of Wisconsin, Madison, USA, 2007.
[173] Abney,S. "Semisupervised Learning for Computational Linguistics" (1st ed.).
Chapman & Hall/CRC, 2007.
[174] Brown, M. P., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T.
S., Ares, M. Jr., Haussler, D. "Knowledge-based analysis of microarray gene
expression data by using support vector machines". Proc. Natl. Acad. Sci. USA
97:262–267. 2001.
[175] Jaakkola, T., Haussler, D. and Diekhans, M. "Using the Fisher kernel method to
detect remote protein homologies". In Proceedings of ISMB, 1999.
[176] Zien, A., Ratsch, G., Mika, S., Scholkopf, B., Lemmen, C., Smola, A., Lengauer,
T., and Muller, K.-R. "Engineering support vector machine kernels that
recognize translation initiation sites", Bioinformatics, 16:799- 807, 2000.
[177] Valafar, F. "Pattern Recognition Techniques in Microarray Data Analysis".
Annals of the New York Academy of Sciences 980(1): 41-64, 2002.
[178] Gommerman, A., Vovk, V. and Vapnik, V. "Learning by Transduction", In
Uncertainty in Artificial Intelligence, pp.148-155,1998.
[179] Collobert, R., Sinz, F., Weston, J. and Bottou, L. "Large Scale Transductive
SVMs". J. Mach. Learn. Res. 7 (December 2006), 1687-1712, 2006.
[180] R. Zhang, W. Wang, Y. Ma, C. Men. "Least Square Transduction Support
Vector Machine." Neural Processing Letters 29(2): 133-142, 2009.
[181] Joachims, T. " Transductive inference for text classification using support vector
machines". In Proceedings of ICML-99, pages 200–209, 1999.
[182] Han, J. and Kamber, M. "Data Mining: Concepts and Techniuqes". Morgan
Kaufmann
Publishers,
San
Francisco,
CA,
2001.
[183] Guyon, I., Weston, J., Barnhill, S., Vapnik, V. "Gene Selection for Cancer
Classification using Support Vector Machines Machine Learning". 46(1): 389422, 2002.
[184] Bennett, K. And Demiriz, A. "Semi-supervised support vector machines". In
NIPS,1998.
[185] Harris, C. and Ghaffari, N. "Biomarker discovery across annotated and
unannotated microarray datasets using semi-supervised learning". BMC
Genomics 9 Suppl 2:S7, 2008.
198
[186] Joachims, T. "Making large-scale support vector machine learning practical". In
Advances in Kernel Methods: Support Vector Machines, 1999.
[187] Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller
H, Loh L, Downing JR, Caliguire MA, Bloomfield CD, Lander ES: Molecular
classification of cancer: class discovery and class prediction by gene expression
monitoring. Science, 286:531-537, 1999.
[188] Shipp MA, Ross KN, Tamayo P, Weng, AP, Kutok JL. "Diffuse large B-cell
lymphoma outcome prediction by gene expression profiling and supervised
machine learning". Nature Medicine, 8:68-74. 2002.
[189] A.S. Yong, R.M. Szydlo, J.M. Goldman, J.F. Apperley and J.V. Melo.
"Molecular profiling of CD34+ cells identifies low expression of CD7, along
with high expression of proteinase 3 or elastase, as predictors of longer survival
in patients with CML" Blood 2006, 107:205-12, 2006.