Developing Methods for Machine Learning Algorithms Using Automated Feature and Sample Selection by Hala Helmi Thesis submitted to The University of Nottingham for the Degree of Doctor of Philosophy School of Computer Science The University of Nottingham Nottingham, United Kingdom January 2014 i ii Abstract Machine learning is concerned with the design and development of algorithms and techniques that allow computers to "learn" from experience with respect to some class of tasks and performance measure. One application of machine learning is to improve the accuracy and efficiency of computeraided diagnosis systems to assist physicians, radiologists, cardiologists, neuroscientists, and health-care technologists. This thesis focuses on machine learning and the applications to different types of data sets for example breast cancer detection. Breast cancer, which is the most common cancer in women, is a complex disease characterised by multiple molecular alterations. Current routine clinical management relies on availability of robust clinical and pathologic prognostic and predictive factors, like the Nottingham Prognostic Index, to support decision making. Recent advances in high throughput molecular technologies supported the evidence of a biologic heterogeneity of breast cancer. Emphasis is laid on preprocessing of features, pattern classification, and model selection. Before the classification task, feature selection and feature transformation may be performed to reduce the dimensionality of the features and to improve the classification performance. Genetic algorithm (GA) can be employed for feature selection based on different measures of data separability or the estimated risk of a chosen classifier. A separate nonlinear transformation can be performed by applying kernel principal component analysis and kernel partial least squares. Different classifiers are proposed in this work. The aim is to fuse the output of multiple classifiers in the test phase weights that are derived during training; such fusion of classifier output improves classifiers performance. Five classifiers are used: Support vector machine, Naïve Bayes, k-nearest neighbour, Logistic Regression and Neural Network. Gathering numerous classifiers for the purpose of solving certain dilemma of learning seems to be iii logic. Overall, a suite of classifiers prove to perform better comparing to an individual classifier. Besides, the native varied conduct which belongs to each single classifier is employed by various classifiers in order to get higher precision of the system as whole in addition, it helps us to hedge the hazard of selecting insufficient individual classifiers. We propose a novel sampling method to replace the random sampling used by SVM and TSVM to find whether this could extensively reduce the amount of needed labelled experimental examples, and to see if this can improve the performance of both SVM and TSVM, as well. The new method uses redundant views to expand the labelled dataset to build strong learning models. The major difference is that the new method uses a number of classifier views (two views in our case) to select and sample unlabeled examples to be labelled by the domain experts, while the original SVM and TSVM randomly sample some unlabeled examples and use classifiers to assign labels to them. The problem of model selection is studied to pick the best values of the hyperparameters for a parametric classifier. To choose the optimal kernel or regularization parameters of a classifier, we investigate different criteria, such as the validation error estimate and the leave-out-out bound, as well as different optimization methods, such as grid search, gradient descent, and GA. By viewing the tuning problem of the multiple parameters of an 2-norm support vector machine (SVM) as an identification problem of a nonlinear dynamic system. Independent kernel optimization based on different measures of data separability are also investigated for different kernel-based classifiers. Numerous computer experiments using the benchmark datasets verify the theoretical results, make comparisons among the techniques in measures of classification accuracy or area under the receiver operating characteristics curve. Computational requirements, such as the computing time and the number of hyper-parameters, are also discussed. Experimental results demonstrate the excellence of these methods with improved classification performance. iv Acknowledgements First of all, I would like to thank my supervisors, Prof. Jonathan Garibaldi and Prof. Uwe Aickelin, for the independence, guidance, and support he has given me throughout all this project. Being at the University of Nottingham has been a great fun. I have been lucky to have such great friends who have given me so much love and support, who have made this experience such a great pleasure and fun. I am really grateful specially to all my colleagues in Intelligent Modeling and Analysis Group, School of Computer Science, the University of Nottingham. I would like to acknowledge financial support from the King Abdullah Foreign Scholarship Program (KAS), for providing me scholarships and travel funds for my studies at the University of Nottingham, UK. A super sweet and very special thank goes to my friends and my family, the most tolerant people I have ever met, for taking care of me, encouraging me, supporting me, and loving me. I want to thank my sisters for their endless patience whenever I wish to talk to them about my research, even they works in a very different area. My PhD. life would not have been so smooth and enjoyable without them always by my side. Finally, I am deeply indebted to my dear parents for their love, patience and encouragement all the years. I am infinitely grateful for the values they have passed down to me, and for their continuous support throughout all my studies. Thanks you all ! v Contents Abstract ii Acknowledgements....................................................................................................... iv Contents ......................................................................................................................... v List of Figure .............................................................................................................. viii List of Tables ................................................................................................................ ix 1 Introduction...................................................................................................... 2 1.1 Background ........................................................................................................ 2 1.2 Motivation.......................................................................................................... 7 1.3 Aims and Objective ........................................................................................... 9 1.4 Thesis Organisation ........................................................................................... 9 2 Literature Review ......................................................................................... 15 2.1 Introduction...................................................................................................... 16 2.2 Types of Algorithms ........................................................................................ 17 2.2.1 Supervised Learning ............................................................................... 17 2.2.2 Unsupervised Learning ............................................................................ 19 2.2.3 Semi-Supervised Learning....................................................................... 19 2.2.4 Reinforcement Learning .......................................................................... 20 2.2.5 Transduction ........................................................................................... 20 2.2.6 Multi-task Learning ................................................................................. 21 2.3 Classification ................................................................................................... 22 2.3.1 Linear Classifiers ..................................................................................... 22 2.3.2 Artificial Neural Networks ..................................................................... 26 2.3.3 Kernel-based Classifiers.................................................................................. 32 2.3.4 Proximal Classifiers ................................................................................. 43 2.3.5 Prototype Classifiers ................................................................................ 47 2.4 Optimization ................................................................................................... 49 2.4.1 Genetic Algorithm ................................................................................... 50 2.4.2 Gradient Descent .................................................................................... 51 2.5 Feature Selection ............................................................................................ 52 2.5.1 Genetic Algorithm ................................................................................... 52 2.5.2 Sequential Backward Selection ............................................................... 53 2.5.3 Recursive Feature Elimination................................................................. 54 2.6 Summary ......................................................................................................... 55 vi 3 Improving TSVM vs. SVM Accordance Based Sample Selection ............. 57 3.1 Experimental Datasets and Feature Analysis................................................... 57 3.1.1 Nottingham Tenovus Primary Breast Carcinoma (NTBC) ...................... 58 3.1.2 Wisconsin Diagnosis Breast Cancer Dataset .......................................... 61 3.2 Background and motivation ............................................................................ 64 3.3 Experiments settings ........................................................................................ 66 3.3.1 Support Vector Machine .......................................................................... 67 3.3.2 Measures for predictive accuracy ............................................................ 69 3.5 Results ............................................................................................................. 74 3.6 Discussion of results ........................................................................................ 83 3.7 Summary ......................................................................................................... 86 4 Automatic Features and Samples Ranking for SVM Classifier ................ 88 4.1 Introduction...................................................................................................... 88 4.2 Background ...................................................................................................... 89 4.2.1 Feature Ranking ....................................................................................... 89 4.2.2 Samples Ranking ..................................................................................... 91 4.3 Methodology .................................................................................................... 94 4.3.1 Feature Selection .................................................................................... 94 4.3.2 Sample Selection ..................................................................................... 97 4.4 Experiment Settings ....................................................................................... 102 4.4.1 Datasets .................................................................................................. 103 4.4.2 Evaluation measures ............................................................................. 103 4.4.3 Ranking model ....................................................................................... 105 4.4.4 Experiments ........................................................................................... 106 4.5 Experimental Results ..................................................................................... 108 4.5.1 MADELON data set(Feature Ranking) ................................................. 108 4.5.2 MADELON data set(Samples Ranking) ................................................ 112 4.5.3 Nottingham Breast Cancer Data set(Feature Ranking) .......................... 113 4.5.4 Nottingham Breast Cancer Data set(Samples Ranking) ........................ 115 4.6 Discussions .................................................................................................... 118 4.7 Summary ....................................................................................................... 120 5 Ensemble weighted classifiers with accordance-based sampling ........... 121 5.1 Introduction.................................................................................................... 121 5.2 Background .................................................................................................... 125 5.2.1 Ensemble Weighed Classifier ............................................................... 125 vii 5.2.2 Sampling Most Informative Sample Method (Multi Views Sample MVS) ... ............................................................................................................ 128 5.3 Experimental Design ...................................................................................... 130 5.4 Experimental Results and Discussion ............................................................. 132 5.4.1 Runtime Performance Study .................................................................. 147 5.5 Summary ............................................................................................................. 148 6 Examination of TSVM Algorithm Classification Accuracy with Feature Selection in Comparison with GLAD Algorithm .................................................. 150 6.1 Introduction.................................................................................................... 150 6.2 Background .................................................................................................... 150 6.2.1 Support Vector Machines ...................................................................... 151 6.2.2 Transductive Support Vector Machines ................................................ 152 6.2.3 Recursive Feature Elimination............................................................... 156 6.2.4 Genetic Algorithms ............................................................................... 159 6.3 Methods ......................................................................................................... 159 6.3.1 Support Vector Machines ...................................................................... 160 6.3.2 Transductive Support Vector Machines ............................................... 162 6.3.3 Recursive Feature Elimination............................................................... 164 6.3.4 Genetic Learning Across Datasets (GLAD) .......................................... 166 6.4 Experiments and Results............................................................................... 168 6.4.1 Datasets .................................................................................................. 168 6.4.2 TSVM Recursive Feature Elimination (TSVM-RFE) Result ................ 168 6.4.3 Comparing TSVM Algorithm result with GLAD Algorithm ................ 169 6.5 Discussion of results ...................................................................................... 172 6.6 Summary ........................................................................................................ 173 7 Conclusions and Future Work.................................................................... 176 7.1 Contributions ................................................................................................. 176 7.3 Dissemination ................................................................................................. viii 7.3.1 Journal papers ....................................................................................... 183 7.3.2 Conference papers ................................................................................. 183 References viii List of Figure igure 2.1: The structure of an SLP with one neuron in the output layer. ....... 27 F Figure 2.2: The overall structure of an MLP with one hidden layer. ............... 28 Figure 2.3: The overall structure of the RBF networks. ................................... 29 Figure 2.4: Illustration of support vectors for linear, non-separable patterns. . 42 Figure 3.1: Histogram of variable CK19 .......................................................... 66 Figure 3.2: Histogram of variable P53 ............................................................. 66 Figure 3.3: Histogram of WDBC...................................................................... 66 Figure 3.4: Accordance Based Sampling TSVM vs. SVM with different percentages of labelled training data for each class.......................................... 79 Figure 3.5: Original random sampling TSVM vs. SVM with different percentages of labelled training data with random sampling. .......................... 80 Figure 3.6: Accordance Based Sampling TSVM vs. SVM with different percentages of labelled training data for each class.......................................... 82 Figure 3.7: Original random sampling TSVM vs. SVM with different percentages of labelled training data with random sampling. .......................... 83 Figure 4.1: Linear projection of four data points ............................................ 101 Figure 4.2: Diagram showing an example of an existing feature selection procedure ........................................................................................................ 102 Figure. 4.3 Ranking accuracy of Ranking SVM with different feature selection methods on the MADELON dataset ............................................................... 110 Figure 4.4. Ranking accuracy of RankNet with different feature selection methods on the MADELON dataset ............................................................... 111 Figure 4.5: Accuracy convergence of random and selective sampling on MADELON dataset ........................................................................................ 112 Figure 4.6. Ranking accuracy of Ranking SVM with different feature selection methods on the NTBC dataset ........................................................................ 114 Figure 4.7. Ranking accuracy of RankNet with different feature selection methods on the NTBC dataset ........................................................................ 115 Figure 4.8: Accuracy convergence of random and selective sampling on NDBC dataset ............................................................................................................. 118 Figure 5.1 Performance of SVM, LR, KNN, NN, NB and Majority under varying 10 folds with random sampling .vs. multi view sampling ................. 139 Figure 5.2 Performance of SVM, LR, KNN, NN, NB and Majority under varying different Folds number sizes with multi view sampling method. ..... 143 Figure 5.3 Error bar and Performance of SVM, LR, KNN, NN, NB and Majority for different dimensionality number sizes with multi view sampling method (d number of dimensional)................................................................. 146 Figure 5.4 System runtime with respect to different fold sizes ...................... 148 Figure 6.1 : Multi Margin vs. SVM Maximum Margin Optimal hyperplane separation ........................................................................................................ 152 Figure 6.2: Separation hyperplane for (semi-supervised data) ....................... 156 Figure 6.3: Maximum margin Separation hyperplane for Transductive SVM (semi-supervised data) .................................................................................... 162 Figure 6.4: Testing error for 3 data sets. The 5-fold cross validated pair t-test shows the SVM-RFE and the TSVM-RFE have relative differences comparing two methods at the confidence rate 95%.(Linear kernel, C = 1) ................... 171 ix List of Tables Table 3.1: Benchmark datasets used in this work............................................. 57 Table 3.2: Complete list of antibodies used and their dilutions ....................... 60 Table 3.3: Comparison of results on three classifiers using all samples .......... 75 Table 3.4: Comparison of results on three classifiers using only 50 samples .. 76 Table 3.5: Average accuracies 10 cross validation experiments for the classifiers (standard deviation in brackets) ....................................................... 76 Table 3.6: Comparing SVM and TSVM using random sampling using different percentages of training ...................................................................................... 78 Table 3.7: Comparing SVM and TSVM using accordance sampling using different percentages of training samples for each class .................................. 78 Table 3.8: Comparing SVM and TSVM using random sampling using different percentages of training ...................................................................................... 81 Table 3.9: Comparing SVM and TSVM using accordance sampling using different percentages of training samples for each class .................................. 82 Table 5.1 summarizes the features of sets of data employed for assessment. 132 Table 5.2 Predictive accuracy of each comparing algorithm under 2 folds comparing Majority voting system (Bold 1st, Italic 2nd) .............................. 134 Table 5.3 Predictive accuracy of each comparing algorithm under 5 folds comparing Majority voting system (Bold 1st , Italic 2nd ) ........................... 135 Table 5.4 Predictive accuracy of each comparing algorithm under 10 folds comparing Majority voting system (Bold 1st , Italic 2nd ) ............................ 136 Table 5.5 Predictive accuracy of each comparing algorithm under 2 folds comparing Majority voting system with multi view sampling method (Bold 1st , Italic 2nd )..................................................................................................... 137 Table 5.6 Predictive accuracy of each comparing algorithm under 5 folds comparing Majority voting system with multi view sampling method (Bold 1st , Italic 2nd )..................................................................................................... 140 Table 5.7 Predictive accuracy of each comparing algorithm under 10 folds comparing Majority voting system with multi view sampling method (Bold 1st , Italic 2nd )..................................................................................................... 141 Table 5.8 Predictive accuracy of each comparing folds for Majority voting system with random sampling comparing to multi view sampling method ... 142 Table 6.1: Accuracy Obtained with SVM-RFE, TSVM-RFE and GLAD ..... 172 Chapter 1 Introduction 1.1 Background Machine learning usually refers to the changes in systems that perform tasks associated with artificial intelligence, such as recognition, diagnosis, planning, robot control, and prediction [1]. Machine learning is very important not only because the achievement of learning in machines might help us understand how animals and humans learn. In addition, there are the following engineering reasons [1]: some tasks cannot be defined well except by examples; that is, we might be able to specify input/output pairs but not a concise relationship between inputs and desired outputs. We would like machines to be able to adjust their internal structure to produce correct outputs for a large number of sample inputs and, thus, suitably constrain their input/output function to approximate the relationship implicit in the examples. Also, machine learning can be used to reach on-the-job improvement of existing machine designs, to capture more knowledge than humans would want to write down in order to adapt to a changing environment, to reduce the need for constant redesign, and to track as much new knowledge as possible. Classification is a supervised learning procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items and based on a training set of previously labelled items. Classification has attracted much research attention as it spans a vast number of application areas, such as medical diagnosis, speech 2 3 recognition, handwriting recognition, natural language processing, document classification, and internet search engines. A classification system includes feature extraction, feature selection, classification, and model selection. Feature extraction is to characterize an object by measurements whose values are similar for objects in the same category but different for objects in different categories. Feature selection is performed to remove the irrelevant or redundant features that have a negative effect on the accuracy of the classifier. A classifier uses the feature vector provided by the feature extractor and feature selector to assign the object to a category. Parameters of a classifier may be adjusted by optimizing the estimated classification performance or measures of data separability. This could lead to the problem of model selection. The classification methods are so-called ‘supervised algorithms’. Supervised machine learning is the search for algorithms that reason from externally supplied instances to produce general hypotheses, which then make predictions about future instances. In other words, the goal of supervised learning is to build a concise model of the distribution of class labels in terms of predictor features. The resulting classifier is then used to assign class labels to the testing instances where the values of the predictor features are known but the value of the class label is unknown [2]. Algorithms for supervised learning range from decision trees to artificial neural networks and from support vector machines to Bayesian classifiers. Decision tree learning, used in data mining and machine learning, uses a decision tree as a predictive model which maps observations about an 4 item to conclusions about the item’s target value. In these tree structures, leaves represent classifications and branches represent conjunctions of features that lead to those classifications. Learned trees can also be re-represented as sets of if-then rules to improve human readability [3]. Artificial Neural Networks (ANNs) provide a general, practical method for learning real-valued, discrete-valued, and vector-valued functions from examples. For certain types of problems, such as learning to interpret complex real-world sensor data, artificial neural networks are among the most effective learning methods currently known [3,4]. However, especially for big data sets, ANNs may become huge and produce sets of rules which are then difficult to interpret, especially for those researchers not familiar with computational analysis. Support Vector Machines (SVMs) can also be used for pattern classification and nonlinear regression. The main idea of an SVM is to construct a hyperplane as the decision surface in such a way that the margin of separation between positive and negative examples is maximised in multidimensional space. The support vector machine can provide a good generalization performance on pattern classification problems despite the fact that it does not incorporate problem domain knowledge [4]. Bayesian classifiers are based on the assumption that the quantities of interest are governed by probability distributions and that optimal decisions can be made by reasoning about these probabilities together with observed data. In addition, Bayesian learning provides a quantitative approach to weighing the evidence supporting alternative hypotheses [3]. 5 Worldwide, cancer has become a major issue for human health. The classification of cancer patients is of great importance for its prognosis. In the last few years, many unsupervised and supervised algorithms have been proposed for this task and modern machine learning techniques are progressively being used by biologists to obtain proper tumour information from databases. The World Health Organization’s Global Burden of Disease statistics identified cancer as the second largest global cause of death, after cardiovascular disease [5]. Cancer is the fastest growing segment of the disease burden; the number of global cancer deaths is projected to increase by 45% from 2007 to 2030 from 7.9 million to 11.5 million [6]. Breast cancer is the second most common type of cancer after lung cancer, with 10.4% of all cancer incidence, both sexes counted [7], and the fifth most common cause of cancer death [8]. Breast cancer is a common disease which affects mostly but not only women. The ability to accurately identify the malignancy is crucial for prognosis and the preparation of effective treatment. Breast cancer is usually, but not always, primarily classified by its histological appearance [9]. The first subjective indication or sign of breast cancer is typically a lump that feels different from the surrounding breast tissue. More than 80% of breast cancer cases are discovered when the woman feels a lump [10]. Lumps found in lymph nodes located in the armpits can also indicate breast cancer. Whereas manual screening techniques are useful in determining the possibility of cancer, further testing is necessary to confirm whether a lump detected on screening is cancer, as opposed to a benign alternative such as a simple cyst. In a clinical setting, breast cancer is 6 commonlydiagnosedusinga“tripletest”ofclinicalbreastexamination(breast examination by a trained medical practitioner), mammography, and fine needle aspiration cytology. Both mammography and clinical breast examination, also used for screening, can indicate an approximate likelihood that a lump is cancer, and may also identify any other lesions. Several treatments are available for breast cancer patients, depending on the stage of the cancer. Doctors usually take many different factors into account when deciding how to treat breast cancer. These factors may include thepatient’sage, the size of the tumour, the type of cancer a patient has, and many more. Cancer research produces huge quantities of data that serve as a basis for the development of improved diagnosis and therapies. Advanced statistical and machine learning methods are needed for the interpretation of primary data and generation of new knowledge needed for the development of new diagnostic tools, drugs, and vaccines. Identification of functional groups and subgroups of genes responsible for the development and spread of this type of cancer as well as its subtypes is urgently needed for proper classification and identification of key processes that can be targeted therapeutically. In addition, accurate diagnostic techniques could enable various cancers to be detected in their early stages and, consequently, the appropriate treatments could be undertaken earlier [11]. 7 1.2 Motivation The fundamental motivation for this research is to take the SVM framework, which is one of the most fundamental techniques in machine learning, and try to make it more useful and powerful. A serious and solid improvement in this scenario-based approach reveals many opportunities where further research in computer science problems can pay a large dividend in the quality of classification and other data mining research, as well as the quantity of results and the speed at which new research can be proposed, understood, and accomplished. The second motivation for this work is to enhance SVM by wrapping and integrating feature selection within SVM to expand the use of SVM to be more applicable and practical for real datasets. This can be implemented by using multi-view feature selection, because one selected feature set may perform well on a certain dataset but may still not be the best feature set. The mainaimistofindan‘optimumfeatureset’whichshouldperformreasonably well. After selecting a number of best performing feature sets, having multi view feature set per datasets this will lead to the best performs. One of the ways to search for the optimum feature set is to combine all the feature sets into one big set consisting of the union of all individual sets and ranking them. The third motivation for the current research has been to examine the SVM and TSVM (Transductive support vector machines has been widely used as a means of treating partially labeled data in semi- supervised learning) as supervised and semi-supervised algorithms to be able to classify and categories data into sub-groups. In machine learning and statistics, classification is the 8 problem of identifying to which set of categories of sub-populations a new observation belongs, on the basis of a training set of data containing observations or instances whose category membership is known. The individual observations are analysed into a set of quantifiable properties, known as various explanatory variables, features, etc. These properties may variously be categorical (e.g. "A", "B", "AB" or "O", for blood type), ordinal (e.g. "large", "medium" or "small"), integer-valued (e.g. the number of occurrences of a part word in an email) or real-valued (e.g. a measurement of blood pressure). Some algorithms work only in terms of discrete data and require that real-valued or integer-valued data be discretized into groups (e.g. less than 5, between 5 and 10, or greater than 10). An example would be assigning a given email into "spam" or "non-spam" classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.). In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering (or cluster analysis), and involves grouping data into categories based on some measure of inherent similarity (e.g. the distance between instances, considered as vectors in a multi-dimensional vector space). The fourth motivation of this research is to expand the use of SVM to provide more useful ways to be able to deal with large amounts of data. Over the past few decades, rapid developments in data analysis technologies and developments in information technologies have combined to produce an 9 incredible amount of information, and analysing all this information could help with decision-making in many fields. So we need to exploit the creation and advances in technologies and the consequently fast growth of datasets, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of data. Data mining (the analysis step of the "knowledge discovery in databases" process, or KDD), an interdisciplinary sub-field of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data interestingness pre-processing, metrics, model complexity and inference considerations, considerations, post-processing of discovered structures, visualization, and online updating. 1.3 Aims and Objectives The objectives of this research are to develop an SVM based on one of the most powerful machine learning algorithms for classification by integrating feature selection and model selection to improve the classification performance. The ultimate goals of this multi-disciplinary research project concern both decision-making and technical aspects. This work aims to help in the field of decision-making from the widely used approach of considering classification techniques in which different methods are investigated and 10 results are derived from a consensus between techniques. Considering a more technical and computational aspect, the aim is to develop an original framework to elucidate core representative classes in a given dataset. Several research questions and hypotheses underlying the overall work were identified at the beginning of the project. Starting from the already published studies on machine learning techniques, it was noted that an extended review and comparison of different classification algorithms had not yet been carried out. This knowledge gap led to the formulation of the following research questions: Can sample selection methods improve machine learning (TSVM, SVM) to provide more accurate classification results? Is it possible to find an automated way to rank features and samples and use them to classify new records in the future? Is there a way to combine the results obtained by the multi-classifiers approach? Can feature selection improve machine learning (TSVM, SVM) to provide more accurate classification results? In order to achieve the aims stated above and to answer the research questions, the following objectives were identified: (i) To establish standard methodologies for multi-classifier approaches for categorising the data into the right group. 11 (ii) To investigate the effect on the classification accuracy of reducing the number of features or samples applying to the various data sets available, using both semi-supervised and supervised (SVM and TSVM) methods. (iii) To investigate different computational analysis methods applicable across different types of data sets. (iv) To determine an effective method to evaluate classification results and to combine them in a set of representative groups of different characteristics. (v) To develop an automated supervised and semi-supervised classification algorithm using SVM and TSVM to be applied to any possible source of data, independent of their underlying distributions. 1.4 Thesis Organisation This thesis is structured as follows. Chapter 2 presents a literature review of various classification approaches developed in the past to categorise data points with high similarity. A review of different classification methods used in literature to classify various data sets, including breast cancer data, has been performed and is reported as well. Classification validity is introduced in this chapter as a technique to assess the quality of classification results, and as a method to select the best number of features to consider in the analysis. Several validity measures used in this thesis work are reviewed and analysed in detail. In addition, a general description of techniques developed for consensus classification is reported, explaining various methods to assess the comparison and the accord among different classification approaches. To conclude the overview on classification algorithms, the feature selection approach is 12 described together with the most commonly used algorithm. The chapter ends with a review of semi-supervised and supervised classification methods, which are used to build models of the class distribution labels and to predict the class assignment of possible new objects. Chapter 3 is dedicated to measures that aim to evaluate the separability of features. A comparison was made between the performance of the original random sampling SVM and TSVM on the one hand, and the new accordancesampling method on the other. Accuracy was applied in measuring the performance strength of the model. An accordance sampling method was applied on TSVM and SVM on a couple of datasets. Then each method was run 10 times on each dataset. Later, the original SVM and TSVM algorithms were run using the same setup to measure the performance. To investigate the performance of accordance sampling on breast cancer classification for each dataset, unlabelled sampling pools of 10%, 20%, 30% etc. were created. Examples of the data training for each class are provided. Then two examples from the pool were randomly selected, giving initial labelled figures. Thus, the learner obtained the remaining unlabelled examples, two labelled examples and the test classifier. In Chapter 4 supervised classification methods are used to validate the classification. It presents the modelling of automated features and samples ranking using Support Vector Machines (SVM). A new technique for feature selection and ranking setting is suggested in this chapter. In biomedical data classification, feature selection can help to get rid of unimportant /recurring features to avoid classifier complexity. We have carried out many experiments 13 to check the performance of the suggested method in ranking for medical data, and the method proves its ability to outperform traditional feature selection methods. Our experiments are based on two main benchmark datasets. The first, MADELON, is for sorting random data. This dataset includes around 4400 patients with dual relevance judgments. The Nottingham Tenovus Primary Breast Cancer (NTBC) dataset [12] is the second dataset. Chapter 5 provides a description of the new weighted voting classification ensemble method based on a classifier combination scheme and a novel multi view sampling method. Due to the high cost of supervised labelling of data, we are able to save effort in order to precisely learn a result by means of avoiding examples with less information for labelling in terms of active learning, various types of active learning concerning the dilemma of semisupervised learning represent the central issue of our classification consideration.The point here revolves around five classifiers: Support vector machine, Naïve Bayes, k-nearest neighbor, Logistic Regression and Neural Network. In simple Majority voting all classifiers have equal weights. Hence, if each classifier makes different predications for an instance, the final decision becomes arbitrary due to the tied votes. Assuming the classifiers tend to classify the most informative instances correctly based on the new multi views sample method. If the classifiers make different predictions on an unseen instance, it is reasonable to give more weight to the classifier that give the largest number of prediction equal to the Majority Chapter 6 proposes to observe the performance of Transductive SVMs (TSVM) combined with a feature selection method called recursive feature 14 elimination (RFE), which we use to select features for TSVMs for the first time. The goal is to examine the classifiers’accuracyandclassificationerrors using the TSVM method. This is in order to determine whether this method is an effective model when combined with recursive feature elimination, compared with another algorithm called Genetic Learning Across Datasets (GLAD). On average, the results for semi-supervised learning surpass supervised learning. However, it is also shown that the GLAD algorithm outperforms SVM-RFE when we make use of the labelled data only. On the other hand, TSVM-RFE exceeds GLAD when unlabelled data along with labelled data are used. It performs much better with gene selection and performs well even if the labelled data set is small. The last chapter of this thesis, Chapter 7, concludes the work, drawing out the main contributions and highlighting several possible directions for future research. A list of publications and oral presentations derived from this thesis is reported at the end of the chapter. Chapter 2 Literature Review This chapter presents the basis of machine learning, based on the topic of classification with an introduction to machine learning and its applications, set out in section 2.1. A brief survey of the prevalent kinds of machine learning algorithms is provided in section 2.2. Classification, as one of the most common learning techniques used by scientists as well as the focal point of this PhD study, can be regarded as a typical formulation of the supervised learning task. Using unanimous classification resolution concerning the correspondence of types of breast cancer is a vital point included in this work. A broad overview of the principal classification methods in use is presented in section 2.3. Many optimization algorithms are revisited in section 2.4, as the majority of machine learning algorithms either use optimization or are cases of optimization algorithms. This chapter has two main goals: firstly, to provide relevant fundamental information concerning all the research subjects, which have been used in the development of the original framework, and secondly to, indicate the gaps in the body of knowledge. This provides the motivation for the thesis: to evolve a framework for the purpose of making the core class in a data set clear, as applicable for any accessible source of data. 15 16 2.1 Introduction Machine learning is a sub-field of artificial intelligence. The design and development algorithms and techniques of machine learning have led computersto''learn''.InMitchel’sdefinition[13], it is: ''A computer program that learns from experience E regarding some class of tasks T and performance measure M, provided that its performance at tasks in T, according to AI, enhances with experience E.'' In the last 50 years, the study of machine learning has developed from the efforts exerted by a few computer engineers attempting to detect whether computers could learn to play games, into a field of statistics that clearly passes beyond computational considerations. Studies have led us to major statistical computational theories of learning operations, and modelled learning algorithms that are typically used in trading systems, from speech recognition to computer vision, and has spun off an industry in data mining to find out the underlying rules within the spectacular volume of data now available from the internet [14]. A number of choices are involved in designing machine learning approach, including choosing the type of training experience, the target function to be learned, a representation for this target function, and an algorithm for learning the target function from training samples [13]. Machine learning is naturally a multidisciplinary field, which draws on results from artificial intelligence, probability and statistics, optimization theory, computational complexity theory, control theory, information theory, philosophy and other fields. 17 There are countless applications of machine learning, such as natural language processing [15,16,17], handwriting recognition [18,19,20,21], both faces and fingerprints recognition [22,19,20,23], search engines [24,25], medical analysis [26,27,28,29,30,31,32,33], bioinformatics and cheminformatics [34,35,36]. In addition, they include detecting credit card fraud [37], assaying the stock market [38], classifying DNA sequences [39], object recognition in computer vision [40], compressing images [41], playing games [42,43], machinery movement and robot locations [44], and machine learning condition monitoring [45,46,47,48,49,50,51,52]. 2.2 Types of Algorithms Machine learning algorithms are classified according to the required results. Prevalent kinds of algorithm include supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, transduction and finally multi-task learning. 2.2.1 Supervised Learning The main aspect of this thesis is SVM, which is a type of supervised learning, so it is worth giving some brief information about different types of learning. To start with, supervised learning is a technique of machine learning used in order to create a function from a group of training samples containing pairs of primary objects (exemplary feature vectors) and required outcomes (results). The function outcome can obtain values with no limit (regression), or is able to predict a label for the category of the input data (classification). 18 Supervised learning is mainly to predict the value of the function for any valid input object after learning from a number of training samples (i.e. pairs of input feature vectors and output targets). Dealing with a problem of supervised learning requires different stages: 1. Underline the kind of training samples, which could be as follows: a featurefromapatient’srecordorallfeaturesforthatonepatient,orall features from the records of several patients. 2. Gather a training set involving the real-life background of a problem. Consequently, we combine a set of the main data and the results, either from the manual efforts of scientists or mechanically. 3. Underline the input feature representation of the learned function (feature extraction). The accuracy of the learned function is highly dependent on the quality representation of the input. Input data on the object becomes a vector of a feature, including several features or expressing the object itself. We should present a sufficient number of features to avoid dimensionality (as a dilemma) and to allow at the same time enough prediction accuracy for the outcome. 4. Identify the structure of the learned function and identical learning algorithm. 5. Lay out the final touches of the model. We should run the learning algorithm against the combined training set. We are able to amend the learning algorithm parameters through well accomplishment on a subset of the training set (namely a validation set) or by means of crossvalidation. After learn and amending the parameters, a test set which 19 has been separated from the training set should be used as a measure of the performance. 2.2.2 Unsupervised Learning Unsupervised learning [53] is a method of machine learning where a model is fit to observations. It is distinguished from supervised learning by the fact that there is no a priori output. In unsupervised learning, a data set of input objects is gathered, and treated as a set of random variables. A joint density model is then built for the data set. Unsupervised learning can be used in combination with Bayesian inference to produce conditional probabilities for any of the random variables given the others. A holy grail of unsupervised learning is the creation of factorial code of the data, which may make the later supervised learning method work better when the raw input data is first translated into a factorial code. Unsupervised learning is also useful for data compression. Another form of unsupervised learning is clustering [54], which is sometimes not probabilistic. 2.2.3 Semi-Supervised Learning Semi-supervised learning [55] in computer science is a technique of machine learning which depends, for training, on both labelled and unlabelled data (though generally more on unlabelled data). Our proposed technique is called "semi-supervised learning", since it is intermediate between unsupervised learning (which does not depend on labelled data at all) and supervised learning (which on the contrary depends entirely on labelled data). Many scholars [56,57] of computer science have found a new technique to 20 enhance the precision of learning by mixing unlabelled and labelled data at the same time. Only skilled and qualified experts, who are able to sort samples of training, can acquire labelled data for the dilemma in learning. The process of labelling is very costly, which may completely prevent the application of labelled training, while unlabelled data is widely available. Therefore, we can benefit from semi-supervised learning. 2.2.4 Reinforcement Learning Reinforcement learning [58,59,60] is a sub-area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward. Reinforcement learning algorithms attempt to find a policy that maps states of the world to the actions the agent ought to take in those states. The environment is typically formulated as a finite-state Markov decision process (MDP), and reinforcement learning algorithms for this context are highly related to dynamic programming techniques. State transition probabilities and reward probabilities in the MDP are typically stochastic but stationary over the course of the problem. Reinforcement learning differs from the supervised learning problem in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The exploration vs. exploitation trade-off in reinforcement learning has been mostly studied through the multi-armed bandit problem. 21 2.2.5 Transduction Vapnik [61] presented transduction in the last decades of the twentieth century. Motivated by his view that transduction is preferable to induction, since induction requires solving a more general problem (inferring a function) before solving a more specific problem (computing outputs for new cases), Vapnik also added that: "When solving a problem of interest, do not solve a more general problem as an intermediate step. Try to get the answer that you really need but not a more general one." [61] Binary classification is a clear instance of non-inductive learning, as attaining the cluster requires a large set of test inputs, which may make classification labels clearer by providing useful information about the labels. We can clearly consider this an instance of semi-supervised learning. A transductive support vector machine (TSVM) [61] provides an instance of an algorithm in this category. We may regard our need for approximation as a third means of transduction. The Bayesian committee machine (BCM) [62] is another instance of an algorithm belonging to this category. 2.2.6 Multi-task Learning Multi-task learning [63] never deals with a single problem, as its methodology relies on gathering all related problems at the same time. Since the learner becomes able to utilize the commonality among the tasks, a modified pattern is provided for the principal task, and for these reasons, multitask learning proves to be a type of inductive transfer. 22 2.3 Classification This section introduces classification. As this is the main aspect of this thesis, almost all the chapters will cover at least aspects of classification so it is important to give a brief introduction to classification and its methods. One standard formulation of the supervised learning task is classification, including binary and multi-class classification. Binary classification is the task of classifying the input samples into two groups on the basis of whether they have some common property or not, such as medical diagnosis [26,27,28,29,30,31,32,33]. Given a set of l labelled training samples where a binary label space is the n-dimensional real feature space with = {1, -1}, and is the label assigned to sample , the aim of binary classification is to seek a function that best predicts the label for the input sample. Multi-class classification is the task of assigning the input samples into one of the multiple categories. One conventional way to extend binary classifiers to a multi-class scenario is to decompose a multi-class problem into a series of two-class problems using one-against-all implementation [64]. To solve a c-class classification problem with a label space = {1, 2, ... , c}, c binary classifiers can be constructed, each separating one class from the remaining classes. 2.3.1 Linear Classifiers Fisher [65] suggested a method of linear classification called Fisher Linear Discriminant Analysis (FLDA) by detaching functions that best detach two or more classes of samples based on the proportion of the between-class and within-class dispersion. The detaching function, provided as: 23 (2.1) is fixed by maximizing this objective : (2.2) where and represent the weight vector and the alignment of successively, and and , stand as the between and within class dispersion matrices, provided by (2.3) (2.5) (2.6) where is a sample of positive classes (+), while classes (-) ; is a sample of negative refers to the number of positive training samples, while to the number of negative training samples; training samples, while refers denote the subsets of positive denote the subsets of negative training samples. The optimal values of generalized problem [66]. Letting and can be calculated by solving a denote the derived optimal separating function, the label for an input sample is predicted by: 24 (2.7) where is when of the label for the input sample and otherwise, and is n estimate . Logistic Regression Logistic classing represents a method for computing based on a pattern called logistic regression (LR) which allows scientists to predict what may happen [67]. This method will be used in chapter 5. An incident, according to a logistic model of classing, is the membership of a vector within one of the two classes concerned. The mentioned technique is used to define a variable that has a certain range [0, 1] and certain input properties, and it can therefore be interpreted as a potentiality. For input property vector , we can denote the LR (logistic regression) with the following formula: We can expect the model of the pattern to be based on the technique of least squares, in linear regression: the particular model of regression is generated from the smallest sums of squares distance between the perceived and predicted values of the subordinate variable, where . The parameters of the LR model are estimated using the maximum likelihood method [68]: the coefficients that make theobservedresults“mostlikely”are selected. The label of an input sample x can be predicted by the LR model [68], which makes it possible to predict the parameters using the maximum likelihood method: the coefficients that make the observed results “most 25 likely” are selected. The following formula presents the label of the given sample . (2.9) Naïve Bayes Classifier Part of chapter 5 will be about the naïve Bayes classifier, so it is important to give the reader some background on this. A naïve Bayes classifier (NBC) [69] is a simple probabilistic classifier that applies the Bayes theorem with strong (naïve) independence assumptions. Let occurrence of class be the probability of this is known as a priori. A posteriori means that an observed sample coming from is expressed as . According to the Bayes rule [70], where is the unconditional probability density function (PDF) of , and is the likelihood function of class The NBC simply assumes that features are independent given the class is Where , that , thus, is a scaling factor dependent only on , i.e. a constant if the values of the feature vector are known. Models of this from in pervious Eq. are much more manageable, since they only factor into the class prior independent probability distributions and the of which the model parameters 26 can be approximated from the training samples. The decision function of the Bayes classifier is given as: We can easily assume that each distribution is a dimensional distribution in itself, since the class conditional property distributions may be decoupled. Thus, we partially avoid some of the problems caused by dimensionality, including the lack of limitation, and the lack of sets of data, which is explicitly shown by the numbers of features. 2.3.2 Artificial Neural Networks Research on Artificial Neural Networks (ANN) confirms that our own brain system works discordantly compared with that of an ordinary digital computer. The manner in which our brain works is characterised by complexity, parallelism and never being linear. Artificial Neural Networks provide us with beneficial features and competencies of nonlinearity, including outcome mapping, adaptation, clear feedback, structural or contextual background, defect endurance, highly valued implementation, consistency of analysis and style and finally neurobiological parallelism [71]. Single-layer Perceptrons The most natural outline of a layered Artificial Neural Network (ANN) is singlelayer perceptrons (SLP), with a given layer of original nodes that visualizes an outcome layer of neurons. Moreover, single-layer perceptrons (SLP) are used to sort linear detachable models [71]. Figure 2.1 illustrates the texture of an SLP 27 with one neuron in the outcome layer. Such an SLP built on a single output neuron is limited to performing binary classification; the label for an input sample is predicted by where is the weight vector of the output neuron. Figure 2.1: The structure of an SLP with one neuron in the output layer. Multi-layer Perceptrons A multi-layer perceptron (MLP) [71], with one input layer, one or more concealed layers, and an outcome layer, has distinct features: Each neuron pattern in the network embraces a non-linear stimulation function, namely the logistic function [71], whose commonly used form is a sigmoidal nonlinearity. Missions by gradually separating meaningful features from nonsensical 28 features within the given feature vectors on condition of beginning with significant ones. Network synapses provide high connectivity. Any change in the population or weights of synaptic connections necessarily leads to a change in network connectivity. Figure 2.2: The overall structure of an MLP with one hidden layer. The overall structure of an MLP with one hidden layer is shown in Fig. 2.2. The back-propagation algorithm [71] can be employed to train an MLP by minimizing the average squared error energy over all the training samples. 29 Radial Basis Function Networks To perform a complex pattern classification task, radial basis function (RBF) networks [71] transform the classification task into a high-dimensional space in a nonlinear manner, by involving the following three layers: Input layer: a given layer is fragmented into sensory units, which play the role of connectors between the network and the environment. Hidden layer: acts as a nonlinear transform from the input space to the hidden space , like a kernel function (see section 2.3.3). One or more layers of hidden neurons are involved within the network, which acquires complexity by means of these hidden neurons. Output layer: furnishes feedback on the network to the input model, which defines the detaching function in the shifted concealed space. Figure 2.3: The overall structure of the RBF networks. The non-linear transformation is made up of , real-valued functions , 30 denotes the number of neurons in the hidden layer input , varied items , then the following form presents the functions of truly valuated mapping: , (2.14) The structure of the RBF network is shown in Fig. 2.3, In the transformed feature space , the RBF network determines the separating function (2.16) by minimizing the following cost function denotes the weight vector, while denotes the alignment of the detaching function in the transformed feature space . parameter and denotes the rationalization finally denotes a linear distinctive operator. Letting as the rationalization parameter the idealistic value of approaches zero, almost assembles the false-reverse solution to the over-determined defined least-squares data-fitting dilemma for provided by [72]. in which denotes a column vector of labels of all the training samples. , as 31 with The label for an input sample is then predicted by . Self-organizing Maps Self-organizing maps (SOM) [73] are sets of data with multidimensions onto one or two-dimensional lattices. It is possible but not common to use higher-dimensional lattices. SOMs are made up of two layers of neurons: a one-dimensional given layer and two-dimensional (2D) competitive layer organized as a 2D-lattice of neurons. Each neuron in the competitive layer holds a weight vector which expresses the same dimensionality of the given space. Preparing the competitive layer allows the given samples to move in parallel toward the competitive neurons. If the competitive layer includes neurons, we can choose the superior and adaptable neuron as the winner, of which the index is symbolized by for a given sample , and contents. (2.19) In accordance with the following rule at the winning neurons iteration, the weights of the , in addition to all the other neurons in the competitive layer, become adaptable to suit the given sample (2.20) 32 and indicates the side distance between the winning neuron and the agitated neuron in the 2D-lattice; the following are the training parameters set by user , , and . The magnitude of the change decreases with time and with distance from the winning neuron. 2.3.3 Kernel-based Classifiers Aizerman was the first to exploit kernel functions in machine learning as inner products in an identical feature space. The data in appropriate feature space becomes part and parcel of kernel techniques in model analysis, and for the purpose of discovering models in the mentioned data it may utilize algorithms based on linear algebra, geometry and statistics. A kernel technique is made up of two components: a module which carries out the mapping into an experimental feature space , and a learning algorithm applied in order to detect linear models in that space. A kernel function is a brief computational method that allows efficient representation of linear models in highdimensional experimental feature spaces, keeping in mind the adequacy of representational power. Researches focus [74] on four main aspects of kernelbased classifiers: 33 Samples in the original feature space are embedded into an empirical feature space . Linear relations are sought among the embedded feature vectors. The way we execute algorithms allows us to spare the counterparts of the included feature vectors except their pairwise inner products. A kernel function allows us to compute the pairwise inner products directly from the feature vectors. Kernel Functions Kernel functions provide a powerful and fundamental way to discover nonlinear connections based on intelligible linear algorithms in a suitable feature space. In the kernel-based classification model, the kernel matrix acts as a bottleneck. The kernel matrix should be the main source of all information. Inner Product Space A vector space over the real’s is an inner product space, in the event of the existence of a real-valuated symmetric bilinear (linear in each dispute) map that satisfies (2.24) The bilinear map is known as the inner dot or scalar product. In the real feature space , the standard inner product between two vectors and is given by (2.25) An inner product space is sometimes referred to as a Hilbert space, though most researchers require the additional properties of completeness and separability, as well as sometimes requiring that the dimension be infinite [74]. 34 Gram Matrix Given a set of feature vectors , the Gram matrix is defined as the . If a kernel function matrix E, entire are is used to evaluate the inner products in the transformed feature space with nonlinear mapping , the associated Gram matrix is referred to as the kernel matrix, denoted by, with entries given by (2.26) Different kernel functions can be designed based on their closure properties [74]. Kernel Forms Three types of kernel function are used in this work, including the Gaussian, Cauchy, and triangle kernels, defined as follows: Gaussian kernel (RBF kernel): Cauchy kernel [75]: Triangle kernel [75]: where is the kernel width set by the user. A more versatile RBF kernel with a different kernel width for each feature can also be used, and is given as where are the kernel widths for the feature, and set by the user. 35 Kernel Fisher Discriminant Analysis Kernel Fisher discriminant analysis (KFDA) is the final draft set by Mika et al. [76] of kernels functions and FLDA. By expanding the weight vector of the detaching function into a linear combination of all training samples, the following formula represents the detaching function in the kernel identified feature space: where denotes the summating weights, and KFDA determines the optimal separating function by maximizing the Fisher criterion [74], as where indicates the average of estimations of the positive samples, while indicates the average of estimation of the negative samples, represents the identical scale deviance, by means of merging Eq. (2.31) and Eq. (2.32) , the 36 idealistic gullies of and is countable depending on the reconciliation of the universalized dilemma [62]. Eq. (2.7) anticipates the label for a given sample . Support Vector Machines SVMs [77, 78] set up a hyperplane as the decision surface, maximizing the margin of detachment between the positive and negative samples in a proper feature space called the maximal margin norm. Boser et al. [79] evolved an SVM based on a kernel by combining the kernel function and large-margin hyperplanes, SVM successfully reconciles different nonlinear and nondetachable dilemmas in machine learning. Along with the authentic C-SVM learning technique [77], Schölkopf et al. [80] evolved the v-SVM learning technique, which is very similar to C-SVM, except for the optimization risk. This section describes the hard-margin SVM and three soft-margin SVMs, including the 1-norm C-SVM (L1-SVM), the 2-norm C-SVM (L2-SVM), and the v-SVM, which will be used in several places in this thesis. Hard-margin SVM Researchers use the hard-margin SVM in clearly detachable instances in order to define the separating function (2.33) In the kernel specified feature space, the modified optimization risk hereinafter is estimated to be minimized: (2.34) 37 where denotes the norm in the transformed feature space introducing Lagrange multipliers . By , this is equivalent to solving the following constrained quadratic programming (QP) problem: and Soft-margin C-SVM The C-SVM is an ordinary SVM with a soft margin, dealing with never detachable instances, which presents the margin torpid vector , which makes breaching the inequality hereinafter accessible for samples (2.37) with the soft-margin loss By involving the 1-norm of the margin slack vector the L1 –SVM determines the separating function in Eq. (2.33) by minimizing the following regularized optimization risk: 38 CSVM represents the positive adjustment parameter defined by the user, by providing the Lagrange multipliers , which corresponds to solving the constrained QP dilemma hereinafter: By involving the 2-norm of the margin slack vector , the L2-SVM determines the separating function in Eq. (2.33) by minimizing the following regularized optimization risk: (2.42) By introducing the Lagrange multipliers, this is equivalent to solving the following constrained QP problem: 39 represents the Kronecker which is 1 when , and 0 in any other case. L2-SVM is a truly exceptional instance of the hard-margin SVM with the fixed kernel matrix (2.44) where I is the identity matrix. Eq. (2.36) is satisfied for both the L1-SVM and L2-SVM. Soft-margin v-SVM is simply SVM with a soft margin for non-separable instances, which recruits the referred margin slack vector as in the C-SVM learning execution, but a varied soft-margin loss, provided by indicates the width of the margin which differs within positive values. In that way, if the margin width of the v-SVM value is 1, we can simply call it C- SVM. Minimizing the adjusted optimization risk hereinafter allows us to identify the separating function by v-SVM where the user sets as the adjustment parameter that differs through [0, 1]. By introducing the Lagrange multipliers , this is equivalent to solving the following constrained QP problem: 40 Eq. (2.36) is also satisfied for the v-SVM. Karush-Kuhn-Tucker Optimality Conditions Let denote the optimal solution of the constrained QP problems in Eq. (2.35), Eq. (2.40), Eq. (2.43), and Eq. (2.47), and and denote the optimal weights and bias of the separating function in Eq. (2.33), respectively, calculated with . The following Karush-Kuhn-Tucker (KKT) optimality conditions must be satisfied, which are slightly different for different types of SVM [78,80]: Hard-margin SVM (2.48) L2-SVM (2.49) L1-SVM (2.50) (2.51) v-SVM (2.52) 41 (2.53) Support Vectors The mentioned KKT ideal clauses identify the meaning of support vectors (SVs). SVs are training samples with non-zero . For L1-SVM and v-SVM, there are two main types of SV: margin and non-margin SVs. Margin SVs are training samples with CSVM for the L1-SVM or less than never equal to zero, but less than for the v-SVM, that are dispensed along the margin in Figure 2.4. On the other hand, non-margin SVs are training samples with to CSVM for the L1-SVM or equal to exactly equal for the v-SVM, that are dispensed within the margin on condition of being either on the right side of the decision surface, or on the wrong one of it, as in Figure 2.4. Based on the clauses of the KKT, the SVs of the hard margin SVM and the L2-SVM, in addition to the margin SVs of the L1-SVM content SVs of the v-SVM content value of the vastness of the margin, computed with , while the margin , which indicates the ideal . Separating Function The optimal value of the bias of the separating function for the C-SVM and v-SVM can be derived by 42 and represent two sets of SVs with the same volume of but varied labels of + 1 and -1. By merging Eq. (2.36) with Eq. (2.33), the ideal detaching function is provided by The label for an input sample value of the margin width is then predicted by Eq. (2.7). The optimal for the v-SVM can be calculated by Figure 2.4: Illustration of support vectors for linear, non-separable patterns. The optimal value of the slack vector can be calculated based on the KKT conditions: 43 C-SVM (2.58) v-SVM (2.59) 2.3.4 Proximal Classifiers Proximal classifiers solve the binary classification task by seeking two proximal planes in a corresponding feature space, instead of one separating plane. Bradley and Mangasarian [81] first addressed the topic of multi-plane learning by proposing the unsupervised k-plane clustering method in 2000. Later, series of studies on multi-plane learning have been developed for supervised learning, such as the proximal SVM (PSVM) [82, 83] and its corresponding statistical interpretation [84], parallelized algorithms for classification with the incremental PSVM [85], a fuzzy extension of the PSVM [86], and the multi-surface PSVM (MPSVM) [87]. Proximal Support Vector Machines The PSVMs seek two parallel proximal planes that are pushed as far apart as possible; samples are classified by assigning them to the closest plane [82,83]. To maintain the parallelism condition and bound samples based on the maximal margin rule, the following proximal planes are employed for the linear PSVMs: (2.60) (2.61) 44 The optimal values of and are obtained by minimizing the following regularized optimization risk: (2.63) where denotes the error variables (see also slack vector in Section 2.3.3), and is the non-negative regularization parameter set by the user. Substituting in terms of and based on the linear constraint as given in Eq. (2.63), the constrained optimization problem in Eq. (2.62) is reduced to an unconstrained minimization problem, which can be solved by setting the gradient with respect to and to zero. For the nonlinear PSVMs, the following proximal planes are employed: where are Lagrangian multipliers. The constrained optimization problem to be solved becomes: Compared with L2-SVM, the key idea of PSVM is to make a simple fundamental change by replacing the inequality constraint in Eq. (2.42) with 45 the equality constraint in Eq. (2.42) with constraint in Eq. (2.63) and Eq. (2.67). Multi-surface Proximal Support Vector Machines MPSVMs drop the parallelism condition on the proximal planes of the PSVMs and require that each plane be as close as possible to one of the two classes of training samples and as far apart as possible; samples are classified by assigning them to the closest plane [87]. The following formulae denote the two proximal planes in the original feature space . (2.68) (2.69) is the weight vector (direction), while is the bias of the proximal planes. The first and the second plane are referred to by the symbols 1 and 2 hereinunder. In the kernel determined feature space , with the functions of the kernel assigned in order to merge nonlinearity, by stretching the direction vector of the hyperplane to form a linear clustering of all the training samples, the following formulae denote the two proximal planes as where vectors of and and are summating weights, forming two column , respectively. To obtain two planes and , the 46 following objective functions with the numerator parts given in the "sum of squares" form are maximized: A Tikhonov regularization term [88], which is often used to regularize least squares and mathematical programming problems, is employed to improve the classification performance of MPSVMs [87]. For linear classification, by incorporating Eq. (2.68) and Eq. (2.69), as well as the regularization term, into Eq. (2.72) and Eq. (2.73), and letting and , the following objective functions are required to be maximized: The user sets the parameter of non-negative systematization which is referred to here as , the class, while the mould mould denotes samples from the negative denotes samples from the positive class; represents a column vector with all factors equal to 1. Solving two generalized eigenvalue values problems , , , and [87] allows the user to compute the ideal . For nonlinear classification, by incorporating Eq. (2.70) and Eq. (2.71), as well the regularization term, into Eq. (2.72) and Eq. (2.73), and letting 47 and the following objective functions are derived and are required to be maximized: The mould denotes the kernel matrix that intermediates samples that belong to the positive class as well as training samples; on the other hand, the mould denotes the kernel matrix, which intermediates samples belonging to the negative class, plus all the training samples as well. Solving two generalized eigenvalue problems [87] allows the user to compute the ideal values , , , and . 2.3.5 Prototype Classifiers A new type of classifiers, namely prototype classifiers, is completely distinguished from all the other techniques mentioned before in this chapter. Prototype classifiers simply present several samples (prototypes) for each class, then estimate the label of a new sample based on the nearest proximal prototype. Meanwhile, SVMs briefly appeal one distinctive hyperplane, PSVMs appeal two proximal hyperplanes and finally, ANNs depend on neurons as the principal units of a model for resolving data. k- Nearest Neighbours The method of k-Nearest Neighbours (KNNs) [89] is an extreme end of the scale for prototype classifiers, where each training sample serves as a prototype, leading to prototype Given a query sample, 48 k number of prototypes closest to the query samples (with the smallest Euclidean distances) are found. The classification uses a majority vote among the classification of the k prototypes. We consider it important to mention the method and give a brief introduction, as it will be mentioned later in chapter 5. Minimum Distance Classifier The minimum distance classifier (MDC) [90,91,92] is another extreme end of the scale for prototype classifiers, where there is only one prototype for each class, namely the class centre (or mean), thus, distance between the query sample denoted as as . The and each prototype is computed, and . Then, the label of the nearest prototype, given , is chosen as the label of . Learning Vector Quantization Learning vector quantization (LVQ) [93] is regarded as a prototype algorithm that depends on supervised classification. LVQ is simply a particular instance of an ANN. LVQ accurately carries out a winner-take-all [93], which applies a Hebbian learning based approach. An LVQ network has a first competitive layer of competitive neurons (prototype) and a second linear layer of outcome neurons (classes). The classes of the competitive layer are switched into identified object classifications by the linear layer. The user denotes the classes presented by the competitive layer as secondary classes, each connected to a prototype, and the linear layer classes are denoted as object ones. Any experienced user can 49 derive benefit from LVQ because it presents accessible and intelligible prototypes. Clustering-based Prototypes Clustering [54] categorizes masses into varied groups, or specifically, resolves a set of information into secondary sets (clusters), which explains the prevalence of a certain trait in each subset in terms of a compatible scale of distance. All clustering algorithms are truly unsupervised techniques evolving prototypes [94]. As much as we have clusters, we can create prototypes, since the centre of a cluster ascertains each prototype. 2.4 Optimization Research by Bennett and Parrado-Hernandez [95] recently revealed the synergistic link between the fields of machine learning and mathematical programming. They note: “Optimization lies at the heart of machine learning. Most machine learning problems reduce to optimization problems. Consider the machine learning analyst in action solving a problem for some set of data. The modeller formulates the problem by selecting an appropriate family of models and massages the data into a format amenable to modelling. Then the model is typically trained by solving a core optimization problem that optimizes the variables or parameters of the model with respect to the selected loss function and possibly some regularization function. In the process of model selection and 50 validation, the core optimization problem may be solved many times. The research area of mathematical programming theory intersects with machine learning through these coreoptimizationproblems.” Examples of machine learning models with existing optimization methods include QP in SVM [77,78], semi-definite programming (SDP) in model selection and hypothesis selection [96,97,98], and dynamic programming in lightest derivation problems [99]. The actual optimization methods used in this work are briefly described in the following sections. 2.4.1 Genetic Algorithm A genetic algorithm (GA) [100] is an investigation method which provides decisive solutions or even compromises to optimization and search dilemmas. GA proposes some possible solutions (a population). It presents close estimations (individuals) of a solution, in terms of the known biological "survival of the fittest" rule (see tutorial by Chipperfield et al. [101]). There is a sign standing for each individual as a series or chromosome alphabetically formed, of which one commonly used depiction is the dual alphabet {0,1}. Both the objective and the fitness function are used to evaluate the accomplishment executed by each individual. Quite proper individuals have a better chance of being recombined with a crossover to give rise to the following generation with a probability , which in turn alternates genetic data between pairs or larger collections of individuals. Producing a secondary tree at any spot (blindly chosen), carried out a beyond genetic operator, namely mutation, which the user reapplies later 51 to the new individuals with a low probability to be away from the local junior borders. Mutation guarantees that the likelihood of finding a certain secondary space of the dilemma space is never zero. Selection, crossover and mutation differentiate the size of the old and new populations by fractions, which we call the generation gap. The user has to bring new individuals with high fitness values to the new population in order to keep an authentic population size. Consequently, the fitter individuals occupy the rest positions. After a predetermined number of generations, the GA reaches an end; the user may later set a test in order to examine the high quality and fitness of population members. The user may either reset a genetic algorithm or start a fresh search, in the event of the lack of any reasonable solutions. 2.4.2 Gradient Descent Gradient descent (GD) is an optimization algorithm functioning definitely for variable systems. In order to detect a local minimum of a function relating to the parameter vector using GD, the user has to relatively move towards the negative of the gradient (or the approximate gradient) of the function at the actual spot. In order to perform GD, the user primarily has to reset the parameter vector to some value , and then update each factor of , on the basis of the principle hereinafter, at the The number of factors of the vector iteration: is referred to as control the rapidity of convergence (determined by the user). , while and 52 2.5 Feature Selection Classification uses samples with labels, presented by a vector of digital/numeral or titular/nominal features, to figure out a pattern that is able to classify things into a definite set of specific categories. This process may include some inappropriate features by mistake, which may negatively affect the precision of the classifier, which is why they must be eliminated. Moreover, using fewer features makes the whole process cheaper and makes the pattern of classification accessible and clear. Different methods of feature selection can be employed, such as sequential backward selection [89, 102], GA [46, 47, 102, 103,104], and recursive feature elimination. 2.5.1 Genetic Algorithm GA is a common and accessible optimization technique; we can use it as a stochastic global search technique that imitates the biological "survival of the fittest" rule (see section 2.4.1). Selecting features using GA requires each feature in the nominated feature cluster to be a binary gene. Each potential feature cluster is referred to as an ـbit binary series, where denotes the total number of potential features. An ـbit individual matches with ـdimensional binary feature vector includes the inclusion of the of six features, ordered as denotes the feature combination where denotes removal and feature. To select a subset from a total the ـbit individual 010001 . Two schemes can be employed for GA-based feature selection: Independent selection: investigating the ideal cluster of features with the objective function that measures the possibility of detaching data, 53 like the alignment of the kernel with a target function, the possibility of detaching a class, ordinary distance, computed in the original feature space. This technique of selection is independent, since it is free of any classifier. Wrapper-type selection: searching for the optimal combination of features, with the objective function set as the estimated risk of a chosen classifier, such as the leave-one-out (LOO) error and crossvalidation error. 2.5.2 Sequential Backward Selection In Sequential Backward Selection (SBS), starting from the complete set of features, we discard features one by one until all the conditions are met and the features have been deleted. SBS provides a diminishing level of complexity which is determined by the number of interplays among the feature variables, or in other words, the number of the edges in the graphical representation of the model. An SBS begins by designating the saturated model as the current model. A saturated model has a complexity level , where denotes the number of feature variables. At each phase in SBS, the user evolves the set of resolvable models of complexity level which is produced by eliminating single edges from the recent model of complexity level . Each element of this set represents an assumed model which can be considered in order to designate a particular model whose proper conclusions in the least deterioration compared to the recent model, with the result that the model shall be in usage, allowing the search to go on. The search may be handicapped if 54 none of the assumed models achieves a low deterioration or the complexity level of the recent model equals zero. 2.5.3 Recursive Feature Elimination A large number of features harms the effectiveness of most algorithms at illustrating patterns for prediction. RFE is one of the keys used to minimize the number of features. A good feature ranking criterion is not necessarily a good feature subset ranking criterion. The criteria or estimate the effect of removing one feature at a time on the objective function. They become very ineffective when it comes to removing several features at a time, which is necessary to obtain a small feature subset. This problem can be overcome by using the following iterative procedure, called Recursive Feature Elimination: 1. Train the classifier (enhance the weights according to ) . 2. Calculate the rank of categorization for all features or . 3. Use the smallest rank of categorization to eliminate the feature. The point of RFE is to start with all features without exclusion, then to determine and eliminate the least useful one, after which it keeps on iterating until some stopping condition is met. RFE is varied in accordance with the time of stopping and the way we select the removable feature. Researchers have investigated RFE to determine how it can benefit studies of gene expression. In RFE, it is forbidden to provide an ideal feature subset. On the other hand, RFE reduces the complexity of feature selection bybeing“greedy”.Thatis,oncea feature is selected for removal, it is never reintroduced. Most studies have found RFE able to select very good feature sets, but with 12,000 or more features to select from, when the number of the samples is large, RFE requires 55 a considerable computation time. Recursive feature elimination is extremely computationally expensive when only one least useful feature is removed during each iteration. For computational reasons, it may be more efficient to remove several features at a time, at the expense of possible classification performance degradation. In such a case, the method produces a feature subset ranking, as opposed to a feature ranking. Feature subsets are nested . If features are removed one at a time, there is also a corresponding feature ranking. However, the features that are top ranked (eliminated last) are not necessarily the ones that are individually most relevant. Only taken together, the features of a subset are optimal in some sense. It should be noted that RFE has no effect on correlation methods since the ranking criterion is computed with information about a single feature. 2.6 Summary This chapter has surveyed the fundamentals of machine learning. As this work focuses on pattern classification, several well-known classification methods have been reviewed, covering the categories of Support Vector Machines in detail, as this thesis focuses mainly on SVM and improving SVM and TSVM. In addition, the chapter gives an overview and some details of Linear Classifiers, Neural Networks, Kernel-based Classifiers, Proximal Classifiers, and Prototype Classifiers. These classifiers will be used later in validating the classification in order to present the modelling of automated features and samples ranking compared to Support Vector Machines (SVM), as one of the aims of this thesis is to improve SVM. 56 This chapter debates the synergistic relations between the scopes of machine learning and optimization; in addition, it presents GA, which represents the mean of evolutional optimization. Moreover, this chapter gives an outline of feature selection and different types of feature selection and the reason for using it, which is directly related to the aims and objectives of this thesis, which are to investigate the effect on the classification accuracy of reducing the number of features or samples applying in various available data sets, using both semi-supervised and supervised (SVM and TSVM) methods. Chapter 3 Improving TSVM vs. SVM Accordance-Based Sample Selection This chapter introduces the experimental datasets used in this work and summarizes the commonly used measures to evaluate the separability of features. Section 3.1 provides a brief description of the datasets used. Section 3.2 presents background and motivation on the based sample selection methods. Section 3.3 describes experiment settings. Section 3.4 introduces the derivation of a new algorithm. Sections 3.5 and 3.6 present results and discussion of results respectively. Lastly, section 3.7 gives a brief summary of the chapter. 3.1 Experimental Datasets and Feature Analysis We use two datasets, the public dataset Wisconsin Diagnostic Breast Cancer Dataset from the UCI Machine Learning Repository [105], and an inhouse dataset Nottingham Tenovus Primary Breast Cancer [106], to evaluate the proposed methods in this work. Information on each dataset is listed in Table 3.1. No. of features 25 No. of samples 1076 L U V (NTBC) No. of classes 6 663 413 200/224 (WDBC) 2 14 569 519 50 200/213 Datasets Table 3.1: Benchmark datasets used in this work. 57 58 Accordance sampling selects the unlabelled examples on whose label the two view classifiers agree the most. When the two classifiers are sufficient and independent, the sampled examples are more reliably labelled. Thus, selecting those examples on whose labels the two view classifiers agree is less likely to introduce errors in the expanded labelled dataset. NTBC Six breast cancer classes, three of them present luminal characteristics (luminal biomarkers are over-expressed), differentiated by the presence or absence of other markers like estrogens receptors and/or progesterone receptors. In one of the six classes the HER2 marker (a human epidermal growth factor which gives higher aggressiveness in breast cancers) is strongly expressed. The last two classes are characterised by the overexpression of basal markers and the subsequent under-expression of the luminal ones. These two groups differ by the presence or absence of a marker called p53. And for WDBC data set two classes are malignant and benign. Since the number of features is large we randomly generate some small groups contains 2 features. We randomly select 200 pairs of views to run of views. The last column in the Table 3.1 represent the number of view pair used in our experiments and the total number of all possible view splits. We use one against all method for multiclass problem we group each class as positive and the rest as negative. U and L represent Unlabelled and labelled samples respectively. 3.1.1 Nottingham Tenovus Primary Breast Cancer (NTBC) A series of 1076 patients from the Nottingham Tenovus Primary Breast Carcinoma Series presenting with primary operable (stages I, II and III) 59 invasive breast cancer between 1986- 1998 were used. This dataset is the Nottingham Tenovus Primary Breast Cancer (NTBC) data [106], which has been used in many experiments in data mining for biomedical research [107,108]. NTBC is an invasive breast cancer collection, developed by Daniele Soria et al. at the University of Nottingham. All tumours were less than 5 cm in diameter on clinical/pre-operative measurement and/or on operative histology (pT1 and pT2). Women aged over 70 years were not included because of the increased confounding factor of death from other causes and because primary treatment protocols for these patients often differed from those for younger women. There are in total 1076 patients upon which six stages of relevance judgments are made, having in total 25 features for each patient. Processdetecting proteins reactivity for twenty-five proteins with known relevance in breast cancer, including those used in routine clinical practice, was previously determined using the standard techniques for detecting proteins on tumour samples prepared as tissue microarrays [106]. Levels of process-detecting reactivity were determined by microscopic analysis using the modified H-score (values between 0-300), giving a semi-quantitative assessment of both the intensity of staining and the percentage of positive cells. The complete list of variables used in this study is given in Table 3.2. This is a well-characterised series [106] of patients who were treated according to standard clinical protocols. Patient management was based on tumour characteristics using Nottingham Prognostic Index (NPI) and hormone receptor status. Patients with an NPI score 3:4 received no adjuvant therapy, those with a NPI score > 3:4 received hormone therapy if oestrogen receptor 60 (ER) positive or classical cyclophosphamide, methotrexate and 5-fluorouracil (CMF) if ER negative and fit enough to tolerate chemotherapy. Table 3.2: Complete list of antibodies used and their dilutions Antibody, clone Luminal phenotype CK 7/8 [clone CAM 5.2] CK 18 [clone DC10] CK 19 [clone BCK 108] Basal Phenotype CK 5/6 [cloneD5/16134] CK 14 [clone LL002] SMA [clone 1A4] p63 ab-1 [clone 4A4] Hormone receptors ER [clone 1D5] PgR [clone PgR 636] AR [clone F39.4.1] EGFR family members EGFR[clone EGFR.113] HER2/c-erbB-2 HER3/c-erbB-3 [clone RTJ1] HER4/c-erbB-4 [clone HFR1] Tumour suppressor genes p53 [clone DO7] nBRCA1 Ab-1 [clone MS110] Anti-FHIT [clone ZR44] Cell adhesion molecules Anti E-cad [clone HECD-1] Anti P-cad [clone 56] Mucins NCL-Muc-1 [clone Ma695] NCL-Muc-1 core [clone Ma552] NCL muc2 [clone Ccp58] Apocrine differentiation Anti-GCDFP-15 Neuroendocrine differentiation Chromogranin A [clone DAK-A3] Synaptophysin [clone SY38] Short Name Dilution CK7/8 CK18 CK19 1:2 1:50 1:100 CK5/6 CK14 Actin p63 1:100 1:100 1:2000 1:200 ER PgR AR 1:80 1:100 1:30 EGFR HER2 HER3 HER4 1:10 1:250 1:20 6:4 p53 nBRCA1 FHIT 1:50 1:150 1:600 E-cad P-cad 1:10/20 1:200 MUC1 MUC1co MUC2 1:300 1:250 1:250 GCDFP 1:30 Chromo Synapto 1:100 1:30 61 Hormonal therapy was given to 420 patients (39%) and chemotherapy to 264 (24.5%). Data relating to survival was collated in a prospective manner for those patients presenting after 1989 only; including survival time, defined as the interval (in months) from the date of the primary treatment to the time of death. The overall survival was taken as the time (in months) from the date of the primary surgical treatment to the time of death or censorship. 3.1.2 Wisconsin Diagnostic Breast Cancer Dataset Breast cancer represents the main type of cancer all over the world, as well as being the second most common cause of death in women. About 10% of women are infected with breast cancer in the western countries [109]. Doctors can decisively diagnose breast cancer only through an FNA biopsy or core needle biopsy [110]. FNA, using a needle smaller than those used for blood tests to remove fluid, cells, and small fragments of tissue for examination under a microscope, is the easiest and fastest method of obtaining a breast biopsy, and is effective for women who have fluid-filled cysts. We applied the Wisconsin Diagnostic Breast Cancer (WDBC) data taking this into consideration [111, 112], in order to precisely identify malignant breast tumours from a set of benign and malignant samples depending only on FNA. To estimate the size, shape and texture of each cell nucleus, many previous studies determined the following features: 1. Radius is calculated from the average of the length of radial line fractions that represent the same lines from the centre of the mass of the border to each of the border points. 62 2. Perimeter is measured as the sum of the distances between consecutive boundary points. 3. Area is calculated by determining how many pixels are on the inside borders, then sub-joining half of the pixels on the parameters to modify the digitization error. 4. Compactness includes the parameters and area, presenting the scale of compactness of the cell, counted as follows: This dimensionless number is diminished to the lowest scale revolving. Irregular borders increase this number. 5. Smoothness can be specified by calculating the dissimilarity between the length of each radial line and the target length of the two ambient radial lines. We have to smooth the outline of a region, in the event of that number being small, taking into consideration the space among successive points of the borders. We can rely on the formula hereinafter in computing smoothness to get rid of numerical inconstancy caused by small divisors: where is the length of the line from the centre to each border point. 6. Concavity is observed by measuring the size of any indentations in the borders of the cell nucleus. 63 7. Concave points seem to resemble concavity, but just compute borders dots points through the concave zones of the borders compared to the quantity of those concavities. 8. Symmetry is measured by providing the proportional divergence in length between mates of line fractions vertical to the major axis of the outline of the cell nucleus. The major axis is measured likewise by providing the longest chord that goes by a border dot within the central point of the nucleus. Then it needs to delineate mates of fraction within certain periods. We divide the sums instead of totalling the division results, to get rid of unreliable numeral outcomes caused by small fractions. where and denote the lengths of perpendicular segments on the left and right of the major axis, respectively. 9. Fractal dimension: we employ the ''coast line approximation'' presented by Mandelbrot [113] in order to approach the dimension of the fractal. This depends on growingly bulkier 'rulers' to measure the parameters of the kernel. Increasing the ruler affects the accuracy of measurement negatively and on the contrary minimizes the parameters. Drawing these values on a log-log scale and measuring the downward slope approaches the dimension of the fractal. 10. Texture is measured by finding the variance of the grey-scale intensities in the component pixels. The mean value, standard error, 64 and the extreme (largest or "worst" ) value of each characteristic were computed for each case, which resulted in 14 features of 569 images, yielding a database of 569 x 14 samples representing 357 benign and 212 malignant cases. 3.2 Background and motivation Cancer diagnosis necessarily requires the ranking of patients infected by breast cancer. Recently, scientists have applied several algorithms to tackle this task. Biologists are increasingly using modern machine learning techniques, for the purpose of providing appropriate data about tumours. This chapter discusses semi-supervised and supervised machine learning methods to rank sets of data. A scientific comparison between semi-supervised and supervised machine learning methods is provided hereinafter. For the first part of this analysis, Nottingham Tenovus Primary Breast Cancer data [106] will be considered. The full list of variables is reported in Table 3.2. A Support Vector Machine (SVM) classifier and Transductive Support Vector Machine classifier will be applied throughout this dataset. The same machine learning techniques have already been used in studies such as Bellaachia and Guven [114] and the revised study by Delen et al. [115]. The above methods were applied in the search for the most suitable one for predicting the survivability rate of breast cancer patients. This study was motivated by the necessity to find an automated and robust method to validate the classification of breast cancer on the accordance based sampling (pg. 58). In fact, six classes were obtained using agreement between different 65 classification algorithms. Starting from these groups, the aim was to reproduce the classification, taking into account the high abnormality of the data (see Figures 3.1 and 3.2). For this reason, the Support Vector Machine classifier was used, then the results were compared with the Transductive Support Vector Machine. It is important to note that, of the 1076 patients, only 62% (663 cases) were classified into one of the six core groups, while the remaining 38% presented indeterminate or mixed characteristics. This part of the study focusedonlyonthesubsetofthe‘in-class’cases.Theobjectivewastorunthe classifiers to find an automated way to justify and reproduce the classification obtained before having the results that were based on our accorded sampling methods. In the second part of the chapter, a dataset taken from the UCI Machine Learning Repository [105] will be considered in coping with abnormal data. In this dataset, the performance of the SVM classifier with accordance-based sampling will be compared with the TSVM approach for classification. Moreover, as the SVM assumption of normality of the data is strongly violated in many real-world problems, the implementation of SVM and TSVM classifiers with accordance-based sampling was developed and is presented in this chapter. This method deals with constant and abnormal variables which, as in the cases presented here, do not follow normal distributions (see Figure 3.3). The algorithm has the same structure as the SVM and TSVM, considering the effect of accordance-based sampling on the SVM and TSVM performance. The results obtained with the new method will be compared with both those found by the SVM algorithm and those obtained by applying TSVM. 66 Figure 3.1: Histogram of variable CK19 Figure 3.2: Histogram of variable P53 Figure 3.3: Histogram of WDBC 67 3.3 Experimental setting For our empirical evaluation, two breast cancer datasets were used – the in-house Nottingham Tenovus Primary Breast Cancer Series Dataset and Wisconsin Diagnostic Breast Cancer UCI Dataset. A comparison was made between the performance of the original random sampling SVM and TSVM on the one hand, and the new accordancesampling method on the other. Accuracy was applied in measuring the performance strength of the model. The accordance-sampling method was applied on TSVM and SVM on both datasets. Then each method was run 10 times on each dataset. Later, the original SVM and TSVM algorithms were run using the same setup to measure the performance. To investigate the performance of accordance sampling for breast cancer classification for each dataset, we created a pool of unlabelled data by sampling 10%, 20%, 30% etc. examples from the data training for each class, as mentioned previously. Then we randomly selected two examples from the pool to give as the initial labelled data. Thus, we give the learner the remaining unlabelled examples and the two labelled examples and then test the classifier. The attributes of each dataset are split into two sets and are used as two classifier views. This is a practical approach for generating two views from a single attribute set. To comprehensively explore SVM and TSVM performance on these views, we want to experiment with all possible combinations of the views. Since the number of attributes is large, we randomly generate some small groups containing 2 attributes. We randomly select 200 pairs of views for a run of views. The last column in Table 3.1 represents the number of view 68 pairs used in our experiments and the total number of all possible view splits. Weusethe‘oneagainstall’methodforthemulticlass problem, and we group each class as positive and the rest as negative. The performances of the Support Vector Machine classifier and the Transductive Support Vector Machine were evaluated using the SVM algorithm, which was implemented by using the LIBSVM package [116]. LIBSVM is a library for Support Vector Machines (SVMs). It has been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. All the techniques analysed were run 10 times using the 10-fold cross-validation option, and the accuracy of the obtained classification was evaluated simply by looking at the percentage of correctly classified instances. The mean of the returning results was then computed. 3.3.1 Support Vector Machine Generally, SVM is a binary classification technique. Assume that the given training data{ are given their labels { } are vectors in some space } where . Also, they {−1,1}.Intheirsimplestform, SVMs use hyperplanes that separate the training data by a maximal margin. All vectorslyingononesideofthehyperplanearelabelledas−1,andallvectors lying on the other side are labelled as 1. The training instances that lie closest to the hyperplane are called support vectors [117]. This works within the framework of induction. Using a labelled training set of data, the task was to create a classifier that would perform well on 69 unseen test data. In addition to regular induction, SVMs can also be used for transduction. Here they are first given a set of both labelled and unlabelled data. The learning task is to assign labels to the unlabelled data as accurately as possible [118,119]. SVMs can perform transduction by finding the hyperplane that maximizes the margin relative to both the labelled and unlabelled data [118,120]. A binary (two-class) classification using SVMs presents an interesting and effective approach to solving automated classification tasks. The initial support vector machines were designed for binary classification; this has now been extended to multiclass classification [120]. Almost all the current multiclass classification methods fall into two categories: one against one, or one against all [120,121]. We use the one against all method because it uses less computational time and is more suitable for semi-supervised data. Usually a bioinformatics classification problem is a multiclass one, since more than two classes are usually needed, relying on a clustering-based approach to predict labels for unlabelled examples [122,123]. Then, the multiclass SVM is used to learn with the augmented training set, to classify the test set [123,124]. 3.3.2 Measures for predictive accuracy There are many different measures for assessing the accuracy of a model [125], two of which are calibration and discrimination. When a fraction of about P of the events that are predicted with probability P actually occur, it can be said that the predicted probabilities are well calibrated and a suitable 70 model for P(C|X) has been found [126]. Discrimination, by contrast, measures apredictor’sabilitytoseparatepatientswithdifferentresponses[125]. When the outcome variable is dichotomous and predictions are stated as probabilities that an event will occur, calibration and discrimination are more informative than other indices (such as the expected squared error) in measuring accuracy [125]. The calibration plot is a method that shows how well the classifier is calibrated, and a perfectly calibrated classifier is represented by a diagonal on the graph [127] and it only apply for probabilistic model and not for SVM or TSVM . A c concordance index is a widely applicable measure of predictive discrimination and it applies to ordinary continuous outcomes, dichotomous diagnostic outcomes and ordinal outcomes. This index of predictive discrimination is related to a rank correlation between predicted and observed outcomes. The c index is defined as the proportion of all patient pairs in which the predictions and outcomes are concordant. For predicting binary outcomes, c is identical to the area under a receiver operating characteristic (ROC) curve [127]. A ROC curve is a tool to measure the quality of a binary classifier independently from the variation in time of the ratio between positive and negative events [127]. In other words, it is a graphical plot of the sensitivity versus (1 - specificity) for a binary classifier system as its discrimination threshold is varied. The ROC can also be represented equivalently by plotting the fraction of true positives (TPR = true positive rate) versus the fraction of false positives (FPR = false positive rate). A completely random guess would 71 give a point along a diagonal line (the so-called line of non-discrimination) from the left bottom to the top right corners. Usually, one is interested in the area under the ROC curve, which gives the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. A random classifier has an area of 0.5, while an ideal one has an area of 1. 3.4 Derivation of a New Algorithm The main idea of the new algorithm is that the closer a variable value is to its median in a particular class, the higher is the probability to be assigned to that specific group. At the beginning of the algorithm, the value of each sample was computed, as well as the prior probabilities. The following step is the main part of the method in which the single labelled samples are calculated. Accordance sampling selects the unlabelled examples on whose label the two view classifiers agree the most. When the two classifiers are sufficient and independent, the sampled examples are more reliably labelled. Thus, selecting those examples on whose labels the two view classifiers agree is less likely to introduce errors in the expanded labelled dataset. This means that one of the classifiers assigns the wrong label to the example, which may lead to labelling errors in the expanded labelled dataset. This is one approach to investigate why the sampling method could work well in exploring the labelling errors. However, in our case we cannot calculate the labelling errors rates since the real labels of unlabelled examples are not known. 72 The basic idea is to have all the class representatives present in the sample; say we have a dataset of 1000 records with 5 classes, class 1→ 500 records,class2→250records,class3→125records,class4→65recordsand class 5 → 60 records. We find that it will not be fair to randomly select a sample of 250 because there is a big probability that class 6 or 5 will not be represented. Also, it will not be fair to take a sample of 50 records from each class, because that leaves only 10 records of class 5 to be tested. In addition to that, class 1 will not be well trained, as only 50 out of 500 will be used to train it. Ideally, we found that we should combine both concepts, as all classes should be used but with the same percentage as in the data; say we need 20% of the data for the training, then 20% of class 1, 20% of class 2, etc. ... to be selected on random bases. We propose a novel sampling method to replace the random sampling used by SVM and TSVM to find whether this could extensively reduce the amount of labelled training examples needed, and in addition to see if this can improve the performance of both SVM and TSVM. The new method uses redundant views to expand the labelled dataset to build strong learning models. The major difference is that the new method uses a number of features views (two views in our case) to select and sample unlabelled examples to be labelled by the domain experts, while the original SVM and TSVM randomly samples some unlabelled examples and uses classifiers to assign labels to them. We expect that SVM and TSVM will benefit from the accordance sampling method. Let V1 and V2 be the two view features learned from 73 training labelled data L to classify all unlabelled examples U using both views for each example in U. 3.1 SVM and TSVM will train the redundant view features by learning from the most informative labelled examples [128,129]. It then uses the view features to classify the most informative unlabelled examples. The unlabelled examples on whose classification the two view classifiers agree the most are then sampled. We use a ranking function to rank all the unlabelled instances according to the predictions of the view classifiers. The ranking score function for an unlabelled instance xi is defined as (3.2) 74 Scores generated result in a rank where examples in the highest positions are the ones to which both view classifiers assign the same label, with high confidence, which means that these are the most informative unlabelled examples. Then it selects the larger one of the average predicted probabilities for the positive and negative classes using the two view classifiers. 3.5 Results First of all, cases for which the value was missing were deleted from the data sets. Then, the experiments were started by running the SVM and TSVM classifiers in Java using the 10-fold cross-validation option and evaluating the accuracy of the obtained classification simply by looking at the percentage of correctly classified instances. The classifiers were applied in order to get an automated way to justify and reproduce the classification previously obtained. The results obtained from the SVM were quite good; 572 out of 663 cases were correctly classified (86.3%) and just 91 (13.7%) incorrectly classified. The main concern in using this classifier came from the set of rules that were produced: these appear to be quite numerous and not straightforward, especially if they were to be used by scientists not familiar with computational analysis. The Transductive Support Vector Machine (TSVM) was then considered. This method performed better than the SVM, succeeding in correctly classifying 632 instances (95.4%) out of 663; just 31 cases (4.6%) were misclassified. A summary of the above results can be found in Table 3.3. 75 Whole data Method Classified Misclassified SVM 572 (86.3%) 91 (13.7%) TSVM 632 (95.4%) 31 (4.6%) Table 3.3: Comparison of results on three classifiers using all samples The strategy was to select those discriminating samples in the categorisation process whose distribution ranked highest among the classes. These samples were selected based on the accordance sampling method. An exhaustive search of the best combination of 50 samples out of these 663 was then performed based on the Support Vector Machine classification results. This was done to reduce the number of samples used for classification as a clinical aim. This should both simplify and reduce the cost of clinical testing applying such samples. This ‘new’ smaller dataset was used in repeating the previous experiments applying the above classifiers. With SVM, significant differences could not be seen, since 576 cases (86.9%) were correctly classified and there was a reduction in the number of misclassified instances obtained, this time being 87 (13.1%). The TSVM, instead, performed very well compared to the previous run. Now 6 out of 651 cases (98.2%) were classified properly and just 12 (1.8%) were misclassified. A summary of these results is reported in Table 3.4. 76 Top 50 Samples Method Classified Misclassified SVM 576 (86.9%) 87 (13.1%) TSVM 651 (98.2%) 12 (1.8%) Table 3.4: Comparison of results on three classifiers using only 50 samples The 10 accuracies of each algorithm were compared using t-tests, after checking for normality using the Shapiro test [130]. It was found that, for both the whole data and the 50-sample datasets, the TSVM classifier performed significantly better than SVM (p < 0:01). The findings for the whole data set are summarized in Table 3.5. Average accuracies SVM TSVM Whole data 86.9 (2.5) 94.9 (2.6) 50 samples 87.8 (6.3) 97.6 (1.8) Table 3.5: Average accuracies for 10 cross-validation experiments for the classifiers (standard deviation in brackets) Nottingham Tenovus Primary Breast Cancer (NTBC) Figure 3.4 shows a graph of accuracy across different percentages of the training data for each class using TSVM and SVM with accordance sampling for the Nottingham Tenovus Primary Breast Cancer Series dataset. In comparison, Fig. 3.5 shows a graph of accuracy presenting different percentages using random sampling training data for TSVM and SVM. 77 We observed that TSVM was able to produce higher average classification accuracy than SVM with the use of the accordance sampling method across different amounts of labelled training data ranging between 10% and 90%. Although SVM starts with a relatively small difference in contrast to TSVM, this gap grows wider starting at 40%, to become 90.9% and 84.76% for TSVM and SVM respectively. However, at 75% to 90% the difference starts to narrow again. This indicates that active learning based on accordance sampling provides more benefit for both supervised and semi-supervised learning methods. Nevertheless, the TSVM algorithm outperforms SVM. In practice, when a large amount of training data is used, such as 90%, TSVM achieved an average accuracy of 99.75 % with the sampling method, while SVM achieved an average accuracy of 94.56%. The accuracy was used to measure the performance strength of our model. The results of the SVM and TSVM performance using the new accordance sampling method were compared to the random sampling in Tables 3.6 and 3.7. We found that the maximum accuracy of the SVM classifier with random sampling was obtained when using 90% training data, with an accuracy of 90.54%, while the maximum accuracy of SVM with accordancebased sampling using 90% training data gave 94.56%. A comparison with the minimum accuracy obtained at 10% training data gave an accuracy of 80.87%, which is exactly the same result obtained with random sampling. Tables 3.6 and 3.7 summarize these findings. It can be seen that Figure 3.5 interprets the Table and presents a different amount of training data of accuracy. 78 Nottingham Tenovus Breast Cancer Training TSVM SVM 10 81.99862 80.87 20 81.97931 80.32 25 86.44067 82.33 30 86.92682 85.09 40 87.8 85.36 50 89.99 85.92 60 91.54 85.97 70 93.75 86.34 75 94.59459 88.96 80 94.12141 89.72 90 94.11965 90.54 Table 3.6: Comparing SVM and TSVM using random sampling with different percentages of training samples for each class Nottingham Tenovus Breast Cancer Training % #Test TSVM SVM 10 595 83.12 80.87 20 529 85.66 82.66 25 496 85.88 84.43 30 463 86.34 84.12 40 397 90.9 84.76 50 331 93.07 87.09 60 265 93.27 88.87 70 199 94.37 90.34 75 166 95.64 92.76 80 133 97.37 93.88 90 67 99.75 94.56 Table 3.7: Comparing SVM and TSVM using accordance sampling with different percentages of training samples for each class 79 Accordance-Based Sampling (TSVM vs. SVM) Breast Cancer Wisconsin Accuracy % 100 95 90 TSVM 85 SVM 80 0 10 20 30 40 50 60 70 80 90 100 % Training data Figure 3.4: Accordance-based sampling TSVM vs. SVM with different percentages of labelled training data for each class Wisconsin Diagnostic Breast Cancer (WDBC) Figure 3.6 indicates the accuracy across different percentages of training data of each class using TSVM and SVM with accordance sampling for the Wisconsin Diagnostic Breast Cancer (WDBC) data set. In contrast Figure 3.7 indicates the accuracy across different percentages of training data using the original random sampling TSVM and SVM run ten times for each amount of training labelled data. 80 (TSVM vs. SVM) Nottingham Tenovus Breast Cancer Accuracy % 100 95 90 TSVM 85 SVM 80 0 10 20 30 40 50 60 70 80 90 100 % Training data Figure 3.5: Original random sampling TSVM vs. SVM with different percentages of labelled training data with random sampling. It appeared that TSVM was able to provide a slight advantage over regular SVM, producing higher average classification accuracy than SVM with the accordance sampling method with all the different amounts of training data ranging between 10% and 90%. Moreover, SVM starts with a relatively large gap compared to TSVM. This gap widened and narrowed irregularly. Specifically, the difference extended starting from 20% to 90.05% - 86.45% of TSVM and SVM respectively. Meanwhile using the original random sampling for TSVM and SVM gave 84.75% - 82.45% using 20% labelled training data. However, at 75% the difference narrows, but later starts to widen again. This indicates that the performance of active learning based on accordance sampling is more beneficial both the supervised and semisupervised learning methods. However, the TSVM algorithm outperforms SVM. In practice, when measuring the accuracy at 90% of training, TSVM achieved an average accuracy of 96.5 % with the sampling method, though the original random sampling TSVM achieved an average accuracy of 93.51%, 81 while the minimum was 89.36 using 10% of training. Compared to SVM, at 90% it recorded 93.93% with accordance-based sampling, while the original random sampling SVM gave around 90%. The results of the performances of SVM and TSVM using the new accordance sampling method were matched to the random sampling in Tables 3.8 and 3.9. We found that the maximum accuracy of the SVM classifier with random sampling was achieved when using 90% training, with an accuracy of 90.97%, while the maximum accuracy of SVM with accordance-based sampling using 90% training was 93.93%. Comparing the minimum accuracy obtained at 10% of training, the accuracy with accordance sampling was 85.22%, which is slightly better than the results obtained with random sampling, 80.24%. Breast Cancer Wisconsin TSVM SVM Training 84.37 80.24 10 84.75 82.48 20 85.95 84.75 25 86.23 84.12 30 88.11 84.31 40 88.56 85.19 50 90.44 87.70 60 91.02 87.91 70 91.28 89.92 75 92.37 90.10 80 93.51 90.97 90 Table 3.8: Comparing SVM and TSVM using random sampling with different percentages of training samples for each class 82 Breast Cancer Wisconsin Training % #Test TSVM 10 629 89.36 20 559 90.05 25 524 91.95 30 489 92.23 40 419 93.11 50 349 93.56 60 280 94.44 70 210 94.02 75 175 94.8 80 140 95.37 90 70 96.5 SVM 85.22 86.45 86.77 88.09 89.34 90.22 91.67 92.88 92.98 93.07 93.93 Table 3.9: Comparing SVM and TSVM using accordance sampling with different percentages of training samples for each class Accordance-Based Sampling (TSVM vs. SVM) Wisconsin Diagnostic Breast Cancer Accuracy % 100 95 90 TSVM 85 SVM 80 0 10 20 30 40 50 60 70 80 90 100 % Training data Figure 3.6: Accordance-based sampling TSVM vs. SVM with different percentages of labelled training data for each class 83 (TSVM vs. SVM) Wisconsin Diagnostic Breast Cancer Accuracy % 100 95 90 TSVM 85 SVM 80 0 10 20 30 40 50 60 70 80 90 100 % Training data Figure 3.7: Original random sampling TSVM vs. SVM with different percentages of labelled training data with random sampling. 3.6 Discussion of results In the experiments presented in this chapter, several different results were obtained from the classifiers. Using the whole Nottingham dataset, the best performance was obtained from the TSVM classifier: in fact, just 31 cases were incorrectly classified. The SVM returned results worse than the TSVM, and 91 cases were incorrect. When just the 50 samples were considered, there was a substantial improvement for both learning methods, supervised and semi-supervised. The performance was assessed: even though it did not return the highest number of correctly classified instances, it performed much better than with all the samples, reducing the number of misclassified instances from 31 to 12 for TSVM and from 91 to 87 for SVM. Again, the best results were obtained using the TSVM, and this time the SVM did not perform as well as TSVM: there were just 4 fewer cases of misclassification. Finally, the supervised learning method support vector machine was the worst classifier of the two used, performing almost identically with all samples. Starting from 84 theseresults,a‘non-parametric’approachofSVMandTSVMclassifiertodeal with continuous and non-normal covariates was developed. The method was presented and its performance over two particular data sets was compared to the original random sampling SVM as well as TSVM. The SVM method did not perform well on all data considered in this part of the work; focusing on the breast cancer, this reflects a sort of independence between biological markers and clinical information. Moreover, all the datasets samples strongly violated thenormalityassumption,soprovingthereasonfortakinga‘non-parametric’ approach. For each class, the median value and the histogram of its distribution were computed. Different situations that might occur were then considered. If, fixing a particular class and a particular data point, the value of generic samples was lower or greater than the extreme values of the same sample in the class considered at that stage, then a probability close to zero of belonging to the specified class was assigned to that data point. Secondly, if the value was identical to the median, the probability was set to be one. Finally, if the data point was smaller than the median, the area between the distribution’s minimum and the actual value was calculated (or between the value and the distribution’s maximumif the value was greater than the median). The value obtained was then divided by half the number of observations. As for the SVM classifier, for each case, the product of the probabilities of all samples given the classes was calculated. Data were classified looking at the class number for which the highest rank was reached. With the method just described, a larger number of data points was correctly classified, raising the percentage from 85 80.24% to 85.22% for the Wisconsin breast cancer dataset using only 10% training with SVM, and from 85.19% to almost 90.22% for the same dataset using only 50% training also with SVM. An increase from 90.97 % to 93.93% for SVM was obtained using 90% training with accordance-based sampling. However, when using TSVM, different results were obtained, for the Wisconsin breast cancer dataset, and the proposed new model seemed to be more accurate (in terms of percentages of patients correctly classified) and better calibrated with respect to the semi-supervised learning. The improvement was from almost 84.37% to more than 89.36% for TSVM using only 10% training for the Wisconsin Diagnostic Breast Cancer dataset, from 88.56% to almost 93.56% for the same dataset using only 50% training with TSVM, and from 93.51% to 96.50% for TSVM using 90% training with accordance-based sampling. This was true also when considering the Nottingham Tenovus Breast Cancer data sets, for which the new algorithm appeared to be slightly better calibrated and more accurate. A larger amount of data points was correctly classified, raising the percentage from 81.99% to 83.12% for the Nottingham breast cancer dataset using only 10% of the training data with TSVM, and from 89.99% to almost 93.07% for the same dataset using only 50% of the training data with TSVM. Moreover, there was a rise from 94.11% to 99.75% for TSVM using 90% training with accordance-based sampling. We obtained varying outcomes using SVM. The estimated modern pattern was not precise enough according to the rate of patients who were properly classified, and was less well calibrated according to the supervised learning, for the Nottingham 86 breast cancer set of data. The improvement was from just under 80.87% to just over 80.87% for SVM using only 10% training for the Wisconsin breast cancer dataset, from 85.92% to almost 87.09% for the same dataset using only 50% training with SVM, and from 90.54% to 94.56% for SVM using 90% training with accordance-based sampling. However, for this dataset, when a random sampling SVM was fitted to the data, the number of patients correctly assigned to their class was identical to the one obtained when using SVM with accordance-based sampling for 10% training. In addition, the ROC curve associated with the method presented here was very similar to the one produced by the SVM, providing two close values for the areas under the curve. It is important to note that a couple of datasets presented in this work were also used in [131] for comparing SVM as a supervised method with the kernel and TSVM as semi-supervised methods, obtaining both better and worse results. Those methods were considered as dealing with continuous sampling when using a SVM classifier. Instead, the accordance-based sampling method was developed to deal with the random sampling of SVM and TSVM for several dataset samples. Moreover, it outperformed all the other methods proposed in [131] when applied to the breast cancer datasets. It is also worth noting that the newly developed method is not meant to be applicable to all available datasets. In this chapter several situations were presented for which a classical approach, the TSVM classifier, was outperformed by a more general algorithm that does not assume any particular distribution of the analysed samples. In general, according to experience, the 87 new method outperforms the classical SVM classifier when datasets with categorical samples are considered or when the majority of them follows a normal distribution. In these situations, it is advisable to use the SVM with the accordance-based sampling approach. 3.7 Summary In this chapter, supervised and semi-supervised learning were applied on several case studies. In particular, two different classifiers, namely the Support Vector Machine and the Transductive Support Vector Machine were reviewedandusedoverthe‘in-class’patientsof the Abd El-Rehim et al. [106] breast cancer dataset in order to validate the earlier classification derived and characterized in previous studies. Surprisingly, the TSVM classifiers performedquite well,especially whenjust50‘most-important’sampleswere considered. This happened even though one of the underlying assumptions of the TSVM was strongly violated by the data. As a matter of fact, all the samples did not follow a normal distribution pattern. An accordance-based sampling version of the TSVM was then developed and validated on known datasets. These latter results were presented in this chapter, together with their comparison with the Support Vector Machine approach. Chapter 4 Automatic Features and Samples Ranking for SVM Classifier This chapter presents the modelling of automated features and samples ranking using Support Vector Machines (SVM). Section 4.3 provides a description of how the input data for the SVM algorithm was derived, and section 4.2 presents the motivation and background. Different models for ranking are studied and presented in section 4.4 together with two measures, MAP and NDCG. Sections 4.5 and 4.6 present experimental results and discussion respectively. Lastly, section 4.7 gives a brief summary of the chapter. 4.1 Introduction Ranking is a central issue in data mining for biomedical data, in which a given set of objects (e.g., documents, patients) are categorized in terms of the computed score of each one. Depending on the application, the scores may represent the degrees of relevance, preference, or importance. Generally, this chapter takes the examples of ranking for relevance and the search for importance. Only a small number of strong features combined with the most informative samples were used to represent relevance and to rank biomedical data for breast cancer patients. As mentioned, one of the most vital topics in data mining for medical data is ranking. While algorithms for learning ranking models have been intensively studied, this is not the case for sample or feature selection, despite its importance. The reality is that many samples and feature 88 89 selection methods used in classification are directly applied to ranking. We argue that, because of the striking differences between ranking and classification, it is better to develop different feature and samples selection methods for ranking. 4.2 Background 4.2.1 Feature Ranking In recent years, with the development of supervised learning algorithms like Ranking SVM [132,133] and RankNet [134], it has become possible to incorporate more features and samples (strong or weak) into ranking models. In this situation, feature selection inevitably becomes an important issue, particularly from the following viewpoints. First, feature selection can help enhance accuracy in many machine learning problems, which strongly indicates that feature selection is also necessary for ranking. For example, although the generalization ability of Support Vector Machines (SVM) depends on a margin which does not change with the addition of irrelevant features, it also depends on the radius of training data points, which can increase when the number of features increases [135,136,137]. Moreover, the probability of over-fitting also increases as the dimensions of the feature space increase, and feature selection is a powerful means to avoid over-fitting [138]. Secondly, feature selection may make training more efficient. In data mining, especially in biomedical data research, usually the data size is very large and thus training of ranking models is computationally costly. For example, when applying Ranking SVM to biomedical datasets, it is easy to encounter a 90 situation in which training cannot be completed in an acceptable time period. To deal with such a problem, we can conduct feature selection before training, as it is the number of features that may cause the complexities of most learning algorithms. Although feature selection is important, to our knowledge, there have been no methods of feature selection dedicated specifically to ranking. Most of the methods used in ranking were developed for classification. Basically, feature selection methods in classification fall into three categories [139]. In the first category, which is named filter, feature selection is defined as a preprocessing step and can be independent from learning. A filter method computes a score for each feature and then selects features according to the scores [140]. Yang et al. [141] and Forman [142] conducted comparative studies on filter methods, and they found that information gain (IG) and chisquare (CHI) are among the most effective methods of feature selection for classification. The second category, referred to as wrapper [142], utilizes the learning system as a black box to score subsets of features, and the third category, called the embedded method [142], performs feature selection within the process of training. Of these three categories, the most comprehensively studied methods are the filter methods. Therefore, we also base our discussions on this category in this chapter, and we will use “feature selection” and “the filter methods for feature selection” interchangeably. When applying the feature selection methods to ranking, several problems may arise. First, there is a significant gap between classification and ranking. In ranking, a number of ordered categories are used, representing the ranking relationship between 91 instances,whilein classificationthecategoriesare“flat”. Obviously,existing feature selection methods for classification are not suitable for ranking. Second, the evaluation measures (e.g. mean average precision (MAP) [143] and normalized discounted cumulative gain (NDCG) [144]) used in ranking problems are different from those measures used in classification: 1- In classification, both precision and recall are of equal importance, while in ranking, we consider precision to be more significant than recall. 2- In ranking, it is critical to rank the top- n cases properly, while in classification, it is very important to classify all cases integrally. Due to these distinctions, new and particular methods for feature selection are imperative in ranking. Precision (also called positive predictive value) is the fraction of retrieved instances that are relevant, while recall (also known as sensitivity) is the fraction of relevant instances that are retrieved. 4.2.2 Samples Ranking Certainly, learning ranking (or preference) samples has become a pivotal issue within the machine learning community [145,146,147] and as a result many applications have been produced in data mining for biomedical data [148,133]. Aiming at data mining for biomedical data applications, distinctions can be made between the role of learning ranking samples and learning classification samples as follows: 92 1. Unlike classification samples, which output a distinct class for a data object, ranking samples output a score for each data object, from which a global ordering of data is constructed. 2. Unlike training in classification, which is a set of data objects and the label of their category, the training set in classification is about partial orders of data. For example, let “ specifiedas“ > ”.Foradataset partial orders is target function is preferred to ” be an example of and the outputs a ranking score of for any . There are other types of ranking models. However, this model has produced practical applications in data mining for biomedical data [147,133,150]. Other appropriate patterns for ranking are discussed in references [149,118]. Present SVM methods for selecting samples for binary classification and influence are practically provided by reference [151] to the proposed methods [118], to conduct effective binary relevance feedback for image retrieval. However, these techniques are proposed within the context of binary classification (of whether the image is relevant or not) and thus do not support the learning of ranking samples from partial orders. Extending the selective sampling to ranking [152] requires considerable effort. It depends on pairwise decomposition [145] as well as constraint classification [146]. They extend multi-class classification to ranking and thus are limited to a finite and a priori fixed set of data, and the model is not 93 scalable to the size of the dataset. Our selective sampling is based on the largemargin ranking which has proven effective in practice for learning global ranking functions [133]. Paying due attention to their ranking scores, this ranking model orders new instances, and is thus scalable and represents a means of producing several applicable methods for data mining for biomedical data [147,133,150]. SVM (support vector machines) have proven effective in learning classification and regression functions [61,153,78]. They have also shown excellent performance in learning ranking functions [147,133,150]. They effectively learn ranking functions of high generalization: “In the context of ranking, function F of high generalization means that a learned ranking function F not only is concordant with the ordering of the training set (i.e., partial orders) but also generalizes well beyondthetrainingset,”basedonthe “large-margin”principleandalsosystematicallysupportsnonlinearrankingby the“kerneltrick”[147]. The SVM ranking leads us to learn the function behind the outline in a supervised batch learning that assumes that a set of training samples (i.e., partial orders) is given. In many applications, however, collecting training samples involves human labour, which is time-consuming and often expensive. Unlike in classification, in ranking, since this represents the central issue [154], it is considered a more serious problem. Labelled data in ranking denotes partial ordering of data, and thus users must consider relative ordering of data in labelling, while in classification users only consider the absolute class of data. 94 The concept of active learning or selective sampling refers to approaches that aim at reducing the labelling effort by selecting only the most informative samples to be labelled. SVM selective sampling techniques have been developed and proven effective in achieving a high accuracy with fewer examples in many applications [133,134]. However, they are restricted to classification problems and do not extend to ranking problems. SVM has been proposed as a selective sampling technique for learning ranking functions. That is, using our selective sampling technique, an accurate SVM ranking function can be learned with fewer partial orders. Our method is “optimal”inthesensethatit selects the most concordant set of samples at each round which is considered most informative in SVM ranking. The labelling effort is significantly reduced according to the results of experiments. The sampling technique is applied to the data mining application [142]. Many experiments are carried out on this application and all their results prove to be in harmony. In other words, accurate ranking functions with fewer interactions with users are the outcome of depending on the selective sampling method. 4.3 Methodology 4.3.1 Feature Selection The contribution of this chapter is the proposal of a new technique for feature selection and ranking setting. Feature selection is useful in biomedical data classification since it helps to remove useless/redundant features and thus reduce classifier complexity. 95 In this chapter, we propose a novel method for this purpose with the following properties. 1) The method makes use of ranking information, instead of simply viewing the ranks as flat categories. For example, it uses evaluation measures or loss functions [148,108] in ranking to measure the importance of features. 2) Considering the similarities between features, inspired by the work in [154,155], can help to avoid redundant selection of features. 3) Feature selection is modelled as a multi-objective optimization problem in ranking. Finding the most important and least similar features represents the final objective. 4) A greedy search algorithm is presented here, aiming to solve the optimization problem. We can consider the solution provided by that method as the optimal one for the original problem, provided that we depend on specific conditions. Feature selection needs such properties in ranking. Scientists have attempted to determine how the proposed technique of feature selection performs depending on two sets of data, NTBC [95] and MADELON [105], and also in terms of two state-of-the-art ranking models, Ranking SVM [134] and RankNet [133]. We have carried out many experiments to check the performance of the suggested method. In ranking for medical data, the method proves its ability to outperform traditional feature selection methods. 96 Aiming at selecting features from the entire feature set in our method we first define the importance score of each feature, and a clear definition is given to the importance score of each feature , explaining the similarity between any two features and . Then an efficient algorithm is employed aiming to maximize the total importance scores and minimize the total similarity scores of a set of features. First, an importance score is assigned to each feature. A standard of assessment is clearly provided, like MAP and NDCG (these are defined hereinafter in section 4.4) or a loss function (e.g. pairwise errors of ranking [121,122] to clarify the importance score. Earlier, depending on the feature to rank instances, then in terms of the measure, we evaluate the performance, considering its result as the importance score. We depend on the feature to rank instances, and then present a score contrarily compared to the corresponding loss as the importance score. It is worth mentioning that features are different regarding their higher ranks, which refer or correspond to larger values in some cases and to smaller values in other cases. Computing MAP, NDCG or the loss of ranking models enables us to categorize the cases twice (in terms of both normal and inverse order), considering the larger score as the importance score of the feature. Inspired by the work in [154,156,155], we also consider removing redundancy in the selected features. This is particularly necessary in cases in which it is required to utilize only a small number of features. In this work, we measure the similarity between any two features on the basis of their ranking results. That is, we 97 regard each feature as a ranking model, and the similarity between two features is represented by the similarity between the ranking results that they produce. Many methods have been proposed to measure the distance between two rankingresults(rankinglists),suchasSpearman’sfootruleF,rankcorrelation R, and Kendall’s [157,158]. We can use all of them here. For instance, considering Kendall’s as an example, the Kendall’s value of query any two features where and can be calculated as follows: indicates the set of case pairs respect to query is ranked ahead of instance by feature values of all the queries are averaged, and the result used as the final similarity score between features that in response with , #{∙} represents the number of elements in a set, and denotes that instance The Kendall = for and . is . It is easy to see holds. 4.3.2 Sample Selection To establish the context of our discussion, we first discuss the preliminaries, representing data as a vector in which each element is a numerical value indicating an attribute value of the data. For instance, a vector representing a patient from a set of a real breast cancer dataset can be represented by a vector (age, recurrence or survival, size, menopausal status, nodal status, histologic subtype). 98 Say rank than or in a case where vector in an order , we can say presume that . Otherwise, we simply is ordered strictly, indicating that for all pairs , either or has a higher and in a set . Anyway, it can be directly generalized for poor orderings. Let be the optimal ranking of data in which the data is ordered perfectly according to the patient feature or situation. A ranking function evaluated by how closely its ordering approximates There is wide reliance on Kendall’s similarity between two orderings orderings and and [133]. In terms of two separate through the number of discordant pairs. The how they order a pair, and For ordering on a dataset and . as a method to measure the , we can present Kendall’s concordant pairs and the number is and of agree in , and the pair is either concordant or discordant. , we define the similarity function as the following: To explain, we presume and order five vectors as follows: (4.2) (4.3) Here we calculate as 0.7, since the number of discordant pairs is 3, specifically, pairs are 90% in harmony. In terms of the while all the remaining 7 measure, the degree of accuracy of 99 is evaluated as the similarity of the ordering optimal ordering coming from and the , i.e., SVM Rank Learning SVM techniques enable us to learn global ranking function F from partial orders presuming F is considered a linear ranking function such that: (4.4) The learning algorithm allows us to amend a weight vector set of partial orders function can be ranked linearly in the case that there exists a (i.e., a weight vector .Concluding, we aim at learning partial orders the weight vector . Say a ) that satisfies Eq. (4,4) for all which goes in harmony with the provided and has good generalization beyond . For providing satisfying Eq. (4.4) for most data pairs reaching the maximum extent of ,r . Though this problem is known to be NP-hard [107], reference [147] achieves the solution approximately based on SVM techniques by introducing (non-negative) slack variables dragging the upper bound into minimum as well as [107] as follows. QP 1. (4.5) (4.6) (4.7) 100 By the constraint (4.6) and minimizing the upper bound satisfies orderings on the training set in (4.5), QP 1 with minimal error. By minimizing or by maximizing the “margin” it tries to maximize the generalization of the ranking function. This chapter will discuss how maximizing the margin corresponds to increasing the generalization of ranking. The soft margin parameter conducts the trade-off between the size of margin and error of training. QP 1 becomes equivalent to that of the SVM classification on pairwise difference vectors . By rearranging the constraint (4.6) as (4.8) we can extend an existing SVM implementation to solve the QP. Maximizing Generalization of Ranking The support vectors in QP 1 denote the data pairs such that . Assume that the training data is linearly rankable and thus all . Then, from Eq. (4.8), the support vectors are the closest data pairs when projected to : From Eq. (4.4), the linear ranking function data vectors onto a weight vector vectors for projected onto geometrically, the margin projects . The geometrical distance of the two is formulated as . Thus, which originally in the classification problem denotes the distance from the support vectors to the boundary, denotes in the ranking problem the distance between the closest two projections. This is illustrated by two different linear functions and that project four 101 data vectors dimensional onto and and respectively in a two- space. Figure 4.1: Linear projection of four data points Both and make the same ordering for the four vectors, namely, . The following formula presents the distances between the closest two projections onto and : and and which are denoted as respectively. We compute the weight vector such that it is concordant with the given orders and generalizes beyond it by maximizing the distance of the closest data pairs in ranking. By minimizing , maximizes the margin, i.e., the distance of the closest data vectors in ranking. For example, in Figure 4.1, although the two weight vectors and are ordering similarly, shows better generalization than because the distance of the closest vectors in (i.e., , ) is larger than that in (i.e., ). Maximizing the margin in ranking is explained in detail in [133]. Dot products of data vectors represent the learned ranking function F, and thus it is possible to use nonlinear kernels to learn nonlinear ranking functions. See [133] for the nonlinear kernel extension of the ranking function. 102 Feature Selection Feature Extraction Initial Feature set Ranking SVM Training Ranking SVM Model NDCG / MAP Evaluation Feature Inference Is NDCG/ MAP OK? No Yes Selected Features Training Data Input Transformation Ranking SVM Training Ranking SVM Training Ranking SVM Model Input Features Figure 4.2: Diagram showing an example of an existing feature selection procedure 103 4.4 Experiment Settings 4.4.1 Datasets Our experiments are based on two main benchmark datasets. MADELON is UCI artificial data set we use it for classifying random sampling data. The dataset includes around 4400 patients with binary relevance judgments. This represents a two-class classification problem with scattered binary contribution changes. The BM25 model [159] is used to recall the top 2000 patients for each feature who were already recalled. In our experiments 50 features were drawn out for each patient, involving both conventional features such as patient’s age, size of tumour and survival. Relatively, we specify the number of test examples required for the test set as 1800, the number of training examples needed for the training set as 2000, and finally set 600 for validation. The second dataset is the Nottingham Tenovus Primary Breast Cancer (NTBC) data [95], which has been used in many experiments in data mining for biomedical research [107,108]. NTBC is an invasive breast cancer collection, developed by Daniele Soria et al. at the University of Nottingham.. In our experiments, we divided each of the two datasets into three parts, for training (both feature selection and model training), validation, and testing. Therefore, for each dataset, we can create different settings corresponding to different training, validation, and testing sets, and run ten trials. The results reported in this chapter are those averaged over ten trials. 104 4.4.2 Evaluation measures Two common measures were determined to evaluate ranking methods for data mining in biomedical data, namely MAP [143] and NDCG [133,134]. Mean average precision (MAP) MAP measures the accuracy of ranking results. We assume two kinds of instances: positive and negative (pertinent and impertinent) accuracy at standards of N how the top N outputs for inquiry are precise, N is the sample size, n is the number of features. Precision at n controls counting the average precision of: where the position is referred to as is referred to as , the instance at position , the number of recalled instances denotes a binary function indicating whether is positive (sign) . MAP is defined as averaged over all runs. In our experiments, the NTBC dataset has six groups of labels. The “first” class was defined as positive and the other five as negative when calculating MAP, as in [107]. 105 Normalized discount cumulative gain (NDCG) NDCG measures the accuracy or precision of ranking when there is a multiplicity of levels for relevance judgment. Given an instance, NDCG at position where is defined as denotes the position, denotes the score for rank , and is a normalization factor to guarantee that a perfect ranking’s NDCG at position equals 1. For runs in which the number of retrieved instances is less than , NDCG is only calculated for the retrieved instances. In evaluation, NDCG is further averaged over all runs. Note that the above measures are not only used for evaluating feature selection methods, but also used within our method to compute the importance scores of features. 4.4.3 Ranking model It is necessary to evaluate to what extent feature and sample selection are effective methods after combining them with ranking models knowing that they represent a primary or preparatory step. We used two ranking models in our experiments, Ranking SVM and Rank Net. Ranking SVM Ranking SVM [132,133] has proved to be an effective algorithm for ranking in several earlier studies. Ranking SVM extends SVM to ranking; in contrast to traditional SVM which works on instances, Ranking SVM utilizes instance 106 pairs and their preference labels in training. The optimization formulation of Ranking SVM is as follows: RankNet Similarly to Ranking SVM, RankNet [134] also uses instance pairs in training. RankNet assigns a neural network that expresses the function of ranking and proportional entropy as a loss function. Let probability ) and let be the expected successive be the “true” posterior probability, and . The loss for an instance pair in RankNet is defined as Gradient descent is employed to minimize the total loss, keeping in mind the training data using RankNet. RankNet selects the best model based on the validation set, as the gradient descent mentioned above may drive toward the local optimum. Therefore, it has become possible to check to what extent RankNet is effective particularly in large-scale datasets. 4.4.4 Experiments This section reports our extensive experiments for studying the effectiveness of our selective sampling algorithm. Due to the lack of labelled real-world datasets for ranking which are completely ordered, we mostly evaluate the method on artificially generated global orderings . Firstly, we evaluate the method on the UCI data set with generated ranking functions. 107 Then we use real-world data from the Nottingham breast cancer dataset to demonstrate the practicality of the method, using the Kendall’s τ measure discussed previously for samples and features ranking evaluation. The experiments for feature ranking were conducted in the following way. First, we ran a feature selection method on the training set. Then we used the selected features to train a ranking model with the training set, and tuned the parameters of the ranking model (e.g. the combination coefficient in the objective function of Ranking SVM, and the number of epochs in RankNet) with the validation set. These two steps were repeated several times to tune the parameters in the feature selection methods (e.g. the parameter in our method). Finally, we used the obtained ranking model to conduct ranking on the test set, and evaluated the results in terms of MAP and NDCG. In order to make a comparison, we selected IG and CHI as the baselines. IG measures the reduction in uncertainty (entropy) in classification prediction when the feature is known. CHI measures the degree of independence between the feature and the categories. Since the notion of category in ranking differs, in theory these two methods cannot be directly applied to ranking. Relevant and irrelevant in the MADELON data are two proportional categories, while ''definitely relevant'' and "not relevant'' are extended to three categories in terms of the NDBC dataset. For this reason, we ignored the order information among the "categories". It is worth mentioning here that IG and CHI are directly used as feature selection methods in ranking, and this kind of approximation is always made. In addition, we also used “With All Features 108 (WAF)”asanotherbaseline,inordertoshowthebenefitofconducting feature selection. Based on Matlab, SVM ranking and the selective sampling and features algorithms were applied. Our experiments were carried out with a 64 2800+PC with 1GB RAM. 4.5 Experimental Results 4.5.1 MADELON dataset (Feature Ranking) Fig. 4.3 shows the performance of the feature selection methods on the MADELON dataset when they work as pre-processors of RankingSVM. Fig. 4.4 shows the performance when using RankNet as the ranking model. In the figures, the x-axis represents the number of selected features. Let us take Fig. 4.3 (a) as an example. It is found that by using the Support Vector Machine algorithm, with only ten features RankingSVM can achieve the same or even better performance when compared with the baseline method WAF. With more features selected, the performance can be further enhanced. In particular, when the number of features is 18, the ranking performance is 15% higher than that of WAF. When the number of selected features increases further, the performance does not improve, and in some cases, even decreases. Therefore, feature selection has become inevitable: however, as previously mentioned, selecting more features does not directly mean increasing the ranking performance. Technically, that goes back to the fact that selecting more features may enhance performance on the training set, but moving to the test set, over-fitting may cause it to deteriorate. Many other learning tasks 109 including the classification process itself clearly testify to that phenomenon. Consequently, it has been possible to enhance both the precision and efficiency of learning for ranking by selecting the feature effectively. Results of our experiments demonstrate that SVM is able to outperform CHI often, although not significantly. It is clearly verified hereinafter that our feature selection and the ranking pattern should go together in training after the preparatory move towards selecting the feature. We should bear separately in mind that the features selected using SVM may be regarded as a successful selection on one level (on the basis of MAP or NDCG), but that the same selection may be considered bad in terms of training the model. It is worth mentioning here that there is very little distinction between CHI and IG allowing common outperformance. (a) MAP of Ranking SVM 110 (b) NDCG@10 of Ranking SVM Figure. 4.3: Ranking accuracy of Ranking SVM with different feature selection methods on the MADELON dataset Experimental results also indicate that with SVM feature selection methods the ranking performances of Ranking SVM are more stable than those with IG and CHI as the feature selection method. This is particularly true when the number of selected features is small. For example, from Fig. 4.3 (a) it can be seen that with 12 features, the MAP values with SVM are more than 0.3, while those of IG and CHI are only 0.22 and 0.25 respectively. Furthermore, IG and CHI do not result in a clearly better performance than WAF. 111 (a) MAP of RankNet (b) NDCG@10 of RankNet Figure 4.4: Ranking accuracy of RankNet with different feature selection methods on the MADELON dataset There may be two reasons for this: IG and CHI are not designed for ranking and the ordinal information between instances may be lost when using them; this happens when we redundantly select features using IG and CHI for NDCG@10 and for RankNet. On the other hand, we can observe that similar tendencies lead to similar conclusions. 112 4.5.2 MADELON dataset (Samples Ranking) At this point, we evaluate the learning performance of our method against random sampling. We randomly created 1000 data samples of 10 dimensions (i.e., a 1000-by-10 matrix) such that each element is a random number between zero and one. For evaluation, we artificially generated ranking functions: First, we generated arbitrary linear functions weight vector . The global ordering by randomly generating the is constructed from the function. Secondly, we constructed a second-degree polynomial function +1)2 by also generating randomly. Figure 4.5: Accuracy convergence of random and selective sampling on MADELON dataset random sampling for both linear L and polynomial P functions. Global orderings were generated from the two types of ranking functions, and then we tested the accuracy of our sampling method compared to random sampling on the orderings. The outcomes were averaged over 20 113 runs. Figure 4.5 shows how the accuracy of selective sampling approximately approaches that of random sampling for both linear and polynomial functions. The SVM linear and polynomial kernels are used respectively. The selective sampling method consistently outperforms random sampling on both types of functions. The accuracy at the first iteration is the same because they start with the same random samples. Selective sampling achieves higher accuracy at each iteration (or for each number of training samples) and the selected four samples at each iteration (i.e., = 4). 4.5.3 Nottingham Breast Cancer Dataset (Feature Ranking) The results of different feature selection methods on the NBCD dataset when they work as pre-processors of Ranking SVM is presented in Fig. 4.6. It can be seen that IG performs the worst this time. When the number of features selected by IG is less than 30, the ranking accuracy is significantly below that of WAF. (a) MAP of Ranking SVM 114 (b) NDCG@10 of Ranking SVM Figure 4.6: Ranking accuracy of Ranking SVM with different feature selection methods on the NTBC dataset On the other hand, 22 features or less are enough for both CHI and our algorithms to achieve good ranking accuracies, but gradually our algorithms are able to outperform CHI by adding more features. For instance, in Figure 4.5 (a), selecting more features helps to increase the MAP of ranking SVM with our algorithms (from 15 to 20). Moving to CHI, after selecting 12 features, it begins to decrease. Our experiments demonstrate clearly that the algorithms are able to outperform both IG and CHI. For NDCG@10 and for RankNet, we can observe similar tendencies and reach similar conclusions. In summary, our feature selection algorithms for ranking really outperform the feature selection methods proposed for classification, and also improve upon the baseline method without feature selection. 115 (a) MAP of RankNet (a) NDCG@10 of RankNet Figure 4.7: Ranking accuracy of RankNet with different feature selection methods on the NTBC dataset 4.5.4 Nottingham Breast Cancer Dataset (Samples Ranking) In this section, we perform experiments with a real-life in-house dataset extracted from Nottingham University. We extracted all immunohistochemistry data for all breast cancer patients, resulting in N = 1076 for breast cancer patients, each with attributes id, age, size, grade, stage, etc. SVM linear kernels are used in this experiment. 116 Since datasets of real-world data seem to be rare, it is hard to evaluate the selected sampling in real situations. Thus, this section focuses on presenting the potential of the selective sampling method in the applications of data mining by demonstrating experimental results with real-life patients. We selected 10 ordinary different groups to test our system with different preferences and collected around 100 real selections. It is worth mentioning that the perfect ordering of the selection intended remains unclear and thus it is hard to make fair evaluations. It is not feasible for a user to provide a complete ordering on hundreds or thousands of patients or samples. Thus, we evaluated the accuracy of the ranking function at this iteration against the partial ordering specified by the selection in the next iteration. That is, the accuracy of the ranking function learned at the iteration is measured by comparing the similarity of the selection for the patients’ partial ordering on generated by . We set at the next iteration and the ordering at each iteration. This is called a measure of the expected accuracy, as it is an approximation evaluated over ten pairwise orderings. (An ordering on five samples generates ten pairwiseorderings.)Thatis,“100%expectedaccuracy”meansit correctly ranks five samples that are randomly chosen. This measure approximates the generalization performance of ranking functions, as is not a part of the training data for learning . Further, this evaluation method can also be used to acquire fair evaluations from the selection since the selection are not aware of whether they are providing feedback or evaluating the functions at each round. However, this measure 117 severely disfavours selective sampling. Intuitively, selective sampling will be most effective for learning if the patient selection ordering on is not what is expected from the previous iteration. Thus, we used a random sampling for the study reported in this section. However, note that selective sampling is expected to be more effective in practice, as demonstrated in our experiments on the MADELON dataset in the previous section and on real data. In terms of random sampling (RAN) and selective sampling (SEL), we generated five samples at each iteration at each round, and it was possible to measure both the accuracy of the learned ranking function and the response time. The experimental results were averaged over 20 runs. Highlights of the experiments are demonstrated as follows: Figure 4.7 explains how SEL outperforms RAN from the second iteration, although they begin with similar accuracy because they start with the same random samples. The size of the dataset does not affect the response time of RAN, and the time seems to be the same for both group 1 and group 2. SEL is capable of working more accurately in a shorter response time than RAN for both datasets. (SEL requires times of function evaluations). The dataset size does not affect its accuracy. 118 Figure 4.8: Accuracy convergence of random and selective sampling on NDBC dataset 4.6 Discussion From the results of the two datasets, the following observations are made: 1- Ranking performance can be enhanced more significantly by selecting features for the MADELON dataset than for the NDBC dataset. For instance, our methods of feature selection can proportionally enhance the MADELON dataset by 10 %; on the other hand, other methods of feature selection can enhance the dataset of the NDBC by just 5~6 % . 2- Our proposed methods outperform IG and CHI, which seems to be more significant for the MADELON dataset than for the NDBC dataset. For instance, in the principle of the MADELON dataset, SVM is more significantly better compared with IG and CHI; on the other hand, the enhancement over IG is upright for the NDBC dataset. More experiments are required to identify the real reasons as follows. We study features as ranking models against their MAP values. Moreover, the features are sorted according to their MAP values. We determine that the MADELON dataset includes useless or redundant 119 features. For instance, there are 10 features whose MAP is smaller than 0.5. Therefore, feature selection becomes necessary not only to get rid of ineffective (or noisy) features, but also to enhance the final ranking performance. On the other hand, the relative effectiveness of most of the features in the NDBC dataset is not differentiated. Therefore, the benefit of eliminating useless features is not great. Moreover, in the two datasets, the similarity between any two features has been discussed(onthebasisofKendall’s ). Features in the MADELON dataset are clustered into many blocks, with highly similar features in the same blocks and less similar features in different blocks. As our proposed technique may help to reduce the total scores between selected features, for each cluster, only representative features can be selected and thus we can reduce the redundancy in the features. As a result, our method performs better than other feature selection methods. For the NDBC dataset, there are only two large blocks, with most features similar to each other. In this case, the similarity punishment in our approach does not work well. That is why the improvement of our method over the other methods is not so significant. According to our approach, based on the discussion above, we conclude that, if the effects of features vary largely and there are redundant features, our method can work very well. When applying our method in practice, therefore, one can first test the two aspects. For sampling ranking, we used only SVM linear kernels in the experiments with NBC, because linear functions, while simple, are often expressive enough, and have thus been used as a popular model for 120 rank (or top-k) selection processing. However, deploying nonlinear functions might be necessary to deal with complex preferences that are not rankable by linear functions. Nonlinear ranking functions can be learned directly using SVM nonlinear kernels. 4.7 Summary In this chapter, we have proposed an optimization method for feature and samples selection in ranking. The contributions of this chapter include the following. We discussed the differences between classification and ranking, and made clear the limitations of the existing feature and samples selection methods when applied to ranking. In addition, we proposed a novel method to select features and samples for ranking, in which the problem is formalized as an optimization issue. In this method, we maximize the total importance scores of selected features and samples, and at the same time minimize the total similarity scores between the features in addition to samples. In this chapter, we evaluated the proposed method using two datasets, with two ranking models, and in terms of a number of evaluation measures. Experimental results have validated the effectiveness and efficiency of the proposed method improving the accuracy for both labelled and unlabelled data. Chapter 5 Ensemble weighted classifiers with accordance-based sampling The main aim of this chapter is to propose a new weighted voting classification ensemble method based on a classifier combination scheme and novel multiview sampling method as a response to the high cost of supervised labelling data. Active learning attempts to reduce the human effort needed to learn an accurate result by selecting only the most informative examples for labelling. Our work has focused on diverse ensembles for active learning for semisupervised learning problems. The chapter is organised as follows. Majority voting classifier weights based on the accordance sampling method are discussed in further detail in the next section, followed by a description of the experiments carried out in section 5.3. In section 5.4, the results are presented and discussed. Lastly, conclusions and future work are drawn out in section 5.5. 5.1 Introduction Recent developments in storage technology have made it possible for broad areas of applications to rely on stream data for quick responses and rapid decision-making [160]. One of the recent challenges facing data mining is to digest the massive volumes of data collected from data stream environments [160,161]. 121 122 In the domain of classification, providing a set of labelled training examples is essential for generating predictive models. It is well accepted that labelling training examples is a costly procedure [162] which requires comprehensive and intensive investigations on the instances, and incorrectly labelled examples will significantly degrade the performance of the model built from the data [162,165]. A common practice to address the problem is to use based sampling methods to selectively label a number of instances from which an accurate predictive model can be formed [163,164]. Selective sampling is a form of active learning that reduces the cost and number of training examples that need to be labelled by examining unlabelled examples and selecting the most informative ones [166,151]. A based sampling method generally begins with a very small number of randomly labelled examples, carefully selects a few additional examples for which it requests labels, learns from the results of that request, and then by using its newly gained knowledge carefully chooses which examples to label next. The goal of the based sampling method is to maximize the prediction accuracy by labelling only a very limited number of instances, and the main challengeistoidentify“important”instancesthat should be labelled in order to improve the model training, given that one cannot afford to label all samples [167]. A general practice for the based sampling method is to employ some rules in determining the most needed instances. For example, uncertainty sampling principles take instances with which the current learners have the 123 highest uncertainty as the most needed instances for labelling. The intention is to label instances on which the current learner(s) has the highest uncertainty, so providing labels to those instances can help improve the model training [168]. Classification is a predictive modelling whose target variable is categorical. A multiple classifier model or ensemble method is a set of individual classifiers whose decisions are combined when classifying new patterns. In the ensemble method of classification, many classifiers are combined to make a final prediction. There are many different reasons for combining multiple classifiers to solve a given learning problem. Ensemble classifiers perform better than a single classifier in general. Moreover, multiple classifiers usually try to exploit the local different behaviour of the individual classifiers to improve the accuracy of the overall system. In addition, it can eliminate the risk of picking inadequate single classifiers [167,169]. The final decision is usually made by voting after combining the predictions from a set of classifiers. The use of an ensemble of classifiers has gained wide acceptance in the machine learning and statistics community after significant improvements in accuracy [151]. Two popular ensemble methods, boosting and bagging, have received heavy attention. These methods are used to resample or reweight training sets from the original data. Then a learning algorithm is repeatedly applied for each of the resampled or reweighted training sets [166,168]. Simple Majority voting is a decision rule that selects one of many alternatives, based on the predicted classes with the most votes. This is the decision rule used most often in ensemble methods. Weighted Majority voting can be 124 applied if the decision of each classifier is multiplied by a weight to reflect the individual confidence of these decisions. Simple Majority voting is a special case of weighted Majority voting [163,164]. Our new sampling method generates multiple views and compares their results to identify results that agree. Each view is a subset of the dataset where certain attributes of all the data patterns are chosen and trained. For instance, view 1 consists of a number of attributes and view 2 consists of the rest of the attributes in a two-view case. Once trained, the solutions from each view are compared with each other. If they agree, they are selected as informative examples. The scheme employed for ranking how informative they are uses two types of weight: a weight vector of classifiers and a weight vector of instances. The instance weight vector assigns the most informative instance and gives it a high weight based on our new sampling method. The weight vector of classifiers puts large weights on classifiers and gives the highest number of instances equal to the Majority of the most informative instances. The instance with higher weights plays a more important role in determining the weights of classifiers. The aim is to fuse the outputs of multiple classifiers in the test phase weights that are derived during training; this fusion of classifier output improvestheclassifiers’performance.Fiveclassifiersareused:SupportVector Machine, Naïve Bayes, K-Nearest Neighbour, Logistic Regression and Neural Network. The common theme across these classifiers is that they attempt to minimize or maximize a single function. A classifier based on multiple functions is computed on the local properties of the data; the main function is 125 to search for an optimal set of weights based on the result of majority voting of classifiers from training data so as to maximize the recognition ability and apply it to the output of classifiers during testing. 5.2 Background 5.2.1 Ensemble Weighted Classifier The point here revolves around five classifiers: Support Vector Machine, Naïve Bayes, K- Nearest Neighbour, Logistic Regression, and Neural Network. All these classifiers are equally weighted in terms of unaffected Majority voting. Consequently, if each classifier predicts an instance differently, the final decision becomes arbitrary due to the tied votes, assuming the classifiers tend to classify the most informative instances correctly based on the new multi-views sample method. In the case of dissimilar predictions made by classifiers for an unknown instance, logically more weight is granted to the classifier that provides the biggest number of predictions equal to the Majority. Accurate evaluation of the most informative samples seems to be essential. The most informative samples can be thought of as those on which fewer classifiers make correct predictions. In that case, we lay out the two weight vectors described above, namely a weight vector of classifiers and an instance weight vector, so our suggested process has two different phases. In addition, the weights of instances are proportional to the degree of informativity of each instance. We have to consider these in assigning weights to classifiers, thus the weights for classifiers and the weights for the instance are dependent on and linked with each other. Through the cross-relationship between the 126 performance of classifiers and the informativity of the instance, we can find the optimal weights for the classifiers and instances. These weights are found by an iterative producer and determined by only the matrix about the instances and the classifiers. We do not need to assume prior knowledge of the behaviour of the individual classifiers. Suppose we have n instances and k classifiers in an ensemble. We let X be an n × k performance matrix indicating whether the classification is right 1 or wrong 0 and its transpose X′. Let Jij be an i × j matrix consisting of 1’s for any dimension i and j. We also define 1n and 1k as n × 1 and k ×1vectorsof1’s. Finally, Ik denotes a k × k identity matrix. 1. Set initial instance weight vector Q0 gets higher weights of instance for the rows of X with fewest 1’s (the most informative instance). The denominator (mathematical term) is simply regarded as an element that normalizes the model of the unit. a) Calculate a classifier weight vector Pm appoints higher weights on more accurate selected classifiers, after merging instance weight . The denominator hereinbefore is simply regarded as an element that normalizes the model of the unit. 127 a) Update the instance weight vector Qm appoints higher weights for the most informative instances after merging the weight vector of the classifier Pm. The denominator in our formula is simply regarded as an element that normalizes the model of the unit. We provide a simple example illustrating the above algorithm. We suppose that there are five classifiers and five instances in an ensemble. We define X = (x1| x2| x3| x4| x5) and let x1 (0,1,1,0,0)ꞌ, x2 (0,1,1,1,0)ꞌ,x3 (0,1,1,1,0)ꞌ, x4 (0,1,1,1,0)ꞌ, x5 (0,1,0,1,1)ꞌ, where xi indicate a performance vector by the ith classifier. Here 1 represent correct decision from a classifier and 0 represent wrong decision. We obtain the normalized weight vector on classifier decision P* = (0.114,0.769,0.433,0.314,0.824)ꞌ and the normalized weight vector on instances Q* = (0.032,0.731,0.439,0.633,0.453)ꞌ. The classifier weight P* can be explained as follow: the accuracies of the classifiers are (0.75,0.67,0.5,0.45,0.75), although the first and the last classifiers have the same error rate, more weight is given to the last classifiers because weight vector of fifth classifier (0.824) than first classifier (0.114) for select former classified the most informative instance correctly. As well as it is the only classifier that had the correct answer or decision for the fifth instance. The weight of the first instance is lowest (0.332) in Q*. The least classifier weight is given to the first (0.114) because it misclassified higher weighted instances and the most inaccurate among the five. 128 Regarding Q*, highest weight is given to the second instance because all the classifiers made correct decisions which mean it is the most informative. The least weight (0.032) is given to the first instance because it is the least informative instance none of classifiers give right decision to it. Although the first and the fifth instance get the same accuracy, we note that the weights are different, this due to the effect of P*. The first instance is misclassified by all classifiers which has the least weight. On the other hand, the third instance is misclassified by the most important classifier which is the fifth classifier. When the instances are equally informative, an instance on which the higher weighted classifier works better must get a higher value in Q*. 5.2.2 Sampling Most Informative Sample Method (Multi-View Sample MVS) We observe a new method, the idea of which is to generate multiple views and to compare their results to identify results that agree. Each view is a subset of the data set where certain attributes of all the data patterns are chosen and trained. For instance, view 1 consists of attributes 1-5 and view 2 consists of attributes 6-25. Once trained, the solutions from each view are compared with each other. If they agree, they are selected as informative instances. In addition, a scheme for ranking how informative they are is employed and the ones which rank highly are selected as the most informative instances to be used for training the classifiers and then testing. In effect, this uses random features selection to generate different views which are then used for training to obtain results used to identify agreement, so that they can be selected as informative instances, and ranking is then used to 129 find the most informative ones. In this way, less labelled data can be used to achieve higher classification accuracy. Let V1 and V2 be the two view classifiers learned from training labelled data L to classify all unlabelled examples U using both views. For each example in U Our classifiers will train the redundant view classifiers by learning from the most informative labelled examples [177,178]. Then the view classifiers are used to classify the most informative unlabelled examples. The unlabelled examples on whose classification the two view classifiers agree the most are then sampled. We use a ranking function to rank all the unlabelled instances according to the predictions of the view classifiers. The ranking score function for an unlabelled instance xi is defined as The scores generated result in a rank where examples in the highest positions are the ones to which both view classifiers assign the same label with high confidence, which means that those are the most informative unlabelled 130 examples. Then it selects the larger one of the average predicted probabilities for the positive and negative classes by two view classifiers. 5.3 Experimental Design Within our experiments, we use 20 publicly accessible binary data sets and two multiclass data sets. Table 5.1 briefly illustrates their characteristics. Sixteen of them belong to the UCI Machine Learning Repository [105], five are from the UCIKDD Archive [157] and the last is the in-house Nottingham Tenovus Breast Cancer data set. To obtain a better measure of predictive accuracy, we compare five classifiers using 10 cross-validations. The average of the 10 estimates simply expresses the accuracy of the cross-validation. The 10-fold cross-validation is repeated 100 times with varied assets each time to give more steady estimates. The rate of 100 accuracies of cross-validations represents the final accuracy estimate for the dataset. Multiclass Data sets: we select two multiclass data sets, one of which, from the UCI, is a dermatology dataset. This contains 34 attributes, 33 of which are linear valued and one of which is nominal, and 366 instances and four classes. The second dataset is the Nottingham Tenovus Breast Cancer dataset. The dataset contains three main clinical groups, Luminal, Basal and HER2, with six subgroups for 1076 patients in the period 1986-1998 and with immunohistochemical reactivity for 25 proteins with known relevance in breast cancer. Support Vector Machine, K-Nearest Neighbour, Naïve Bayes, Logistic Regression and Neural Network are employed as the base learning algorithms. 131 The generated classifiers are then combined to form an ensemble using our weighted voting system. We performed all the experiments using Java statistical packages. The SVM algorithm was implemented by the LIBSVM package. We also implement the K-Nearest Neighbour algorithm available at [169,170]. We use LIBLINEAR to implement Logistic Regression [171]. Classifier4J is a Java library designed to do classification. It comes with an implementation of a Naïve Bayes classifier. We implement the Neural Network from the Java library [180]. There are a number of different parameters that must be decided upon when designing a neural network NN, SVM and KNN. For NN, among the parameters there are the number of neurons per layer = 50 and the number of training iterations =100. Some of the more important parameters in terms of training and network capacity are the number of hidden neurons, the learning rate and the momentum parameter = 0.1. In addition, kernels are used in transductive support vector machines to map the learning data (nonlinearly) into a higher dimensional feature space, w0, where the computational power of the linear learning machine is increased. The classifier uses RBF and polynomial kernels. The first RBF kernel sets the regularization parameter = 200, while the remaining kernels that are polynomial will be set = 100. SVM adjusts the cost = 10 and gamma = 0.1. For K-Nearest Neighbours (KNN) we select only one parameter, the number of K. . 132 5.4 Experimental Results and Discussion For comparison purposes, we implemented five algorithms. The results given in Tables 5.2, 5.3 and 5.4 indicate the performance of the five algorithms, as well as our Majority voting class system, as a consequence of different numbers of cross-validations without using any special sampling method, in other words before using a multi-view sample. The results in Tables 5.5, 5.6 and 5.7 indicate the performance of all algorithms as a consequence of different numbers of cross-validations using the new multi-view sample MVS method to evaluate the effect it has on the five algorithms and on the majority voting system. We choose different numbers of cross-validations to evaluate our method because a smaller number of cross-validations contains fewer examples, and sparse training examples generally produce inferior learners. The advantage of having a small number of examples is the training efficiency. 133 Table 5.1: Summary of the features of sets of data employed for assessment. Data Set Dimensionality Dermatology 35 #instances Labelled 220 NBDC Diabetes Heart WDBC Austra House Vote Vehicle Hepatitis Labor Ethn Ionosphere kr_vs_kp Isolet Sonar Colic Credit-g BCI Digital COIL2 g241n 25 8 9 14 15 16 16 16 19 26 30 34 40 51 60 60 61 117 241 241 241 663 268 120 357 307 108 168 218 123 37 1310 225 1527 300 111 136 300 200 734 750 748 #classes Unlabelled 146 Total 366 6 413 500 150 212 383 124 267 217 32 20 1320 126 1669 300 97 232 700 200 766 750 752 1076 786 270 569 690 232 435 435 155 57 2630 381 3196 600 208 368 1000 400 1500 1500 1500 6 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 From our results, SVM gives the best performance compared with the other algorithms when evaluating the algorithms without using any particular sampling methods. For two-fold cross-validation, SVM achieves the best production accuracy for 16 out of 22 data sets. The performance of SVM is enhanced even more when using 5- and 10-fold cross-validation, and gives the best performance for 18 and 20 data sets out of 22 respectively. 134 Table 5.2: Predictive accuracy of each algorithm under two-fold crossvalidation compared with the Majority voting system (Bold 1st, Italic 2nd) Data Set Dermatology NTBC SVM 87.82 87.61 LR 84.12 83.92 KNN 81.36 82.90 NN 71.02 75.82 NB 82.46 82.23 Majority 88.17 89.62 Diabetes 77.36 84.16 77.02 79.74 Heart 76.22 73.46 83.13 85.62 80.68 83.62 83.30 WDBC 90.56 92.36 84.22 77.21 91.82 83.37 91.82 Austra 70.56 62.86 66.85 60.75 71.81 House 85.12 85.64 88.09 89.04 Vote 89.65 70.19 72.22 86.56 64.37 68.77 69.22 Vehicle 84.56 70.75 80.53 66.25 76.73 82.07 81.88 78.05 Hepatitis 87.07 86.49 88.02 86.55 88.48 Labor 90.98 63.92 57.66 57.42 58.64 Ethn 65.87 63.22 61.54 66.08 65.97 65.22 68.46 Ionosphere 90.03 89.14 86.42 87.03 89.49 71.27 90.02 kr_vs_kp 80.76 77.54 76.93 76.25 79.39 79.52 Isolet 80.70 82.48 76.66 77.28 79.13 80.91 Sonar 61.36 58.26 59.02 58.72 61.73 Colic 72.29 70.41 71.02 69.14 71.99 62.92 72.24 Credit-g 85.51 82.54 83.92 82.14 84.59 83.12 BCI 91.53 89.72 90.22 86.23 90.35 89.35 Digital 83.31 81.46 82.07 80.19 83.00 83.20 COIL2 86.84 87.24 84.94 85.44 82.46 85.27 86.73 85.53 86.92 87.13 86.69 88.39 g241n Comparing all five algorithms, we can easily see that the Support Vector Machine (SVM) for two-fold cross-validation with multi-view sampling (MVS) methods gives the best performance across 12 data sets out of 22. In addition, the performance of SVM ranked as the second best algorithm for the rest of the data sets, except for the diabetes data set. However, for five-fold cross-validation, the performance of SVM was enhanced and it gave the best performance for 14 out of 22 data sets and ranked as the second best algorithm for the rest, except again for the diabetes dataset. Using 10 cross-validations and adding two more data sets resulted in SVM giving the best performance for 135 16 data sets, although still the diabetes data sets were not assigned an appropriate rank. Table 5.3: Predictive accuracy of each algorithm under five-fold crossvalidation compared with the Majority voting system (Bold 1st, Italic 2nd ) Data Set Dermatology NTBC Diabetes SVM 89.41 89.03 82.38 KNN 81.93 82.96 81.79 NN 72.42 75.8 83.04 NB 82.44 82.51 78.13 Majority 91.46 91.97 83.04 86.21 LR 84.32 84.64 83.29 76.42 Heart 74.25 81.52 78.94 84.86 WDBC 92.01 92.49 83.94 77.21 92.33 95.24 Austra 72.10 72.93 63.09 66.94 61.79 75.24 House 90.62 88.88 85.20 86.08 87.84 93.11 Vote 71.11 73.19 66.37 65.96 69.93 72.93 Vehicle 85.30 81.34 76.54 81.88 81.72 86.22 Hepatitis 90.76 88.10 86.79 87.83 87.38 91.60 Labor 65.74 61.73 58.14 59.88 65.73 68.44 Ethn 71.32 67.85 64.57 62.84 66.67 72.28 Ionosphere 92.14 89.15 88.24 87.15 88.35 93.25 kr_vs_kp 82.48 76.99 77.92 76.18 78.33 83.14 Isolet 82.00 82.53 75.89 76.22 78.17 84.40 Sonar 63.32 58.55 59.83 59.24 61.51 64.95 Colic 73.22 70.53 71.97 69.67 72.25 74.55 Credit-g 86.65 83.41 82.62 82.39 82.24 87.31 BCI 92.61 89.85 89.74 87.68 89.73 Digital 87.26 81.53 83.16 81.70 82.78 93.75 84.98 COIL2 88.60 89.45 86.07 85.42 82.43 85.26 88.32 85.53 87.27 87.14 86.94 91.51 g241n The Neural Network method (NN) is not an appropriate option for multi-view sampling across almost all the data sets, and its performance is consistently worse than the other methods, regardless of whether the classification problems are for binary or multiple classes. Although Logistic Regression (LR) MVS outperforms SVM MVS quite often, for binary-class data sets, its performance is unsatisfactory and is almost always inferior to 136 SVM. The results from LR are surprisingly good, and they generally improve as we increase the number of cross-validations whether we use MVS or not. Table 5.4: Predictive accuracy of each algorithm under 10-fold cross-validation compared with the Majority voting system (Bold 1st, Italic 2nd) Data Set Dermatology NTBC SVM 89.65 90.37 LR 86.09 85.21 KNN 82.22 83.17 NN 72.70 75.93 NB 84.65 83.23 Majority 92.08 92.97 Diabetes 84.02 78.38 85.61 83.12 79.12 Heart 78.19 76.23 84.19 82.95 Wdbc 87.75 92.27 87.66 87.30 92.60 84.91 77.32 92.52 95.86 Austra 72.24 71.62 65.31 67.36 62.84 House 89.08 86.58 87.15 88.56 Vote 93.87 72.04 75.43 93.86 74.88 69.61 66.82 69.89 74.35 Vehicle 87.21 81.59 77.06 83.33 80.41 88.92 Hepatitis 91.11 88.41 87.62 88.23 89.09 92.74 Labor 67.97 65.18 58.58 59.16 66.82 69.16 Ethn 73.17 68.86 63.86 62.51 67.92 75.04 Ionosphere kr_vs_kp 92.56 82.97 89.22 77.58 88.42 78.35 88.36 78.22 89.78 80.33 93.82 83.76 Isolet 82.39 81.62 78.09 78.26 79.20 85.79 Sonar 63.38 59.60 59.68 59.43 61.92 65.64 Colic 74.47 70.66 72.46 70.27 72.62 75.19 Credit-g 86.36 83.65 83.15 82.42 84.22 87.22 BCI 92.66 90.06 90.92 89.76 90.27 93.97 Digital 85.36 82.79 83.41 81.07 83.32 88.26 COIL2 88.95 86.17 85.58 82.52 85.50 90.45 g241n 89.11 85.71 88.24 87.51 88.82 92.18 Naïve Bayes (NB), surprisingly and conversely, achieves the best predictive accuracy with two-fold multi-view sampling for four data sets. This performance decreases compared to the other algorithms, and obtains the best predictive accuracy for only one data set when using MVS with 5- and 10- fold cross-validation. From the results, it can be seen that MVS improves the accuracy for all the algorithms. The extent of the improvements differs from one algorithm to another and from one data set to another. 137 Comparing different algorithms, the results again confirm that the Majority voting system produces a considerably higher performance over all data sets than the other algorithms. This is very encouraging because our results show that, even when base classifiers are incapable of estimating each instance’s classprobabilitiesaccurately,ourapproachmaystillperform well. This is because Majority voting relies on the variance of the base classifiers’ probability estimation, but not on the absolute probability values. The results also show that the Majority voting system benefits very clearly from the multiview sampling method. Table 5.5: Predictive accuracy of each algorithm under 2-fold cross-validation comparing Majority voting system with multi-view sampling method (Bold 1st, Italic 2nd) Data Set Dermatology NTBC Diabetes Heart WDBC Austra House Vote Vehicle Hepatitis Labor Ethn Ionosphere kr_vs_kp Isolet Sonar Colic Credit-g BCI Digital COIL2 g241n SVM 92.82 90.91 80.56 86.20 93.56 73.66 92.55 74.79 88.76 94.43 67.82 72.26 93.23 85.76 83.86 64.66 76.29 89.81 95.23 86.26 89.66 92.24 LR 90.02 88.12 88.26 80.02 96.26 76.22 90.36 76.25 85.63 91.42 62.46 70.57 93.24 83.44 86.54 62.46 75.31 87.74 94.32 85.31 88.66 91.43 KNN 86.66 86.50 80.52 76.66 87.52 66.26 88.32 71.15 81.23 90.24 61.62 67.32 89.92 82.23 80.12 62.62 75.32 88.52 94.22 85.32 88.56 92.22 NN 77.22 80.32 84.14 87.23 81.41 71.15 89.74 70.17 87.47 92.67 63.74 66.54 91.43 82.45 81.64 63.22 74.34 87.64 91.13 84.34 86.48 93.33 NB 87.36 85.43 88.72 83.48 94.72 63.75 90.89 73.27 85.98 89.90 69.88 69.67 92.59 84.29 82.19 64.93 75.89 88.79 93.95 85.85 87.99 91.59 Majority 94.47 94.22 88.12 87.57 96.12 76.21 93.24 75.12 83.55 93.23 70.42 76.37 94.52 85.82 85.37 67.52 77.54 88.72 94.35 87.45 90.85 94.69 Tables 5.5, 5.6 and 5.7 indicate that the results for multi-class data sets are not worse than for a binary class data set. This shows that multi-view sampling for a multi-class data set is not more challenging than a binary class 138 data set. Therefore, the performance for both binary and multi classes has shown significant improvement using MVS for all the data sets compared to the results in Tables 5.2, 5.3 and 5.4, which evaluate algorithms without using any particular sampling methods and using 2-, 5- and10-fold cross-validation to validate the prediction accuracy. In addition, it can be seen from Table 5.8 that MVS improves the accuracy for the Majority voting system for almost all the data sets. The effect of increasing the number of cross-validations is to increase the prediction accuracy for the Majority voting system for all the data sets when not using any sampling method, but this was not the case when we implemented MVS. Then some data sets had worse results when increasing the number of crossvalidations. For multi-view sampling selection, we observed that its performance is better than random selection for almost all the times for all the data sets. Since the instance selection procedure is replaced by using a distribution-based measure instead of the uncertainty measure, this leads to the conclusion that instance distribution is more effective than uncertainty-based measures for findingthemost“important”samples for labelling. We believe that the reason for which the two-fold results were less accurate than the other results is the reliance on sample distributions to select “important” instances for labelling. Note that sample distributions do not necessarily have a direct connection to indicate whether a sample is “important” for labelling or not, even if the method does accurately capture the data distributions. 139 Figure 5.1: Performance of SVM, LR, KNN, NN, NB and majority under 10fold cross-validation with random sampling vs. multi-view sampling 140 Table 5.6: Predictive accuracy of each algorithm under 5-fold cross-validation comparing Majority voting system with multi-view sampling method (Bold 1st, Italic 2nd) Data Set SVM LR KNN NN NB Majority Dermatology NTBC Diabetes Heart WDBC Austra House Vote Vehicle Hepatitis Labor Ethn Ionosphere kr_vs_kp Isolet Sonar Colic Credit-g BCI Digital COIL2 g241n 93.41 91.33 84.58 88.11 94.01 74.20 92.52 74.71 88.50 93.21 68.64 74.12 94.34 86.48 84.16 65.62 76.22 89.95 95.31 89.21 90.42 93.45 90.22 88.84 87.39 80.22 96.39 76.93 92.68 78.69 86.44 92.45 66.53 72.55 93.25 82.89 86.59 62.75 75.43 88.61 94.45 85.38 89.79 91.43 87.23 86.56 85.29 77.45 87.24 66.49 88.40 71.27 81.04 90.54 62.34 68.67 91.74 83.22 79.35 63.43 76.27 87.22 93.74 86.41 88.54 92.57 78.62 80.30 87.44 85.62 81.41 71.24 90.18 71.76 87.28 92.48 64.98 67.84 91.55 82.38 80.58 63.74 74.87 87.89 92.58 85.85 86.45 93.34 87.34 85.71 81.23 81.74 95.23 64.79 90.64 74.43 85.82 90.73 69.53 70.37 91.45 83.23 81.23 64.71 76.15 86.44 93.33 85.63 87.98 91.84 95.66 94.47 85.44 86.96 97.44 77.54 95.21 76.73 89.62 94.25 71.54 75.28 95.65 87.34 86.76 67.45 77.75 90.81 96.65 87.13 90.34 95.71 When a data set experiences a significant class distribution change, this immediately has an impact on the classification accuracy. Of course, the actual accuracy relies not only on the class distributions, but also on the complexity of the decision surfaces. The results here indicate that changing class distributions is a challenge for active learning from data sets, especially for multi-class data sets. 141 Table 5.7: Predictive accuracy of each algorithm under 10-fold cross-validation comparing Majority voting system with multi-view sampling method (Bold 1st, Italic 2nd) Data Set SVM LR KNN NN NB Majority Dermatology NTBC Diabetes Heart WDBC Austra House Vote Vehicle Hepatitis Labor Ethn Ionosphere kr_vs_kp Isolet Sonar Colic Credit-g BCI Digital COIL2 g241n 93.65 92.67 86.22 89.65 94.27 74.34 95.77 75.64 90.41 93.56 70.87 75.97 94.76 86.97 84.55 65.68 77.47 89.66 95.36 87.31 90.77 93.11 91.99 89.41 82.48 81.99 96.50 75.62 92.88 80.38 86.69 92.76 69.98 73.56 93.32 83.48 85.68 63.80 75.56 88.85 94.66 86.64 89.89 91.61 87.52 86.77 89.11 79.43 88.21 68.71 89.78 74.51 81.56 91.37 62.78 67.96 91.92 83.65 81.55 63.28 76.76 87.75 94.92 86.66 88.70 93.54 78.90 80.43 87.52 88.29 81.52 71.66 91.25 72.62 88.73 92.88 64.26 67.51 92.76 84.42 82.62 63.93 75.47 87.92 94.66 85.22 86.54 93.71 89.55 86.43 82.22 85.75 95.42 65.84 91.36 74.39 84.51 92.44 70.62 71.62 92.88 85.23 82.26 65.12 76.52 88.42 93.87 86.17 88.22 93.72 95.78 94.97 89.56 88.90 97.56 77.23 95.46 77.65 91.82 94.89 71.76 77.54 95.72 87.46 87.65 67.64 77.89 90.22 96.37 89.91 91.97 95.88 Table 5.8 summarizes the statistics of accuracy from 30 runs, where the mean, standard error, median, standard deviation, sample variance, minimum, maximum and confidence level values are reported for eight data sets: Dermatology, NTBC, WDBC, Austra, kr_vs_kp, Ionosphere, COIL2 and g241n. We select two multiclass data sets and two data sets depending on small, medium and large dimensionality numbers respectively. 142 Table 5.8: Predictive accuracy of each number of cross-validations for Majority voting system with random sampling compared to multi-view sampling method using the whole data. Data Sets Majority Dermatology Majority MVS 2-fold 5-fold 10-fold 2-fold 5-fold 10-fold 88.17 91.46 92.08 94.74 95.66 95.78 94.47 94.97 NTBC 89.62 91.97 92.97 94.22 Diabetes 83.62 83.04 87.66 88.12 85.44 89.56 Heart 83.37 84.86 87.3 87.57 86.96 88.9 WDBC 91.82 95.24 95.86 96.12 97.44 97.56 77.54 76.23 Austra 71.81 75.24 75.43 76.21 House 89.04 93.11 93.86 93.22 95.21 95.46 Vote 69.22 72.93 74.35 75.12 76.44 77.56 Vehicle 78.05 86.22 88.92 83.55 89.62 91.82 94.25 94.89 Hepatitis 88.48 91.6 92.74 93.23 Labor 65.22 68.44 69.16 70.42 71.54 71.76 Ethn 71.27 72.28 75.04 76.37 75.28 77.54 Ionosphere 90.02 93.25 93.82 94.52 95.65 95.72 87.34 87.46 kr_vs_kp 79.52 83.14 83.76 85.82 Isolet 80.91 84.4 85.79 85.52 86.76 87.65 Sonar 62.92 64.95 65.64 67.37 67.45 67.64 Colic 72.24 74.55 75.19 77.54 77.75 77.89 90.81 90.22 Credit-g 83.12 87.31 87.22 88.72 BCI 89.35 93.75 93.97 94.35 96.65 96.37 Digital 83.2 84.98 88.26 87.45 87.13 89.91 COIL2 86.73 88.32 90.45 90.85 90.34 91.97 92.18 94.69 95.71 95.88 g241n 88.39 91.51 Figure 5.1, illustrates the performance of Majority vs. Majority MVS on the 22 data sets with different dimensionality size numbers. Our empirical experience suggests that the performance of the Majority does not significantly change within successive cross-validations and gradually levels out, as the greatest size growth was for the Vehicles data set at around 12%. Figure 5.1 also shows the results for the other methods, SVM vs. SVM MVS against various sizes of dimensionality. The improvement for prediction accuracy from SVM to SVM MVS was between 2-4% for all of the data sets, 143 while the accuracy improvement for LR vs. LR MVS was higher at 2-6%. The Vote dataset showed the greatest accuracy improvement when comparing the results from using multi-view sampling with LR, at 6%. NN vs. NN MVS showed the largest number of enhanced datasets, at four datasets, and the highest improvement prediction accuracy was 6%. Figure 5.2: Performance of SVM, LR, KNN, NN, NB and Majority under varying different Folds number sizes with multi-view sampling method. 144 Figure 5.1 illustrates in addition the performance of KNN vs. KNN MVS for the 22 data sets with different dimensionality size numbers. The experiments indicate that, in most cases, the performance of the KNN changes with successive cross-validations and the improvement when using MVS was in excess of 3% to 6% for all the data sets. Lastly, from Figure 5.1, we can see the performance of NB vs. NB MVS on the all data sets against different dimensionality size numbers. The results show that, in all cases, the performance of the NB increases when using MVS compared to the results before implementing MVS. NB performs very well for some data sets, with a prediction accuracy growth of 8%, while for most of the cases the improvement was between 2% and 5%. Figure 5.2 shows a graph of accuracy against different numbers of cross-validations of data using SVM, LR, KNN, NN, NB and the Majority weighted ensemble classifier for different data sets with the multi-view sampling method. The graph indicates that for datasets like Austra, BCI and g241n, the number of cross-validations did affect the prediction accuracy in a non-stable way. For the Majority voting system, the best prediction accuracy was with 5-fold cross-validations and 10fold validation was second (large dimensionality). For the ethn dataset the different numbers of cross-validations did not have any Majority effect on the Majority voting system. On the other hand, for the digital, COLI2 and diabetes data sets the changing number of crossvalidations affects the prediction accuracy in a non-stable way, although 10fold cross-validation is highly recommended because the accuracy improves from 87% to around 90%. Lastly, the prediction accuracy of the dermatology, 145 vehicles and hepatitis data sets increases steadily as we increase the number of cross-validations (small dimensionality). Figure 5.3 shows a graph of accuracy against the percentage of mean using SVM, LR, KNN, NN, NB and Majority for the above eight data sets with the implementation of MVS. We observed that Majority was able to produce a higher average classification accuracy than the other methods. For some datasets, Majority was able to achieve a maximum accuracy of 97%. However, SVM also performed well, and achieved an average accuracy of 93%. Majority achieved high accuracy for both binary and multiclass. The error bars show the average positive and negative error for the different methods. The results from Figure 5.4 show that the Majority voting system consistently outperforms other methods across all data sets. We used the t-test to compare the means difference for Majority against other methods for the dermatology data set (P < 0.001, paired t-test; Fig. 5.4). We found that there is no significant mean difference between Majority and SVM, LR, KNN, NN, NB: (0.0029), (0.0016), (0.0137), (0.00136), (0.00424) respectively. The probability is extremely high, so the means are not significantly different for NTBC, WDBC, kr_vs_kp and g241n. On the other hand, the Austra, ionosphere, COIL2 data sets have a highly significant mean difference at (P < 0.001, paired t-test). 146 Figure 5.3: Error bar and performance of SVM, LR, KNN, NN, NB and Majority for different dimensionality number sizes with multi-view sampling method (d - number of dimensions). 147 This observation asserts that for concept drifting data streams with a constantly changing number of cross-validations distributions and continually evolving decision surfaces, Majority can adaptively label instances and build a superior classifier ensemble. The advantage of Majority can be observed across different types of data streams (binary-class and multi-class), because it takes the single advantage of each classifier or method. 5.4.1 Runtime Performance Study In Figure 5.4, we report the system runtime performance in response to different numbers of cross-validations. The x-axis in Figure 5.4 denotes the number of cross-validations and the y-axis denotes the average system runtime. Comparing all five methods, the runtimes of KNN and NB are very close to each other, with NB slightly more efficient than KNN. Not surprisingly, LR has demonstrated itself to be the most efficient method due to the nature of its simple random selection. NB and KNN are at the second tier because instance labelling involves a recursive labelling and retraining process. Similar to NB and KNN, NN also requires a recursive instance selection process plus a number of local training runs to build classifiers from each cross-validation. Consequently, NN is less efficient than KNN and NB. The proposed SVM method is a time-consuming approach mainly because the calculation of the ensemble variance and the weight updating require additional scanning for each cross-validation. On average, when the number of crossvalidations is 500 or less, the runtime of SVM is about 2 to 4 times longer than 148 its peers. The larger the number of cross-validations, the more expensive the SVM is, because the weight updating and instance labelling require more time. Figure 5.4: System runtime with respect to different numbers of crossvalidations 5.5 Summary We proposed a new research topic on active learning with different fold data sizes, where the data volumes continuously increase and data concepts dynamically develop, and the objective is to label a portion of data to form a classifier ensemble with the highest accuracy rate in predicting future samples. We studied theconnectionbetweenaclassifierensemble’svarianceand its prediction accuracy and showed that combining a classifier ensemble’s variance is equivalent to maximizing the accuracy rates classifier. We derived an optimal Majority weighting method to assign weight values for base classifiers, such that they can form an ensemble with maximum prediction accuracy. Following the above derivations, we proposed a Majority voting system for active learning from different fold data sizes, where the key is to 149 label instances which are responsible for a large variance value from the classifier ensemble. Our intuition was that providing class labels for such instances can significantly benefit from the variance of the ensemble classifier, and therefore maximize the prediction accuracy rates. Experimental results on synthetic and real-world data showed that the dynamic nature of data streams poses significant challenges to the existing active learning algorithms, especially when dealing with multiclass problems. By applying random sampling globally or locally results in good performance in practice. The proposed Majority voting system and active learning framework address these challenges using a variance measure to guide the instance classification process, followed by the voting weight to ensure that the instance labelling process can classify the future sample in the most appropriate class. Chapter 6 Examination of TSVM Algorithm Classification Accuracy with Feature Selection in Comparison with GLAD Algorithm The chapter is organized as follows. Section 6.2 presents the literature background regarding Support Vector Machines, and Transductive Support Vector Machines are also discussed. Finally, section 6.2 deals with recursive feature removal. A clear elaboration is provided in section 6.3 concerning the TSVM algorithm in addition to RFE. Then the GLAD algorithm is summarized in a manner that enables us to compare the estimation precision of these two algorithms. Section 6.4 provides an analysed comparison of the outcomes of the empirical execution of both algorithms, TSVM and GLAD. Finally, a brief conclusion is provided in section 6.5. 6.1 Introduction Data mining techniques have been traditionally used to extract the hidden predictive information in many diverse contexts. Usually datasets contained thousands of examples. Recently the growth in biology, medical science, and DNA analysis has led to the accumulation of vast amounts of biomedical data that requests in–depth analysis. There have been many data mining machines learning, statistics analysis systems and tools available, after years of research and developments that have been use in bio-data exploration and bio-data analysis. Consequently, this chapter will examine relatively new techniques in data mining, This technique called Semi supervised support 150 151 vector machines S3VMs, which is also named Transductive supervised support vector machines [172], located between supervised learning with fully labelled training data and unsupervised learning without any labelled training data [173]. In this method we use both labelled and unlabeled samples for training, small amount of labelled data with large amount of unlabeled data. This chapter propose to observe the performance of Transductive SVMs combining with a feature selection method called recursive feature elimination (RFE), used to select molecular descriptors for transductive support vector machines (TSVM). We used LIBSVM open source machine learning libraries. LIBSVM implements support vector machines (SVMs), supporting classification and regression [116]. We modified some of the libraries to extend them to be used for TSVM along with RFE. 6.2 Background 6.2.1 Support Vector Machines A support vector machine (SVM) is a distinctive classifier which is basically introduced by separating a hyperplane, i.e. a source of labelled training data (supervised learning). The algorithm presents an idealistic hyperplane, which sorts recent instances. The SVM algorithm depends on getting a hyperplane which provides the training instances with the largest minimum space. In the theory of SVM, this space is called the margin. Consequently, the idealistic separating hyperplane clearly increases the margin of the training data; a line should not be allowed to pass nearby the dots in order to avoid noise and sensitivity. Moreover, it will not properly generalize. As a result, a line passing as far as possible from all the dots is preferable. 152 Support vector machine (SVMs), a supervised machine learning method, are useful in many fields of biomedical research, such as microarray expression data assessment [174], far protein homologies detection [175] and translating initiation websites identify [176]. SVMs can not only properly categorize objects, but also recognize instances categorized without any data [177]. The SVMs method relies on training samples in order to specify in advance which data need to be banded together [174]. Figure 6.1: Multi Margin vs. SVM Maximum Margin optimal hyperplane separation 153 6.2.2 Transductive Support Vector Machines Transductive learning is a method that is closely connected with semisupervised learning, where semi-supervised learning is intermediate between supervised and unsupervised. Vladimir Vapnik introduced Support Vector Machines Semi-Supervised Learning in the 1990s. This was motivated by his view that transduction (TSVM) is preferable to induction (SVM), since the induction needs to solve more general problems (inferring a function) before being able to solve a more detailed problem (computing outputs for new cases) [178,179]. TSVM seeks a hyperplane, a labelling of the unlabelled examples, so that the SVM objective function is minimized, subject to the constraint that a fraction of the unlabelled data is classified as positive. SVM margin maximization in the presence of unlabelled examples can be interpreted as an implementation of the cluster assumption. The Transductive Support Vector Machine attempts to maximize the hyperplane classifier between two classes, using labelled training data, while at the same time forcing the hyperplane to be far away from the unlabelled samples. TSVM seems to be a perfect semi-supervised learning algorithm because it combines the regularization of the Support Vector Machine with the straight implementation of the clustering assumption [180]. In semi-supervised learning, a sample observed with an independent unlabelled sample . is a -dimensional input and is , and , independently and identically distributed according to an unknown 154 distribution and is distributed according to distribution . TSVM is based on the idea of maximizing the separation between labelled and unlabelled data (see Vapnik [178]). It deals with: where represents a decision function in , a candidate function class, indicates the decisive loss, and refers to the counter of the geometric separation margin. In the linear condition, nonlinear kernel and . In the case, is a kernel satisfying Mercer’s case to confirm , where with being a proper norm (see [176,177] for more details). Minimizing ( ) with respect to is non-convex, which can be solved through integer programming, and is known to be NP [182]. To solve ( ), Joachims [181] proposed an efficient local search algorithm that is the basis of SVMLight. This algorithm may fail to deliver a good local solution, resulting in worse performance of TSVM against SVM. This aspect is confirmed by our numerical results, as well as empirical studies in the literature. Chapelle and Zien [176] aimed to correct this problem by approximating by a smooth convex problem through gradient descent. [183] used an extended bundle method to treat non-convexity and non-smoothness of the cost function. 155 TSVMs [180] enhance the generalization accuracy of SVMs [182] based on unlabelled data. Both TSVMs and SVMs aim to maximize the margin of the hyperplane classifier based on labelled training data, while TSVM is distinguished by pushing the hyperplane away from the unlabelled data. One way of justifying this algorithm in the context of semi-supervised learning is that one is finding a decision boundary that lies in a region of low density, implementing the so-called cluster assumption (see e.g. [176]). In this framework, if you believe the underlying distribution of the two classes is such thatthereisa“gap”orlowdensityregionbetweenthem,thenTSVMscanhelp because they select a rule with exactly those properties. Vapnik [178] has a different interpretation of the success of TSVMs that is rooted in the idea that transduction (labelling a test set) is inherently easier than induction (learning a general rule). In either case, experimentally it seems clear that algorithms such as TSVMs can give considerable improvement in generalization over SVMs, if the number of labelled points is small and the number of unlabelled points is large. Unfortunately, TSVM algorithms (like other semi-supervised approaches) are often unable to deal with a large number of unlabelled examples. The first implementation of TSVM appeared in [184], using an integer programming method that is intractable for large problems. Joachims [181] then proposed a combinatorial approach known as SVMLight-TSVM, which is practical for a few thousand examples. 156 A sequential optimization procedure is introduced in [182] that could potentially scale well, although their largest experiment used only 1000 examples. However, their method was for the linear case only, and used a special kind of SVM with a 1-norm regularizer, in order to retain linearity. Finally, Chapelle and Zien [176] proposed a primal method, which turned out to show improved generalization performance over the previous approaches, but still scales as (L+U)3, where L and U are the numbers of labelled and unlabelled examples. This method also stores the entire (L+U) × (L+U) kernel matrix in memory. Other methods [177,179] transform the non-convex transductive problem into a convex semi-definite programming problem that scales as (L+U)4 or worse. Figure 6.2: Separation hyperplane for (semi-supervised data) 6.2.3 Recursive Feature Elimination An enormous size of data set negatively affects the performance of most prediction models algorithms. In order to minimize the feature set, we underline the recursive feature elimination/removal (RFE) among many proposed techniques. The proposal of RFE is to begin with all the features, 157 select the least useful features and remove them, and repeat until some stopping condition is reached. Detecting the best subset features costs much, so RFE decreases the difficulty of feature selection by being greedy. REF worked well in gene expression studies by Guyon et al. [183]. Recursive feature elimination (RFE) is another multivariate mapping approach that allows us to detect (sparse) discriminative patterns in the dataset that are not limited to the local neighbourhood of a feature, i.e. features may be spread across the whole sample. The basic principle of RFE is to include initially all the features of a large region, and to gradually exclude features that do not contribute to discriminating patterns from different classes. Whether a feature in the current feature set contributes enough to be kept is determined by the weight value of a feature resulting from training a classifier (e.g. SVM) with the current set of features. In order to increase the likelihood that the "best" features are selected, feature elimination progresses gradually and includes cross-validation steps. In each feature elimination step, a small proportion of features is discarded until a core set of features remains with the highest discriminative power. Note that using SVM to separate "good" from "bad" features implements a multivariate feature selection strategy, as opposed to univariate feature selection which uses single-feature F or t values from a statistical analysis. Nonetheless, an initial feature reduction step using a univariate method might be useful if one wants to restrict RFE to the subset of "active" features. The implementation of RFE includes two nested levels of cross-validation to maximize the chance of keeping the "best" features. At the first level, the training data is partitioned and RFE is applied a number of times. In each 158 application, one of the folds is put aside for testing generalization performance, while the remainder together form the training data for the RFE procedure, i.e. for each of the RFEs another "split" of the data is used. When all the separate RFEs have been performed, the final generalization performance is determined as the average of the performance across the NF different splits, separately for each reduction level. The final set of features (for a specific reduction level) is obtained by merging the features with the best weights (highest absolute values) across all splits. The training data from each first-level split is used for a separate RFE procedure, while the split with the test data is set aside and only used for performance testing. The training data is then partitioned again into L subsplits and an SVM is trained on L splits in order to obtain robust weight rankings for feature elimination. A feature’s ranking score is obtained by averaging the weights of that feature across the different second-level splits. The absolute values of these scores are then ranked and the features with the lowest ranks are removed. The "surviving" features are then used for the next RFE iteration, which starts again with (a new) partitioning of the data into L splits. The whole procedure is repeated R times until a desired number of features has been reached. As described above, the RFE level that produces the highest generalization performance across all first-level splits is finally selected andthelevel’ssetoffeaturesisdeterminedbymergingthebestfeaturesofthe respective first-level splits. 159 The support vector machine based recursive feature elimination, the socalled (RFE-SVM) approach [173] is a commonly used method to select features, as well as for subsequent classification, especially in the scope of biological data. Each time we attempt an iteration, a linear SVM is trained, followedbyremovingoneormore“bad”featuresfromfurtherconsideration. The goodness of the features is determined by the absolute value of the corresponding weights used in the SVM. The features remaining after a number of iterations are deemed to be the most useful for discrimination, and can be used to provide insights into the given data. A similar feature selection strategy was used in the author unmasking approach, proposed for the task of authorship verification [172] (a sub-area within the natural language processing field). Instead of excluding the worst features, we could let the best features iteratively drop. Recently, it has been observed experimentally on two microarray datasets that using very low values for the regularisation constant C can enhance the effectiveness of RFE-SVM execution [185]. Instead of continually evolving SVMs within the usual iterations, we rely on the limit C → 0. This limit can be clearly computed using a cantered based classifier. Moreover, unlike RFE-SVM, in the mentioned limit, removing a number of features has no influence on resolving the rest of the features. Consequently, the need for multiple recursion is obviated, resulting in considerable computational savings. 6.2.4 Genetic Algorithms Genetic algorithms are one of the best ways to solve a problem for which little is known. They are a very general algorithm and so will work well in any 160 search space. All you need to know is what you need the solution to be able to do well, and a genetic algorithm will be able to create a high quality solution. Genetic algorithms use the principles of selection and evolution to produce several solutions to a given problem. Genetic algorithms tend to thrive in an environment in which there is a very large set of candidate solutions and in which the search space is uneven and has many hills and valleys. True, genetic algorithms will do well in any environment, but they will be greatly outclassed by more situation specific algorithms in the simpler search spaces. Therefore you must keep in mind that genetic algorithms are not always the best choice. Sometimes they can take quite a while to run and are therefore not always feasible for real time use. They are, however, one of the most powerful methods with which to (relatively) quickly create high quality solutions to a problem. 6.3 Methods This section describes TSVM-RFE, the problem motivated by the task of classifying biomedical data. The goal is to examine classifiers accuracy and classification errors using the transductive support vector machine method. We set out to determine whether this method is an effective model when combined with recursive feature elimination (RFE), compared with another algorithm called Genetic Learning Across Datasets (GLAD). 161 6.3.1 Support Vector Machines SVMs aim at creating a classifier with major margins between the samples in respect of two varied classes, where the training error is minimized. Consequently, we employ a set of -dimensional training samples labelled by and their outline across the kernel function: The formula hereinafter initially expresses SVM : SVM has the following primal form: The SVM predictor for samples X, as stated hereinafter, is settled by thevector’sinnerproductbetweenthe as well as the mapped vector the constant . , plus 162 Figure 6.3: Maximum margin separation hyperplane for Transductive SVM (semi-supervised data) The predictor actually corresponds to a separating hyperplane in the mapped feature space. The prediction for each training sample with a violation term . The is connected is a user-specified constant to manage the penalty to these violation terms. The parameter/indicator hereinbefore specifies a certain type of norm of to be evaluated. It is often assigned to 1 or 2, forming the 1-norm ( -SVM) or 2norm SVM ( -SVM) respectively. The 1-norm and 2-norm TSVMs have been discussed in [181,184]. 6.3.2 Transductive Support Vector Machines 163 This chapter introduces the extended SVM method that is simply transductive SVM. The 2-norm has been orderly recruited for TSVM. The following formula expresses the status of the level: The standard setting can be illustrated as: where each represents the un given label for which is considered one of the K unlabelled samples. In contrast with SVM, the TSVM formula is concerned with the unlabelled data by standing for the violation terms presented by predicting each unlabelled model into these violation terms is controlled by new constant labelled with unlabelled samples, while . The penalty to consists of labelled samples only. Precisely solving the transductive problem needs us to search all the potential assignments of and to identify various terms of , which is regularly intractable for big data sets. It is worth mentioning the implemented in the SVMLight [178,186]. -TSVM 164 6.3.3 Recursive Feature Elimination Recursive Feature Elimination (RFE) has the advantage of decreasing the redundant and recursive features. RFE decreases the difficulty of feature selection by being greedy. SVM Recursive Feature Elimination (SVM RFE) SVM RFE is an application of RFE using the weight magnitude as ranking criterion. We present below an outline of the algorithm in the linear case, using the SVM-train equation, which is mentioned earlier. SVM RFE Algorithm: Inputs: Training examples Class labels Initialize: Subset of surviving features Feature ranked list Repeat until Restrict training examples to good feature indices Train the classifier Compute the weight vector of dimension length (s) 165 Compute the ranking criteria for all i Find the feature with the smallest ranking criterion Update feature ranked list Eliminate the feature with the smallest ranking criterion Output: Feature ranked list r. As mentioned before, the algorithm can be generalized to remove more than one feature per step for speed reasons. Extending SVM feature selection techniques to transductive feature selection is straightforward. Specifically, we can produce TSVM RFE by iteratively eliminating features with weights calculated from TSVM models. The following steps illustrate the theories of evolving TSVM RFE from TSVM. . 1. Pre-process data to be ready, then compute percolating markers/grades . Optionally further normalize data. This approach first filters some features based on scores like Pearson correlation coefficients. 2. Adapt as an all-one input vector. 3. Assign . Partially assign small entries of proportion/ threshold, and maybe distinct non-zero zero in terms of a to 1. 4. Get a sub-optimal TSVM as computed by cross-validation accuracy. 166 5. In accordance with theories of RFE, expect weighting of feature from the pattern in step 4 as follows: where are the given samples, with feature eliminated. The weighting of the -th feature is expressed by the following formula: The following estimation suggested in [160] is easier to measure: Specifically, the feature weights are identical to the if the SVM is built upon a linear kernel. We go back to step 3 unless there is an accepted amount of features/iteration. Output the closing predictor and features pointed out by large values of . Step 3 comprises the selection of a proportion/number of features according to a threshold cutting the vector . For filtering scores and the RFE method, the vector is changed to a binary vector. Then the has the effect of pruning or deactivating some features. The threshold is usually found to prune a (fixed) number/proportion of features at each iteration. The value of the remaining features is then measured by the optimality of the TSVM model obtained in step 4. We apply cross-validation accuracy as the performance measure of the TSVM algorithm. For a subset of features as selected by choosing a threshold value, we extend the model search 167 upon the free parameters like [ and choose the preferred parameter set which results in the highest cross-validation accuracy. 6.3.4 Genetic Learning Across Datasets (GLAD) The GLAD algorithm is simply distinguishing a semi-supervised learning algorithm. The GLAD algorithm is applied as a wrapper method for feature selection. The GA is implemented to generate a population of related feature subsets. The labelled data and the unlabelled data samples are computed separately. Linear Discriminant Analysis (LDA) and K-means (K = 2) were used for these two data forms of cluster algorithms [182]. A distinctive twoterm scoring function resulted in independently scoring the labelled and unlabelled data samples. Generally, the score is calculated as a weighted average of the two terms, as shown below: As the typical leave-one-out-cross-validation accuracy for the labelled training samples, they identify the labelled data samples score. The unlabelled data samples score exists of two terms: a cluster separation term and a steady ratio term. 168 = centroid of cluster, cluster 6.4 , = ratio of data in cluster , = number of data samples in cluster , = expected ratio in = number of clusters. Experiments and Results This section shows empirically the outcomes that help us to evaluate the performance of the classification pattern accuracy introduced earlier. 6.4.1 Datasets Leukaemia (AML-ALL) involving 7129 genes detects two different forms of leukaemia: Acute Myeloblastic Leukaemia (AML), 25 samples, and Acute Lymphoblastic Leukaemia (ALL), 47 samples [187]. Lymphoma (DLBCL) consisting of 7129 genes, and 58 DLBCL samples of Diffuse large B-Cell lymphoma (DLBCL) and 19 samples of Follicular lymphoma (FL) [188]. Chronic Myeloid Leukaemia (CML) includes 30 samples (18 severe emphysema, 12 mild or no emphysema) detected from a set of 22,283 human genes [189]. 6.4.2 TSVM Recursive Feature Elimination (TSVM-RFE) Result Leukaemia (AML-ALL). The results for the leukaemia ALL/AML data set are summarized in Figure 6.4. TSVM-RFE gives the smallest error of 3.68%, and compassionately smaller errors compared to SVM-RFE at 3.97% for 30, 40, . . . , 70 genes. Interestingly, in our experiments both of the methods give the lowest error when 60 genes are used. This 169 provides a reasonable suggestion for the number of relevant genes that should be used for the leukaemia data. Lymphoma (DLBCL). The results for the lymphoma (DLBCL) data set are summarized in Figure 6.4. TSVM-RFE gives the smallest error of 3.89%, and considerably smaller errors compared to SVM-RFE at 4.72% for 30, 40, . . . , 70 genes. The TSVM methods give the lowest error with 60 genes, while the SVM methods give the lowest error at 50 genes with 4.72%, compared to 4.97% with 60 genes. This suggests the number of relevant genes that should be used for the lymphoma (DLBCL) data. Leukaemia (CML). Lastly, the TSVM-RFE and SVM-RFE results for the leukaemia (CML) data set are provided in Figure 6.4 in the bottom diagram. TSVM-RFE gives the smallest error of 6.52%, and markedly smaller errors in contrast to the 7.85% with SVM-RFE for 30, 40, . . . , 70 genes. Both algorithms show the lowest error when 50 genes are used. This represents a sensible number of related genes that should be used for the leukaemia (CML) data. 6.4.3 Comparing TSVM Algorithm result with GLAD Algorithm When implementing the Genetic Learning Across Datasets, we conduct three experiments using the previous data sets, each addressing a different cancer diagnostic problem: the aim with ALL/AML is to disparity the 170 diagnosis; with the CML data set it is to predict the response of imatinib; with DLBCL it is to forecast the outcome. In the AML-ALL data set, the accuracy range using only labelled samples is 73.46%. Combining unlabelled and labelled samples increases the range to 75.14%. Adding unlabelled samples increases the accuracy from 59.34% to 65.57% in the CML experiments. The addition of the unlabelled samples to the unlabelled sample for DLBCL raises the accuracy from 49.67% to 55.79%. This shows that the GLAD algorithm outperforms SVM-RFE and TSVM-RFE in some cases when we make use of the labelled data only without gene selection. Table 6.1 shows, for example, that with the AML-ALL data set, the GLAD algorithm gives 73.46%, while the SVM-RFE and TSVM-RFE accuracy was 52.8% and 55.6% respectively. However, the results for the second data set (DLBCL) shows that the GLADalgorithm’saccuracywas49.67%andSVM-RFE 55.8%. Furthermore, for the third data set (CML), SVM-RFE gives 59.02% without gene selection, while GLAD gives 59.34%. On the other hand, TSVM exceeds GLAD when we make a use of unlabelled data along with labelled data and selection of genes. The results are shown in Table 6.1. 171 Error Rate DLBCL Data set 6.5 6 5.5 5 4.5 4 3.5 3 SVM-RFE TSVM-RFE 30 40 50 60 70 Number of Genes Error Rate CML Data set SVM-RFE TSVM-RFE 9 8.5 8 7.5 7 6.5 6 5.5 5 30 40 50 60 70 Number of Genes Error Rate AML/ALL Data set 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 SVM-RFE TSVM-RFE 30 40 50 60 70 Number of Genes Figure 6.4: Testing error for 3 data sets. The 5-fold cross-validated pair t-test shows the differences between SVM-RFE and the TSVM-RFE when comparing the two methods at the confidence rate of 95%. (Linear kernel, C = 1) 172 Table 6.1: Accuracy obtained with SVM-RFE, TSVM-RFE and GLAD SVM-RFE Accuracy (labelled) TSVMRFE Accuracy 52.8% 55.6% 96.03% 96.32% 75.14% 7219 Genes, 77 Samples 55.8% 57.1% 49.67% (labelled) 60 Genes, 77 Samples 95.03% 96.11% 55.79% 22,283 Genes, 59.02% 30 Samples 72.6% 59.34% (labelled) 50 Genes, 30 Samples 93.48% 65.57% Datasets GLAD Accuracy ALL-AML Without Selection With Selection 7219 Genes, 72 Samples 60 Genes, 72 Samples 73.46% (labelled) DLBCL Without Selection With Selection CML Without Selection With Selection 92.15% For instance, with the CML data set using all the samples without gene selection, TSVM gives 72.6% but with gene selection based on REF, TSVM exceeds 93.48%, while the GLAD results is 65.57% with gene selection. In the same vein, the accuracy for the DLBCL data set reaches 96.11% by TSVM with gene selection. On the other hand, the GLAD algorithm gives 55.79% with gene selection. In addition, TSVM with the AML-ALL data set with gene selection gives 96.32% 173 while the GLAD algorithm gives 75.14%. This means that TSVM performs better than the GLAD algorithm, and with gene selection the result is superior. 6.5 Discussion of results From the results of the three datasets, we made the following observations: 1. TSVM-RFE can improve the performance more significantly for the CML dataset than for the ALL-AML and DLBCL datasets. For example, our TSVM-RFE methods can lead to a relative improvement of more than 12% over the other methods, SVM-RFE and GLAD, for the CML dataset, while the other methods only result in a 3~4% improvement for the ALL-AML and DLBCL datasets. 2. Our proposed algorithms outperform SVM-RFE and GLAD more significantly when we select some genes for the CML dataset than for the other datasets, compared to when we use all the genes. On the other hand, for example, GLAD is significantly better than SVM-RFE and TSVM-RFE for the ALL-AML dataset; in contrast, the improvement over SVM-RFE is modest for the DLBCL dataset. To determine the reasons, we conducted the following additional experiments. We studied genes against their values when they were regarded as ranking models. There are more than 10% genes which can improve the performance. In this case, feature selection can help to remove noisy features and thus improve the performance of the final ranking. In contrast, there are many features in the ALL-AML, DLBCL, and CML 174 datasets that are not effective. Therefore, the benefit of removing noisy features is great. Based on the discussion above, we conclude that if the effects of features vary largely and there are redundant features, our method can work very well when applied in practice. It is also worth noting that the newly developed method is meant to be applicable over large-scale datasets. In this chapter, several situations were presented, for which the TSVM classifier was outperformed by a more general algorithm that does not assume any particular distribution of the analysed samples. In general, according to our experience, the new method outperforms the SVM classifier. 6.6 Summary This chapter has investigated topics focused on semi-supervised learning. This was achieved by comparing two different methods for semi-supervised learning using previously classified cancer data sets. The results, on average, for semi-supervised learning surpass supervised learning. However, it was shown that the GLAD algorithm outperforms SVMRFE when we make use of the labelled data only. On the other hand, TSVMRFE exceeds GLAD when unlabelled data is used along with labelled data. It performs much better with gene selection, and performs well even if the labelled data set is small. On the other hand, TSVM still has some drawbacks; when increasing the size of the labelled data set, the result does not significantly increase accordingly. Moreover, when the size of the unlabelled samples is extremely small, the 175 computational effort will be extremely high because small size of unlabelled set requires more computation. Like almost all semi-supervised learning algorithms, TSVM shows some instability and some results differ on different runs. This happens because unlabelled samples may be wrongly labelled during the learning process. If we find a way in future to select and eliminate the unlabelled sample first, then we can limit the number of newly labelled samples for re-training the classifiers. 176 Chapter 7 Conclusions and Future Work This thesis has been devoted to the core problem of pattern classification and its applications. Three stages have been studied through pre-processing of features, classification, and model selection for a classifier. 7.1 Contributions Improving TSVM vs. SVM Accordance-Based Sample Selection In this chapter, supervised and semi-supervised learning were applied over several case studies. In particular, two different classifiers, the Support Vector Machine and the Transductive Support Vector Machine, were reviewed and usedoverthe‘in-class’patients of the Abd El-Rehim et al. [95] breast cancer dataset in order to validate the previous classification derived and characterised in earlier studies. Surprisingly, the TSVM classifiers performed quite well, especially when only the 50 ‘most important’ samples were considered. This happened even though one of the underlying assumptions of the TSVM was strongly violated by the data: as a matter of fact, all the samples did not follow a normal distribution. An accordance-based sampling version of the TSVM was then developed and validated over known data sets. These latter results were presented in this chapter, together with their comparison with the Support Vector Machine approach. Using accordance based sampling improved the accuracy of both data sets the results show that the improvement for TSVM was more than SVM as it shown in chapter 3. 177 Automatic Features and Samples Ranking for SVM Classifier In this chapter, we have proposed an optimization method for feature and sample selection in ranking. The contributions of this chapter include the following. We discussed the differences between classification and ranking, and made clear the limitations of the existing feature and samples selection methods when applied to ranking. In addition, we proposed a novel method to select features and samples for ranking, in which the problem is formalized as an optimization issue. In this method, we maximize the total importance scores of selected features and samples, and at the same time minimize the total similarity scores between the features in addition to samples. In this chapter, we evaluated the proposed method using two datasets, with two ranking models, and in terms of a number of evaluation measures. Experimental results validated the effectiveness and efficiency of the proposed method. Ensemble weighted classifiers with accordance-based sampling Chapter 5 proposed a new research topic on active learning with different fold data sizes, where the data volumes continuously increase and data concepts dynamically develop, and the objective is to label a portion of data to form a classifier ensemble with the highest accuracy rate in predicting future samples. This chapter also studied the connection between a classifier ensemble’s variance and its prediction accuracy and showed that combining a classifier ensemble’svarianceis equivalenttomaximizingtheaccuracyratesclassifier. We derived an optimal Majority weighting method to assign weight values for base classifiers, such that they can form an ensemble with maximum prediction 178 accuracy. Following the above derivations, we proposed a Majority voting system for active learning from different fold data sizes, where the key is to label instances which are responsible for a large variance value from the classifier ensemble. Our intuition was that providing class labels for such instances can significantly benefit from the variance of the ensemble classifier, and therefore maximize the prediction accuracy rates. Experimental results on synthetic and real-world data showed that the dynamic nature of data streams poses significant challenges to existing active learning algorithms, especially when dealing with multiclass problems. Simply applying uncertainty sampling globally or locally achieves good performance in practice. The proposed Majority voting system and active learning framework address these challenges using a variance measure to guide the instance classification process, followed by the voting weight to ensure that instance labelling process can classify the future sample in the most appropriate class. Examination of TSVM Algorithm Classification Accuracy with Feature Selection in Comparison with GLAD Algorithm This chapter has investigated topics focused on semi-supervised learning. The result in chapter 6 was achieved by comparing two different methods for semisupervised learning using previously classified cancer data sets. The results on average for semi-supervised surpass supervised learning. However, it is shown that the GLAD algorithm outperforms SVM-RFE when 179 we make use of the labelled data only. On the other hand, TSVM-RFE exceeds GLAD when unlabelled data is used along with labelled data. It performs much better with gene selection and performs well even if the labelled data set is small. On the other hand, TSVM still has some drawbacks; when increasing the size of the labelled dataset, the result does not improve accordingly. Moreover, when the size of the unlabelled sample is extremely small, the computational effort will be extremely high because small size of unlabelled set requires more computation time will be extremely high. Like almost all semi-supervised learning algorithms, TSVM shows some instability and some results differ on different runs. This happens because unlabelled samples may be wrongly labelled during the learning process. If we find a way in future to select and eliminate the unlabelled samples first, then we can limit the number of newly labelled samples for re-training the classifiers. Future Work The following is a list of possible points that could lead the continuation of the present investigation: The classification performance of the comparatively weak features has been improved by using our proposed kernel-based classifier in this work. However, it is difficult to pre-determine the associated optimal kernel parameter, as the robustness is not satisfactory around those values of the kernel parameters that can provide a high classification accuracy. The formulation and 180 optimization of the kernel parameter is an important issue to explore. Investigationof‘not-classified’patients:byaformofconsensus clustering, six breast cancer classes have been defined in this work. However, as highlighted in Chapter 4, not all the available patients were classified in one of these six groups. A very important future project will be to define a proper classification for those patients in order to help doctors give them more accurate prognoses, as well as targeting patients with more specialised treatments. This represents a big challenge for future work, as finding the proper cure for each patient will decreasehospitalcostsaswellasthepatient’spain. Getting new patients: one of the strategies that may be followed to achieve the previous goal might be to increase the number of available patients. This could be done by retrieving medical records or by performing again the same biological analyses in order to recover some missing data. It would be interesting to investigate if it could be feasible to combine different sources of data by merging studies from different research groups in which data have been collected using very similar protocols. Globally optimum feature set: all wrapper or filter feature selection methods try to find a set of features that perform better than others under certain conditions, but they cannot guarantee that the selected feature set is the globally optimum solution. 181 Searching for a globally optimum set of features has always been a computationally un-feasible task. The task is to select one combination out of 2D possible combinations of features. To solve this problem, we can apply any feature selection algorithm as a polynomial mixed 0-1 problem, where 0-1 corresponds to the absence or presence of the corresponding features in the selected feature set. Any linear optimiser to get a global solution can easily solve this mixed 0-1 linear problem. Potentially, this means that we can search for a globally optimum feature set at a running cost of a linear optimization technique. This is a huge improvement in computation costs and the solution is also globally optimum. These techniques need to be further investigated and tested on several available datasets to verify their effectiveness. All of the importance weighting algorithms discussed in this thesis try to minimise the distance between the distributions. However, these methods do not guarantee to preserve data variance properties. In this regard, kernel PCA does guarantee to preserve the maximum variance of the data. There is a requirement of developing transfer learning algorithms that minimise the distance between the distributions while preserving the data variance. The weighting algorithms have been developed for problems where a large amount of unlabelled testing data is available and 182 importance weights can be calculated using limited training data. However, these algorithms have been developed for offline processes. This means that they cannot be applied to runtime problems where processing time is of the essence. If we wish to make these algorithms integral to SER systems, they have to be modified to make them work online. This is where weighting algorithms are lacking in comparison to CMN and MLLR algorithms. If we consider the example of kernel mean matching, it is a quadratic optimization problem. We know that SVMs are also solved as a quadratic optimization problem. More complex classifiers designed with a small dataset may appear to have higher performance than simpler classifiers because of overtraining, but they may generalize poorly to unknown cases. Many studies have shown that sometimes even thousands of cases are not enough to ensure generalization. This is particularly true when using powerful nonlinear techniques with multiple stages. As many of the experiments carried out in this study employed small medical datasets, further studies should be conducted with 10 x larger sets, such as the digital database for screening mammography (DDSM). 7.3 Dissemination 183 The research work reported in this thesis has been used in various conference and journal papers as well as several internal and international talks. What follows is a list of publications and presentations derived from this work, together with a reference to the chapter in which the topic is covered. 7.3.1 Journal papers In submission Hala Helmi, Jonathan M. Garibaldi, Ensemble weighted classifiers with accordance-based sampling, submitted to Data Mining and Knowledge Discovery,2013. 7.3.2 Conference papers Hala Helmi, Jonathan M. Garibaldi, Improving SVM and TSVM with Multiclass Accordance Sampling for Breast Cancer, in the Proceedings of 14th International Conference on Bioinformatics and Computational Biology BIOCOMP 2013, Las Vegas, USA, July 2013 Hala Helmi, Daphne Teck Ching Lai, Jonathan M. Garibaldi, SemiSupervised Techniques in Breast Cancer Classification: A Comparison between Transductive SVM and Semi-Supervised FCM, in the Proceedings of 12th Annual Workshop on Computational Intelligence (UKCI), Heriot-Watt University, Edinburgh, 2012. 184 Hala Helmi, Jonathan M. Garibaldi, Improving SVM with Accordance Sampling in Breast Cancer Classification, in the Proceedings of International Conference on Bioinformatics and Computational Biology BIOCOMP BG 2012, Varna, Bulgaria. Hala Helmi, Jon M. Garibaldi and Uwe Aickelin, Examining the Classification Accuracy of TSVMs with Feature Selection in Comparison with the GLAD Algorithm, in the Proceedings of UKCI 2011, the 11th Annual Workshop on Computational Intelligence, Manchester, UK 185 References [1] Platt, J. (1998). Sequential minimal optimization: A fast algorithm for training support vector machines. Technical Report MSR-TR-98-14, Microsoft Research. [2] Nils J. Nilsson. Introduction to Machine Learning. Artificial Intelligence Laboratory, Department of Computer Science, Stanford University, 2005. Draft of Incomplete Notes. [3] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, NY, 1986. [4] Lluis Marquez. Machine learning and natural language processing. Technical Report LSI-00-45-R, Departament de Llenguatges i Sisternes Informatics (LSI), Universitat Politecnica de Catalunya (UPC), 2000. [5] J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic curve. Radiology, 143: 29-36,1982. [6] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proc. of the 501 Annual /1CAI Workshop on Computational Learning Theory, pages 144--152,1992. [7] O. L. Mangasariani. Breast cancer diagnosis and prognosis via linear programming. Cancer Letter, 43(4): 570-577,1995. [8] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation. 16(12): 2639 - 2664,2004. [9] K. 0. Ladly, C. B. Frank, G. D. Bell, Y. T. Zhang, and R. M. Rangayyan. The effect of external loads and cyclic loading on normal patellofemoral joint signals. Special Issue on Biomedical Engineering, Defence Science Journal (India). 43: 201-210, July 1993. [10] R. M. Haralick, K. Shanmugam, and I. Dinstein. Textural features for image classification. IEEE Trans. on Systems, Man, Cybernetics, SMC-3(6): 610-622, 1973. [11] R. M. Rangayyan. Biomedical Signal Analysis -A Case-Study Approach. IEEE and Wiley, New York, NY, 2002. [12] D.M. Abd El-Rehim, G. Ball, S.E. Pinder, E. Rakha, C. Paish, J.F. Robertson, D. Macmillan, R.W. Blamey, and I.O. Ellis. High-throughput protein expression 186 analysis using tissue microarray technology of a large well-characterised series identifies biologically distinct classes of breast cancer confirming recent cDNA expression analyses. Int. Journal of Cancer, 116:340–350, 2005. [13] T. M. Mitchell. Machine Learning. McGraw Hill, 1997. [14] T. M. Mitchell. The Discipline and Future of Machine Learning. Machine Learning Department, School of Computer Science, Carnegie Mellon University, 2007. [15] P. Day and A. K. Nandi. Robust text-independent speaker verification using genetic programming. IEEE Trans. on Audio, Speech and Language Processing, 15: 285-295,2007. [16] Lluis Marquez. Machine learning and natural language processing. Technical Report LSI-00-45-R, Departament de Llenguatges i Sisternes Informatics (LSI), Universitat Politecnica de Catalunya (UPC), 2000. [17] R. Navigli, P. Velardi, and A. Gangemi. Ontology learning and its application to automated terminology translation. Intelligent Systems, IEEE, 18(1): 1541-1672, 2003. [18] C. -L. Liu, S. Jaeger, and M. Nakagawa. Online recognition of chinese characters: the state-of-the-art. IEEE Trans. on Pattern Analysis and Machine Intelligence, 26(2): 198-213,2004. [19] A. D. Parkins and A. K. Nandi. Genetic programming techniques for hand written digit recognition. Signal Processing, 84(12): 2345-2365,2004. [20] A. D. Parkins and A. K. Nandi. Method for calculating first-order derivative based feature saliency information in a trained neural network and its application to handwritten digit recognition. IEE Proceedings - Part VIS, 152(2): 137147,2005. [21] R. Plamondon and S. N. Srihari. Online and off-line handwriting recognition: a comprehensive survey. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(l): 63-84,2000. [22] Y. Jian, A. F. Frangi, J. -Y. Yang, D. Zhang, and Z. Jin. KPCA plus LDA: a coinplete kernel Fisher discriminant framework for feature extraction and recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 27(2): 230 244,2005. [23] S. Yang, J. Song, H. Rajamani, C. Taewon, Y. Zhang, and R. Mooney. Fast and effective worm fingerprinting via machine learning. In Proc. of the Int'l Conf. on Autonomic Computing, ICAC, pages 311-313, TX, US, 2006. [24] S. Oyama, T. Kokubo, and T. Ishida. Domain-specific web search with keyword spices. IEEE Trans. on Knowledge and Data Engineering. 16(1): 17 27.2004. [25] S. J. Vaughan-Nichols. Researchers make web searches more intelligent. Computer, 39(12): 16-18,2006. 187 [26] H. Alto, R. M. Rangayyan, and J. E. L. Desautels. Content-based retrieval mid analysis of mammographic masses. Journal of Electronic Imaging, 1=1(2)1: -1 7. 2005. article no. 023026. [27] T. C. S. S. Andre and R. M. Rangaýýail. Classification of breast masses in mammograms using neural Networks with shape. edge sharpness. and texture features. Journal of Electronic Imaging. 15(1): 1-10,2006. article no. 013010. [28] H. Guo and A. K. Nandi. Breast cancer diagnosis using genetic programming generated feature. Pattern Recognition, 39: 980-987,2006. [29] P. J. Lisboa and A. F. G. Taktak. The use of artificial neural networks in decision support in cancer: a systematic review. Neural Networks, 19(4): 408-415,2006. [30] T. Mu and A. K. Nandi. Breast cancer detection from FNA using SVM with different parameter tuning systems and SOM-RBF classifier. Journal of the Franklin Institute, 344(3-4): 285-311,2007. [31] T. Mu, A. K. Nandi, and R. M. Rangayyan. Classification of breast masses via nonlinear transformation of features based on a kernel matrix. Medical and Biological Engineering and Computing, 45(8): 769-780,2007. [32] R. J. Nandi, A. K. Nandi, R. M. Rangayyan, and D. Scutt. Classification of breast masses in mammograms using genetic programming and feature selection. Medical and Biological Engineering and Computing, 44(8): 693-694,2006. [33] W. H. Wolberg, W. N. Street, and 0. L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letter, 77: 163-171,1994. [34] P. Bertone and M. Gerstein. Integrative data mining: the new direction in bioinformatics. Engineering in Medicine and Biology Magazine, IEEE. 20(4): 33-40, 2001. [35] H. Hae-Jin, P. Yi, R. Harrison, and P. C. Tai. Improved protein secondary structure prediction using support vector machine with a new encoding scheme and an advanced tertiary classifier. IEEE Trans. on NanoBioscience, 3(4): 265271,2004. [36] S. Winters-Hilt, M. Landry, M. Akeson, M. Tanase, I. Amin, A. Coombs, E. Morales, J. Millet, C. Baribault, and S. Sendamangalam. Cheminformatics methods for novel nanopore analysis of HIV DNA termini. BMC Bioinformatics, 2006.7 Suppl 2: S22. [37] Huang et al., Z. Huang, H. Chen, C.-J. Hsu, W.-H. Chen, S. Wu Credit rating analysis with support vector machines and neural networks: a market comparative study Decision Support Systems, 37 (4) (2004), pp. 543–558 [38] P. D. Yoo, M. H. Kim, and T. Jan. Machine learning techniques and use of event information for stock market prediction: A survey and evaluation. In Proc. Of the Int'l Conf. on Computational Intelligence for Modelling Control and Automation, 188 CIMCA, and the Int'l Conf. on Intelligent Agents, Web Technologies and Internet Commerce, IAWTIC, volume 2, pages 835-841, Vienna, Austria, 2005. [39] S. Handley. Predicting whether or not a nucleic acid sequence is an E. coli promoter region using genetic programming. In Proc. of the Ist Int'l Symposium on Intelligence in Neural and Biological Systems, INBS, pages 122-127, Herndon, VA, 1995. [40] H. Tong-Cheng and D. You-Dong. Generic object recognition via integrating distinct features with SVM. In Proc. of the Int'l Conf. on Machine Learning and Cybernetics, pages 3897-3902, Dalian, 2006. [41] W. Xu, A. K. Nandi, and J. Zhang. Novel fuzzy reinforced learning vector quantization algorithm and its application in image compression. IEEE Proceedings - Part VIS, 150(5): 292-298,2003. [42] M. Lent. Game Smarts. Computer, 40(4): 99-101.2007. [43] K. O. Stanley, B. D. Bryant, and R. Miikkulainen. Real-time neurorevolution in the NERO video game. IEEE Trans. on Evolutionary Computation. 9(6): 653668,2005. [44] N. Kohl and P. Stone. Machine learning for fast quadrupedal locomotion. 2004. [45] H. Guo, L. B. Jack, and A. K. Nandi. Feature generation using genetic programming with application to fault classification. IEEE Trans. on Systems, Man, and Cybernetics, B: Cybernetics, 35(1): 89-99,2005. [46] L. B. Jack and A. K. Nandi. Genetic algorithms for feature selection in machine condition monitoring with vibration signals. IEEE Proc. - Vision. Image Signal Process. 147(3): 205-212,2000. [47] L. B. Jack and A. K. Nandi. Fault detection using support vector machine's and artificial neural networks, augmented by genetic algorithms. Mechanical Systems and Signal Processing, 16(2-3): 373-390,2002. [48] M.L.D. Wong, L.B. Jack and A.K. Nandi, Modified self-organising map for automated novelty detection applied to vibration signal monitoring. Mechanical Systems and Signal Processing, 20(3): 593-610,2006. [49] A. C. McCormick and A. K. Nandi. Real time classification of rotating shaft loading conditions using artificial neural networks. IEEE Tran..s . on Neural Network. 55, 8(3): 748-757,1997. [50] A. Rojas and A. K. Nandi. Practical scheme for fast detection and classification of rolling-element bearing faults using support vector machines. Mechanical Systems and Signal Processing, 20(7): 1523-1536,2006. [51] L. Zhang, L. B. Jack, and A. K. Nandi. Fault detection using genetic programming. Mechanical Systems and Signal Processing, 19: 271-289,2005. 189 [52] L. Zhang and A. K. Nandi. Fault classification using genetic programming. Mechanical Systems and Signal Processing, 21: 1273-1284,2007. [53] G. Hinton and T. J. Sejnowski. Unsupervised Learning and Map Forioatr. on: Foundations of Neural Computation. MIT Press, Cambridge, MA. 1999. [54] S. Kotsiantis and P. Pintelas. Recent advances in clustering: A brief survey. WSEAS Trans. on Information Science and Applications, 1(1): 73-81,2004. [55] O. Chapelle, B. Schälkopf, and A. Zien. Semi-Supervised learning. MIT Press. Cambridge, MA, 2006. [56] N. Dean, T. B. Murphy, and G. Downey. Updating classification rules with unlabelled data with applications in food authenticity studies. Journal of the Royal Statistical Society, Series C., 55(1): 1-14,2006. [57] B. Sahiner, N. Petrick, H. P. Chan, L. M. Hadjiiski, C. Paramagul. M. A. Helvie. and M. N. Gurcan. Computer-aided characterization of mammographic masses: Accuracy of mass segmentation and its effects on characterization. IEEE Trans Medical Imaging, 20(12): 1275-1284,2001. [58] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, ILIA, 1998. [59] W. Xu, A. K. Nandi, and J. Zhang. Novel fuzzy reinforced learning vector quantization algorithm and its application in image compression. IEE Proceedings - Part VIS, 150(5): 292-298,2003. [60] W. Xu, A. K. Nandi, J. Zhang, and K. G. Evans. Novel vector quantiser design using reinforced learning as a pre-process. Signal Processing, 85(7): 1315-1333, 2005. [61] V. N. Vapnik. Statistical learning theory, pages 339-371. New York: Wiley. 1998. [62] V. Tresp. A Bayesian committee machine. Neural Computation, 12: 2719-2741, 2000. [63] R. Caruana. Multitask learning: A knowledge-based source of inductive bias. Machine Learning, 28: 41-75,1997. [64] V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag, London, UK, 1995. [65] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2): 179-188,1936. [66] R. 0. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, New York, NY, 2nd edition, 2001. [67] J. Neter, M. H Kutner, C. J. Nachtsheim, and W. Wasserman. Applied Linear Statistical Models. Irwin, Chicago, IL, 4 edition, 1990. 190 [68] S. M. Kay. Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice Hall, 1993. Chapter 7. [69] P. Domingos and M. J. Pazzani. On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29(2-3): 103-130,1997. [70] A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGrawHill, New York, NY, 2 edition, 1984. [71] S. S. Haykin. Neural Networks: A Comprehensive, Foundation. Prentice Hall, London, UK, 1999. [72] D. S. Broornhead and D. Lowe. Multivariable functional interpolation and adaptive networks. Complex System, 2: 321-355,1988. [73] T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9): 14041480,1990. [74] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK, 2004. [75] Y. Chen and J. Z. Wang. Support vector learning for fuzzy rule-based classification systems. IEEE Trans. on Fuzzy Systems, 11(6) : 716-728,2003. [76] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K. Muller. Fisher discriminant analysis with kernels. In Proc. of IEEE Neural Networks for Signal Processing Workshop, pages 41- 48,1999. [77] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3): 273-297,1995. [78] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge, UK, 2000. [79] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proc. of the 501 Annual ACM Workshop on Computational Learning Theory, pages 144--152,1992. [80] B. Schölkopf, A. J. Smola, R. Williamson, and P. Bartlett. New support vector algorithms. Neural Computation, 12: 1207-1245, May 2000. [81] B. Cady and M. Chung. Mammographic screening: No longer controversial. American Journal of Clinical Oncology, 28(1): 1-4.2005. [82] G. Fung and 0. L. Mangasarian. Proximal support vector machine classifiers. In Proc, of the 7th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, pages 77-86, San Francisco, CA, 2001. [83] G. Fung and 0. L. Mangasarian. Multicategory proximal support vector machine classifiers. Machine Learning, 59: 77-97, May 2005. 191 [84] D. Agarwal. Shrinkage estimator generalizations of proximal support vector machines. In Proc. of the 8th Int'l Conf. Knowledge Discovery and Data Mining, pages 173-182, Edmonton, Alberta, Canada, 2002. [85] A. Tveit and H. Engum. Parallelization of the incremental proximal support vector machine classifier using a heap-based tree topology. In Workshop on Parallel and Distributed computing for Machine Learning (In conjunction with ECML'2003 and PKDD'2003), Cavtat, Dubrovnik, Croatia, 2003. [86] S. K. Pal, S. Bandyopadhyay, and S. Biswas. Fuzzy proximal support vector classification via generalized eigenvalues. In Proc. of 1st Int’lConf.onPattern Recognition and Machine Intelligence, pages 360-363, Kolkata, India. 2005. [87] O. L. Mangasarian and E. W. Wild. Multisurface proximal support vector machine classification via generalized eigenvalues. IEEE Trans Pattern Analysis and Machine Intelligence, 28: 69-74, January 2006. [88] A. N. Tikhonov and V. Y. Arsen. Solutions of Ill-posed Problems. John Wiley and Sons, New York, NY, 1977. [89] R. Webb. Andrew. Statistical Pattern Recognition. John Wiley and Ltd.. 2 edition, 2002. [90] S. Bleha and D. Gillespie. Computer user identification using the mean and the median asfeatures. In Proc. of the IEEE Int'l Conf. on System. Man, and Cybernetics, pages 4379-4381, San Diego, CA. 1998. [91] H. Lin and A. N. Venetsanopoulos. A weighted minimum distance classifier for pattern recognition. In Proc. of the 6th Canadian Conf. on Electrical and Computer Engineering, pages 904-907, Vancouver, BC, Canada, 1993. [92] D. Zhang, S. Chen, and Z. Zhou. Learning the kernel parameters in kernel minimum distance classifier. Pattern Recognition, 39(l): 133-135,2006. [93] P. Somervuo and T. Kohonen. Self-organizing maps and learning vector quantization for feature sequences. Neural Processing Letters. 10(2): 151 159.1999. [94] C. Chou, C. Lin, Y. Liu, and F. Chang. A prototype classification method and its use in a hybrid solution for multiclass pattern recognition. Pattern Recognition, 39(4): 624-634,2006. [95] K.P. Bennett and E. Parrado-Hernandez, The Interplay of Optimization and Machine Learning Research, Journal of Machine Learning Research,7(Jul):12651281, 2006. [96] G. Lanckriet, N. Cristianini, P. Bartlett, L. Ghaoui and M.I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5: 27-72,2004. 192 [97] C. S. Ong, A. Smola. and R. Williamson. Learning the kernel with hyperkernels. Journal of Machine Learning Research, 6: 1045-1071,2005. [98] Y. Zhang, S. Burer, and W. N. Street. Ensemble pruning via semi-definite programming. Journal of Machine Learning Research, 7: 1315-1338,2006. [99] P. F. Felzenszwalb and D. McAllester. The generalized A* architecture. Journal of Artificial Intelligence Research, 29: 153-190,2007. [100] D. E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Reading, MA: Addison-Wesley, 1989. [101] A. J. Chipperfield, P. J. Fleming, H. Pohlheim, and C. M. Fonseca. Genetic Algorithm Toolbox for use with MATLAB (version 1.2). University of Sheffield, Sheffield, UK, 1994. [102] D. Windridge and J. Kittler. Combined classifier optimisation via feature selection. In Book Advances in Pattern Recognition: Proc. of the Joint 1APR Int'l Workshops, SSPR 2000 and SPR 2000, volume 1876, pages 687-695, Alicante, Spain, 2000. [103] H. Vafaie and K. A. De Jong. Improving the performance of rule induction system using genetic algorithms. In Proc. of the 1st Int'l Workshop on Multistrategy Learning, pages 305-315, Harpers Ferry, WV, 1991. [104] M. L. D. Wong and A. K. Nandi. Automatic digital modulation recognition using the artificial neural network and genetic algorithm. Signal Processing, 84(2): 351-365,2004. [105] Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. [106] D.M. Abd El-Rehim, G. Ball, S.E. Pinder, E. Rakha, C. Paish, J.F. Robertson, D. Macmillan, R.W. Blamey, and I.O. Ellis. High-throughput protein expression analysis using tissue microarray technology of a large well-characterised series identifies biologically distinct classes of breast cancer confirming recent cDNA expression analyses. Int. Journal of Cancer, 116:340–350, 2005. [107] W. W. Cohen, R. E. Schapire, and Y. Singer. Learning to order things. In Proc. AdvancesinNeuralInformationProcessingSystems(NIPS’98),1998. [108] V. Hristidis, N. Koudas, and Y. Papakonstantinou. PREFER: A system for the efficient execution of multi-parametric ranked queries. Proceedings ACM SIGMOD International Conference on Management of Data, 2001. [109] L. Duijm, J. H. Groenewoud, F. H. Jansen, J. Fracheboud, M. Beek, and H. J. de Koning. Mammography screening in the Netherlands: delay in the diagnosis of breast cancer after breast cancer screening. British Journal of Cancer , 91:1795– 1799, 2004. 193 [110] Breast Cancer Diagnostic Algorithms for Primary Care Providers, Breast Expert Workgroup, third ed., Cancer Detection Section, California Department of Health Service, 2005. [111] W. H. Wolberg, W. N. Street, and O. L. Mangasarian. Breast cytology diagnosis via digital image analysis. Analytical and Quantitative Cytology and Histology, 15(6): 396-404,1993. [112] W. H. Wolberg, W. N. Street, and O. L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letter, 77: 163-171,1994. [113] B. B. Mandelbrot. The Fractal Geometry of Nature. Cupter 5. W. H. Freeman and Company, New York, 1997. [114] A. Bellaachia and E. Guven. Predicting breast cancer surviv-ability using data mining techniques. Scientific Data Mining Workshop, in Conjunction with the 2006 SIAM Conference on Data Mining, 2006. [115] D. Delen, G. Walker, and A. Kadam. Predicting breast cancer survivability: a comparison of three data mining methods. Artificial Intelligence in Medicine, 34(2):113–127, 2005 [116] C.-C. Chang and C.-J. Lin, LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm [117] N. El Barbri, E. Llobet, N. El Bari, X. Correig, B. Bouchikhi, Application of a portable electronic nose system to assess the freshness of Moroccan sardines, Materials Science and Engineering: C, 28 (5–6), 666-670, 2008. [118] S.Tong and D.Koller, Support vector machine active learning with applications to text classification, The Journal of Machine Learning Research, 2, 45-66, 2002. [119] T.Nugent and DT. Jones, Transmembrane protein topology prediction using support vector machines. BMC Bioinformatics, 10:159, 2009. [120] Fu, Li M., and Casey S. Fu-Liu. "Multi-class cancer subtype classification based on gene expression signatures with reliability analysis." FEBS letters 561.1,186190,2004. [121] Liu, Yi, and Yuan F. Zheng. "One-against-all multi-class SVM classification using reliability measures." Neural Networks, 2005. IJCNN'05. Proceedings. 2005 IEEE International Joint Conference on. Vol. 2. IEEE, 2005. [122] P. Mahesh, Multiclass approaches for support vector machine based land cover classification, arXiv preprint arXiv:0802.2411,2008. [123] J. Weston and C. Watkins, Multi-class support vector machines. Technical Report CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, May, 1998. 194 [124] C.W.Hsu and Lin.C-J, A comparison of methods for multiclass support vector machines." Neural Networks, IEEE Transactions on 13.2 ,415-425, 2002. [125] F.E. Harrell Jr., K.L. Lee, and D.B. Mark. Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine, 15:361–387, 1996. [126] W.N. Venables and B.D. Ripley. Modern Applied Statistics with S. New York:Springer, 4th edition, 2002. [127] M. Vuk and T. Curk. Roc curve, lift chart and calibration plot. Metodoloˇski zvezki,3(1):89–108, 2006. [128] W. Wang and , Z. H. Zhou, On multi-view active learning and the combination with semi-supervised learning, Proceedings of the 25th international conference on Machine learning. ACM, 2008. [129] I. Muslea, S. Minton, and, C. A. Knoblock, Selective sampling with redundant views, Proceedings of the national conference on artificial intelligence. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2000. [130] P. Royston. Algorithm AS 181: The W test for normality. Applied Statistics,31:176–180, 1982. [131] R.R. Bouckaert. Naive bayes classifiers that perform well with continuous variables.In Proceedings of the 17th Australian Conference on AI (AI04). Berlin: Springer, 2004. [132] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, MIT Press, Pages: 115-132, 2000. [133] T. Joachims. Optimizing search engines using clickthrough data. KDD 2002. [134] C. Burges, T. Shaked, E. Renshaw, A .Lazier, M. Deeds, N. Hamilton, G. Hullender Learning to rank using gradient descent. ICML 2005. [135] W. Lior, S. Bileschi. Combining variable selection with dimensionality reduction. CVPR 2005. [136] A. Blum and P. Langley. Selection of relevant features and examples in machine learning. AI, 97(1-2), 1997. [137] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio and V. Vapnik. Feature selection for SVMs. NIPS 2001. [138] A. Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance. ICML 2004. [139] I. Guyon, A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 2003. 195 [140] Y. Yang and Jan O. Pedersen. A comparative study on feature selection in text categorization. ICML 1997. [141] G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 2003. [142] R. Kohavi, G. H. John. Wrappers for feature selection. Artificial Intelligence, 1997. [143] R. B. Yates, B. R. Neto. Modern information retrieval, Addison Wesley, 1999. [144] K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of IR techniques, ACM Transactions on Information Systems, 2002. [145] J. Furnkranz and E. Hullermeier. Pairwise preference learning and ranking. In Proc.EuropeanConf.MachineLearning(ECML’03),2003. [146] S. Har-Peled, D. Roth, and D. Zimak. Constraint classification: A new approach to multiclass classification and ranking. In Proc. Advances in Neural Information ProcessingSystems(NIPS’02),2002. [147] R. Herbrich, T. Graepel, and K. Obermayer, editors. Large margin rank boundaries for ordinal regression. MIT-Press, 2000. [148] E. Chang and S. Tong. Support vector machine active learning for image retrieval. In ACM Multimedia 2001, 2001. [149] G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proc. Int. Conf. Machine Learning (ICML’00), pages 839–846, 2000. [150] H. Yu, S. Hwang, and K. C.-C. Chang. Rankfp: A framework for supporting rank formulation and processing. In Proc. Int. Conf. Data Engineering (ICDE’05),2005. [151] D. Lewis and J. Catlett, “Heterogeneous uncertainty sampling for supervised learning,”inProc.ICML,1994,pp.148–156. [152] K. Brinker. Active learning of label ranking functions. In Proc. Int. Conf. MachineLearning(ICML’04),2004. [153] C. J. C. Burges. A tutorial on support vector machines for pattern Data Mining and Knowledge Discovery, 2:121–167, 1998. recognition. [154] R. Battiti. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks. vol. 5, NO.4, July 1994. [155] S. Theodoridis, K. Koutroumbas. Pattern recognition. Academic Press, New York, 1999. 196 [156] N. Kwak, C. H. Choi. Input feature selection for classification problems. Neural Networks, IEEE Transactions on Neural Networks, vol.13, No.1, January 2002. [157] M. Kendall. Rank correlation methods. Oxford University Press, 1990. [158] A. M. Liebetrau. Measures of association, volume 32 of Quantitative Applications in the Social Sciences. Sage Publications, Inc., 1983. [159] S. Robertson. Overview of the okapi projects, Journal of Documentation, Vol. 53, No. 1, pp. 3-7, 1997. [160] L. Breiman, J. H. Friedman, R. A. Olshen, and C.J.Stone. Classification and regression trees. Wadsworth and Brooks, 1984. [161] C. Aggarwal, Data Streams: Models and Algorithms. New York: SpringerVerlag, 2007. [162] D. Cohn, L. Atlas, and R. Ladner, “Improving generalization with active learning,”Mach.Learn.,vol.15,no. 2, pp. 201–221, May 1994. [163] X.Zhu,X.Wu,andQ.Chen,“Eliminatingclassnoiseinlargedatasets,”inProc. ICML, 2003, pp. 920–927. [164] H. Seung, M. Opper, and H. Sompolinsky, “Query by committee,” in Proc.COLT, 1992, pp. 287–294. [165] M. Culver, D. Kun, and S. Scott, “Active learning to maximize area under the ROCcurve,”inProc.ICDM,2006,pp.149–158. [166] X.ZhuandX.Wu,“Classnoisevsattributenoise:Aquantitativestudyoftheir impacts,”Artif.Intell.Rev.,vol.22,no.3/4,pp.177–210, Nov. 2004. [167] W.Hu,W.Hu,N.Xie,andS.Maybank,“Unsupervisedactivelearningbasedon hierarchical graph-theoretic clustering,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 39, no. 5, pp. 1147–1161, Oct. 2009. [168] P.Mitra,C.Murthy,andS.Pal,“Aprobabilistic active support vector learning algorithm,”IEEETrans.PatternAnal.Mach.Intell.,vol.26,no.3,pp.413–418, Mar. 2004. [169] B. Settles, “Active learning literature survey,” Univ. Wisconsin-Madison, Madison, WI, Computer Science Tech. Rep. 1648, 2009. [170] K. Czarnecki, Model Driven Architecture, OOPSLA Tutorial, http://www.sts.tuharburg.de/teaching/ss-07/FMDM/K-NearestNeighbors.pdf [171] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-RuiWang, and Chih-Jen Lin. LIBLIN-EAR: A library for large linear classication. Journal of Machine Learning Research, 9: 1871 1874, 2008. URL http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf. Laral.istc.cnr.it [online]. 197 2003 [cit. 2011-02-02]. Neural Networks Library In Java. Dostupné z URL http://laral.istc.cnr.it/daniele/software/NNLibManual.pdf [172] Zhu, X., "Semi-Supervised Learning Tutorial". Department of Computer Sciences University of Wisconsin, Madison, USA, 2007. [173] Abney,S. "Semisupervised Learning for Computational Linguistics" (1st ed.). Chapman & Hall/CRC, 2007. [174] Brown, M. P., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., Ares, M. Jr., Haussler, D. "Knowledge-based analysis of microarray gene expression data by using support vector machines". Proc. Natl. Acad. Sci. USA 97:262–267. 2001. [175] Jaakkola, T., Haussler, D. and Diekhans, M. "Using the Fisher kernel method to detect remote protein homologies". In Proceedings of ISMB, 1999. [176] Zien, A., Ratsch, G., Mika, S., Scholkopf, B., Lemmen, C., Smola, A., Lengauer, T., and Muller, K.-R. "Engineering support vector machine kernels that recognize translation initiation sites", Bioinformatics, 16:799- 807, 2000. [177] Valafar, F. "Pattern Recognition Techniques in Microarray Data Analysis". Annals of the New York Academy of Sciences 980(1): 41-64, 2002. [178] Gommerman, A., Vovk, V. and Vapnik, V. "Learning by Transduction", In Uncertainty in Artificial Intelligence, pp.148-155,1998. [179] Collobert, R., Sinz, F., Weston, J. and Bottou, L. "Large Scale Transductive SVMs". J. Mach. Learn. Res. 7 (December 2006), 1687-1712, 2006. [180] R. Zhang, W. Wang, Y. Ma, C. Men. "Least Square Transduction Support Vector Machine." Neural Processing Letters 29(2): 133-142, 2009. [181] Joachims, T. " Transductive inference for text classification using support vector machines". In Proceedings of ICML-99, pages 200–209, 1999. [182] Han, J. and Kamber, M. "Data Mining: Concepts and Techniuqes". Morgan Kaufmann Publishers, San Francisco, CA, 2001. [183] Guyon, I., Weston, J., Barnhill, S., Vapnik, V. "Gene Selection for Cancer Classification using Support Vector Machines Machine Learning". 46(1): 389422, 2002. [184] Bennett, K. And Demiriz, A. "Semi-supervised support vector machines". In NIPS,1998. [185] Harris, C. and Ghaffari, N. "Biomarker discovery across annotated and unannotated microarray datasets using semi-supervised learning". BMC Genomics 9 Suppl 2:S7, 2008. 198 [186] Joachims, T. "Making large-scale support vector machine learning practical". In Advances in Kernel Methods: Support Vector Machines, 1999. [187] Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh L, Downing JR, Caliguire MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286:531-537, 1999. [188] Shipp MA, Ross KN, Tamayo P, Weng, AP, Kutok JL. "Diffuse large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning". Nature Medicine, 8:68-74. 2002. [189] A.S. Yong, R.M. Szydlo, J.M. Goldman, J.F. Apperley and J.V. Melo. "Molecular profiling of CD34+ cells identifies low expression of CD7, along with high expression of proteinase 3 or elastase, as predictors of longer survival in patients with CML" Blood 2006, 107:205-12, 2006.
© Copyright 2024