How to speed up the learning mechanism in a connectionist model Szil´ard Vajda, Abdel Bela¨ıd Henri Poincar´e University, Nancy 1 Loria Research Center READ Group Campus Scientifique, BP. 239 Vandoeuvre les Nancy, 54506, France {vajda,abelaid}@loria.fr Abstract In this paper a fast data driven learning-corpus building algorithm (FDDLCB) is proposed. The generic technique allows to build dynamically a representative and compact learning corpus for a connectionist model. The constructed dataset contains just a reduced number of patterns but sufficiently descriptive to characterize the different classes which should be separated. The method is based on a double least mean squares (LMS) error minimization mechanism trying to find the optimal boundaries of the different pattern classes. In the classical learning process the LMS is serving to minimize the error during the learning and this process is improved with a second one, as the new samples selection is also based on the idea to minimize the recognition error. Reinforcing the class boundaries where the recognition fails let us achieve a rapid and good generalization without any loss of occuracy. A modified version of the algorithm will be also presented. The experiments were performed on MNIST1 separated digit dataset. The encouraging result (98.51%) using just 1.85% of the available patterns from the original training dataset is comparable even with the state of the art techniques. 1. Introduction In the last few decades among the neural network (NN) based applications that have been proposed for pattern recognition, character recognition has been one of the most successful. The most common approach applied in this field is the usage of a multi-layer perceptron (MLP) with a supervised learning scheme based mainly on the least mean squares (LMS) error minimization rule. 1 Modified NIST Generally to train such a system, a huge data amount is needed in order to cover the different intra-class and inter-class variations. Hence in the neural network scheme based approaches we can find equally advantages and disadvantages. Among the advantages we can enumerate: good generalization property based on solid mathematical background, good convergence rate, fast recognition process, etc. [2]. Among the disadvantages can be enumerated some restrictions like: the size of the input should be fixed, the nature of the input is not obvious for the different pattern recognition tasks, it is hard to implant in the network topology some a priori knowledge based on the data corpus. In the same time, the system convergence speed can be long, as the adjustment of the decision surface (hyperplane) in the function of the network’s free parameters and the huge data amount can be a time costly process. The excessive data amount, the network architecture and the course of dimensionality are always an endless trade-off in the neural networks theory [15]. In order to tackle these problems, different techniques have been proposed in the literature. Some of them use some a priori knowledge derived from the dataset, some other methods modify the network topology in order to reduce the number of free-parameters, some other techniques try to reduce the dimension of the input using feature selection techniques and some others try to develop so called active learning techniques which goal is to use an optimal training dataset through selection of most informative patterns from the original dataset. Our proposition is based on this active learning technique more exactly belonging to the branch of incremental learning where the dataset is constructed dynamically during the training. Using the FDDLCB algorithm based mainly on LMS error minimization we have reduced considerably the training time factor without any loss of accuracy. The rest of the paper is organized as follows. In Section 2. a brief description of the existing improvements and ac- tive learning techniques is provided while Section 3. will discuss in detail the proposed algorithm. Section 4. is dedicated to the results and finally in Section 5. some concluding remarks are given. 2. Related works During the years different classifier systems have been proposed and developed for the different pattern recognition tasks inspired from real life applications. Nowadays, more or less each baseline recognition system has reached its limits as there is no existing system allowing to realistically model the human vision. Hence, the research was oriented toward different improvements of the existing systems by refining the mathematical formalism or by implanting in the systems empirical knowledge. Considering the nature of the improvements, different research axis can be distinguished. In order to select the most descriptive features, in the last few years many feature extraction algorithms were proposed. In character recognition these features can be grouped as: statistical features, geometrical features [4, 8, 9, 14], size and rotation invariant features [1, 7, 12, 19], etc. For the combination of these features many feature subset selection mechanisms were designed. The feature subset selection is based mainly on neural networks (NN) and genetic algorithms (GA) [21] operating with randomized heuristic search techniques [3]. Concerning the network topology there is a consensus in the NN community. One or two hidden layer is sufficient for the different pattern classification problems but LeCun and his colleagues proved than it is possible to use multiple hidden layers based on multiple maps using convolution, sub-sampling and weight sharing techniques in order to achieve excellent results on separated digit recognition [4, 18]. Spirkovska and Reid using a higher order neural network introduced inside the topology some position, size and rotation invariant a priori information [19]. Another solution is the optimal brain damage (OBD), proposed by LeCun in [5] which removes the unimportant weights in order to achieve a better generalization and a speed-up of the training/testing procedure. The optimal cell damage (OCD) and its derivatives are also based on the idea to prune the network structure. All approaches solve more or less the encountered problems but each of them has a considerable time complexity. The best approach seems to be the so called active learning and its different derivatives. In such an approach the learner, the classifier is guided during the training process. Some information control mechanism is implanted in the system. Rather than passively accepting all the available training samples, the classifier on his own guides its learn- ing process by finding the most informative patterns. With such a guided training, where the training patterns are selected dynamically, we can reduce considerably the training duration and a better generalization can be obtained as all non interesting data can be discarded. As stated in [16] the generalization error decrease more rapidly for active learning than for passive learning. Engelbrecht in [6] is grouping these techniques in two classes in function of their action mechanism: 1. Selective learning, where the classifier selects at each selection interval a new training subset from the original candidate set. 2. Incremental learning, where the classifier starts with an initial subset selected somehow from the candidate set. During the learning process, at specified selection intervals some new samples are selected from the candidate set and these patterns are added to the training set. While in [6, 13, 16] the authors have been developed active learning techniques for feedforward neural networks, for SVM approaches similar systems have been proposed in [11, 17, 20]. In this second case all the pattern selection algorithms are based on the idea than the hyperplane constructed by the SVM is depending only on a reduced subset of the training patterns called also support vectors that lies close to the decision surface. Mostly the selection methods in that case are based on kNN, clustering, confidence measure, Hausdorf distance, etc.. The drawback of such systems is quite difficult to fix the different parameters of the systems as stated by Shin and Cho in [17]. Another limitation of the approach is a second training procedure which is necessary while for NN the training process is applied just once. This drawback can be also found in case of the different network pruning algorithms proposed by Le Cun. 3. FDDLCB algorithm description Our method is based on incremental learning using error selection. The approach is based on an MLP type classifier with one hidden layer. The main idea of the FDDLCB algorithm is to build-up in run-time a data driven minimal learning-corpus based on the LMS by adding additional patterns to the training corpus at each training level in order to cover maximally the different variations of the patterns and reducing the recognition error. Let us denote by GlobalLearningCorpus (GLC) the whole set of patterns which can be used during the training procedure, by GlobalTestingCorpus (GTC) the whole set of patterns which can be used for the test and by DynamicLearningCorpus (DLC) the minimal set of patterns which can serve to train the network. Let’s also denote by NN the neural network and by N the iterator, which provides the number of new patterns to be considered at each learning level and M denotes the number of classes to be separated. Algorithm description: Initialization: DLC = {xi ∈ GLC | i = 1, M } GLC = GLC − DLC Database Building: repeat { repeat { TrainNetwork(NN,DLC) }until(NetworkError(NN,DLC,ALL) T hreshold1 ) TestNetwork(NN,GTC) if (NetworkError(NN,GTC,ALL)≺ T hreshold2 ) then STOP else { TestNetwork(NN,GLC) if (NetworkError(NN,GLC,ALL)≺ T hreshold1 ) then STOP else { DLC = DLC ∪ {yi ∈ GLC | i = 1, N } GLC = GLC − DLC } } }until(| GLC |> 0) Results: NN contains the modified weight set DLC contains the minimal number of patterns which is sufficient to train the NN where: • TrainNetwork(NN,DATASET) will train the NN with the given DATASET using classical LMS error minimization and error backpropagation • TestNetwork(NN,DATASET) will test the NN with the given DATASET • NetworkError(NN,DATASET,SAMPLES NUMBER) calculates the error given by NN using SAMPLES NUMBER of patterns from the DATASET using the LMS criterion • yi denotes the pattern from the GLC giving the ith highest error during the test • | DAT ASET DATASET | denotes the cardinality of the The algorithm is starting with an initialized DLC set where we have selected for each class one random representative pattern (xi ) in order to not favour one or another class initially. The algorithm performs the network training with these samples. Once the training error is less than an empirical threshold value, the training process stops and we test our network with the samples belonging to the GTC. If the error criterion is satisfied the algorithm stops as the training was successful. Otherwise we should continue by adding new samples to our DLC set. To do this we are looking from the GLC for the N samples (yi ) giving the highest error in the classification. If this error is less than a threshold value we are stopping the algorithm, as we cannot add extra helpful information to the network. Otherwise we are picking these N elements from GLC and move them in the DLC and restart the training on this new extended dataset. The algorithm stops when the error criterion is satisfied or there are no more available patterns in GLC set. In the second case we are in the classical training as finally we are using the whole dataset. So there is no restriction in the algorithm. In the worst case we should have almost the same results as in case of using the whole dataset. A modified version of the FDDLCB algorithm consists to feed the network with class samples having the same distribution. This precaution is necessary as stated by [10] in order to not influence the system in a way or another. For that reason we modified the conditions of the DLC set creation. Now, at each iteration we add N samples for each pattern class based on their highest error contribution in their class instead of using the first N samples of the dataset giving the highest error rate. Using this selection process we can guarantee the distribution uniformity for each pattern class. 4. Test results The experiments performed by the FDDLCB algorithm used as input data the MNIST reference database. This dataset contains 60.000 samples for learning and 10.000 samples for test. The 28x28 normalized gray-scale images contain separated handwritten digits from 0 to 9. The tests were performed with a fully connected MLP with one hidden layer. As input raw images were used, so our input vector contains 784 values. In order to achieve a recognition rate like 98.6%, using the whole learning corpus we need at least 30 learning epochs. That means we should present at least 30×60.000=1.800.000 patterns to our network. In Table 1we show different constructed datasets, the number of patterns presented to the system and the obtained results by the FDDLCB algorithm on the test set in function of the N parameter. So we can achieve comparable result with even 1,110 samples from the possible 60,000 that means the other patterns can be considered redundant information so it’s not necessary to use them. The learning process can be reduced substantially as it is possible to achieve almost similar results presenting just 58,690 patterns to the network. So we can speed up the learning process 14 times that is a considerable gain even for a high-tech computer. N 50 100 150 200 250 300 350 Generated learning set (DLC) 1110 2210 2260 2810 3010 2710 3510 Patterns presented Recognition rate 58690 113940 127600 15630 139410 123180 123180 98.51% 98.47% 98.51% 98.50% 98.48% 98.48% 98.50% Table 1. Results obtained with different datasets constructed by FDDLCB algorithm As in [17] the authors provide results of their pattern selection method on MNIST benchmark dataset, a comparison study can be performed. Nine SVM type binary classifier was used: class 8 is paired with each of the rest. The reported recognition error in average over nine classifiers is 0.28% using all the available patterns and 0.38% for the pattern selection based technique. The lost of accuracy is similar as in our case. Unfortunately there is no results reported concerning the real recognition accuracy for each separated digit class so a direct comparison can not be performed with our method. The time factor is reduced with a factor of 11.8 which is much less as in our case. Similarly the number of used patterns (16.76%) serving as support vector is much more considerably than our 1.85% selected patterns to train the system. The modified FDDLCB algorithm result 98.01% is near to the result produced by the original algorithm but it needs much more iteration and samples (9,000 different samples were selected while 864,600 patterns were presented to the network). In the Figure 1 we present the class distribution for the different datasets built by FDDLCB in function of the N parameter. The x-axis means the different classes, and the y-axis means the distribution percentage of the different classes. We can see than the element distribution variance is not significant for the different datasets so the N parameter can control just the learning convergence speed and the size of the built dataset. The empirical value N = 50 was established after some trial runs performed with different N values. We found this is the optimal value which should be used in order to achieve a considerable speed gain. Similarly the results presented in Table 1 prove than the changing of parameter N has no major influence on the results. It can influence just the size of the built dataset and Figure 1. The samples distribution in the classes for the different constructed datasets the speed of the building process. Using the same pattern distribution as in Figure 1 using random choice for the patterns selection for the dataset creation, the recognition accuracy cannot achieve higher average recognition scores than 91.01%. Analyzing the dynamic learning corpus we can pronounce also in the matter of the intra-class and inter-class variance. In the MNIST database the class ”0” contains the fewest variance and the class ”9” contains the most variation, so we need much more samples belonging to class ”9” in order to achieve a good recognition score. In pattern complexity terms speaking, the class ”0”, ”1”, ”6” are the classes which are the simplest and the classes ”3”, ”8”, ”9” are the more complex ones, which is natural as they can be confused. 5. Conclusion We proposed in this paper generic, simple and fast active learning algorithm to build run-time a minimal learning corpus, based on an MLP classifier. The algorithm is based on a dual LMS error estimation, which can guarantee the convergence of the algorithm. The first LMS minimization is used in the training process in the error backpropagation. The second one is used when we are calculating the LMS error for the samples during the recognition. The misclassified patterns should be added to the DLC set in order to minimize the recognition error by learning these new items which have contributed to the error accumulation. The method reduces substantially the learning period and discards the redundant information in order to avoid the overfitting. The performed tests on MNIST showed that is possible to achieve 98,51% recognition accuracy using just 1,110 different samples and the learning time can be reduced by a factor of 14, a time gain which is also considerable considering the algorithm complexity. The mechanism cannot function for the improvement of the system presented in [18] which is based on the data redundancy. The algorithm tries to enlarge the different class boundaries using in learning the extreme patterns. The algorithm increases the number of forward steps (propagation) but decreases substantially the number of backward steps (error backpropagation) which are much more costly in calculus. The FDDLCB algorithm can be also used to solve the challenge proposed by Japkowicz in [10] in order to deal with the class imbalance problem, which often occurs in the real world applications. Many times we deal with learning corpuses where the distribution of the samples for the different classes is not uniform. There are under represented classes and respectively low represented classes. The methods presented in [10] based on down-sizing and re-sampling are restrictive as there is no rigorous selection criteria to choose which elements should be discarded or re-sampled. The FDDLCB can avoid the overfitting effect caused by the presented methods using a rigorous selection criterion. References [1] S. Adam, J. M. Ogier, C. Cariou, R. Mullot, J. Gardes, and Y. Lecourtier. Multi-scaled and multi oriented character recognition: An original strategy. In ICDAR, pages 45–48, 1999. [2] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. [3] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to algorithms. MIT Press, 1995. [4] Y. L. Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Intelligent Signal Processing, pages 306–351, 2001. [5] Y. L. Cun, J. S. Denker, and S. A. Solla. Optimal brain damage. Advances in Neural Information Processing Systems 2 (NIPS*89), 1990. [6] A. P. Engelbrecht. Selective learning for multilayer feedforward neural networks. J Mira and A Prieto (eds.) In Lecture Notes in Computer Science, 2084:386–393, 2001. [7] A. Goshtasby. Description and discrimination of planar shapes using shapes matrices. IEEE Transactions of Pattern Recognition and Machine Intelligence, 7(6):738–743, 1984. [8] I. Guyon. Application of neural networks to character recognition. International Journal of Pattern Recognition and Artificial Intelligence, 5(1):353–382, 1991. [9] M. S. Hoque and M. C. Fairhurst. A moving window classifier for off-line character recognition. In Proceedings of 7th [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] International Workshop on Frontiers in Handwriting Recognition, pages 595–600, 1998. N. Japkowicz. The class imbalance problem: significance and strategies. In Proceedings of International Conference on Artificial Intelligence 2000 (IC-AI2000), 2000. R. Koggalage and S. Halgamuge. Reducing the number of training samples for support vector machine classification. Neural Information Processing - Letters and Reviews, 2(3):57–65, 2004. S.-W. Lee, H.-S. Park, and Y. Y. Tang. Translation-, scale-, and rotation-invariant recognition of hangul characters with ring projection. Proceeding of 1st International Conference on Document Analysis and Recognition, pages 829– 836, 1991. S. U. P. Polikar, L. Udpa and V. Honavar. Learn++: An incremental learning algorithm for supervised neural networks. IEEE Transactions on Systems, Man and Cybernetics - Part C: Application and Reviews, 31(4):497–508, 2001. R. Romero, R. Berger, R. Thibadeau, and D. Touretzky. Neural network classifiers for optical chinese character recognition. Proceedings of the 4th annual Symposium on Document Analysis and Information Retrieval, pages 385– 389, 1995. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. pages 318– 362, 1986. H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. Proceeding of 5th Annual ACM Workshop on Computational Learning Theory, pages 287–299, 1992. H. Shin and S. Cho. Fast pattern selection for support vector classifier. Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining, LNCS, (2637), 2003. P. Y. Simard, D. Steinkraus, and J. C. Platt. Best practices for convolutional neural networks applied to visual document analysis. ICDAR, pages 958–962, 2003. L. Spirkovska and M. B. Reid. Robust position, scale and rotation invariant object recognition using higher- order neural networks. Pattern Recognition, 25(9):975–985, 1992. J. Wang, P. Neskovic, and L. N. Cooper. Training data selection for support vector machines. International Conference on Neural Computation, 2005. J. Yang and V. Honavar. Feature subset selection using genetic algorithms. IEEE Transactions, Intelligent Systems, (34):45–49, 1998.