Multi-View EM Algorithm for Finite Mixture Models Xing Yi, Changshui Zhang, and Yunpeng Xu State Key Laboratory of Intelligent Technology and Systems, Department of Automation, Tsinghua University, Beijing 100084, P. R. China Abstract. In this paper, Multi-View Expectation and Maximization algorithm for finite mixture models is proposed by us to handle realworld learning problems which have natural feature splits. Multi-View EM does feature split as Co-training and Co-EM, but it considers multiview learning problems in the EM framework. The proposed algorithm has these impressing advantages comparing with other algorithms in Cotraining setting: it can be applied for both unsupervised learning and semi-supervised learning tasks; it can easily deal with more two views learning problems; it can simultaneously utilize different classifiers and different optimization criteria such as ML and MAP in different views for learning; its convergence is theoretically guaranteed. Experiments on synthetic data, USPS data and WebKB data1 demonstrated that Multi-View EM performed satisfactorily well compared with Co-EM, Cotraining and standard EM. 1 Introduction Semi-supervised learning, which combines information from both labeled and unlabeled data for learning tasks, has drawn wide attention. Some related research deal with labeled and unlabeled data in problem domains where features naturally divide into different subsets(views)[1][2][3][4][5][6]. This is a reasonable approach, because that in real-world learning problems there are usually some natural splits of features: in web-page classification, features can be divided into two disjoint subsets, one concerning words that appear on the page, another concerning words that appear in hyperlinks pointing to that page; in audiovisual speech recognition, features consist of audio view part and visual view part; in color image segmentation, features usually involve coordinate view and color view, etc. In the Co-training setting where features naturally partition into two sets, many algorithms have been proposed to utilize this feature division for boosting performance of learning systems with labeled and unlabeled data such as Co-training[1], Co-EM[7], Co-Testing[3], Co-EMT[2], etc. Blum and Mitchell provided a PAC-style analysis for Co-training, which shows that when the two views are compatible and uncorrelated, Co-training will successfully learn the target concept with labeled and unlabeled data[1]. Nigam and Ghani demonstrated 1 This data is available at http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo20/www/data/webkb-data.gtar.gz 2 that when a natural independent split of input features exists, algorithms utilizing this feature split outperform algorithms that do not. They also argued that Co-EM is a closer match to the theoretical argument in [1] than Co-training[7]. Co-EM uses the hypothesis learned in one view to probabilistically label the examples in the other. Intuitively, Co-EM runs Expectation-Maximization(EM) algorithm in each view, and before each new EM iteration inter-changes the probabilistic labels generated in each view. However, Co-EM is only a technical design in the view of EM framework considering the step of interchanging two views’ labels, which does not guarantee convergence. In this paper, we propose Multi-View Expectation-Maximization (MultiView EM) algorithm for finite mixture models, which follows the scheme of feature split and deal with these multi-view learning scenarios in the framework of EM algorithm instead of PAC model. The proposed algorithm can easily handle problems with more than two views and guarantees convergence. Remainder of the paper is organized as follows. In section 2, we firstly briefly review Co-training setting and EM for finite mixture models, then detailedly describe Multi-View EM for finite mixture models. We also provide some implements of Multi-View EM in section 2.4 such as Gaussian mixture models (GMM) based version and Na¨ıve Bayes Classifier based version. Experimental results of comparing Multi-View EM with standard EM on synthetic datasets, USPS[8] dataset and comparing Multi-View EM with Co-training, Co-EM and standard EM on WebKB data[1] are presented in section 3. Performance Analysis are given in section 4. Section 5 concludes. 2 2.1 Multi-View EM Algorithm for finite mixture models The Co-training Setting The Co-training setting[1] assumes that in real-world learning problems that have a natural way to partition features into two views V1 , V2 , an example x can be described by a triple [x1 , x2 , l], where x1 ,x2 are x’s descriptions in two views and l is its label. This can be easily generalized to Multi-View setting where there are more than two views. In this setting, given a learning algorithm L, the sets T and U of labeled and unlabeled samples and the number k of iterations to be performed, Table 1 and Table 2 describe flow charts of Co-training and Co-EM respectively. Intuitively, Co-training labels data on which the classifier in each view makes the most confident predictions and that Co-EM can be viewed as a probabilistic version of Co-training[7]. 2.2 Finite mixture models and EM algorithm It is said a d-dimensional random variable x = [x1 , x2 , · · · , xd ]T follows a kcomponent finite mixture distribution, if its probability density function can be written as 3 p(x|θ) = k X αm p(x|θm ), (1) m=1 where αm is the prior probability of the mth component and satisfies: αm ≥ 0, and k X αm = 1, (2) m=1 θm is the parameter of the mth density model and θ = {(αm , θm ), m = 1, 2, · · · , k} is the parameter set of mixture models. For GMM, θm = {µm , Σm }. EM has been widely used in the parameter estimation of finite mixture models. Suppose that one set Z consists of observed data X and unobserved data Y , Z = (X, Y ) and X are called complete data and incomplete data respectively. According to Maximum Likelihood(ML) estimation, the E-Step calculates the complete data expected log-likelihood function defined by the so called Q function[9], ˆ ˆ Q(θ, θ(t)) ≡ E[log p(X, Y |θ)|X, θ(t)]. (3) The M-Step updates the parameters by ˆ + 1) = arg max Q(θ, θ(t)). ˆ θ(t θ (4) The EM algorithm performs the E-Step and M-Step iteratively, and the convergence is guaranteed. EM algorithm can be easily generalized to the Maximum a Posteriori(MAP) estimation: denoting the log of prior density of θ by G(θ), we ˆ simply maximize Q(θ, θ(t)) + G(θ) in the M-step[10], which will lead to different iterative updating formulas for θ. 2.3 Multi-View EM Algorithm for finite mixture models For convenience, we describe two-view version of Multi-View EM in this paper, which can be easily generated to more views situation with only slight changes of the corresponding formulas. Table 1. Flow chart of Co-training Loop for k iterations or while all samples have been labeled -use L, V1(T ), V2(T ) to create classifiers h1 and h2 -For each class Ci : -let E1, E2 be unlabeled examples on which h1 and h2 make the most confident predictions for Ci : -remove E1, E2 from U , label them according to h1 and h2 , respectively, and add them to T Combine the prediction of h1 and h2 4 Table 2. Flow chart of Co-EM -let S = T ∪ U , h1 be the classifier obtained by training L on T Loop for k iterations -New1=Probabilistically Label(S, h1 ) -use L, V2(New1) to create classifier h2 -New2= Probabilistically Label(S, h2 ) -use L, V1(New2) to create classifier h1 Combine the prediction of h1 and h2 In the finite mixture models and Co-training setting, it holds that: p(x1 |θ) = p(x2 |θ) = k P m=1 k P m=1 αm αm P x2 P x1 p(x1 , x2 |θm ) = p(x1 , x2 |θm ) = k P m=1 k P m=1 αm p(x1 |θmV1 ) (5) αm p(x2 |θmV2 ), where {(αm , θmV1 ), m = 1, 2, · · · , k} = θV1 and {(αm , θmV2 ), m = 1, 2, · · · , k} = θV2 denote models’ parameter sets of two views respectively. Different versions of Multi-View EM for finite mixture models can be deduced given different estimation criteria and different assumptions. Version I of Multi-View EM The first Multi-View EM version fits finite mixture models to observed data according to another criterion instead of ML criterion and MAP criterion, which can be formulated as: ˆ ˆ Q00 (θ, θ(t)) ≡ E[log p(X1 , Y |θV1 )w1 • p(X2 , Y |θV2 )w2 |X, θ(t)], (6) 00 ˆ + 1) = arg max Q (θ, θ(t)), ˆ θ(t θ where wi denotes the weight of the ith view. The M-step of Version I of MultiView EM updates parameters by maximizing the Q00 function. Consider that in the EM parameter estimation of mixture models, for the Q function[9], it holds that: ˆ ˆ Q(θ, θ(t)) ≡ E[log p(X, Y |θ)|X, θ(t)] k N P P ˆ = ln(αm )p(m|xi , θ)+ (7) m=1 i=1 k P N P ˆ ln(p(xi |θm ))p(m|xi , θ), m=1 i=1 ˆ denotes the probability that component m generates sample where p(m|x , θ) i x . Therefore in Version I of Multi-View EM, it holds that: i ˆ Q00 (θ, θ(t)) = w1 · k P N P m=1 i=1 k P N P m=1 i=1 ln(αm )(w1 · p(m|xi1 , θˆV1 ) + w2 · p(m|xi2 , θˆV2 ))+ k P N P ln(p(xi1 |θmV1 ))p(m|xi1 , θˆV1 ) + w2 · ln(p(xi2 |θmV2 ))p(m|xi2 , θˆV2 ), m=1 i=1 (8) 5 Then M-step of Multi-View EM updates parameters by maximizing this Q00 function which can be formulated as: αm = 1 N N P ˆ p(m|xi , θ), i=1 P θˆmV1 = max ln(p(xi1 |θmV1 ))p(m|xi1 , θˆV1 ), N θmV1 i=1 N P θˆmV2 = max (9) ln(p(xi2 |θmV2 ))p(m|xi2 , θˆV2 ), θmV2 i=1 ˆ = w1 p(m|xi , θ) w1 +w2 • p(m|xi1 , θˆV1 ) + w2 w1 +w2 • p(m|xi2 , θˆV2 ). This version of Multi-View EM algorithm can be intuitively regarded as that it runs EM in each view, and before each new EM iteration combines all the weighted probabilistic labels generated in each view. Version II of Multi-View EM Instead of estimating model parameters by changing criterion of fitting data as in the first version, the second version of Multi-View EM assumes that each probabilistic component in the finite mixture models satisfies:p(x|θm ) = [p(x1 |θmV1 )]w1 •[p(x2 |θmV2 )]w2 R R , where wi denotes the weight of the ith view. [p(x1 |θmV1 )]w1 dx1 • [p(x2 |θmV2 )]w2 dx2 The assumption in the second version can be viewed as some probabilistic information fusion for components of two views. Garg et al. named this fusion as static fusion and discussed this fusion in[11]. By utilizing Equation (7), it can be obtained that in the M-step the parameter updating formulas are: αm = 1 N N P ˆ p(m|xi , θ), i=1 P p(x |θ ) ˆ θˆmV1 = max ln( R [p(x1 |θ1mVmV)]1w1 dx1 )p(m|xi , θ), N θmV1 i=1 N P θˆmV2 = max θmV2 i=1 i p(x |θm ) = R i 1 (10) p(xi |θ ) ˆ ln( R [p(x2 |θ2mVmV)]2w2 dx2 )p(m|xi , θ), 2 [p(xi1 |θmV1 )]w1 •[p(xi2 |θmV2 )]w2 R . [p(x1 |θmV1 )]w1 dx1 • [p(x2 |θmV2 )]w2 dx2 It should be pointed out that this version does not correspond to an explicit independence assumption of different views, i.e. it does not require p(x|θ) = p(x1 |θV1 ) • p(x2 |θV2 ), but corresponds to the uncorrelated assumption[1] when w1 = 1, w2 = 1. Also notice that when w1 = 1, w2 = 1, Version II of Multi-View EM will revert to the standard EM algorithm implicitly. Consider that Co-training and Co-EM deal with semi-supervised learning scenarios, where labeled data set T and unlabeled data set U are utilized to learn a classification system, Multi-View EM for semi-supervised learning is also presented here. Nigam et al. proposed one scheme to combine information from labeled and unlabeled data for text classification in the EM framework [12]. This algorithm firstly built an initial classifier θˆ with all the labeled data T , and then began the EM iterations until convergence: ˆ E-step: for all the unlabeled data xiu ∈ U calculated the probability p(m|xiu , θ) that each mixture component m (or class m when one class was represented by 6 ˆ =1 one component) generated xiu ; for all the labeled data (xil , y i ) ∈ T , p(m|xil , θ) i i ˆ when m = y , otherwise p(m|xl , θ) = 0. M-step: re-estimated the classifier θˆ with all the labeled and unlabeled data i ˆ x and their probability labels p(m|xi , θ). Following this scheme, Multi-View EM can be utilized for semi-supervised learning similarly. 2.4 Implements of Multi-View EM For GMM, where θm = {µm , Σm }, the updating iterative formulas in the M-step for θˆmVt can be represented by: N P µmt = xit p(m|xit ,θˆVt ) i=1 N P i=1 p(m|xit ,θˆVt ) N P , Σmt = i=1 p(m|xit ,θˆVt )(xit −µmt )(xit −µmt )T N P i=1 p(m|xit ,θˆVt ) (11) in Version I; N P µmt = ˆ xit p(m|xit ,θ) i=1 N P i=1 ˆ p(m|xit ,θ) , Σmt = wt · N P i=1 ˆ i −µm )(xi −µm )T p(m|xit ,θ)(x t t t t N P i=1 ˆ p(m|xit ,θ) (12) in Version II. For Na¨ıve Bayes classifier which is usually employed for text document classification, consider the simple case which assumes that one document class or one topic consists of only one component(this case can be easily generalized it holds that θm = ( to one topic involves multi components[12]). Then ) P θaj |m = P (aj |m) : j ∈ {1, 2, . . . , n} , P (aj |m) = 1 , where set {aj } denotes j input attributes of all the training samples. For version I of Multi-View EM, the updating iterative formulas in the M-step for θˆmVt can be represented by [12]: 1+ θaj |mVt = n+ N P V alue(xit ,aj )P (m|xit ,θˆVt ) i=1 n P N P s=1 i=1 V alue(xit ,as )P (m|xit ,θˆVt ) , (13) where V alue(xit , aj ) denotes the value of attribute aj in the tth View of sample xi . For example, in the text classification task, V alue(xit , aj ) denotes the count of the number of times word aj occurs in the tth View of document xi . For Version II of Multi-View EM, parameter estimation of Na¨ıve Bayes classifier does not have analytical updating iterative formulas, which can only be solved by Generalized EM algorithm[9]. It can be seen that besides GMM and Na¨ıve Bayes classifier, Multi-View EM can employ other different finite mixture models in different views, which could be intuitively regarded as designing a different classifier in each view then combining them for classification. 7 Table 3. Parameters of models for generating synthetic datasets Different Synthetic datasets α1 α2 p(x1 |θmV1 ) p(x2 |θmV2 ) 3 1 2 0.6 0.4 2-component GMM, µ ¶ 1 0.5 Σ1 = , µ1 0.5 1.5 µ ¶ 0.3 0 Σ2 = , µ2 0 0.6 2-component GMM, µ ¶ 0.3 0 Σ1 = , µ1 0 0.6 µ ¶ 1 0.5 Σ2 = , µ2 0.5 1.5 0.6 0.4 2-component GMM, µ ¶ µ ¶ 1 0.5 1 Σ1 = , µ1 = 0.5 1.5 1 µ ¶ µ ¶ 0.3 0 2 Σ2 = , µ2 = 0 0.6 2 2-component GMM, Σ1 = 0.2, µ1 = 0.6 Σ2 = 0.2, µ2 = 0.3 µ ¶ 1 , 1 µ ¶ 2 = 2 = µ ¶ 2 , 2 µ ¶ 1 = 1 = Experiments In section 3.1, Multi-View EM is firstly compared with standard EM when fitting GMM to synthetic data and USPS dataset in unsupervised learning scenarios. Then in section 3.2, Multi-View EM is compared with Co-training, Co-EM and standard EM for designing Na¨ıve Bayes classifier on WebKB data in semisupervised learning scenarios. 3.1 Unsupervised learning Scenario Synthetic data Synthetic data are generated in this way: for the finite mixture models, firstly sample the mth component according to αm and record this sample’s label as m, and then sample View1 x1 and View2 x2 of x according to p(x1 |θmV1 ) and p(x2 |θmV2 ) respectively, thus obtain the complete data x = (x1 , x2 , m). In experiments, two synthetic datasets were generated according to parameters in Table 3, each of which had 4000 samples. In the unsupervised learning scenario, labels of complete data were totally removed and only the views’ parts {x1 , x2 } were utilized for clustering. Performances of different algorithms in the unsupervised learning setting can be evaluated by the error rate, which can be obtained as below. Consider that after all the samples are assigned to different clusters by certain unsupervised learning algorithm, each sample obtains a cluster label. Since the ground truth label of each sample is known in this controlled experiment, for each cluster, find the ground truth label that most samples in this cluster have and let this label be the cluster label. Then when a sample’s cluster label does not agree with its true label, it is said that this sample is ”wrong” classified. Therefore, the error rate can be calculated. In experiments, two versions of Multi-View EM and standard EM had been applied to fit 2-component GMM to these synthetic datasets. Each 8 Table 4. Average error rates on different synthetic datasets Different Synthetic datasets Version I w1 = 0.2 (Multi-View EM) w2 = 0.8 w1 = 0.8 w2 = 0.2 Version II w1 = 0.2 w2 = 0.8 (Multi-View EM) w1 = 0.8 w2 = 0.2 EM using complete feature set 1 0.1770 2 0.2659 0.2268 0.2414 0.1312 0.1323 0.1320 0.1322 0.1326 0.1380 algorithm run for 50 times and the corresponding average error rates of clustering results were calculated and presented in Table 4. Notice that standard EM used the whole feature set for clustering. It can be observed that Version II of Multi-View EM achieves the lowest error rate, which has utilized uncorrelated property of the synthetic data. Standard EM using the complete feature set performs quite well in all the synthetic datasets. This demonstrates that when the dimension of data is not so high and there are enough samples, GMM based on standard EM can be fit to the whole dimensional data, which can capture distribution information of data fairly well without introducing feature splits. USPS data set There are overall 9298 labeled samples (each sample is a 16×16 image matrix with each element lying in the interval [−1, 1]) in the USPS set of handwritten postal codes. We randomly selected 3000 labeled samples for experiments. Two views of data are created as follows which may be a little artificial. Each sample’s image matrix was divided into two parts: the upper part of the image or the first 8*16 dimensions of the matrix and the bottom part of the image or the last 8*16 dimensions. We reduced two parts’ dimensions by PCA according to the energy of eigenvalues[13] and set the cumulative eigenvalues’ energy to be no less than 75%. In this way, dimensions of View1 and View2 were automatically determined to be 8 and 7 respectively. Then we removed samples’ labels and applied Multi-View EM and standard EM to fit 10-component GMM for data. The accuracy was utilized to evaluate performances of these different unsupervised learning algorithms, which can be obtained similarly in Section 3.1. Still in experiments, standard EM utilized the whole 15 dimensions of feature sets for learning. Consider that EM and Multi-View EM only guarantee converging to local minima, we randomly initialized the model parameters and run EM and MultiView EM both for 50 times. Table 5 describes how the averages of the accuracy of 50 runs vary by setting different weights to two versions of Multi-View EM. It should be pointed out that in experiment for convenience, we set w2 = 1 − w1 , which is not a necessary constraint in Multi-View EM. Clustering results by 9 Table 5. Averages and variations of accuracy on USPS dataset by Multi-View EM and EM w1 0.1 0.2 0.3 w2 0.9 0.8 0.7 Version 0.560 0.563 0.562 I ±0.027 ±0.024 ±0.029 Version 0.663 0.680 0.685 II ±0.030 ±0.046 ±0.029 EM using 15 dimensions 0.4 0.6 0.552 ±0.029 0.677 ±0.026 0.5 0.5 0.478 ±0.031 0.664 ±0.106 0.6 0.7 0.8 0.4 0.3 0.2 0.571 0.580 0.572 ±0.032 ±0.032 ±0.065 0.690 0.664 0.675 ±0.034 ±0.106 ±0.067 0.703±0.080 0.9 0.1 0.576 ±0.033 0.666 ±0.082 Table 6. Multi-View EM, Co-training, Co-EM and EM for semi-supervised learning Accuracy labeled sam- Version I of ples Multi-View EM 200 0.872±0.013 400 0.893±0.006 600 0.907±0.004 Co-training Co-EM EM 0.819±0.017 0.861±0.015 0.748±0.011 0.849±0.009 0.881±0.005 0.780±0.008 0.869±0.009 0.897±0.004 0.808±0.007 standard EM are also presented in Table 5. It can be observed in Table 5 that for the USPS dataset standard EM which employed the whole feature set for clustering outperformed two versions of Multi-View EM. 3.2 Semi-supervised learning Scenario In WebKB data there are overall 8,282 pages which had been manually classified into 7 categories. In experiment, we firstly extracted page-based view and hyperlink-based view from those pages by the document feature selection method: ”χ2 statistics+Document Frequency Cut”[14]. Here 3000 features were selected to form the page-based view and 2000 features were selected to form the hyperlink-based view. Then we randomly selected 2000 samples from the “course” category and the “others” category to design a 2-category Na¨ıve Bayes classifier for testing performances of Multi-View EM, Co-training, Co-EM and standard EM(using the whole 5000 features) in semi-supervised setting. In experiment, after removing all the labels of data, we randomly re-labeled part of them then testified performances of Version I of Multi-View EM, Cotraining, Co-EM and EM with the semi-supervised datasets created. We randomly re-labeled 200, 400, 600 samples in 2000 samples for 20 times respectively and achieved the averages and variations of the accuracy by different semi-supervised learning algorithms. Here in Multi-View EM, weight of pagebased view and hyperlink-based view was set to be 0.2 and 0.8 respectively. The experimental results are presented in Table 6. It can be observed: in this semi-supervised learning task Multi-View EM outperformed Co-training Co-EM and EM. 10 4 Performance Analysis Experiments on synthetic data and USPS data demonstrated that when the dimension of data is not high and there are enough samples, GMM based on standard EM can capture distribution information of data perfectly well. But for many high dimensional classification problems such as text classification, GMM can hardly be performed. Na¨ıve Bayes classifiers are usually utilized for text classification. Nigam and Ghani had analyzed the effectiveness of Co-training and Co-EM[7]. They argued that besides the compatible and uncorrelated assumptions, the performances of two algorithms were also affected by another important underlying assumption: Na¨ıve Bayes assumption. The fact that the algorithms which introduce feature splits outperform those which do not in text classification problems can be intuitively explained in this way: the size of the feature set or the vocabulary is usually very large, when feature splits are introduced, the underlying Na¨ıve Bayes assumption can be better satisfied in each view[7]. However, when using GMM to fit data whose distribution does not satisfy the Na¨ıve Bayes assumption as in section 3.1, algorithms using feature splits do not guarantee to outperform algorithms using whole features: standard EM which directly utilizes the whole feature set for learning performs fairly well. It can also be observed that in section 3.1 Version II of Multi-View EM, which utilizes the uncorrelated property of two views and does feature splits, outperforms standard EM a little bit. Completely deduced in the framework of EM, Multi-View EM is a more appropriate and proper probabilistic version of Co-training than Co-EM. In Table 4, Version II of Multi-View EM achieves the most best performance in the unsupervised learning scenarios with synthetic data. In Table 6, Multi-View EM achieves the most best performance in the semi-supervised learning scenario with WebKB data. For Multi-View EM, one important issue is how to choose appropriate weights for different views, which is an open issue and has close relation to the research on how to choose appropriate scale factors for different features[15]. In the semisupervised learning scenario, weights can still be chosen through cross-validation. In the unsupervised learning scenario where we can not utilize label information of data, other prior knowledge of data should be utilized, such as the importance of different views for certain clustering task, etc. Notice that for Multi-View EM, standard EM, Co-training and Co-EM, only the first two algorithms can deal with unsupervised learning scenario, and that standard EM also needs some other evaluation methods to determine models achieved. In the semi-supervised learning settings, similar to Co-EM, Multi-View EM iteratively uses the unlabeled data, which probabilistically labels all the data at each round, while Co-training incrementally uses the unlabeled data, which only labels the most confident data at each round. Different from Co-EM, Multi-View EM does not interchange probabilistic labels of each view but combines them for learning at each round, the convergence of which can be guaranteed in the framework of EM. 11 Notice that Multi-View EM can simultaneously utilize different classifiers or models in different views, which can be viewed as combination of hybrid classifiers in the point of classifier fusion, also Multi-View EM can simultaneously employ different maximization criterion such ML or MAP in different views. These advantages enable Multi-View EM for more broad multi-view learning problems. In the view of classifier fusion, Version I of Multi-View EM can be intuitively regarded as in each round before each EM iteration, independently designing a classifier in each view and then using the weighted sum criterion of classifier fusion[16] to combine probabilistic labels from all the classifiers. Technically, other classifier fusion criteria such as multiplication criterion and max criterion, etc, could be utilized to combine probabilistic labels, but rigidly, in the framework of EM, using these criteria could not guarantee the convergence. 5 Conclusions In this paper, Multi-View EM for finite mixture models is proposed to handle real-world learning problem which have natural feature splits. As a more proper and close probabilistic version of Co-training than Co-EM, Multi-View EM is completely deduced in the EM framework and its convergence is theoretically guaranteed in the EM framework. Compared with Co-training and Co-EM, Multi-View also does feature split and can be applied for both unsupervised learning and semi-supervised learning tasks. Moreover, Multi-View EM can easily deal with more two views and that it can simultaneously utilize different classifiers and different optimization criteria such as ML and MAP in different views for learning so that Multi-View EM can be employed for more broad real-world learning tasks. Two versions of Multi-View EM algorithm are also provided in this paper. Version I of Multi-View EM is achieved by using another criterion in stead of ML and MAP in the M-step. It can be intuitively regarded as that this version runs EM in each view, and before each new EM iteration combines all the weighted probabilistic labels generated in each view. Version II of Multi-View EM can be viewed as some static probabilistic fusion of each component in the finite mixture models. GMM based and Na¨ıve Bayes classifier based Multi-View EM are also presented in this paper. For text classification problem, algorithms in Co-training settings have been successfully applied for learning with label and unlabeled data. Previous research demonstrated that when the compatible and uncorrelated assumptions are satisfied, these algorithms perform better than algorithms which do not utilize feature splits. But when dimension of data is not very high and there are enough samples, GMM based on standard EM which employs whole features performs very well. When the underlying distribution of data does satisfy uncorrelated assumption, Version II of Multi-View EM performs a little better than EM using whole feature set. 12 Following the feature split scheme, Multi-View EM can be expected to have broad applications for multi-view learning problems. Future work involves introducing prior knowledge of data and some criteria to automatically choose more appropriate weights of different views and utilizing Multi-View EM for more semi-supervised learning problems. And that active learning could be introduced so that active learning version of Multi-View EM similar to Co-EMT could be designed. References 1. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory. (1998) 92–100 2. Muslea, I., Minton, S., Knoblock, C.A.: Active + semi-supervised learning = robust multi-view learning. In: Proceedings of ICML-02. (2002) 3. Muslea, I., Minton, S., Knoblock, C.A.: Selective sampling with redundant views. In: Proceedings of National Conference on Artificial Intelligence. (2000) 621–626 4. Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: Proceeding of EMNLP-99. (1999) 5. Riloff, E., Jones, R.: Learning dictionaries for information extraction using multilevel boot-strapping. In: Proceedings of AAAI-99. (1999) 6. Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of ACL-95. (1995) 7. Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training. In: Proceedings of Information and Knowledge Management. (2000) 86–93 8. Cun, Y.L., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., Jackel, L.: Backpropagation applied to handwritten zip code recognition. Neural Computation 1 (1989) 541–551 9. Bilmes, J.A.: A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Technical report, ICSI Technical Report TR-97-021,UC Berkeley (1997) 10. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the em algorithm. Journal of Royal Statistic Society B 90 (1977) 1–38 11. Garg, A., Potamianos, G., Neti, C., Huang, T.: Frame-dependent multi-stream reliability indicators for audio-visual speech recognition. Proc. of international conference on Acoustics, Speech and Signal Processing (ICASSP) (2003) 12. Nigam, K., Mccallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using em. Machine Learning 39 (2000) 103–134 13. Kirby, M.: Geometric Data Analysis: An Empirical Approach to Dimensionality Reduction and the Study of Patterns. John Wiley and Sons, New York (2000) 14. Rogati, M., Yang, Y.: High-performing feature selection for text classification. In: Proceedings of the eleventh international conference on Information and knowledge management, McLean, Virginia, USA (2002) 15. Wang, J.D., Zhang, C.S., Shum, H.Y.: Face image resolution versus face recognition performance based on two global methods. In: Proceedings of the Asia Conference on Computer Vision (ACCV’2004), Jeju Island, Korea (2004) 16. Suen, C.Y.W., Lam, L.: Analyzing the effectiveness and applicability of co-training. In: Proceedings of the First International Workshop on Multiple Classifier Systems (MCS2000), Cagliari, Italy (2000) 52–66
© Copyright 2024