Multi-View EM Algorithm for Finite Mixture Models

Multi-View EM Algorithm for Finite Mixture
Models
Xing Yi, Changshui Zhang, and Yunpeng Xu
State Key Laboratory of Intelligent Technology and Systems,
Department of Automation, Tsinghua University, Beijing 100084, P. R. China
Abstract. In this paper, Multi-View Expectation and Maximization
algorithm for finite mixture models is proposed by us to handle realworld learning problems which have natural feature splits. Multi-View
EM does feature split as Co-training and Co-EM, but it considers multiview learning problems in the EM framework. The proposed algorithm
has these impressing advantages comparing with other algorithms in Cotraining setting: it can be applied for both unsupervised learning and
semi-supervised learning tasks; it can easily deal with more two views
learning problems; it can simultaneously utilize different classifiers and
different optimization criteria such as ML and MAP in different views
for learning; its convergence is theoretically guaranteed. Experiments
on synthetic data, USPS data and WebKB data1 demonstrated that
Multi-View EM performed satisfactorily well compared with Co-EM, Cotraining and standard EM.
1
Introduction
Semi-supervised learning, which combines information from both labeled and
unlabeled data for learning tasks, has drawn wide attention. Some related research deal with labeled and unlabeled data in problem domains where features
naturally divide into different subsets(views)[1][2][3][4][5][6]. This is a reasonable
approach, because that in real-world learning problems there are usually some
natural splits of features: in web-page classification, features can be divided into
two disjoint subsets, one concerning words that appear on the page, another
concerning words that appear in hyperlinks pointing to that page; in audiovisual speech recognition, features consist of audio view part and visual view
part; in color image segmentation, features usually involve coordinate view and
color view, etc. In the Co-training setting where features naturally partition into
two sets, many algorithms have been proposed to utilize this feature division for
boosting performance of learning systems with labeled and unlabeled data such
as Co-training[1], Co-EM[7], Co-Testing[3], Co-EMT[2], etc. Blum and Mitchell
provided a PAC-style analysis for Co-training, which shows that when the two
views are compatible and uncorrelated, Co-training will successfully learn the target concept with labeled and unlabeled data[1]. Nigam and Ghani demonstrated
1
This data is available at http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo20/www/data/webkb-data.gtar.gz
2
that when a natural independent split of input features exists, algorithms utilizing this feature split outperform algorithms that do not. They also argued that
Co-EM is a closer match to the theoretical argument in [1] than Co-training[7].
Co-EM uses the hypothesis learned in one view to probabilistically label the
examples in the other. Intuitively, Co-EM runs Expectation-Maximization(EM)
algorithm in each view, and before each new EM iteration inter-changes the
probabilistic labels generated in each view. However, Co-EM is only a technical
design in the view of EM framework considering the step of interchanging two
views’ labels, which does not guarantee convergence.
In this paper, we propose Multi-View Expectation-Maximization (MultiView EM) algorithm for finite mixture models, which follows the scheme of
feature split and deal with these multi-view learning scenarios in the framework
of EM algorithm instead of PAC model. The proposed algorithm can easily handle problems with more than two views and guarantees convergence.
Remainder of the paper is organized as follows. In section 2, we firstly briefly
review Co-training setting and EM for finite mixture models, then detailedly
describe Multi-View EM for finite mixture models. We also provide some implements of Multi-View EM in section 2.4 such as Gaussian mixture models (GMM)
based version and Na¨ıve Bayes Classifier based version. Experimental results of
comparing Multi-View EM with standard EM on synthetic datasets, USPS[8]
dataset and comparing Multi-View EM with Co-training, Co-EM and standard
EM on WebKB data[1] are presented in section 3. Performance Analysis are
given in section 4. Section 5 concludes.
2
2.1
Multi-View EM Algorithm for finite mixture models
The Co-training Setting
The Co-training setting[1] assumes that in real-world learning problems that
have a natural way to partition features into two views V1 , V2 , an example x can
be described by a triple [x1 , x2 , l], where x1 ,x2 are x’s descriptions in two views
and l is its label. This can be easily generalized to Multi-View setting where
there are more than two views.
In this setting, given a learning algorithm L, the sets T and U of labeled and
unlabeled samples and the number k of iterations to be performed, Table 1 and
Table 2 describe flow charts of Co-training and Co-EM respectively.
Intuitively, Co-training labels data on which the classifier in each view makes
the most confident predictions and that Co-EM can be viewed as a probabilistic
version of Co-training[7].
2.2
Finite mixture models and EM algorithm
It is said a d-dimensional random variable x = [x1 , x2 , · · · , xd ]T follows a kcomponent finite mixture distribution, if its probability density function can be
written as
3
p(x|θ) =
k
X
αm p(x|θm ),
(1)
m=1
where αm is the prior probability of the mth component and satisfies:
αm ≥ 0, and
k
X
αm = 1,
(2)
m=1
θm is the parameter of the mth density model and θ = {(αm , θm ), m = 1, 2, · · · , k}
is the parameter set of mixture models. For GMM, θm = {µm , Σm }.
EM has been widely used in the parameter estimation of finite mixture models. Suppose that one set Z consists of observed data X and unobserved data
Y , Z = (X, Y ) and X are called complete data and incomplete data respectively. According to Maximum Likelihood(ML) estimation, the E-Step calculates
the complete data expected log-likelihood function defined by the so called Q
function[9],
ˆ
ˆ
Q(θ, θ(t))
≡ E[log p(X, Y |θ)|X, θ(t)].
(3)
The M-Step updates the parameters by
ˆ + 1) = arg max Q(θ, θ(t)).
ˆ
θ(t
θ
(4)
The EM algorithm performs the E-Step and M-Step iteratively, and the convergence is guaranteed. EM algorithm can be easily generalized to the Maximum
a Posteriori(MAP) estimation: denoting the log of prior density of θ by G(θ), we
ˆ
simply maximize Q(θ, θ(t))
+ G(θ) in the M-step[10], which will lead to different
iterative updating formulas for θ.
2.3
Multi-View EM Algorithm for finite mixture models
For convenience, we describe two-view version of Multi-View EM in this paper,
which can be easily generated to more views situation with only slight changes
of the corresponding formulas.
Table 1. Flow chart of Co-training
Loop for k iterations or while all samples have been
labeled
-use L, V1(T ), V2(T ) to create classifiers h1 and h2
-For each class Ci :
-let E1, E2 be unlabeled examples on which h1 and h2
make the most confident predictions for Ci :
-remove E1, E2 from U , label them according to h1 and
h2 , respectively, and add them to T
Combine the prediction of h1 and h2
4
Table 2. Flow chart of Co-EM
-let S = T ∪ U , h1 be the classifier obtained by training
L on T
Loop for k iterations
-New1=Probabilistically Label(S, h1 )
-use L, V2(New1) to create classifier h2
-New2= Probabilistically Label(S, h2 )
-use L, V1(New2) to create classifier h1
Combine the prediction of h1 and h2
In the finite mixture models and Co-training setting, it holds that:
p(x1 |θ) =
p(x2 |θ) =
k
P
m=1
k
P
m=1
αm
αm
P
x2
P
x1
p(x1 , x2 |θm ) =
p(x1 , x2 |θm ) =
k
P
m=1
k
P
m=1
αm p(x1 |θmV1 )
(5)
αm p(x2 |θmV2 ),
where {(αm , θmV1 ), m = 1, 2, · · · , k} = θV1 and {(αm , θmV2 ), m = 1, 2, · · · , k} =
θV2 denote models’ parameter sets of two views respectively.
Different versions of Multi-View EM for finite mixture models can be deduced
given different estimation criteria and different assumptions.
Version I of Multi-View EM
The first Multi-View EM version fits finite mixture models to observed data
according to another criterion instead of ML criterion and MAP criterion, which
can be formulated as:
ˆ
ˆ
Q00 (θ, θ(t))
≡ E[log p(X1 , Y |θV1 )w1 • p(X2 , Y |θV2 )w2 |X, θ(t)],
(6)
00
ˆ + 1) = arg max Q (θ, θ(t)),
ˆ
θ(t
θ
where wi denotes the weight of the ith view. The M-step of Version I of MultiView EM updates parameters by maximizing the Q00 function. Consider that in
the EM parameter estimation of mixture models, for the Q function[9], it holds
that:
ˆ
ˆ
Q(θ, θ(t))
≡ E[log p(X, Y |θ)|X, θ(t)]
k
N
P P
ˆ
=
ln(αm )p(m|xi , θ)+
(7)
m=1 i=1
k P
N
P
ˆ
ln(p(xi |θm ))p(m|xi , θ),
m=1 i=1
ˆ denotes the probability that component m generates sample
where p(m|x , θ)
i
x . Therefore in Version I of Multi-View EM, it holds that:
i
ˆ
Q00 (θ, θ(t))
=
w1 ·
k P
N
P
m=1 i=1
k P
N
P
m=1 i=1
ln(αm )(w1 · p(m|xi1 , θˆV1 ) + w2 · p(m|xi2 , θˆV2 ))+
k P
N
P
ln(p(xi1 |θmV1 ))p(m|xi1 , θˆV1 ) + w2 ·
ln(p(xi2 |θmV2 ))p(m|xi2 , θˆV2 ),
m=1 i=1
(8)
5
Then M-step of Multi-View EM updates parameters by maximizing this Q00
function which can be formulated as:
αm =
1
N
N
P
ˆ
p(m|xi , θ),
i=1
P
θˆmV1 = max
ln(p(xi1 |θmV1 ))p(m|xi1 , θˆV1 ),
N
θmV1 i=1
N
P
θˆmV2 = max
(9)
ln(p(xi2 |θmV2 ))p(m|xi2 , θˆV2 ),
θmV2 i=1
ˆ = w1
p(m|xi , θ)
w1 +w2
• p(m|xi1 , θˆV1 ) +
w2
w1 +w2
• p(m|xi2 , θˆV2 ).
This version of Multi-View EM algorithm can be intuitively regarded as that
it runs EM in each view, and before each new EM iteration combines all the
weighted probabilistic labels generated in each view.
Version II of Multi-View EM
Instead of estimating model parameters by changing criterion of fitting data
as in the first version, the second version of Multi-View EM assumes that
each probabilistic component in the finite mixture models satisfies:p(x|θm ) =
[p(x1 |θmV1 )]w1 •[p(x2 |θmV2 )]w2
R
R
, where wi denotes the weight of the ith view.
[p(x1 |θmV1 )]w1 dx1 • [p(x2 |θmV2 )]w2 dx2
The assumption in the second version can be viewed as some probabilistic information fusion for components of two views. Garg et al. named this fusion as
static fusion and discussed this fusion in[11]. By utilizing Equation (7), it can
be obtained that in the M-step the parameter updating formulas are:
αm =
1
N
N
P
ˆ
p(m|xi , θ),
i=1
P
p(x |θ
)
ˆ
θˆmV1 = max
ln( R [p(x1 |θ1mVmV)]1w1 dx1 )p(m|xi , θ),
N
θmV1 i=1
N
P
θˆmV2 = max
θmV2 i=1
i
p(x |θm ) =
R
i
1
(10)
p(xi |θ
)
ˆ
ln( R [p(x2 |θ2mVmV)]2w2 dx2 )p(m|xi , θ),
2
[p(xi1 |θmV1 )]w1 •[p(xi2 |θmV2 )]w2
R
.
[p(x1 |θmV1 )]w1 dx1 • [p(x2 |θmV2 )]w2 dx2
It should be pointed out that this version does not correspond to an explicit
independence assumption of different views, i.e. it does not require p(x|θ) =
p(x1 |θV1 ) • p(x2 |θV2 ), but corresponds to the uncorrelated assumption[1] when
w1 = 1, w2 = 1. Also notice that when w1 = 1, w2 = 1, Version II of Multi-View
EM will revert to the standard EM algorithm implicitly.
Consider that Co-training and Co-EM deal with semi-supervised learning
scenarios, where labeled data set T and unlabeled data set U are utilized to
learn a classification system, Multi-View EM for semi-supervised learning is also
presented here. Nigam et al. proposed one scheme to combine information from
labeled and unlabeled data for text classification in the EM framework [12]. This
algorithm firstly built an initial classifier θˆ with all the labeled data T , and then
began the EM iterations until convergence:
ˆ
E-step: for all the unlabeled data xiu ∈ U calculated the probability p(m|xiu , θ)
that each mixture component m (or class m when one class was represented by
6
ˆ =1
one component) generated xiu ; for all the labeled data (xil , y i ) ∈ T , p(m|xil , θ)
i
i ˆ
when m = y , otherwise p(m|xl , θ) = 0.
M-step: re-estimated the classifier θˆ with all the labeled and unlabeled data
i
ˆ
x and their probability labels p(m|xi , θ).
Following this scheme, Multi-View EM can be utilized for semi-supervised
learning similarly.
2.4
Implements of Multi-View EM
For GMM, where θm = {µm , Σm }, the updating iterative formulas in the M-step
for θˆmVt can be represented by:
N
P
µmt =
xit p(m|xit ,θˆVt )
i=1
N
P
i=1
p(m|xit ,θˆVt )
N
P
, Σmt =
i=1
p(m|xit ,θˆVt )(xit −µmt )(xit −µmt )T
N
P
i=1
p(m|xit ,θˆVt )
(11)
in Version I;
N
P
µmt =
ˆ
xit p(m|xit ,θ)
i=1
N
P
i=1
ˆ
p(m|xit ,θ)
, Σmt =
wt ·
N
P
i=1
ˆ i −µm )(xi −µm )T
p(m|xit ,θ)(x
t
t
t
t
N
P
i=1
ˆ
p(m|xit ,θ)
(12)
in Version II.
For Na¨ıve Bayes classifier which is usually employed for text document classification, consider the simple case which assumes that one document class
or one topic consists of only one component(this case can be easily generalized
it holds that θm =
( to one topic involves multi components[12]). Then
)
P
θaj |m = P (aj |m) : j ∈ {1, 2, . . . , n} , P (aj |m) = 1 , where set {aj } denotes
j
input attributes of all the training samples. For version I of Multi-View EM, the
updating iterative formulas in the M-step for θˆmVt can be represented by [12]:
1+
θaj |mVt =
n+
N
P
V alue(xit ,aj )P (m|xit ,θˆVt )
i=1
n P
N
P
s=1 i=1
V alue(xit ,as )P (m|xit ,θˆVt )
,
(13)
where V alue(xit , aj ) denotes the value of attribute aj in the tth View of sample
xi . For example, in the text classification task, V alue(xit , aj ) denotes the count
of the number of times word aj occurs in the tth View of document xi . For
Version II of Multi-View EM, parameter estimation of Na¨ıve Bayes classifier
does not have analytical updating iterative formulas, which can only be solved
by Generalized EM algorithm[9].
It can be seen that besides GMM and Na¨ıve Bayes classifier, Multi-View
EM can employ other different finite mixture models in different views, which
could be intuitively regarded as designing a different classifier in each view then
combining them for classification.
7
Table 3. Parameters of models for generating synthetic datasets
Different
Synthetic
datasets
α1
α2
p(x1 |θmV1 )
p(x2 |θmV2 )
3
1
2
0.6
0.4
2-component
GMM,
µ
¶
1 0.5
Σ1 =
, µ1
0.5 1.5
µ
¶
0.3 0
Σ2 =
, µ2
0 0.6
2-component
GMM,
µ
¶
0.3 0
Σ1 =
, µ1
0 0.6
µ
¶
1 0.5
Σ2 =
, µ2
0.5 1.5
0.6
0.4
2-component
GMM,
µ
¶
µ ¶
1 0.5
1
Σ1 =
, µ1 =
0.5 1.5
1
µ
¶
µ ¶
0.3 0
2
Σ2 =
, µ2 =
0 0.6
2
2-component GMM,
Σ1 = 0.2, µ1 = 0.6
Σ2 = 0.2, µ2 = 0.3
µ ¶
1
,
1
µ ¶
2
=
2
=
µ ¶
2
,
2
µ ¶
1
=
1
=
Experiments
In section 3.1, Multi-View EM is firstly compared with standard EM when fitting GMM to synthetic data and USPS dataset in unsupervised learning scenarios. Then in section 3.2, Multi-View EM is compared with Co-training, Co-EM
and standard EM for designing Na¨ıve Bayes classifier on WebKB data in semisupervised learning scenarios.
3.1
Unsupervised learning Scenario
Synthetic data Synthetic data are generated in this way: for the finite mixture models, firstly sample the mth component according to αm and record this
sample’s label as m, and then sample View1 x1 and View2 x2 of x according to p(x1 |θmV1 ) and p(x2 |θmV2 ) respectively, thus obtain the complete data
x = (x1 , x2 , m). In experiments, two synthetic datasets were generated according to parameters in Table 3, each of which had 4000 samples.
In the unsupervised learning scenario, labels of complete data were totally
removed and only the views’ parts {x1 , x2 } were utilized for clustering. Performances of different algorithms in the unsupervised learning setting can be
evaluated by the error rate, which can be obtained as below. Consider that after
all the samples are assigned to different clusters by certain unsupervised learning
algorithm, each sample obtains a cluster label. Since the ground truth label of
each sample is known in this controlled experiment, for each cluster, find the
ground truth label that most samples in this cluster have and let this label be
the cluster label. Then when a sample’s cluster label does not agree with its true
label, it is said that this sample is ”wrong” classified. Therefore, the error rate
can be calculated. In experiments, two versions of Multi-View EM and standard
EM had been applied to fit 2-component GMM to these synthetic datasets. Each
8
Table 4. Average error rates on different synthetic datasets
Different Synthetic datasets
Version I
w1 = 0.2
(Multi-View EM)
w2 = 0.8
w1 = 0.8
w2 = 0.2
Version II
w1 = 0.2
w2 = 0.8
(Multi-View EM)
w1 = 0.8
w2 = 0.2
EM using complete feature set
1
0.1770
2
0.2659
0.2268
0.2414
0.1312
0.1323
0.1320
0.1322
0.1326
0.1380
algorithm run for 50 times and the corresponding average error rates of clustering results were calculated and presented in Table 4. Notice that standard EM
used the whole feature set for clustering.
It can be observed that Version II of Multi-View EM achieves the lowest
error rate, which has utilized uncorrelated property of the synthetic data. Standard EM using the complete feature set performs quite well in all the synthetic
datasets. This demonstrates that when the dimension of data is not so high and
there are enough samples, GMM based on standard EM can be fit to the whole
dimensional data, which can capture distribution information of data fairly well
without introducing feature splits.
USPS data set There are overall 9298 labeled samples (each sample is a 16×16
image matrix with each element lying in the interval [−1, 1]) in the USPS set
of handwritten postal codes. We randomly selected 3000 labeled samples for
experiments. Two views of data are created as follows which may be a little
artificial. Each sample’s image matrix was divided into two parts: the upper part
of the image or the first 8*16 dimensions of the matrix and the bottom part of the
image or the last 8*16 dimensions. We reduced two parts’ dimensions by PCA
according to the energy of eigenvalues[13] and set the cumulative eigenvalues’
energy to be no less than 75%. In this way, dimensions of View1 and View2 were
automatically determined to be 8 and 7 respectively. Then we removed samples’
labels and applied Multi-View EM and standard EM to fit 10-component GMM
for data. The accuracy was utilized to evaluate performances of these different
unsupervised learning algorithms, which can be obtained similarly in Section 3.1.
Still in experiments, standard EM utilized the whole 15 dimensions of feature
sets for learning.
Consider that EM and Multi-View EM only guarantee converging to local
minima, we randomly initialized the model parameters and run EM and MultiView EM both for 50 times. Table 5 describes how the averages of the accuracy
of 50 runs vary by setting different weights to two versions of Multi-View EM. It
should be pointed out that in experiment for convenience, we set w2 = 1 − w1 ,
which is not a necessary constraint in Multi-View EM. Clustering results by
9
Table 5. Averages and variations of accuracy on USPS dataset by Multi-View
EM and EM
w1
0.1
0.2
0.3
w2
0.9
0.8
0.7
Version 0.560 0.563 0.562
I
±0.027 ±0.024 ±0.029
Version 0.663 0.680 0.685
II
±0.030 ±0.046 ±0.029
EM using 15 dimensions
0.4
0.6
0.552
±0.029
0.677
±0.026
0.5
0.5
0.478
±0.031
0.664
±0.106
0.6
0.7
0.8
0.4
0.3
0.2
0.571 0.580 0.572
±0.032 ±0.032 ±0.065
0.690 0.664 0.675
±0.034 ±0.106 ±0.067
0.703±0.080
0.9
0.1
0.576
±0.033
0.666
±0.082
Table 6. Multi-View EM, Co-training, Co-EM and EM for semi-supervised
learning
Accuracy
labeled sam- Version I of
ples
Multi-View
EM
200
0.872±0.013
400
0.893±0.006
600
0.907±0.004
Co-training
Co-EM
EM
0.819±0.017 0.861±0.015 0.748±0.011
0.849±0.009 0.881±0.005 0.780±0.008
0.869±0.009 0.897±0.004 0.808±0.007
standard EM are also presented in Table 5. It can be observed in Table 5 that
for the USPS dataset standard EM which employed the whole feature set for
clustering outperformed two versions of Multi-View EM.
3.2
Semi-supervised learning Scenario
In WebKB data there are overall 8,282 pages which had been manually classified into 7 categories. In experiment, we firstly extracted page-based view
and hyperlink-based view from those pages by the document feature selection
method: ”χ2 statistics+Document Frequency Cut”[14]. Here 3000 features were
selected to form the page-based view and 2000 features were selected to form
the hyperlink-based view. Then we randomly selected 2000 samples from the
“course” category and the “others” category to design a 2-category Na¨ıve Bayes
classifier for testing performances of Multi-View EM, Co-training, Co-EM and
standard EM(using the whole 5000 features) in semi-supervised setting.
In experiment, after removing all the labels of data, we randomly re-labeled
part of them then testified performances of Version I of Multi-View EM, Cotraining, Co-EM and EM with the semi-supervised datasets created. We randomly re-labeled 200, 400, 600 samples in 2000 samples for 20 times respectively and achieved the averages and variations of the accuracy by different
semi-supervised learning algorithms. Here in Multi-View EM, weight of pagebased view and hyperlink-based view was set to be 0.2 and 0.8 respectively. The
experimental results are presented in Table 6.
It can be observed: in this semi-supervised learning task Multi-View EM
outperformed Co-training Co-EM and EM.
10
4
Performance Analysis
Experiments on synthetic data and USPS data demonstrated that when the dimension of data is not high and there are enough samples, GMM based on standard EM can capture distribution information of data perfectly well. But for
many high dimensional classification problems such as text classification, GMM
can hardly be performed. Na¨ıve Bayes classifiers are usually utilized for text
classification. Nigam and Ghani had analyzed the effectiveness of Co-training
and Co-EM[7]. They argued that besides the compatible and uncorrelated assumptions, the performances of two algorithms were also affected by another
important underlying assumption: Na¨ıve Bayes assumption. The fact that the
algorithms which introduce feature splits outperform those which do not in text
classification problems can be intuitively explained in this way: the size of the
feature set or the vocabulary is usually very large, when feature splits are introduced, the underlying Na¨ıve Bayes assumption can be better satisfied in each
view[7].
However, when using GMM to fit data whose distribution does not satisfy the
Na¨ıve Bayes assumption as in section 3.1, algorithms using feature splits do not
guarantee to outperform algorithms using whole features: standard EM which
directly utilizes the whole feature set for learning performs fairly well. It can also
be observed that in section 3.1 Version II of Multi-View EM, which utilizes the
uncorrelated property of two views and does feature splits, outperforms standard
EM a little bit.
Completely deduced in the framework of EM, Multi-View EM is a more appropriate and proper probabilistic version of Co-training than Co-EM. In Table
4, Version II of Multi-View EM achieves the most best performance in the unsupervised learning scenarios with synthetic data. In Table 6, Multi-View EM
achieves the most best performance in the semi-supervised learning scenario with
WebKB data.
For Multi-View EM, one important issue is how to choose appropriate weights
for different views, which is an open issue and has close relation to the research
on how to choose appropriate scale factors for different features[15]. In the semisupervised learning scenario, weights can still be chosen through cross-validation.
In the unsupervised learning scenario where we can not utilize label information
of data, other prior knowledge of data should be utilized, such as the importance
of different views for certain clustering task, etc. Notice that for Multi-View EM,
standard EM, Co-training and Co-EM, only the first two algorithms can deal
with unsupervised learning scenario, and that standard EM also needs some
other evaluation methods to determine models achieved.
In the semi-supervised learning settings, similar to Co-EM, Multi-View EM
iteratively uses the unlabeled data, which probabilistically labels all the data at
each round, while Co-training incrementally uses the unlabeled data, which only
labels the most confident data at each round. Different from Co-EM, Multi-View
EM does not interchange probabilistic labels of each view but combines them
for learning at each round, the convergence of which can be guaranteed in the
framework of EM.
11
Notice that Multi-View EM can simultaneously utilize different classifiers
or models in different views, which can be viewed as combination of hybrid
classifiers in the point of classifier fusion, also Multi-View EM can simultaneously
employ different maximization criterion such ML or MAP in different views.
These advantages enable Multi-View EM for more broad multi-view learning
problems.
In the view of classifier fusion, Version I of Multi-View EM can be intuitively
regarded as in each round before each EM iteration, independently designing
a classifier in each view and then using the weighted sum criterion of classifier fusion[16] to combine probabilistic labels from all the classifiers. Technically,
other classifier fusion criteria such as multiplication criterion and max criterion,
etc, could be utilized to combine probabilistic labels, but rigidly, in the framework of EM, using these criteria could not guarantee the convergence.
5
Conclusions
In this paper, Multi-View EM for finite mixture models is proposed to handle real-world learning problem which have natural feature splits. As a more
proper and close probabilistic version of Co-training than Co-EM, Multi-View
EM is completely deduced in the EM framework and its convergence is theoretically guaranteed in the EM framework. Compared with Co-training and
Co-EM, Multi-View also does feature split and can be applied for both unsupervised learning and semi-supervised learning tasks. Moreover, Multi-View EM
can easily deal with more two views and that it can simultaneously utilize different classifiers and different optimization criteria such as ML and MAP in
different views for learning so that Multi-View EM can be employed for more
broad real-world learning tasks.
Two versions of Multi-View EM algorithm are also provided in this paper.
Version I of Multi-View EM is achieved by using another criterion in stead of ML
and MAP in the M-step. It can be intuitively regarded as that this version runs
EM in each view, and before each new EM iteration combines all the weighted
probabilistic labels generated in each view. Version II of Multi-View EM can
be viewed as some static probabilistic fusion of each component in the finite
mixture models. GMM based and Na¨ıve Bayes classifier based Multi-View EM
are also presented in this paper.
For text classification problem, algorithms in Co-training settings have been
successfully applied for learning with label and unlabeled data. Previous research demonstrated that when the compatible and uncorrelated assumptions
are satisfied, these algorithms perform better than algorithms which do not utilize feature splits. But when dimension of data is not very high and there are
enough samples, GMM based on standard EM which employs whole features
performs very well. When the underlying distribution of data does satisfy uncorrelated assumption, Version II of Multi-View EM performs a little better than
EM using whole feature set.
12
Following the feature split scheme, Multi-View EM can be expected to have
broad applications for multi-view learning problems. Future work involves introducing prior knowledge of data and some criteria to automatically choose
more appropriate weights of different views and utilizing Multi-View EM for
more semi-supervised learning problems. And that active learning could be introduced so that active learning version of Multi-View EM similar to Co-EMT
could be designed.
References
1. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In:
Proceedings of the 11th Annual Conference on Computational Learning Theory.
(1998) 92–100
2. Muslea, I., Minton, S., Knoblock, C.A.: Active + semi-supervised learning = robust
multi-view learning. In: Proceedings of ICML-02. (2002)
3. Muslea, I., Minton, S., Knoblock, C.A.: Selective sampling with redundant views.
In: Proceedings of National Conference on Artificial Intelligence. (2000) 621–626
4. Collins, M., Singer, Y.: Unsupervised models for named entity classification. In:
Proceeding of EMNLP-99. (1999)
5. Riloff, E., Jones, R.: Learning dictionaries for information extraction using multilevel boot-strapping. In: Proceedings of AAAI-99. (1999)
6. Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of ACL-95. (1995)
7. Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training.
In: Proceedings of Information and Knowledge Management. (2000) 86–93
8. Cun, Y.L., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., Jackel,
L.: Backpropagation applied to handwritten zip code recognition. Neural Computation 1 (1989) 541–551
9. Bilmes, J.A.: A gentle tutorial of the em algorithm and its application to parameter
estimation for gaussian mixture and hidden markov models. Technical report, ICSI
Technical Report TR-97-021,UC Berkeley (1997)
10. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data
via the em algorithm. Journal of Royal Statistic Society B 90 (1977) 1–38
11. Garg, A., Potamianos, G., Neti, C., Huang, T.: Frame-dependent multi-stream
reliability indicators for audio-visual speech recognition. Proc. of international
conference on Acoustics, Speech and Signal Processing (ICASSP) (2003)
12. Nigam, K., Mccallum, A., Thrun, S., Mitchell, T.: Text classification from labeled
and unlabeled documents using em. Machine Learning 39 (2000) 103–134
13. Kirby, M.: Geometric Data Analysis: An Empirical Approach to Dimensionality
Reduction and the Study of Patterns. John Wiley and Sons, New York (2000)
14. Rogati, M., Yang, Y.: High-performing feature selection for text classification. In:
Proceedings of the eleventh international conference on Information and knowledge
management, McLean, Virginia, USA (2002)
15. Wang, J.D., Zhang, C.S., Shum, H.Y.: Face image resolution versus face recognition
performance based on two global methods. In: Proceedings of the Asia Conference
on Computer Vision (ACCV’2004), Jeju Island, Korea (2004)
16. Suen, C.Y.W., Lam, L.: Analyzing the effectiveness and applicability of co-training.
In: Proceedings of the First International Workshop on Multiple Classifier Systems
(MCS2000), Cagliari, Italy (2000) 52–66