Exploratory Consensus of Hierarchical Clusterings for Melanoma and Breast Cancer Pritha Mahata

138
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
VOL. 7,
NO. 1, JANUARY-MARCH 2010
Exploratory Consensus of Hierarchical
Clusterings for Melanoma and Breast Cancer
Pritha Mahata
Abstract—Finding subtypes of heterogeneous diseases is the biggest challenge in the area of biology. Often, clustering is used to
provide a hypothesis for the subtypes of a heterogeneous disease. However, there are usually discrepancies between the clusterings
produced by different algorithms. This work introduces a simple method which provides the most consistent clusters across three
different clustering algorithms for a melanoma and a breast cancer data set. The method is validated by showing that the Silhouette,
Dunne’s and Davies-Bouldin’s cluster validation indices are better for the proposed algorithm than those obtained by k-means and
another consensus clustering algorithm. The hypotheses of the consensus clusters on both the data sets are corroborated by clear
genetic markers and 100 percent classification accuracy. In Bittner et al.’s melanoma data set, a previously hypothesized primary cluster
is recognized as the largest consensus cluster and a new partition of this cluster into two subclusters is proposed. In van’t Veer et al.’s
breast cancer data set, previously proposed “basal” and “luminal A” subtypes are clearly recognized as the two predominant clusters.
Furthermore, a new hypothesis is provided about the existence of two subgroups within the “basal” subtype in this data set. The clusters
of van’t Veer’s data set is also validated by high classification accuracy obtained in the data set of van de Vijver et al.
Index Terms—Consensus clustering, melanoma, breast cancer.
Ç
1
INTRODUCTION
W
ITH the advent of microarray data, the last decade has
seen a collective effort for finding subtypes of heterogeneous diseases like melanoma, breast cancer, and ovarian
cancer. The traditional tools for subtype detection include
unsupervised methods, such as hierarchical (e.g., single-link,
average-link, and complete-link agglomerative clustering) or
iterative relocation-based clustering (e.g., k-means [1]).
If a hierarchical clustering algorithm (e.g., [2]) is used, then
it is often difficult to infer the number of actual subtypes
hidden in the data. It can be anything between 1 and n, where
n is the number of samples in the data. It has been a common
technique to cut the tree at different levels starting below the
root and to find if the generated clusters are well defined.
However, the trees produced by different hierarchical
clustering algorithms can be very different from each other,
even if the corresponding dendrograms are cut at the same
level. Not surprisingly, there has been many a controversy
about proposed subtypes within different heterogeneous
cancer data sets. For instance, in [3], five subtypes of
sporadic breast cancers were proposed. However, only
two subtypes out of the five were relatively consistently
found in the sporadic breast cancer data set produced by
van’t Veer et al. [4]. Thus, there is a necessity to come up with
better clustering algorithms which can yield more definitive
inferences for the subtypes of heterogeneous diseases.
If an iterative relocation-based method is used, the results
are often dependent on the initialization of the algorithm.
. The author is with the School of Information Technology and Electrical
Engineering, University of Queensland, Australia.
E-mail: [email protected].
Manuscript received 3 Apr. 2007; revised 2 Nov. 2007; accepted 3 Mar. 2008;
published online 14 Mar. 2008.
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TCBB-2007-04-0039.
Digital Object Identifier no. 10.1109/TCBB.2008.33.
1545-5963/10/$26.00 ß 2010 IEEE
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply.
This tendency is even more visible when the data set
includes noisy gene expressions. Furthermore, the number
of clusters has to be chosen a priori for these methods.
One way to solve this problem is to take consensus of the
trees generated by various algorithms. Now, there can be
various ways to take consensus of such trees. Consensus
trees, first developed by Adams [5], have long been used to
combine information from two or more different classifications of the same set of Operational Taxonomic Units into a
single tree. Adam’s method is a top-down method and it
recursively finds consensus clusters by intersection of
clusters occurring at a number of specified heights on the
trees being compared. In this method, the height at which
clusters are compared is a function of what consensus
clusters were found in the previous intersection step. For
example, consider two binary trees. Then, the intersection of
left and right clusters at the root of both trees result in four
clusters (perhaps, some of these are null sets). Then, the
nearest common ancestor (least upper bound or LUB) of
these clusters are sought in both the trees and an edge from
it to each nonnull cluster is drawn. A similar method is
described in [6], where consensus of k clusters includes all
clusters formed by intersecting any set of k clusters, all of
the same height and one from each tree. These methods
have the capability of producing consensus clusters that do
not exist in any of the original trees being compared.
Strict consensus [7] and majority consensus [8] are cluster
counting methods, and they rely on tallying the number of
trees in which a particular cluster occurs. In the strict
consensus method, a cluster must occur in every tree in
order to be included in the consensus tree. In the majority
consensus method, the cluster must occur in more than half
of the trees in order to be included.
In the last decade, again a number of works concentrated
on optimization-based methods for finding “consensus
Published by the IEEE CS, CI, and EMB Societies & the ACM
MAHATA: EXPLORATORY CONSENSUS OF HIERARCHICAL CLUSTERINGS FOR MELANOMA AND BREAST CANCER
clusters” (e.g., [9], [10], [11], [12], [13], [14], and [15]). Next,
we recall some of these methods. These methods usually
have high computational complexity as mentioned below.
In [9], a topological consensus of several trees is sought
to infer the best possible topology to be fitted to the data. In
[10], a weighted Rand Measure, representing the degree of
association between any given clustering and the final tobe-found amalgamated clustering, is maximized (notice that
it is an NP-hard problem).
In [11], an adaptive clustering ensemble method is
proposed. This method is essentially an extension of
supervised boosting algorithm in the context of consensus
clustering. This algorithm generates individual partitions in
the ensemble in a sequential manner by clustering specially
selected subsamples of the given data set. New subsamples
are drawn to focus on the difficult regions of the feature space.
The probability that a sample will be picked depends on the
consistency of its previous assignments in the ensemble.
Monti et al. [12] also use resampling to simulate
perturbations of the original data set. In essence, they apply
a given clustering algorithm on each resampled version of
the data set and compute a connectivity matrix (n n matrix,
where n is the number of elements to be clustered and each
element of the matrix takes value from f0; 1g, where the
value is 1 if they are in the same cluster, otherwise 0). From
several such connectivity matrices, then a consensus matrix
is computed as a properly normalized sum of all the
connectivity matrices (see [12] for details). The values in
the consensus matrix range between 0 and 1; a value close to
1 (0) means that most (rarely) runs cluster a pair of elements
in the same cluster. The consensus matrix can be used as a
visualization tool to find the members of the clusters and the
actual number of clusters.
The authors of [13] employ another “consensus clustering” method and applied it on the gene expression data.
They start with an agreement matrix, A of size n n, where n
is the number of elements to cluster. Each element Aij , at
ith row and jth column of A denotes the number of
clustering methods which put the element i and element j
in the same cluster. Then, a consensus clustering is
performed by maximizing the sum of the scores fðGi Þ of
each to-be-found cluster Gi with size si , where fðGi Þ for
nonzero-sized clusters is defined as follows:
To take consensus of multiple nonhierarchical clusterings, Strehl and Ghosh [20] propose three heuristics to find
candidate clusterings (the first induces a similarity measure
from the partitionings and then reclusters the objects; the
second is based on hypergraph partitioning; and the third
collapses groups of clusters into metaclusters which then
compete for each object to determine the clustering). Then,
they use a supraconsensus function that evaluates all three
approaches against a given objective function, which
essentially seeks the maximum average normalized mutual
information and picks the best solution for a given situation.
Recently in [15], a Bayesian method is given which
averages multiple clustering runs with Variational Bayes
Mixtures of Gaussians or k-means to yield the final clustering
result. They collect reoccurring clustering patterns in a
cooccurrence matrix (which contains the probabilities if a
pair of elements belongs to the same cluster). This matrix can
then be used as a similarity matrix and be given as an input to
any hierarchical clustering algorithm for generating a tree.
In contrast to the recent optimization-based methods and
Bayesian techniques, this work proposes a computationally
simple method, which is thematically more similar to the
works of Adams [5] and Neumann [6]. This work considers
intersection of leaves of various trees generated by several
clustering algorithms and finds one or more “consensus
clusters” which appear(s) in similar sized subtrees with
slightly different topologies in the considered trees. Thus,
this approach is also very different from traditional consensus tree building approaches that rely on considering
subtrees only at some specified heights. Notice that for our
approach, an assumption is made that there exists more than
one cluster in the data. The motivation of not considering
resulting consensus tree is that different clustering criterion
can lead to a difference in the topology of the elements
within the same cluster, and therefore, each cluster in the
consensus tree can potentially contain less number of
samples than the actual number of members in that cluster.
Contributions. The contributions of this paper are as
follows:
.
i 1
i
fðGi Þ ¼ sj¼1
sk¼jþ1
ðAGij Gik Þ;
with being a user-defined parameter for the agreement
threshold. fðGi Þ is 0 if si ¼ 0. To solve this optimization
problem, authors of [13] use a probabilistic meta-algorithm
method called simulated annealing [16].
In [14], given a number of clusterings (say, C1 ; . . . ; Ck ) for
a data set, authors find the optimum clustering C such that
ki¼1 dðCi ; C Þ is minimized, where d gives the symmetric
difference distance defined over partitions. This measure is
defined on the number of coclustered and not coclustered
pairs in the partitions (see [14] for details). The problem is
NP-complete [17], [18] in general, and thus, a solution to
this optimization problem is based on local search and
simulated annealing.
There are also other works which address the mixture of
labeled and unlabeled data (e.g., [19]).
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply.
139
.
.
A simple exploratory method for finding consensus
clusters from several hierarchical clustering algorithms is proposed. The method is based on
intersecting leaves of various “similar-sized” subtrees of the given trees generated by some top-down
and bottom-up clustering techniques. The proposed
algorithm has been applied to two public-domain
microarray data sets: 1) melanoma (skin cancer) data
set produced by Bittner et al. [21] and 2) breast
cancer data set provided by van’t Veer et al. [4].
Using this method, a new hypothesis can be made:
there might be a new subtype of melanoma in the
data set of Bittner et al. [21], within a commonly
found dense cluster.
Also, the popularly hypothesized notions of “basal”
and “luminal A” subtypes in breast cancer from the
results on the data set of van’t Veer et al. are validated
by the results on the breast cancer data set. However,
other subtypes are rather unclear, even though there
is a possibility of a third cluster called “luminal B.”
Furthermore, a new hypothesis about two strong
subtypes within the “basal” subtype is made.
140
2
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
CONSENSUS CLUSTERS
2.1 Finding Consensus Clusters
The motivation of just finding consensus clusters instead of
building a consensus tree for the microarrays is that such
data sets are often very noisy and the topology produced by
the candidate classification algorithms cannot provide the
exact relationship of the samples in the data. Intuitively, the
proposed algorithm (called ECHC, exploratory consensus of
hierarchical clusterings) takes every proper subtree of a given
tree and finds a match in terms of the set of leaves in all other
trees, such that the matching trees are of comparable sizes.
Finally, a consensus cluster is formed by taking the
intersection of the set of leaves of the given subtree and
the other matching subtrees, if the proportion of the number
of leaves in common in all the subtrees with respect to the
number of leaves in the given subtree is above some
prespecified value.
More formally, suppose that there are k algorithms
which produce k trees, T1 ; . . . ; Tk for a given data set. The
output of the following algorithm is sets of strings, where
the elements (strings) in each set define a “consensus
cluster” (names of the samples). The algorithm consists of
the following three steps.
Step 1. Consider each proper subtree ti of the tree T1
containing more than l (say, 5) leaves, to generate a candidate
cluster (this implies that one cluster can be a subset of a
larger cluster). Then, for each proper subtree ti 2 T1 , find a
subtree tjip in each tree Tj with 2 j k such that
leavesðti Þ \ leavesðtp Þ2
;
ð1Þ
tjip ¼ arg max tp 2Sj
leavesðti Þjjleavesðtp Þ
where Sj is the set of all proper subtrees in Tj , leavesðti Þ is a
set of strings for the names of the leaves in the tree ti , and k
is the sizeof operator.
Notice that unlike other methods which cut the dendrograms at some height for generating the clusters, all proper
subtrees of T1 containing more than l leaves are considered
here. This is feasible, since usually the data of heterogeneous diseases are taken from a few populations of tens
of patients such that each population has similar symptoms.
Here, comparing subtrees only at similar heights may not
yield good results, since depending on the algorithm
branching structures are usually widely different (e.g.,
consider a bipartitioning of samples such that one side
contains only a singleton outlier, while the other side
contains the rest of the samples).
Step 2. For each subtree ti 2 T1 , let the matching trees in
sets S2 ; . . . ; Sk be m2i ; . . . ; mki . Discard a subtree ti from the
set of candidate clusters S1 if any of the following holds for
any j : 2 j k:
leavesðti Þ \ leavesðmj Þ
i
;
0 <
jleavesðti Þj
leavesðti Þ \ leavesðmj Þ
i
0 <
leavesðmj Þ
i
for some predefined 1. An ¼ 0:75 is chosen in an
experimental manner. Note that k is the number of trees
under consideration.
This step ensures that a subtree in T1 is matched with a
subtree of similar size in other trees Tj . However, < 1 will
allow one to match subtrees with different structures and
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply.
VOL. 7,
NO. 1, JANUARY-MARCH 2010
slightly different number of leaves, even though there
might be a few mistakes in each individual subtree under
consideration. Nevertheless, if takes a value close to 0,
there will be a large number of very small clusters found
and a well-defined cluster will stand the chance of being
broken into subclusters.
Step 3. For all remaining trees ti 2 T1 , there are
nonempty subtrees m2i 2 S2 ; . . . ; mki 2 Sk . Now, the elements in a consensus cluster extracted from ti are given
by leavesðti Þ \ leavesðm2i Þ \ \ leavesðmki Þ.
Notice that the tree created by a top-down clustering
algorithm is chosen to be T1 , since it is often the case that
top-down methods are more true near the root of the tree,
while bottom-up clustering techniques make mistakes in
merging large-sized clusters.
In the following, two top-down and one bottom-up
clustering algorithms are shown from which consensus
clusters will be generated.
2.2 Arithmetic-Harmonic (AH-CUT)
Recently in [22] and [23], a new top-down hierarchical
clustering algorithm was introduced, which performed very
well to find trees with well-supported hypothesis for several
diverse data sets, e.g., separation times of 84 Indo-European
languages [24] and microarray data of 64 cancer cell-lines
from National Cancer Institute, USA [25]. This method
recursively builds the hierarchy in a top-down fashion. In
each step of the recursion, the following graph optimization
problem is solved for newly generated partitions of the data.
For instance, initially for each sample (experiment) in the
microarray data, there is a vertex assigned in the initial graph
and the distance (correlation, euclidean, etc.) between a pair
of samples is assigned on an edge between the representative
vertices in the graph. The solution of the following optimization problem partitions the vertices into two sets. Then, a new
graph is constructed and the following optimization problem
is solved for each side of the obtained partition.
AH CUT Problem:
Instance: A pair ðG; wÞ. G ¼ ðV ; EÞ is an undirected graph
where V is the set of vertices, E is the set of edges and w is
a weighting function given by w : E7!INþ , where INþ is the
set of nonzero natural numbers.
Goal: To divide the set V of vertices into fS; Sg with
S ¼ V n S, such that the following objective function f
is maximized.
Objective:
1
!0
P
P 1
A
(2)
wðeÞ @
fðS; ðG; wÞÞ ¼
wðeÞ ;
e2G ðSÞ
e2G ðSÞ
where for every subset S of V , G ðSÞ :¼ fuv 2 E j u 2 S;
v 62 Sg denotes the cut in G induced by S and G ðSÞ :¼
E n G ðSÞ. Effectively, every partition of V into S and S
generates a partition of the set E into two sets G ðSÞ and
G ðSÞ.
The name Arithmetic-Harmonic Cut (AH-CUT) follows
from the following:
ðjEj jG ðSÞjÞ
f ðS; ðG; wÞÞ ¼ ðAext jG ðSÞjÞ
Hint
Aext
¼
jG ðSÞjðjEj jG ðSÞjÞ;
Hint
MAHATA: EXPLORATORY CONSENSUS OF HIERARCHICAL CLUSTERINGS FOR MELANOMA AND BREAST CANCER
where Aext is the arithmetic mean of the weights of the edges
in the cut, Hint is the harmonic mean of the weights of the
edges not in the cut, and jG ðSÞj is the cardinality of the cut.
In this work, authors used a population-based metaheuristic (memetic algorithm) to solve AH-CUT (since the problem
is shown to be APX-hard [23] by a reduction from the
MAX-CUT problem). The memetic algorithm first creates a
population of feasible solutions using a randomized greedy
algorithm. Then, these solutions are combined (same crossover mechanism of genetic algorithms) using a differential
greedy algorithm such that a better solution is obtained,
and this solution then replaces the parent solution with the
worse objective value. Finally, a variable neighborhood local
search is employed in the neighborhood of each such new
solution to find further improvements. If no better solution is
obtained, then the algorithm is restarted a few times.
2.3 NORMALIZED-CUT
Shi and Malik introduced a top-down technique for divisive
clustering, namely NORMALIZED-CUT [26], and this method was very successfully applied on several 2D data sets
and images. The work in [26] uses a similarity matrix and
proposes to minimize the following objective function
NCutðS; SÞ induced by a partition of vertices for some S V and S ¼ V n S, where
P
P
ij2ðSÞ sðijÞ
ij2ðSÞ sðijÞ
;
þP
NCutðS; SÞ ¼ P
i2S;j2V sðijÞ
i2S;j2V sðijÞ
with s being a function from edges to real number and gives
the similarity between any two vertices i and j. The work in
[26] reduces this optimization problem to a generalized
eigenvalue problem and provides an efficient method for
image segmentation. However, the algorithm in [26] for the
NORMALIZED-CUT problem requires specification of the
number of clusters. Thus, an alternative method (a variant
of the memetic algorithm described above) is employed and
it does not require such specifications, but caters for the
objective function for NORMALIZED-CUT. The tree is
generated by applying NORMALIZED-CUT recursively on
each newly obtained sets of vertices in the partitions.
2.4 Agglomerative Hierarchical Clustering
For large-scale data sets, often a bottom-up approach is
taken to find a hierarchical tree. The reason behind this is
mainly the computational efficiency of the bottom-up
approach. Notice that NORMALIZED-CUT is shown to be
NP-hard in [26] and AH-CUT is proved to be hard to
approximate in [23].
The method of agglomerative hierarchical clustering [2]
starts by considering each object as a separate cluster. At
each iteration, it agglomerates two most “similar” nodes,
decreases the number of clusters by 1, and updates the
distances between the clusters. The process stops until a
single cluster is obtained. The tree structure is generated by
keeping track of the merging of the clusters. Different trees
arise depending on the different ways of defining distance
between clusters. In this work, the most commonly used
average-linkage method is used. In this method, the
distance between two clusters is defined as the average of
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply.
141
distances between all pairs of objects, where each pair is
made up of one object from each group.
3
VALIDATION
The validation of clustering is a hard problem, although
there exist quite a few works [27], [28], [29], [30], [31], which
have addressed this issue in the past and provided
strategies for validating clustering partitions. We consider
the recently proposed Silhouette, Dunn’s, and Davis-Bouldin
indices, respectively, given in [29], [30], and [31].
3.1 Silhouette Index
For a given cluster, Cj with j : 1 j c, where c is the
number of clusters, each sample si : i ¼ 1; . . . ; mj of Cj can
be assigned a quality measure, known as the Silhouette
width. The Silhouette width is a confidence indicator on the
membership of the ith sample in cluster Cj . The Silhouette
width for the ith sample in the cluster Cj is defined as
sðiÞ ¼
bðiÞ aðiÞ
;
maxfaðiÞ; bðiÞg
where aðiÞ is the average distance between the ith sample
and all other samples in Cj excluding it; “max” is the
maximum operator, and bðiÞ is the minimum average
distance between the ith sample and all samples clustered
in Ck , ðk ¼ 1; . . . ; c; k 6¼ jÞ. From this formula, it follows that
1 sðiÞ 1. When sðiÞ takes a value close to 1, one may
infer that the ith sample has been clustered well. When sðiÞ
is close to zero, it suggests that the ith sample could also be
assigned to the nearest neighboring cluster. If sðiÞ is close to
1, it may be argued that such a sample has been
“misclassified” [29].
For a given cluster, Cj ðj ¼ 1; . . . ; cÞ, it is possible to
calculate a cluster Silhouette Sj , which characterizes the
heterogeneity and isolation properties of such a cluster:
Sj ¼
1 mj
sðiÞ;
mj i¼1
where mj is the number of samples in Cj . In fact, a Global
Silhouette value, GSu , can be used as an effective validity
index for a partition u, such that u is the union of samples in
clusters Cj for all j : 1 j c:
1
GSu ¼ cj¼1 Sj :
c
Notice that higher value of GSu signifies better isolation
properties of the clusters. Furthermore, it is shown in [32]
that Silhouette index is insensitive to the type of the
distance metric used.
3.2 Dunn’s Index
Dunn’s index identifies sets of clusters that are compact and
well separated. For any partition u, the Dunn’s validation
index D is defined as
ðCi ; Cj Þ
Du ¼ min1ic min1jc:j6¼i
;
max1kc ðCk Þ
where ðCi ; Cj Þ defines the distance between clusters Ci and
Cj (intercluster distance); ðCk Þ represents the intracluster
142
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
distance of cluster Ck , and c is the number of clusters of
partition u. The main goal of this measure is to maximize
intercluster distances while minimizing intracluster distances. Thus, large values of D correspond to good clusters.
3.3 Davis-Bouldin’s Index
As the Dunn’s index, the Davies-Bouldin index tries to
identify compact and well-separated sets of clusters. The
Davies-Bouldin validation index, DB, is defined as
1
ðCi Þ þ ðCj Þ
;
DBu ¼ ci¼1 maxi6¼j
c
ðCi ; Cj Þ
where u, ðCi ; Cj Þ, ðCi Þ, ðCj Þ, and c are defined as in
Dunn’s index. Small values of DB correspond to clusters that
are compact, and whose centers are far away from each other.
For each of the above indices, the distance between
samples si and sj is considered to be their correlation distance
ð1 Þ, where is the Pearson’s correlation coefficient (can
be computed as in Section 3.5.1) between the gene
expression vectors of samples si and sj .
3.4 Comparison with Other Algorithms
In the following sections, the Silhouette, Dunn’s, and
Davies-Bouldin indices for the clustering obtained by
ECHC on the melanoma and breast cancer data sets are
compared, with those obtained using k-means, fuzzy
c-means, and the recent consensus clustering in [15]. In all
of the above algorithms except ECHC, the number of
clusters needs to be specified.
In k-means, first k clusters’ centers are chosen and those
are iteratively updated until the maximum number of
iterations are reached. The updating is done by computing
the new center of a cluster to be the centroid of the samples
which are clustered together when the previous center of
the cluster was considered. In this work, Eisen et al.’s [2]
implementation of k-means is used for this purpose.
In fuzzy c-means, each sample is assigned to a cluster
with a given probability. The higher the probability, the
more probable it is to belong to that cluster. However,
application of fuzzy c-means (using MATLAB with the
default options) on melanoma and breast cancer data
yielded equal probability for each sample to belong to the
specified number of clusters. Thus, fuzzy c-means is not
considered for comparison in the rest of this paper.
In [15], authors’ consensus clustering algorithm (called
CC in the following) takes advantage of multiple runs of
variational Bayes mixtures of Gaussians or k-means to
create a consensus clustering. In some simulated data sets,
this algorithm was able to identify more complex shapes of
clusters than those provided by the input clusterings. They
also provide an information-theoretic metric (average mutual
information) between the clustering runs to signify the
similarity between the clustering solutions.
3.5 Feature Selection and Classification
Another way to realize the importance of clustering the
samples of microarray data is to show that there are genetic
markers which can linearly separate a consensus cluster from
others and there is a high classification accuracy when these
markers are used for classifying a randomly chosen population of samples from the data. For this purpose, this work
uses a simple feature selection method, which is widely
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply.
VOL. 7,
NO. 1, JANUARY-MARCH 2010
employed by the biologists (e.g., see [33], and also [4]) and a
simple classifier which is widely used by the statisticians and
the machine learning community. In the following, we will
consider that in each validation test, there are two classes a
and b, where a, b can be two consensus clusters or one of them
can correspond to the set of all unclustered samples.
3.5.1 Feature Selection
To find genetic markers separating one consensus cluster
from another (or, unclustered samples), the correlationbased method is employed here. In this method, first, each
gene is sorted according to the class label. Let there be
na samples in class a and nb samples in class b. Consider
g ¼ ½g1 ; . . . ; gna ; gna þ1 ; . . . ; gna þnb and n ¼ na þ nb . Also, let
v ¼ ½va ; vb , where va ¼ l1 ; . . . ; lna and vb ¼ m1 ; . . . ; mnb with
l1 ¼ ¼ lna ¼ 1 and m1 ¼ ¼ mnb ¼ 1. Then, Pearson’s
correlation coefficient of a gene g with v is given by
Pn
ðgk gÞðvk vÞ
q
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi
g ¼ P k¼1
n
2 Pn
2
ðg
g
Þ
ðv
v
Þ
k¼1 k
k¼1 k
with
g ¼
n
1X
gk ;
n k¼1
and
v ¼
n
1X
vk :
n k¼1
Finally, genes are ranked according to the absolute value of
their Pearson’s correlation coefficient g , i.e., rankg ¼ jg j.
Intuitively, this coefficient tells the correlation of the gene
expressions of samples from two classes with an ideal gene
which takes the value of 1 in one class and 1 in the other.
Thus, if rankg is close to 1, then the gene g is good for
differentiating between classes a and b.
3.5.2 Classifier
For classification purpose, a well-known classifier, namely
the Naive Bayes classifier (NBC) is used. This classifier is based
on the conditional probability distribution pðcjg1 ; . . . ; gn Þ of
the class variable c 2 fa; bg conditioned on the geneexpressions g1 through gn . Bayes’ theorem gives
pðcjg1 ; . . . ; gn Þ ¼
pðc; g1 ; . . . ; gn Þ
:
pðg1 ; . . . ; gn Þ
The denominator is effectively constant. Using the assumption that g1 ; . . . ; gn are mutually independent, it is easy to
show that the numerator can be rewritten as
pðc; g1 ; . . . ; gn Þ ¼ pðcÞ
n
Y
pðgi jcÞ:
i¼1
The probability distribution of a gene in a given class c can
be computed using standard steps. Thus, computing
pðc; g1 ; . . . ; gn Þ is straightforward. Given a new sample,
one can compute pða; g1 ; . . . ; gn Þ and pðb; g1 ; . . . ; gn Þ and
assign its class to a if the former is larger, or to b otherwise.
The NBC from the public-domain software WEKA [34] is
used here to validate the consensus clusters obtained in
melanoma and breast cancer data sets. In each case, two
groups a and b are considered, and a randomly chosen
66 percent of each group is used as the training data, while
the rest of the samples is classified. The seed chosen for
creating the random partitioning of the data is 1.
MAHATA: EXPLORATORY CONSENSUS OF HIERARCHICAL CLUSTERINGS FOR MELANOMA AND BREAST CANCER
143
Classification of a test set. It is important to validate the
consensus clusters obtained by ECHC across various data
sets. However, when the class of a sample is unknown in
both the training and the test set, independent validation
becomes a tricky issue. To solve this problem, the same
classification method on another test set is employed and
the consensus clusters from it are generated. Still, without
prior knowledge, it might be difficult to map the cluster I
(II, etc.) of training set to the cluster I (II, etc.) of the test set.
In case of breast cancer data sets, there were a few common
samples in the two sets. This allows us to map clusters from
the training set to the test set. Once the class of the samples
in test set were known, it is straightforward to use NBC for
validation.
4
MICROARRAY DATA
The microarray data for the considered data sets is given by
a 2D matrix D of size N S, where N is the number of
genes and S is the number of samples. Each row Ni gives
the expressions of one gene for S different samples with
i : 1 i S, and each column Si contains the expression of
all genes for one sample (experiment) for i : 1 i N.
From D, first a distance matrix of size S S is computed,
where each element ði; jÞ gives the correlation distance 1 ðSi ; Sj Þ between the columns Si and Sj . Here, ðSi ; Sj Þ
computed in a similar manner to that in Section 3, by
replacing g, v with Si and Sj , respectively.
In the following sections, the distance matrix corresponding to each microarray data set is considered. Given
a distance matrix, it is straightforward to construct a
graph with S vertices and S ðS 1Þ=2 edges. The
weighting function w is defined by the distance matrix
itself. Then, ECHC is applied using the trees produced by
AH-CUT, NORMALIZED-CUT, and agglomerative clustering. In this work, the tree produced by AH-CUT is
considered to be T1 .
5
MELANOMA
The most common human cancers are malignant skin
cancers. Melanoma is the least common but the most deadly
skin cancer, accounting for only about 4 percent of all cases,
but 79 percent of skin cancer deaths (http://www.cancer.
org). Even though there has been a significant effort
devoted to identification of the clinical/genetic markers
differentiating melanoma samples from the normal ones,
identifying subtypes of melanoma is still a research topic.
Recent works claim existence of a strong subtype of
melanoma [21], [35]. However, there are no conclusive
evidences about the markers of this subtype. The work in
[21] provides a melanoma data set of 38 samples, among
which 31 are melanoma-affected, while 7 are nonmelanoma
samples (NilC, UACC-3149, MCF-10A, CRL-1634, SRS-3,
SRS-5, and RMS-13). The authors of [21] employed two
clustering methods, agglomerative hierarchical clustering
[2] and a nonhierarchical technique called cluster affinity
search technique [36] to define the hypothesis about a
primary subtype of melanoma, which is centrally clustered
using both algorithms.
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply.
Fig. 1. Hierarchical clustering using AH-CUT on the distance matrix of
Bittner’s melanoma data set. MI denotes the largest consensus
cluster found, while MII denotes the second largest consensus
cluster. Notice that the elements in MI and not in MII are not grouped
together in AH-CUT and they are not inferred as the third consensus
cluster.
5.1 Consensus Clusters
It is not surprising that the same primary cluster in [21] is
inferred to be the largest consensus cluster, called MI
hereafter. Node 1 in Fig. 1 is matched with node 2 in Fig. 2
and node 6 in Fig. 3 to get MI. See the genetic signature of
MI versus rest in Fig. 4. Notice that the primary cluster
given in [21] contains 19 samples out of the 20 in the first
cluster MI. In addition to Bittner’s primary cluster, MI also
contains the sample TC-F027.
Furthermore, the cluster MI is found to be further
subdivided into two more strongly separated clusters MII
and MI n MII, as evidenced by the genetic signature shown
in Fig. 5. In this case, node 5 in Fig. 1 is matched with
node 37 in Fig. 2 and node 17 in Fig. 3 to infer MII. Notice
that MII is a subset of MI. Furthermore, no nonmelanoma
samples are included in MI.
5.2 AH-CUT
The hierarchical tree obtained by recursive AH-CUT on the
melanoma data set is shown in Fig. 1. Notice that the
elements in MI n MII are in different subtrees rooted at
nodes 34 and 6, respectively.
144
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
Fig. 2. Hierarchical clustering using NORMALIZED-CUT on the distance
matrix of Bittner’s melanoma data set. Notice that the first division in
NORMALIZED-CUT separates MI from the rest. MII and MI n MII are also
well separated within MI.
5.3 NORMALIZED-CUT
The hierarchical tree obtained using the second top-down
clustering method NORMALIZED-CUT on the melanoma
data set is shown in Fig. 2. Notice that all the samples in MI
are clearly separated from the rest in the first division.
Furthermore, samples in MI n MII are also shown to be
condensed within one subtree.
5.4 Agglomerative Hierarchical Clustering
In Fig. 3, the hierarchical tree obtained by the bottom-up
agglomerative clustering technique on this data set is
shown. Notice that here also MI n MII is well defined.
5.5 Remarks
Instead of AH-CUT, if the tree produced by the agglomerative hierarchical clustering is used to be the first
hypothesis tree T1 , then a third cluster MIII can be found
which is a subset of MII and consists of six samples M92001, UACC-1273, UACC-091, M93-007, UACC-1256, and
UACC-2534. However, this cluster is not considered here,
since the difference between MII and MIII will then be just
three samples, and it is not enough to consider for further
classification.
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply.
VOL. 7,
NO. 1, JANUARY-MARCH 2010
Fig. 3. Agglomerative hierarchical clustering of the melanoma data set
with average linkage. This algorithm is prone to emitting outliers.
However, node 6 gives the elements in MI and node 17 gives elements
in MII.
5.6 Validation by Clustering Indices
Next, the performance of ECHC is compared with k-means
and CC using the three validation indices for the respective
clusterings of melanoma data set (see Table 1). To compute
these indices for ECHC, MI and MII are considered as the
two clusters in this data set and the rest of the samples is
omitted, since ECHC does not infer any class for those.
However, for other algorithms, all 38 samples are considered, since they assign a class to all the samples, even if
some of the samples actually might be outliers. For the
k-means, the results are shown for k ¼ 2 and k ¼ 3 and for
the algorithm in [15], the results are shown with k ¼ 3.
Using k-means with k ¼ 2, all samples in MI are clustered
together along with four other samples in one cluster,
whereas the clustering with k ¼ 3 mixes some samples from
MI with those from the rest. Notice that the validation
indices with k ¼ 2 are better than those with k ¼ 3 when
k-means is used. However, CC performed very poorly (with
respect to the considered validation indices). In fact, mutual
information between the runs of CC was only 0.038. Notice
that in all cases, ECHC performs better, since larger values
MAHATA: EXPLORATORY CONSENSUS OF HIERARCHICAL CLUSTERINGS FOR MELANOMA AND BREAST CANCER
145
Fig. 5. The genetic signature of 50 genes discriminating MII from other
samples in MI.
Fig. 4. The genetic signature of 50 genes discriminating MI from the rest
of the samples. The cloneids of the genes are shown on the right-hand
side. The color white represents overexpression and the color black
represents underexpression of a gene.
of Silhouette, Dunne’s index, and smaller values of DaviesBouldin’s index signify better clustering.
5.7
Genetic Markers and Classification Results
5.7.1 MI versus Rest
To ensure that MI is definitely a primary subtype of
melanoma, first the data is divided into MI and rest of the
samples as the two classes and 50 genetic markers using the
correlation-based method mentioned in Section 3 are
visualized in Fig. 4. No surprise, 100 percent classification
accuracy is obtained for this partition, using NBC.
In the following, some of the highly ranked genetic
markers are given for the partition MI versus Rest. The best
marker for MI versus Rest is a gene called MELAN-A
(cloneid 266361, and correlation rank is 0.93), which is
known for melanin synthesis, and is considered to be a
marker for melanoma [37]. Notice that this gene is relatively
overexpressed in MI (white/gray) and underexpressed in
the rest. The second gene with rank 0.86 is PIRIN (cloneid
234237) is not well known for differentiating melanoma
from other tissues and perhaps should be further investigated. The third gene WNT5A (cloneid 357278, with rank
0.83) is again well known to be overexpressed in aggressive
melanoma tumors [38] and has function in cell growth and
differentiation. However, according to [21], the primary
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply.
cluster in this data set (MI cluster) is less aggressive, and
this is reflected by the underexpression of WNT5A in MI.
Notice that the work in [21] omitted the sample TC-F027,
which is very similar to the primary cluster and does not
obtain the best marker MELAN-A. However, they also find
the gene WNT5A as a marker for MI. The sixth gene
Sinuclein, alpha (symbol SNCA, cloneid 812276, and rank
0.80) is again a marker in [21]. Another common gene
between these two studies is CD63 (cloneid 39781). CD63 is
associated with melanoma progression and is known to be a
marker for this disease [37]. Another interesting gene found
by this method is syntenin (cloneid 813533, rank 0.78), which
is known to be overexpressed in aggressive melanoma
tissues [39]. However, as expected, the expression of this
gene is not very high in the MI cluster, since they are not
overaggressive, as mentioned earlier.
5.7.2 MII versus MI n MII
While using NORMALIZED-CUT and agglomerative clustering, notice that both MII and MI n MII were included in two
sister-subtrees. Thus, it seems that there is a further division
of MI into MII and MI n MII. MII contains nine samples and
TABLE 1
Comparison of Various Validation Indices
for Melanoma Data Set
146
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
VOL. 7,
NO. 1, JANUARY-MARCH 2010
the rest of the 11 samples of MI is included into MI n MII.
Fig. 5 gives the genetic signature with respect to the first 50
highly ranked genes for this subdivision. Furthermore,
using the same classification method with NBC, an
accuracy of 100 percent is obtained.
The first 50 genetic markers for the partition MII versus
MI n MII are ranked between 0.96 and 0.8, showing
significant distinction between these two types. However,
this partition is not well explored by the biologists. The
names of the first six genes with rank 0.9 are tyrosinase
(cloneid 271985) (melanin biosynthesis from tyrosine),
cloneid 42993, cloneid 897563, IFIT1 (cloneid 823696),
RRS1 (cloneid 562867), and ACY1 (cloneid 51293).
6
BREAST CANCER
With one million new cases in the world each year, breast
cancer is the most common malignancy in women and
constitutes 18 percent of all female cancers [40]. Even though,
family history of breast cancer makes a small minority (fewer
than 20 percent) of women to be breast-cancer-prone, a
variety of unknown causes are suspected to cause a change in
gene expressions causing various sporadic breast cancers.
Only a part of the diagnosed cases (as small as 50 percent to
60 percent) get benefited from available medicines like
tamoxifen [41] and herceptin. This provides an obvious
indication to the heterogeneity of this disease. In [42], [43],
and [44], a small set of genetic markers have been provided to
distinguish between hereditary breast cancer caused by the
mutations in the genes BRCA-1, BRCA-2, and other sporadic
breast tumors. However, the taxonomy of sporadic tumors
still remains a mystery. A few works have concentrated on
finding genetic markers between patients with good prognosis and bad prognosis, e.g., [4], [45], and [46]. However,
mostly the genetic markers are not found to be common
between patients from different studies. In the works in [3]
and [47], a classification of sporadic breast cancer was given,
which consist of five groups: “basal,” “luminal A,”
“luminal B,” “ERBB2,” and “normal breast-like.” However,
except for the first two, the other groups were not found to be
prominent in the data set of van’t Veer et al. [4] (perhaps,
due to a difference in the population). Nevertheless, a quest
for finding subtypes of sporadic breast cancer is on worldwide. The goal is to find and eradicate the inherent causes of
the sporadic breast tumors.
6.1 Consensus Clusters
In this work, 78 samples of sporadic breast tumors from [4]
are used, from which only 76 samples in total are
considered after removing samples numbered 53 and 54,
due to their large number of missing values. The samples
that had metastasized in less than 5 years are prefixed as
“LessSample” and those for which it did not happen are
prefixed with “MoreSample.” Using the method of ECHC, it
can be inferred that there are three “consensus clusters”,
called BI, BII, and BIII. The clusters BI and BII are, in fact,
those corresponding to the subtypes “luminal A” and
“basal,” identified in [3], modulo a couple of inclusions/
exclusions (see the remarks). Notice that BI is generated by
matching the subtree rooted at node 1 in Fig. 6 with the one
at node 1 in Fig. 7 and node 3 in Fig. 8. BII is generated by
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply.
Fig. 6. Hierarchical clustering using AH-CUT on the distance matrix of
van’t Veer’s breast cancer data set. Notice the BI is separated from BII in
the first division, even though some unclustered samples are mixed with
BI samples. Also, see that BIII is a subgroup of BII.
matching the subtree rooted at node 96 in Fig. 6 with the one
at node 80 in Fig. 7 and node 62 in Fig. 8. Finally, BIII is
generated by matching the subtree rooted at node 132 in
Fig. 6 with the one at node 118 in Fig. 7 and node 66 in Fig. 8.
MAHATA: EXPLORATORY CONSENSUS OF HIERARCHICAL CLUSTERINGS FOR MELANOMA AND BREAST CANCER
Fig. 7. Hierarchical clustering using NORMALIZED-CUT on the distance
matrix of van’t Veer’s breast cancer data set. Notice that in the first
division by NORMALIZED-CUT, BI is separated alone from BII, except the
occurrence of one sample from UNC. However, samples in BII are
mixed with some samples from UNC, and thus, the clear separation of
BIII and BII n BIII is not achieved in this tree.
More interestingly, a new consensus cluster BIII is found,
which is a subgroup of BII. This subgroup was hitherto
unknown according to the best of the author’s knowledge
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply.
147
Fig. 8. Agglomerative hierarchical clustering of the van’t Veer’s breast
cancer data set with average linkage. See the clear separation of BII
from BI and UNC (unclustered) samples in the first division. Also notice
the division of BII into BIII and BII n BIII.
and can be inferred as a subgroup of “basal” subtype. See
the genetic signature of BIII versus BII n BIII in Fig. 12.
6.2 AH-CUT
In Fig. 6, the hierarchical tree obtained by using AH-CUT on
the van’t Veer’s data set of sporadic breast tumors is shown.
In this tree, BI is mostly separated from the other two
consensus clusters BII and BIII in the first division.
148
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
VOL. 7,
NO. 1, JANUARY-MARCH 2010
TABLE 2
Comparison of Various Validation Indices
for Breast Cancer Data Set
However, the samples in BI are mixed with some
unclustered samples. All unclustered samples, which are
not included in BI and BII, are grouped together and called
UNC in the following.
6.3 NORMALIZED-CUT
Next, the second top-down clustering method of
NORMALIZED-CUT is used to find a tree in van’t Veer’s
data set (see Fig. 7 for the resulting tree). In this tree, BI
is totally separated from the consensus cluster BII in the
first division. However, sample MoreSample5 from UNC
is included in the same side as BI.
6.4 Agglomerative Hierarchical Clustering
Fig. 8 shows the tree obtained by using agglomerative
hierarchical clustering on this data set. In this tree, BII is
clearly separated from BI and other unclustered samples in
the first division. Furthermore, BII is nicely divided into BIII
and BII n BIII in the second division of BII.
A comparison of the results obtained above with the one
produced in [3] on the van’t Veer’s data set shows that there
are only minor differences between the BI and their
“luminal A” and the BII and their “basal” subtype. For
instances, 1) in BII, samples LessSample56 and MoreSample28 are included, and [3] does not include them in
“basal”, while this work excludes MoreSample8 and
LessSample71 from BII and [3] includes them in “basal”;
the difference is probably due to the fact that Sorlie et al. [3]
considered an additional 19 test samples with the training
data in [4] and this might affect the hierarchical clustering;
2) in “luminal A” in [3], LessSample45, LessSample78, and
MoreSample34 are included, while using ECHC they are
left in UNC, and not in BI; and 3) all samples in their
“luminal B” are in the UNC, while one of Sorlie’s “luminal
B” samples, LessSample63 is included in BI.
6.5 Remarks
If the trees produced by the other two algorithms are used
as T1 , the result excludes three samples LessSample55,
LessSample63, and MoreSample9 from the cluster BI. Now,
an investigation of the three trees shows that except for the
agglomerative hierarchical clustering, these three samples
are well within the group of BI. Therefore, these three
samples are not omitted from BI.
6.6 Cluster Validation Using Various Indices
For the breast cancer data set, again the performance of
ECHC is compared with k-means and CC using the three
validation indices (see Table 2).
In k-means algorithm, values of k is considered to be 2, 3,
4 to cluster the breast samples. It turns out that the
validation indices with k ¼ 2 were better than those with
the rest (results for k ¼ 3; 4 are not shown here). Notice that
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply.
Fig. 9. The genetic signature of 50 genes discriminating BI from BII.
the two clusters for k ¼ 2 using k-means consist of mainly
(BII and seven samples from UNC) versus (BI and the rest
of UNC). Notice that Silhouette index is better for ECHC,
when compared with the other two algorithms. However,
Dunne’s index and Davies-Bouldin’s index for k-means
with k ¼ 2 were better than ECHC for this data set.
In the algorithm in [15], the number of minimum and
maximum clusters could be specified to be 2 and 3, 4, etc. It
turns out that better results are obtained when number of
clusters is 3. However, mutual information between runs
was only 0.0178, thus revealing the cause of poor
performance of CC on this data set.
6.7
Genetic Markers and Classification Results
6.7.1 BI versus BII
In this data set, the first clusters BI (39 samples) and BII
(14 samples) are found to be disjoint. Fig. 9 shows a clear
genetic signature for the samples in BI versus those in BII.
Again, using NBC, a classification accuracy of 100 percent is
obtained. In the following, a few top-ranked genes are
named, and they can be further explored as potential
markers for BI versus BII. The best gene to distinguish BI
from BII is the androgen receptor gene (AR, unigene ID
NM_000044, rank 0.88). It is known that AR is expressed in a
significant number of ER-negative (no estrogen receptor)
cases and shows significant associations with important
clinical and pathologic prognostic factors [48]. Notice that
most of the samples in BII are ER-negative. The second gene
MAHATA: EXPLORATORY CONSENSUS OF HIERARCHICAL CLUSTERINGS FOR MELANOMA AND BREAST CANCER
Fig. 10. The genetic signature of 50 genes discriminating BI from the
unclustered samples.
is FOXA1 (ID NM_004496, rank 0.88). FOXA1 has been
recently promoted as a marker for “luminal A” subtype (the
BI cluster) in [49]. The next known gene is GATA3 (ID
NM_002051, rank 0.84). GATA binding protein 3 (GATA3) is
a transcriptional activator highly expressed by the luminal
epithelial cells in the breast. The study in [50] found that
GATA3 was one of the top genes with low expression in
invasive carcinomas associated with negative estrogen
receptor, progesterone receptor (such as the samples in
BII). Other interesting genes include some known oncogenes
like RAB6 (ID: NM_016577), MYB (ID: NM_005375) and the
genes XBP1 (X-box binding protein 1, ID: NM_005080),
GABRP (ID: NM_014211), SPDEF (ID: NM_012391), TBC1D9
(ID: AB020689), ESR1 (estrogen receptor 1, ID: NM_000125),
TRIM2 (ID: AB011089), FOXC1 (ID: NM_001453), etc.
Among the above genes, estrogen receptor 1 (ESR1) is the
most explored gene associated with breast cancer. Also
notice that many works claim that GABRP is down regulated
in 75 percent of breast cancer [51]. However, in BII, this gene
is always up regulated, and can thus be considered as a
marker in addition to the genes in AR to ESR1, etc.
6.7.2 BI versus UNC
Since, there were 39 samples in BI and 14 samples in BII,
leaving 23 samples in UNC, it is also explored if there exists
any possibility of another subgroup except BI, BII, BIII in this
data set. In Fig. 10, the genetic signature of 50 genes for BI
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply.
149
Fig. 11. The genetic signature of 50 genes discriminating BII from the
unclustered samples.
versus UNC is given. Surprisingly, again a classification
accuracy of 100 percent is obtained for this case. This implies
that with a larger data set, one may actually find some other
subtypes. Otherwise, all these samples might simply be
outliers! So, for this purpose, the details of a few genetic
markers for this division are provided. The first few topranked genes are with IDs AF052101, AL050227 (PTGER3,
Prostaglandin E receptor 3), NM_001255 (CDC20, Cell
division cycle 20 homolog), NM_007274 (ACOT7, AcylCoA thioesterase 7), NM_004701 (CCNB2, Cyclin B2), etc.
Notice that the correlation ranks range from 0.75 to 0.71 for
these genes. This may indicate that there are either really no
strong clusters in UNC, or there may be more than one
subgroup hidden in it.
6.7.3 BII versus UNC
In Fig. 11, the genetic signature of 50 genes distinguishing
BII from UNC is shown. In this case, a classification
accuracy of 92.31 percent is obtained. One sample from
UNC gets misclassified as BII sample. Notice that this result
can be explained by the following fact. Both AH-CUT and
NORMALIZED-CUT put a lot of samples from UNC, in a
sibling branch of BII, in the very first division.
However, BII is distinctly different from those unclustered samples, as found by ECHC. Thus, it is worth noting a
few genetic markers for this division in the following.
150
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
VOL. 7,
NO. 1, JANUARY-MARCH 2010
6.7.5 Validation of Data from van de Vijver et al.
In case of breast cancer, again NBC is used to classify the
microarray data provided by van de Vijver et al. [52]. First
three consensus clusters are found as before in this data set.
The first cluster BI’ of van de Vijver et al. consists of
141 samples, the second cluster BII’ consists of 34 samples,
and the third cluster BIII’ consists of just 5 samples. Similar
to the van’t Veer’s data, BI’ is disjoint from BII’ and BIII’ is a
subset of BII’. Due to such small number of samples in BIII’,
only BI’ and BII’ are considered for the validation purpose.
Using the genetic markers from BI versus BII, BI versus
UNC, and BII versus UNC, classification accuracies of
100 percent, 82.75 percent, and 88.3 percent are obtained,
respectively, in the data set of van de Vijver et al.
7
Fig. 12. The genetic signature of 50 genes discriminating BIII from other
samples in BII.
CONCLUSIONS
In this paper, a new method has been introduced for
obtaining clusters in a data set (e.g., subtypes of a disease)
by taking an exploratory “consensus” of a number of
hierarchical clustering algorithms. Each cluster obtained by
this method corresponds to subtrees of similar sizes in
different hypothesized trees which agree on their elements,
but may differ on the exact topology. This allows us to get
rid of the general approach of cutting trees at certain
heights and finding clusters. This method also does not
need to have any previous knowledge about the number of
clusters unlike some other methods like k-means algorithm.
In the future, this method will be further explored by
using more clustering algorithms. The hypothesis on
melanoma and breast cancer will be also validated by
using other data sets.
ACKNOWLEDGMENTS
The high-ranked genes for this division is given by
Contig54477_RC, NM_018014 (BCL11A), NM_012391
(SPDEF), Contig56765_RC, NM_014211 (GABRP),
Contig47814_RC, NM_004496 (FOXA1), etc. The correlation
ranks of these genes are 0.87 to 0.82. Notice that the genes
BCL11A, SPDEF, GABRP, and FOXA1 also appear as markers
in BI versus BII.
6.7.4 BIII versus BII n BIII
Finally, the last consensus cluster BIII (seven samples),
which is a subgroup of BII, seems to bring new information
to breast cancer. This is also witnessed by a clear genetic
signature provided in Fig. 12. Notice that BIII is very much
different from BII n BIII, in both AH-CUT and agglomerative clustering. Furthermore, again a 100 percent
classification accuracy is obtained for this division of BII.
To explore this curious finding, note the best-ranked
genes for this division as follows: NM_003559 (PIP5K2B),
AB029004 (ERC1), NM_003624 (RANBP3), AK001362
(DCBLD2), Contig50879_RC, NM_002339 (LSP1), Contig41262_RC, NM_003809 (TNFSF13, Tumor necrosis factor,
superfamily member 13), NM_016267 (VGLL1), etc. The
ranks of these genes range from 0.86 to .0.81. However, not
much exploration has been done in biological experiments
to find subgroups within BII or “basal” tumors and we wish
that more wet lab experiments can be designed in the future
to support these findings.
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply.
This work was carried out while the author was a Research
Academic at the University of Newcastle, Australia, and it
was inspired by the author’s work while she was supported
by the ARC Center for Bioinformatics, Australia. The author
is also thankful to Dr. Pablo Moscato and Mr. Wagner Costa
from the University of Newcastle for many enlightening
discussions on breast cancer during that period.
REFERENCES
[1]
[2]
[3]
[4]
[5]
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical
Learning, Statistics. Springer, 2001.
M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “Cluster
Analysis and Display of Genome-Wide Expression Patterns,” Proc.
Nat’l Academy of Sciences USA, vol. 95, pp. 14 863-14 868, 1998.
T. Sorlie, R. Tibshirani, J. Parker, T. Hastie, J.S. Marron, A. Nobel,
S. Deng, H. Johnsen, R. Pesich, S.G.J. Demeter, C.M. Perou,
P.E. Lonning, P.O. Brown, A. Brresen-Dale, and D. Botstein,
“Repeated Observation of Breast Tumor Subtypes in Independent
Gene Expression Data Sets,” Proc. Nat’l Academy of Sciences USA,
vol. 100, no. 14, pp. 8418-8423, July 2003.
L.J. van’t Veer, H. Dai, M.J. van de Vijver, Y.D. He, A.A.M. Hart,
M. Mao, H.L. Peterse, K. van der Kooy, M.J. Marton,
A.T. Witteveen, G.J. Schreiber, R.M. Kerkhoven, C. Roberts,
P.S. Linsley, R. Bernards, and S.H. Friend, “Computational Cluster
Validation in Post-Genomic Data Analysis,” Nature, vol. 415,
pp. 530-536, 2005.
E.N. Adams, “Consensus Techniques and the Comparison of
Taxonomic Trees,” Systematic Zoology, vol. 21, no. 4, pp. 390-397,
1972.
MAHATA: EXPLORATORY CONSENSUS OF HIERARCHICAL CLUSTERINGS FOR MELANOMA AND BREAST CANCER
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
D.A. Neumann, “Faithful Consensus Methods for n-Trees,” Math.
Biosciences, vol. 63, pp. 271-287, 1983.
R.R. Sokal and F.J. Rohlf, “Taxonomic Congruence in the
Leptopodomorpha Re-Examined,” Systematic Zoology, vol. 30,
pp. 304-325, 1981.
T. Margush and F.R. McMorris, “Consensus n-Trees,” Bull. Math.
Biology, vol. 43, pp. 239-244, 1981.
M. Wilkinson, “Common Cladistic Information and Its Consensus
Representation: Reduced Adams and Reduced Cladistic Consensus Trees and Profiles,” Systematic Biology, vol. 43, pp. 343-368,
1994.
A.M. Krieger and P.E. Green, “A Generalized Rand-Index Method
for Consensus Clustering of Separate Partitions of Separate
Partitions of the Same Data Base,” J. Classification, vol. 16,
pp. 63-89, 1999.
A. Topchy, B. Minaei-Bidgoli, A.K. Jain, and W.F. Punch,
“Adaptive Clustering Ensembles,” J. Classification, vol. 16,
pp. 63-89, 1999.
S. Monti, P. Tamayo, J. Mesirov, and T. Golub, “Consensus
Clustering: A Resampling-Based Method for Class Discovery and
Visualization of Gene Expression Microarray Data,” Machine
Learning, vol. 52, no. 1/2, pp. 91-118, 2003.
S. Swift, A. Tucker, V. Vinciotti, N. Martin, C. Orengo, X. Liu, and
P. Kellam, “Consensus Clustering and Functional Interpretation of
Gene-Expression Data,” Genome Biology, vol. 5, no. R94, 2004.
V. Filkov and S. Skiena, “Heterogeneous Data Integration with the
Consensus Clustering Formalism,” Proc. Int’l Workshop Data
Integration in the Life Sciences (DILS ’04), vol. 2994, pp. 110-123,
2004.
T. Grotkjaer, O. Winther, B. Regenberg, J. Nielsen, and
L.K. Hansen, “Robust Multi-Scale Clustering of Large DNA
Microarray Datasets with the Consensus Algorithm,” Bioinformatics, vol. 22, no. 1, pp. 58-67, 2006.
S. Kirkpatrick, C.D.J. Gelatt, and M.P. Vecchi, “Optimization by
Simulated Annealing,” Science, vol. 220, pp. 671-680, 1983.
Y. Wakabayashi, “The Complexity of Computing Medians of
Relations,” Resenhas IME-USP, vol. 3, pp. 323-349, 1998.
M. Krivanek and J. Moravek, “Hard Problems in Hierarchical-Tree
Clustering,” Acta Informatica, vol. 23, pp. 311-323, 1986.
B. Raskutti, H. Ferra, and A. Kowalczyk, “Combining Clustering
and Co-Training to Enhance Text Classification Using Unlabelled
Data,” Proc. Eighth ACM SIGKDD Int’l Conf. Knowledge Discovery
and Data Mining (KDD), 2002.
A. Strehl and J. Ghosh, “Cluster Ensembles—A Knowledge Reuse
Framework for Combining Partitionings,” Proc. 18th Nat’l Conf.
Artificial Intelligence (AAAI ’02), pp. 93-98, 2002.
M. Bittner, P. Meltzer, Y. Chen, Y. Jiang, E. Seftor, M. Hendrix,
M. Radmacher, R. Simon, Z. Yakhinik, A. Ben-Dork, N. Sampask,
E. Dougherty, E. Wang, F. Marincola, C. Gooden, J. Lueders,
A. Glatfelter, P. Pollock, J. Carpten, E. Gillanders, D. Leja,
K. Dietrich, C. Beaudry, M. Berens, D. Alberts, V. Sondak,
N. Hayward, and J. Trent, “Molecular Classification of Cutaneous
Malignant Melanoma by Gene Expression Profiling,” Nature,
vol. 406, no. 3, 2000.
P. Mahata, W. Costa, C. Cotta, and P. Moscato, “Hierarchical
Clustering, Languages and Cancer,” Proc. EvoWorkshops ’06,
pp. 67-78, 2006.
R. Rizzi, P. Mahata, W. Costa, and P. Moscato, Hierarchical
Clustering Using Arithmetic-Harmonic Cut, submitted, 2007.
I. Dyen, J.B. Kruskal, and P. Black, “An Indo-European Classification: A Lexicostatistical Experiment,” Trans. Am. Philosophical Soc.,
New Series, vol. 82, no. 5, pp. 1-132, 1992.
D.T. Ross, U. Scherf, M. Eisen, C. Perou, C. Rees, P. Spellman,
V. Iyer, S. Jeffrey, M. Rijn, M. Waltham, A. Pergamenschikov,
J.C. Lee, D. Lashkari, D. Shalon, T. Myers, J.N. Weinstein,
D. Botstein, and P. Brown, “Systematic Variation in Gene
Expression Patterns in Human Cancer Cell Lines,” Nature Genetics,
vol. 24, pp. 227-235, Mar. 2000.
J. Shi and J. Malik, “Normalized Cuts and Image Segmentation,”
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8,
pp. 888-905, Aug. 2000.
W.M. Rand, “Objective Criteria for the Evaluation of Clustering
Methods,” J. Am. Statistical Assoc., vol. 66, pp. 846-850, 1971.
L. Hubert and P. Arabie, “Comparing Partitions,” J. Classification,
vol. 2, no. 1, pp. 193-218, 1985.
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply.
151
[29] P. Rousseeuw, “Silhouettes: A Graphical Aid to the Interpretation
and Validation of Cluster Analysis,” J. Computational and Applied
Math., vol. 20, pp. 53-65, 1987.
[30] J. Bezdek and N. Pal, “Some New Indexes of Cluster Validity,” IEEE
Trans. Systems, Man, and Cybernetics, vol. 28 B, pp. 301-315, 1998.
[31] D.L. Davies and D.W. Bouldin, “A Cluster Separation Measure,”
IEEE Trans. Pattern Recognition and Machine Intelligence, vol. 1,
no. 2, pp. 224-227, Feb. 1979.
[32] F. Azuaje and N. Bolshakova, “Clustering Genome Expression
Data: Design and Evaluation Principles,” Understanding and Using
Microarray Analysis Techniques: A Practical Guide, D. Berrar,
W. Dubitzky, and M. Granzow, eds., Springer-Verlag, 2002.
[33] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek,
J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri,
C.D. Bloompeld, and E.S. Lander, “Molecular Classification of
Cancer: Class Discovery and Class Prediction by Gene Expression
Monitoring,” Science, vol. 286, pp. 531-537, 1999.
[34] I.H. Witten and E. Frank, Data Mining: Practical Machine Learning
Tools and Techniques, second ed. Morgan Kaufmann, 2005.
[35] M. Smolkin and D. Ghosh, “Cluster Stability Scores for Microarray
Data in Cancer Studies,” BMC Bioinformatics, vol. 4, no. 36, 2003.
[36] A. Ben Dor, R. Shamir, and Z. Yakhini, “Clustering Gene Expression
Patterns,” J. Computational Biology, vol. 6, pp. 281-297, 1999.
[37] T.B. Lewis, J.E. Robison, R. Bastien, B. Milash, K. Boucher,
W.E. Samlowski, S.A. Leachman, R.D. Noyes, C.T. Wittwer,
L. Perreard, and P.S. Bernard, “Molecular Classification of
Melanoma Using Real-Time Quantitative Reverse TranscriptasePolymerase Chain Reaction,” Cancer, vol. 104, no. 8, pp. 1678-1686,
Oct. 2005.
[38] I.M. Bachmann, O. Straume, H.E. Puntervoll, M.B. Kalvenes, and
L.A. Akslen, “Importance of p-Cadherin, Beta-Catenin, and wnt5a/
frizzled for Progression of Melanocytic Tumors and Prognosis
in Cutaneous Melanoma,” Clinical Cancer Research, vol. 11, no. 24,
Pt. 1, pp. 8606-8614, 2005.
[39] H. Boukerche, Z.Z. Su, L. Emdad, D. Sarkar, and P.B. Fisher,
“mda-9/syntenin Regulates the Metastatic Phenotype in Human
Melanoma Cells by Activating Nuclear Factor-Kappab,” Cancer
Research, vol. 67, no. 4, pp. 1812-1822, Feb. 2007.
[40] K. McPherson, C.M. Steel, and J.M. Dixon, “ABC of Breast
Diseases: Breast Cancer—Epidemiology, Risk Factors, and Genetics,” British Medical J., vol. 321, pp. 624-628, 2000.
[41] L.C. Dorssers, S.V. der Flier, A.B.T. van Agthoven, J. Veldscholte,
E.M. Berns, J. Klijn, L.V. Beex, and J.A. Foekens, “Tamoxifen
Resistance in Breast Cancer: Elucidating Mechanisms,” Drugs,
vol. 61, no. 12, pp. 1721-1733, 2001.
[42] I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner,
R. Simon, P. Meltzer, B. Gusterson, M. Esteller, O.P. Kallioniemi,
B.W. Wilfond, A.B.J. Trent, M. Raffeld, Z. Yakhini, A. Ben-Dor,
E. Dougherty, J. Kononen, L. Bubendorf, S. Fehrle, S. Pittaluga,
S.G.S.N. Loman, O. Johannsson, H. Olsson, and G. Sauter, “GeneExpression Profiles in Hereditary Breast Cancer,” New England J.
Medicine, vol. 344, no. 8, pp. 539-548, 2001.
[43] K.E. Lee, N. Sha, E.R. Dougherty, M. Vannucci, and B.K. Mallick,
“Gene Selection: A Bayesian Variable Selection Approach,”
Bioinformatics, vol. 19, pp. 90-97, 2003.
[44] P. Mahata and K. Mahata, “Selecting Differentially Expressed
Genes Using Minimum Probability of Classification Error,”
J. Biomedical Informatics, vol. 40, no. 6, pp. 775-786, 2007.
[45] Y. Pawitan, J. Bjhle, L. Amler, A. Borg, S. Egyhazi, P. Hall, X. Han,
L. Holmberg, F. Huang, S. Klaar, E.T. Liu, L. Miller, H. Nordgren,
A. Ploner, K. Sandelin, P.M. Shaw, J. Smeds, L. Skoog, S. Wedrn,
and J. Bergh, “Gene Expression Profiling Spares Early Breast
Cancer Patients from Adjuvant Therapy: Derived and Validated
in Two Population-Based Cohorts,” Breast Cancer Research, vol. 7,
pp. R953-R964, 2005.
[46] S. Paik, S. Shak, G. Tang, C. Kim, J. Baker, M. Cronin, F.L. Baehner,
M.G. Walker, D. Watson, T. Park, W. Hiller, E.R. Fisher,
L. Wickerham, J. Bryant, and N. Wolmark, “A Multigene Assay to
Predict Recurrence of Tamoxifen-Treated, Node-Negative Breast
Cancer,” New England J. Medicine, no. 27, pp. 2817-2826, 2004.
[47] T. Sorlie, C.M. Perou, R. Tibshirani, T. Aas, S. Geisler, H. Johnsen,
T. Hastie, M.B. Eisen, M. ban de Rijn, S.S. Jeffrey, T. Thorsen,
H. Quist, J.C. Matese, P.O. Brown, D. Botstein, P.E. Lonnin,
and A. Brresen-Dale, “Gene Expression Patterns of Breast
Carcinomas Distinguish Tumor Subclasses with Clinical Implications,” Proc. Nat’l Academy of Sciences USA, vol. 98, no. 19,
pp. 10 869-10 874, Sept. 2001.
152
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
[48] S.N. Agoff, P.E. Swanson, H. Linden, S.E. Hawes, and T.J. Lawton,
“Androgen Receptor Expression in Estrogen Receptor-Negative
Breast Cancer. Immunohistochemical, Clinical, and Prognostic
Associations,” Am. J. Clinical Pathology, vol. 120, no. 5, pp. 725-731,
Nov. 2003.
[49] H. Nakshatri and S. Badve, “Foxa1 as a Therapeutic Target for
Breast Cancer,” Expert Opinion on Therapeutic Targets, vol. 11, no. 4,
pp. 507-514, Apr. 2007.
[50] R. Mehra, S. Varambally, L. Ding, R. Shen, M.S. Sabel, D. Ghosh,
A.M. Chinnaiyan, and C.G. Kleer, “Identification of GATA3 as a
Breast Cancer Prognostic Marker by Global Gene Expression
Meta-Analysis,” Cancer Research, vol. 65, pp. 11 259-11 264,
Dec. 2005.
[51] M. Zafrakas, M. Chorovicer, I. Klaman, G. Kristiansen, P. Wild,
U. Heindrichs, R. Knchel, and E. Dahl, “Systematic Characterization of Gabrp Expression in Sporadic Breast Cancer and Normal
Breast Tissue,” Int’l J. Cancer, vol. 118, no. 6, pp. 1453-1459, 2006.
[52] M.J. van de Vijver, Y.D. He, L.J. van’t Veer, H. Dai, A.A.M. Hart,
D.W. Voskuil, G.J. Schreiber, J.L. Peterse, C. Roberts, M.J. Marton,
M. Parrish, D. Atsma, A.T. Witteveen, A. Glas, L. Delahaye,
T. van der Velde, H. Bertelink, S. Rodenhuis, E.T. Rutgers,
S.H. Friend, and R. Bernards, “A Gene-Expression Signature as a
Predictor of Survival in Breast Cancer,” New England J. Medicine,
vol. 347, no. 25, pp. 1999-2009, 2002.
Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply.
VOL. 7,
NO. 1, JANUARY-MARCH 2010
Pritha Mahata received the BE degree in
electrical engineering from Jadavpur University,
Kolkata, India, in 1996, the ME degree in
computer science and automation from the
Indian Institute of Science, Bangalore, India, in
1999, and the PhD degree in computer science
from Uppsala University, Sweden, in 2005.
Currently, she is a research fellow in the School
of Information Technology and Electrical Engineering, University of Queensland, Australia.
She was a postdoctoral fellow from 2005 to 2007 in the School of
Electrical Engineering and Computer Science, University of Newcastle,
Australia. She also worked at Asea Brown Boveri Ltd., India, from 1996
to 1997 and Texas Instruments, India, from 1999 to 2000. Her main
research interests include verification and validation of software
systems, real-time communication protocols, bioinformatics, and
machine learning.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.