138 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 1, JANUARY-MARCH 2010 Exploratory Consensus of Hierarchical Clusterings for Melanoma and Breast Cancer Pritha Mahata Abstract—Finding subtypes of heterogeneous diseases is the biggest challenge in the area of biology. Often, clustering is used to provide a hypothesis for the subtypes of a heterogeneous disease. However, there are usually discrepancies between the clusterings produced by different algorithms. This work introduces a simple method which provides the most consistent clusters across three different clustering algorithms for a melanoma and a breast cancer data set. The method is validated by showing that the Silhouette, Dunne’s and Davies-Bouldin’s cluster validation indices are better for the proposed algorithm than those obtained by k-means and another consensus clustering algorithm. The hypotheses of the consensus clusters on both the data sets are corroborated by clear genetic markers and 100 percent classification accuracy. In Bittner et al.’s melanoma data set, a previously hypothesized primary cluster is recognized as the largest consensus cluster and a new partition of this cluster into two subclusters is proposed. In van’t Veer et al.’s breast cancer data set, previously proposed “basal” and “luminal A” subtypes are clearly recognized as the two predominant clusters. Furthermore, a new hypothesis is provided about the existence of two subgroups within the “basal” subtype in this data set. The clusters of van’t Veer’s data set is also validated by high classification accuracy obtained in the data set of van de Vijver et al. Index Terms—Consensus clustering, melanoma, breast cancer. Ç 1 INTRODUCTION W ITH the advent of microarray data, the last decade has seen a collective effort for finding subtypes of heterogeneous diseases like melanoma, breast cancer, and ovarian cancer. The traditional tools for subtype detection include unsupervised methods, such as hierarchical (e.g., single-link, average-link, and complete-link agglomerative clustering) or iterative relocation-based clustering (e.g., k-means [1]). If a hierarchical clustering algorithm (e.g., [2]) is used, then it is often difficult to infer the number of actual subtypes hidden in the data. It can be anything between 1 and n, where n is the number of samples in the data. It has been a common technique to cut the tree at different levels starting below the root and to find if the generated clusters are well defined. However, the trees produced by different hierarchical clustering algorithms can be very different from each other, even if the corresponding dendrograms are cut at the same level. Not surprisingly, there has been many a controversy about proposed subtypes within different heterogeneous cancer data sets. For instance, in [3], five subtypes of sporadic breast cancers were proposed. However, only two subtypes out of the five were relatively consistently found in the sporadic breast cancer data set produced by van’t Veer et al. [4]. Thus, there is a necessity to come up with better clustering algorithms which can yield more definitive inferences for the subtypes of heterogeneous diseases. If an iterative relocation-based method is used, the results are often dependent on the initialization of the algorithm. . The author is with the School of Information Technology and Electrical Engineering, University of Queensland, Australia. E-mail: [email protected]. Manuscript received 3 Apr. 2007; revised 2 Nov. 2007; accepted 3 Mar. 2008; published online 14 Mar. 2008. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCBB-2007-04-0039. Digital Object Identifier no. 10.1109/TCBB.2008.33. 1545-5963/10/$26.00 ß 2010 IEEE Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply. This tendency is even more visible when the data set includes noisy gene expressions. Furthermore, the number of clusters has to be chosen a priori for these methods. One way to solve this problem is to take consensus of the trees generated by various algorithms. Now, there can be various ways to take consensus of such trees. Consensus trees, first developed by Adams [5], have long been used to combine information from two or more different classifications of the same set of Operational Taxonomic Units into a single tree. Adam’s method is a top-down method and it recursively finds consensus clusters by intersection of clusters occurring at a number of specified heights on the trees being compared. In this method, the height at which clusters are compared is a function of what consensus clusters were found in the previous intersection step. For example, consider two binary trees. Then, the intersection of left and right clusters at the root of both trees result in four clusters (perhaps, some of these are null sets). Then, the nearest common ancestor (least upper bound or LUB) of these clusters are sought in both the trees and an edge from it to each nonnull cluster is drawn. A similar method is described in [6], where consensus of k clusters includes all clusters formed by intersecting any set of k clusters, all of the same height and one from each tree. These methods have the capability of producing consensus clusters that do not exist in any of the original trees being compared. Strict consensus [7] and majority consensus [8] are cluster counting methods, and they rely on tallying the number of trees in which a particular cluster occurs. In the strict consensus method, a cluster must occur in every tree in order to be included in the consensus tree. In the majority consensus method, the cluster must occur in more than half of the trees in order to be included. In the last decade, again a number of works concentrated on optimization-based methods for finding “consensus Published by the IEEE CS, CI, and EMB Societies & the ACM MAHATA: EXPLORATORY CONSENSUS OF HIERARCHICAL CLUSTERINGS FOR MELANOMA AND BREAST CANCER clusters” (e.g., [9], [10], [11], [12], [13], [14], and [15]). Next, we recall some of these methods. These methods usually have high computational complexity as mentioned below. In [9], a topological consensus of several trees is sought to infer the best possible topology to be fitted to the data. In [10], a weighted Rand Measure, representing the degree of association between any given clustering and the final tobe-found amalgamated clustering, is maximized (notice that it is an NP-hard problem). In [11], an adaptive clustering ensemble method is proposed. This method is essentially an extension of supervised boosting algorithm in the context of consensus clustering. This algorithm generates individual partitions in the ensemble in a sequential manner by clustering specially selected subsamples of the given data set. New subsamples are drawn to focus on the difficult regions of the feature space. The probability that a sample will be picked depends on the consistency of its previous assignments in the ensemble. Monti et al. [12] also use resampling to simulate perturbations of the original data set. In essence, they apply a given clustering algorithm on each resampled version of the data set and compute a connectivity matrix (n n matrix, where n is the number of elements to be clustered and each element of the matrix takes value from f0; 1g, where the value is 1 if they are in the same cluster, otherwise 0). From several such connectivity matrices, then a consensus matrix is computed as a properly normalized sum of all the connectivity matrices (see [12] for details). The values in the consensus matrix range between 0 and 1; a value close to 1 (0) means that most (rarely) runs cluster a pair of elements in the same cluster. The consensus matrix can be used as a visualization tool to find the members of the clusters and the actual number of clusters. The authors of [13] employ another “consensus clustering” method and applied it on the gene expression data. They start with an agreement matrix, A of size n n, where n is the number of elements to cluster. Each element Aij , at ith row and jth column of A denotes the number of clustering methods which put the element i and element j in the same cluster. Then, a consensus clustering is performed by maximizing the sum of the scores fðGi Þ of each to-be-found cluster Gi with size si , where fðGi Þ for nonzero-sized clusters is defined as follows: To take consensus of multiple nonhierarchical clusterings, Strehl and Ghosh [20] propose three heuristics to find candidate clusterings (the first induces a similarity measure from the partitionings and then reclusters the objects; the second is based on hypergraph partitioning; and the third collapses groups of clusters into metaclusters which then compete for each object to determine the clustering). Then, they use a supraconsensus function that evaluates all three approaches against a given objective function, which essentially seeks the maximum average normalized mutual information and picks the best solution for a given situation. Recently in [15], a Bayesian method is given which averages multiple clustering runs with Variational Bayes Mixtures of Gaussians or k-means to yield the final clustering result. They collect reoccurring clustering patterns in a cooccurrence matrix (which contains the probabilities if a pair of elements belongs to the same cluster). This matrix can then be used as a similarity matrix and be given as an input to any hierarchical clustering algorithm for generating a tree. In contrast to the recent optimization-based methods and Bayesian techniques, this work proposes a computationally simple method, which is thematically more similar to the works of Adams [5] and Neumann [6]. This work considers intersection of leaves of various trees generated by several clustering algorithms and finds one or more “consensus clusters” which appear(s) in similar sized subtrees with slightly different topologies in the considered trees. Thus, this approach is also very different from traditional consensus tree building approaches that rely on considering subtrees only at some specified heights. Notice that for our approach, an assumption is made that there exists more than one cluster in the data. The motivation of not considering resulting consensus tree is that different clustering criterion can lead to a difference in the topology of the elements within the same cluster, and therefore, each cluster in the consensus tree can potentially contain less number of samples than the actual number of members in that cluster. Contributions. The contributions of this paper are as follows: . i 1 i fðGi Þ ¼ sj¼1 sk¼jþ1 ðAGij Gik Þ; with being a user-defined parameter for the agreement threshold. fðGi Þ is 0 if si ¼ 0. To solve this optimization problem, authors of [13] use a probabilistic meta-algorithm method called simulated annealing [16]. In [14], given a number of clusterings (say, C1 ; . . . ; Ck ) for a data set, authors find the optimum clustering C such that ki¼1 dðCi ; C Þ is minimized, where d gives the symmetric difference distance defined over partitions. This measure is defined on the number of coclustered and not coclustered pairs in the partitions (see [14] for details). The problem is NP-complete [17], [18] in general, and thus, a solution to this optimization problem is based on local search and simulated annealing. There are also other works which address the mixture of labeled and unlabeled data (e.g., [19]). Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply. 139 . . A simple exploratory method for finding consensus clusters from several hierarchical clustering algorithms is proposed. The method is based on intersecting leaves of various “similar-sized” subtrees of the given trees generated by some top-down and bottom-up clustering techniques. The proposed algorithm has been applied to two public-domain microarray data sets: 1) melanoma (skin cancer) data set produced by Bittner et al. [21] and 2) breast cancer data set provided by van’t Veer et al. [4]. Using this method, a new hypothesis can be made: there might be a new subtype of melanoma in the data set of Bittner et al. [21], within a commonly found dense cluster. Also, the popularly hypothesized notions of “basal” and “luminal A” subtypes in breast cancer from the results on the data set of van’t Veer et al. are validated by the results on the breast cancer data set. However, other subtypes are rather unclear, even though there is a possibility of a third cluster called “luminal B.” Furthermore, a new hypothesis about two strong subtypes within the “basal” subtype is made. 140 2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, CONSENSUS CLUSTERS 2.1 Finding Consensus Clusters The motivation of just finding consensus clusters instead of building a consensus tree for the microarrays is that such data sets are often very noisy and the topology produced by the candidate classification algorithms cannot provide the exact relationship of the samples in the data. Intuitively, the proposed algorithm (called ECHC, exploratory consensus of hierarchical clusterings) takes every proper subtree of a given tree and finds a match in terms of the set of leaves in all other trees, such that the matching trees are of comparable sizes. Finally, a consensus cluster is formed by taking the intersection of the set of leaves of the given subtree and the other matching subtrees, if the proportion of the number of leaves in common in all the subtrees with respect to the number of leaves in the given subtree is above some prespecified value. More formally, suppose that there are k algorithms which produce k trees, T1 ; . . . ; Tk for a given data set. The output of the following algorithm is sets of strings, where the elements (strings) in each set define a “consensus cluster” (names of the samples). The algorithm consists of the following three steps. Step 1. Consider each proper subtree ti of the tree T1 containing more than l (say, 5) leaves, to generate a candidate cluster (this implies that one cluster can be a subset of a larger cluster). Then, for each proper subtree ti 2 T1 , find a subtree tjip in each tree Tj with 2 j k such that leavesðti Þ \ leavesðtp Þ2 ; ð1Þ tjip ¼ arg max tp 2Sj leavesðti Þjjleavesðtp Þ where Sj is the set of all proper subtrees in Tj , leavesðti Þ is a set of strings for the names of the leaves in the tree ti , and k is the sizeof operator. Notice that unlike other methods which cut the dendrograms at some height for generating the clusters, all proper subtrees of T1 containing more than l leaves are considered here. This is feasible, since usually the data of heterogeneous diseases are taken from a few populations of tens of patients such that each population has similar symptoms. Here, comparing subtrees only at similar heights may not yield good results, since depending on the algorithm branching structures are usually widely different (e.g., consider a bipartitioning of samples such that one side contains only a singleton outlier, while the other side contains the rest of the samples). Step 2. For each subtree ti 2 T1 , let the matching trees in sets S2 ; . . . ; Sk be m2i ; . . . ; mki . Discard a subtree ti from the set of candidate clusters S1 if any of the following holds for any j : 2 j k: leavesðti Þ \ leavesðmj Þ i ; 0 < jleavesðti Þj leavesðti Þ \ leavesðmj Þ i 0 < leavesðmj Þ i for some predefined 1. An ¼ 0:75 is chosen in an experimental manner. Note that k is the number of trees under consideration. This step ensures that a subtree in T1 is matched with a subtree of similar size in other trees Tj . However, < 1 will allow one to match subtrees with different structures and Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply. VOL. 7, NO. 1, JANUARY-MARCH 2010 slightly different number of leaves, even though there might be a few mistakes in each individual subtree under consideration. Nevertheless, if takes a value close to 0, there will be a large number of very small clusters found and a well-defined cluster will stand the chance of being broken into subclusters. Step 3. For all remaining trees ti 2 T1 , there are nonempty subtrees m2i 2 S2 ; . . . ; mki 2 Sk . Now, the elements in a consensus cluster extracted from ti are given by leavesðti Þ \ leavesðm2i Þ \ \ leavesðmki Þ. Notice that the tree created by a top-down clustering algorithm is chosen to be T1 , since it is often the case that top-down methods are more true near the root of the tree, while bottom-up clustering techniques make mistakes in merging large-sized clusters. In the following, two top-down and one bottom-up clustering algorithms are shown from which consensus clusters will be generated. 2.2 Arithmetic-Harmonic (AH-CUT) Recently in [22] and [23], a new top-down hierarchical clustering algorithm was introduced, which performed very well to find trees with well-supported hypothesis for several diverse data sets, e.g., separation times of 84 Indo-European languages [24] and microarray data of 64 cancer cell-lines from National Cancer Institute, USA [25]. This method recursively builds the hierarchy in a top-down fashion. In each step of the recursion, the following graph optimization problem is solved for newly generated partitions of the data. For instance, initially for each sample (experiment) in the microarray data, there is a vertex assigned in the initial graph and the distance (correlation, euclidean, etc.) between a pair of samples is assigned on an edge between the representative vertices in the graph. The solution of the following optimization problem partitions the vertices into two sets. Then, a new graph is constructed and the following optimization problem is solved for each side of the obtained partition. AH CUT Problem: Instance: A pair ðG; wÞ. G ¼ ðV ; EÞ is an undirected graph where V is the set of vertices, E is the set of edges and w is a weighting function given by w : E7!INþ , where INþ is the set of nonzero natural numbers. Goal: To divide the set V of vertices into fS; Sg with S ¼ V n S, such that the following objective function f is maximized. Objective: 1 !0 P P 1 A (2) wðeÞ @ fðS; ðG; wÞÞ ¼ wðeÞ ; e2G ðSÞ e2G ðSÞ where for every subset S of V , G ðSÞ :¼ fuv 2 E j u 2 S; v 62 Sg denotes the cut in G induced by S and G ðSÞ :¼ E n G ðSÞ. Effectively, every partition of V into S and S generates a partition of the set E into two sets G ðSÞ and G ðSÞ. The name Arithmetic-Harmonic Cut (AH-CUT) follows from the following: ðjEj jG ðSÞjÞ f ðS; ðG; wÞÞ ¼ ðAext jG ðSÞjÞ Hint Aext ¼ jG ðSÞjðjEj jG ðSÞjÞ; Hint MAHATA: EXPLORATORY CONSENSUS OF HIERARCHICAL CLUSTERINGS FOR MELANOMA AND BREAST CANCER where Aext is the arithmetic mean of the weights of the edges in the cut, Hint is the harmonic mean of the weights of the edges not in the cut, and jG ðSÞj is the cardinality of the cut. In this work, authors used a population-based metaheuristic (memetic algorithm) to solve AH-CUT (since the problem is shown to be APX-hard [23] by a reduction from the MAX-CUT problem). The memetic algorithm first creates a population of feasible solutions using a randomized greedy algorithm. Then, these solutions are combined (same crossover mechanism of genetic algorithms) using a differential greedy algorithm such that a better solution is obtained, and this solution then replaces the parent solution with the worse objective value. Finally, a variable neighborhood local search is employed in the neighborhood of each such new solution to find further improvements. If no better solution is obtained, then the algorithm is restarted a few times. 2.3 NORMALIZED-CUT Shi and Malik introduced a top-down technique for divisive clustering, namely NORMALIZED-CUT [26], and this method was very successfully applied on several 2D data sets and images. The work in [26] uses a similarity matrix and proposes to minimize the following objective function NCutðS; SÞ induced by a partition of vertices for some S V and S ¼ V n S, where P P ij2ðSÞ sðijÞ ij2ðSÞ sðijÞ ; þP NCutðS; SÞ ¼ P i2S;j2V sðijÞ i2S;j2V sðijÞ with s being a function from edges to real number and gives the similarity between any two vertices i and j. The work in [26] reduces this optimization problem to a generalized eigenvalue problem and provides an efficient method for image segmentation. However, the algorithm in [26] for the NORMALIZED-CUT problem requires specification of the number of clusters. Thus, an alternative method (a variant of the memetic algorithm described above) is employed and it does not require such specifications, but caters for the objective function for NORMALIZED-CUT. The tree is generated by applying NORMALIZED-CUT recursively on each newly obtained sets of vertices in the partitions. 2.4 Agglomerative Hierarchical Clustering For large-scale data sets, often a bottom-up approach is taken to find a hierarchical tree. The reason behind this is mainly the computational efficiency of the bottom-up approach. Notice that NORMALIZED-CUT is shown to be NP-hard in [26] and AH-CUT is proved to be hard to approximate in [23]. The method of agglomerative hierarchical clustering [2] starts by considering each object as a separate cluster. At each iteration, it agglomerates two most “similar” nodes, decreases the number of clusters by 1, and updates the distances between the clusters. The process stops until a single cluster is obtained. The tree structure is generated by keeping track of the merging of the clusters. Different trees arise depending on the different ways of defining distance between clusters. In this work, the most commonly used average-linkage method is used. In this method, the distance between two clusters is defined as the average of Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply. 141 distances between all pairs of objects, where each pair is made up of one object from each group. 3 VALIDATION The validation of clustering is a hard problem, although there exist quite a few works [27], [28], [29], [30], [31], which have addressed this issue in the past and provided strategies for validating clustering partitions. We consider the recently proposed Silhouette, Dunn’s, and Davis-Bouldin indices, respectively, given in [29], [30], and [31]. 3.1 Silhouette Index For a given cluster, Cj with j : 1 j c, where c is the number of clusters, each sample si : i ¼ 1; . . . ; mj of Cj can be assigned a quality measure, known as the Silhouette width. The Silhouette width is a confidence indicator on the membership of the ith sample in cluster Cj . The Silhouette width for the ith sample in the cluster Cj is defined as sðiÞ ¼ bðiÞ aðiÞ ; maxfaðiÞ; bðiÞg where aðiÞ is the average distance between the ith sample and all other samples in Cj excluding it; “max” is the maximum operator, and bðiÞ is the minimum average distance between the ith sample and all samples clustered in Ck , ðk ¼ 1; . . . ; c; k 6¼ jÞ. From this formula, it follows that 1 sðiÞ 1. When sðiÞ takes a value close to 1, one may infer that the ith sample has been clustered well. When sðiÞ is close to zero, it suggests that the ith sample could also be assigned to the nearest neighboring cluster. If sðiÞ is close to 1, it may be argued that such a sample has been “misclassified” [29]. For a given cluster, Cj ðj ¼ 1; . . . ; cÞ, it is possible to calculate a cluster Silhouette Sj , which characterizes the heterogeneity and isolation properties of such a cluster: Sj ¼ 1 mj sðiÞ; mj i¼1 where mj is the number of samples in Cj . In fact, a Global Silhouette value, GSu , can be used as an effective validity index for a partition u, such that u is the union of samples in clusters Cj for all j : 1 j c: 1 GSu ¼ cj¼1 Sj : c Notice that higher value of GSu signifies better isolation properties of the clusters. Furthermore, it is shown in [32] that Silhouette index is insensitive to the type of the distance metric used. 3.2 Dunn’s Index Dunn’s index identifies sets of clusters that are compact and well separated. For any partition u, the Dunn’s validation index D is defined as ðCi ; Cj Þ Du ¼ min1ic min1jc:j6¼i ; max1kc ðCk Þ where ðCi ; Cj Þ defines the distance between clusters Ci and Cj (intercluster distance); ðCk Þ represents the intracluster 142 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, distance of cluster Ck , and c is the number of clusters of partition u. The main goal of this measure is to maximize intercluster distances while minimizing intracluster distances. Thus, large values of D correspond to good clusters. 3.3 Davis-Bouldin’s Index As the Dunn’s index, the Davies-Bouldin index tries to identify compact and well-separated sets of clusters. The Davies-Bouldin validation index, DB, is defined as 1 ðCi Þ þ ðCj Þ ; DBu ¼ ci¼1 maxi6¼j c ðCi ; Cj Þ where u, ðCi ; Cj Þ, ðCi Þ, ðCj Þ, and c are defined as in Dunn’s index. Small values of DB correspond to clusters that are compact, and whose centers are far away from each other. For each of the above indices, the distance between samples si and sj is considered to be their correlation distance ð1 Þ, where is the Pearson’s correlation coefficient (can be computed as in Section 3.5.1) between the gene expression vectors of samples si and sj . 3.4 Comparison with Other Algorithms In the following sections, the Silhouette, Dunn’s, and Davies-Bouldin indices for the clustering obtained by ECHC on the melanoma and breast cancer data sets are compared, with those obtained using k-means, fuzzy c-means, and the recent consensus clustering in [15]. In all of the above algorithms except ECHC, the number of clusters needs to be specified. In k-means, first k clusters’ centers are chosen and those are iteratively updated until the maximum number of iterations are reached. The updating is done by computing the new center of a cluster to be the centroid of the samples which are clustered together when the previous center of the cluster was considered. In this work, Eisen et al.’s [2] implementation of k-means is used for this purpose. In fuzzy c-means, each sample is assigned to a cluster with a given probability. The higher the probability, the more probable it is to belong to that cluster. However, application of fuzzy c-means (using MATLAB with the default options) on melanoma and breast cancer data yielded equal probability for each sample to belong to the specified number of clusters. Thus, fuzzy c-means is not considered for comparison in the rest of this paper. In [15], authors’ consensus clustering algorithm (called CC in the following) takes advantage of multiple runs of variational Bayes mixtures of Gaussians or k-means to create a consensus clustering. In some simulated data sets, this algorithm was able to identify more complex shapes of clusters than those provided by the input clusterings. They also provide an information-theoretic metric (average mutual information) between the clustering runs to signify the similarity between the clustering solutions. 3.5 Feature Selection and Classification Another way to realize the importance of clustering the samples of microarray data is to show that there are genetic markers which can linearly separate a consensus cluster from others and there is a high classification accuracy when these markers are used for classifying a randomly chosen population of samples from the data. For this purpose, this work uses a simple feature selection method, which is widely Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply. VOL. 7, NO. 1, JANUARY-MARCH 2010 employed by the biologists (e.g., see [33], and also [4]) and a simple classifier which is widely used by the statisticians and the machine learning community. In the following, we will consider that in each validation test, there are two classes a and b, where a, b can be two consensus clusters or one of them can correspond to the set of all unclustered samples. 3.5.1 Feature Selection To find genetic markers separating one consensus cluster from another (or, unclustered samples), the correlationbased method is employed here. In this method, first, each gene is sorted according to the class label. Let there be na samples in class a and nb samples in class b. Consider g ¼ ½g1 ; . . . ; gna ; gna þ1 ; . . . ; gna þnb and n ¼ na þ nb . Also, let v ¼ ½va ; vb , where va ¼ l1 ; . . . ; lna and vb ¼ m1 ; . . . ; mnb with l1 ¼ ¼ lna ¼ 1 and m1 ¼ ¼ mnb ¼ 1. Then, Pearson’s correlation coefficient of a gene g with v is given by Pn ðgk gÞðvk vÞ q ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬃ g ¼ P k¼1 n 2 Pn 2 ðg g Þ ðv v Þ k¼1 k k¼1 k with g ¼ n 1X gk ; n k¼1 and v ¼ n 1X vk : n k¼1 Finally, genes are ranked according to the absolute value of their Pearson’s correlation coefficient g , i.e., rankg ¼ jg j. Intuitively, this coefficient tells the correlation of the gene expressions of samples from two classes with an ideal gene which takes the value of 1 in one class and 1 in the other. Thus, if rankg is close to 1, then the gene g is good for differentiating between classes a and b. 3.5.2 Classifier For classification purpose, a well-known classifier, namely the Naive Bayes classifier (NBC) is used. This classifier is based on the conditional probability distribution pðcjg1 ; . . . ; gn Þ of the class variable c 2 fa; bg conditioned on the geneexpressions g1 through gn . Bayes’ theorem gives pðcjg1 ; . . . ; gn Þ ¼ pðc; g1 ; . . . ; gn Þ : pðg1 ; . . . ; gn Þ The denominator is effectively constant. Using the assumption that g1 ; . . . ; gn are mutually independent, it is easy to show that the numerator can be rewritten as pðc; g1 ; . . . ; gn Þ ¼ pðcÞ n Y pðgi jcÞ: i¼1 The probability distribution of a gene in a given class c can be computed using standard steps. Thus, computing pðc; g1 ; . . . ; gn Þ is straightforward. Given a new sample, one can compute pða; g1 ; . . . ; gn Þ and pðb; g1 ; . . . ; gn Þ and assign its class to a if the former is larger, or to b otherwise. The NBC from the public-domain software WEKA [34] is used here to validate the consensus clusters obtained in melanoma and breast cancer data sets. In each case, two groups a and b are considered, and a randomly chosen 66 percent of each group is used as the training data, while the rest of the samples is classified. The seed chosen for creating the random partitioning of the data is 1. MAHATA: EXPLORATORY CONSENSUS OF HIERARCHICAL CLUSTERINGS FOR MELANOMA AND BREAST CANCER 143 Classification of a test set. It is important to validate the consensus clusters obtained by ECHC across various data sets. However, when the class of a sample is unknown in both the training and the test set, independent validation becomes a tricky issue. To solve this problem, the same classification method on another test set is employed and the consensus clusters from it are generated. Still, without prior knowledge, it might be difficult to map the cluster I (II, etc.) of training set to the cluster I (II, etc.) of the test set. In case of breast cancer data sets, there were a few common samples in the two sets. This allows us to map clusters from the training set to the test set. Once the class of the samples in test set were known, it is straightforward to use NBC for validation. 4 MICROARRAY DATA The microarray data for the considered data sets is given by a 2D matrix D of size N S, where N is the number of genes and S is the number of samples. Each row Ni gives the expressions of one gene for S different samples with i : 1 i S, and each column Si contains the expression of all genes for one sample (experiment) for i : 1 i N. From D, first a distance matrix of size S S is computed, where each element ði; jÞ gives the correlation distance 1 ðSi ; Sj Þ between the columns Si and Sj . Here, ðSi ; Sj Þ computed in a similar manner to that in Section 3, by replacing g, v with Si and Sj , respectively. In the following sections, the distance matrix corresponding to each microarray data set is considered. Given a distance matrix, it is straightforward to construct a graph with S vertices and S ðS 1Þ=2 edges. The weighting function w is defined by the distance matrix itself. Then, ECHC is applied using the trees produced by AH-CUT, NORMALIZED-CUT, and agglomerative clustering. In this work, the tree produced by AH-CUT is considered to be T1 . 5 MELANOMA The most common human cancers are malignant skin cancers. Melanoma is the least common but the most deadly skin cancer, accounting for only about 4 percent of all cases, but 79 percent of skin cancer deaths (http://www.cancer. org). Even though there has been a significant effort devoted to identification of the clinical/genetic markers differentiating melanoma samples from the normal ones, identifying subtypes of melanoma is still a research topic. Recent works claim existence of a strong subtype of melanoma [21], [35]. However, there are no conclusive evidences about the markers of this subtype. The work in [21] provides a melanoma data set of 38 samples, among which 31 are melanoma-affected, while 7 are nonmelanoma samples (NilC, UACC-3149, MCF-10A, CRL-1634, SRS-3, SRS-5, and RMS-13). The authors of [21] employed two clustering methods, agglomerative hierarchical clustering [2] and a nonhierarchical technique called cluster affinity search technique [36] to define the hypothesis about a primary subtype of melanoma, which is centrally clustered using both algorithms. Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply. Fig. 1. Hierarchical clustering using AH-CUT on the distance matrix of Bittner’s melanoma data set. MI denotes the largest consensus cluster found, while MII denotes the second largest consensus cluster. Notice that the elements in MI and not in MII are not grouped together in AH-CUT and they are not inferred as the third consensus cluster. 5.1 Consensus Clusters It is not surprising that the same primary cluster in [21] is inferred to be the largest consensus cluster, called MI hereafter. Node 1 in Fig. 1 is matched with node 2 in Fig. 2 and node 6 in Fig. 3 to get MI. See the genetic signature of MI versus rest in Fig. 4. Notice that the primary cluster given in [21] contains 19 samples out of the 20 in the first cluster MI. In addition to Bittner’s primary cluster, MI also contains the sample TC-F027. Furthermore, the cluster MI is found to be further subdivided into two more strongly separated clusters MII and MI n MII, as evidenced by the genetic signature shown in Fig. 5. In this case, node 5 in Fig. 1 is matched with node 37 in Fig. 2 and node 17 in Fig. 3 to infer MII. Notice that MII is a subset of MI. Furthermore, no nonmelanoma samples are included in MI. 5.2 AH-CUT The hierarchical tree obtained by recursive AH-CUT on the melanoma data set is shown in Fig. 1. Notice that the elements in MI n MII are in different subtrees rooted at nodes 34 and 6, respectively. 144 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, Fig. 2. Hierarchical clustering using NORMALIZED-CUT on the distance matrix of Bittner’s melanoma data set. Notice that the first division in NORMALIZED-CUT separates MI from the rest. MII and MI n MII are also well separated within MI. 5.3 NORMALIZED-CUT The hierarchical tree obtained using the second top-down clustering method NORMALIZED-CUT on the melanoma data set is shown in Fig. 2. Notice that all the samples in MI are clearly separated from the rest in the first division. Furthermore, samples in MI n MII are also shown to be condensed within one subtree. 5.4 Agglomerative Hierarchical Clustering In Fig. 3, the hierarchical tree obtained by the bottom-up agglomerative clustering technique on this data set is shown. Notice that here also MI n MII is well defined. 5.5 Remarks Instead of AH-CUT, if the tree produced by the agglomerative hierarchical clustering is used to be the first hypothesis tree T1 , then a third cluster MIII can be found which is a subset of MII and consists of six samples M92001, UACC-1273, UACC-091, M93-007, UACC-1256, and UACC-2534. However, this cluster is not considered here, since the difference between MII and MIII will then be just three samples, and it is not enough to consider for further classification. Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply. VOL. 7, NO. 1, JANUARY-MARCH 2010 Fig. 3. Agglomerative hierarchical clustering of the melanoma data set with average linkage. This algorithm is prone to emitting outliers. However, node 6 gives the elements in MI and node 17 gives elements in MII. 5.6 Validation by Clustering Indices Next, the performance of ECHC is compared with k-means and CC using the three validation indices for the respective clusterings of melanoma data set (see Table 1). To compute these indices for ECHC, MI and MII are considered as the two clusters in this data set and the rest of the samples is omitted, since ECHC does not infer any class for those. However, for other algorithms, all 38 samples are considered, since they assign a class to all the samples, even if some of the samples actually might be outliers. For the k-means, the results are shown for k ¼ 2 and k ¼ 3 and for the algorithm in [15], the results are shown with k ¼ 3. Using k-means with k ¼ 2, all samples in MI are clustered together along with four other samples in one cluster, whereas the clustering with k ¼ 3 mixes some samples from MI with those from the rest. Notice that the validation indices with k ¼ 2 are better than those with k ¼ 3 when k-means is used. However, CC performed very poorly (with respect to the considered validation indices). In fact, mutual information between the runs of CC was only 0.038. Notice that in all cases, ECHC performs better, since larger values MAHATA: EXPLORATORY CONSENSUS OF HIERARCHICAL CLUSTERINGS FOR MELANOMA AND BREAST CANCER 145 Fig. 5. The genetic signature of 50 genes discriminating MII from other samples in MI. Fig. 4. The genetic signature of 50 genes discriminating MI from the rest of the samples. The cloneids of the genes are shown on the right-hand side. The color white represents overexpression and the color black represents underexpression of a gene. of Silhouette, Dunne’s index, and smaller values of DaviesBouldin’s index signify better clustering. 5.7 Genetic Markers and Classification Results 5.7.1 MI versus Rest To ensure that MI is definitely a primary subtype of melanoma, first the data is divided into MI and rest of the samples as the two classes and 50 genetic markers using the correlation-based method mentioned in Section 3 are visualized in Fig. 4. No surprise, 100 percent classification accuracy is obtained for this partition, using NBC. In the following, some of the highly ranked genetic markers are given for the partition MI versus Rest. The best marker for MI versus Rest is a gene called MELAN-A (cloneid 266361, and correlation rank is 0.93), which is known for melanin synthesis, and is considered to be a marker for melanoma [37]. Notice that this gene is relatively overexpressed in MI (white/gray) and underexpressed in the rest. The second gene with rank 0.86 is PIRIN (cloneid 234237) is not well known for differentiating melanoma from other tissues and perhaps should be further investigated. The third gene WNT5A (cloneid 357278, with rank 0.83) is again well known to be overexpressed in aggressive melanoma tumors [38] and has function in cell growth and differentiation. However, according to [21], the primary Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply. cluster in this data set (MI cluster) is less aggressive, and this is reflected by the underexpression of WNT5A in MI. Notice that the work in [21] omitted the sample TC-F027, which is very similar to the primary cluster and does not obtain the best marker MELAN-A. However, they also find the gene WNT5A as a marker for MI. The sixth gene Sinuclein, alpha (symbol SNCA, cloneid 812276, and rank 0.80) is again a marker in [21]. Another common gene between these two studies is CD63 (cloneid 39781). CD63 is associated with melanoma progression and is known to be a marker for this disease [37]. Another interesting gene found by this method is syntenin (cloneid 813533, rank 0.78), which is known to be overexpressed in aggressive melanoma tissues [39]. However, as expected, the expression of this gene is not very high in the MI cluster, since they are not overaggressive, as mentioned earlier. 5.7.2 MII versus MI n MII While using NORMALIZED-CUT and agglomerative clustering, notice that both MII and MI n MII were included in two sister-subtrees. Thus, it seems that there is a further division of MI into MII and MI n MII. MII contains nine samples and TABLE 1 Comparison of Various Validation Indices for Melanoma Data Set 146 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 1, JANUARY-MARCH 2010 the rest of the 11 samples of MI is included into MI n MII. Fig. 5 gives the genetic signature with respect to the first 50 highly ranked genes for this subdivision. Furthermore, using the same classification method with NBC, an accuracy of 100 percent is obtained. The first 50 genetic markers for the partition MII versus MI n MII are ranked between 0.96 and 0.8, showing significant distinction between these two types. However, this partition is not well explored by the biologists. The names of the first six genes with rank 0.9 are tyrosinase (cloneid 271985) (melanin biosynthesis from tyrosine), cloneid 42993, cloneid 897563, IFIT1 (cloneid 823696), RRS1 (cloneid 562867), and ACY1 (cloneid 51293). 6 BREAST CANCER With one million new cases in the world each year, breast cancer is the most common malignancy in women and constitutes 18 percent of all female cancers [40]. Even though, family history of breast cancer makes a small minority (fewer than 20 percent) of women to be breast-cancer-prone, a variety of unknown causes are suspected to cause a change in gene expressions causing various sporadic breast cancers. Only a part of the diagnosed cases (as small as 50 percent to 60 percent) get benefited from available medicines like tamoxifen [41] and herceptin. This provides an obvious indication to the heterogeneity of this disease. In [42], [43], and [44], a small set of genetic markers have been provided to distinguish between hereditary breast cancer caused by the mutations in the genes BRCA-1, BRCA-2, and other sporadic breast tumors. However, the taxonomy of sporadic tumors still remains a mystery. A few works have concentrated on finding genetic markers between patients with good prognosis and bad prognosis, e.g., [4], [45], and [46]. However, mostly the genetic markers are not found to be common between patients from different studies. In the works in [3] and [47], a classification of sporadic breast cancer was given, which consist of five groups: “basal,” “luminal A,” “luminal B,” “ERBB2,” and “normal breast-like.” However, except for the first two, the other groups were not found to be prominent in the data set of van’t Veer et al. [4] (perhaps, due to a difference in the population). Nevertheless, a quest for finding subtypes of sporadic breast cancer is on worldwide. The goal is to find and eradicate the inherent causes of the sporadic breast tumors. 6.1 Consensus Clusters In this work, 78 samples of sporadic breast tumors from [4] are used, from which only 76 samples in total are considered after removing samples numbered 53 and 54, due to their large number of missing values. The samples that had metastasized in less than 5 years are prefixed as “LessSample” and those for which it did not happen are prefixed with “MoreSample.” Using the method of ECHC, it can be inferred that there are three “consensus clusters”, called BI, BII, and BIII. The clusters BI and BII are, in fact, those corresponding to the subtypes “luminal A” and “basal,” identified in [3], modulo a couple of inclusions/ exclusions (see the remarks). Notice that BI is generated by matching the subtree rooted at node 1 in Fig. 6 with the one at node 1 in Fig. 7 and node 3 in Fig. 8. BII is generated by Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply. Fig. 6. Hierarchical clustering using AH-CUT on the distance matrix of van’t Veer’s breast cancer data set. Notice the BI is separated from BII in the first division, even though some unclustered samples are mixed with BI samples. Also, see that BIII is a subgroup of BII. matching the subtree rooted at node 96 in Fig. 6 with the one at node 80 in Fig. 7 and node 62 in Fig. 8. Finally, BIII is generated by matching the subtree rooted at node 132 in Fig. 6 with the one at node 118 in Fig. 7 and node 66 in Fig. 8. MAHATA: EXPLORATORY CONSENSUS OF HIERARCHICAL CLUSTERINGS FOR MELANOMA AND BREAST CANCER Fig. 7. Hierarchical clustering using NORMALIZED-CUT on the distance matrix of van’t Veer’s breast cancer data set. Notice that in the first division by NORMALIZED-CUT, BI is separated alone from BII, except the occurrence of one sample from UNC. However, samples in BII are mixed with some samples from UNC, and thus, the clear separation of BIII and BII n BIII is not achieved in this tree. More interestingly, a new consensus cluster BIII is found, which is a subgroup of BII. This subgroup was hitherto unknown according to the best of the author’s knowledge Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply. 147 Fig. 8. Agglomerative hierarchical clustering of the van’t Veer’s breast cancer data set with average linkage. See the clear separation of BII from BI and UNC (unclustered) samples in the first division. Also notice the division of BII into BIII and BII n BIII. and can be inferred as a subgroup of “basal” subtype. See the genetic signature of BIII versus BII n BIII in Fig. 12. 6.2 AH-CUT In Fig. 6, the hierarchical tree obtained by using AH-CUT on the van’t Veer’s data set of sporadic breast tumors is shown. In this tree, BI is mostly separated from the other two consensus clusters BII and BIII in the first division. 148 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 1, JANUARY-MARCH 2010 TABLE 2 Comparison of Various Validation Indices for Breast Cancer Data Set However, the samples in BI are mixed with some unclustered samples. All unclustered samples, which are not included in BI and BII, are grouped together and called UNC in the following. 6.3 NORMALIZED-CUT Next, the second top-down clustering method of NORMALIZED-CUT is used to find a tree in van’t Veer’s data set (see Fig. 7 for the resulting tree). In this tree, BI is totally separated from the consensus cluster BII in the first division. However, sample MoreSample5 from UNC is included in the same side as BI. 6.4 Agglomerative Hierarchical Clustering Fig. 8 shows the tree obtained by using agglomerative hierarchical clustering on this data set. In this tree, BII is clearly separated from BI and other unclustered samples in the first division. Furthermore, BII is nicely divided into BIII and BII n BIII in the second division of BII. A comparison of the results obtained above with the one produced in [3] on the van’t Veer’s data set shows that there are only minor differences between the BI and their “luminal A” and the BII and their “basal” subtype. For instances, 1) in BII, samples LessSample56 and MoreSample28 are included, and [3] does not include them in “basal”, while this work excludes MoreSample8 and LessSample71 from BII and [3] includes them in “basal”; the difference is probably due to the fact that Sorlie et al. [3] considered an additional 19 test samples with the training data in [4] and this might affect the hierarchical clustering; 2) in “luminal A” in [3], LessSample45, LessSample78, and MoreSample34 are included, while using ECHC they are left in UNC, and not in BI; and 3) all samples in their “luminal B” are in the UNC, while one of Sorlie’s “luminal B” samples, LessSample63 is included in BI. 6.5 Remarks If the trees produced by the other two algorithms are used as T1 , the result excludes three samples LessSample55, LessSample63, and MoreSample9 from the cluster BI. Now, an investigation of the three trees shows that except for the agglomerative hierarchical clustering, these three samples are well within the group of BI. Therefore, these three samples are not omitted from BI. 6.6 Cluster Validation Using Various Indices For the breast cancer data set, again the performance of ECHC is compared with k-means and CC using the three validation indices (see Table 2). In k-means algorithm, values of k is considered to be 2, 3, 4 to cluster the breast samples. It turns out that the validation indices with k ¼ 2 were better than those with the rest (results for k ¼ 3; 4 are not shown here). Notice that Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply. Fig. 9. The genetic signature of 50 genes discriminating BI from BII. the two clusters for k ¼ 2 using k-means consist of mainly (BII and seven samples from UNC) versus (BI and the rest of UNC). Notice that Silhouette index is better for ECHC, when compared with the other two algorithms. However, Dunne’s index and Davies-Bouldin’s index for k-means with k ¼ 2 were better than ECHC for this data set. In the algorithm in [15], the number of minimum and maximum clusters could be specified to be 2 and 3, 4, etc. It turns out that better results are obtained when number of clusters is 3. However, mutual information between runs was only 0.0178, thus revealing the cause of poor performance of CC on this data set. 6.7 Genetic Markers and Classification Results 6.7.1 BI versus BII In this data set, the first clusters BI (39 samples) and BII (14 samples) are found to be disjoint. Fig. 9 shows a clear genetic signature for the samples in BI versus those in BII. Again, using NBC, a classification accuracy of 100 percent is obtained. In the following, a few top-ranked genes are named, and they can be further explored as potential markers for BI versus BII. The best gene to distinguish BI from BII is the androgen receptor gene (AR, unigene ID NM_000044, rank 0.88). It is known that AR is expressed in a significant number of ER-negative (no estrogen receptor) cases and shows significant associations with important clinical and pathologic prognostic factors [48]. Notice that most of the samples in BII are ER-negative. The second gene MAHATA: EXPLORATORY CONSENSUS OF HIERARCHICAL CLUSTERINGS FOR MELANOMA AND BREAST CANCER Fig. 10. The genetic signature of 50 genes discriminating BI from the unclustered samples. is FOXA1 (ID NM_004496, rank 0.88). FOXA1 has been recently promoted as a marker for “luminal A” subtype (the BI cluster) in [49]. The next known gene is GATA3 (ID NM_002051, rank 0.84). GATA binding protein 3 (GATA3) is a transcriptional activator highly expressed by the luminal epithelial cells in the breast. The study in [50] found that GATA3 was one of the top genes with low expression in invasive carcinomas associated with negative estrogen receptor, progesterone receptor (such as the samples in BII). Other interesting genes include some known oncogenes like RAB6 (ID: NM_016577), MYB (ID: NM_005375) and the genes XBP1 (X-box binding protein 1, ID: NM_005080), GABRP (ID: NM_014211), SPDEF (ID: NM_012391), TBC1D9 (ID: AB020689), ESR1 (estrogen receptor 1, ID: NM_000125), TRIM2 (ID: AB011089), FOXC1 (ID: NM_001453), etc. Among the above genes, estrogen receptor 1 (ESR1) is the most explored gene associated with breast cancer. Also notice that many works claim that GABRP is down regulated in 75 percent of breast cancer [51]. However, in BII, this gene is always up regulated, and can thus be considered as a marker in addition to the genes in AR to ESR1, etc. 6.7.2 BI versus UNC Since, there were 39 samples in BI and 14 samples in BII, leaving 23 samples in UNC, it is also explored if there exists any possibility of another subgroup except BI, BII, BIII in this data set. In Fig. 10, the genetic signature of 50 genes for BI Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply. 149 Fig. 11. The genetic signature of 50 genes discriminating BII from the unclustered samples. versus UNC is given. Surprisingly, again a classification accuracy of 100 percent is obtained for this case. This implies that with a larger data set, one may actually find some other subtypes. Otherwise, all these samples might simply be outliers! So, for this purpose, the details of a few genetic markers for this division are provided. The first few topranked genes are with IDs AF052101, AL050227 (PTGER3, Prostaglandin E receptor 3), NM_001255 (CDC20, Cell division cycle 20 homolog), NM_007274 (ACOT7, AcylCoA thioesterase 7), NM_004701 (CCNB2, Cyclin B2), etc. Notice that the correlation ranks range from 0.75 to 0.71 for these genes. This may indicate that there are either really no strong clusters in UNC, or there may be more than one subgroup hidden in it. 6.7.3 BII versus UNC In Fig. 11, the genetic signature of 50 genes distinguishing BII from UNC is shown. In this case, a classification accuracy of 92.31 percent is obtained. One sample from UNC gets misclassified as BII sample. Notice that this result can be explained by the following fact. Both AH-CUT and NORMALIZED-CUT put a lot of samples from UNC, in a sibling branch of BII, in the very first division. However, BII is distinctly different from those unclustered samples, as found by ECHC. Thus, it is worth noting a few genetic markers for this division in the following. 150 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 1, JANUARY-MARCH 2010 6.7.5 Validation of Data from van de Vijver et al. In case of breast cancer, again NBC is used to classify the microarray data provided by van de Vijver et al. [52]. First three consensus clusters are found as before in this data set. The first cluster BI’ of van de Vijver et al. consists of 141 samples, the second cluster BII’ consists of 34 samples, and the third cluster BIII’ consists of just 5 samples. Similar to the van’t Veer’s data, BI’ is disjoint from BII’ and BIII’ is a subset of BII’. Due to such small number of samples in BIII’, only BI’ and BII’ are considered for the validation purpose. Using the genetic markers from BI versus BII, BI versus UNC, and BII versus UNC, classification accuracies of 100 percent, 82.75 percent, and 88.3 percent are obtained, respectively, in the data set of van de Vijver et al. 7 Fig. 12. The genetic signature of 50 genes discriminating BIII from other samples in BII. CONCLUSIONS In this paper, a new method has been introduced for obtaining clusters in a data set (e.g., subtypes of a disease) by taking an exploratory “consensus” of a number of hierarchical clustering algorithms. Each cluster obtained by this method corresponds to subtrees of similar sizes in different hypothesized trees which agree on their elements, but may differ on the exact topology. This allows us to get rid of the general approach of cutting trees at certain heights and finding clusters. This method also does not need to have any previous knowledge about the number of clusters unlike some other methods like k-means algorithm. In the future, this method will be further explored by using more clustering algorithms. The hypothesis on melanoma and breast cancer will be also validated by using other data sets. ACKNOWLEDGMENTS The high-ranked genes for this division is given by Contig54477_RC, NM_018014 (BCL11A), NM_012391 (SPDEF), Contig56765_RC, NM_014211 (GABRP), Contig47814_RC, NM_004496 (FOXA1), etc. The correlation ranks of these genes are 0.87 to 0.82. Notice that the genes BCL11A, SPDEF, GABRP, and FOXA1 also appear as markers in BI versus BII. 6.7.4 BIII versus BII n BIII Finally, the last consensus cluster BIII (seven samples), which is a subgroup of BII, seems to bring new information to breast cancer. This is also witnessed by a clear genetic signature provided in Fig. 12. Notice that BIII is very much different from BII n BIII, in both AH-CUT and agglomerative clustering. Furthermore, again a 100 percent classification accuracy is obtained for this division of BII. To explore this curious finding, note the best-ranked genes for this division as follows: NM_003559 (PIP5K2B), AB029004 (ERC1), NM_003624 (RANBP3), AK001362 (DCBLD2), Contig50879_RC, NM_002339 (LSP1), Contig41262_RC, NM_003809 (TNFSF13, Tumor necrosis factor, superfamily member 13), NM_016267 (VGLL1), etc. The ranks of these genes range from 0.86 to .0.81. However, not much exploration has been done in biological experiments to find subgroups within BII or “basal” tumors and we wish that more wet lab experiments can be designed in the future to support these findings. Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply. This work was carried out while the author was a Research Academic at the University of Newcastle, Australia, and it was inspired by the author’s work while she was supported by the ARC Center for Bioinformatics, Australia. The author is also thankful to Dr. Pablo Moscato and Mr. Wagner Costa from the University of Newcastle for many enlightening discussions on breast cancer during that period. REFERENCES [1] [2] [3] [4] [5] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Statistics. Springer, 2001. M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “Cluster Analysis and Display of Genome-Wide Expression Patterns,” Proc. Nat’l Academy of Sciences USA, vol. 95, pp. 14 863-14 868, 1998. T. Sorlie, R. Tibshirani, J. Parker, T. Hastie, J.S. Marron, A. Nobel, S. Deng, H. Johnsen, R. Pesich, S.G.J. Demeter, C.M. Perou, P.E. Lonning, P.O. Brown, A. Brresen-Dale, and D. Botstein, “Repeated Observation of Breast Tumor Subtypes in Independent Gene Expression Data Sets,” Proc. Nat’l Academy of Sciences USA, vol. 100, no. 14, pp. 8418-8423, July 2003. L.J. van’t Veer, H. Dai, M.J. van de Vijver, Y.D. He, A.A.M. Hart, M. Mao, H.L. Peterse, K. van der Kooy, M.J. Marton, A.T. Witteveen, G.J. Schreiber, R.M. Kerkhoven, C. Roberts, P.S. Linsley, R. Bernards, and S.H. Friend, “Computational Cluster Validation in Post-Genomic Data Analysis,” Nature, vol. 415, pp. 530-536, 2005. E.N. Adams, “Consensus Techniques and the Comparison of Taxonomic Trees,” Systematic Zoology, vol. 21, no. 4, pp. 390-397, 1972. MAHATA: EXPLORATORY CONSENSUS OF HIERARCHICAL CLUSTERINGS FOR MELANOMA AND BREAST CANCER [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] D.A. Neumann, “Faithful Consensus Methods for n-Trees,” Math. Biosciences, vol. 63, pp. 271-287, 1983. R.R. Sokal and F.J. Rohlf, “Taxonomic Congruence in the Leptopodomorpha Re-Examined,” Systematic Zoology, vol. 30, pp. 304-325, 1981. T. Margush and F.R. McMorris, “Consensus n-Trees,” Bull. Math. Biology, vol. 43, pp. 239-244, 1981. M. Wilkinson, “Common Cladistic Information and Its Consensus Representation: Reduced Adams and Reduced Cladistic Consensus Trees and Profiles,” Systematic Biology, vol. 43, pp. 343-368, 1994. A.M. Krieger and P.E. Green, “A Generalized Rand-Index Method for Consensus Clustering of Separate Partitions of Separate Partitions of the Same Data Base,” J. Classification, vol. 16, pp. 63-89, 1999. A. Topchy, B. Minaei-Bidgoli, A.K. Jain, and W.F. Punch, “Adaptive Clustering Ensembles,” J. Classification, vol. 16, pp. 63-89, 1999. S. Monti, P. Tamayo, J. Mesirov, and T. Golub, “Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data,” Machine Learning, vol. 52, no. 1/2, pp. 91-118, 2003. S. Swift, A. Tucker, V. Vinciotti, N. Martin, C. Orengo, X. Liu, and P. Kellam, “Consensus Clustering and Functional Interpretation of Gene-Expression Data,” Genome Biology, vol. 5, no. R94, 2004. V. Filkov and S. Skiena, “Heterogeneous Data Integration with the Consensus Clustering Formalism,” Proc. Int’l Workshop Data Integration in the Life Sciences (DILS ’04), vol. 2994, pp. 110-123, 2004. T. Grotkjaer, O. Winther, B. Regenberg, J. Nielsen, and L.K. Hansen, “Robust Multi-Scale Clustering of Large DNA Microarray Datasets with the Consensus Algorithm,” Bioinformatics, vol. 22, no. 1, pp. 58-67, 2006. S. Kirkpatrick, C.D.J. Gelatt, and M.P. Vecchi, “Optimization by Simulated Annealing,” Science, vol. 220, pp. 671-680, 1983. Y. Wakabayashi, “The Complexity of Computing Medians of Relations,” Resenhas IME-USP, vol. 3, pp. 323-349, 1998. M. Krivanek and J. Moravek, “Hard Problems in Hierarchical-Tree Clustering,” Acta Informatica, vol. 23, pp. 311-323, 1986. B. Raskutti, H. Ferra, and A. Kowalczyk, “Combining Clustering and Co-Training to Enhance Text Classification Using Unlabelled Data,” Proc. Eighth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD), 2002. A. Strehl and J. Ghosh, “Cluster Ensembles—A Knowledge Reuse Framework for Combining Partitionings,” Proc. 18th Nat’l Conf. Artificial Intelligence (AAAI ’02), pp. 93-98, 2002. M. Bittner, P. Meltzer, Y. Chen, Y. Jiang, E. Seftor, M. Hendrix, M. Radmacher, R. Simon, Z. Yakhinik, A. Ben-Dork, N. Sampask, E. Dougherty, E. Wang, F. Marincola, C. Gooden, J. Lueders, A. Glatfelter, P. Pollock, J. Carpten, E. Gillanders, D. Leja, K. Dietrich, C. Beaudry, M. Berens, D. Alberts, V. Sondak, N. Hayward, and J. Trent, “Molecular Classification of Cutaneous Malignant Melanoma by Gene Expression Profiling,” Nature, vol. 406, no. 3, 2000. P. Mahata, W. Costa, C. Cotta, and P. Moscato, “Hierarchical Clustering, Languages and Cancer,” Proc. EvoWorkshops ’06, pp. 67-78, 2006. R. Rizzi, P. Mahata, W. Costa, and P. Moscato, Hierarchical Clustering Using Arithmetic-Harmonic Cut, submitted, 2007. I. Dyen, J.B. Kruskal, and P. Black, “An Indo-European Classification: A Lexicostatistical Experiment,” Trans. Am. Philosophical Soc., New Series, vol. 82, no. 5, pp. 1-132, 1992. D.T. Ross, U. Scherf, M. Eisen, C. Perou, C. Rees, P. Spellman, V. Iyer, S. Jeffrey, M. Rijn, M. Waltham, A. Pergamenschikov, J.C. Lee, D. Lashkari, D. Shalon, T. Myers, J.N. Weinstein, D. Botstein, and P. Brown, “Systematic Variation in Gene Expression Patterns in Human Cancer Cell Lines,” Nature Genetics, vol. 24, pp. 227-235, Mar. 2000. J. Shi and J. Malik, “Normalized Cuts and Image Segmentation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888-905, Aug. 2000. W.M. Rand, “Objective Criteria for the Evaluation of Clustering Methods,” J. Am. Statistical Assoc., vol. 66, pp. 846-850, 1971. L. Hubert and P. Arabie, “Comparing Partitions,” J. Classification, vol. 2, no. 1, pp. 193-218, 1985. Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply. 151 [29] P. Rousseeuw, “Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis,” J. Computational and Applied Math., vol. 20, pp. 53-65, 1987. [30] J. Bezdek and N. Pal, “Some New Indexes of Cluster Validity,” IEEE Trans. Systems, Man, and Cybernetics, vol. 28 B, pp. 301-315, 1998. [31] D.L. Davies and D.W. Bouldin, “A Cluster Separation Measure,” IEEE Trans. Pattern Recognition and Machine Intelligence, vol. 1, no. 2, pp. 224-227, Feb. 1979. [32] F. Azuaje and N. Bolshakova, “Clustering Genome Expression Data: Design and Evaluation Principles,” Understanding and Using Microarray Analysis Techniques: A Practical Guide, D. Berrar, W. Dubitzky, and M. Granzow, eds., Springer-Verlag, 2002. [33] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloompeld, and E.S. Lander, “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, vol. 286, pp. 531-537, 1999. [34] I.H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, second ed. Morgan Kaufmann, 2005. [35] M. Smolkin and D. Ghosh, “Cluster Stability Scores for Microarray Data in Cancer Studies,” BMC Bioinformatics, vol. 4, no. 36, 2003. [36] A. Ben Dor, R. Shamir, and Z. Yakhini, “Clustering Gene Expression Patterns,” J. Computational Biology, vol. 6, pp. 281-297, 1999. [37] T.B. Lewis, J.E. Robison, R. Bastien, B. Milash, K. Boucher, W.E. Samlowski, S.A. Leachman, R.D. Noyes, C.T. Wittwer, L. Perreard, and P.S. Bernard, “Molecular Classification of Melanoma Using Real-Time Quantitative Reverse TranscriptasePolymerase Chain Reaction,” Cancer, vol. 104, no. 8, pp. 1678-1686, Oct. 2005. [38] I.M. Bachmann, O. Straume, H.E. Puntervoll, M.B. Kalvenes, and L.A. Akslen, “Importance of p-Cadherin, Beta-Catenin, and wnt5a/ frizzled for Progression of Melanocytic Tumors and Prognosis in Cutaneous Melanoma,” Clinical Cancer Research, vol. 11, no. 24, Pt. 1, pp. 8606-8614, 2005. [39] H. Boukerche, Z.Z. Su, L. Emdad, D. Sarkar, and P.B. Fisher, “mda-9/syntenin Regulates the Metastatic Phenotype in Human Melanoma Cells by Activating Nuclear Factor-Kappab,” Cancer Research, vol. 67, no. 4, pp. 1812-1822, Feb. 2007. [40] K. McPherson, C.M. Steel, and J.M. Dixon, “ABC of Breast Diseases: Breast Cancer—Epidemiology, Risk Factors, and Genetics,” British Medical J., vol. 321, pp. 624-628, 2000. [41] L.C. Dorssers, S.V. der Flier, A.B.T. van Agthoven, J. Veldscholte, E.M. Berns, J. Klijn, L.V. Beex, and J.A. Foekens, “Tamoxifen Resistance in Breast Cancer: Elucidating Mechanisms,” Drugs, vol. 61, no. 12, pp. 1721-1733, 2001. [42] I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon, P. Meltzer, B. Gusterson, M. Esteller, O.P. Kallioniemi, B.W. Wilfond, A.B.J. Trent, M. Raffeld, Z. Yakhini, A. Ben-Dor, E. Dougherty, J. Kononen, L. Bubendorf, S. Fehrle, S. Pittaluga, S.G.S.N. Loman, O. Johannsson, H. Olsson, and G. Sauter, “GeneExpression Profiles in Hereditary Breast Cancer,” New England J. Medicine, vol. 344, no. 8, pp. 539-548, 2001. [43] K.E. Lee, N. Sha, E.R. Dougherty, M. Vannucci, and B.K. Mallick, “Gene Selection: A Bayesian Variable Selection Approach,” Bioinformatics, vol. 19, pp. 90-97, 2003. [44] P. Mahata and K. Mahata, “Selecting Differentially Expressed Genes Using Minimum Probability of Classification Error,” J. Biomedical Informatics, vol. 40, no. 6, pp. 775-786, 2007. [45] Y. Pawitan, J. Bjhle, L. Amler, A. Borg, S. Egyhazi, P. Hall, X. Han, L. Holmberg, F. Huang, S. Klaar, E.T. Liu, L. Miller, H. Nordgren, A. Ploner, K. Sandelin, P.M. Shaw, J. Smeds, L. Skoog, S. Wedrn, and J. Bergh, “Gene Expression Profiling Spares Early Breast Cancer Patients from Adjuvant Therapy: Derived and Validated in Two Population-Based Cohorts,” Breast Cancer Research, vol. 7, pp. R953-R964, 2005. [46] S. Paik, S. Shak, G. Tang, C. Kim, J. Baker, M. Cronin, F.L. Baehner, M.G. Walker, D. Watson, T. Park, W. Hiller, E.R. Fisher, L. Wickerham, J. Bryant, and N. Wolmark, “A Multigene Assay to Predict Recurrence of Tamoxifen-Treated, Node-Negative Breast Cancer,” New England J. Medicine, no. 27, pp. 2817-2826, 2004. [47] T. Sorlie, C.M. Perou, R. Tibshirani, T. Aas, S. Geisler, H. Johnsen, T. Hastie, M.B. Eisen, M. ban de Rijn, S.S. Jeffrey, T. Thorsen, H. Quist, J.C. Matese, P.O. Brown, D. Botstein, P.E. Lonnin, and A. Brresen-Dale, “Gene Expression Patterns of Breast Carcinomas Distinguish Tumor Subclasses with Clinical Implications,” Proc. Nat’l Academy of Sciences USA, vol. 98, no. 19, pp. 10 869-10 874, Sept. 2001. 152 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, [48] S.N. Agoff, P.E. Swanson, H. Linden, S.E. Hawes, and T.J. Lawton, “Androgen Receptor Expression in Estrogen Receptor-Negative Breast Cancer. Immunohistochemical, Clinical, and Prognostic Associations,” Am. J. Clinical Pathology, vol. 120, no. 5, pp. 725-731, Nov. 2003. [49] H. Nakshatri and S. Badve, “Foxa1 as a Therapeutic Target for Breast Cancer,” Expert Opinion on Therapeutic Targets, vol. 11, no. 4, pp. 507-514, Apr. 2007. [50] R. Mehra, S. Varambally, L. Ding, R. Shen, M.S. Sabel, D. Ghosh, A.M. Chinnaiyan, and C.G. Kleer, “Identification of GATA3 as a Breast Cancer Prognostic Marker by Global Gene Expression Meta-Analysis,” Cancer Research, vol. 65, pp. 11 259-11 264, Dec. 2005. [51] M. Zafrakas, M. Chorovicer, I. Klaman, G. Kristiansen, P. Wild, U. Heindrichs, R. Knchel, and E. Dahl, “Systematic Characterization of Gabrp Expression in Sporadic Breast Cancer and Normal Breast Tissue,” Int’l J. Cancer, vol. 118, no. 6, pp. 1453-1459, 2006. [52] M.J. van de Vijver, Y.D. He, L.J. van’t Veer, H. Dai, A.A.M. Hart, D.W. Voskuil, G.J. Schreiber, J.L. Peterse, C. Roberts, M.J. Marton, M. Parrish, D. Atsma, A.T. Witteveen, A. Glas, L. Delahaye, T. van der Velde, H. Bertelink, S. Rodenhuis, E.T. Rutgers, S.H. Friend, and R. Bernards, “A Gene-Expression Signature as a Predictor of Survival in Breast Cancer,” New England J. Medicine, vol. 347, no. 25, pp. 1999-2009, 2002. Authorized licensd use limted to: IE Xplore. Downlade on May 10,2 at 19:507 UTC from IE Xplore. Restricon aply. VOL. 7, NO. 1, JANUARY-MARCH 2010 Pritha Mahata received the BE degree in electrical engineering from Jadavpur University, Kolkata, India, in 1996, the ME degree in computer science and automation from the Indian Institute of Science, Bangalore, India, in 1999, and the PhD degree in computer science from Uppsala University, Sweden, in 2005. Currently, she is a research fellow in the School of Information Technology and Electrical Engineering, University of Queensland, Australia. She was a postdoctoral fellow from 2005 to 2007 in the School of Electrical Engineering and Computer Science, University of Newcastle, Australia. She also worked at Asea Brown Boveri Ltd., India, from 1996 to 1997 and Texas Instruments, India, from 1999 to 2000. Her main research interests include verification and validation of software systems, real-time communication protocols, bioinformatics, and machine learning. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.