What Is Clustering? • Group data into clusters Clustering – Similar to one another within the same cluster – Dissimilar to the objects in other clusters – Unsupervised learning: no predefined classes Similarity and Dissimilarity Manhattan and Chebyshev Distance • Distances are normally used measures • Minkowski distance: a generalization d (i, j) = q | x − x |q + | x − x |q +...+ | x − x |q (q > 0) i1 j1 i2 j2 ip jp • • • • If q = 2, d is Euclidean distance If q = 1, d is Manhattan distance If q = ∞, d is Chebyshev distance Weighed distance Chebyshev Distance Manhattan Distance When n = 2, chess-distance d (i, j) = q w | x − x |q +w | x − x |q +...+ w p | x − x |q ) (q > 0) ip j p 1 i1 j1 2 i2 j 2 Picture from Wekipedia Properties of Minkowski Distance Clustering Methods • Nonnegative: d(i,j) ≥ 0 • The distance of an object to itself is 0 • • • • • • – d(i,i) = 0 • Symmetric: d(i,j) = d(j,i) • Triangular inequality – d(i,j) ≤ d(i,k) + d(k,j) K-means and partitioning methods Hierarchical clustering Density-based clustering Grid-based clustering Pattern-based clustering Other clustering methods – K-means: a cluster is represented by the center – K-medoids or PAM (partition around medoids): each cluster is represented by one of the objects in the cluster K-Means: Example Assign each objects to most similar center K=2 Pros and Cons of K-means • Relatively efficient: O(tkn) – n: # objects, k: # clusters, t: # iterations; k, t << n. • Often terminate at a local optimum • Applicable only when mean is defined – What about categorical data? • Need to specify the number of clusters • Unable to handle noisy data and outliers • Unsuitable to discover non-convex clusters Variations of the K-means A Problem of K-means • Aspects of variations • Sensitive to outliers – Selection of the initial k means – Dissimilarity calculations – Strategies to calculate cluster means – Outlier: objects with extremely large values • May substantially distort the distribution of the data • K-medoids: the most centrally located object in a cluster • Handling categorical data: k-modes – Use mode instead of mean – Mode: the most frequent item(s) – A mixture of categorical and numerical data: k-prototype method • EM (expectation maximization): assign a probability of an object to a cluster PAM: A K-medoids Method Swapping Cost • PAM: partitioning around Medoids • Arbitrarily choose k objects as the initial medoids • Until no change, do • Measure whether o' is better than o as a medoid • Use the squared-error criterion k – (Re)assign each object to the cluster to which the nearest medoid – Randomly select a non-medoid object o', compute the total cost, S, of swapping medoid o with o' – If S < 0 then swap o with o' to form the new set of k medoids E = ∑ ∑ d ( p, oi ) 2 i =1 p∈Ci – Compute Eo'-Eo – Negative: swapping brings benefit Pros and Cons of PAM PAM: Example • PAM is more robust than k-means in the presence of noise and outliers – Medoids are less influenced by outliers • PAM is efficiently for small data sets but does not scale well for large data sets – O(k(n-k)2 ) for each iteration • Sampling based method: CLARA CLARA CLARANS • CLARA: Clustering LARge Applications (Kaufmann and Rousseeuw in 1990) • Clustering large applications based upon randomized search • The problem space graph of clustering – Built in statistical analysis packages, such as S+ • Draw multiple samples of the data set, apply PAM on each sample, give the best clustering • Perform better than PAM in larger data sets • Efficiency depends on the sample size – A good clustering on samples may not be a good clustering of the whole data set ⎛ n ⎞ – A vertex is k from n numbers, ⎜⎜ ⎟⎟ vertices in total ⎝ k ⎠ – PAM searches the whole graph – CLARA searches some random sub-graphs • CLARANS climbs hills – Randomly sample a set and select k medoids – Consider neighbors of medoids as candidate for new medoids – Use the sample set to verify – Repeat multiple times to avoid bad samples Hierarchy Hierarchical Clustering • An arrangement or classification of things according to inclusiveness • A natural way of abstraction, summarization, compression, and simplification for understanding • Typical setting: organize a given set of objects to a hierarchy • Group data objects into a tree of clusters • Top-down versus bottom-up agglomerative (AGNES) divisive (DIANA) AGNES (Agglomerative Nesting) Dendrogram • Initially, each object is a cluster • Step-by-step cluster merging, until all objects form a cluster • Show how to merge clusters hierarchically • Decompose data objects into a multilevel nested partitioning (a tree of clusters) • A clustering of the data objects: cutting the dendrogram at the desired level – Single-link approach – Each cluster is represented by all of the objects in the cluster – The similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters – Each connected component forms a cluster DIANA (Divisive ANAlysis) Distance Measures • Initially, all objects are in one cluster • Step-by-step splitting clusters until each cluster contains only one object • • • • Minimum distance d min (Ci , C j ) = min d ( p, q) p∈Ci , q∈C j Maximum distance d max (C i , C j ) = max d ( p, q) p∈Ci , q∈C j Mean distance d mean (Ci , C j ) = d (mi , m j ) m: mean for a cluster C: a cluster n: the number of objects in a cluster Average distance d avg (Ci , C j ) = ni n j ∑ ∑ d ( p, q ) p∈Ci q∈C j Multi-Clustering Challenges BIRCH • Hard to choose merge/split points • Balanced Iterative Reducing and Clustering using Hierarchies • CF (Clustering Feature) tree: a hierarchical data structure summarizing object info – Never undo merging/splitting – Merging/splitting decisions are critical • High complexity O(n2) • Integrating hierarchical clustering with other techniques – Clustering objects à clustering leaf nodes of the CF tree – BIRCH, CURE, CHAMELEON, ROCK • Clustering feature: Clustering Feature: CF = (N, LS, SS) – Summarize the statistics for a cluster – Many cluster quality measures (e.g., radium, distance) can be derived – Additivity: CF1+CF2=(N1+N2, L1+L2, SS1+SS2) N: Number of data points Clustering Feature Vector LS: ∑Ni=1=oi SS: ∑Ni=1=oi2 • A CF tree: a height-balanced tree storing the clustering features for a hierarchical clustering – A nonleaf node in a tree has descendants or "children" – The nonleaf nodes store sums of the CFs of children CF-tree in BIRCH Parameters of a CF-tree • Branching factor: the maximum number of children • Threshold: max diameter of sub-clusters stored at the leaf nodes BIRCH Clustering Pros & Cons of BIRCH • Phase 1: scan DB to build an initial inmemory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) • Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree • Linear scalability – Good clustering with a single scan – Quality can be further improved by a few additional scans • Can handle only numeric data • Sensitive to the order of the data records Drawbacks of Square Error Based Methods CURE: the Ideas • One representative per cluster • Each cluster has c representatives – Good only for convex shaped having similar size and density – Choose c well scattered points in the cluster – Shrink them towards the mean of the cluster by a fraction of α – The representatives capture the physical shape and geometry of the cluster • K: the number of clusters parameter – Good only if k can be reasonably estimated • Merge the closest two clusters – Distance of two clusters: the distance between the two closest representatives Cure: The Algorithm • • • • Data Partitioning and Clustering Draw random sample S Partition sample to p partitions Partially cluster each partition Eliminate outliers – Random sampling + remove clusters growing too slowly • Cluster partial clusters until only k clusters left – Shrink representatives of clusters towards the cluster center Shrinking Representative Points • Shrink the multiple representative points towards the gravity center by a fraction of α • Representatives capture the shape • Robust Clustering using links – # of common neighbors between two points – Use links to measure similarity/proximity – Not distance based – O(n2 + nmmma + n2 log n) è • Basic ideas: – Similarity function and neighbors: Sim(T , T ) = T1 ∩ T2 / T1 ∪ T2 • Let T1 = {1,2,3}, T2={3,4,5} Sim( T 1, T 2) = {3} / {1,2,3,4,5} = 1/5 = 0.2 Limitations Chameleon • Merging decision based on static modeling • Hierarchical clustering using dynamic modeling • Measures the similarity based on a dynamic model – No special characteristics of clusters are considered – Interconnectivity & closeness (proximity) between two clusters vs interconnectivity of the clusters and closeness of items within the clusters • A two-phase algorithm – Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters – Find the genuine clusters by repeatedly combining subclusters CURE and BIRCH merge C1 and C2 C1' and C2' are more appropriate for merging Overall Framework of CHAMELEON Distance-based Methods: Drawbacks • Hard to find clusters with irregular shapes • Hard to specify the number of clusters • Heuristic: a cluster must be dense Construct Partition the Graph Sparse Graph Data Set Merge Partition Final Clusters How to Find Irregular Clusters? Directly Density Reachable • Divide the whole space into many small areas • Parameters MinPts = 3 Eps = 1 cm – Eps: Maximum radius of the neighborhood – MinPts: Minimum number of points in an Epsneighborhood of that point – NEps(p): {q | dist(p,q) ≤Eps} – The density of an area can be estimated – Areas may or may not be exclusive – A dense area is likely in a cluster • Start from a dense area, traverse connected dense areas and discover clusters in irregular shape • Core object p: |Neps(p)|≥MinPts – A core object is in a dense area • Point q directly density-reachable from p iff q ∈NEps(p) and p is a core object Density-Based Clustering DBSCAN • Density-reachable • A cluster: a maximal set of densityconnected points – Directly density reachable p1àp2, p2àp3, …, pn-1à pn – pn density-reachable from p1 – Discover clusters of arbitrary shape in spatial databases with noise • Density-connected – If points p, q are density-reachable from o then p and q are density-connected Eps = 1cm MinPts = 5 DBSCAN: the Algorithm Challenges for DBSCAN • Arbitrary select a point p • Retrieve all points density-reachable from p wrt Eps and MinPts • If p is a core point, a cluster is formed • If p is a border point, no points are densityreachable from p and DBSCAN visits the next point of the database • Continue the process until all of the points have been processed • Different clusters may have very different densities • Clusters may be in hierarchies OPTICS: A Cluster-ordering Method Ordering Points • Idea: ordering points to identify the clustering structure • "Group" points by density connectivity • Points strongly density-connected should be close to one another • Clusters density-connected should be close to one another and form a "cluster" of clusters – Hierarchies of clusters • Visualize clusters and the hierarchy OPTICS: An Example DENCLUE: Using Density Functions • DENsity-based CLUstEring • Major features – Solid mathematical foundation – Good for data sets with large amounts of noise – Allow a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets – Significantly faster than existing algorithms (faster than DBSCAN by a factor of up to 45) – But need a large number of parameters DENCLUE: Techniques Density Attractor • Use grid cells – Only keep grid cells actually containing data points – Manage cells in a tree-based access structure • Influence function: describe the impact of a data point on its neighborhood • Overall density of the data space is the sum of the influence function of all data points • Clustering by identifying density attractors – Density attractor: local maximal of the overall density function Center-defined and Arbitrary Clusters A Shrinking-based Approach • Difficulties of Multi-dimensional Clustering – Noise (outliers) – Clusters of various densities – Not well-defined shapes • A novel preprocessing concept "Shrinking" • A shrinking-based clustering approach Intuition & Purpose Inspiration • For data points in a data set, what if we could make them move towards the centroid of the natural subgroup they belong to? • Natural sparse subgroups become denser, thus easier to be detected • Newton's Universal Law of Gravitation – Noises are further isolated – Any two objects exert a gravitational force of attraction on each other – The direction of the force is along the line joining the objects – The magnitude of the force is directly proportional to the product of the gravitational masses of the objects, and inversely proportional to the square of the distance between them – G: universal gravitational constant Fg = G mm / r2 • G = 6.67 x 10-11 N m2 /kg2 Apply shrinking into clustering field The Concept of Shrinking • A data preprocessing technique • Shrink the natural sparse clusters to make them much denser to facilitate further cluster-detecting process. – Aim to optimize the inner structure of real data sets • Each data point is "attracted" by other data points and moves to the direction in which way the attraction is the strongest • Can be applied in different fields Multiattribute hyperspace Data Shrinking Approximation & Simplification • Each data point moves along the direction of the density gradient and the data set shrinks towards the inside of the clusters • Points are "attracted" by their neighbors and move to create denser clusters • It proceeds iteratively; repeated until the data are stabilized or the number of iterations exceeds a threshold • Problem: Computing mutual attraction of each data points pair is too time consuming O further isolated – Any two objects exert a gravitational force of attraction on each other – The direction of the force is along the line joining the objects – The magnitude of the force is directly proportional to the product of the gravitational masses of the objects, and inversely proportional to the square of the distance between them – G: universal gravitational constant Fg = G 1 2 2 • G = 6.67 x 10-11 N m2 /kg2 Jian Pei: Big Data Analytics -- Clustering 57 mm r Jian Pei: Big Data Analytics -- Clustering 58 Apply shrinking into clustering field The Concept of Shrinking • A data preprocessing technique • Shrink the natural sparse clusters to make them much denser to facilitate further cluster-detecting process. – Aim to optimize the inner structure of real data sets • Each data point is “attracted” by other data points and moves to the direction in which way the attraction is the strongest • Can be applied in different fields Jian Pei: Big Data Analytics -- Clustering 56 Multiattribute hyperspac e 59 Jian Pei: Big Data Analytics -- Clustering 60 10 Data Shrinking Approximation & Simplification • Each data point moves along the direction of the density gradient and the data set shrinks towards the inside of the clusters • Points are “attracted” by their neighbors and move to create denser clusters • It proceeds iteratively; repeated until the data are stabilized or the number of iterations exceeds a threshold • Problem: Computing mutual attraction of each data points pair is too time consuming O(n2) Jian Pei: Big Data Analytics -- Clustering – Solution: No Newton's constant G, m1 and m2 are set to unit • Only aggregate the gravitation surrounding each data point • Use grids to simplify the computation 61 Termination condition Jian Pei: Big Data Analytics -- Clustering 62 Optics on Pendigits Data • Average movement of all points in the current iteration is less than a threshold • The number of iterations exceeds a threshold Before data shrinking Jian Pei: Big Data Analytics -- Clustering 63 Fuzzy Clustering ∑w ij =1 j =1 m – For each cluster Cj 0 < ∑ wij < m 64 Select an initial fuzzy pseudo-partition, i.e., assign values to all the wij Repeat Compute the centroid of each cluster using the fuzzy pseudo-partition Recompute the fuzzy pseudo-partition, i.e., the wij Until the centroids do not change (or the change is below some threshold) i =1 Jian Pei: Big Data Analytics -- Clustering Jian Pei: Big Data Analytics -- Clustering Fuzzy C-Means (FCM) • Each point xi takes a probability wij to belong to a cluster Cj • Requirements k – For each point xi, After data shrinking 65 Jian Pei: Big Data Analytics -- Clustering 66 11 Critical Details Choice of P • Optimization on sum of the squared error (SSE): SSE(C ,…, C ) = k m w p dist( x , c ) 2 • When p à 1, FCM behaves like traditional k-means • When p is larger, the cluster centroids approach the global centroid of all data points • The partition becomes fuzzier as p increases 1 k ∑∑ ij i j j =1 i =1 m p m p • Computing centroids: c j = ∑ wij xi / ∑ wij i =1 i =1 • Updating the fuzzy pseudo-partition 1 k wij = (1 / dist( xi , c j ) 2 ) p −1 2 ∑ (1 / dist( x , c ) ) i q 1 p −1 q =1 – When p=2 k wij = 1 / dist ( xi , c j ) 2 ∑1/ dist( x , c ) i 2 q q =1 Jian Pei: Big Data Analytics -- Clustering 67 Effectiveness Jian Pei: Big Data Analytics -- Clustering 68 Mixture Models • A cluster can be modeled as a probability distribution – Practically, assume a distribution can be approximated well using multivariate normal distribution • Multiple clusters is a mixture of different probability distributions • A data set is a set of observations from a mixture of models Jian Pei: Big Data Analytics -- Clustering 69 Object Probability Jian Pei: Big Data Analytics -- Clustering 70 Example • Suppose there are k clusters and a set X of m objects prob( xi | Θ) = – Let the j-th cluster have parameter θj = (µj, σj) – The probability that a point is in the j-th cluster is wj, w1 + …+ wk = 1 • The probability of an object x is prob( x | Θ) = ∑ w p ( x | θ ) − 1 e 2π σ ( x−µ )2 2σ 2 θ1 = (−4,2) θ2 = (4,2) k j m j =1 j j m prob( x | Θ) = k prob( X | Θ) = ∏ prob( xi | Θ) =∏∑ w j p j ( xi | θ j ) i =1 Jian Pei: Big Data Analytics -- Clustering − 1 e 2 2π ( x+ 4)2 8 + − 1 e 2 2π ( x−4)2 8 i =1 j =1 71 Jian Pei: Big Data Analytics -- Clustering 72 12 Maximal Likelihood Estimation EM Algorithm • Maximum likelihood principle: if we know a set of objects are from one distribution, but do not know the parameter, we can choose the parameter maximizing the probability ( x−µ ) • Maximize prob( x | Θ) = m 1 e− 2σ • Expectation Maximization algorithm Select an initial set of model parameters Repeat Expectation Step: for each object, calculate the probability that it belongs to each distribution θi, i.e., prob(xi|θi) Maximization Step: given the probabilities from the expectation step, find the new estimates of the parameters that maximize the expected likelihood 2 ∏ i j =1 2 2π σ – Equivalently, maximize m log prob( X | Θ) = −∑ i =1 ( xi − µ ) 2 − 0.5m log 2π − m log σ 2σ 2 Jian Pei: Big Data Analytics -- Clustering Until the parameters are stable 73 Jian Pei: Big Data Analytics -- Clustering Advantages and Disadvantages Grid-based Clustering Methods • Mixture models are more general than kmeans and fuzzy c-means • Clusters can be characterized by a small number of parameters • The results may satisfy the statistical assumptions of the generative models • Computationally expensive • Need large data sets • Hard to estimate the number of clusters • Ideas Jian Pei: Big Data Analytics -- Clustering – Using multi-resolution grid data structures – Using dense grid cells to form clusters • Several interesting methods – CLIQUE – STING – WaveCluster 75 Jian Pei: Big Data Analytics -- Clustering CLIQUE CLIQUE: the Ideas • Clustering In QUEst • Automatically identify subspaces of a high dimensional data space • Both density-based and grid-based • Partition each dimension into the same number of equal length intervals Jian Pei: Big Data Analytics -- Clustering 74 76 – Partition an m-dimensional data space into nonoverlapping rectangular units • A unit is dense if the number of data points in the unit exceeds a threshold • A cluster is a maximal set of connected dense units within a subspace 77 Jian Pei: Big Data Analytics -- Clustering 78 13 – Apriori: a k-d cell cannot be dense if one of its (k-1)-d projection is not dense Vac atio n • Partition the data space and find the number of points in each cell of the partition – Determine dense units in all subspaces of interests and connected dense units in all subspaces of interests • Generate minimal description for the clusters – Determine the minimal cover for each cluster 50 age Vacation (week) 0 1 2 3 4 5 6 7 l 30 Sa y ar • Identify clusters: 20 Jian Pei: Big Data Analytics -- Clustering 79 CLIQUE: Pros and Cons 0 1 2 3 4 5 6 7 CLIQUE: An Example Salary (10,000) CLIQUE: the Method 20 30 40 50 30 40 50 age 60 age 60 Jian Pei: Big Data Analytics -- Clustering 80 Bad Cases for CLIQUE • Automatically find subspaces of the highest dimensionality with high density clusters • Insensitive to the order of input Parts of a cluster may be missed – Not presume any canonical data distribution • Scale linearly with the size of input • Scale well with the number of dimensions • The clustering result may be degraded at the expense of simplicity of the method Jian Pei: Big Data Analytics -- Clustering A cluster from CLIQUE may contain noise 81 Jian Pei: Big Data Analytics -- Clustering 82 Biclustering Application Examples • Clustering both objects and attributes simultaneously • Four requirements • Recommender systems – Objects: users – Attributes: items – Values: user ratings – Only a small set of objects in a cluster (bicluster) – A bicluster only involves a small number of attributes – An object may participate in multiple biclusters or no biclusters – An attribute may be involved in multiple biclusters, or no biclusters Jian Pei: Big Data Analytics -- Clustering sample/condition gene w11 w12 w1m w21 w22 w31 w32 w2m w3m wn1 wn2 wnm • Microarray data – Objects: genes – Attributes: samples – Values: expression levels 83 Jian Pei: Big Data Analytics -- Clustering 84 14 Biclusters with Constant Values 11.2. CLUSTERING HIGH-DIMENSIONAL DATA 536 a1 ··· a33 ··· a86 ··· ··· ··· ··· ··· ··· ··· ··· b6 60 ··· 60 ··· 60 ··· ··· ··· ··· ··· ··· ··· ··· b12 60 ··· 60 ··· 60 ··· ··· ··· ··· ··· ··· ··· ··· b36 60 ··· 60 ··· 60 ··· Biclusters with Coherent Values 535 ··· ··· ··· ··· ··· ··· ··· • Also known as pattern-based clusters b99 · · · 60 · · · ··· ··· 60 · · · ··· ··· 60 · · · ··· ··· CHAPTER 11. ADVANCED CLUSTER ANALYSIS Figure 11.5: A gene-condition matrix, a submatrix, and a bi-cluster. 10 20 50 0 10 20 50 0 10 20 50 0 10 20 50 0 10 20 50 0 subset of products. For example, AllElectronics is highly interested in finding a group of customers who all like the same group of products. Such a cluster is a submatrix in the customer-product matrix, where all elements have a high value. Using such a cluster, AllElectronics can make recommendations in two directions. First, the company can recommend products to new customers who are similar to the customers in the cluster. Second, the company can recommend to customers new products that are similar to those involved in the cluster. On rows Figure 11.6: A bi-cluster with constant values on rows. Jian Pei: BigAs Data Analytics -- Clustering with bi-clusters in a gene expression data matrix, the bi-clusters in a 85 Jian Pei: Big Data Analytics -- Clustering 86 customer-product matrix usually have the following characteristics: 10 50 30 70 20 • Only a small set of customers participate in a cluster; 20 60 40 80 30 • A cluster involves only a small subset of products; 50 90 70 110 60 • A customer can participate in multiple clusters, or may not participate 40 20 60 10 in any cluster 0 at all; and • A product may be involved in multiple clusters, or may not be involved Figure in any11.7: cluster atAall.bi-cluster with coherent values. Bi-clustering can be applied to customer-product matrices to mine clusters satisfying the above requirements. Biclusters with Coherent Evolutions Types of Bi-clusters defined by a subset I ⊆ A of genes and a subset J ⊆ B of conditions. For “How can we model bi-clusters and mine them?” Let’s start with some basic example, in• the matrix Figurewe’ll 11.5, {a a33“conditions” , a86over } × to{b6 , b12 , b36 , b99 } notation. sake in of simplicity, use “genes” 1 , and Only up-Forshown orthe down-regulated changes refer to the two dimensions in our discussion. Our discussion can easily be is a submatrix. 11.2. CLUSTERING DATA 537 extended other applications. For example, we can simply replace “genes” and rows ortoHIGH-DIMENSIONAL columns by “customers” and “products” to tackle customer-product follow biA bi-cluster “conditions” is a submatrix where genes andtheconditions consistent clustering problem. patterns. We can define of bi-clusters Let A = {adifferent of genes and B = {b1 , . . . ,based set ofsuch patterns: 1 , . . . , an } be types 10 50a set 30 70 20 bm} be a on conditions. Let E = [eij ] be a gene expression data matrix, that is, a gene- condition matrix,20 where 100 1 ≤ i ≤ 50 n and 11000 ≤ j ≤ m. 30 A submatrix I × J is • As the simplest case,50a submatrix J (I 80 ⊆ A, J ⊆ B) is a bi-cluster 100 90 I × 120 with constant values for any I and10j ∈ J, eij = c, where c is a 0 if 80 20 i ∈100 constant. For example, the submatrix {a1 , a33 , a86 } × {b6 , b12 , b36 , b99 } in Figure 11.511.8: is a bi-cluster with constant values. Figure A Coherent bi-cluster with coherent evolutions on rows. evolutions on rows • A bi-cluster is interesting if each row has a constant value, though different rows may have different values. A bi-cluster with constant values Jian Pei: Data Analytics -- Clustering on rows isBigamultiplication, submatrix I ×that J such I and 87jbi-clusters ∈ J, then values using is eijthat = cfor · αany i · βji. ∈Clearly, eij = constant c+αi where αi isonthe adjustment for row i. For example, 11.6 with values rows or columns are special cases of Figure bi-clusters showscoherent a bi-cluster with constant values on rows. with values. Symmetrically, a bi-cluster with constant values on columns is a • In some applications, submatrix I × J such we thatmay for only any ibe ∈ Iinterested and j ∈ in J, the thenupeij or = downc + βj , regulated changes across genes or conditions without constraining the where βj is the adjustment for column j. exact values. A bi-cluster with coherent evolutions on rows is a I × J asuch that for any i1 , i2 ∈if Ithe androws j1 , jchange (eia1 j1syn− 2 ∈ J, in • submatrix More generally, bi-cluster is interesting echronized − ei2with ≥ respect 0. For example, Figure Pattern? 11.8 a bi-cluster with i1 j2 )(e i2 j1 way j2 )Follow to theSame columns and shows vice versa. MathematObjects the coherent evolutions on rows. Symmetrically, we can define ically, a bi-cluster with coherent values (also known as abi-clusters patternwith coherent evolutions on columns. based cluster) is a submatrix I × J such that for any i ∈ I and j ∈ J, eij = c + αi + βj , where αi and βj are the adjustment for row i and pScore Next, we study how to mine column j, respectively. Forbi-clusters. example, Figure 11.7 shows a bi-cluster with coherent Object values. blue It can beMethods shown that I × J is a bi-cluster with coherent values if and Bi-clustering only if for any i1 , i2 ∈ I and j1 , j2 ∈ J, then ei1 j1 − ei2 j1 = ei1 j2 − ei2 j2 . The above specification of the types of bi-clusters only considers ideal cases. In Moreover, instead Obejct green of using addition, we can define bi-cluster with coherent real data sets, such perfect bi-clusters rarely exist. When they do exist, they are usually very small. Instead, random noise can affect the readings of eij and thus prevent a bi-cluster in nature from appearing in a perfect shape. D1 D2 There are two major types of methods for discovering bi-clusters in data The less the pScore, the more consistent the objects that may come with noise. Optimization-based methods conduct an itJian Pei: Big Data Analytics -- Clustering the submatrix with the highest 89 erative search. At each iteration, significance score is identified as a bi-cluster. The process terminates when a user-specified condition is met. Due to cost concerns in computation, greedy search is often employed to find local optimal bi-clusters. Enumeration methods use a tolerance threshold to specify the degree of noise allowed in the bi-clusters to be mined, and then tries to enumerate all submatrices of bi-clusters that satisfy the requirements. We use the δ-Cluster and MaPle algorithms as examples to illustrate these ideas. Optimization Using the δ-Cluster Algorithm For a submatrix, I × J, the mean of the i-th row is Differences from Subspace Clustering • Subspace clustering uses global distance/ similarity measure • Pattern-based clustering looks at patterns • A subspace cluster according to a globally defined similarity measure may not follow the same pattern Jian Pei: Big Data Analytics -- Clustering 88 Pattern-based Clusters • pScore: the similarity between two objects rx, ry on two attributes au, av ⎛ ⎡ rx .au pScore⎜ ⎢ ⎜ ry .au ⎝ ⎣ rx .av ⎤ ⎞ ⎟ = ( rx .au − ry .au ) − ( rx .av − ry .av ) ry .av ⎥⎦ ⎟⎠ • δ-pCluster (R, D): for any objects rx, ry∈R and any attributes au, av∈D, ⎛ ⎡ rx .au pScore⎜ ⎢ ⎜ ry .au ⎝ ⎣ Jian Pei: Big Data Analytics -- Clustering rx .av ⎤ ⎞ ⎟ ≤ δ ry .av ⎥⎦ ⎟⎠ (δ ≥ 0) 90 15 Maximal pCluster Mining Maximal pClusters • If (R, D) is a δ-pCluster , then every subcluster (R’, D’) is a δ-pCluster, where R’⊆R and D’⊆D • Given – A cluster threshold δ – An attribute threshold mina – An object threshold mino – An anti-monotonic property – A large pCluster is accompanied with many small pClusters! Inefficacious • Task: mine the complete set of significant maximal δ-pClusters • Idea: mining only the maximal pClusters! – A δ-pCluster is maximal if there exists no proper super cluster as a δ-pCluster Jian Pei: Big Data Analytics -- Clustering 91 – A significant δ-pCluster has at least mino objects on at least mina attributes Jian Pei: Big Data Analytics -- Clustering 92 pCluters and Frequent Itemsets Where Should We Start from? • A transaction database can be modeled as a binary matrix • Frequent itemset: a sub-matrix of all 1’s • How about the pClusters having only 2 objects or 2 attributes? – MDS (maximal dimension set) – A pCluster must have at least 2 objects and 2 Objects attributes Attribute – 0-pCluster on binary data – Mino: support threshold – Mina: no less than mina attributes – Maximal pClusters – closed itemsets • Finding MDSs • Frequent itemset mining algorithms cannot be extended straightforwardly for mining pClusters on numeric data Jian Pei: Big Data Analytics -- Clustering 93 How to Assemble Larger pClusters? • Systematically enumerate every combination of attributes D – For each attribute subset, find the maximal subsets of objects R s.t. (R, D) is a pCluster – Check whether (R, D) is maximal b c d e f g h 13 11 9 7 9 13 2 15 y 7 4 10 1 12 3 4 7 x-y 6 7 -1 6 -3 10 -2 8 Jian Pei: Big Data Analytics -- Clustering 94 More Pruning Techniques • Why attribute-first-objectlater? – # of objects >> # attributes • Algorithm MaPle (Pei et al, 2003) • Prune search branches as early as possible Jian Pei: Big Data Analytics -- Clustering a x 95 • Only possible attributes should be considered to get larger pClusters • Pruning local maximal pClusters having insufficient possible attributes • Extracting common attributes from possible attribute set directly • Prune non-maximal pClusters Jian Pei: Big Data Analytics -- Clustering 96 16 Gene-Sample-Time Series Data Mining GST Microarray Data Sample-Time Matrix time2 time1 • Reduce the gene-sample-time series data to gene-sample data sample1 sample2 Time Sample – Use the Pearson's correlation coeffcient as the coherence measure gene1 gene2 Gene-Sample Matrix Gene-Time Matrix Gene expression level of gene i on sample j at time k Jian Pei: Big Data Analytics -- Clustering 97 Jian Pei: Big Data Analytics -- Clustering Basic Approaches Basic Tools • Sample-gene search • Set enumeration tree • Sample-gene search and gene-sample search are not symmetric! – Enumerate the subsets of samples systematically – For each subset of samples, find the genes that are coherent on the samples • Gene-sample search 98 – Many genes, but a few samples – No requirement on samples coherent on genes – Enumerate the subsets of genes systematically – For each subset of genes, find the samples on which the genes are coherent Jian Pei: Big Data Analytics -- Clustering 99 Informative Genes 1 2 3 • Input: a microarray matrix and k • Output: phenotypes and informative genes 4 5 6 7 gene1 – Partitioning the samples into k exclusive subsets – phenotypes – Informative genes discriminating the phenotypes gene2 gene3 gene4 Noninformative Genes gene5 • Machine learning methods gene6 – Heuristic search – Mutual reinforcing adjustment gene7 Jian Pei: Big Data Analytics -- Clustering 100 The Phenotype Mining Problem Phenotypes and Informative Genes samples Jian Pei: Big Data Analytics -- Clustering 101 Jian Pei: Big Data Analytics -- Clustering 102 17 Requirements Intra-phenotype Consistency • The expression levels of each informative gene should be similar over the samples within each phenotype • The expression levels of each informative gene should display a clear dissimilarity between each pair of phenotypes • In a subset of genes (candidate informative genes), does every gene have good consistency on a set of samples? • Average of variance of the subset of genes – the smaller the intra-phenotype consistency, the better 1 Con(G' , S ' ) = ∑ ∑ (wi, j − wi,S ' )2 G' ⋅ ( S ' − 1) g!i∈G 's!j∈S ' Jian Pei: Big Data Analytics -- Clustering 103 • How a subset of genes (candidate informative genes) can discriminate two phenotypes of samples? • Sum of the average difference between the phenotypes – the larger the inter-phenotype divergence, the better Div(G ' , S 1, S 2)) = ∑w i , S1 104 Quality of Phenotypes and Informative Genes Inter-phenotype Divergence ! gi∈G ' Jian Pei: Big Data Analytics -- Clustering 1 Con(G' , Si ) + Con(G' , S j ) Ω= ∑ Si , S j (1≤i , j ≤ K ;i ≠ j ) Div(G' , Si , Sj ) • The higher the value, the better the quality − wi , S 2 G' Jian Pei: Big Data Analytics -- Clustering 105 Heuristic Search Jian Pei: Big Data Analytics -- Clustering 106 Possible Adjustments • Start from a random subset of genes and an arbitrary partition of the samples • Iteratively adjust the partition and the gene set toward a better solution – For each possible adjustment, compute ΔΩ • For each gene, try possible insert/remove • For each sample, try the best movement ΔΩ Insert a gene – ΔΩ > 0 à conduct the adjustment e Ω⋅T (i ) – ΔΩ < 0 à conduct the adjustment with probability Remove a gene Move a sample • T(i) is a decreasing simulated annealing function and i is the iteration number. T(i)=1/(i+1) in our implementation Jian Pei: Big Data Analytics -- Clustering 107 Jian Pei: Big Data Analytics -- Clustering 108 18 Disadvantages of Heuristic Search Mutual Reinforcing Adjustment • Samples and genes are examined and adjusted with equal chances • A two-phase approach – Iteration phase – Refinement phase – # samples << # genes – Samples should play more important roles • Mutual reinforcement • Outliers in the samples should be handled specifically – Use gene partition to improve the sample partition – Use the sample partition to improve the gene partition – Outliers highly interfere the quality and the adjustment decisions Jian Pei: Big Data Analytics -- Clustering 109 Jian Pei: Big Data Analytics -- Clustering 110 Dimensionality Reduction Variance and Covariance • Clustering a high dimensional data set is challenging • Given a set of 1-d points, how different are those points? n – Distance between two points could be dominated by noise n • Dimensionality reduction: choosing the informative dimensions for clustering analysis 2 s= ∑(X i − X )2 i =1 n −1 2 i i =1 n −1 • Given a set of 2-d points, are the two dimensions correlated? – Feature selection: choosing a subset of existing dimensions – Feature construction: construct a new (small) set of informative attributes Jian Pei: Big Data Analytics -- Clustering – Standard deviation: (X − X ) – Variance: s = ∑ n – Covariance: 111 Principal Components cov( X , Y ) = ∑(X i − X )(Yi − Y ) i =1 n −1 Jian Pei: Big Data Analytics -- Clustering 112 Step 1: Mean Subtraction • Subtract the mean from each dimension for each data point • Intuition: centralizing the data set Art work and example from http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf Jian Pei: Big Data Analytics -- Clustering 113 Jian Pei: Big Data Analytics -- Clustering 114 19 Step 2: Covariance Matrix ⎛ cov( D1 , D1 ) cov( D1 , D2 ) ⎜ ⎜ cov( D2 , D1 ) cov( D2 , D2 ) C = ⎜ " " ⎜ ⎜ cov( D , D ) cov( D , D ) n 1 n 2 ⎝ Step 3: Eigenvectors and Eigenvalues • Compute the eigenvectors and the eigenvalues of the covariance matrix ! cov( D1 , Dn ) ⎞ ⎟ ! cov( D2 , Dn ) ⎟ ⎟ # " ⎟ ! cov( Dn , Dn ) ⎟⎠ Jian Pei: Big Data Analytics -- Clustering – Intuition: find those direction invariant vectors as candidates of new attributes – Eigenvalues indicate how much the direction invariant vectors are scaled – the larger the better for manifest the data variance 115 Step 4: Forming New Features Jian Pei: Big Data Analytics -- Clustering 116 New Features • Choose the principal components and forme new features NewData = RowFeatureVector x RowDataAdjust – Typically, choose the top-k components The first principal component is used Jian Pei: Big Data Analytics -- Clustering 117 Clustering in Derived Space Jian Pei: Big Data Analytics -- Clustering Spectral Clustering Y Data Affinity matrix [ Wij ] A = f(W) - 0.707x + 0.707y O Jian Pei: Big Data Analytics -- Clustering 118 Computing the leading k eigenvectors of A Clustering in the new space Projecting back to cluster the original data Av = \lamda v X 119 Jian Pei: Big Data Analytics -- Clustering 120 20 Affinity Matrix Clustering • Using a distance measure Wij = e • In the Ng-Jordan-Weiss algorithm, we define a diagonal matrix such that n dist(oi ,oj ) w where σ is a scaling parameter controling how fast the affinity Wij decreases as the distance increases • In the Ng-Jordan-Weiss algorithm, Wii is set to 0 Jian Pei: Big Data Analytics -- Clustering Dii = X Wij j=1 1 1 • Then, A = D 2 W D 2 • Use the k leading eigenvectors to form a new space • Map the original data to the new space and conduct clustering 121 Jian Pei: Big Data Analytics -- Clustering Is a Clustering Good? Major Tasks • Feasibility • Assessing clustering tendency – Applying any clustering methods on a uniformly distributed data set is meaningless – Are there non-random structures in the data? • Determining the number of clusters or other critical parameters • Measuring clustering quality • Quality – Are the clustering results meeting users’ interest? – Clustering patients into clusters corresponding various disease or sub-phenotypes is meaningful – Clustering patients into clusters corresponding to male or female is not meaningful Jian Pei: Big Data Analytics -- Clustering 122 123 Uniformly Distributed Data Jian Pei: Big Data Analytics -- Clustering 124 Hopkins Statistic • Clustering uniformly distributed data is • Hypothesis: the data is generated by a meaningless uniform distribution in a space • A uniformly distributed data set is generated • Sample n points, p1, …, pn, uniformly from 504CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS by a uniform data distribution the space of D • For each point pi, find the nearest neighbor of pi in D, let xi be the distance between pi and its nearest neighbor in D xi = min{dist(pi , v)} v2D Jian Pei: Big Data Analytics -- Clustering Jian Pei: Big Data Analytics -- Clustering Figure 10.21: A data set that is uniformly 125 distributed in the data space. • Measuring clustering quality. After applying a clustering method on a data set, we want to assess how good the resulting clusters are. A number of measures can be used. Some methods measure how well the clusters fit the data set, while others measure how well the clusters match the ground truth, if such truth is available. There are also measures that score clusterings and thus can compare two sets of clustering results on the same data set. In the rest of this section, we discuss each of the above three topics. 10.6.1 Assessing Clustering Tendency Clustering tendency assessment determines whether a given data set has a nonrandom structure, which may lead to meaningful clusters. Consider a data set that does not have any non-random structure, such as a set of uniformly 126 21 Hopkins Statistic Explanation • Sample n points, q1, …, qn, uniformly from D • For each qi, find the nearest neighbor of qi in D – {qi}, let yi be the distance between qi and its nearest neighbor in D – {qi} • If D is uniformly distributed, then i=1 xi and n X yi would be close to each other, and thus yi = min {dist(qi , v)} v2D,v6=qi • Calculate the Hopkins Statistic H H= P n i=1 Jian Pei: Big Data Analytics -- Clustering n P yi i=1 xi + n P yi i=1 127 n X i=1 H would be round 0.5 n X • If D is skewed, then yi would be i=1 substantially smaller, and thus H would be close to 0 • If H > 0.5, then it is unlikely that D has statistically significant clusters Jian Pei: Big Data Analytics -- Clustering Finding the Number of Clusters A Cross-Validation Method • Depending on many factors • Divide the data set D into m parts • Use m – 1 parts to find a clustering • Use the remaining part as the test set to test the quality of the clustering – The shape and scale of the distribution in the data set – The clustering resolution required by the user • Many methods exist r n 2 p – Set k = , each cluster has 2n points on average – Plot the sum of within-cluster variances with respect to k, find the first (or the most significant turning point) Jian Pei: Big Data Analytics -- Clustering 129 128 – For each point in the test set, find the closest centroid or cluster center – Use the squared distances between all points in the test set and the corresponding centroids to measure how well the clustering model fits the test set • Repeat m times for each value of k, use the average as the quality measure Jian Pei: Big Data Analytics -- Clustering 130 Measuring Clustering Quality Quality in Extrinsic Methods • Ground truth: the ideal clustering determined by human experts • Two situations • Cluster homogeneity: the more pure the clusters in a clustering are, the better the clustering • Cluster completeness: objects in the same cluster in the ground truth should be clustered together • Rag bag: putting a heterogeneous object into a pure cluster is worse than putting it into a rag bag • Small cluster preservation: splitting a small cluster in the ground truth into pieces is worse than splitting a bigger one – There is a known ground truth – the extrinsic (supervised) methods, comparing the clustering against the ground truth – The ground truth is unavailable – the intrinsic (unsupervised) methods, measuring how well the clusters are separated Jian Pei: Big Data Analytics -- Clustering 131 Jian Pei: Big Data Analytics -- Clustering 132 22 Many clustering quality measures satisfy some of the above four criteria. Here, we introduce the BCubed precision and recall metrics, which satisfy all of the above criteria. BCubed evaluates the precision and recall for every object in a clustering on a given data set according to the ground truth. The precision of an object indicates how many other objects in the same cluster belong to the same category as the object. The recall of an object reflects how many objects of the same category are assigned to the same cluster. Formally, let D ={o1 , . . . , on } be a set of objects, and C be a clustering on D. Let L(oi ) (1 ≤ i ≤ n) be the category of oi given by ground truth, and C(oi ) be the cluster ID of oi in C. Then, for two objects, oi and oj , (1 ≤ i, j, ≤ n, i ̸= j), the correctness of the relation between oi and oj in clustering C is given by Correctness(oi , oj ) = ! 1 if L(oi ) = L(oj ) ⇔ C(oi ) = C(oj ) 0 otherwise. Bcubed Precision and Recall Bcubed Precision and Recall BCubed precision is defined as • D = {o1, …, on} • Precision – L(oi) is the cluster of oi given by the ground truth • C is a clustering on D " n 10.6. EVALUATION OF " CLUSTERING oj :i̸=j,C(oi )=C(oj ) Precision BCubed = BCubed recall is defined as – C(oi) is the cluster-id of oi in C • For two objects oi and oj, the correctness is 1 if L(oi) = L(oj) ßà C(oi) = C(oj), 0 otherwise • Recall n ! oj :i̸=j,L(oi )=L(oj ) i=1 509 . n ! Recall BCubed = Correctness(oi , oj ) ∥{oj |i ̸= j, C(oi ) = C(oj )}∥ i=1 (10.28) (10.29) Correctness(oi , oj ) ∥{oj |i ̸= j, L(oi ) = L(oj )}∥ n . (10.30) Intrinsic Methods Jian Pei: Big Data Analytics -- Clustering 133 Silhouette Coefficient Silhouette Coefficient • No ground truth is assumed • Suppose a data set D of n objects is partitioned into k clusters, C1, …, Ck • For each object o, – Calculate a(o), the average distance between o and every other object in the same cluster – compactness of a cluster, the smaller, the better – Calculate b(o), the minimum average distance from o to every objects in a cluster that o does not belong to – degree of separation from other clusters, the larger, the better Jian Pei: Big Data Analytics -- Clustering When the ground truth of a data set is not available, we have to use an intrinsic Jian Pei:to Big assess Data Analytics 134 method the-- Clustering clustering quality. In general, intrinsic methods evaluate a clustering by examining how well the clusters are separated and how compact the clusters are. Many intrinsic methods take the advantage of a similarity metric between objects in the data set. The silhouette coefficient is such a measure. For a data set D of n objects, suppose D is partitioned into k clusters, C1 , . . . , Ck . For each object o ∈ D, we calculate a(o) as the average distance between o and all other objects in the cluster to which o belongs. Similarly, b(o) is the minimum average distance from o to all clusters to which o does not belong. Formally, suppose o ∈ Ci (1 ≤ i ≤ k), then " P o′ ) ′ dist(o, o′ ∈Cdist(o, ̸ o i ,o = o0 ) a(o) = (10.31) o,o0 2Ci ,o0 6=o |Ci | − 1 135 Multi-Clustering • A data set may be clustered in different ways a(o) = and b(o) = The silhouette • Then |Ci | "1 P ′ o′ ∈Cj dist(o, dist(o, o0 ) o ) }. min { 0 o =2C C :1≤j≤k,j̸ i j j b(o) = min { Cj :o62 coefficient ofCoj |Cj | j| is then |C defined as − a(o) b(o)b(o)a(o) s(o) = . s(o) = max{a(o), b(o)} max{a(o), b(o)} (10.32) } (10.33) value the silhouette coefficientcoefficient is between −1ofand • The Use theofaverage silhouette all1. The value of a(o) objects reflects the of the cluster to which o belongs. The smaller ascompactness the overall measure the value is, the more compact the cluster is. The value of b(o) captures the Jian degree to Analytics which-- Clustering o is separated from other clusters. The136larger b(o) is, Pei: Big Data the more separated o is from other clusters. Therefore, when the silhouette coefficient value of o approaches 1, the cluster containing o is compact and o is far away from other clusters, which is the preferable case. However, when the silhouette coefficient value is negative (that is, b(o) < a(o)), this means that, in expectation, o is closer to the objects in another cluster than to the objects in the same cluster as o. In many cases, this is a bad case, and should be avoided. To measure the fitness of a cluster within a clustering, we can compute the average silhouette coefficient value of all objects in the cluster. To measure the quality of a clustering, we can use the average silhouette coefficient value of all objects in the data set. The silhouette coefficient and other intrinsic measures – In different subspaces, that is, using different attributes – Using different similarity measures – Using different clustering methods • Some different clusterings may capture different meanings of categorization – Orthogonal clusterings • Putting users in the loop Jian Pei: Big Data Analytics -- Clustering 137 23
© Copyright 2025