What Is Clustering? Clustering •  Group data into clusters

What Is Clustering?
•  Group data into clusters
Clustering
–  Similar to one another within the same cluster
–  Dissimilar to the objects in other clusters
–  Unsupervised learning: no predefined classes
Outliers
Cluster 1
Cluster 2
Jian Pei: Big Data Analytics -- Clustering
Similarity and Dissimilarity
2
Manhattan and Chebyshev Distance
•  Distances are normally used measures
•  Minkowski distance: a generalization
d (i, j) = q | x − x |q + | x − x |q +...+ | x − x |q (q > 0)
i1
j1
i2
j2
ip
jp
• 
• 
• 
• 
If q = 2, d is Euclidean distance
If q = 1, d is Manhattan distance
If q = ∞, d is Chebyshev distance
Weighed distance
Chebyshev Distance
Manhattan Distance
When n = 2, chess-distance
d (i, j) = q w | x − x |q +w | x − x |q +...+ w p | x − x |q ) (q > 0)
ip j p
1 i1 j1
2 i2 j 2
Jian Pei: Big Data Analytics -- Clustering
Picture from Wekipedia
3
Properties of Minkowski Distance
Clustering Methods
•  Nonnegative: d(i,j) ≥ 0
•  The distance of an object to itself is 0
• 
• 
• 
• 
• 
• 
–  d(i,i) = 0
•  Symmetric: d(i,j) = d(j,i)
•  Triangular inequality
–  d(i,j) ≤ d(i,k) + d(k,j)
i
j
http://brainking.com/images/rules/chess/02.gif
Jian Pei: Big Data Analytics -- Clustering
4
K-means and partitioning methods
Hierarchical clustering
Density-based clustering
Grid-based clustering
Pattern-based clustering
Other clustering methods
k
Jian Pei: Big Data Analytics -- Clustering
5
Jian Pei: Big Data Analytics -- Clustering
6
1
Partitioning Algorithms: Ideas
K-means
•  Partition n objects into k clusters
•  Arbitrarily choose k objects as the initial
cluster centers
•  Until no change, do
–  Optimize the chosen partitioning criterion
•  Global optimal: examine all possible partitions
–  (kn-(k-1)n-…-1) possible partitions, too expensive!
–  (Re)assign each object to the cluster to which
the object is the most similar, based on the
mean value of the objects in the cluster
–  Update the cluster means, i.e., calculate the
mean value of the objects for each cluster
•  Heuristic methods: k-means and k-medoids
–  K-means: a cluster is represented by the center
–  K-medoids or PAM (partition around medoids): each
cluster is represented by one of the objects in the
cluster
Jian Pei: Big Data Analytics -- Clustering
7
Jian Pei: Big Data Analytics -- Clustering
K-Means: Example
Assign
each
objects
to most
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
Arbitrarily choose K
object as initial
cluster center
10
9
9
8
8
7
7
6
6
5
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9
10
10
similar
center
K=2
Pros and Cons of K-means
10
0
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
reassign
10
9
9
8
8
7
7
6
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
7
8
9
10
reassign
10
0
8
9
10
Update
the
cluster
means
4
3
2
1
0
1
Jian Pei: Big Data Analytics -- Clustering
2
3
4
5
6
7
8
9
9
–  n: # objects, k: # clusters, t: # iterations; k, t <<
n.
•  Often terminate at a local optimum
•  Applicable only when mean is defined
–  What about categorical data?
5
0
•  Relatively efficient: O(tkn)
10
•  Need to specify the number of clusters
•  Unable to handle noisy data and outliers
•  Unsuitable to discover non-convex clusters
Jian Pei: Big Data Analytics -- Clustering
10
Variations of the K-means
A Problem of K-means
•  Aspects of variations
•  Sensitive to outliers
–  Selection of the initial k means
–  Dissimilarity calculations
–  Strategies to calculate cluster means
–  Outlier: objects with extremely large values
•  May substantially distort the distribution of the data
•  K-medoids: the most centrally located object
in a cluster
•  Handling categorical data: k-modes
–  Use mode instead of mean
•  Mode: the most frequent item(s)
–  A mixture of categorical and numerical data: k-prototype
method
10
9
9
8
7
5
4
3
2
1
0
0
11
10
6
•  EM (expectation maximization): assign a
probability of an object to a cluster
Jian Pei: Big Data Analytics -- Clustering
+
+
1
2
3
4
5
6
7
Jian Pei: Big Data Analytics -- Clustering
8
9
10
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
12
2
PAM: A K-medoids Method
Swapping Cost
•  PAM: partitioning around Medoids
•  Arbitrarily choose k objects as the initial medoids
•  Until no change, do
•  Measure whether o’ is better than o as a
medoid
•  Use the squared-error
criterion
k
–  (Re)assign each object to the cluster to which the
nearest medoid
–  Randomly select a non-medoid object o’, compute the
total cost, S, of swapping medoid o with o’
–  If S < 0 then swap o with o’ to form the new set of k
medoids
Jian Pei: Big Data Analytics -- Clustering
E = ∑ ∑ d ( p, oi ) 2
i =1 p∈Ci
–  Compute Eo’-Eo
–  Negative: swapping brings benefit
13
Jian Pei: Big Data Analytics -- Clustering
14
Pros and Cons of PAM
PAM: Example
Total Cost = 20
Arbitrary
choose k
object as
initial
medoids
K=2
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
1
2
3
4
5
6
7
8
9
•  PAM is more robust than k-means in the
presence of noise and outliers
10
9
8
Assign
each
remaining
object to
nearest
medoids
0
0
10
1
2
3
4
5
6
7
8
9
10
7
6
5
–  Medoids are less influenced by outliers
4
3
2
1
0
0
1
Total Cost = 26
Do loop
Until no
change
Compute
total cost of
swapping
10
9
9
8
If quality is
improved.
5
7
6
4
3
2
1
0
0
1
2
3
4
5
6
7
3
4
5
6
7
8
9
10
10
Swapping O
and Oramdom
2
Randomly select a
nonmedoid object,Oramdom
8
9
10
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
Jian Pei: Big Data Analytics -- Clustering
7
8
9
•  PAM is efficiently for small data sets but
does not scale well for large data sets
–  O(k(n-k)2 ) for each iteration
•  Sampling based method: CLARA
10
15
Jian Pei: Big Data Analytics -- Clustering
16
CLARA
CLARANS
•  CLARA: Clustering LARge Applications
(Kaufmann and Rousseeuw in 1990)
•  Clustering large applications based upon
randomized search
•  The problem space graph of clustering
–  Built in statistical analysis packages, such as S+
•  Draw multiple samples of the data set, apply
PAM on each sample, give the best
clustering
•  Perform better than PAM in larger data sets
•  Efficiency depends on the sample size
–  A good clustering on samples may not be a
good clustering of the whole data set
Jian Pei: Big Data Analytics -- Clustering
17
⎛ n ⎞
–  A vertex is k from n numbers, ⎜⎜ ⎟⎟ vertices in total
⎝ k ⎠
–  PAM searches the whole graph
–  CLARA searches some random sub-graphs
•  CLARANS climbs hills
–  Randomly sample a set and select k medoids
–  Consider neighbors of medoids as candidate for new
medoids
–  Use the sample set to verify
–  Repeat multiple times to avoid bad samples
Jian Pei: Big Data Analytics -- Clustering
18
3
Hierarchy
Hierarchical Clustering
•  An arrangement or classification of things
according to inclusiveness
•  A natural way of abstraction, summarization,
compression, and simplification for
understanding
•  Typical setting: organize a given set of
objects to a hierarchy
•  Group data objects into a tree of clusters
•  Top-down versus bottom-up
Step 0
a
b
abcde
cde
de
e
Step 4
agglomerative
(AGNES)
ab
d
19
Step 2 Step 3 Step 4
c
–  No or very little supervision
–  Some heuristic quality guidances on the quality
of the hierarchy
Jian Pei: Big Data Analytics -- Clustering
Step 1
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA)
Jian Pei: Big Data Analytics -- Clustering
20
AGNES (Agglomerative Nesting)
Dendrogram
•  Initially, each object is a cluster
•  Step-by-step cluster merging, until all
objects form a cluster
•  Show how to merge clusters
hierarchically
•  Decompose data objects into a multilevel nested partitioning (a tree of
clusters)
•  A clustering of the data objects: cutting
the dendrogram at the desired level
–  Single-link approach
–  Each cluster is represented by all of the objects
in the cluster
–  The similarity between two clusters is measured
by the similarity of the closest pair of data points
belonging to different clusters
Jian Pei: Big Data Analytics -- Clustering
21
–  Each connected component forms a cluster
Jian Pei: Big Data Analytics -- Clustering
22
DIANA (Divisive ANAlysis)
Distance Measures
•  Initially, all objects are in one cluster
•  Step-by-step splitting clusters until each
cluster contains only one object
• 
• 
• 
• 
10
9
9
8
6
6
5
4
4
3
3
3
2
2
2
1
1
1
0
0
1
2
3
4
5
6
7
8
9
10
Jian Pei: Big Data Analytics -- Clustering
i
j
ni n j
∑ ∑ d ( p, q )
p∈Ci q∈C j
m: mean for a cluster
C: a cluster
n: the number of objects in a cluster
5
5
4
j
7
7
6
i
8
8
7
d avg (Ci , C j ) =
10
10
9
d (C , C ) = min d ( p, q)
min
i
j
p∈C , q∈C
Minimum distance
d
(
C
,
C
)
=
max d ( p, q)
max
i
j
Maximum distance
p∈C , q∈C
Mean distance
d mean (Ci , C j ) = d (mi , m j )
1
Average distance
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
23
Jian Pei: Big Data Analytics -- Clustering
24
4
Challenges
BIRCH
•  Hard to choose merge/split points
•  Balanced Iterative Reducing and Clustering
using Hierarchies
•  CF (Clustering Feature) tree: a hierarchical
data structure summarizing object info
–  Never undo merging/splitting
–  Merging/splitting decisions are critical
•  High complexity O(n2)
•  Integrating hierarchical clustering with other
techniques
–  Clustering objects à clustering leaf nodes of
the CF tree
–  BIRCH, CURE, CHAMELEON, ROCK
Jian Pei: Big Data Analytics -- Clustering
25
•  Clustering feature:
Clustering Feature: CF = (N, LS, SS)
–  Summarize the statistics for a cluster
–  Many cluster quality measures (e.g., radium, distance)
can be derived
–  Additivity: CF1+CF2=(N1+N2, L1+L2, SS1+SS2)
N: Number of data points
CF = (5, (16,30),(54,190))
SS: ∑Ni=1=oi2
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
•  A CF tree: a height-balanced tree storing the
clustering features for a hierarchical clustering
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
10
10
–  A nonleaf node in a tree has descendants or “children”
–  The nonleaf nodes store sums of the CFs of children
Jian Pei: Big Data Analytics -- Clustering
27
CF Tree
B=7
L=6
CF1
CF2 CF3
CF6
child1
child2 child3
child6
CF1
Non-leaf node
CF2 CF3
CF5
child1
child2 child3
child5
CF1 CF2
Jian Pei: Big Data Analytics -- Clustering
28
Parameters of a CF-tree
Leaf node
prev
26
CF-tree in BIRCH
Clustering Feature Vector
LS: ∑Ni=1=oi
Jian Pei: Big Data Analytics -- Clustering
CF6 next
Jian Pei: Big Data Analytics -- Clustering
•  Branching factor: the maximum number of
children
•  Threshold: max diameter of sub-clusters
stored at the leaf nodes
Root
Leaf node
prev
CF1 CF2
CF4 next
29
Jian Pei: Big Data Analytics -- Clustering
30
5
BIRCH Clustering
Pros & Cons of BIRCH
•  Phase 1: scan DB to build an initial inmemory CF tree (a multi-level compression
of the data that tries to preserve the inherent
clustering structure of the data)
•  Phase 2: use an arbitrary clustering
algorithm to cluster the leaf nodes of the
CF-tree
•  Linear scalability
Jian Pei: Big Data Analytics -- Clustering
31
–  Good clustering with a single scan
–  Quality can be further improved by a few
additional scans
•  Can handle only numeric data
•  Sensitive to the order of the data records
Jian Pei: Big Data Analytics -- Clustering
32
Drawbacks of Square Error
Based Methods
CURE: the Ideas
•  One representative per cluster
•  Each cluster has c representatives
–  Good only for convex shaped having similar
size and density
–  Choose c well scattered points in the cluster
–  Shrink them towards the mean of the cluster by
a fraction of α
–  The representatives capture the physical shape
and geometry of the cluster
•  K: the number of clusters parameter
–  Good only if k can be reasonably estimated
•  Merge the closest two clusters
–  Distance of two clusters: the distance between
the two closest representatives
Jian Pei: Big Data Analytics -- Clustering
33
Cure: The Algorithm
• 
• 
• 
• 
Jian Pei: Big Data Analytics -- Clustering
Data Partitioning and Clustering
Draw random sample S
Partition sample to p partitions
Partially cluster each partition
Eliminate outliers
y
y
y
–  Random sampling + remove clusters growing too slowly
•  Cluster partial clusters until only k clusters left
x
y
–  Shrink representatives of clusters towards the cluster
center
35
x
y
x
Jian Pei: Big Data Analytics -- Clustering
34
Jian Pei: Big Data Analytics -- Clustering
x
x
36
6
Clustering Categorical Data:
ROCK
Shrinking Representative Points
•  Shrink the multiple representative points
towards the gravity center by a fraction of α
•  Representatives capture the shape
y
•  Robust Clustering using links
–  # of common neighbors between two points
–  Use links to measure similarity/proximity
–  Not distance based
–  O(n2 + nmmma + n2 log n)
y
è
•  Basic ideas:
–  Similarity function and neighbors: Sim(T , T ) =
1
•  Let T1 = {1,2,3}, T2={3,4,5}
x
x
Jian Pei: Big Data Analytics -- Clustering
37
Sim( T 1, T 2) =
2
T1 ∩ T2
T1 ∪ T2
{3}
1
=
= 0.2
{1,2,3,4,5}
5
Jian Pei: Big Data Analytics -- Clustering
38
Limitations
Chameleon
•  Merging decision based on static modeling
•  Hierarchical clustering using dynamic modeling
•  Measures the similarity based on a dynamic model
–  No special characteristics of clusters are
considered
–  Interconnectivity & closeness (proximity) between two
clusters vs interconnectivity of the clusters and
closeness of items within the clusters
•  A two-phase algorithm
C1
–  Use a graph partitioning algorithm: cluster objects into a
large number of relatively small sub-clusters
–  Find the genuine clusters by repeatedly combining subclusters
C2
C1’ C2’
CURE and BIRCH merge C1 and C2
C1’ and C2’ are more appropriate for merging
Jian Pei: Big Data Analytics -- Clustering
39
Overall Framework of CHAMELEON
40
Distance-based Methods: Drawbacks
•  Hard to find clusters with irregular shapes
•  Hard to specify the number of clusters
•  Heuristic: a cluster must be dense
Construct
Partition the Graph
Sparse Graph
Jian Pei: Big Data Analytics -- Clustering
Data Set
Merge Partition
Final Clusters
Jian Pei: Big Data Analytics -- Clustering
41
Jian Pei: Big Data Analytics -- Clustering
42
7
How to Find Irregular Clusters?
Directly Density Reachable
•  Divide the whole space into many small
areas
•  Parameters
MinPts = 3
Eps = 1 cm
–  Eps: Maximum radius of the neighborhood
–  MinPts: Minimum number of points in an Epsneighborhood of that point
–  NEps(p): {q | dist(p,q) ≤Eps}
–  The density of an area can be estimated
–  Areas may or may not be exclusive
–  A dense area is likely in a cluster
•  Start from a dense area, traverse connected
dense areas and discover clusters in
irregular shape
Jian Pei: Big Data Analytics -- Clustering
43
•  Core object p: |Neps(p)|≥MinPts
–  A core object is in a dense area
•  Point q directly density-reachable from p iff
q ∈NEps(p) and p is a core object
Jian Pei: Big Data Analytics -- Clustering
44
Density-Based Clustering
DBSCAN
•  Density-reachable
•  A cluster: a maximal set of densityconnected points
–  Directly density reachable p1àp2, p2àp3, …, pn-1à pn
–  pn density-reachable from p1
–  Discover clusters of arbitrary shape in spatial
databases with noise
•  Density-connected
–  If points p, q are density-reachable from o then p and q
are density-connected
p
q
p
p1
Outlier
Border
q
Core
o
Jian Pei: Big Data Analytics -- Clustering
45
Eps = 1cm
MinPts = 5
Jian Pei: Big Data Analytics -- Clustering
DBSCAN: the Algorithm
Challenges for DBSCAN
•  Arbitrary select a point p
•  Retrieve all points density-reachable from p
wrt Eps and MinPts
•  If p is a core point, a cluster is formed
•  If p is a border point, no points are densityreachable from p and DBSCAN visits the
next point of the database
•  Continue the process until all of the points
have been processed
•  Different clusters may have very different
densities
•  Clusters may be in hierarchies
Jian Pei: Big Data Analytics -- Clustering
p
q
47
Jian Pei: Big Data Analytics -- Clustering
46
48
8
OPTICS: A Cluster-ordering Method
Ordering Points
•  Idea: ordering points to identify the
clustering structure
•  “Group” points by density connectivity
•  Points strongly density-connected should be
close to one another
•  Clusters density-connected should be close
to one another and form a “cluster” of
clusters
–  Hierarchies of clusters
•  Visualize clusters and the hierarchy
Jian Pei: Big Data Analytics -- Clustering
49
Jian Pei: Big Data Analytics -- Clustering
50
OPTICS: An Example
DENCLUE: Using Density Functions
Reachability-distance
•  DENsity-based CLUstEring
•  Major features
undefined
–  Solid mathematical foundation
–  Good for data sets with large amounts of noise
–  Allow a compact mathematical description of
arbitrarily shaped clusters in high-dimensional
data sets
–  Significantly faster than existing algorithms
(faster than DBSCAN by a factor of up to 45)
–  But need a large number of parameters
ε
ε‘
ε
Jian Pei: Big Data Analytics -- Clustering
Cluster-order of the objects
51
DENCLUE: Techniques
Jian Pei: Big Data Analytics -- Clustering
52
Density Attractor
•  Use grid cells
–  Only keep grid cells actually containing data points
–  Manage cells in a tree-based access structure
•  Influence function: describe the impact of a data
point on its neighborhood
•  Overall density of the data space is the sum of the
influence function of all data points
•  Clustering by identifying density attractors
–  Density attractor: local maximal of the overall density
function
Jian Pei: Big Data Analytics -- Clustering
53
Jian Pei: Big Data Analytics -- Clustering
54
9
Center-defined and Arbitrary
Clusters
A Shrinking-based Approach
•  Difficulties of Multi-dimensional Clustering
–  Noise (outliers)
–  Clusters of various densities
–  Not well-defined shapes
•  A novel preprocessing concept “Shrinking”
•  A shrinking-based clustering approach
Jian Pei: Big Data Analytics -- Clustering
55
Jian Pei: Big Data Analytics -- Clustering
Intuition & Purpose
Inspiration
•  For data points in a data set, what if we
could make them move towards the centroid
of the natural subgroup they belong to?
•  Natural sparse subgroups become denser,
thus easier to be detected
•  Newton’s Universal Law of Gravitation
–  Noises are further isolated
–  Any two objects exert a gravitational force of attraction
on each other
–  The direction of the force is along the line joining the
objects
–  The magnitude of the force is directly proportional to
the product of the gravitational masses of the objects,
and inversely proportional to the square of the distance
between them
–  G: universal gravitational constant
Fg = G 1 2 2
•  G = 6.67 x 10-11 N m2 /kg2
Jian Pei: Big Data Analytics -- Clustering
57
mm
r
Jian Pei: Big Data Analytics -- Clustering
58
Apply shrinking into clustering
field
The Concept of Shrinking
•  A data preprocessing technique
•  Shrink the natural sparse clusters to make
them much denser to facilitate further
cluster-detecting process.
–  Aim to optimize the inner structure of real data
sets
•  Each data point is “attracted” by other data
points and moves to the direction in which
way the attraction is the strongest
•  Can be applied in different fields
Jian Pei: Big Data Analytics -- Clustering
56
Multiattribute
hyperspac
e
59
Jian Pei: Big Data Analytics -- Clustering
60
10
Data Shrinking
Approximation & Simplification
•  Each data point moves along the direction
of the density gradient and the data set
shrinks towards the inside of the clusters
•  Points are “attracted” by their neighbors
and move to create denser clusters
•  It proceeds iteratively; repeated until the
data are stabilized or the number of
iterations exceeds a threshold
•  Problem: Computing mutual attraction of
each data points pair is too time consuming
O(n2)
Jian Pei: Big Data Analytics -- Clustering
–  Solution: No Newton's constant G, m1 and m2
are set to unit
•  Only aggregate the gravitation surrounding
each data point
•  Use grids to simplify the computation
61
Termination condition
Jian Pei: Big Data Analytics -- Clustering
62
Optics on Pendigits Data
•  Average movement of all points in the
current iteration is less than a threshold
•  The number of iterations exceeds a
threshold
Before data shrinking
Jian Pei: Big Data Analytics -- Clustering
63
Fuzzy Clustering
∑w
ij
=1
j =1
m
–  For each cluster Cj 0 < ∑ wij < m
64
Select an initial fuzzy pseudo-partition, i.e., assign
values to all the wij
Repeat
Compute the centroid of each cluster using the fuzzy
pseudo-partition
Recompute the fuzzy pseudo-partition, i.e., the wij
Until the centroids do not change (or the change
is below some threshold)
i =1
Jian Pei: Big Data Analytics -- Clustering
Jian Pei: Big Data Analytics -- Clustering
Fuzzy C-Means (FCM)
•  Each point xi takes a probability wij to belong
to a cluster Cj
•  Requirements
k
–  For each point xi,
After data shrinking
65
Jian Pei: Big Data Analytics -- Clustering
66
11
Critical Details
Choice of P
•  Optimization on sum of the squared error
(SSE): SSE(C ,…, C ) = k m w p dist( x , c ) 2
•  When p à 1, FCM behaves like traditional
k-means
•  When p is larger, the cluster centroids
approach the global centroid of all data
points
•  The partition becomes fuzzier as p
increases
1
k
∑∑
ij
i
j
j =1 i =1
m
p
m
p
•  Computing centroids: c j = ∑ wij xi / ∑ wij
i =1
i =1
•  Updating the fuzzy pseudo-partition
1
k
wij = (1 / dist( xi , c j ) 2 ) p −1
2
∑ (1 / dist( x , c ) )
i
q
1
p −1
q =1
–  When p=2
k
wij = 1 / dist ( xi , c j ) 2
∑1/ dist( x , c )
i
2
q
q =1
Jian Pei: Big Data Analytics -- Clustering
67
Effectiveness
Jian Pei: Big Data Analytics -- Clustering
68
Mixture Models
•  A cluster can be modeled as a probability
distribution
–  Practically, assume a distribution can be
approximated well using multivariate normal
distribution
•  Multiple clusters is a mixture of different
probability distributions
•  A data set is a set of observations from a
mixture of models
Jian Pei: Big Data Analytics -- Clustering
69
Object Probability
Jian Pei: Big Data Analytics -- Clustering
70
Example
•  Suppose there are k clusters and a set X of
m objects
prob( xi | Θ) =
–  Let the j-th cluster have parameter θj = (µj, σj)
–  The probability that a point is in the j-th cluster is
wj, w1 + …+ wk = 1
•  The probability of an object x is
prob( x | Θ) = ∑ w p ( x | θ )
−
1
e
2π σ
( x−µ )2
2σ 2
θ1 = (−4,2) θ2 = (4,2)
k
j
m
j =1
j
j
m
prob( x | Θ) =
k
prob( X | Θ) = ∏ prob( xi | Θ) =∏∑ w j p j ( xi | θ j )
i =1
Jian Pei: Big Data Analytics -- Clustering
−
1
e
2 2π
( x+ 4)2
8
+
−
1
e
2 2π
( x−4)2
8
i =1 j =1
71
Jian Pei: Big Data Analytics -- Clustering
72
12
Maximal Likelihood Estimation
EM Algorithm
•  Maximum likelihood principle: if we know a
set of objects are from one distribution, but
do not know the parameter, we can choose
the parameter maximizing the probability
( x−µ )
•  Maximize prob( x | Θ) = m 1 e− 2σ
•  Expectation Maximization algorithm
Select an initial set of model parameters
Repeat
Expectation Step: for each object, calculate the
probability that it belongs to each distribution θi, i.e.,
prob(xi|θi)
Maximization Step: given the probabilities from the
expectation step, find the new estimates of the
parameters that maximize the expected likelihood
2
∏
i
j =1
2
2π σ
–  Equivalently, maximize
m
log prob( X | Θ) = −∑
i =1
( xi − µ ) 2
− 0.5m log 2π − m log σ
2σ 2
Jian Pei: Big Data Analytics -- Clustering
Until the parameters are stable
73
Jian Pei: Big Data Analytics -- Clustering
Advantages and Disadvantages
Grid-based Clustering Methods
•  Mixture models are more general than kmeans and fuzzy c-means
•  Clusters can be characterized by a small
number of parameters
•  The results may satisfy the statistical
assumptions of the generative models
•  Computationally expensive
•  Need large data sets
•  Hard to estimate the number of clusters
•  Ideas
Jian Pei: Big Data Analytics -- Clustering
–  Using multi-resolution grid data structures
–  Using dense grid cells to form clusters
•  Several interesting methods
–  CLIQUE
–  STING
–  WaveCluster
75
Jian Pei: Big Data Analytics -- Clustering
CLIQUE
CLIQUE: the Ideas
•  Clustering In QUEst
•  Automatically identify subspaces of a high
dimensional data space
•  Both density-based and grid-based
•  Partition each dimension into the same
number of equal length intervals
Jian Pei: Big Data Analytics -- Clustering
74
76
–  Partition an m-dimensional data space into nonoverlapping rectangular units
•  A unit is dense if the number of data points
in the unit exceeds a threshold
•  A cluster is a maximal set of connected
dense units within a subspace
77
Jian Pei: Big Data Analytics -- Clustering
78
13
–  Apriori: a k-d cell cannot be dense if one of its (k-1)-d
projection is not dense
Vac
atio
n
•  Partition the data space and find the number of
points in each cell of the partition
–  Determine dense units in all subspaces of interests and
connected dense units in all subspaces of interests
•  Generate minimal description for the clusters
–  Determine the minimal cover for each cluster
50
age
Vacation
(week)
0 1 2 3 4 5 6 7
l
30
Sa y
ar
•  Identify clusters:
20
Jian Pei: Big Data Analytics -- Clustering
79
CLIQUE: Pros and Cons
0 1 2 3 4 5 6 7
CLIQUE: An Example
Salary
(10,000)
CLIQUE: the Method
20
30
40
50
30
40
50
age
60
age
60
Jian Pei: Big Data Analytics -- Clustering
80
Bad Cases for CLIQUE
•  Automatically find subspaces of the highest
dimensionality with high density clusters
•  Insensitive to the order of input
Parts of a cluster may be missed
–  Not presume any canonical data distribution
•  Scale linearly with the size of input
•  Scale well with the number of dimensions
•  The clustering result may be degraded at
the expense of simplicity of the method
Jian Pei: Big Data Analytics -- Clustering
A cluster from CLIQUE may
contain noise
81
Jian Pei: Big Data Analytics -- Clustering
82
Biclustering
Application Examples
•  Clustering both objects and attributes
simultaneously
•  Four requirements
•  Recommender systems
–  Objects: users
–  Attributes: items
–  Values: user ratings
–  Only a small set of objects in a cluster (bicluster)
–  A bicluster only involves a small number of
attributes
–  An object may participate in multiple biclusters or
no biclusters
–  An attribute may be involved in multiple biclusters,
or no biclusters
Jian Pei: Big Data Analytics -- Clustering
sample/condition
gene
w11 w12
w1m
w21 w22
w31 w32
w2m
w3m
wn1 wn2
wnm
•  Microarray data
–  Objects: genes
–  Attributes: samples
–  Values: expression levels
83
Jian Pei: Big Data Analytics -- Clustering
84
14
Biclusters with Constant Values
11.2. CLUSTERING HIGH-DIMENSIONAL DATA
536
a1
···
a33
···
a86
···
···
···
···
···
···
···
···
b6
60
···
60
···
60
···
···
···
···
···
···
···
···
b12
60
···
60
···
60
···
···
···
···
···
···
···
···
b36
60
···
60
···
60
···
Biclusters with Coherent Values
535
···
···
···
···
···
···
···
•  Also known as pattern-based clusters
b99 · · ·
60 · · ·
··· ···
60 · · ·
··· ···
60 · · ·
··· ···
CHAPTER 11. ADVANCED CLUSTER ANALYSIS
Figure 11.5: A gene-condition matrix, a submatrix, and a bi-cluster.
10
20
50
0
10
20
50
0
10
20
50
0
10
20
50
0
10
20
50
0
subset of products. For example, AllElectronics is highly interested in finding
a group of customers who all like the same group of products. Such a cluster
is a submatrix in the customer-product matrix, where all elements have a high
value. Using such a cluster, AllElectronics can make recommendations in two
directions. First, the company can recommend products to new customers
who are similar to the customers in the cluster. Second, the company can
recommend to customers new products that are similar to those involved in
the cluster.
On rows
Figure 11.6: A bi-cluster with constant values on rows.
Jian Pei: BigAs
Data
Analytics
-- Clustering
with
bi-clusters
in a gene expression data matrix, the bi-clusters in a
85
Jian Pei: Big Data Analytics -- Clustering
86
customer-product matrix usually have the following characteristics:
10 50 30 70 20
• Only a small set of customers participate in a cluster;
20 60 40 80 30
• A cluster involves only a small subset of products;
50 90 70 110 60
• A customer can participate in multiple clusters, or may not participate
40 20 60 10
in any cluster 0
at all; and
• A product may be involved in multiple clusters, or may not be involved
Figure
in any11.7:
cluster atAall.bi-cluster with coherent values.
Bi-clustering can be applied to customer-product matrices to mine clusters
satisfying the above requirements.
Biclusters with Coherent Evolutions
Types of Bi-clusters
defined by a subset
I ⊆ A of genes and a subset J ⊆ B of conditions. For
“How can we model bi-clusters and mine them?” Let’s start with some basic
example, in•  the
matrix
Figurewe’ll
11.5,
{a
a33“conditions”
, a86over
} × to{b6 , b12 , b36 , b99 }
notation.
sake in
of simplicity,
use “genes”
1 , and
Only
up-Forshown
orthe down-regulated
changes
refer to the two dimensions in our discussion. Our discussion can easily be
is
a submatrix.
11.2.
CLUSTERING
DATA
537
extended
other applications. For example, we can
simply replace “genes” and
rows
ortoHIGH-DIMENSIONAL
columns
by “customers”
and “products”
to tackle
customer-product follow
biA bi-cluster “conditions”
is a submatrix
where
genes
andtheconditions
consistent
clustering problem.
patterns. We can define
of bi-clusters
Let A = {adifferent
of genes
and B = {b1 , . . . ,based
set ofsuch patterns:
1 , . . . , an } be types
10 50a set 30
70
20 bm} be a on
conditions. Let E = [eij ] be a gene expression data matrix, that is, a gene-
condition matrix,20
where 100
1 ≤ i ≤ 50
n and 11000
≤ j ≤ m. 30
A submatrix I × J is
• As the simplest case,50a submatrix
J (I 80
⊆ A, J ⊆ B) is a bi-cluster
100 90 I ×
120
with constant values
for any
I and10j ∈ J, eij = c, where c is a
0 if 80
20 i ∈100
constant. For example, the submatrix {a1 , a33 , a86 } × {b6 , b12 , b36 , b99 } in
Figure
11.511.8:
is a bi-cluster
with
constant
values.
Figure
A Coherent
bi-cluster
with
coherent
evolutions
on rows.
evolutions
on
rows
• A bi-cluster is interesting if each row has a constant value, though different rows may have different values. A bi-cluster with constant values
Jian
Pei:
Data
Analytics -- Clustering
on rows
isBigamultiplication,
submatrix
I ×that
J such
I and 87jbi-clusters
∈ J, then
values
using
is eijthat
= cfor
· αany
i · βji. ∈Clearly,
eij = constant
c+αi where
αi isonthe
adjustment
for row
i. For example,
11.6
with
values
rows
or columns
are special
cases of Figure
bi-clusters
showscoherent
a bi-cluster
with constant values on rows.
with
values.
Symmetrically, a bi-cluster with constant values on columns is a
• In
some applications,
submatrix
I × J such we
thatmay
for only
any ibe
∈ Iinterested
and j ∈ in
J, the
thenupeij or
= downc + βj ,
regulated changes across genes or conditions without constraining the
where βj is the adjustment for column j.
exact values. A bi-cluster with coherent evolutions on rows is a
I × J asuch
that for
any i1 , i2 ∈if Ithe
androws
j1 , jchange
(eia1 j1syn−
2 ∈ J, in
• submatrix
More generally,
bi-cluster
is interesting
echronized
− ei2with
≥ respect
0. For example,
Figure Pattern?
11.8
a bi-cluster
with
i1 j2 )(e
i2 j1 way
j2 )Follow
to theSame
columns
and shows
vice versa.
MathematObjects
the
coherent
evolutions on
rows.
Symmetrically,
we can
define
ically, a bi-cluster
with
coherent
values (also
known
as abi-clusters
patternwith
coherent
evolutions
on columns.
based
cluster)
is a submatrix
I × J such that for any i ∈ I and j ∈ J,
eij = c + αi + βj , where αi and βj are the adjustment for row i and
pScore
Next,
we study
how to mine
column
j, respectively.
Forbi-clusters.
example, Figure 11.7 shows a bi-cluster with
coherent Object
values.
blue
It can beMethods
shown that I × J is a bi-cluster with coherent values if and
Bi-clustering
only if for any i1 , i2 ∈ I and j1 , j2 ∈ J, then ei1 j1 − ei2 j1 = ei1 j2 − ei2 j2 .
The above
specification
of the types of bi-clusters only considers ideal cases. In
Moreover,
instead
Obejct
green of using addition, we can define bi-cluster with coherent
real data sets, such perfect bi-clusters rarely exist. When they do exist, they
are usually very small. Instead, random noise can affect the readings of eij and
thus prevent a bi-cluster in nature from appearing in a perfect shape.
D1
D2
There are two major types of methods for discovering bi-clusters in data
The
less
the
pScore,
the
more
consistent
the objects
that may come with noise. Optimization-based methods
conduct an itJian Pei:
Big Data
Analytics
-- Clustering the submatrix with the highest 89
erative search.
At
each
iteration,
significance
score is identified as a bi-cluster. The process terminates when a user-specified
condition is met. Due to cost concerns in computation, greedy search is often
employed to find local optimal bi-clusters. Enumeration methods use a tolerance threshold to specify the degree of noise allowed in the bi-clusters to be
mined, and then tries to enumerate all submatrices of bi-clusters that satisfy
the requirements. We use the δ-Cluster and MaPle algorithms as examples to
illustrate these ideas.
Optimization Using the δ-Cluster Algorithm
For a submatrix, I × J, the mean of the i-th row is
Differences from Subspace Clustering
•  Subspace clustering uses global distance/
similarity measure
•  Pattern-based clustering looks at patterns
•  A subspace cluster according to a globally
defined similarity measure may not follow
the same pattern
Jian Pei: Big Data Analytics -- Clustering
88
Pattern-based Clusters
•  pScore: the similarity between two objects
rx, ry on two attributes au, av
⎛ ⎡ rx .au
pScore⎜ ⎢
⎜ ry .au
⎝ ⎣
rx .av ⎤ ⎞
⎟ = ( rx .au − ry .au ) − ( rx .av − ry .av )
ry .av ⎥⎦ ⎟⎠
•  δ-pCluster (R, D): for any objects rx, ry∈R
and any attributes au, av∈D,
⎛ ⎡ rx .au
pScore⎜ ⎢
⎜ ry .au
⎝ ⎣
Jian Pei: Big Data Analytics -- Clustering
rx .av ⎤ ⎞
⎟ ≤ δ
ry .av ⎥⎦ ⎟⎠
(δ ≥ 0)
90
15
Maximal pCluster
Mining Maximal pClusters
•  If (R, D) is a δ-pCluster , then every subcluster (R’, D’) is a δ-pCluster, where
R’⊆R and D’⊆D
•  Given
–  A cluster threshold δ
–  An attribute threshold mina
–  An object threshold mino
–  An anti-monotonic property
–  A large pCluster is accompanied with many
small pClusters! Inefficacious
•  Task: mine the complete set of significant
maximal δ-pClusters
•  Idea: mining only the maximal pClusters!
–  A δ-pCluster is maximal if there exists no proper
super cluster as a δ-pCluster
Jian Pei: Big Data Analytics -- Clustering
91
–  A significant δ-pCluster has at least mino objects
on at least mina attributes
Jian Pei: Big Data Analytics -- Clustering
92
pCluters and Frequent Itemsets
Where Should We Start from?
•  A transaction database can be modeled as
a binary matrix
•  Frequent itemset: a sub-matrix of all 1’s
•  How about the pClusters having only 2
objects or 2 attributes?
–  MDS (maximal dimension set)
–  A pCluster must have at least 2 objects and 2
Objects
attributes
Attribute
–  0-pCluster on binary data
–  Mino: support threshold
–  Mina: no less than mina attributes
–  Maximal pClusters – closed itemsets
•  Finding MDSs
•  Frequent itemset mining algorithms cannot
be extended straightforwardly for mining
pClusters on numeric data
Jian Pei: Big Data Analytics -- Clustering
93
How to Assemble Larger
pClusters?
•  Systematically enumerate
every combination of
attributes D
–  For each attribute subset,
find the maximal subsets of
objects R s.t. (R, D) is a
pCluster
–  Check whether (R, D) is
maximal
b
c
d
e
f
g
h
13
11
9
7
9
13
2
15
y
7
4
10
1
12
3
4
7
x-y
6
7
-1
6
-3
10
-2
8
Jian Pei: Big Data Analytics -- Clustering
94
More Pruning Techniques
•  Why attribute-first-objectlater?
–  # of objects >> # attributes
•  Algorithm MaPle (Pei et al,
2003)
•  Prune search branches as
early as possible
Jian Pei: Big Data Analytics -- Clustering
a
x
95
•  Only possible attributes should be
considered to get larger pClusters
•  Pruning local maximal pClusters having
insufficient possible attributes
•  Extracting common attributes from possible
attribute set directly
•  Prune non-maximal pClusters
Jian Pei: Big Data Analytics -- Clustering
96
16
Gene-Sample-Time Series Data
Mining GST Microarray Data
Sample-Time Matrix
time2
time1
•  Reduce the gene-sample-time series data to
gene-sample data
sample1
sample2
Time
Sample
–  Use the Pearson's correlation coeffcient as the
coherence measure
gene1
gene2
Gene-Sample Matrix
Gene-Time Matrix
Gene
expression level of gene i on
sample j at time k
Jian Pei: Big Data Analytics -- Clustering
97
Jian Pei: Big Data Analytics -- Clustering
Basic Approaches
Basic Tools
•  Sample-gene search
•  Set enumeration tree
•  Sample-gene search and gene-sample
search are not symmetric!
–  Enumerate the subsets of samples
systematically
–  For each subset of samples, find the genes that
are coherent on the samples
•  Gene-sample search
98
–  Many genes, but a few samples
–  No requirement on samples coherent on genes
–  Enumerate the subsets of genes systematically
–  For each subset of genes, find the samples on
which the genes are coherent
Jian Pei: Big Data Analytics -- Clustering
99
Informative
Genes
1 2 3
•  Input: a microarray matrix and k
•  Output: phenotypes and informative genes
4 5 6 7
gene1
–  Partitioning the samples into k exclusive
subsets – phenotypes
–  Informative genes discriminating the
phenotypes
gene2
gene3
gene4
Noninformative
Genes
gene5
•  Machine learning methods
gene6
–  Heuristic search
–  Mutual reinforcing adjustment
gene7
Jian Pei: Big Data Analytics -- Clustering
100
The Phenotype Mining Problem
Phenotypes and Informative Genes
samples
Jian Pei: Big Data Analytics -- Clustering
101
Jian Pei: Big Data Analytics -- Clustering
102
17
Requirements
Intra-phenotype Consistency
•  The expression levels of each informative
gene should be similar over the samples
within each phenotype
•  The expression levels of each informative
gene should display a clear dissimilarity
between each pair of phenotypes
•  In a subset of genes (candidate informative
genes), does every gene have good
consistency on a set of samples?
•  Average of variance of the subset of genes
– the smaller the intra-phenotype
consistency, the better
1
Con(G' , S ' ) =
∑ ∑ (wi, j − wi,S ' )2
G' ⋅ ( S ' − 1) g!i∈G 's!j∈S '
Jian Pei: Big Data Analytics -- Clustering
103
•  How a subset of genes (candidate
informative genes) can discriminate two
phenotypes of samples?
•  Sum of the average difference between the
phenotypes – the larger the inter-phenotype
divergence, the better
Div(G ' , S 1, S 2)) =
∑w
i , S1
104
Quality of Phenotypes and
Informative Genes
Inter-phenotype Divergence
!
gi∈G '
Jian Pei: Big Data Analytics -- Clustering
1
Con(G' , Si ) + Con(G' , S j )
Ω=
∑
Si , S j (1≤i , j ≤ K ;i ≠ j )
Div(G' , Si , Sj )
•  The higher the value, the better the quality
− wi , S 2
G'
Jian Pei: Big Data Analytics -- Clustering
105
Heuristic Search
Jian Pei: Big Data Analytics -- Clustering
106
Possible Adjustments
•  Start from a random subset of genes and an
arbitrary partition of the samples
•  Iteratively adjust the partition and the gene set
toward a better solution
–  For each possible adjustment, compute ΔΩ
•  For each gene, try possible insert/remove
•  For each sample, try the best movement
ΔΩ
Insert a gene
–  ΔΩ > 0 à conduct the adjustment e Ω⋅T (i )
–  ΔΩ < 0 à conduct the adjustment with probability
Remove a gene
Move a sample
•  T(i) is a decreasing simulated annealing function and i is the
iteration number. T(i)=1/(i+1) in our implementation
Jian Pei: Big Data Analytics -- Clustering
107
Jian Pei: Big Data Analytics -- Clustering
108
18
Disadvantages of Heuristic Search
Mutual Reinforcing Adjustment
•  Samples and genes are examined and
adjusted with equal chances
•  A two-phase approach
–  Iteration phase
–  Refinement phase
–  # samples << # genes
–  Samples should play more important roles
•  Mutual reinforcement
•  Outliers in the samples should be handled
specifically
–  Use gene partition to improve the sample
partition
–  Use the sample partition to improve the gene
partition
–  Outliers highly interfere the quality and the
adjustment decisions
Jian Pei: Big Data Analytics -- Clustering
109
Jian Pei: Big Data Analytics -- Clustering
110
Dimensionality Reduction
Variance and Covariance
•  Clustering a high dimensional data set is
challenging
•  Given a set of 1-d points, how different are
those points?
n
–  Distance between two points could be dominated by
noise
n
•  Dimensionality reduction: choosing the informative
dimensions for clustering analysis
2
s=
∑(X
i
− X )2
i =1
n −1
2
i
i =1
n −1
•  Given a set of 2-d points, are the two
dimensions correlated?
–  Feature selection: choosing a subset of existing
dimensions
–  Feature construction: construct a new (small) set of
informative attributes
Jian Pei: Big Data Analytics -- Clustering
–  Standard deviation:
(X − X )
–  Variance: s = ∑
n
–  Covariance:
111
Principal Components
cov( X , Y ) =
∑(X
i
− X )(Yi − Y )
i =1
n −1
Jian Pei: Big Data Analytics -- Clustering
112
Step 1: Mean Subtraction
•  Subtract the mean from each dimension for each
data point
•  Intuition: centralizing the data set
Art work and example from http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
Jian Pei: Big Data Analytics -- Clustering
113
Jian Pei: Big Data Analytics -- Clustering
114
19
Step 2: Covariance Matrix
⎛ cov( D1 , D1 ) cov( D1 , D2 )
⎜
⎜ cov( D2 , D1 ) cov( D2 , D2 )
C = ⎜
"
"
⎜
⎜ cov( D , D ) cov( D , D )
n
1
n
2
⎝
Step 3: Eigenvectors and Eigenvalues
•  Compute the eigenvectors and the
eigenvalues of the covariance matrix
! cov( D1 , Dn ) ⎞
⎟
! cov( D2 , Dn ) ⎟
⎟
#
"
⎟
! cov( Dn , Dn ) ⎟⎠
Jian Pei: Big Data Analytics -- Clustering
–  Intuition: find those direction invariant vectors as
candidates of new attributes
–  Eigenvalues indicate how much the direction
invariant vectors are scaled – the larger the
better for manifest the data variance
115
Step 4: Forming New Features
Jian Pei: Big Data Analytics -- Clustering
116
New Features
•  Choose the principal components and forme new
features
NewData = RowFeatureVector x RowDataAdjust
–  Typically, choose the top-k components
The first principal component is used
Jian Pei: Big Data Analytics -- Clustering
117
Clustering in Derived Space
Jian Pei: Big Data Analytics -- Clustering
Spectral Clustering
Y
Data
Affinity matrix
[ Wij ]
A = f(W)
- 0.707x + 0.707y
O
Jian Pei: Big Data Analytics -- Clustering
118
Computing the leading
k eigenvectors of A
Clustering in the
new space
Projecting back to
cluster the original data
Av = \lamda v
X
119
Jian Pei: Big Data Analytics -- Clustering
120
20
Affinity Matrix
Clustering
•  Using a distance measure
Wij = e
•  In the Ng-Jordan-Weiss algorithm, we
define a diagonal matrix
such that
n
dist(oi ,oj )
w
where σ is a scaling parameter controling
how fast the affinity Wij decreases as the
distance increases
•  In the Ng-Jordan-Weiss algorithm, Wii is set
to 0
Jian Pei: Big Data Analytics -- Clustering
Dii =
X
Wij
j=1
1
1
•  Then, A = D 2 W D 2
•  Use the k leading eigenvectors to form a
new space
•  Map the original data to the new space and
conduct clustering
121
Jian Pei: Big Data Analytics -- Clustering
Is a Clustering Good?
Major Tasks
•  Feasibility
•  Assessing clustering tendency
–  Applying any clustering methods on a uniformly
distributed data set is meaningless
–  Are there non-random structures in the data?
•  Determining the number of clusters or other
critical parameters
•  Measuring clustering quality
•  Quality
–  Are the clustering results meeting users’ interest?
–  Clustering patients into clusters corresponding
various disease or sub-phenotypes is meaningful
–  Clustering patients into clusters corresponding to
male or female is not meaningful
Jian Pei: Big Data Analytics -- Clustering
122
123
Uniformly Distributed Data
Jian Pei: Big Data Analytics -- Clustering
124
Hopkins Statistic
•  Clustering uniformly distributed data is
•  Hypothesis: the data is generated by a
meaningless
uniform distribution in a space
•  A uniformly distributed data set is generated
•  Sample n points, p1, …, pn, uniformly from
504CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS
by a uniform data distribution
the space of D
•  For each point pi, find the nearest neighbor
of pi in D, let xi be the distance between pi
and its nearest neighbor in D
xi = min{dist(pi , v)}
v2D
Jian Pei: Big Data Analytics -- Clustering
Jian Pei: Big Data Analytics -- Clustering
Figure 10.21: A data set that is uniformly 125
distributed in the data space.
• Measuring clustering quality. After applying a clustering method on a
data set, we want to assess how good the resulting clusters are. A number
of measures can be used. Some methods measure how well the clusters
fit the data set, while others measure how well the clusters match the
ground truth, if such truth is available. There are also measures that
score clusterings and thus can compare two sets of clustering results on
the same data set.
In the rest of this section, we discuss each of the above three topics.
10.6.1
Assessing Clustering Tendency
Clustering tendency assessment determines whether a given data set has a nonrandom structure, which may lead to meaningful clusters. Consider a data
set that does not have any non-random structure, such as a set of uniformly
126
21
Hopkins Statistic
Explanation
•  Sample n points, q1, …, qn, uniformly from D
•  For each qi, find the nearest neighbor of qi
in D – {qi}, let yi be the distance between qi
and its nearest neighbor in D – {qi}
•  If D
is uniformly distributed, then i=1 xi and
n
X
yi would be close to each other, and thus
yi =
min {dist(qi , v)}
v2D,v6=qi
•  Calculate the Hopkins Statistic H
H= P
n
i=1
Jian Pei: Big Data Analytics -- Clustering
n
P
yi
i=1
xi +
n
P
yi
i=1
127
n
X
i=1
H would be round 0.5
n
X
•  If D is skewed, then yi would be
i=1
substantially smaller, and thus H would be
close to 0
•  If H > 0.5, then it is unlikely that D has
statistically significant clusters
Jian Pei: Big Data Analytics -- Clustering
Finding the Number of Clusters
A Cross-Validation Method
•  Depending on many factors
•  Divide the data set D into m parts
•  Use m – 1 parts to find a clustering
•  Use the remaining part as the test set to test
the quality of the clustering
–  The shape and scale of the distribution in the
data set
–  The clustering resolution required by the user
•  Many methods
exist
r
n
2
p
–  Set k =
, each cluster has 2n points on
average
–  Plot the sum of within-cluster variances with
respect to k, find the first (or the most significant
turning point)
Jian Pei: Big Data Analytics -- Clustering
129
128
–  For each point in the test set, find the closest
centroid or cluster center
–  Use the squared distances between all points in the
test set and the corresponding centroids to
measure how well the clustering model fits the test
set
•  Repeat m times for each value of k, use the
average as the quality measure
Jian Pei: Big Data Analytics -- Clustering
130
Measuring Clustering Quality
Quality in Extrinsic Methods
•  Ground truth: the ideal clustering
determined by human experts
•  Two situations
•  Cluster homogeneity: the more pure the
clusters in a clustering are, the better the
clustering
•  Cluster completeness: objects in the same
cluster in the ground truth should be clustered
together
•  Rag bag: putting a heterogeneous object into a
pure cluster is worse than putting it into a rag
bag
•  Small cluster preservation: splitting a small
cluster in the ground truth into pieces is worse
than splitting a bigger one
–  There is a known ground truth – the extrinsic
(supervised) methods, comparing the clustering
against the ground truth
–  The ground truth is unavailable – the intrinsic
(unsupervised) methods, measuring how well
the clusters are separated
Jian Pei: Big Data Analytics -- Clustering
131
Jian Pei: Big Data Analytics -- Clustering
132
22
Many clustering quality measures satisfy some of the above four criteria.
Here, we introduce the BCubed precision and recall metrics, which satisfy all
of the above criteria.
BCubed evaluates the precision and recall for every object in a clustering
on a given data set according to the ground truth. The precision of an object
indicates how many other objects in the same cluster belong to the same category as the object. The recall of an object reflects how many objects of the
same category are assigned to the same cluster.
Formally, let D ={o1 , . . . , on } be a set of objects, and C be a clustering
on D. Let L(oi ) (1 ≤ i ≤ n) be the category of oi given by ground truth,
and C(oi ) be the cluster ID of oi in C. Then, for two objects, oi and oj ,
(1 ≤ i, j, ≤ n, i ̸= j), the correctness of the relation between oi and oj in
clustering C is given by
Correctness(oi , oj ) =
!
1 if L(oi ) = L(oj ) ⇔ C(oi ) = C(oj )
0 otherwise.
Bcubed Precision and Recall
Bcubed
Precision and Recall
BCubed precision is defined as
•  D = {o1, …, on}
•  Precision
–  L(oi) is the cluster of oi given by the ground truth
•  C is a clustering on D
"
n
10.6. EVALUATION OF "
CLUSTERING
oj :i̸=j,C(oi )=C(oj )
Precision
BCubed
=
BCubed
recall
is defined
as
–  C(oi) is the cluster-id of oi in C
•  For two objects oi and oj, the correctness is
1 if L(oi) = L(oj) ßà C(oi) = C(oj), 0
otherwise
•  Recall
n
!
oj :i̸=j,L(oi )=L(oj )
i=1
509
.
n
!
Recall BCubed =
Correctness(oi , oj )
∥{oj |i ̸= j, C(oi ) = C(oj )}∥
i=1
(10.28)
(10.29)
Correctness(oi , oj )
∥{oj |i ̸= j, L(oi ) = L(oj )}∥
n
.
(10.30)
Intrinsic Methods
Jian Pei: Big Data Analytics -- Clustering
133
Silhouette Coefficient
Silhouette Coefficient
•  No ground truth is assumed
•  Suppose a data set D of n objects is partitioned
into k clusters, C1, …, Ck
•  For each object o,
–  Calculate a(o), the average distance between o and
every other object in the same cluster –
compactness of a cluster, the smaller, the better
–  Calculate b(o), the minimum average distance from
o to every objects in a cluster that o does not
belong to – degree of separation from other
clusters, the larger, the better
Jian Pei: Big Data Analytics -- Clustering
When the ground truth of a data set is not available, we have to use an intrinsic
Jian Pei:to
Big assess
Data Analytics
134
method
the-- Clustering
clustering quality. In general, intrinsic methods
evaluate
a clustering by examining how well the clusters are separated and how compact
the clusters are. Many intrinsic methods take the advantage of a similarity
metric between objects in the data set.
The silhouette coefficient is such a measure. For a data set D of n
objects, suppose D is partitioned into k clusters, C1 , . . . , Ck . For each object o
∈ D, we calculate a(o) as the average distance between o and all other objects
in the cluster to which o belongs. Similarly, b(o) is the minimum average
distance from o to all clusters to which o does not belong. Formally, suppose
o ∈ Ci (1 ≤ i ≤ k), then
"
P
o′ )
′ dist(o,
o′ ∈Cdist(o,
̸ o
i ,o =
o0 )
a(o) =
(10.31)
o,o0 2Ci ,o0 6=o |Ci | − 1
135
Multi-Clustering
•  A data set may be clustered in different
ways
a(o) =
and
b(o) =
The silhouette
•  Then
|Ci | "1
P
′
o′ ∈Cj dist(o,
dist(o,
o0 ) o ) }.
min
{
0
o =2C
C :1≤j≤k,j̸
i j
j
b(o) = min
{
Cj :o62
coefficient
ofCoj
|Cj |
j|
is then |C
defined
as
− a(o)
b(o)b(o)a(o)
s(o) =
.
s(o) =
max{a(o),
b(o)}
max{a(o),
b(o)}
(10.32)
}
(10.33)
value
the silhouette
coefficientcoefficient
is between −1ofand
• The
Use
theofaverage
silhouette
all1. The value of
a(o) objects
reflects the
of the
cluster to which o belongs. The smaller
ascompactness
the overall
measure
the value is, the more compact the cluster is. The value of b(o) captures
the Jian
degree
to Analytics
which-- Clustering
o is separated from other clusters. The136larger b(o) is,
Pei: Big Data
the more separated o is from other clusters. Therefore, when the silhouette
coefficient value of o approaches 1, the cluster containing o is compact and o
is far away from other clusters, which is the preferable case. However, when
the silhouette coefficient value is negative (that is, b(o) < a(o)), this means
that, in expectation, o is closer to the objects in another cluster than to the
objects in the same cluster as o. In many cases, this is a bad case, and should
be avoided.
To measure the fitness of a cluster within a clustering, we can compute the
average silhouette coefficient value of all objects in the cluster. To measure the
quality of a clustering, we can use the average silhouette coefficient value of all
objects in the data set. The silhouette coefficient and other intrinsic measures
–  In different subspaces, that is, using different
attributes
–  Using different similarity measures
–  Using different clustering methods
•  Some different clusterings may capture
different meanings of categorization
–  Orthogonal clusterings
•  Putting users in the loop
Jian Pei: Big Data Analytics -- Clustering
137
23