Download Report

University of Alberta
Library Release Form
Name of Author: Jianjun Zhou
Title of Thesis: Efficiently Searching and Mining Biological Sequence and Structure Data
Degree: Doctor of Philosophy
Year this Degree Granted: 2009
Permission is hereby granted to the University of Alberta Library to reproduce single copies
of this thesis and to lend or sell such copies for private, scholarly or scientific research
purposes only.
The author reserves all other publication and other rights in association with the copyright
in the thesis, and except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatever without
the author’s prior written permission.
Jianjun Zhou
Date:
University of Alberta
E FFICIENTLY S EARCHING AND M INING
B IOLOGICAL S EQUENCE AND S TRUCTURE DATA
by
Jianjun Zhou
A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of
the requirements for the degree of Doctor of Philosophy.
Department of Computing Science
Edmonton, Alberta
Spring 2009
University of Alberta
Faculty of Graduate Studies and Research
The undersigned certify that they have read, and recommend to the Faculty of Graduate
Studies and Research for acceptance, a thesis entitled Efficiently Searching and Mining
Biological Sequence and Structure Data submitted by Jianjun Zhou in partial fulfillment
of the requirements for the degree of Doctor of Philosophy.
J¨org Sander
Supervisor
Guohui Lin
Co-Supervisor
Mario Nascimento
Davood Rafiei
David Wishart
Caetano Traina Junior
External Examiner
Date:
To my Parents,
Thanks for your decade-long support.
Abstract
The rapid growth of biological sequence and structure data imposes significant challenges
on searching and mining them. While handling growing data-sets has been a continuously
interesting topic in the database and data mining communities, the unique characteristics
of biological data make it difficult or even impossible to directly apply traditional database
searching and mining methods. In many biological databases, the data objects and the
dissimilarity measurement (i.e. distance function) between data objects form a so-called
metric space, in which the notions of dimensionality in traditional vector space are no longer
valid and the dissimilarity measurement can be computationally much more expensive than
traditional measurements such as the Euclidean distance in a low dimensional space.
In this thesis, we study the problems of performing efficient clustering and similarity
searches on biological sequence and structure data using an expensive distance function.
The efficient solutions to these problems relies on the ability of the searching and mining
algorithms to avoid expensive distance computations. For this central challenge, we propose
several novel techniques including directional extent in non-vector data bubbles, pairwise
ranking, virtual pivots and partial pivots. In-depth theoretical studies and extensive experimental results on several real-life data-sets confirm the superiority of our methods over the
previous approaches.
Acknowledgements
I am grateful to my supervisors, Dr. Guohui Lin and Dr. J¨org Sander, who have generously
supported me in my PhD study. From them, I have learned a lot.
I got to know Dr. Sander before I entered the PhD program. Throughout all these years,
he has been patiently guiding me in my study, showing me how to develop an idea, and
more importantly, how to face success and failure. While being critical, he always respects
my opinion, treating me not just as his student, but his friend as well.
Dr. Lin is one of the most hard-working teachers I have ever had. He gives his students
a good example of how diligence will lead to a successful career, and how perseverance
will eventually remove the blockades on the challenging road of research. He motivates
his students to learn by putting not just a high requirement on them, but also a even higher
requirement on himself.
Table of Contents
1 Introduction
1.1 Database Support for Efficient Hierarchical Clustering . . . . . . . . . . .
1.2 Database Support for Efficient Similarity Search in Metric Spaces . . . . .
1
2
4
I Speed-up Clustering
8
2 Speed-up Clustering with Data Summarization
2.1 Related Work . . . . . . . . . . . . . . . . . .
2.2 Speed-up Clustering with Data Summarization
2.2.1 Data Bubbles for Euclidean Vector Data
2.2.2 Data Bubbles for Non-Vector Spaces .
2.3 Performance Evaluation . . . . . . . . . . . . .
2.3.1 Data-sets and Experimental Setup . . .
2.3.2 Comparison with Original Data Bubbles
2.3.3 Comparison with Random Sampling . .
2.3.4 An Application to a Real-life Data-set .
2.4 Summary . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
10
12
12
16
29
29
31
32
36
37
3 Speed-up Clustering with Pairwise Ranking
3.1 Preliminaries . . . . . . . . . . . . . . . . .
3.1.1 Three Major Clustering Approaches .
3.1.2 Triangle Inequalities in Metric Spaces
3.2 Motivation . . . . . . . . . . . . . . . . . . .
3.3 k-Close Neighbor Ranking . . . . . . . . . .
3.3.1 Ranking Using Triangle Inequalities .
3.3.2 An Intuition of When Ranking Works
3.4 Pairwise Hierarchical Ranking . . . . . . . .
3.4.1 Partitioning . . . . . . . . . . . . . .
3.4.2 Ranking . . . . . . . . . . . . . . . .
3.4.3 Reducing the Overhead in Ranking .
3.5 Experimental Evaluation . . . . . . . . . . .
3.5.1 Synthetic Data . . . . . . . . . . . .
3.5.2 Real-life Data . . . . . . . . . . . . .
3.6 Summary . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
38
38
38
39
40
41
41
43
44
47
47
49
54
54
56
64
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
II Speed-up Similarity Search
4 The Concept of Virtual Pivots for Similarity Search
4.1 Related Work . . . . . . . . . . . . . . . . . . .
4.1.1 Hierarchical Approaches . . . . . . . . .
4.1.2 Non-hierarchical Approaches . . . . . .
4.2 The Pruning Ability of Pivots . . . . . . . . . . .
4.2.1 One Pivot versus Several Pivots . . . . .
67
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
68
69
69
70
72
73
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
74
78
82
85
86
87
88
89
5 Efficient Similarity Search with Virtual Pivots and Partial Pivots
5.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Partial Pivots . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.2 The Complete Algorithm with Virtual and Partial Pivots . . .
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Results on the HIV-1 Dataset . . . . . . . . . . . . . . . . . .
5.2.2 Results on the HA Gene Dataset . . . . . . . . . . . . . . . .
5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Query Performance Dependence on the Effectiveness of Pivots
5.3.2 Query Performance w.r.t. Ranking . . . . . . . . . . . . . . .
5.3.3 Query Performance w.r.t. t . . . . . . . . . . . . . . . . . . .
5.3.4 Time and Space Complexity . . . . . . . . . . . . . . . . . .
5.3.5 Distance Distributions . . . . . . . . . . . . . . . . . . . . .
5.3.6 HIV-1 Genotyping Accuracy . . . . . . . . . . . . . . . . . .
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
90
91
91
93
95
95
100
103
103
104
106
107
108
109
110
4.3
4.4
4.5
4.6
4.7
4.2.2 Random Pivots versus Close Pivots .
Virtual Pivots . . . . . . . . . . . . . . . . .
4.3.1 The Pruning Ability of Virtual Pivots
Boosting Virtual Pivots . . . . . . . . . . . .
4.4.1 Why Boosting Works . . . . . . . . .
Advantages of Using Virtual Pivots . . . . . .
Algorithm . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Conclusions and Future Work
111
Bibliography
112
List of Tables
5.1
5.2
5.3
The detailed numbers of distance computations and the runtime (in seconds) by all seven k-nn search methods in the smaller preprocessing, and
their resultant query performance in 1-nn search in terms of the average
number of distance computations and the average runtime (in seconds) per
query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
The distance intervals associated with the five bins and the numbers of
query objects therein. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
The HIV-1 computational genotyping accuracies by k-nn search and majority vote, k = 1, 2, 3, 4, 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
List of Figures
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14
2.15
2.16
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
Example reachability plot. . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration of the distance between original Data Bubbles for vector data
(Fig. adapted from [9]). . . . . . . . . . . . . . . . . . . . . . . . . . . . .
No object is close to the center of the set. . . . . . . . . . . . . . . . . . .
“Directional extent” of a Data Bubble. . . . . . . . . . . . . . . . . . . . .
Illustration of direction and reverse direction. . . . . . . . . . . . . . . . .
Illustration of border distance. . . . . . . . . . . . . . . . . . . . . . . . .
“Normal” Data Bubble separation. . . . . . . . . . . . . . . . . . . . . . .
A “gap” in Data Bubble B. Since the cluster in Bubble A spread across
the ceter line between representative rA and rB , Bubble B contains points
from two clusters. In B, the border distance in direction A is larger than in
reverse direction of A. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Examples for a “gap” in a Data Bubble B. . . . . . . . . . . . . . . . . . .
Reachability plots for the whole synthetic data-sets used for the evaluation.
New versus old method on DS-Vector using N-score. . . . . . . . . . . . .
New versus old method on DS-Vector using F-score. . . . . . . . . . . . .
New versus old method on DS-UG using F-score. . . . . . . . . . . . . . .
non-vector data bubbles vs. random sampling on the DS-Tuple data-set. . .
Scale-up speed-up w.r.t. number of objects on DS-Tuple. . . . . . . . . . .
Results for DS-Real. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
An OPTICS walk. The arrows represent the ordering in the walk. Although
the corresponding reachability-distances are different from the distances between the pairs of objects, the lengths of the edges indicate the level of
the reachability-distance values. It shows that most plotted reachabilitydistances are small in values. . . . . . . . . . . . . . . . . . . . . . . . . .
Ranking with triangle inequalities. Although D(q, o) cannot be estimated
by using p′ , chances are that D(q, o) can be estimated by using another pivot
p; while p′ cannot be used to estimate D(q, o), p′ can be used to estimate
D(q ′ , o) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
P r[EP (q, o) ≤ D(q, c)] with respect to µ, σ, δ, and |P |. . . . . . . . . . .
An example for hierarchical ranking. . . . . . . . . . . . . . . . . . . . . .
Hierarchical ranking algorithm. . . . . . . . . . . . . . . . . . . . . . . . .
Linking the occurrences of each object. . . . . . . . . . . . . . . . . . . .
k-cn ranking algorithm with best-first frontier search. . . . . . . . . . . . .
The Reachability plots from OPTICS (a and b) and OPTICS-Rank (c and
d) are almost identical (due to the property of OPTICS clustering, there are
switches of cluster positions). . . . . . . . . . . . . . . . . . . . . . . . .
Clustering accuracy and performance on DS-Vector. . . . . . . . . . . . .
Clustering accuracy and performance on DS-Tuple. . . . . . . . . . . . . .
OPTICS output for DS-Protein with three cut-lines (y = 0.1, 0.2, 0.3). . . .
Clustering accuracy on DS-Protein. . . . . . . . . . . . . . . . . . . . . .
Performance of OPTICS-Rank and OPTICS-Bubble on DS-Protein. . . . .
The distribution of ǫ values for rankings in OPTICS-Rank. . . . . . . . . .
Effect of changing parameter k. . . . . . . . . . . . . . . . . . . . . . . .
Effect of changing branching factor b. . . . . . . . . . . . . . . . . . . . .
14
15
19
22
23
24
25
25
26
30
32
32
33
34
35
36
40
42
45
48
48
50
51
54
56
57
58
59
60
60
61
63
3.17
3.18
3.19
3.20
Effect of changing step limit s. . . . . . . . . . . . .
Scalability with respect to the size of the data-set. . .
Clustering results on DS-Jaspar. . . . . . . . . . . .
Distribution of F-scores. The overall F-score is 0.87.
4.1
4.2
4.3
4.4
4.5
4.6
4.7
Methods to enhance pruning ability . . . . . . . . . . . . . . . . . . . . .
The lower bound of m (δ ≤ 0.3). . . . . . . . . . . . . . . . . . . . . . . .
Portion of neighbors within a δ radius. . . . . . . . . . . . . . . . . . . . .
Illustration of virtual pivots. Dashed lines represent distances to be estimated.
Pruning schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The factor f (s, δ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Probability of having an LBE smaller than δ. . . . . . . . . . . . . . . . .
5.1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
64
65
66
66
74
77
78
80
81
84
86
Partial pivots can help when there is no close pivot to the query object. Let
q be a query object, p be a partial pivot far away from q, and oi (i = 1, 2, 3)
be neighbors of p. |D(q, p) − D(p, oi )| is close to D(q, p) and can be larger
than the current k-nn distance upper bound so that oi will be pruned away. . 92
5.2 The average numbers of distance computations per query by all seven k-nn
search methods on HIV-1 dataset, for k = 1, 3, 5, 10. . . . . . . . . . . . . 97
5.3 The average runtime per query using by all seven k-nn search methods on
HIV-1 dataset, for k = 1, 3, 5, 10. . . . . . . . . . . . . . . . . . . . . . . . 98
5.4 Performance measures per query by the six k-nn search methods, with different amounts of preprocessing efforts on HIV-1 dataset using global alignment distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.5 The average numbers of distance computations per query by all seven k-nn
search methods with two different amounts of preprocessing efforts, on HA
dataset, for k = 1, 3, 5, 10. . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.6 The average runtime per query by all seven k-nn search methods with two
different amounts of preprocessing efforts, on HA dataset, for k = 1, 3, 5, 10. 102
5.7 The average numbers of distance computations per query by all six methods
in 1-nn search for five bins of query objects with different nearest neighbor
distance ranges, on three datasets. . . . . . . . . . . . . . . . . . . . . . . 105
5.8 Performance on HIV-1-GA when the number of fixed pivots to perform
ranking, |S|, increases from 1 to 80. . . . . . . . . . . . . . . . . . . . . . 106
5.9 Performance on HIV-CCV when |S| increases from 1 to 70. . . . . . . . . 107
5.10 Performance when the number of predicted close neighbors for each data
object, t, increases from 1 to 50. . . . . . . . . . . . . . . . . . . . . . . . 108
5.11 The distributions of all pairwise global alignment distances and all pairwise
CCV-based euclidean distances on the complete HIV-1 dataset of 1,198 viral strains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
University of Alberta
E FFICIENTLY S EARCHING AND M INING
B IOLOGICAL S EQUENCE AND S TRUCTURE DATA
by
Jianjun Zhou
A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of
the requirements for the degree of Doctor of Philosophy.
Department of Computing Science
Edmonton, Alberta
Spring 2009
Efficiently Searching and Mining
Biological Sequence and Structure Data
Jianjun Zhou
Chapter 1
Introduction
In the last three decades, computer-assisted biological technologies have been playing a
more and more significant role in our society. For instance, hierarchical clustering of DNA
sequences now enables scientists to trace down the origins of contagious diseases that have
the potential to evolve into global pandemics; DNA forensic analysis is widely used in
criminal investigations; and recent advances in protein structure prediction help the drug
industry to develop new drugs more efficiently.
Given the potential of DNA and protein research, a significant number of resources have
been devoted to it, generating a huge volume of sequence and structure data. For instance,
the Human Genome Project has created a sequence database containing 3 billion chemical
base pairs; and each year, millions of dollars are spent on protein structure determination
via X-ray and NMR methods, leading to a steady growth of protein structure databases.
How to search and mine these sequence and structure data to make the best use of them is
a significant challenge facing researchers today.
While handling growing data-sets has been a long-standing topic of interest in the
database and data mining communities, the unique characteristics of biological data make it
difficult or even impossible to directly apply the traditional database searching and mining
methods. In many biological databases, the (dis-)similarity measurement typically involves
time consuming operations such as sequence or structure alignments, which are computationally much more expensive than traditional dissimilarity measurements such as the Euclidean distance in a low dimensional space. On the other hand, many traditional searching
and mining algorithms do not scale-up well so that they require a large number of expensive
distance computations which lead to low performance in runtime.
This dissertation addresses the challenges in two searching and mining problems: (1)
clustering, and (2) similarity searches. The rest of this chapter will outline our contributions
1
to each of these two areas.
1.1 Database Support for Efficient Hierarchical Clustering
Data Clustering is an important task for knowledge discovery in databases (KDD). The
basic goal of a clustering algorithm is to partition a set of data objects into groups so that
similar objects belong to the same group and dissimilar objects belong to different groups.
There are different types of clustering algorithms for different types of applications. A
common distinction is between partitioning and hierarchical clustering algorithms (see e.g.
[28]). Partitioning algorithms, for instance, the k-means [32] and the k-medoids algorithms
[28], decompose a database into a set of k clusters whereas hierarchical algorithms only
compute a representation of the data-set, which reflects its hierarchical clustering structure,
but do not explicitly determine clusters. Examples of hierarchical clustering algorithms are
the Single-Link method [42] and OPTICS [3].
Clustering algorithms in general, and in particular hierarchical algorithms, do not scale
well with the size of the data-set. On the other hand, very fast methods are most desirable for
exploratory data analysis, which is what clustering is mostly used for. To speed-up cluster
analysis on large data-sets, a number of data summarization methods have been proposed
recently. Those methods are based on a general strategy that can be used to scale-up whole
classes of clustering algorithms (rather than inventing a new clustering algorithm), and their
basic idea is described in more detail as below.
1. Use a data summarization method that produces “sufficient statistics” for subsets of
the data-set (using either sampling plus a classification of objects to the closest sample
point, or some other technique such as BIRCH [54]). The data summarizations are
sometimes also called “micro-clusters” (e.g. in [27]).
2. Apply (an adapted version of) the clustering algorithm to the data summaries only.
3. Using the clustering result for the data summaries, estimate a clustering result for the
whole data-set.
Different data summarization methods have different advantages and disadvantages.
In [9] it was shown that hierarchical clustering algorithms such as the Single-Link [42]
method or OPTICS [3] require special information in order to produce high quality results
for small numbers of data summaries. The proposed data summarizations that meet all the
requirements for hierarchical clustering were called “Data Bubbles”.
2
Most techniques to compute data summaries, including Data Bubbles, are based on the
assumption that the data is from a vector space. Typically, they compute statistics such
as the mean of the set of objects which requires that vector space operations (addition of
objects, multiplication with scalar values) can be applied.
In many important applications, however, the data is from a non-vector metric space. A
metric space is given by a set of data objects B and a distance function D(., .) that satisfies
symmetry, reflexivity, and triangle inequality (for any x, y, z ∈ B, D(x, y) + D(y, z) ≥
D(x, z)) [12]. Metric spaces include some vector spaces such as the Euclidean space, but
generally in a metric space, the notions of volume and dimensionality in vector space are not
valid, rendering most mining and searching methods developed on vector data inapplicable.
For summarization in non-vector spaces, the only information that can be utilized is a
similarity or a dissimilarity distance function. In this chapter we will assume a distance
function to measure the dissimilarities, i.e., we only have information about distances between objects. This makes it difficult or at least very expensive to compute the usual sufficient statistics used to summarize vector data. However, having a data summarization
method that allows a very fast (even if only approximate) clustering of non-vector data is
highly desirable since the distance functions for some typical and important applications
can be extremely computationally expensive (e.g., a sequence alignment for a set of DNA
or amino acid sequences).
In Chapter 2, we propose a novel data summarization method that can be applied to nonvector data to produce high-quality “micro-clusters” to efficiently and effectively support
hierarchical clustering. The information produced for each data summary is related and
improves upon the information computed for the Data Bubbles proposed in [9] in the sense
that accurate estimations of the information needed by hierarchical clustering algorithms
are generated (in our experiment, we found that our new version of Data Bubbles even
outperforms the original Data Bubble method for some vector data).
While being a significant improvement over the na¨ıve sampling, the approach of using
data summarization to derive approximate clustering results still suffers from the problem of
inadequate accuracy for some important real-life applications. For these methods, clusters
with a size smaller than the number of points in the smallest abstract region, represented by
a set of sufficient statistics, will typically be lost in the final clustering result. Even clusters
that have a large number of points but are close to other clusters could be obscured by
bigger clusters in the output result, since gaps between clusters can often not be recovered
correctly by BIRCH or Data Bubbles [56].
3
In Chapter 3, we propose a novel approach to perform approximate clustering with high
accuracy. The method is based on the observation that in some clustering algorithms such as
OPTICS and Single-Link, the final clustering result depends largely on the nearest neighbor distances of data objects, which comprise only a very small portion of the quadratic
number of pairwise distances between data objects. We introduce a novel pairwise hierarchical ranking to efficiently determine close neighbors for every data object. The clustering
will then be performed on the original data objects instead of on sample points or sufficient statistics as in the previous methods. Since a na¨ıve pairwise hierarchical ranking may
introduce a large computational overhead, we also propose two techniques to significantly
reduce this overhead: 1) a frontier search rather than a sequential scan in the na¨ıve ranking
to reduce the search space; 2) an approximate frontier search for pairwise ranking that further reduces the runtime. Empirical results on synthetic and real-life data show a speed-up
of up to two orders of magnitude over previous approaches.
1.2 Database Support for Efficient Similarity Search in Metric
Spaces
The rapid growth of biological sequence and structure databases imposes a challenging
problem of how to access them efficiently. For a classic instance, when scientists are to
study a new biological sequence or structure, they may want to see if there is any similar
sequence or structure in the existing databases so that they can use the knowledge on the
existing objects to infer the properties of the new object. In the case of primary sequences
only, fast homology search tools such as BLAST [2] and FASTA [36] may be used for those
queries which have very similar sequences already deposited in the database. Nevertheless,
neither of them is guaranteed to identify the most similar sequences for an arbitrary query.
The main cause is that these homology search tools are largely heuristic, and they typically
fail on queries which do not have very similar sequences (measured solely by a single
distance funtion) deposited in the database [31]. Finding the most similar sequences (or
structures) from an existing database can be easily modeled as a similarity search problem
such as the k-nearest neighbor (k-nn) search studied in the database community [25]. In
k-nn search, the basic task is to efficiently retrieve from a large set of database objects the
k most similar ones to a given query object, measured by a distance (or similarity) function
— the smaller the distance between objects, the more similar they are.
Both the large number of database objects and the time-consuming distance computation between two objects can make the k-nn search slow, or even prohibitive, in real-life
4
applications. How k-nn search can be done efficiently has been extensively studied in the
database and theory community, and numerous methods have been proposed. However,
while the problem has well known solutions in low-dimensional vector space, for example,
through R-trees [22], an efficient solution is still an elusive goal for k-nn search in many
important real-life applications. In these applications, usually the data objects are from very
high dimensional vector spaces or general metric spaces in which the data objects behave,
from an indexing point of view, like very high dimensional data. Such databases often have
the following characteristic: each data object typically has only a few close neighbors (objects that are similar in a meaningful way), while the other objects are very far away from
it at a similar distance (i.e., approximately to the same degree of large dissimilarity). For
example, according to our computational experience, when indexing all protein sequences
using a global sequence alignment dissimilarity measure, sequences within the same protein
family typically form a very tight cluster and have very small distances to each other, while
sequences from different families have very large distances. This interesting characteristic
of very high dimensional databases renders most indexing structures ineffective. Currently,
there is no exact k-nn search method that can handle such databases efficiently [12, 25, 6, 4].
We call this kind of databases “hard-to-index” databases.
In these “hard-to-index” databases, the distance function is typically very expensive
in terms of CPU time. For instance, in biological databases the distance (or similarity)
functions usually involve optimization operations such as sequence or structure alignments.
When using some sequence alignment algorithms [52], computing the distance (or similarity) between two genomic sequences of length around 10,000 nucleotides may take seconds
on a state-of-the-art PC. This property makes the na¨ıve solution to k-nn search, that computes the distances between the query and all database objects, unacceptable for interactive
applications that require quick response time or data mining applications that require very
large numbers of k-nn searches. Particularly note that a modern database, such as GenBank
or Protein Data Bank, may contain thousands or even millions of objects.
Consequently, the primary goal of an efficient k-nn search in these “hard-to-index”
databases is to reduce the number of distance computations. For that purpose, many applications have regulated their expensive distance function so that the distance function and
the database together form a metric space, In such a space, the triangle inequality (for any
x, y, z ∈ B, D(x, y) + D(y, z) ≥ D(x, z)) is the fundamental mechanism to estimate the
distance between two data objects, which can be done in time often negligible compared
to distance computation. In many existing search methods, the distances between (all or a
5
portion of) data objects and a set of pre-selected reference data objects are pre-computed.
At query time, the distance D(q, p) between the query object q and a reference object p
is computed first. Then, the triangle inequality on ho, p, qi can be used to derive a lower
bound and an upper bound for the distance D(q, o) between the query object and an arbitrary non-reference object o, by:
|D(q, p) − D(o, p)| ≤ D(q, o) ≤ D(o, p) + D(q, p).
(1.1)
The pre-selected reference data objects such as the above object p are called pivots. When
there is more than one pivot, the combination of estimated bounds from all pivots in a set
P using Formula (1.1) leads to the largest lower bound
lD(o,q) = max{|D(q, p) − D(o, p)|}
p∈P
(1.2)
and the smallest upper bound
uD(o,q) = min{D(q, p) + D(o, p)}
p∈P
(1.3)
for D(q, o). These bounds can be used to prune away objects that are not within a certain
distance threshold from query object q, thus avoiding computing their distances to query
object q.
The general goal during query processing in metric data is to use the triangle inequality
with pre-computed distances and distances computed during the query stage to exclude
or prune objects from the set of possible answers. Essential for the effectiveness of such
indexing schemes is the ability to produce only a small set of candidates during the pruning
stage of the query processing, without computing too many distances. Objects that cannot
be excluded from the result of a query using triangle inequalities have to be loaded from
disk and the distances between the query and these candidate objects have to be computed.
We will see that for hard-to-index metric data-sets, traditional pruning techniques using a
relatively small set of fixed pivots cannot be effective, even if the data satisfies the condition
that each object can have a small number of close neighbors, but those groups of close
objects are far away from each other, as in the given application domains.
Our contributions to this k-nn search problem are the following. We propose a new
method for k-nn search in hard-to-index metric data that significantly outperforms previous
approaches. The method is based on the novel concepts of virtual pivots and partial pivots.
In Chapter 4, we first analyze the pruning ability of pivots and the resulting limitations of
existing approaches. We then introduce the novel concept of a virtual pivot, which allows us
6
to select virtually any data object as a pivot. We show formally that a single virtual pivot can
have the same pruning power as a large set of fixed pivots, and propose a query dependent,
dynamic virtual pivot selection method using ranking. Dynamic virtual pivot selection effectively tightens both the lower and upper bound estimations of unknown distances during
query processing, whereas most traditional schemes of triangle inequality pruning focuses
only on tightening the lower bound estimations. The space and pre-computation overhead
of our method is small and similar to approaches using only a relatively small number of
fixed pivots. In Chapter 5, we introduce the concept of partial pivots, further extending
the concept of virtual pivots to use every data object as a pivot but without suffering from
a quadratic number of distance computations. Our partial pivots method is based on the
pairwise ranking technique developed in Chapter 3, and is shown to outperform virtual
pivots in our testing. An extensive experimental evaluation on real-life data-sets confirms
that our new method outperforms the next best method by using no more than 40% of distance computations per query, on a database of 10,000 gene sequecnes, compared to several
best known k-nn search methods including M-Tree [13], OMNI [20], SA-Tree [34], and
LAESA [37].
7
Part I
Speed-up Clustering
8
Chapter 2
Speed-up Clustering with Data
Summarization
To speed-up clustering algorithms, data summarization methods, which first summarize
the data-set by computing suitable representative objects, have been proposed. Then a
clustering algorithm is applied to these representatives only, and a clustering structure for
the whole data-set is derived, based on the result for the representatives. Most previous
summarization methods are, however, limited in their application domain. They are, in
general, based on sufficient statistics such as the linear sum of a set of points, which assumes
that the data is from a vector space. On the other hand, in many important applications,
the data is from a non-vector metric space, and only distances between objects can be
exploited to construct effective data summarizations. In this chapter1 , we develop a new
data summarization method based only on distance information that can be applied directly
to non-vector metric data. An extensive performance evaluation shows that our method
is very effective in finding the hierarchical clustering structure of non-vector metric data
using only a very small number of data summarizations, thus resulting in a large reduction
of runtime while trading only very little clustering quality.
The rest of this chapter is organized as follows. We briefly review related work in Section 2. We present the necessary background regarding the original Data Bubbles technique
for vector data and the OPTICS clustering algorithm in Section 3. Section 4 discusses the
problems when trying to generate summary information for sets of non-vector metric data
and introduces our new method. The experimental evaluation in Section 5 shows that our
method not only allows very effective and efficient hierarchical clustering of non-vector
metric data, but it even outperforms the original Data Bubbles method when applied to
vector data. Section 6 summarizes the chapter.
1
Some of the material in this chapter has been published in [56].
9
2.1 Related Work
The most basic method to speed-up expensive data mining algorithms such as hierarchical
clustering is random sampling: only a subset of the database is randomly chosen, and the
data mining algorithm is applied to this subset instead of to the whole database. Typically,
if the sample size is large enough, the result of the data mining method on the samples will
be similar enough to the result on the original database.
More specialized data summarization methods have been developed to support particularly clustering algorithms. For k-means type of clustering algorithms, summary statistics
called “clustering features”, originally introduced for the BIRCH method [54], have been
used by different approaches. BIRCH incrementally computes compact descriptions of subclusters, called Clustering Features, which are defined as CF = (n, LS, ss), where LS is
P
P
the linear sum ( ni x~i ) and ss the square sum ( ni x~i 2 ) of the n points in the subcluster
represented by the clustering feature CF .
The CF -values are sufficient to compute information like centroid, radius and diameter
of a set of points. They also satisfy an additivity condition, that allows the incremental
computation of CF -values when inserting points into a set: if CF1 = (n1 , LS1 , ss1 ) and
CF2 = (n2 , LS2 , ss2 ) are the CF s for sets of points S1 and S2 respectively, then CF1 +
CF2 = (n1 + n2 , LS1 + LS2 , ss1 + ss2 ) is the clustering feature for the union of the points
in S1 and S2 , S1 ∪ S2 .
The CF s are organized in a balanced tree with branching factor b and a threshold t,
where a non-leaf node represents all objects in the sub-tree rooted at this node. A leaf node
has to contain at most l entries and the diameter of each leaf node has to be less than t.
The generation of a CF -tree is similar to the construction of B+-trees: point p is inserted
into the tree by finding first the leaf in the current CF -tree that is closest to p. If an entry
in the leaf can absorb p without violating the threshold condition, it is inserted into this
entry and the corresponding CF value is updated. If p cannot be inserted into an existing
entry, a new entry is created in the leaf node. This may lead to an overflow of the leaf
node causing it (and possibly its ancestors) to be split in a similar fashion as B+-trees. A
clustering algorithm is applied to the entries in the leaf nodes of the CF -tree.
In [8], a very specialized compression technique for scaling up k-means and EM clustering algorithms [24] is proposed. This method basically uses the same type of sufficient statistics as BIRCH, i.e., triples of the form (n, LS, ss). The major difference is only
that different sets of data items (points or summaries) are treated and summarized inde-
10
pendently: points that are unlikely to change cluster membership in the iterations of the
clustering algorithm, data summaries that represent tight subclusters of data points, and a
set of regular data points which contains all points which cannot be assigned to other data
summarizations.
In [16], a general framework for “squashing” data is proposed, which is intended to
scale up a large collection of data mining methods. This method is based on partitioning
the dimensions of the data space and grouping the points into the resulting regions. For each
region, a number of moments are calculated such as mean, minimum, maximum, second
order moments such as Xi2 or Xi Xj , and higher order moments depending on the desired
degree of approximation. Squashed data items are then created for each region in a way
that the moments of the squashed items approximate those of the original data falling into
the region. This information can also be used to compute clustering features as above for
each constructed region in order to speed-up k-means type of clustering algorithms.
In [9] it was also proposed to compute sufficient statistics of the form (n, LS, ss) based
on a random sample by partitioning the data-set using a k-nearest neighbor classification.
This method has several advantages over, for instance the CF -tree: the number of representative objects for a data-set can be determined exactly, and no other heuristic parameters
such as a maximum diameter, or a bin size have to be used in order to restrict the number of
partitions that are represented by triples (n, LS, ss). The method was proposed as follows:
• Draw a random sample of size s from the database to initialize s sufficient statistics.
• In one pass over the database, classify each object o to the sampled object p to which
it is closest and incrementally add o to the sufficient statistics initialized by p, using
the additivity condition given above.
Results in [9] show that the quality of the sufficient statistics obtained by random sampling is much better than the CF -values produced by BIRCH, when used to generate the
additional information that is needed to get satisfactory results with hierarchical clustering
algorithms. The runtime to generate those CF values using a CF -tree is also significantly
larger and makes it almost impossible to beat even a na¨ıve sampling approach to speed-up
clustering, given the same resources. If it takes too much time to generate data summarizations, na¨ıve sampling may just use a larger sample and obtain superior results with a much
less complex implementation.
The only other data summarization method for non-vector metric data that we are aware
of is presented in [21], and is based on BIRCH. The authors suggest a generalization of a
11
BIRCH tree that has two variants BUBBLE and BUBBLE-FM for metric data. Both variants keep a number of representatives for each leaf node entry in order to approximate the
most centrally located object in a CF -tree leaf. In non-leaf level entries, both methods keep
a certain number of sample objects from the sub-tree rooted at that entry in order to guide
the search process when building the tree. The basic difference between BUBBLE and
BUBBLE-FM is that for BUBBLE-FM the sample points in the non-leaf node entries are
mapped to a d-dimensional Euclidean vector space using Fastmap [19]. The image space is
then used to determine distances between new objects and the CF s, thus replacing possibly
expensive distance calculations in the original space by Euclidean distance computations.
We argue that this approach has similar drawbacks as the vector version, and we will therefore, base our current work for non-vector metric data on a sampling based approach to
produce data summarizations.
2.2 Speed-up Clustering with Data Summarization
In this section, we propose a novel data summarization method that can be applied to nonvector metric data to produce high quality “micro-clusters” to efficiently and effectively
support hierarchical clustering. The information produced for each data summary is related and improves upon the information computed for the Data Bubbles proposed in [9]
in the sense that accurate estimations of the information needed by hierarchical clustering
algorithms is generated.
2.2.1 Data Bubbles for Euclidean Vector Data
In this subsection, we briefly review the notion of Data Bubbles for Euclidean vector spaces
as proposed in [9]. We discuss the special requirements that hierarchical clustering algorithms such as the Single-Link method and OPTICS pose on data summarization methods,
and we illustrate the advantages of Data Bubbles.
While simple statistics such as clustering features produced by BIRCH, are effective for
k-means type clustering algorithms, they typically are not sufficient to produce good results
using a hierarchical clustering algorithm. The main reason is that hierarchical clustering
algorithms are based on the distances between sets of data points which are not represented
well by the distances between only the representative objects, especially when the compression rate increases. This type of error typically results in a very distorted clustering
structure based on data summaries. The Data Bubbles in [9] have been proposed to solve
those problems, showing that a data summarization method, in order to support hierarchical
12
clustering, has to take into account the extension and the point density of the data-subset
being represented.
Basic Definitions
A Data Bubble was defined in [9] as follows:
DEFINITION 1. A Data Bubble for a set of points X = {Xi }, 1 ≤ i ≤ n, is a tuple
BX = (rep, n, extent, nnDist), where
• rep is a representative object for X, which is assumed to be close to the center (e.g.
the centroid) of the set X;
• n is the number of objects in X;
• extent is the radius of BX around rep that encloses “most” of the objects in X;
• nnDist(k, BX ) is a function that estimates the average k-nearest neighbor distances
in BX .
For d-dimensional points from a Euclidean vector data, the representative rep, the radius of the Data Bubbles extent, and the k-nearest neighbor distances nnDist(k, B) can
be easily estimated using simple sufficient statistics, which can be incrementally computed
during the initial construction of the Data Bubbles. The representative rep is simply comP
puted as the mean of the set of objects in X, i.e., rep =
i=1...n Xi /n. The radius
of BX around
be estimated by the average pairwise distances within BX , i.e.,
r P rep can
P
extent =
i=1...n
2
j=1...n (Xi −Xj )
n(n−1)
. This expression can in turn be computed from the
simpler statistics linear sum LS and square sum ss of all objects in X. LS and ss can
be incrementally maintained when constructing a Data Bubble (as in the construction of
cluster features CF in the BIRCH algorithm). Using these two values, the extent can be
q
2
calculated as 2·n·ss−2·LS
.
n(n−1)
The average k-nearest neighbor distances can be estimated by a simple arithmetic ex-
pression assuming a uniform distribution of the objects within a Data Bubble [9]:
1
nnDist(k, BX ) = (k/n) d · extent.
Application to Hierarchical Clustering
Hierarchical clustering algorithms compute a hierarchical representation of the data-set,
which reflects its possibly nested clustering structure.
13
The hierarchical clustering algorithm OPTICS is based on the notions of core distance
and reachability distance for objects with respect to parameters eps and minP ts. For any
point p, its core distance is equal to its minP ts nearest neighbor distance if this distance is
no greater than eps, or infinity otherwise. The reachability of p with respect to another point
o is the greater value of the core distance of p and the distance between p and o. Parameter
minP ts allows the core-distance and reachability-distance of a point p to capture the point
density around that point. Using these distances, OPTICS computes a “walk” through the
data-set, and assigns to each object p its core distance and the smallest reachability distance
reachDist with respect to an object considered before p in the walk. The algorithm starts
with an arbitrary object assigning it a reachability distance equal to ∞. The next object o
in the output is then always the object that has the shortest reachability distance d to any
of the objects that were “visited” previously by the algorithm. This reachability value d
is assigned to this object o. The output of the algorithms is a reachability plot, which is
a bar plot of the reachability values assigned to the objects in the order they were visited.
An example reachability plot for a 2-dimensional data-set is depicted in Figure 2.1. Such a
plot is interpreted as following: “valleys” in the plot represent clusters, and the deeper the
“valley”, the denser the cluster. The tallest bar between two “valleys” is a lower bound on
the distance between the two clusters. High bars in the plot, not at the border of a cluster
represent noise, and “nested valleys” represent hierarchically nested clusters (reachability
plots can also be converted into dendrograms [39]).
(a) Data-set
(b) Reachability plot
Figure 2.1: Example reachability plot.
Clusters in a hierarchical clustering representation are in general obtained manually
(e.g., by cutting through the representation). This process is typically guided by a visual
inspection of the diagram - which is why a correct representation of the clustering structure
is very important, especially when applying the algorithm to data summarizations instead
of the whole data-set.
The most important issue when applying hierarchical clustering to Data Bubbles is the
14
Figure 2.2: Illustration of the distance between original Data Bubbles for vector data (Fig.
adapted from [9]).
distance function that is used to measure dissimilarity between two Data Bubbles. In [9]
it has been shown that using the distance between representatives and the extent of Data
Bubbles for vector data, a distance between Data Bubbles can be computed that dramatically improves the result of hierarchical clustering compared to using only the distance
between representatives. This notion of distance that is aware of the extent of Data Bubbles
is depicted in Figure 2.2. If the Data Bubbles do not overlap, it is basically the distance
between the “borders” of the Data Bubbles (distance between representative objects of the
Data Bubbles minus the extents of the Data Bubbles plus the 1-nearest neighbor distances
of the Data Bubbles); otherwise, if they overlap, it is the estimated nearest neighbor distance
of the Data Bubble that has the larger nn-distance.
The second important issue in hierarchical clustering of data summarizations is the
adaptation of the graphical result. The reason is that the Data Bubbles typically represent
sets of objects that contain a significantly different number of objects, and that can have
largely differing point densities. Including only the representative of a Data Bubble in
the hierarchical output representation will most often lead to a very distorted picture of
the true clustering structure of a data-set. Therefore, for OPTICS, the bar for each Data
Bubble in the reachability plot is expanded using the so called “virtual reachability”. More
precisely, for each Data Bubble representing n points, n bars are added to the reachability
plot. The height of each bar is calculated as the virtual reachability of the Data Bubble,
which corresponds to the estimated average reachability distance for points within the Data
Bubble (basically the estimated minP ts-nearest neighbor distance). Other hierarchical
algorithms such as the Single-Link method can be similarly adapted to work with Data
Bubbles.
15
2.2.2 Data Bubbles for Non-Vector Spaces
The only information from a non-vector space that can be utilized is the distance function,
i.e., distances between objects. Therefore, it is difficult in such spaces to get an accurate
and at the same time computationally inexpensive estimation for the important components
defined for the original Data Bubbles. We cannot compute new “artificial” objects such as
a centroid, which is guaranteed to be in the center of the respective set, and hence would be
the best representative for the objects in a Data Bubble. We also cannot compute statistics
like the linear sum or the square sum of the objects that would allow us to incrementally
maintain a good estimation of the radius of a Data Bubble around the representative. Similarly, there is no inexpensive or incremental way to compute an estimation of the average
k-nearest neighbor distances in a Data Bubble. For these reasons the original definition
of Data Bubbles has to be significantly changed and adapted to the particular problems of
non-vector metric spaces.
The main purpose of Data Bubbles is to support effective and highly efficient hierarchical clustering based on the summary information provided by the Data Bubbles. The
representative, the extent, and the average k-nearest neighbor distances of a Data Bubble
serve only the purpose of defining effective distance notions for hierarchical clustering. For
the algorithm OPTICS, which we will use to evaluate our method, these notions are:
• The notion of a distance between Data Bubbles, which has to “be aware” of the extent
of the Data Bubbles. This is the most important notion for effective hierarchical
clustering, because the distances between Data Bubbles will determine the shape of
the cluster result.
• The core-distance of a Data Bubble, which is also used to define the “virtual reachability” for the objects represented by the Bubble.
• The reachability-distance of a Data Bubble relative to another Data Bubble, which
is needed during the execution of OPTICS. The appropriateness of the reachabilitydistance is dependent on the previous two notions, since it is defined using only coredistance and the distance between Data Bubbles.
Errors in estimating a representative, the extent, or the average k-nearest neighbor distances will lead to errors when computing the above distances, which in turn will produce
errors in the clustering result using Data Bubbles. To make things worse: errors for different components in a Data Bubble may depend on and amplify each other, e.g., an error in
16
the estimation of the representative will obviously lead to an increased error in the extent
around the representative, if we keep the original definition of extent as a radius around the
representative that contains most of the objects of the Data Bubble.
In the following subsections we will analyze these problems and propose a new and
more suitable version of Data Bubbles that solves these problems. In order to discuss the
problems, we will assume the following minimal procedure to generate k Data Bubbles for
non-vector metric data (the complete method will be given later in this section):
1. Sample k objects from the database randomly.
2. Assign each object in the database to the closest sample object in the set of objects
obtained in step 1.
This means that using this procedure, the only information we can utilize in our computation of data summarizations are the k objects drawn from the database in step 1 (they
may be used, for instance, as candidates for representative objects), and the distances of
all objects to all of the sample objects from step 1. These distances have to be computed
anyway to determine the closest representative for each object.
Representative Objects
In a non-vector metric space the representative object for a Data Bubble has to be an object
from the Data Bubble itself, since we cannot compute a centroid for the set of objects.
Theoretically, the best representative for a set of objects in a non-vector metric space is a
medoid, i.e., an object that is located most centrally in the set of objects, in the sense that
its overall distance to all other objects in the Data Bubble is minimal. More formally:
DEFINITION 2. A medoid for a set of objects X is an object m ∈ X such that for all
p ∈ X:
X
o∈X
D(m, o) ≤
X
D(p, o).
o∈X
A medoid, although it seems to be the best choice of a representative, has a severe drawback: determining a medoid for a set of n objects is computationally expensive (O(n2 )),
since all pairwise distances have to be computed. Because we want to use very high compression rates in practice (i.e., only a very small number of Data Bubbles, and hence a very
large number of objects represented by one Data Bubble on average), it is not feasible to determine a medoid for a Data Bubble with this exhaustive search method. The same number
17
of computations could be better used to cluster a larger subset of objects directly without
generating Data Bubbles.
Using our minimal procedure to construct data summarizations, there are basically three
alternatives to determine some representative objects for a Data Bubble more efficiently but
less optimally – all with advantages and disadvantages:
1. “Initial sample object”: keep the initial sample object that is used to generate a Data
Bubble as the representative of the Data Bubble.
2. “Relocation using a sub-sample”: after the generation of the Data Bubble, take a
small sub-sample from the Data Bubble (including the initial sample object), and
determine an exact medoid only in this subset.
3. “Maintaining several candidates”: while generating the Data Bubble, keep a number
of objects as potential representatives in main memory (e.g. first m objects assigned
to the Data Bubble). When assigning objects, compute and sum up distances not only
to the initial sample object but also to the additional candidates in the Data Bubble.
After the generation of the Data Bubble, select the candidate with the smallest sum
of distances.
The first alternative, keeping the initial sample object, is the least expensive, since no
additional computation is necessary. But, it is also the alternative with the largest error.
The quality of the representatives found by the second alternative obviously depends on the
size of the sub-sample drawn from the Data Bubble. Our experiments show however, that
a 5% sub-sample of a Data Bubble will result in representatives that are only slightly better
approximations of a true medoid than the first approach, but the effect on the quality of the
clustering result is not significant. Taking larger sub-samples, however, is also too expensive
in the same sense as the exhaustive method: instead of taking a sub-sample to relocate
the representative, we can use a larger sample without bubble generation to improve the
clustering result. Alternative 3, i.e., maintaining several candidates during the generation
of the bubbles, has basically the same properties as alternative 2.
Note that, even in the best case, i.e., if we could get an exact medoid for the whole set of
objects in a Data Bubble, we may produce noticeable errors in the clustering result because
there are limits to the accuracy of a medoid as being in the center of the set of objects that it
represents. This is in fact a drawback for any representative object that has to be an element
of the set itself (opposed to a computed mean in case of vector data). Figure 2.3 depicts
18
Figure 2.3: No object is close to the center of the set.
a case where the data-set does not contain an object close to the center of the set. Due to
this drawback, even the best selection of a representative for a Data Bubble may result in an
error when estimating the extent of a Data Bubble and consequently in the distance between
Data Bubbles to a degree that would not occur for vector Data Bubbles.
Using any of the three alternatives, and keeping the original definition of a Data Bubble,
we cannot guarantee that our representative will be close enough to the “center” of the dataset to lead to small errors. On the other hand, having a representative close to the center of a
Data Bubble is not an objective in its own for hierarchical clustering. Only the above listed
distance notions for Data Bubbles are important. As we will see in the next subsection,
we can in fact compensate for a less centered representative by applying a new and much
more sophisticated distance function for Data Bubbles. Representatives that are not close
to the center of a data-set will only lead to an error in the clustering result when using the
original idea of extent of a Data Bubble around a representative and the original definition
of distance that is based on this extent.
Therefore, we choose alternative 1 and keep the initial sample object as the representative of a non-vector metric Data Bubble, which has no computational overhead.
Average k-nn-Distances, Core-Distance, and Virtual Reachability Distance
The estimation of the average k-nearest neighbor distances nnDist(k, B) for a Data Bubble
B is closely related to the core-distance and the virtual reachability distance of B. The
nearest neighbor distance is also used in the original definition of the distance between
Data Bubbles.
Because there is no notion of volume and dimensionality in a non-vector metric space,
we cannot apply a simple function to calculate the average k-nearest neighbor distances as
in a vector space. When constructing Data Bubbles for non-vector metric data, we have
similar alternatives to determine an estimation of the average k-nearest neighbor distances
as we have for the selection of a representative object. Using a sub-sample of the objects in
a Data Bubble and computing the k-nearest neighbor distances only in this sub-sample is,
19
however, not an option: they would very likely be highly overestimated because the point
density of samples is usually much smaller than the density of original data points, as the
samples are only a small portion of the original data-set.
1. “k-nn distances w.r.t. the initial sample object”: when assigning objects to Data
Bubbles, maintain a list of the k smallest distances relative to each initial sample
object. For each Data Bubble, simply use the k-nn distances to its representative as
the estimation of the average k-nn distance in the whole Data Bubble.
2. “k-nn distances w.r.t. to several reference objects”: keep a number of objects from a
Data Bubble in main memory (e.g. the first m objects assigned to the Data Bubble)
and compute distances to these objects for all objects that are assigned to the Data
Bubble. For each of the reference objects, maintain a list of the k smallest distances.
After the generation of the Data Bubble, compute an estimation of the average knearest neighbor distances by using those values.
As for the selection of the representative objects, the first alternative has no significant
computational overhead since the distances to the initial sample objects have to be computed anyway. The computational cost and the improvement in the estimation of the k-nn
distances for the second alternative depend on the number of reference objects that are kept
in main memory. As before, if we keep too many reference objects, the gain in accuracy
will not outweigh the increased number of distance computations; for the same number
of additional distance computations, we may be able to get a better result by just taking a
larger sample size to begin with.
The important question regarding the average k-nn distances in a Data Bubble is: for
which values of k do we need the estimation and how accurate do they have to be? The most
important use of the k-nn distances is for estimating the core-distance of a Data Bubble.
The core-distance also defines the virtual reachability value for a Data Bubble, which is
used when “expanding” a Data Bubble in the clustering output. Typically, we don’t want
to use values for minP ts that are too small, in order to avoid Single-Link effect and to
reduce the effect of noise (see [3] for details). In practice, we mostly use values which are
significantly larger than 5 for larger data-sets; and we may consider minP ts values in the
range of 5 only for relatively small data-sets. To estimate the core- and virtual reachability
distance of a Data Bubble we therefore only need k-nn distances for the larger values of
k = M inP ts that we want to use for clustering. Fortunately, the larger values of the
average k-nn distance in a Data Bubble can be estimated with small errors using only the
20
distances to the initial sample object. In fact, the larger k, the more accurate the estimation
using only the initial sample object (or any other reference object, or the average over
several reference objects) would be. Only for very small values of k, especially for k = 1,
is the actual value nnDist(1, B) for most of the objects in B quite different from the
average nearest neighbor distance. The nearest neighbor distance in a Data Bubble B,
nnDist(1, B), is only used in the original definition of distance between Data Bubbles,
which we will not use for non-vector metric data because of other reasons. Therefore,
we don’t need the more error prone estimation of nnDist(k, B) for very small values of
k. And, since the use of only a few reference objects does not significantly improve the
result for the larger values of k, we choose here again the more efficient alternative 1 to
estimate the k-nn distances (up to the maximum value of minP ts), i.e., we use only the
initial sample objects and the distances that we have to compute in the construction of Data
Bubbles.
Using the estimation of k-nearest neighbor distances and the core-distance of a Data
Bubble, the virtual reachability distance of the Data Bubble (the distance needed for the
expansion of the reachability plot after clustering) are then defined similarly as in [9]:
DEFINITION 3. Let B be a Data Bubble. The virtual reachability and core-distance of
B are defined using the estimated k-nn distances, nnDist(k, B), as following:
virtualReachability(B) = coreDist(B) = nnDist(minP ts, B).
The Distance Between Data Bubbles
The distance between Data Bubbles in [9] is based on the extent of the Data Bubbles as
illustrated above in Figure 2.2. The purpose of the extent of a Data Bubble is to be able to
define the distance between Data Bubbles as the distance between their borders, which are
approximated by the extents.
However, the extent as the average pairwise distance in a Data Bubble is expensive to
estimate since there is no supporting statistics that could be collected incrementally while
constructing a Data Bubble. The only option to get an estimation of the average pairwise
distance would be to draw a sub-sample of objects from a Data Bubble and compute all
pairwise distances in this sub-sample. The accuracy of this approach depends on the size
of the sub-sample. The value could be used as a radius around the representative within
which most of the objects of the Data Bubble are supposed to be located. Since this is
the intended interpretation of the extent, we could alternatively use only the distances to
21
the representative and maintain incrementally a distance around the representative so that
“most” objects of the Data Bubble fall inside this radius around the representative (similar
to maintaining the k-nn distances). The second alternative for estimating a global extent is
much more efficient but also much more error-prone since it is very sensitive to outliers.
To work properly, both approaches have to assume (in addition to having a small error)
that the representative is close to the center, which is difficult to guarantee in a non-vector
metric space. In fact, errors in choosing representatives and errors in the estimation of the
extent amplify each other, resulting in large errors in the clustering result, because the distances between Data Bubbles will be heavily distorted. As a solution, we propose a new
definition of distance between Data Bubbles, which is based on simple statistics that uses
only the distances between objects and sample objects (which are computed when constructing a Data Bubble anyway). All the needed notions can be maintained incrementally
and without significant computational overhead.
Conceptually, in order to compensate for an error in the selection of the representatives,
we want to distinguish the extent of a Data Bubble around its representative in different
directions - “direction” being defined using only distances between the representative and
other objects. For instance, if a representative is not centered well in a Data Bubble, the
distances to the “border” of the Data Bubble may be very different in different “directions”.
Figure 2.4 illustrates this concept using a 2-dimensional Data Bubble B where the extent
e1 from the representative rB in direction of object o1 is much smaller than the extent e2 in
direction of object o2 . The notions required to formalize these intuitions will be introduced
as follows. Please note that all concepts will be defined without any reference to vector
space properties or operations, and that although we will use 2-dimensional point data to
illustrate the concepts, the notions are solely based on distances between objects.
Figure 2.4: “Directional extent” of a Data Bubble.
In order to define more accurate distances between Data Bubbles, the goal is to find
a more accurate representation of the “border” of a Data Bubble. However, we only need
to know the distance between the representative and the border, i.e., the extent of a Data
Bubble, in the directions of the (representatives of) other Data Bubbles. Intuitively, given
22
any two Data Bubbles, A and B, and their representatives, rA and rB , we can divide the
Data Bubble B into two parts with respect to Data Bubble A: one part containing the
objects in B that are “in the direction of A” in the sense that the distance between them and
the representative of A, rA , is smaller than the distance between the two representatives
rA and rB ; the second part of B contains the other objects, which are called to be “in the
reverse direction of A”. Formally:
DEFINITION 4. Let A and B be two sets of objects, represented by rA and rB , respectively.
• Bubble(B).InDirection(A) = {o ∈ B|D(o, rA ) ≤ D(rA , rB )}. For each object
o ∈ Bubble(B).InDirection(A) we say that o lies in the direction of A.
• Bubble(B).InRevDirection(A) = {o ∈ B|D(o, rA ) > D(rA , rB )} For each object o ∈ Bubble(B).InRevDirection(A) we say that o lies in the reverse direction
of A.
Figure 2.5 illustrates these notions: all objects o ∈ B that lie inside the circle having
rA as center and the distance between rA and rB as radius, are in direction of A, i.e., in
Bubble(B).InDirection(A); objects o′ ∈ B which are outside the circle lie in “reverse”
direction of A, i.e., in Bubble(B).InRevDirection(A).
Figure 2.5: Illustration of direction and reverse direction.
Following a similar intuition, the next notion we define is the border distance of a Data
Bubble in the direction of another Bubble:
DEFINITION 5. Let A and B be two sets of objects, represented by rA and rB , respectively. The border distance of B in the direction of A is defined as
Bubble(B).borderDistInDirection(A) = D(rA , rB ) − mino∈B (D(o, rA ))
The border distance of B in the direction of A is defined as the distance between the
two representatives minus the distance between the representative of A, rA , and the object
23
o in B that is closest to rA . Figure 2.6 illustrates our estimation of the border distance of a
Data Bubble B in the direction of another Bubble A.
Figure 2.6: Illustration of border distance.
A consequence of our definition of border distance is that - in contrast to what can happen in the original Data Bubbles - the extents of two Data Bubbles can never overlap, i.e.,
the distance from a representative rB of a Data Bubble B to its “border”, in the direction of
a Data Bubble A with representative rA , can never be larger than half the distance between
the two representatives:
LEMMA 1. Given two data bubbles A and B with representatives rA and rB , respectively. Let Bubble(B).borderDistInDirection(A) be the border distance of B in the
direction of A. If the distance function is a metric, i.e., satisfies the triangle inequality, then
Bubble(B).borderDistInDirection(A) ≤ D(rA , rB )/2.
Proof. Suppose the border distance is greater than D(rA , rB )/2. It follows that D(o, rA ) <
D(rA , rB )/2, where o = argmin (D(o, rA )). And because o ∈ B, by construction of B it
o∈B
must be D(o, rB ) ≤ D(o, rA ). But then it follows that
D(o, rA ) + D(o, rB ) ≤ 2D(o, rA ) < D(rA , rB ).
This inequality violates the assumption that the triangle inequality holds. Hence
Bubble(B).borderDistInDirection(A) ≤ D(rA , rB )/2.
Our definition serves well in a “normal” situation of well-separated bubbles as depicted
in Figure 2.7, where the distance between the “borders” of the bubbles gives a good estimate
of the true distance between the point sets.
In practice, some situations can occur where a Data Bubble may contain a “gap” in
a particular direction. This can occur if the Data Bubbles represent points from different
clusters but their representatives happen to be close enough so that one Data Bubble contains points from both clusters. Figure 2.8 illustrates such a case. These situation can lead
24
Figure 2.7: “Normal” Data Bubble separation.
to errors in the clustering result because the difference between the borders, and hence the
distance between Data Bubbles may be underestimated. As a consequence, cluster separations may be lost. For vector Data Bubbles, this problem does not occur as frequently as for
non-vector Bubbles, since the extent is estimated by the average pairwise distance, which
in the case depicted in Figure 2.8 would still be close to the true extent of B (which is much
smaller than the depicted directional border distance of B).
Figure 2.8: A “gap” in Data Bubble B. Since the cluster in Bubble A spread across the
ceter line between representative rA and rB , Bubble B contains points from two clusters.
In B, the border distance in direction A is larger than in reverse direction of A.
In order to detect those and similar situations, we maintain certain statistics with respect
to the distances of objects in a Data Bubble B to its representative rB . The values we
compute when constructing a Data Bubble are: 1) average distance of the objects in B in
direction of each other Data Bubble (and reverse directions), 2) the standard deviation of
the distances in all different directions.
DEFINITION 6. Let A and B be two sets of objects, represented by rA and rB , respectively. Let BA = Bubble(B).InDirection(A) denote the set of objects in B that lie in
direction of A and let BrevA = Bubble(B).InRevDirection(A) denote the set of objects
in B that lie in the reverse direction of A.
• Bubble(B).aveDistInDirection(A) =
P
D(o,rB )
.
|BA |
o∈BA
Bubble(B).aveDistInDirection(A) is the average distance between the representative of B and the objects in B that lie in direction of A.
• Bubble(B).aveDistInRevDirection(A) =
25
P
o∈BrevA D(o,rB )
|BrevA |
.
(a) Average distance of B in direction of
A is larger than in the reverse direction
(b) Standard deviation of B in direction
of A is much larger than in the reverse
direction
Figure 2.9: Examples for a “gap” in a Data Bubble B.
Bubble(B).aveDistInRevDirection(A) is the average distance between the representative of B and the objects in B that lie in reverse direction of A.
DEFINITION 7. Let A and B be two sets of objects, represented by rA and rB , respectively. Let again BA = Bubble(B).InDirection(A) and
BrevA = Bubble(B).InRevDirection(A).
Let furthermore, distBA = Bubble(B).aveDistInDirection(A) and
distBrevA = Bubble(B).aveDistInRevDirection(A)
• Bubble(B).stdevInDirection(A) =
rP
2
o∈BA (D(o,rB )−distBA )
|BA |
Bubble(B).stdevInDirection(A) is the standard deviation of the distances between the representative of B and the objects in B that lie in direction of A.
rP
• Bubble(B).stdevInRevDirection(A) =
2
o∈BrevA (D(o,rB )−distBrevA )
|BrevA |
Bubble(B).stdevInRevDirection(A) is the standard deviation of the distances between the representative of B and all objects in B lying in reverse direction of A.
The “directional” versions of border distance, average distance and standard deviation
of the distances help us to detect “gaps” in a Data Bubble in many cases. The situation
depicted in Figure 2.8, e.g., is indicated by the fact that the border distance of Bubble
B in direction of A is not only much larger than in the reverse direction but also much
larger than the average distance in direction of A. Two other examples of a “gap” in a
bubble are given in Figure 2.9. In order to avoid overestimating the extent of a Data Bubble
(and consequently underestimating the distance between Data Bubbles) in the presence
of “gaps”, we introduce a refined notion of “directional” border distance, which we call
“directional” extent of the Data Bubble.
26
DEFINITION 8. Let A and B be two sets of objects, represented by rA and rB , respectively. Let Ave, and Stdv be the average respectively the standard deviation of the distances
in B in direction of A or the reverse direction of A - whichever is smaller. The extent of B
in the direction of A, Bubble(B).extentInDirection(A), is then defined as
Bubble(B).extentInDirection(A) =
min{Bubble(B).borderDistInDirection(A), Ave + 2 · Stdv}
Basically, the (directional) extent of a Data Bubble is either the (directional) border distance, or the (directional) average distance plus two times (directional) standard deviation
- whichever is smaller. Taking the average distance plus two times standard deviation is
a way to estimate a (“directional”) border around the representative that will include most
of the points within that limit. Having a good estimation of the extent of a Data Bubble
in a certain direction, we can define the distance between two Data Bubbles simply as the
distance between their representatives minus their directional extents.
DEFINITION 9. Let A and B be two sets of objects, represented by rA and rB , respectively. The distance between A and B is defined as
D(A, B) =
D(rA , rB ) − Bubble(A).extentInDirection(B) − Bubble(B).extentInDirection(A)
In summary, our new method for constructing a collection of k Data Bubbles for nonvector metric data is as following:
1. Draw randomly a sample of k objects from the database. Each sample object will be
the representative object rBi for one of the k Data Bubbles Bi (i = 1, . . . , k).
2. Classify, i.e., assign each object in the database to the closest representative object
rB in the set of objects obtained in step 1, and maintain incrementally the following
information about each Data Bubble B:
• The distances to the k-nearest neighbors of the representative object rB , up to
a value k = minP ts. These k-nearest neighbor distances, nnDist(k, B) are
used to define core-dist and virtual reachability as in [9], i.e.,
coreDist(B) = virtualReachability(B) = nnDist(minP ts, B).
• Relative to each other Data Bubble A:
27
(a) Bubble(B).borderDistInDirection(A);
(b) Average distance and standard deviation in direction of A and in reverse
direction of A.
3. Compute the extent of each Data Bubble B in direction of every other Data Bubble
A:
Ave = min{Bubble(B).aveInDirection(A), Bubble(B).aveInRevDirection(A)};
Dev =
min{Bubble(B).stdevInDirection(A), Bubble(B).stdevInRevDirection(A)};
Bubble(B).extentInDirection(A) =
min{Bubble(B).borderDistInDirection(A), Ave + 2Dev}.
After the construction of the Data Bubbles and the computation of the directional extent
values, hierarchical algorithms such as the Single-Link method can be applied to the nonvector Bubbles by using the distance between Data Bubbles defined in Definition 9.
The clustering algorithm OPTICS is based on the notion of reachability distance. For
point objects, the reachability distance of an object o1 relative to an object o2 was defined
as the maximum of D(o1 , o2 ) and coreDist(o2 ) (see [3] for details). For Data Bubbles, the
notion of the reachability distance of a Data Bubble B1 relative to a Data Bubble B2 can be
defined analogously:
DEFINITION 10. Let A and B be two Data Bubbles. The reachability distance of B
relative to A is defined as
reachDist(B, A) = max{D(A, B), coreDist(A), coreDist(B)}.
In the following, we will show that this definition is an improved version of the definition used in [9], which estimates the reachability distance of a hypothetical object o in B1
in direction of B2 , relative to an object in B2 in direction of B1 . Analogously to the definition of reachability distance for points, if the two bubbles are far apart, the reachability
distance will be equal to the distance between the bubbles. If the bubbles are very close
to each other, which is indicated by at least one of the core-distances being larger than the
distance between the bubbles, the hypothetical object o can be considered as being located
at the border of both Data Bubbles, and we estimate its core-distance by the larger of the
two core-distances, which in turn is used as the estimation of the reachability distance. The
definition given in [9] only considers the core-distance of the Bubble A when estimating
the reachability distance of Bubble B relative to A. The old definition underestimates the
28
reachability value for points at the border of B significantly if A is relatively dense in its
center and close to a less dense Bubble B (resulting in errors in the clustering structure).
Furthermore, this definition allows a much easier integration of Data Bubbles into OPTICS
than the original method.
2.3 Performance Evaluation
In this section, we perform an extensive experimental evaluation of our Data Bubbles. The
results show that (1) our new method is highly scalable; (2) it produces reliable results even
for very small numbers of Data Bubbles; (3) it is significantly better than random sampling;
and (4) it even has an improved accuracy compared to the original Data Bubbles when
applied to some vector data.
2.3.1 Data-sets and Experimental Setup
To evaluate the performance of our new method for non-vector Data Bubbles, in particular
the ability to discern clusters close to each other, we compared our method with the original
Data Bubbles method on the following data-sets:
First, a synthetic 2-dimensional point data-set (Figure 2.10(a)) with Euclidean distance
(L2 ), called DS-Vector, which is used to show that even for Euclidean vector spaces the new
version of Data Bubbles (which only uses the distance information and none of the vector
space properties) outperforms the original Data Bubbles (for vector spaces) proposed in [9].
The reachability plot obtained when clustering the whole data-set using OPTICS is depicted
in Figure 2.10(c). The data-set contains 50,000 points distributed over 8 clusters and 4%
background noise. The eight clusters have similar sizes and most of them are located very
close to each other as can be seen in Figure 2.10(a). Therefore, this data-set is a good test
bed for evaluating the new directional definition of extent and the heuristics to handle gaps
in bubbles.
The second data-set, called DS-UG, is another synthetic 2-dimensional synthetic vector
data (Figure 2.10(b)) that contains 100,000 points in both uniform distributed and Gaussian
distributed clusters. Figure 2.10(d) shows the reachability plot of the OPTICS clustering
result on DS-UG. We used this testing data to study the performance of the new and old
Data Bubbles on different type of clusters.
The third data-set, called DS-Tuple, which we use to evaluate the relative performance
of our non-vector Data Bubbles, is a synthetic set of 50,000 binary strings. Each object of
DS-tuple is a 100-bit 0/1 sequence, and represents a set (Given 100 items, if a position in
29
the sequence has value one, then the item corresponds to the position is in the set). The
similarity between two such sequences s1 and s2 is measured using the Jaccard coefficient
on the corresponding sets, i.e., |s1 ∩ s2 |/|s1 ∪ s2 |. 80% of the objects form 10 clusters
and the remaining 20% are noise. Two of the clusters are very small (123 and 218 objects),
making the problem of finding them very challenging for data summarization techniques.
The reachability plot obtained when clustering the whole data-set using OPTICS is depicted
in Figure 2.10(e) (the two tiny clusters are indicated by arrows).
The fourth data-set used to illustrate the practical relevance of our method is a real
data-set containing RNA sequences. The application and the data-set, called DS-Real, are
explained in more detail in Section 2.3.4.
100
100
90
90
80
80
70
70
60
60
50
50
40
40
30
30
20
20
10
10
0
0
0 10 20 30 40 50 60 70 80 90 100
0 10 20 30 40 50 60 70 80 90 100
(a) DS-Vector
(c) reach.-plot for DS-Vector
(b) DS-UG
(d) reach.-plot for DS-UG
(e) reach.-plot for DS-Tuple
Figure 2.10: Reachability plots for the whole synthetic data-sets used for the evaluation.
The values reported in the following sections are average values over 100 repetitions of
each experiment. In order to measure the quality of the hierarchical clustering results based
on data summarizations, We used the following two measurements.
We designed the first scoring scheme and denote it by N-score. This score emphasizes
the ability of the tested clustering algorithm in fully recovering all the clusters (big and
small) in the data-set. For each reachability plot obtained for data summarizations, we
apply the following heuristic to select the best cut-line through the diagram. Specifically
we evaluate 40 different cut-lines through a reachability plot in equidistant intervals, and
selects the cut-line that corresponds most closely to the clustering obtained for the whole
data-set. This cut-line is assigned a score based on the number of clusters that are present
30
with respect to this cut through the reachability plot. If n clusters are found (0 ≤ n ≤
maximum number of clusters in the original data-set, n max), then the cut-line gets a score
of n/n max. If k is greater than n max, it gets a score of 0. Hence missing clusters,
finding spurious clusters, and splitting clusters will get penalties. The intention of this
score is to penalize especially those results where the algorithm produces structures that are
not existent in the original data-set and which may lead to misinterpretations of a data-set.
This scoring scheme is in the favor of the Data Bubbles for vector data instead of our new
Data Bubbles for non-vector data, since our new method is designed to recover gaps, the
likely missing cluster structures in the result of the old Data Bubble method.
The second measurement we applied on the clustering results is the F-score measurement [30]. Similar to the first measurement, we used a cut-line to determine the borders of
clusters. Denote by X the set of objects in a cluster of the approximate clustering result,
X = {X} the set of all clusters X, and T the set of objects in a cluster of the OPTICS
result. Let N1 = T ∩ X, N2 = |X|, N3 = T ∪ X, p = N1 /N2 and r = N1 /N3 . Then the
F-score for T is defined as F (T ) = maxX∈X {2pr/(p + r)}. The overall F-score for an
approximate clustering result is the weighted average of F-score for each T in the OPTICS
result. We used multiple cut-lines to cut through the reachability plot of a clustering result,
and report the maximal overall F-score as the accuracy. The higher the F-score, the better
the result.
All experiments were performed on an AMD Athlon-xp 1800+ workstation with 512
MB of memory.
2.3.2 Comparison with Original Data Bubbles
First, we compare our new method with the original Data Bubble method using the vector
data-set DS-Vector. The N-scores for both methods when increasing the number of Data
Bubbles are depicted in Figure 2.11. The results clearly show that our new directional
definition of extent and the heuristics to handle gaps in Data Bubbles leads to a better
quality of the results, even when not using any of the vector space properties.
In Figure 2.12, when we applied the F-score on the clustering results of DS-Vector, we
saw a similar trend as in Figure 2.11: our new Data Bubbles designed for non-vector data
outperform the old Data Bubbles for vector data on DS-Vector.
We also tested new and old Data Bubble methods on the data-set DS-UG which contains
more conplicated cluster structures. On this vector data-set, we acknowlege that the original
Data Bubble method for vector data has better F-scores than our new method for non-vector
31
0.9
0.89
New Method
Old Method
0.88
0.87
N-Score
0.86
0.85
0.84
0.83
0.82
0.81
0.8
0.79
100
150
200
Number of Data Bubbles
250
Figure 2.11: New versus old method on DS-Vector using N-score.
0.96
0.94
Overall F-score
0.92
0.9
0.88
0.86
0.84
0.82
0.8
100
New Method
Old Method
150
200
Number of Data Bubbles
250
Figure 2.12: New versus old method on DS-Vector using F-score.
data, ash shown in Figure 2.13. Compared with the result on DS-Vector which is consist
of purely uniform distributed clusters and random noise, we conjecture that the original
method may have a better ability to handle clusters in Gauss distribution. How to improve
the performance of Data Bubbles for non-vector data on Gaussian clusters is a direction of
future study.
2.3.3 Comparison with Random Sampling
In this section we apply our method to the non-vector metric data-set DS-Tuple. We evaluate both the quality and the runtime benefits of non-vector Data Bubbles. The first set of
experiments shows the relative performance of our method compared to a random sampling
approach, which just clusters a random sample of the data-set and then assigns every object
32
0.96
0.94
Overall F-score
0.92
0.9
0.88
0.86
0.84
0.82
0.8
100
New Method
Old Method
150
200
Number of Data Bubbles/Samples
250
Figure 2.13: New versus old method on DS-UG using F-score.
in the database to the closest sample object. This approach represents a baseline method,
which is in fact difficult to beat since it is very efficient. If a method is computationally relatively expensive in the construction of its data summarizations (such as the BIRCH based
methods BUBBLE and BUBBLE-FM), random sampling can be more effective since it can
use the same number of resources to simply cluster a larger sample.
Figure 2.14 shows the result of the comparison on DS-tuple. The quality of plots created
by our method is consistently better than that of random sampling. For example, when using
50 bubbles/sample, our method is almost always able to recover all the significant clusters
and finds one of the 2 small ones regularly, while random sampling recovers only 7 of the
8 big clusters quite consistently. Two example plots are depicted in Figure 2.14(a).
Figure 2.14(b) shows the average N-score for both methods when varying the number of
bubbles/sample. We obtain an up to 40% better quality when using very high compression
rates (low numbers of bubbles). In general, we consistently obtain a score that can be
obtained by random sampling only when the number of sample points is at least twice the
number of Data Bubbles (this rate increases with the number of bubbles/sample).
Figure 2.14(c) shows that the average overall F-score for both methods when change
the number of bubbles/sample. Similar to Figure 2.14(b), it shows that our Data Bubble
method outperforms random sampling consistently by a large margin, especially when the
number of bubbles/sample is small.
Figure 2.14(d) shows the scale-up of both methods with respect to the number of bubbles/sample. Both methods scale linearly. They both have the same number of distance
computations and hence their runtime is very close to each other, especially when the sam-
33
(a) reachability-plots for bubbles and sampling
0.95
Data Bubbles
Random Sampling
0.9
0.85
N-Score
0.8
0.75
0.7
0.65
0.6
0.55
0.5
25
50
75
100
150
200
250
Number of Data Bubbles/Samples
(b) quality w.r.t. N-score
1
0.95
Overall F-score
0.9
0.85
0.8
0.75
0.7
0.65
Data Bubbles
Random Sampling
0.6
0
50
100
150
200
250
Number of Data Bubbles/Samples
(c) quality w.r.t. F-score
30
Data Bubbles
Random Sampling
Runtime (second)
25
20
15
10
5
0
25
50
75
100
150
200
250
Number of Data Bubbles/Samples
(d) runtime
Figure 2.14: non-vector data bubbles vs. random sampling on the DS-Tuple data-set.
pling rate is low. Random sampling is slightly faster when using a large sample rate and
a relatively cheap distance function (Jaccard coefficient in this case). In real applications,
however, the distance function is typically much more expensive (e.g., a sequence alignment score as in our real-life data-set DS-Real), and the runtime of both methods will be
34
dominated heavily by only the distance computations (e.g., 638 seconds for Data Bubbles
versus 635 seconds for sampling on the DS-Real data set).
11
10
Runtime (second)
9
8
7
6
5
4
3
2
10000 15000 20000 25000 30000 35000 40000 45000 50000
Number of Objects
(a) Scale-up
400
Speed-up factor
350
300
250
200
150
100
50
10000 15000 20000 25000 30000 35000 40000 45000 50000
Number of Objects
(b) Speed-up
Figure 2.15: Scale-up speed-up w.r.t. number of objects on DS-Tuple.
Figure 2.15 shows the absolute runtime and speed-up factors (compared to OPTICS on
the whole database), when varying the database size. The databases used are subsets of
DS-Tuple. Our algorithm, using 100 data bubbles, scales approximately linearly with the
size of database, and we achieve as expected large speed-up values over OPTICS: between
77 and 400. Note that this speed-up is also dependent on the distance function, and for more
expensive distance functions the expected speed-up will even be much larger.
35
2.3.4 An Application to a Real-life Data-set
The RNase P Database [47] is a compilation of ribonuclease P (RNase P) sequences and
other information. In the last a few years, the number and diversity of RNase P RNA
sequences available have increased significantly and analyzing this data-set has become an
important issue. Cluster analysis can help detecting functional subgroups in this data-set
and help understanding the evolutionary relationships between the sequences.
(a) Result of OPTICS, runtime = 6578 seconds.
(b) Result of OPTICS-Bubbles, runtime = 638 seconds.
Figure 2.16: Results for DS-Real.
In this application, we used global sequence alignment under the standard affine gap
penalty scoring scheme (used in BLAST) to cluster the database of 481 sequences. The
OPTICS result for the whole data-set (DS-Real) is shown in Figure 2.16(a). Figure 2.16(b)
shows a typical result using 50 Data Bubbles. It is easy to verify that the results are very
similar. The clustering structure corresponds mostly to the already known evolutionary
relationships and matches well with the annotations in the database. An exception is the
Bacteria.Gamma family that is partitioned into two sub-groups, which both are mixed with
respect to the existing annotations of the sequences. This is an interesting finding that is
currently being investigated in more detail.
36
2.4 Summary
In this chapter, we presented a new data summarization method for non-vector metric data.
The method uses only distance information, and introduces the novel concept of a directional extent of a set of objects. We showed that the distance between bubbles based on
this notion of extent even improves upon Data Bubbles when applied to vector data. An
extensive performance evaluation also showed that our method is more effective than a
random sampling approach, using only a very small number of data summarizations, and
thus resulting in a large reduction of runtime (up to 400 times better than OPTICS) while
trading only very little clustering quality. The method allows us to obtain results even for
data-sets where clustering the whole data-set is infeasible because of the prohibitive cost of
the distance function.
37
Chapter 3
Speed-up Clustering with Pairwise
Ranking
Many clustering algorithms in particular hierarchical clustering algorithms do not scale-up
well for large data-sets especially when using an expensive distance function. In this chapter1 , we propose a novel approach to perform approximate clustering with high accuracy.
We introduce the concept of a pairwise hierarchical ranking to efficiently determine close
neighbors for every data object. We also propose two techniques to significantly reduce the
overhead of ranking: 1) a frontier search rather than a sequential scan in the na¨ıve ranking
to reduce the search space; 2) based on this exact search, an approximate frontier search
for pairwise ranking that further reduces the runtime. Empirical results on synthetic and
real-life data show a speed-up of up to two orders of magnitude over OPTICS while maintaining a high accuracy and up to one order of magnitude over the previously proposed Data
Bubbles method, which also tries to speed-up OPTICS by trading accuracy for speed.
The remainder of this chapter is organized as follows. In Section 2 we introduce background knowledge including the OPTICS clustering algorithm; in Section 3, we state the
motivation of the new method; Section 4 discusses the idea of ranking; in Section 5, we
introduce our new ranking method; in Section 6, we compare our method empirically with
the previous methods; finally, we summarize this chapter with Section 7.
3.1 Preliminaries
3.1.1 Three Major Clustering Approaches
Clustering algorithms can be categorized based on how they cluster the data objects. In
this subsection we briefly introduce three algorithms representing the major categories:
1
Some of the material in this chapter has been published in [57].
38
partitioning, hierarchical and density-based approaches. For a complete description of all
categories, see [24].
The partitioning approach is represented by the k-means algorithm. This approach selects a set of centers and partitions the data-set by assigning data objects to their nearest
center. The centers are then adjusted according to the objects in each group and the assignment process is repeated to refine the result. Each group of objects assigned to a center is
considered a cluster.
The hierarchical approach is represented by the Single-Link algorithm. Starting from
groups of individual data objects (one data object per group), the method agglomerates two
nearest groups into a new group. The final result is a hierarchical ordering of all data objects
that shows the process of the agglomeration.
The density-based approach is represented by the DBSCAN algorithm [18]. The method
estimates the density of the region around each data object by counting the number of neighbor objects within a given radius. It then connects dense regions to grow them into clusters.
Although our algorithm can be slightly modified to apply to other clustering methods
such as Single-Link and DBSCAN. In this chapter we focus on OPTICS, which is a hierarchical clustering algorithm that uses density-based concepts to measure the dissimilarity
between points, and is described in more detail below.
3.1.2 Triangle Inequalities in Metric Spaces
The triangle inequality can be used in a technique called pruning to avoid distance computations in data retrieval operations and data-mining applications that require distance computations.
To apply the pruning technique, typically the distances between a selected small set of
objects P and all other objects o in a data-set are pre-computed in a preprocessing step. The
objects p ∈ P are called “pivots” or “reference points” in the literature [12, 25].
In a range query, for example, a query object q is given and the task is to find objects
within a given query radius r from q. For any data object o and pivot p, by a derived form
of the triangle inequality, it holds that D(q, o) ≥ |D(q, p) − D(p, o)|. Therefore, at query
time, the distance D(q, p) is also computed in order to determine if |D(q, p) − D(p, o)| > r
for all object o. If this condition is true for o, then it follows that D(q, o) > r, and o can
safely be excluded (pruned away) without actually computing the distance D(q, o).
The triangle inequality has been incorporated in several indexing methods for metric
data, for instance the M-Tree [13]. It can lead to a substantial saving of distance com-
39
putations in low dimensional spaces and in metric spaces that can be mapped to a low
dimensional space. In high dimensional spaces and general metric spaces, however, its
effectiveness deteriorates dramatically [25, 12].
Compared with the sampling and methods such as BIRCH and DATA-BUBBLES, the
advantage of using triangle inequalities is that it can provide additional speed-up for virtually any method on metric data (including our method) and it is an exact method that loses
no accuracy.
3.2 Motivation
Although the OPTICS algorithm without index support requires the computation of O(n2 )
distances, its final result depends largely on the minP ts-nearest neighbor distances only.
Some large distances (larger than typical minP ts-nearest neighbor distances) between
clusters also count, but OPTICS only needs a few of them, e.g., one per pair of clusters
as depicted in Figure 3.1, while most of the reachability-distances plotted in the output are
short distances within clusters. The exact values of these large distances can even been
replaced by approximated values without significantly changing the cluster structure in the
output plot, since as long as the approximation value is large enough, it can fulfill its function of separating a cluster from the remaining of the data-set.
Figure 3.1: An OPTICS walk. The arrows represent the ordering in the walk. Although
the corresponding reachability-distances are different from the distances between the pairs
of objects, the lengths of the edges indicate the level of the reachability-distance values. It
shows that most plotted reachability-distances are small in values.
It is also not necessary to figure out the exact minP ts-nearest neighbor for each object
to compute its core-distance (approximately). Since OPTICS only uses the values of the
minP ts-nearest neighbor distances, an object with a very similar distance as the minP tsnearest neighbor can also serve the same purpose. Overall, in order to preserve the quality
of the final output, we are only required to provide OPTICS with values that are similar to
the minP ts-nearest neighbor distance for each object and a few large distances between
clusters. To achieve this without computing O(n2 ) distances, we will introduce the method
of pairwise hierarchical ranking in Section 3.4.
40
3.3 k-Close Neighbor Ranking
The problem of ranking a list of objects has been well studied in database research and
social sciences, with the typical applications of ranking top selling merchandise or political
candidates. Different from these problems of ranking, in this section, we will study a special
kind of ranking for similarity queries.
DEFINITION 11 (k-cn ranking). Given a list B of data objects, a query q, and a distance
function D(., .), the problem is how to rank objects in B so that the top k-th object ok in the
ranking has a similar distance D(q, ok ) to the true k-th nearest neighbor distance dk , i.e.,
D(q, ok ) = (1 + ǫ)dk for a small ǫ (the smaller the ǫ, the better). This problem is called the
k-close neighbor (k-cn) ranking problem.
For our application, we do not require to find the true top k nearest neighbors; as long
as the D(q, ok ) value returned by the ranking reflects the density around the query object
(the core-distance in OPTICS [3]), we can use the ranking to replace the range search
in OPTICS and reduce distance computations. In the coming sections, we will propose
a ranking method based on the triangle inequality to perform k-cn ranking. It does not
provide a theoretical bound on ǫ but achieves high accuracy in synthetic and real-life data,
and is efficient to compute.
3.3.1 Ranking Using Triangle Inequalities
It has been long observed empirically [48] that the triangle inequality in a metric space
(D(x, y)+D(y, z) ≥ D(x, z) for data objects x, y, z) can be used to detect close neighbors
for a given query object. While the triangle inequality has been used to speed-up datamining applications via the pruning technique [17] as discussed in Section 3.1.2, the use of
triangle inequalities to perform ranking is only gaining the attention of researchers in recent
years [4].
DEFINITION 12 (Lower Bound Estimation (LBE)). Given a distance function D(., .), a
query object q and data objects o and p, Ep (q, o) = |D(q, p) − D(p, o)| is a lower bound
estimation of D(q, o) using p as a pivot. The estimations of individual pivots in a set of
pivots P can be naturally combined in order to obtain the best estimation as the largest
lower bound:
EP (q, o) = max{|D(q, pi ) − D(pi , o)|}.
pi ∈P
41
Figure 3.2: Ranking with triangle inequalities. Although D(q, o) cannot be estimated by
using p′ , chances are that D(q, o) can be estimated by using another pivot p; while p′ cannot
be used to estimate D(q, o), p′ can be used to estimate D(q ′ , o)
As shown in the 2-d example of Figure 3.2, when q and o are not on the circle centered
at p, then the absolute difference value |D(q, p) − D(p, o)| will be larger than zero and can
indicate the actual D(q, o) value. The estimation will be the better, the closer q, o, and p are
located on a straight line. Using several pivots will typically improve the estimation, since
if one pivot fails to estimate D(q, o) well, chances are that it can be estimated better using
another pivot.
DEFINITION 13 (Ranking with LBE). Given a query object q, a set of pivots P , and a
distance function D(., .) on a data-set B, in the preprocessing stage the distances between
all data objects in B and all pivots in P are computed; then, in the application, all data
objects o in B are ranked non-decreasingly according to EP (q, o).
The merit of this ranking method lies in its ability to save distance computations. All
required distances except those between q and pivots in P have been computed in the preprocessing stage, so that in the application, only |P | distance computations and other computationally cheap operations are performed. When the number of pivots is set to be small
and the distance function is computationally expensive, the total number of computations
is much smaller than in the brute-force approach of computing all distances between q and
all data objects to find the close neighbors. In most scenarios, the runtime of an application is much more important than that of a possible preprocessing, since the preprocessing
is usually performed in advance and only once for several applications. But even when
the runtime of the preprocessing stage is counted in the total runtime of an application,
the ranking method can still significantly speed-up our intended applications such as hierarchical and density-based clustering, where the runtime is dominated by the runtime of
typically very expensive distance computations, and where the closest neighbors have to
be determined for each object in the data-set. In these applications, the total number of
computed distances is O(n|P |), which is much smaller than O(n2 ) (since |P | ≪ n).
42
3.3.2 An Intuition of When Ranking Works
In this subsection, we give a theoretical analysis to show why EP (q, o) can be used to
estimate D(q, o) in general metric spaces.
Given a pivot p, a query q and a close neighbor c of the query, Ep (q, c) = |D(q, p) −
D(p, c)| is bounded by D(q, c) since, by triangle inequality, |D(q, p) − D(p, c)| ≤ D(q, c).
This result can be extended directly to the case of using a set of pivots P , with EP (q, c) ≤
D(q, c). Therefore, if c is very close to q, then D(q, c) is small, and consequently EP (q, c)
must be small. This means that when ranking objects according to their estimated distance
to q, c can be expected to be ranked high, if not many objects that are farther away from q
have estimated distances lower than EP (q, c). The important question is therefore: “How
large will EP (q, o) on average be for a randomly chosen object o?” If EP (q, o) has a high
probability of being larger than EP (q, c), then close neighbors will mostly be ranked higher
than random objects. Lemma 2 below gives the probability of random objects o getting an
EP (q, o) value no greater than a given value.
LEMMA 2. Given a data-set B with metric distance function D(., .), let query q, data
object o and pivot set P be selected randomly from B. Let Z = {|D(q, pi ) − D(pi , o)||pi ∈
P }, and let PZ (x) be the probability for z ∈ Z with z ≤ x. Then P r[EP (q, o) ≤ x] =
(PZ (x))|P | .
Proof. Let S = {v|v = D(q, pi ) or v = D(o, pi ), pi ∈ P }. Since q, o and the pivots in
P are selected randomly from B, elements in S are independent of each other. Thus the
zi = D(q, pi ) − D(pi , o) are also independent of each other. Therefore
P r[EP (q, o) ≤ x] = P r[max{|D(q, pi ) − D(pi , o)|} ≤ x]
pi ∈P
Y
=
P r[|D(q, pi ) − D(pi , o)| ≤ x]
pi ∈P
= (PZ (x))|P |
Lemma 2 provides us with a clue of when the ranking will be effective. Let x = D(q, c)
be a distance between a query q and an object c. By Lemma 2
P r[EP (q, o) ≤ D(q, c)] = (PZ (D(q, c))|P | .
43
Although the distribution of Z is unknown, (PZ (D(q, c))|P | is always a monotonic
function of D(q, c). The smaller the D(q, c), the smaller PZ (D(q, c)), and consequently
the smaller will be (PZ (D(q, c))|P | . It also holds that the larger the number of pivots |P |,
the smaller (PZ (D(q, c))|P | is. Therefore, the closer a neighbor c is to a query q, and the
more pivots we use, the higher the probability that a random object is ranked lower than c.
For instance, if Z follows a normal distribution with mean µ and standard deviation σ,
let D(q, c) = δ · µ, then by simple calculation using the Gauss error function erf(.), we can
derive
δµ
δµ
P r[EP (q, o) ≤ D(q, c)] = ((erf( √ ) − erf(− √ ))/2)|P | .
σ 2
σ 2
(3.1)
Let |P | = 1, and the close neighbor distance D(q, c) = 0.1µ, i.e., δ = 0.1. The value
of P r[EP (q, o) ≤ D(q, c)] in Formula (3.1) is plotted in Figure 3.3(a). It shows that as the
standard deviation σ of Z approaches zero, the probability of a random object being ranked
higher than a close neighbor goes up quickly towards one, leading to a less and less effective
ranking. This result is consistent with the theoretical analysis of Beyer et al. in [7] which
shows that as the dimensionality of a Euclidean space goes up, the standard deviation of
distances converge to zero, and the effectiveness of all known indexing methods is getting
lower and lower as well. Compared with the standard deviation, Figure 3.3(a) also shows
that the mean µ has little effect on P r[EP (q, o) ≤ D(q, c)].
Figure 3.3(b) and Figure 3.3(c) show how P r[EP (q, o) ≤ D(q, c)] is correlated with
δ and |P |. For µ = 0.5 and |P | = 10, Figure 3.3(b) shows that when the close neighbor
is very close to the query object, i.e., δ is small (< 0.1), for most σ values (σ > 0.1),
P r[EP (q, o) ≤ D(q, c)] in Formula (3.1) is close to zero, indicating an effective ranking.
In Figure 3.3(c), µ is set to 0.5 and σ is set to 0.1. It shows that when δ approaches zero,
P r[EP (q, o) ≤ D(q, c)] decreases dramatically, and the larger the number of pivots |P |,
the larger the range of δ values that will have low P r[EP (q, o) ≤ D(q, c)] values, allowing
more close neighbors to be ranked higher than random objects (objects with a distance to
the query object close to µ).
3.4 Pairwise Hierarchical Ranking
In this section, we propose a new method using ranking to reduce distance computations
in OPTICS. The method performs a “pairwise” ranking to detect close neighbors for each
object.
44
1
0.8
1
0.6
0.8
0.4
0.6
0.2
0
Pr 0.4
0.2
1
0.8
00
0.1 0.2
0.3 0.4
0.5 0.6
0.7 0.8
sigma
0.9
0.6
mu
0.4
0.2
1
0
(a) |P | = 1
1
0.8
1
0.6
0.8
0.4
0.6
0.2
0
Pr 0.4
0.2
00
0.1 0.2
0.3 0.4
0.5 0.6
0.7 0.8
sigma
0.9
1
1
0.9
0.8
0.7
0.6
0.5
delta
0.4
0.3
0.2
0.1
(b) |P | = 10, µ = 0.5
1
|P| = 5
|P| = 10
|P| = 20
|P| = 50
0.9
0.8
0.7
Pr
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
delta
(c) σ = 0.1, µ = 0.5
Figure 3.3: P r[EP (q, o) ≤ D(q, c)] with respect to µ, σ, δ, and |P |.
DEFINITION 14 (Pairwise Ranking). In a pairwise ranking of a set of m objects, every
object will in turn be the query so that the ranking contains m sub-rankings of the m objects.
After the pairwise ranking, OPTICS uses only the distances between objects and their
detected close neighbors plus a few additional distances between objects that are far away
from each other.
45
As indicated by Lemma 2, to rank close neighbors of a query object high, we should
use as many pivots as we can in the ranking, since the larger the number of pivots, the
larger is the probability that a random object is ranked lower than close neighbors of the
query. However, more pivots also mean more distance computations between pivots and
data objects, as well as the overhead (i.e. runtime outside distance computations) of the
ranking. Selecting a suitable number of pivots to balance these two factors is not easy.
In order to increase the number of pivots without significantly increasing the number
of distance computations, we propose to perform the ranking hierarchically. The idea is to
organize the pivots in a way similar to metric trees such as the M-Tree [13], but we will
use the pivots in a way different from the previous methods. Our method can be roughly
described as follows. First, the data-set of n objects are partitioned in a hierarchical way,
so that for each group of objects on the leaf level, their distances to O(log n) ancestor
objects are computed. Using these O(log n) ancestors as pivots, our method then performs a
pairwise ranking for each group of objects to find close neighbors within the group. To find
close neighbors across different groups, the method also performs ranking across several
groups at a time. Since different groups of objects have different sets of ancestors, the
pivots our method uses will be their common ancestors. In other words, the rankings will
be performed layer by layer, along the generated hierarchical partitioning.
Our hierarchical ranking method can save distance computations because not every
pivot is associated with the same number of distance computations. The top level of pivots
have distances to all objects, but in the next level, for each pivot, since its set of descendants
is only a fraction of the whole data-set, the number of distances associated with it is reduced
to the same fraction. In this way, the pivots are constructed similar to the pyramidal hierarchy of a government: some pivots are global pivots, responsible for every member of the
data-set, but some are local pivots that are responsible for members within their territories
only.
The layer by layer ranking in our method can also improve the accuracy of finding close
neighbors. As the ranking goes down towards the leaves, more and more random objects
are excluded from the ranking during the partition of the data-set, thus reducing the total
number of random objects being ranked higher than close neighbors.
The processes of partitioning and ranking will be explained in more detail in the following subsections.
46
3.4.1 Partitioning
Our method first partitions the data in a hierarchical way, creating a tree, which we call
“pivot tree”. Initially, the root node of the hierarchical tree contains all objects in the dataset. During the construction, if a node v is “split”, a set of b representative objects are
randomly selected from the set of objects in v, and b associated child nodes (one per representative) are created under v. The set of objects in v is then distributed among its children
by assigning each object in v to the child node with the closest associated representative.
Each representative and the objects assigned to it will form a new node. The construction
proceeds recursively with the leaf node that contains the largest number of objects until the
tree has a user-specified number of leaf nodes. At the end, all data objects are contained in
the leaf nodes of the tree, and all nodes except the root contain a representative. For any
data object o and a node v, o is said to be under v if (1) v is a leaf that contains o; or (2) v
is an internal node while o is contained in a leaf that is a decedent of v.
3.4.2 Ranking
Multiple rankings are performed using subsets of the representatives in the constructed hierarchical tree as pivots. We will show an example before giving the formal description of
the algorithm. In Figure 3.4, Ii (i = 1, 2, 3) are internal nodes and Li (i = 1, 2, 3, 4, 5) are
leaf nodes in different layers. o1 , o2 , o3 , o4 , o5 , o6 , o7 , o8 , and o9 are data objects under
them. Let Ii .rep and Li .rep be the representatives of internal node Ii and leaf node Li
respectively. For o1 , o2 , o3 , o4 , and o5 , since the distances between them and the representatives of leaf nodes L1 and L2 are computed when the algorithm partitions the internal
node I2 into L1 and L2 , we can use L1 .rep and L2 .rep as pivots to rank data objects o1 , o2 ,
o3 , o4 , and o5 . Since the distances between these data objects and I2 .rep, I3 .rep, L5 .rep
are also computed in earlier partitions, I2 .rep, I3 .rep, L5 .rep can also be used as pivots
to rank them. Therefore, {L1 .rep, L2 .rep, I2 .rep, I3 .rep, L5 .rep} is the set of pivots to
perform the ranking for objects of L1 and L2 . In the upper layer ranking of objects under I2 , I3 and L5 , we can only use {I2 .rep, I3 .rep, L5 .rep} as pivots to rank the whole
set of {o1 , o2 , o3 , o4 , o5 , o6 , o7 , o8 , o9 }, since distances between data objects o6 , o7 , o8 , o9
and representatives L1 .rep, L2 .rep may not be computed (they are computed only when
L1 .rep = I2 .rep or L2 .rep = I2 .rep).
The pseudo code for the ranking algorithm is shown in Figure 3.5. Starting from the
root as the input node v, function rankNode performs a pairwise k-cn ranking of the objects
under node v, using the child representatives of v and the higher-level pivots with known
47
Figure 3.4: An example for hierarchical ranking.
Figure 3.5: Hierarchical ranking algorithm.
48
distances to the objects under v as the current set of pivots. rankNode is then recursively
applied on all child nodes of v. Therefore, any object o under v takes part in multiple
rankings: the ranking in v as well as the rankings in all descendant nodes of v that o is also
under. The lower the level of the node, the more pivots are used in its pairwise ranking
and the less objects are involved in the pairwise ranking. The method maintains a close
neighbor set for each data object o. In any k-cn ranking, the top k objects with the smallest
EP (q, o) values are retrieved and stored in this close neighbor set of o. The distances
between objects in this set and o are computed at the end and will be used in the clustering.
It is easy to prove that the number of distances computed in the partition and ranking is
O(bn logb n + kn logb n), where b is the branching factor in the tree, and n is the size
of the data-set. However, the overhead of this ranking using na¨ıveKcnRank can have a
quadratic time complexity (although using computationally cheap operations), since the
function na¨ıveKcnRank essentially scans all objects and all pivots to compute and sort the
distance estimation values EP (q, o) of each object o. In the next subsection, we will propose
two new techniques to reduce this overhead.
3.4.3 Reducing the Overhead in Ranking
One issue that arises in the ranking algorithm shown in Figure 3.5 is that for the ranking
in each node, the worst case time complexity is O(m2 log n), where m is the number of
objects to rank (m decreases as the algorithm proceeds from the root to the bottom layer).
This is due to the fact that the algorithm needs to perform a k-cn ranking for each data
object and each ranking has a time complexity of O(m log n). (Note, however, that the
time complexity is on computationally cheap operations, i.e., simple comparisons of precomputed distance values, rather than expensive distance computations.) To reduce this
overhead, we propose two new techniques: 1) a best-first frontier search (rather than the
sequential scan in the na¨ıve ranking) to significantly reduce the search space; 2) based on
this frontier search (which has the same pairwise ranking result as the na¨ıveKcnRank) an
approximate pairwise ranking that further reduces the runtime without sacrificing too much
accuracy in the application to hierarchical clustering.
Best-First Frontier Search
While the na¨ıve k-cn ranking performs a sequential scan of distances between pivots and all
data objects to be ranked, we propose to use instead a best-first frontier search, based on a
new data structure that organizes the distances associated with pivots in the following way.
49
Given a set of objects R under a particular node of the pivot tree and the corresponding
set P of pivots for the k-cn ranking of the objects in R, for each pivot pi ∈ P , we store
the distances between pi and o ∈ R in a list of pairs (o, D(o, pi )), and sort the list by
the distance value of D(pi , o). Using |P | pivots, we have |P | sorted lists, and each object
o ∈ R will have exactly one occurrence in each of these lists. Between the lists of different
pivots we link the occurrences of the same object together in order to efficiently access all
occurrences of a particular object in all lists. The data structure is illustrated in Figure 3.6.
Figure 3.6: Linking the occurrences of each object.
When computing a pairwise k-cn ranking, each object q will be used in turn as a query
object, and all other objects o ∈ R will be ranked according to their estimated distances to
q.
Instead of solving this problem with a sequential scan, our new k-cn ranking algorithm
first retrieves all occurrences of the current query q from the given data structures. These
occurrences virtually form a starting line. Then, our method searches from the starting line,
advances upward and downward along the |P | sorted lists, to search for the top k objects
with the smallest EP (q, o) distance estimation values.
The rationale is as follows. For a query q, let object o be one of the top k objects that is
returned by the na¨ıve ranking, i.e., its distance estimation value EP (q, o) (= maxpi ∈P {|D(q, pi )−
D(pi , o)|}) is one of the k smallest among all objects to be ranked. That also means
that for object o, the values |D(q, pi ) − D(pi , o)| for each pivot pi are all small (since
|D(q, pi ) − D(pi , o)| ≤ maxpi ∈P {|D(q, pi ) − D(pi , o)|} = EP (q, o)). Consequently, the
occurrences of (a top k object) o in all the lists will in general be close to the occurrences of
the query q because the lists are sorted by the distances of the objects to the pivot D(pi , o),
and for a difference |D(q, pi ) − D(pi , o)| to be small, D(q, pi ) and D(pi , o) have to be
similar values and will hence appear close to each other when sorted. Therefore, we can
start from the occurrences of q and look in the nearby positions in the |P | sorted lists for
50
Figure 3.7: k-cn ranking algorithm with best-first frontier search.
the top k objects by a frontier search. At the end, the number of occurrences we visit will
be typically only a fraction of the total occurrences in the lists that belong to the pivots,
leading to a speed-up over the sequential scan.
The pseudo-code of the new k-cn ranking algorithm is given in Figure 3.7. Function kcnRank maintains a priority queue as the frontier such that its top element is a pair
(o, D(o, pi )) with |D(q, pi ) − D(pi , o)| the smallest in the queue. After all occurrences of q
in the lists that belong to the pivots are retrieved, the frontier is initialized with occurrences
immediately adjacent to those occurrences of q above and below. Then the function per51
forms a frontier search in all the sorted lists, always advancing in the list of pi where the
current top element of the queue lies. For objects already encountered when the frontier advances, the function maintains a count of the number of their occurrences. If this number is
equal to the number of pivots used in the ranking, then the object is one of the top k objects
returned in the final ranking. This process continues until all top k objects are found.
In the remainder of this subsection, we prove the correctness of algorithm kcnRank.
LEMMA 3. In algorithm kcnRank, let occurrence (a, D(a, pi )) be popped out of the priority queue before another occurrence (b, D(b, pj )), then |D(q, pi )− D(pi , a)| ≤ |D(q, pj )−
D(pj , b)|.
Proof. When (a, D(a, pi )) is popped out the priority queue, (b, D(b, pj )) can only be either
in the frontier queue or outside the frontier (i.e. the occurrence has not yet been visited by
the frontier). If (b, D(b, pj )) is in the queue, then by the property of the priority queue,
|D(q, pi ) − D(pi , a)| ≤ |D(q, pj ) − D(pj , b)|. If (b, D(b, pj )) is outside the frontier, since
all the lists are sorted, there must be a third occurrence (c, D(c, pj )) in the list of pivots
pj with an absolute difference value |D(q, pi ) − D(pi , c)| ≤ |D(q, pj ) − D(pj , b)|. Since
|D(q, pi ) − D(pi , a)| ≤ |D(D(q, pj ) − D(pj , c)|, |D(q, pi ) − D(pi , a)| ≤ |D(q, pj ) −
D(pj , b)|.
THEOREM 1. The algorithm of kcnRank and the na¨ıve k-cn ranking algorithm na¨veKcnRank
return the same result.
Proof. Let the last occurrence popped out of the priority queue belongs to data object t.
Denote this occurrence by (t, D(t, pi )). For any object o other than the returned k objects
in topK, it must have an occurrence (o, D(o, pj )) still in the queue or to be explored (it only
be popped out of the priority queue after (t, D(t, pi ))). By Lemma 3, |D(q, pi )−D(pi , t)| ≤
|D(q, pj ) − D(pj , o)|. Thus EP (q, t) ≤ EP (q, o). Also by Lemma 3, (t, D(t, pi )) has an
absolute difference value |D(q, pi )−D(t, pi )| no less than those of the previous occurrences
popped out the queue. Since for the other top k objects returned by kcnRank, all of their
occurrences are popped out before (t, D(t, pi )), their distance estimation values are all no
greater than EP (q, t). So the elements in topK have the smallest distance estimation values
among all objects to rank. Therefore, they will also be returned by the na¨ıveKcnRank
algorithm.
52
Approximate Pairwise k-cn Ranking
As indicated by Lemma 2 in Section 3.3.2, the larger the number of pivots, the greater
the ranking accuracy. Given a fixed set of pivots, if the number of pivots is too small to
effectively perform k-cn ranking, e.g., k = 5 and only 3 of the top 5 objects returned by the
ranking are actually close neighbors, then some of the occurrences of the top k objects to
be returned by the ranking may be located far away from the starting point of the frontier
search. Thus the frontier search in algorithm kcnRank in Figure 3.7 may have to visit many
occurrences of not-so-close neighbors of q before finding all the occurrences of all top k
objects, and may still incur a large overhead.
Our solution to this problem is to limit the steps that the frontier can advance from
the starting position. The returned result is then no longer exactly the same as the na¨ıve
k-cn ranking, i.e., the new algorithm performs an approximate pairwise k-close neighbor
ranking. When the search stops, if only k′ of the top k objects (k′ < k) have all occurrences
within the frontier, then the remaining k−k′ objects are selected from those objects (besides
the k′ objects already selected) that have the largest numbers of occurrences within the
frontier.
The rationale behind this idea is that objects with occurrences located far away from the
corresponding occurrences of the query objects are more likely to be random neighbors that
cannot contribute short distances to be used by OPTICS, even if the frontier search goes all
the way to find their occurrences. Thus setting a step limit for the frontier search will not
hurt the final clustering accuracy, even if some of the top but not so close objects are not
returned by the search.
Let the step limit be s. The approximate pairwise k-cn ranking algorithm has worst
case time complexity of O(sn log n), where n is the size of the data-set. Empirical results
in Section 3.5 show that s can be as small as 2k to generate clustering results with high and
robust accuracy.
Integration with OPTICS
After close neighbors of all objects have been detected by the pairwise ranking based on
distance estimations, our method computes the actual distances between each object and
these close neighbors. Another set of distances we will use are the distances computed in the
partition stage when creating the pivot tree, i.e., the distances between the representatives
of nodes and the objects under them. These are the only distances that OPTICS will use in
the clustering. All other distances are assumed to be “infinitely” large.
53
The value of k should not be significantly smaller than the minP ts parameter of OPTICS, otherwise the cluster result can be distorted because there are not enough computed
distances associated with each object to estimate the core-distances. In the pairwise hierarchical ranking, each object can take part in several sub-ranking, i.e., rankings of different
layers, so that the number of distances associated with each object is usually a little larger
than k. And since in practice the minP ts parameter only needs to be relatively small to
provide good results, k can also be set to a small value (typically ≤ 20).
3.5 Experimental Evaluation
In this section, we compared our method and the Data Bubbles method on synthetic as
well as real-life data-sets. Both methods were used to speed-up the OPTICS clustering
algorithm. We denote our method using approximate pairwise hierarchical ranking by
OPTICS-Rank, and the DATA-BUBBLE method by OPTICS-Bubble. All experiments
were performed on a Linux workstation with dual AMD Opteron 2.2GHz CPUs and 5GB
of RAM, using one CPU only.
3.5.1 Synthetic Data
We used the two synthetic data-sets studied in last chapter, DS-Vector and DS-Tuple, to
show that our new method has better accuracy in detecting subtle cluster structures.
(a) OPTICS output for DS-Vector
(b) OPTICS output for DS-Tuple
(c) OPTICS-Rank output for DS-Vector (d) OPTICS-Rank output for DS-Tuple
Figure 3.8: The Reachability plots from OPTICS (a and b) and OPTICS-Rank (c and d)
are almost identical (due to the property of OPTICS clustering, there are switches of cluster
positions).
The outputs of OPTICS-Rank for DS-Vector and DS-Tuple are shown in Figure 3.8(c)
and (3.8(d)) respectively. For the parameters of OPTICS-Rank, the branching factor and the
54
step limit s for the best-first frontier search is set to 10, the number of leaf nodes for the pivot
tree to 5, 000, and the number of top objects to return in each ranking, k, is set to 5, the same
as minP ts. The plots generated by OPTICS-Rank are almost identical to those generated
by OPTICS, only that some clusters have switched position, which is a normal phenomenon
when clustering with OPTICS and which does not affect the clustering accuracy. OPTICSRank uses only a fraction of the total number of distances used by OPTICS. The number of
distances computed by OPTICS is 2.5 × 109 for both data-sets, while OPTICS-Rank uses
2.4 × 106 and 2.7 × 106 distances for DS-Vector and DS-Tuple respectively.
We used two measurements, N-score and F-score described in the last chapter to compare the clustering accuracy of OPTICS-Rank and OPTICS-Bubble on DS-Vector and DSTuple.
The clustering accuracy of N-score on DS-Vector is shown in Figure 3.9(a). OPTICSRank uses a fixed setting as mentioned above while the number of bubbles used by OPTICSBubble varies from 100 to 250. The experiment was repeated 10 times and OPTICS-Rank
always succeeds to find all the clusters so that it has a score of 1. This accuracy is consistently better than that of OPTICS-Bubble. The F-scores on DS-Vector are shown in Figure 3.9(b). As the number of bubbles increases, the accuracy of OPTICS-Bubble increases
rapidly and stays on a relatively high level (> 0.98). This accuracy, however, consistently
laggs behind that of OPTICS-Rank (> 0.999). The corresponding numbers of computed
distances by the two algorithms are shown in Figure 3.9(c). As the number of bubbles increases, the number of distances computed by OPTICS-Bubble increases linearly. It uses
as many as 12.5 × 106 /2.4 × 106 ≈ 5.2 times the number of distances of OPTICS-Rank.
The clustering accuracy on DS-Tuple using N-score and F-score is shown in Figure 3.10(a)
and 3.10(b) respectively. OPTICS-Rank uses the same setting as in the previous experiment
and the number of bubbles used by OPTICS-Bubble varies from 25 to 250. Similar to the
previous experiment, OPTICS-Rank outperforms OPTICS-Bubble and is only matched by
the latter when the number of bubbles reaches 200 for N-score and 250 for F-score. The
numbers of computed distances for both methods are shown in Figure 3.10(c). It shows
that when we use a larger number of bubbles (≥ 250) for OPTICS-Bubble to match the
accuracy of OPTICS-Rank, OPTICS-Bubble needs to perform four times the number of
distance computations.
55
1
N-score
0.8
0.6
0.4
0.2
OPTICS-Rank
OPTICS-Bubble
0
100
150
200
250
Number of bubbles
(a) N-score on DS-Vector
1.02
1
Overall F-score
0.98
0.96
0.94
0.92
0.9
OPTICS-Rank
OPTICS-Bubble
0.88
100
150
200
Number of bubbles
250
(b) F-score on DS-Vector
16
14
OPTICS-Bubble
OPTICS-Rank
12
Million
10
8
6
4
2
0
100
150
200
Number of bubbles
250
(c) Number of distance computations for clustering DS-Vector
Figure 3.9: Clustering accuracy and performance on DS-Vector.
3.5.2 Real-life Data
The first real-life data-set we used, denoted by DS-Protein, are 12,010 refined structure
models on the NMR structure of Escherichia coli ribosomal protein L25 [43] generated by
the protein structure modeling program GaFolder [51]. The distance function we used is
the dRMS distance which is a metric when applied to structure models sharing the same
protein sequence (see [29] for a description of dRMS). Another real-life data-set we used,
denoted by DS-Jaspar, consists of 73,253 DNA sequence motifs extracted from the human
56
1
N-score
0.8
0.6
0.4
0.2
OPTICS-Rank
OPTICS-Bubble
0
25
50
100
150
200
250
Number of bubbles
(a) N-score on DS-Tuple
1.02
Overall F-score
1
0.98
0.96
0.94
0.92
0.9
OPTICS-Rank
OPTICS-Bubble
0.88
25
50
100
150
Number of Bubbles
200
250
(b) F-score on DS-Tuple
16
OPTICS-Bubble
OPTICS-Rank
14
12
Million
10
8
6
4
2
0
25
50
100
150
Number of bubbles
200
250
(c) Number of distance computations for clustering DS-Tuple
Figure 3.10: Clustering accuracy and performance on DS-Tuple.
chromosome 1 using the transcription factor binding patterns in the JASPAR database [44].
The distance function we used is an adapted edit (Levenstein) distance for all sequences in
two motifs [38]. Since the runtime of OPTICS on these two data-sets are too long to run
on a single PC (each experiment takes a few months), we had to pre-compute the pairwise
distances using massive parallel processing, and report the runtime calculated as follows:
runtime = a · d + h, where a is the average runtime of computing one distance, d is the
number of distance computations and h is the algorithm overhead, estimating the runtime
57
of the algorithms as if they do not have access to the pre-computed resource of all pairwise
distances.
Protein Structure Data
The data-set DS-Protein is considered more difficult to cluster than DS-Vector and DSTuple, as we can see in Figure 3.11 DS-Protein contains many more clusters. To evaluate
the clustering accuracy using F-score, we defined three cut-lines on the OPTICS reachability plot (y = 0.1, 0.2, 0.3), and for each of these cut-lines, we determined the best
corresponding (according to F-score) cut-lines on the OPTICS-Rank and OPTICS-Bubble
results from a set of candidate cut-lines.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
2000
4000
6000
8000
10000
12000
Figure 3.11: OPTICS output for DS-Protein with three cut-lines (y = 0.1, 0.2, 0.3).
Figure 3.12(a), 3.12(b) and 3.12(c) show the accuracy for the cut-lines of y = 0.1, 0.2, 0.3
on the OPTICS output, as the number of leaves for OPTICS-Rank and the number of bubbles for OPTICS-Bubble ranges from 100 to 3,000. The other parameters of OPTICS-Rank
are set as follows. The branching factor b and the step limit s are set to 10, the parameter k
and minP ts are set to 20. The results show that the accuracy for OPTICS-Bubble increases
dramatically as the number of bubbles grows from 100 to 1,000, but this curve starts to level
off afterwards. The curve of OPTICS-Rank is relatively stable disregarding the number of
leaves. The results also show that OPTICS-Rank has better accuracy than OPTICS-Bubble
when the number of leaves/bubbles is small (≤ 200). For larger numbers of leaves/bubbles,
these two methods have similar accuracy, with OPTICS-Bubble outperforming OPTICSRank on cut-line y = 0.1 and OPTICS-Rank outperforming OPTICS-Bubble on cut-line
58
1
Overall F-score
0.98
0.96
0.94
0.92
OPTICS-Rank
OPTICS-Bubble
0.9
100
500
1000
1500
2000
#Leaves or #Bubbles
2500
3000
(a) Accuracy on cut-line y = 0.1
1
Overall F-score
0.98
0.96
0.94
0.92
OPTICS-Rank
OPTICS-Bubble
0.9
100
500
1000
1500
2000
#Leaves or #Bubbles
2500
3000
(b) Accuracy on cut-line y = 0.2
1
Overall F-score
0.98
0.96
0.94
0.92
OPTICS-Rank
OPTICS-Bubble
0.9
100
500
1000
1500
2000
#Leaves or #Bubbles
2500
3000
(c) Accuracy on cut-line y = 0.3
Figure 3.12: Clustering accuracy on DS-Protein.
y = 0.2 and y = 0.3. Nonetheless, both methods achieve high F-scores greater than 0.96.
The bigger difference between OPTICS-Rank and OPTICS-Bubble lies on their runtime performance. Figure 3.13 shows the performance of the two approximate clustering
methods with respect to the number of distance computations and runtime. For instance,
when using 1,000 leaves/bubbles, OPTICS-Rank would run an estimated 15 hours while
OPTICS-Bubble would run 271 hours (> 11 days) and OPTICS would run 1,702 hours
(> 70 days), without access to the pre-computed distances.
Figure 3.14 shows how accurate the rankings in OPTICS-Rank perform. For each data
object o, let Dt (o) be the true 20th nearest neighbor distance of o, Dr (o) be the estimated
20th nearest neighbor distance in OPTICS-Rank, and let Dr (o) = (1 + ǫ)Dt (o). The
59
3.5e+07
OPTICS-Rank
OPTICS-Bubble
#Distance Computations
3e+07
2.5e+07
2e+07
1.5e+07
1e+07
5e+06
0
100
500
1000
1500
2000
2500
3000
#Leaves or #Bubbles
(a) Number of distance computations
800
700
OPTICS-Rank
OPTICS-Bubble
Runtime (Hour)
600
500
400
300
200
100
0
100
500
1000
1500
2000
#Leaves or #Bubbles
2500
3000
(b) Runtime
Figure 3.13: Performance of OPTICS-Rank and OPTICS-Bubble on DS-Protein.
80
70
60
%
50
40
30
20
10
0
0
2
4
6
8
10
Epsilon
Figure 3.14: The distribution of ǫ values for rankings in OPTICS-Rank.
distribution of ǫ in terms of percentage of data objects are shown in Figure 3.14. As we
can see, most objects have a small ǫ value which indicates good ranking accuracy. For
instance, 75.8% of the objects have an ǫ value smaller than 0.3, i.e., most objects have an
estimated Dr (o) deviating no more than 30% from the true Dt (o) value. This difference
can be considered small in a data-set where the standard deviation of Dt (o) is equal to 1.9
60
times the mean of Dt (o).
1
Overall F-score
0.99
0.98
0.97
0.96
0.95
0.94
0.93
1 5 10
20
30
40
50
60
70
80
90
100
k
(a) Clustering accuracy
1.3e+06
#Distance Computation
1.2e+06
1.1e+06
1e+06
900000
800000
700000
600000
500000
400000
1 5 10
20
30
40
50
k
60
70
80
90 100
(b) Number of distance computations
20
Overhead (Second)
19
18
17
16
15
14
13
12
1 5 10
20
30
40
50
k
60
70
80
90
100
(c) Overhead
Figure 3.15: Effect of changing parameter k.
We also studied the relation between the setting of the OPTICS-Rank parameters and
the clustering result. Since the dRMS distance function on DS-Protein is expensive (it
takes 0.085 second on average to compute one distance), the runtime of OPTICS-Rank is
dominated by the distance computation runtime and thus correlates closely with the number
of distance computations, as shown in Figure 3.13. In the remainder of this subsection, we
will report the number of distance computations and the algorithm overhead only (overhead
61
= total runtime - distance computation runtime). Unless explicitly specified otherwise, in
the remainder of this subsection we always set the number of leaves to 1,000 and used the
cut-line y = 0.3 on the OPTICS output for the accuracy measurement.
The parameter k denotes the number of top neighbors to return in the ranking for each
data object. Figure 3.15(a) shows how changing k from 1 to 100 affects the clustering
accuracy, with both branching factor b and step limit s set to 10. When k is small (< 5), the
accuracy is relatively less stable. As k increases, the accuracy also increases until it levels
off at k = 50.
Using a lager k can improve the accuracy, but at the price of increasing the number of
distance computations as well as the algorithm overhead, as shown in Figure 3.15(b) and
Figure 3.15(c) respectively. Figure 3.15(b) shows that the number of distance computations
grows almost linear in the number k, while Figure 3.15(c) shows the overhead increase
rapidly when k is small but grows more slowly after reaching k = 10. Note that the
overhead of OPTICS-Rank is small compared with the total runtime of the algorithm. The
overhead is in the magnitude of tens of seconds, but the total runtime of OPTICS-Rank
is in tens of hours because of the expensive distance computations. Overall, Figure 3.15
indicates that a k value between 5 and 50 can provide the best balance between accuracy
and runtime.
The results of changing the branching factor b from 2 to 100 with s = 10 are shown
in Figure 3.16. When b increases, Figure 3.16(a) shows that the clustering accuracy stays
around a F-score of 0.98, but the number of distance computations and the overhead tend
to go up, with the optimal point around b = 5.
Figure 3.17 shows the result of change the step limit s from 1 to 5,000 with b = 10.
By the design of the algorithm, increasing the step limit does not affect the number of
distance computations, so that the corresponding plot is omitted. Figure 3.17(a) shows that
the step limit can only affect the accuracy slightly, but increasing it increases the overhead
significantly, as shown in Figure 3.17(b). Our result also shows that increasing s does not
always improve the accuracy, as we can see in Figure 3.17(a) the run with s = 5, 000
actually has slightly lower accuracy than the run with s = 500. One explanation is that
some pivots might behave significantly different from the others and their scores bring
down the ranking of some close neighbors. In that case, pushing the search limit further to
incorporate the scores of these pivots might not improve the accuracy.
In our last experiment on DS-Protein, we reduced the size of the data-set to study the
scalability of the OPTICS-Rank algorithm. We used the first portion of DS-Protein, chang62
1
0.98
Overall F-score
0.96
0.94
0.92
0.9
0.88
0.86
0.84
0.82
0.8
0
10
20
30
40
50
60
70
80
90
100
Branching Factor
(a) Clustering accuracy
1.8e+06
#Distance Computations
1.6e+06
1.4e+06
1.2e+06
1e+06
800000
600000
400000
0
10
20
30
40 50 60 70
Branching Factor
80
90 100
(b) Number of distance computations
55
Overhead (second)
50
45
40
35
30
25
20
15
0
10
20
30
40
50
60
Branching Factor
70
80
90
100
(c) Overhead
Figure 3.16: Effect of changing branching factor b.
ing the size to from 2,000 to 12,000. Figure 3.18 shows the good scalability of OPTICSRank, as we can see that the number of distance computation, runtime and overhead all
grow linear with respect to the size of the data-set.
JASPAR Data
The clustering results on DS-Jaspar are depicted in Figure 3.19. While using 1,000 times
less distances, OPTICS-Rank generates a plot that captures the same cluster structure as the
63
1
0.98
Overall F-score
0.96
0.94
0.92
0.9
0.88
0.86
0.84
0.82
0.8
1 5 10
20
30
40
50
60
70
80
90
100
80
90
100
s
(a) Clustering accuracy
70
Overhead (Second)
60
50
40
30
20
10
0
1 5 10
20
30
40
50
s
60
70
(b) Overhead
Figure 3.17: Effect of changing step limit s.
output of the original OPTICS (with some switching of cluster positions due to the property
of OPTICS clustering algorithm).
We applied the F-score to measure the accuracy numerically. As we can see in Figure 3.19, the diverse level of reachability distances between clusters makes it difficult to
extract the clusters using cut-lines, and we had to manually extract them from both output
plots (98 clusters for OPTICS and 101 clusters for OPTICS-Rank). Each cluster in the OPTICS output is then matched to the cluster in the OPTICS-Rank output that has the highest
F-score. The F-score distribution of the matched clusters is shown in Figure 3.20. It shows
that the majority of the clusters in the OPTICS output can be matched with a cluster in the
OPTICS-Rank output with a high F-score of more than 0.95. The overall F-score is 0.87.
3.6 Summary
In this chapter, we proposed a novel approach to perform approximate clustering with high
accuracy. We introduced a novel pairwise hierarchical ranking method to efficiently determine close neighbors for every data object. We also proposed a frontier search rather than a
64
700000
#Distance Computation
600000
500000
400000
300000
200000
100000
0
2000
4000
6000
8000
10000
12000
Size of Data-set
(a) Number of distance computations
60000
Runtime (Second)
50000
40000
30000
20000
10000
0
2000
4000
6000
8000
Size of Data-set
10000
12000
(b) Runtime
18
Overhead (Second)
16
14
12
10
8
6
4
2
0
2000
4000
6000
8000
Size of Data-set
10000
12000
(c) Overhead
Figure 3.18: Scalability with respect to the size of the data-set.
sequential scan in the na¨ıve ranking to reduce the search space and a heuristic that approximates the frontier search but further reduces the runtime. Empirical results on synthetic and
real-life data show the high efficiency and accuracy of our method in combination with OPTICS, obtaining a speed-up up to two orders of magnitude over OPTICS while maintaining
a very high accuracy, and up to one order of magnitude over Data Bubbles combined with
OPTICS while obtaining a more accurate result.
65
(a) OPTICS output, using 5.4 billion distances.
(b) OPTICS-Rank output, using 4.9 million distances.
Figure 3.19: Clustering results on DS-Jaspar.
60
50
%
40
30
20
10
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
F-score
Figure 3.20: Distribution of F-scores. The overall F-score is 0.87.
66
Part II
Speed-up Similarity Search
67
Chapter 4
The Concept of Virtual Pivots for
Similarity Search
Performing efficient k-nearest neighbor (k-nn) search in many real world metric data-sets
is a very challenging problem, e.g., given a new protein sequence or structure, finding efficiently the most similar protein(s) in a protein database using computationally expensive
dissimilarity measures such as sequence or structure alignments. In this chapter, we propose
a new algorithm to efficiently perform k-nn search in metric data based on triangle inequality pruning using pivot objects. Our method is based on a formal theoretical analysis of the
pruning ability of pivots. We introduce the novel concept of virtual pivots and show that
a single virtual pivot can have the same pruning power as a large set of fixed pivots. Any
database object can act as a virtual pivot, and virtual pivots can be dynamically selected,
without imposing a quadratic space complexity as in previously proposed dynamic pivot
selection schemes. We also propose and analyze a method for boosting the performance of
virtual pivots by a method that selects objects close to the query object with high probability
(if they exist).
In this chapter, we will focus on the theoretical analysis of virtual pivots and leave the
testing on real-life data to the next chapter.
The remainder of the chapter is organized as following. In the next section, we review
related work. In Section 4.2, we analyze the pruning ability of pivots. In Section 4.3 we introduce virtual pivots. We describe boosting of virtual pivots in Section 4.4. In Section 4.5,
we discuss the advantages of using virtual pivots. Section 4.6 describes our algorithm of
virtual pivot, and Section 4.7 summarizes this chapter.
68
4.1 Related Work
Similarity search in general metric spaces where only a metric distance function is available,
uses so-called pivots (also called reference objects or foci in some papers) and the triangle
inequality of the distance function to speed up similarity search. We can distinguish between two basic approaches: (1) methods that organize the pivots in a hierarchical structure
and (2) methods that use a non-hierarchical approach to organize and use the pivots.
4.1.1 Hierarchical Approaches
The first proposal for a hierarchical (tree-based) pivot structure is probably the BurkhardKeller Tree [10] (BKT) for indexing discrete-valued distances. The root node of a BKT
is an arbitrary data object. The data space is partitioned by grouping data objects with the
same distance to this root node object into a child node, and each child node is recursively
partitioned in the same way. The objects in the nodes of the tree are called pivots. Later,
BKT has been improved by other methods. For instance, Fixed Queries Tree (FQT) [5]
reduces the number of distance computations during search by choosing the same pivot
for all nodes on the same level of a tree; the method Vantage-Point Tree (VPT) [46, 53]
extends the approach to handle continuous distance functions; the M-tree [13] introduces
a disk resident version of VPT; the SA-tree [34] uses spatial approximation inspired by
the Voronoi diagram to reduce the number of backtracks in the tree. For a comprehensive
survey on tree-based k-nn search methods, see Ch´avez et al. [12] and Hjaltason and Samet
[25]. Two of the most famous tree pivot structures are the M-Tree [13] and SA-Tree [34],
which are further described below.
The M-tree [13] method splits the data-set into spheres represented by nodes. Each
node can hold a limited number of objects, and (except the root node) is led by a pivot
(called routing object in [13]) which is stored in the parent node. Every pivot maintains
a covering radius that is the largest distance between itself and all the data objects in its
covering tree, which is the subtree rooted at the node led by this pivot. During the M-tree
construction stage, when a new data object arrives, its distances to pivots stored in the root
node are computed and it is inserted into the most suitable node (i.e., led by the closest
pivot from the root node) or the node that minimizes the increase of the covering radius,
and subsequently descended down to a leaf node. If the size of a node reaches the prespecified limit, this node is split into two new nodes with their pivots elected accordingly
and inserted into their parent node. In the query stage, M-tree uses the covering radius to
69
perform triangle inequality pruning. For a node led by pivot p with covering radius r(p),
the distance between any data object o in this node and query q, D(q, o), is lower bounded
by D(q, p) − r(p) [13].
SA-Tree [34] uses spatial approximation inspired by the Voronoi diagram to reduce the
number of visited subtrees during the query stage. In SA-Tree, every node corresponds to a
database object (unlike in M-Tree). In the SA-tree construction stage, first, a random data
object is chosen as the root and a set of data objects from the whole database are selected
as its neighbors. Any object in this neighbor set must be closer to the root than to any
other object in this neighbor set. Each neighbor object then starts a subtree of the root and
the process goes on recursively. In the query stage, besides using the covering radius to
perform triangle inequality pruning the same as in M-Tree, SA-Tree also uses a property
derived from the way the tree is constructed. That is, if p1 and p2 are a pair of parent-child
nodes (serving as pivots) visited by the query process, then for any data object o in the
neighbor set of node p2 , D(o, p1 ) > D(o, p2 ) and thus 12 (D(q, p2 ) − D(q, p1 )) is a lower
bound of D(q, o) [34].
A common problem in tree-based k-nn search methods is that close objects always
have certain probability of being stored into different branches and this probability goes up
quickly as the difficulty of indexing the database increases (e.g., when the dimensionality
increases in vector data). These methods then have to search many subtrees to retrieve
all close neighbors of the query object, which will involve a large overhead and a large
number of distance computations. Another disadvantage of tree-based methods is that there
is no easy way to adjust the pruning ability, when allowed, through varying the database
preprocessing effort. Results in Weber et al. [50] and Jagadish et al. [26] have shown that
tree-based k-nn search methods typically do not work well in very high dimensional metric
data.
4.1.2 Non-hierarchical Approaches
Pivots can also be selected and organized into a non-hierarchical structure. Non-hierarchical
k-nn search methods [40, 33, 20, 11, 37, 15] select in the pre-computing stage a set of
pivots and compute the distance between each pivot and every database object; during query
processing, all the pivots are used collectively to prune objects from the k-nn candidate list
using the triangle inequality. That is, any object whose estimated lower bound is larger than
the k-th smallest estimated upper bound cannot be among the k nearest neighbors of the
query, and thus can be pruned away.
70
The na¨ıve way to select pivots is to pick them randomly. Some researchers preferred
using pivots that are far away from all the other database objects or pivots that are far away
from each other [40, 33]. One can also choose not to select pivots in batch mode (i.e., select
all the pivots at the same time), but incrementally: Filho et al. [20, 45] proposed to select
pivots so that they are far away from each other and at the most similar pairwise distances to
each other; Bustos et al. [11] proposed to select pivots to maximize the mean of the pairwise
distances between pivots; Rico-Juan and Mic´o [37] proposed a similar method LAESA, to
be further described below. One recent work by Digout et al. [15] pre-selects a large set of
pivots but uses only some of them to do the actual pruning in the query stage. However,
this method relies on the distances between the query object and all the pivots, which are
computed directly, to determine which subset of the pivots to be used for pruning purpose.
Therefore, when the distance function is very expensive, this method cannot afford to use a
large number of pre-selected pivots.
LAESA [37] is a nearest neighbor search algorithm, which can be easily modified to
perform k-nn search (as we did in this work). In the preprocessing stage, a set of fixed
pivots are selected and the distance between each pivot and every data object is computed.
To select the pivots, LAESA uses a method called maximum minimum distances [37], to
avoid selecting pivots that are close to each other. The first pivot is selected randomly. The
other pivots are selected iteratively, with the next pivot always being the object that has
the maximum distance to the set of already selected pivots. In the query stage, LAESA
computes the distance between the query object and a fixed pivot with the lowest lower
bound estimation, updates the upper bound δ of the nearest neighbor distance and prunes
away objects with a lower bound estimation greater than δ. If there are no more fixed pivots
to use, it computes the distances to data objects that have not been pruned yet, one at a time,
until the nearest neighbor is found. During this process, each computed distance is used to
update δ and to further prune objects.
All the methods we mentioned so far have spent minimal effort in the pre-computing
stage to build an index structure. Their numbers of pre-computed distances are typically
in the level of O(n log n) where n is the size of the data-set. In contrast, another branch
of methods spend more effort on pre-computing distances (up to all pairwise distances), in
order to obtain a higher pruning ability in the query stage. This approach can be dated back
to the 1980s, when Vidal proposed the method AESA [48, 49]. This method pre-computes
the distances between all objects in the data-set. When performing k-nn search, an object
is selected randomly as the first pivot, and the distance between the query object and the
71
pivot is computed. Using the triangle inequality, objects are then pruned, if possible. If
the number of remaining objects is still large, another object with the smallest lower bound
distance estimation computed by the triangle inequality is selected as pivot and the process
is repeated.
Shasha and Wang [41] proposed a method that is similar to AESA in that it tries to
exploit all pairwise distances for pruning. In contrast to AESA, however, they assumed that
only a subset of all the distances is computed in a first step. Then, a θ(n3 ) time algorithm
using two matrices of size n by n is applied to compute the tightest lower and tightest
upper bound estimations for unknown distances, given the known distances. These bounds
are calculated using a generalized form of the triangle inequality considering more than
three objects at a time. This pre-computation step is itself computationally very expensive
and replacing actual distances with approximate bounds is only worthwhile, if the cost for
computing the pairwise distances would be even higher, i.e, for this method we have to
assume that the distance function is computationally extremely expensive.
The reason why approaches that try to exploit all pairwise distances have been much
less popular than approaches using fixed pivots is that they all use quadratic space in the
pre-computing and the search stage since the matrices computed in the pre-computing stage
which store the distances or the estimated bounds are all needed. “This is unacceptably high
for all but very small databases” [12].
4.2 The Pruning Ability of Pivots
For the following analysis of similarity search in metric spaces, we assume a data-set
B, a metric distance function D(., .), a query object q ∈
/ B, and a fixed set of pivots
P = {p|p ∈ B}, for which all distances between database objects and pivots have been
pre-computed. Furthermore, we will assume a metric vector space to be able to do a mathematical analysis involving volumes of regions in the data space. Although the notion of
volume is not available in general metric data, the conclusions that we derive from the analysis apply approximately to general metric data as well, since it is possible to map a general
metric space into a vector space so that the distances between all objects are approximately
preserved. Only the properties of those distances and their distribution are relevant for the
effectiveness of triangle inequality pruning using pivots.
The basic approach to similarity search in metric spaces using pivots is the following.
First, compute the distance between the query object and some pivot(s). Some approaches
72
use the pivots in P in some specific order, while others may use a subset or the whole set
of pivots “simultaneously”. These differences in algorithms only affect the runtime of the
pruning but not the total number of distance computations, if eventually all pivots in P are
used. Utilizing the pre-computed distances between the pivots in a currently selected subset
P ′ ⊆ P and objects o ∈ B, we can estimate (w.r.t. P ′ ) a largest lower bound lD(o,q) =
maxp∈P ′ {|D(q, p) − D(o, p)|}, and a smallest upper bound uD(o,q) = minp∈P ′ {D(q, p) +
D(o, p)} of D(q, o) between the query object q and every database object o, i.e., lD(o,q) ≤
D(o, q) ≤ uD(o,q) . These bounds are used in both range search and k nearest neighbor
search to prune objects from the list of candidate query answers.
For a range search, i.e., for finding all objects within a distance δ from the query q,
the lower bound lD(o,q) can be used to exclude objects o for which the lower bound of the
distance to q is already larger than the query range, i.e., lD(o,q) > δ.
For a k-nn search, i.e., for finding the k closest objects to the query q, typically an iterative scheme is applied in which at every step pruning is done as for a range query, but with
increasingly smaller tentative ranges δ around q. For instance, using the current estimated
upper bounds uD(o,q) , a tentative range δ can be found by selecting the kth smallest upper
bound distance estimation. Any object whose lower bound estimation is already larger than
the k-th smallest upper bound estimation cannot be a true k nearest neighbor of query object
q.
Such pruning for similarity queries does not work well when both the query q and the
database objects o are approximately equally far away from all the pivots. The lower bounds
will be almost zero and the upper bounds will be very large, so that lD(o,q) of almost every
object would be smaller than most query ranges, in particular the pruning ranges for k-nn
search, derived from upper bound estimations. In those cases, the resulting k-nn candidate
list will contain most of the database objects.
In the following subsections, we will analyze the pruning ability of pivots for range and
k-nn queries, by studying the ability of pivots to exclude objects from a given query radius
(fixed for a range query, derived for k-nn queries).
4.2.1 One Pivot versus Several Pivots
To improve pruning ability, one straightforward approach is to increase the number of pivots. The intuition behind this approach is illustrated in Figure 4.1(a) for a 2-d data example
using two pivots for a query object q with query radius δ. For each pivot p, by the triangle
inequality, any object that can be potentially located within the radius δ around q must have
73
q
δ
s+δ
q′
δ p
q
δ
p1
s−δ
s−δ
p2
(a) More pivots. s =
D(q, p2 ).
s+δ
(b) A closer pivot.
s = D(q, p).
Figure 4.1: Methods to enhance pruning ability
a distance satisfying D(q, p) − δ ≤ D(q, o) ≤ D(q, p) + δ, which means that these data
objects fall into a region bounded by a (hyper-)ring centered at p. When combining more
pivots, the region for possible locations of objects is the intersection of the corresponding
(hyper-)rings (indicated by darker color in Figure 4.1(a)). Intuitively, the more pivots are
used, the smaller the volume of this intersection.
4.2.2 Random Pivots versus Close Pivots
To improve pruning ability, instead of increasing the number of pivots, we can alternatively
try to select pivots more carefully. Ideally, we would like to select pivots close to the query
object. The closer a pivot to the query object, the better its pruning ability. This fact is
illustrated in Figure 4.1(b) for a 2-d example: for pivot p, the hyper-ring associated with the
closer query object q ′ has a much smaller volume than the hyper-ring for the query object
q which is further away from p. The following lemma shows that this increase in volume
when increasing D(q, p) is by a factor more than linear if the dimensionality d > 2, which
means that the pruning ability of a pivot deteriorates rapidly with increasing distance to the
query object in higher dimensional spaces.
LEMMA 4. In a d-dimensional metric space, d > 2, let q be a query object with query
radius δ. Without considering boundary effects, the volume of the region that cannot be
pruned via the triangle inequality using an object p as a pivot grows super-linearly as
s = D(q, p) increases.
Proof. If s > δ, the region that cannot be pruned away is a hyper-ring defined by the
inner radius s − δ and outer radius s + δ. If s ≤ δ, the region is a hypersphere of radius
s + δ. Since the volume of a d-dimensional hypersphere with radius R can be computed
as Vhypersphere(R, d) = π d/2 Rd /g(d), where g(d) = Γ(1 + d2 ) and Γ(.) is the gamma
74
function, the volume of the region that cannot be pruned away is
Vhypersphere(s + δ) − Vhypersphere(s − δ) if s > δ
Vr =
Vhypersphere(s + δ)
if s ≤ δ
n
π 2 ((s + δ)d − (s − δ)d )/g(d) if s > δ
n
=
π 2 (s + δ)d /g(d)
if s ≤ δ)
It is easy to see that Vr grows super-linearly as s increases, since the second derivative Vr′′
w.r.t. s is greater than zero for both cases of s > δ and s ≤ δ, if d > 2.
In the following, we present a systematic comparison between using a set of random
pivots versus using closer pivots in terms of pruning ability. We leave the discussion of how
to actually find or select close pivots to the next section.
DEFINITION 15 (d-space). A data space where the objects are uniformly distributed in a
d-dimensional hypersphere of radius a is denoted by d-space(a). If a = 1 (unit radius), the
space is denoted by d-space.
To measure the pruning ability of pivots, we define the following measure called Random Pivot Power (RPP).
DEFINITION 16 (RPP). When using the triangle inequality to perform pruning, if the
probability of a random object “o” not being pruned away using a particular pivot p is
equal to the probability of “o” not being pruned away using m random pivots, we say that
pivot p has a pruning ability of m RPP (Random Pivot Power).
LEMMA 5. In a d-space(a), when performing a range query of radius δ, the asymptotic
probability P (m) of a random object not being pruned away by m random pivots is
r
δ d m
P (m) = (erf(
)) ,
a 2
where erf is the Gauss error function.
Proof. Denote the set of random pivots by P and let x and y be two variables following the
distance distribution within a d-space. Since the pruning among different random pivots are
independent,
P (m) = Πpi ∈P P r(|D(q, pi ) − D(o, pi )| < δ)
= (P r(|x − y| < δ))m .
75
√
When d → ∞, both x and y follow a normal distribution N (µ, σ 2 ), where µ = a 2 and
σ2 =
a2
2d
[23]. Since x and y are independent of each other, z = x − y also follows a normal
distribution with µ = 0 and σ =
Z
P (m) = (
δ
−δ
a2
d .
r
Thus,
d −d·t2 /(2a2 ) m
δ
e
dt) = (erf(
2
2πa
a
r
d m
)) .
2
Using Lemma 5, we can estimate the pruning ability of a particular pivot, depending on
its distance to the query object, as stated in the following theorem.
THEOREM 2. For a δ radius query in a d-space, let the asymptotic pruning ability of a
pivot be m RPP, then
p
)/ ln(erf(δ d/2)) if s > δ
ln((s + δ)d − (s − δ)dp
m≥
ln((s + δ)d )/ ln(erf(δ d/2))
if s ≤ δ.
Proof. Ignoring boundary effects, the probability of a random object not being pruned away
by a single pivot is
P′ =
=



Vhypersphere (s+δ,d)−Vhypersphere (s−δ,d)
Vhypersphere (1,d)
Vhypersphere (s+δ,d)
Vhypersphere (1,d)
if s > δ
if s ≤ δ
(s + δ)d − (s − δ)d if s > δ
(s + δ)d
if s ≤ δ
where Vhypersphere(R, d) is the volume of a d-dimensional hypersphere with radius R and
can be computed as in the proof of Lemma 4. If we consider boundary effects, P ′ will be
even smaller since some of the volume of the hyper-ring will fall outside the data space.
To estimate the pruning ability of one close pivot, let P ′ = P (m), where P (m) is
the probability computed in Lemma 5 for a = 1. By simple calculation, we obtain the the
lower bound of m as stated in the theorem.
The lower bound of m is plotted as a function of s and δ in Figure 4.2. In the 2-d
illustration of Figure 4.1, it is quite obvious that in most cases, using two pivots, even if
they are randomly selected, will have better pruning ability than using a single close pivot
in a 2-d space. However, as the dimensionality increases to 50, we need as many as 1,000
random pivots in order to get the same pruning ability as a single close pivot. When the
dimensionality reaches 100, that number of needed pivots can be as high as 40,000.
This analysis shows that when a pivot is close to the query, it may have a tremendous
pruning ability equivalent to using thousands of random pivots. The crucial question is,
76
m
1800
1600
1400
1200
1000
800
600
400
200
00
1800
1600
1400
1200
1000
800
600
400
200
0
0.3
0.2
0.2
delta
0.1
0.4
0.6
s = D(q,p)
0.8
1 0
(a) d = 50
45000
40000
35000
30000
m 25000
20000
15000
10000
5000
00
45000
40000
35000
30000
25000
20000
15000
10000
5000
0
0.3
0.2
0.2
0.1
0.4
0.6
s = D(q,p)
0.8
delta
1 0
(b) d = 100
Figure 4.2: The lower bound of m (δ ≤ 0.3).
however, how to select close pivots for all possible queries, when the number of pivots are
limited?
Assuming that query objects follow the same distribution as the objects in the database,
each database object that could act as a pivot is typically close to only a very small number
of query objects. In a d-space, the asymptotic distribution of the pairwise distances is a
√ 1
normal distribution N ( 2, 2d
). For a random object, the portion of neighbors within a
radius of δ can be computed as
Z
0
δ
r
d −d(t−√2)2
e
dt.
π
77
The values for three different dimensionalities are plotted in Figure 4.3. It shows that most
neighbors are uniformly far away. Therefore, any object that is selected as a pivot, is only
close to a few neighbors and only proportionally many queries. For the vast majority of
objects, it behaves in the same way as a random pivot.
Portion of neighbors within radius
1
d=50
d=100
d=200
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.5
1
1.5
2
Radius
Figure 4.3: Portion of neighbors within a δ radius.
In the traditional schemes of using a fixed number of pivots, this number is usually very
small compared to the number of objects in the database. If there is no prior knowledge
about the possible locations query objects, it is very unlikely that a pre-selected pivot will
be close to a query and can perform efficient pruning.
To overcome this problem, it has been proposed to pre-compute or estimate all pairwise
distances between data objects, and use this complete knowledge on relations between objects in the data-set to find close pivots. Methods using this approach include AESA [48, 49]
and the Shasha-Wang’s method [41]. However, this approach suffers the serious drawback
of using quadratic space as well as quadratic or even cubic time complexity. Therefore,
although these methods have been shown to work very well on small data-sets, they are not
very practical and therefore very few methods are based on this approach.
4.3 Virtual Pivots
In this section, we propose a new class of pivots called virtual pivots that exhibit the pruning
power of close pivots without suffering quadratic space and time complexity. We first compute the distances between a relatively small number of fixed pivots and all the other data
objects, as in the traditional approaches using a fixed number of pivots. Then, we will use
these pre-computed distances to approximate the pruning of close pivots using the concept
of a virtual pivot.
78
All data objects except the randomly pre-selected fixed pivots are potential virtual pivots, which are selected at query time from the data-set, based on the given query.
In Section 4.4 we will discuss a method for selecting objects from the database as virtual pivots so that they are close to the query object with high probability (if close objects
exist). In this section, we describe how virtual pivots can be used to prune objects, which is
different from the traditional pruning scheme and requires the following simple lemmata.
LEMMA 6 (Shortest Path Property of a Metric). For a set of n objects with a metric
distance function, if we map the objects to the nodes of a complete graph G of size n and
set the length of every edge to be the distance between the corresponding two objects, then
for any two nodes u and v in G, the shortest path between u and v is the edge uv.
Proof. By induction and the triangle inequality.
If some of the pairwise distances are unknown, then the corresponding edges will be
missing. In this case, we can use the lemma below to estimate the upper and lower bounds
of those unknown distances.
LEMMA 7 (Upper and Lower Bounds). For a set of n objects with an incomplete distance
matrix M , assume that the objects are mapped to the nodes of an undirected graph G of
size n and two nodes (x, y) are connected with an edge of length D(x, y) if the distance
D(x, y) is in M . Let |.| be the length function. For any two nodes u and v, if D(u, v) ∈
/ M,
then for any path between u and v in G (if it exists): |path(u, v)| is an upper bound for
D(u, v); and 2|st| − |path(u, v)| is a lower bound for D(u, v), where st is any edge on
path(u, v).
Proof. The upper bound follows directly from Lemma 6. Also by Lemma 6,
|st| = D(s, t) ≤ |path(u, v)| − |st| + D(u, v).
Therefore,
D(u, v) ≥ |st| − (|path(u, v)| − |st|) = 2|st| − |path(u, v)|.
To understand the pruning using virtual pivots, consider, for example, in nearest neighbor search a query object q that is close to another data object v that has been selected as
virtual pivot (and is also a candidate for the nearest neighbor since it is close to q). Any data
79
p1
Close to 180 degree
q
v
p2
o
p3
(a) Using fixed pivots that are almost in a line
with virtual pivot v and data object o to help pruning. Either p1 or p2 can be used.
p1
q
v
p1
o
(b) Using virtual pivots.
D(q, v) is computed to rule
out o.
q
v
o
(c) Traditional schemes
compute D(q, p1 ), but not
D(q, v).
Figure 4.4: Illustration of virtual pivots. Dashed lines represent distances to be estimated.
object o that is farther away from v than q could be easily pruned away if v would be a pivot
for which the distance D(v, o) would have been pre-computed. If the distance D(v, o) is
unknown, then we can no longer use the triangle inequality to do the pruning. However, we
observe that if there is one of the pre-selected fixed pivots in a suitable position, then we
can use the above lemma to try to exclude object o.
An example that illustrates pruning in this nearest neighbor search is depicted in a 2-d
example of Figure 4.4(a). There are several fixed pivots, p1 , p2 and p3 , and there is an
object v close to q. p1 and p2 are quite far away from v and o, and they are almost a
line with v and o. Another observation is that the estimations |D(o, p1 ) − D(p1 , v)| and
|D(o, p2 ) − D(p2 , v)| are quite close to the actual distance D(o, v). Fixed pivot p3 , on the
other hand, is almost orthogonal to the line between v and o, and the difference between
D(o, p3 ) and D(v, p3 ) is very small.
We observe that the fixed pivots p1 and p2 but not p3 can be used to help pruning
o. Suppose we are using pivot p1 , as illustrated in Figure 4.4(b). By Lemma 7, we can
estimate D(q, o) using
|D(o, p1 ) − D(p1 , v)| − D(v, q) ≤ D(q, o).
(4.1)
Since o is much farther away from v than q, and |D(o, p1 ) − D(p1 , v)| is close to
the actual D(o, v), the lower bound of D(q, o) given in above formula will be larger than
D(v, q). Therefore, we can exclude o as a candidate of the nearest neighbor using this
virtual pivot.
Our pruning scheme is different from the traditional scheme of using fixed pivots as
illustrated in Figure 4.5 for a query q and a random object o. The traditional scheme always
computes the distances between the query object and the fixed pivots, all of which are
80
Figure 4.5: Pruning schemes.
likely to be very large and have a low probability to exclude o. In our scheme, only the
the distances between the query and some virtual pivots will be computed. If we are able
to select virtual pivots close to the query, these distances will be short and generate a tight
upper bound for the nearest neighbor distance since the virtual pivots are part of the database
and hence are also candidates for the nearest neighbors.
Although, the above discussion used a nearest neighbor search example, the arguments
apply to k-nn search as well: we can simply replace the nearest neighbor distance with a
k-nearest neighbor distance upper bound, and the pruning is otherwise done in the same
way.
In our scheme, we will assume that the set of fixed pivots that supports virtual pivots
is selected so that the fixed pivots are in general far away from each other (which can be
achieved in high-dimensional space by simply selecting them randomly). Given a set of
fixed pivots and a virtual pivot v, for effective pruning with v, the question is now how
to choose one of the fixed pivots (like p1 in the example) that can be used in combination
with v to prune objects. A straight forward solution is suggested by Formula (4.1): ideally
the fixed pivot we want to find should give the tightest lower bound for D(o, v) using
|D(o, p1 ) − D(p1 , v)|. All distances needed to solve this problem are pre-computed, and in
a na¨ıve way we can simply scan the set of distances to fixed pivots to find the tightest bound
81
for a query object q. Similar to proposals as, e.g., in [20], we can also store the distances
associated with every fixed pivot in a B+-tree and perform a range query to find distances
like D(o, p1 ) that can be used to prune objects. If the data-set does not have frequent object
deletions and insertions, we can also use sorted lists (or sorted files if the main memory is
not large enough to store the distances).
4.3.1 The Pruning Ability of Virtual Pivots
For the following analysis of the pruning ability of virtual pivots, we assume we have no
prior knowledge of the data and we only assume that we have x fixed pivots that are randomly distributed.
THEOREM 3. In a d-space, assume a query radius δ around a query object q. For any
virtual pivot v, supported by x fixed pivots, let the asymptotic pruning power of v be y RPP,
and let s = D(v, q), then
y = f (s, δ) · x,
p
ln(erf((δ + s) d/2))
p
.
f (s, δ) =
ln(erf(δ d/2))
(4.2)
Proof. Any object o with D(q, o) ≤ δ, by Lemma 7, satisfies Formula (4.1). If v would
be a true pivot so that all distances between database objects and v would be known, then
we would be able to exclude all objects except those in the hyper-ring centered at v that
contains the sphere centered at q with radius δ (see Figure 4.1(b)). But, since v is not a true
pivot, we only approximate this region using the fixed pivots in the following precise sense.
Formula (4.1), which is used for the pruning, can be rewritten as |D(o, p1 ) − D(p1 , v)| ≤
D(q, o) + D(v, q). Therefore, since D(q, o) ≤ δ, for all fixed pivots pi , o must satisfy
|D(o, pi ) − D(pi , v)| ≤ δ + D(v, q).
This means that the original problem of finding objects within distance δ around q using
v can be transformed into the problem of finding objects within distance δ + D(v, q) around
v using the x fixed reference points. Since the distances between v and all the fixed pivots
follow the same distribution of the pairwise distances in the d-space, v can be considered as
a random query object in this transformed problem. Consequently, the region that cannot
be excluded in the original problem is approximated by the intersection of the hyper-rings
centered at the fixed pivots. Therefore, the pruning ability of v in the original problem can
be estimated as the pruning ability of x random pivots for a query radius δ + D(v, q) and a
random query object.
82
By Lemma 5, the asymptotic probability of a random object not being pruned by y random pivots, for a query objects q with query distance δ (by assumption the same as using
p
v for the original problem), is P (y) = (erf(δ d/2))y . With s = D(v, q), the correspond-
ing probability using the x fixed pivots in the transformed problem for query object v with
query distance δ + s is
p
P (x) = (erf((δ + s) d/2))x .
To estimate the pruning ability y of a single virtual pivot v supported by the x fixed pivots,
we set P (x) = P (y). Then y can be easily obtained by simple transformations as stated in
the theorem (Formula 4.2).
The factor f (s, δ) essentially indicates how much pruning ability of the x fixed pivots
is “transferred” to a single virtual pivot. The distributions of the factor f (s, δ) in different
d-spaces (d = 50, 100) are plotted in Figure 4.6. The distributions show a similar pattern.
When s is small, f is close to 1, which means the single virtual pivot has almost the same
pruning ability as the x random pivots. Note that this also means that we save x − 1
distance computations in the query stage since we only compute the distance between the
virtual pivot and the query object, but not as in traditional schemes the distances between
all fixed pivots and q.
What about more than one virtual pivot? Assuming we have z virtual pivots, selected
randomly and independently. Since each virtual pivot performs an independent pruning for
a given query, the asymptotic probability of a random object not being pruned away by z
virtual pivots is
p
(Πzi=1 erf((δ + si ) d/2))x ,
where vi is the ith virtual pivot and si = D(vi , q). Similar to the case of one virtual pivot,
we can compute the RPP as
y = fz (s, δ) · x,
p
Pz
i=1 ln(erf((δ + si ) d/2))
p
fz (s, δ) =
.
ln(erf(δ d/2))
The above formula indicates that if we use more than one virtual pivot, it is possible to
achieve a pruning ability that is a multiple of that of the total set of x fixed pivots.
Since in our method we do not compute the distances between the query object and all
the fixed pivots, we can use a relatively large number of fixed pivots. This will increase
the runtime of the pre-computing stage, which is a one-time effort. However, in the query
stage, the number distance computations will very likely decrease since the larger number
83
1
0.8
1
0.6
0.8
f
0.4
0.6
0.2
0
0.4
1
0.2
00
0.5
0.2
0.4
0.6
s = D(q,p)
0.8
radius
1 0
(a) d = 50
1
0.8
1
0.6
0.8
f
0.4
0.6
0.2
0
0.4
1
0.2
00
0.5
0.2
0.4
0.6
s = D(q,p)
0.8
radius
1 0
(b) d = 100
Figure 4.6: The factor f (s, δ).
of fixed pivots will give the virtual pivots a better pruning ability. In the traditional scheme,
since it needs to compute the distances between the query objects and all the fixed pivots, it
can only use a small number of fixed pivots.
84
4.4 Boosting Virtual Pivots
In this subsection, we will answer the question of how to select the virtual pivots for any
given query object efficiently and effectively. In order to perform efficient k-nn search
using virtual pivots, we want the virtual pivots to be as close to the query object as possible.
For a typical metric data-set, however, there is no information available about which object
could be close to the query object. In the following, we propose a method for boosting the
performance of virtual pivots by selecting them one by one, selecting the next virtual pivot
using criteria measured by previously selected pivots.
A similar idea of pivot selection was first mentioned without analysis in AESA [48,
49] and then later in Shasha and Wang’s paper [41] as part of their search methods using
a full matrix of computed or estimated pairwise distances between all objects. Given a
query object, their algorithms pick a random object as the first pivot. Starting with the
second pivot, they always choose the object with the smallest lower bound estimation as
the next pivot. As we have mentioned before, these methods suffer from extremely high
pre-computation cost and the quadratic space problem in both the preprocessing stage and
the query stage.
There are two differences between our boosting method and the previous approaches.
(1) Our method uses virtual pivots and thus gets rid of the quadratic space complexity. (2)
In a high dimensional space, the distance between an arbitrary query object and a random
pivot is almost always very large, as shown in Figure 4.3, which means that the first selected
pivot has in general a low pruning ability, and the convergence of the previous methods to
a close pivot could be slow when the lower bound estimations are loose. Therefore, instead
of using a random object as the first pivot, we propose a method that tries to select a close
neighbor of the query already as the first virtual pivot.
The main idea in selecting the first virtual pivot for a given query object q is to utilize
a small subset S ⊆ P of the fixed pivots to find a potentially close neighbor of q. Given
that the distances between all pivots in P and all database objects have been pre-computed,
we can first compute the distances between q and the fixed pivots in S, in order to obtain
the lower bound estimation (LBE) ES (q, o) = maxp∈S {|D(q, p) − D(o, p)|}. Then, all the
LBEs are sorted and the object with the smallest LBE is chosen as the first pivot.
The following theoretical analysis will show that this is indeed a very good heuristic
to find a close object to a query. Note, however, that this heuristic itself cannot be used to
rule out objects as possible k-nn objects, using the traditional scheme of triangle inequality
85
pruning, since the upper bound estimation of the k-th nearest neighbor distance could be
very large. We will call this ability of LBE to discriminate close neighbors rather than ruling
out other objects its discriminatory ability.
4.4.1 Why Boosting Works
The reason why boosting works may seem surprising. It is because the discriminatory
ability of the lower bound estimation is growing so fast (exponentially with the number
of used pivots) that LBE can rank a close neighbor highly in a sorted list. For a random
query object q with query radius δ using m random pivots in a d-space, by Lemma 5,
the asymptotic probability of a random object o having an LBE smaller than δ is P r =
p
(erf(δ d/2))m .
Pr
1
1e-05
1e-10
1e-15
1e-20
1e-25
1e-30
0
m = 30
d=30
d=50
d=100
d=200
0.05
0.1
0.15
0.2
0.25
Pr
1
1e-05
1e-10
1e-15
1e-20
1e-25
1e-30
1e-35
1e-40
1e-45
1e-50
0.3
0
m = 100
d=30
d=50
d=100
d=200
0.05
0.1
0.15
Delta
Delta
(a) m = 30
(b) m = 100
0.2
0.25
0.3
Figure 4.7: Probability of having an LBE smaller than δ.
The values of P r for several dimensions are plotted in Figure 4.7 against different
values of δ. It shows that when δ is small, the probability of a random object getting a low
LBE is also small and this probability goes down exponentially as the number of pivots m
increases. For instance, assuming m = 100, d = 50 and δ = 0.1, P will be no greater
than 10−30 . Let the size of the data-set be 10 million. Then on average, there is only 10−23
objects with a LBE smaller than 0.1. If we reduce m to 30, this average number is still only
10−2 . Therefore, if the query object has actually a close neighbor within δ = 0.1, then this
object will be the top object in the sorted list of lower bounds. Once we have found this
closer-than-random neighbor, our boosting method will then use it as a virtual pivot and
prune the other objects. If the query objects have more than one close neighbor, since now
the close virtual pivot generates lower LBEs for close neighbors than LBEs generated by
random pivots as shown in Lemma 5 (when a decreases, P increases, and close neighbors
86
LBEs decrease since erf is a monotonic increasing function), we can continue to select a
pivot that is even closer to the query object.
This analysis also indicates that the boosting can only be effective when the query object
actually has some neighbors that are not too far away. This is a reasonable assumption in
data-sets where similarity search is performed, since if all objects are far away from the
query object like random noise, it is hard to justify the need to do k-nn search. Therefore,
we will assume the user will, in practice, be able to provide a parameter ǫ so that only knearest neighbors within a radius of ǫ to the query object are considered as being relevant.
There is no hard limit on the value for ǫ. It will only affect the performance of boosting. In
the worst case, the boosting will simply select the virtual pivots randomly.
4.5 Advantages of Using Virtual Pivots
Given a virtual pivot v, Formula (4.1) gives a new lower bound estimation |D(p, o) −
D(p, v)| − D(q, v) for D(q, o). This estimation is derived from two adjacent triangles
on four objects, and thus might be looser than a single triangle inequality as Formula (1.1).
Nevertheless, such a pruning scheme based on a virtual pivot v is significantly different from
the traditional methods using many fixed pivots (i.e., Formula (1.2)), where a much larger
number of distances to the query have to be computed in the query stage. Additionally, in
the traditional methods, Formula (1.3) gives the estimated upper bounds which might all
be large; while in our VP method, if we are able to select the virtual pivot v to be close to
the query q, then D(q, v) will be small and serve as a tighter upper bound for the nearest
neighbor distance. To summarize, there are several advantages for using a virtual pivot:
• One can freely choose from the non-pivot data objects as virtual pivots during querying time, based on the given query. In our VP method, we choose the one with the
smallest estimated lower bound. This query-dependent and dynamic pivot selection
will likely enable us to select virtual pivots that are close to the query object.
• The distances between the close virtual pivots and the query will provide a much
tighter k-nn upper bound, δ, than the other methods. This upper bound and the new
lower bound estimation by Formula (4.1) can more effectively prune away distant
data objects.
• In our VP method, the selection of virtual pivots in the query stage relies on only
a small number of fixed pivots whose distances to the query are computed. The
87
subsequent lower bound estimation using Formula (4.1) and a larger number of fixed
pivots saves a substantial number of distance computations during the query stage.
Consequently, one can opt to pre-compute a large set of fixed pivots to improve the
effectiveness of virtual pivot pruning by Formula (4.1).
4.6 Algorithm
Preprocessing Algorithm. Select a set of fixed pivots, denoted by P . Compute the distances
between these pivots and all the objects in the data-set. Store the distances in |P | sorted
lists. If the main memory is not large enough, store them in |P | sorted files or B+-trees. All
data objects except the |P | fixed pivots can be used as virtual pivots.
Query Algorithm. Given a query object q and threshold value v (the maximal number of
virtual pivots to use) and ǫ (only k-nn objects within ǫ are relevant), perform the following
operations.
Step 1. Randomly select a small subset S ⊆ P of fixed pivots. For all f ∈ S, compute D(q, f ). Use the traditional scheme of k-nn search to perform pruning and rank the
remaining set according to LBEs. Denote the current k-nn distance upper bound as u. Let
the current k-nn candidate set be C. Denote the set of virtual pivots by V .
Step 2. If |V | = v, return with C. Otherwise, use the object that has the lowest LBE
but is currently not in P ∪ V as the next virtual pivot (denoted by v). If v does not exist,
return with C. Otherwise compute D(q, v).
Step 3. For each fixed pivot p ∈ P \ S, perform a range query of radius min{ǫ, u} +
D(v, q) in its sorted list/sorted file/B+-tree to find objects with distances that are close to
D(p, v). Let the objects in this query result be Q. Update the lower and upper bounds of
objects in Q ∩ C using Formula (4.1) Update u and C. Goto Step 2.
Let the size of the data-set be n. In the preprocessing stage, our method computes
|P | · n distances, so that the space complexity is linear. Since sorting them takes O(n log n)
time, the total time complexity for preprocessing is O(n log n). In the query stage, for each
virtual pivot, the range query takes O(n) time, so that the total time complexity is O(|V | ·
|P | · n). However, the total number of distance computations which can be computationally
very expensive and dominate the runtime, is only |S| + |V |.
We remark that in high dimensional data when ǫ is large, we should use a linear scan
rather than range query in step 3 since range queries are not effective in such data.
88
4.7 Summary
In this chapter, we analyzed the pruning ability of pivots and showed that pivots close to
a query object have a much better pruning ability than random pivots. We introduced the
novel concept of virtual pivots and showed formally that a single virtual pivot can have
the same pruning power as a large set of fixed pivots. A theoretical analysis showed how
and why we can select virtual pivots close to the query object (if they exist) with high
probability.
Our results indicate that the problem of k-nn search in general metric spaces or high
dimensional spaces may not be as difficult as we might have thought, as long as the data is
not completely random and k-nn search is not completely meaningless. The LBE is a very
good heuristic in high dimensional space for finding close neighbors. The actual problem is
rather how to rule out far away objects. Our virtual pivots can simulate the pruning of close
pivots to address this problem of ruling out far away objects.
We conjecture that the reasons why k-nn search had been so challenging in the literature
are a combination of (1) k-nn search is performed to find not very close neighbors, (2) the
number of fixed pivots is too small, and (3) methods lack adaptability with respect to the
query.
Suppose in the analysis of Section 4.4.1, using 10 random pivots, we have δ = 0.5
instead of δ = 0.1, then P = 0.56, which means 56% of the data objects will have an LBE
smaller than δ. The LBE is now not a good heuristic to find close neighbors. However,
if we are using 100 random pivots instead of 10, then P = 0.0033. This indicates that
using a larger number of pivots can be a potential solution when searching for neighbors
that are farther away. However, in the traditional scheme, this is not affordable since in
the query stage we need to compute the distances between the query object and all the
pivots. In our “virtual pivot + boosting” method, we can compute a larger set of fixed pivots
without increasing the number of distance computations at query time. Only being able use
a small number of fixed pivots limits the effectiveness of traditional approaches even when
searching for close neighbors in high dimensional data-sets when there are many very small
clusters, since we cannot have a fixed pivot in every cluster. In our method, the virtual pivot
selection has been designed to be adaptive to the query so that every cluster can have a
virtual pivot.
89
Chapter 5
Efficient Similarity Search with
Virtual Pivots and Partial Pivots
Modern biological applications usually involve the similarity comparison between two objects, which is often computationally very expensive, such as whole genome pairwise alignment and protein three dimensional structure alignment. Nevertheless, being able to quickly
identify the closest neighboring objects from very large databases for a newly obtained sequence or structure can provide timely hints to its functions and more.
This chapter1 presents a substantial speed-up technique for the well studied k-nearest
neighbor (k-nn) search, based on novel concepts of virtual pivots and partial pivots, such
that a significant number of the expensive distance computations can be avoided. The new
method is able to dynamically locate virtual pivots, according to the query, with increasing
pruning ability. Using the same or less amount of database preprocessing effort, the new
method outperformed the next best method by using no more than 40% distance computations per query, on a database of 10,000 gene sequences, compared to several of the best
known k-nn search methods including M-Tree, OMNI, SA-Tree, and LAESA. We demonstrated the use of this method on two biological sequence datasets, one of which is for
HIV-1 viral strain computational genotyping.
In the next section, we detail our method that implements the above two novel ideas
for speeding up k-nn search. We show the computational results on two biological datasets
in Section 5.2, where several existing well-known hierarchical and non-hierarchical k-nn
search methods are included for comparison. In Section 5.3, we further discuss our method
and its application on real-life data.
1
Some of the material in this chapter has been published in [58].
90
5.1 Methods
From Chapter 4, we see that an effective pivot would be one that is close to the query object.
However, a pivot can be effective for only a portion of data objects provided that these data
objects are all close to the pivot. It is often difficult to select universally effective pivots,
given that they have to be selected during the database preprocessing stage. Assuming that
query objects follow the same distribution as the database objects, each database object is
typically close to only a small number of other database objects. Therefore, when one is
selected as a pivot, it becomes effective for only a small number of query objects. In other
words, one pivot is ineffective for the vast majority of query objects. Consequently, if an
effective pivot is desired for all query objects, one has to have a large pool of pivots, despite
many careful considerations on how pivots should be selected [20, 11, 37].
5.1.1 Partial Pivots
In Chapter 4, we introduced virtual pivots which enable us to use virtually any data object
as a pivot, which by itself does not have pre-computed distances to other data objects and
must be supported by a set of traditional fixed pivots (each has distances to all data object)
to perform pruning. In this section, we extend the concept of virtual pivots to use every data
object itself as a pivot. The difference between such a pivot and the fixed pivots is that for
every data object, we predict a number of its nearest neighbors using a simplified version
of the pairwise ranking technique developed in Chapter 3, and then compute the distances
between the data object and these predicted nearest neighbors, so that the new pivot only
has distances to a small set of data objects (predicted close neighbors). In this sense, such a
pivot is only a partial pivot.
Partial pivots are associated with two phases of ranking. In the preprocessing stage, to
construct partial pivots, a pairwise ranking step predicts the close neighbors for every data
object; in the query stage, a ranking with LBE (described in Section 3.3.1) selects pivots
that are close to the query object. Our algorithm uses the predicted closest pivots to perform
pruning first.
The idea of using partial pivots is that, since such a pivot is expected to be close to these
predicted nearest neighbors, it will be an effective pivot in estimating the lower (and upper)
bounds of the actual distances between the query object and these predicted nearest neighbors. All these steps are done in the database preprocessing stage, and we will demonstrate
that these additional preprocessing efforts are paid off by the significantly reduced numbers
91
Figure 5.1: Partial pivots can help when there is no close pivot to the query object. Let q
be a query object, p be a partial pivot far away from q, and oi (i = 1, 2, 3) be neighbors of
p. |D(q, p) − D(p, oi )| is close to D(q, p) and can be larger than the current k-nn distance
upper bound so that oi will be pruned away.
of distance computations during the query stage. Even when the query object has no close
neighbors and thus no close pivots at all, partial pivots can still help in the pruning. Being
far away from the query object, by the analysis of Chapter 4, virtual pivots can no longer
be effective, but for partial pivots, since they have exact distances to some of their close
neighbors, they can be used as pivots to exclude their close neighbors as close objects to
the query, as shown in Figure 5.1. From this point of view, partial pivots can have better
pruning ability than virtual pivots since a partial pivot has some pruning ability even when
it is not close to the query object.
The pruning ability of a partial pivot is limited by its number of close pivots predicted.
If the number of distances associated with a partial pivot is t, then theoretically using partial
pivots alone can at most reduce the number of distance computations in the query stage to
1/t of the total size of the data-set. However, in many real-life metric data such as those
studied in our experimental section, performing k-nn queries is challenging such that the
performance of most methods cannot approach this limit even for t as small as 20.
Our partial pivots are fully integrated with the virtual pivot algorithm described in Section 4.6. In our database preprocessing stage, after the fixed pivots are selected and their
distances to all other data objects computed, we estimate the lower bounds on D(o′ , o) for
each non-pivot object o and those data objects o′ with yet unknown distances to o, using
Formula (1.2). This way, for each non-pivot object o, we can predict a pre-specified number
of data objects o′ as its close neighbors using their estimated lower bounds. These objects
o′ form the estimated neighborhood of object o, denoted as N (o). Subsequently, we precompute the distances D(o′ , o), for o′ ∈ N (o). In the query stage, when o is selected as a
92
virtual pivot, D(q, o′ ) can be better bounded from below than using Formula (4.1):
|D(o′ , o) − D(q, o)| ≤ D(q, o′ ) ≤ D(o′ , o) + D(q, o), ∀ o′ ∈ N (o).
(5.1)
Also, when the pre-specified number of virtual pivots is reached, o can be used for possible
pruning of objects in N (o). That is, for any k-nn candidate o, we compute the distance
D(q, o), and can use Formula (5.1) to possibly prune away o′ if o′ ∈ N (o) was also a
candidate.
5.1.2 The Complete Algorithm with Virtual and Partial Pivots
Preprocessing Algorithm. Given a database B, randomly select a set of data objects as fixed
pivots P ⊆ B and compute the distance between each fixed pivot in P and every data object
in B. Using these pre-computed distances, for each data object o in B, rank all the other
data objects o′ in B in non-decreasing order of their estimated lower bound on D(o′ , o) by
Formula (1.2). If o ∈ P , let N (o) = B; Otherwise, compute the distances between o and
the top t objects in the ordered list, which form N (o) with t pre-specified.
Query Algorithm. Given a query object q (q can be in or not in B), perform the following
steps of operations:
1. (Compute the distances between q and a number of fixed pivots:) Randomly select a
subset S ⊆ P of the fixed pivots, and compute the distances between q and all the
pivots in S.
2. (Estimate the initial k-nn distance upper bound δ:) Compute the set of upper bounds
U = {uD(q,o) }, where
uD(q,o) =
D(q, o),
∀ o ∈ S;
minp∈S {D(p, q) + D(p, o)}, ∀ o ∈ B \ S.
Set δ to the k-th smallest value in U .
3. (Select the first virtual pivot:) For each data object o ∈ B estimate its lower bound as
D(q, o),
∀ o ∈ S;
lD(q,o) =
maxp∈S {|D(p, q) − D(p, o)|}, ∀ o ∈ B \ S.
If lD(q,o) > δ, object o is excluded from the k-nn candidate list C, which is initialized
to be B and (always) sorted in non-decreasing order of estimated distance lower
bounds. Select the non-pivot object in C with the smallest lower bound as the first
virtual pivot v. Add v to the set of virtual pivots V .
93
4. (Update δ:) Compute D(q, v). If D(q, v) > δ, remove v from the C. Otherwise,
update both uD(q,v) and lD(q,v) as D(q, v), and update δ as the k-th smallest value in
U.
5. (Pruning using the virtual pivot v:) For each non-pivot k-nn candidate o ∈ C \ N (v),
lD(q,o) = max lD(q,o) , max {|D(p, o) − D(p, v)| − D(q, v)} ;
p∈P \S
for each non-pivot k-nn candidate o ∈ C ∩ N (v),
lD(q,o) =
max lD(q,o) , max {|D(p, o) − D(p, v)| − D(q, v)}, |D(o, v) − D(q, v)| .
p∈P \S
If lD(q,o) > δ, remove o from the k-nn candidate list C.
6. If |C| = k or there is no more virtual pivot to select, return the list of k-nn objects;
If |V | reaches the pre-specified limit, go to the next step;
Otherwise, select the non-pivot object in C with the smallest lower bound as the next
virtual pivot v, add v to V , and go to Step 4).
7. (Pruning using partial pivots:) Repeat the following while |C| > k:
(a) Select the non-pivot object in C with the smallest lower bound as the next partial
pivot o. If lD(q,o) > δ, return the k objects in C with the smallest distances to
the query; Otherwise, compute D(q, o).
(b) Update uD(q,o) as D(q, o); For each o′ ∈ N (o) ∪ P , update uD(q,o′ ) using
Formula (5.1),
uD(q,o′ ) = min uD(q,o′ ) , D(o′ , o) + D(q, o) ;
Update δ as the k-th smallest value in U .
(c) Update lD(q,o) as D(q, o), and if lD(q,o) > δ, remove o from C; For each o′ ∈
N (o) ∪ P , update lD(q,o′ ) using Formula (5.1),
lD(q,o′ ) = max lD(q,o′ ) , |D(o′ , o) − D(q, o)| ,
and if lD(q,o′ ) > δ, remove o′ from C.
94
5.2 Results
We tested our k-nn search method, denoted as VP, on two biological sequence datasets.
The first dataset consists of 1,198 human immunodeficiency type 1 virus (HIV-1) complete
viral sequences studied by Wu et al. [52] (with an average sequence length of 9,103 bases).
The second dataset contains 10,000 HA gene sequences of influenza (FLU) virus (encoding
hemagglutinin) downloaded from the NCBI GenBank [1] (with an average sequence length
of 1,480 bases). For each of these two datasets, we randomly removed 100 sequences from
the dataset and used them as query objects to perform k-nn search on the remaining objects.
All reported performance measurements are the average values over these 100 queries.
We compared our VP method with several hierarchical and non-hierarchical k-nn search
methods, regarding the average number of distance computations and the average running
time per query. Note that we tuned our VP method so that its preprocessing efforts, in terms
of the number of pre-computed distances, were less than those in the other methods. 1) The
sequential scan, denoted by SeqScan, is the baseline method that computes the distances
between the query and all database objects. 2) The na¨ıve non-hierarchical pivot method uses
a number of randomly selected fixed pivots, denoted by FP. In the query stage, FP computes
the distances between the query objects and all the fixed pivots, and uses Formulas (1.2–
1.3) for pruning. Lastly, for each k-nn candidate, FP computes its distance to the query,
updates the k-nn distance upper bound, and uses this bound to further prune objects from
the candidate list. 3) The non-hierarchical pivot method by Filho et al. [20, 45], denoted
by OMNI, selects pivots carefully so that they are far away from each other. In the query
stage, it performs the same as FP. 4) The fourth method is M-Tree [13]. 5) The fifth method
is SA-Tree [34]. 6) The sixth method is LAESA [37]. Brief descriptions of the last three
methods are included in the related work section of Chapter 4. All these six methods are
implemented in C++. The code for M-Tree was downloaded from its authors’ website, and
the code for LAESA was provided by its authors. All experiments were performed using
an AMD Opteron 2.2GHz CPU with a 5GB RAM.
5.2.1 Results on the HIV-1 Dataset
Wu et al. studied the HIV-1 viral strain computational genotyping problem, where 42 pure
subtype HIV-1 strains were used as references and the task is to assign a subtype or a recombination to any given HIV-1 viral strain [52]. In total, there are 1,198 HIV-1 complete
viral strains used in the study. The average length of these 1,198 strains is around 9,000 nu-
95
cleotides. These strains form our first dataset, and they are classified into 11 pure subtypes
(A, B, C, D, F, G, H, J, K, N, O) and 18 recombinations. One computational genotyping
method is, given a query strain, to identify the k most similar strains in the dataset and then
use their subtypes to assign a subtype to the query through a majority vote. Similarity or
dissimilarity between two genomic sequences can be defined in many ways. In this work
we test two of them: the global alignment distance [35] and the Complete Composition
Vector (CCV) based euclidean distance, the latter of which was adopted in Wu et al. [52].
In calculating global alignment distances, we adopted the scoring scheme used in Mega
BLAST [55], which is designed for aligning nucleotide sequences that differ slightly. In
calculating the CCV-based euclidean distances, every HIV-1 viral strain is represented as a
high dimensional vector, in which each entry records essentially the amount of evolutionary
information carried by a nucleotide string. We considered all nucleotide strings of length 21
or less as suggested in [52]. The distance between two HIV-1 viral strains is taken as the euclidean distance between the two corresponding high, more than two million, dimensional
vectors. Both distance functions are metric. Computing the global alignment distance and
the CCV-based euclidean distance between two HIV-1 viral strains took about 1.6 and 2.5
seconds on average, respectively.
Overall Performance
In FP, OMNI, and LAESA, 100 fixed pivots were selected and their distances to all data
objects were computed. This way, they had equal preprocessing efforts. In our VP method,
only |P | = 80 fixed pivots were randomly selected and similarly their distances to all data
objects were computed. Afterwards, t = 20 close neighbors for each non-pivot object
were predicted using these 80 pivots and triangle inequality, and subsequently the distance
between each non-pivot object and everyone of its predicted close neighbors was computed.
Therefore, VP had no more preprocessing efforts than FP, OMNI, and LAESA, in terms
of the number of distance computations. In the query stage, for each query object, VP
randomly selected and used only 5 out of the 80 fixed pivots (i.e., |S| = 5), and the number
of virtual pivots to be selected, |V |, was set to 10. For M-Tree, the input parameters were
set according to its original paper [13]. For SeqScan and SA-Tree, there was no parameter
to be set.
Obviously, for each query, SeqScan always made exactly 1,098 distance computations,
independent of the value of k. The other six methods, which all take advantage of triangle
inequality, all made less distance computations, and the detailed numbers of distance com-
96
Average #Distance Computations
1200
SeqScan
M-Tree
OMNI
SA-Tree
FP
LAESA
VP
1000
800
600
400
200
0
1
3
5
10
k
(a) HIV-1 dataset using global alignment distance.
Average #Distance Computations
1200
1000
SeqScan
M-Tree
OMNI
SA-Tree
FP
LAESA
VP
800
600
1
3
5
10
k
(b) HIV-1 dataset using CCV-based euclidean distance.
Figure 5.2: The average numbers of distance computations per query by all seven k-nn
search methods on HIV-1 dataset, for k = 1, 3, 5, 10.
putations depended on the query HIV-1 viral strain, as well as the value of k. Figure 5.2(a)
(Figure 5.2(b), respectively) plots the average numbers of global alignment (CCV-based
euclidean, respectively) distance computations per query by these seven methods, over the
100 k-nn queries, where k = 1, 3, 5, 10. With respect to k, except SeqScan, each of the
other six methods made more distance computations with increasing values of k. For each
value of k, from the plots we can see that the general performance order is: our VP method
performed the best, followed by LAESA, FP, SA-Tree/OMNI, M-Tree, and SeqScan. It was
surprising to see that all these four values of k, M-Tree, OMNI, and SA-Tree made more
distance computations per query than FP, which simply used 100 randomly selected fixed
pivots for pruning. On the other hand, non-surprisingly, our method VP consistently outper-
97
formed all the other methods. Specifically, our VP method was at least 17.7% better than
the next best method LAESA. For instance, for 1-nn search, M-Tree made 523 global alignment distance computations, OMNI made 418, SA-Tree made 387, FP made 351, LAESA
made 213, but our VP method made only 181 global alignment distance computations.
2000
SeqScan
M-Tree
OMNI
SA-Tree
FP
LAESA
VP
Average Runtime (seconds)
1800
1600
1400
1200
1000
800
600
400
200
0
1
3
5
10
k
(a) HIV-1 dataset using global alignment distance.
Average Runtime (seconds)
2600
2400
2200
2000
SeqScan
M-Tree
OMNI
SA-Tree
FP
LAESA
VP
1800
1600
1400
1
3
5
10
k
(b) HIV-1 dataset using CCV-based euclidean distance.
Figure 5.3: The average runtime per query using by all seven k-nn search methods on HIV-1
dataset, for k = 1, 3, 5, 10.
The actual runtime per query by a k-nn search method includes the time for distance
computations and possibly the time for triangle inequality pruning and other operations, and
thus it is not necessarily proportional to the number of distance computations. Nevertheless,
our collected average runtime per query by each of the seven methods, over the 100 k-nn
queries, did correlate quite well with the average number of distance computations per
query, as shown in Figure 5.3. Again, for each value of k, from the plots we can see that
98
the general performance order is: our VP method performed the best, followed by LAESA,
FP, SA-Tree/OMNI, M-Tree, and SeqScan.
Varying Preprocessing Efforts
We dropped SeqScan from this comparison since it performed the worst and its preprocessing effort does not affect its query performance. We mentioned above that the input
parameters of M-Tree method were set according to its original paper and SA-Tree has
no parameter to be tuned. For the other four methods, VP, LAESA, FP, and OMNI, we
are able to tune their parameters such that they have different preprocessing efforts, which
would subsequently affect their query performance. For LAESA, FP, and OMNI, their parameters are the number of fixed pivots; For our VP method, its parameters are the number
of fixed pivots, the number of predicted close neighbors for each data object, the number of
fixed pivots used in the query stage, and the number of virtual pivots allowed in the query
stage.
As shown in Figure 5.4, M-Tree always pre-computed 81,910 global alignment distances on the database of 1,098 HIV-1 viral strains, in 134,332 seconds, and SA-Tree precomputed 42,142 global alignment distances in 69,112 seconds. We tested four different
amounts of preprocessing efforts by the other four methods and their query performance in
1-nn search. We set the numbers of fixed pivots in OMNI, FP and LAESA at 40, 50, 75,
and 100 such that their preprocessing efforts fell in the same range as M-Tree and SA-Tree.
Correspondingly, we set the number of fixed pivots in our VP method at 10, 20, 50, and
75, and kept (|S|, t, |V |) = (5, 20, 10). Figure 5.4 shows how varying preprocessing efforts
affected the average numbers of distance computations and the average runtime per query
for VP, LAESA, FP, and OMNI in 1-nn search. In these four settings, the preprocessing efforts of VP were always less than those of OMNI, FP and LAESA, yet VP performed better
than them in the query stage. For instance, corresponding to the case where the numbers
of fixed pivots in OMNI, FP and LAESA were set 40 (the leftmost point for each method),
our VP method computed 32,669 global alignment distances in 53,577 seconds in the preprocessing. This effort was the smallest by comparison: LAESA computed 43,139 global
alignment distances in 70,747 seconds, FP computed 43,140 global alignment distances in
70,749 seconds, and OMNI computed 44,198 global alignment distance in 72,484 seconds
(M-Tree and SA-Tree as above). Nevertheless, VP computed only 198 global alignment
distances per query (in 241 seconds), while LAESA computed 238 global alignment distance per query (in 293 seconds), FP computed 347 global alignment distances per query
99
Average #Distance Computations per Query
700
M-Tree
OMNI
SA-Tree
FP
LAESA
VP
600
500
400
300
200
100
0
30000
50000
70000
90000
110000
Average Runtime per Query (seconds)
#Distance Computations in Preprocessing
(a) The average numbers of distance computations
800
M-Tree
OMNI
SA-Tree
FP
LAESA
VP
700
600
500
400
300
200
100
0
50000
80000
110000
140000
170000
Runtime in Preprocessing (seconds)
(b) The average runtime
Figure 5.4: Performance measures per query by the six k-nn search methods, with different
amounts of preprocessing efforts on HIV-1 dataset using global alignment distance.
(in 419 seconds), SA-Tree computed 387 global alignment distances per query (in 453 seconds), OMNI computed2 441 global alignment distances per query (in 536 seconds), and
M-Tree computed 523 global alignment distances per query (in 625 seconds).
5.2.2 Results on the HA Gene Dataset
All the seven k-nn search methods were also compared on the FLU HA gene dataset, which
contains 10,000 HA gene sequences. In this section we only reported the results using
2
The author of OMNI suggested to use less pivots for this experiment. However, since reducing the number
of pivots will not increase the pruning ability of OMNI, in the best case OMNI can only reduce the number
of distance computations by the number of computed distances between the query object and all the pivots,
i.e., reducing the number of distance computations from 441 to 401, a performance which is still worse than
SA-Tree.
100
the CCV-based euclidean distance (the results using the global alignment distance were
similar).
Average #Distance Computations
10000
SeqScan
M-Tree
OMNI
LAESA
FP
SA-Tree
VP
8000
6000
4000
2000
0
1
3
5
10
k
(a) Smaller preprocessing effort
Average #Distance Computations
10000
SeqScan
M-Tree
OMNI
LAESA
FP
SA-Tree
VP
8000
6000
4000
2000
0
1
3
5
10
k
(b) Bigger preprocessing effort
Figure 5.5: The average numbers of distance computations per query by all seven k-nn
search methods with two different amounts of preprocessing efforts, on HA dataset, for
k = 1, 3, 5, 10.
Figures 5.5 and 5.6 show the query performance of the seven k-nn search methods on
HA dataset, for k = 1, 3, 5, 10. We only plotted two different amounts of preprocessing
efforts for those k-nn search methods whose preprocessing efforts can be tuned. For MTree, we used its default parameter setting; For OMNI, FP, and LAESA, the numbers of
fixed pivots were set at 100 and 200, respectively; Correspondingly, for our VP method,
the numbers of fixed pivots were set at 75 and 150, and (|S|, t, |V |) = (5, 20, 10). We
distinguish these two settings as the smaller and the bigger preprocessing efforts. Table 5.1
101
Average Runtime (seconds)
3000
2500
SeqScan
M-Tree
OMNI
LAESA
FP
SA-Tree
VP
2000
1500
1000
500
0
1
3
5
10
k
(a) Smaller preprocessing effort
Average Runtime (seconds)
3000
2500
SeqScan
M-Tree
OMNI
LAESA
FP
SA-Tree
VP
2000
1500
1000
500
0
1
3
5
10
k
(b) Bigger preprocessing effort
Figure 5.6: The average runtime per query by all seven k-nn search methods with two
different amounts of preprocessing efforts, on HA dataset, for k = 1, 3, 5, 10.
Table 5.1: The detailed numbers of distance computations and the runtime (in seconds)
by all seven k-nn search methods in the smaller preprocessing, and their resultant query
performance in 1-nn search in terms of the average number of distance computations and
the average runtime (in seconds) per query.
Method
VP
SA-Tree
FP
LAESA
OMNI
M-Tree
Preprocessing
#dist comp’s
runtime
960,705
263,079
1,682,030
460,383
985,049
269,614
985,049
269,614
985,049
269,614
1,280,537
350,492
Per query
#dist comp’s
runtime
485
194
1,213
373
1,462
468
2,166
716
2,388
743
4,950
2,178
shows that in the smaller preprocessing settings, our VP method spent the least amount of
effort among the six methods, in both the number of distance computations and the actual
102
runtime (SeqScan was excluded since it doesn’t have the preprocessing stage). Nevertheless, both the table and Figures 5.5(a) and 5.6(a) show that our VP method outperformed
the other five methods (and SeqScan) in the query stage, in both the average number of
distance computations and the actual runtime per query. Furthermore, the results show a
slightly different performance tendency compared to that on the HIV-1 dataset: for each
value of k, our VP method performed the best, followed by SA-Tree/FP, LAESA, OMNI,
M-Tree, and SeqScan. That is, LAESA performed worse than SA-Tree and FP on this FLU
HA gene dataset. Our VP method can be 150% better than the next best method, SA-Tree.
For instance, for a 1-nn search with the smaller preprocessing efforts (cf. Table 5.1), the
next best method, SA-Tree, made 1, 213 CCV-based distance computations, but our VP
method made only 485 per query.
5.3 Discussion
5.3.1 Query Performance Dependence on the Effectiveness of Pivots
We discussed earlier that the level of effectiveness of pivots is extremely important for the
success of a query. Essentially, one effective pivot, fixed or virtual, would have the ability
to estimate the distances between the query object and the non-pivot objects accurately such
that a large portion of the database can be pruned away to avoid distance computations.
Table 5.2: The distance intervals associated with the five bins and the numbers of query
objects therein.
1
2
3
4
5
HIV-1-GA
Distance interval
#Objects
Bin
[0, 0.08)
25
[0.08, 0.16)
11
[0.16, 0.24)
43
[0.24, 0.32)
20
[0.32, 0.4)
1
HIV-1-CCV
Distance interval
#Objects
[0, 50)
8
[50, 100)
9
[100, 150)
9
[150, 200)
12
[200, 250)
62
HA-CCV
Distance interval
#Objects
[0, 20)
51
[20, 40)
34
[40, 60)
10
[60, 80)
4
[80, 100)
1
The reported performance measurements on the two biological sequence datasets for
the six k-nn search methods (excluding SeqScan) that use triangle inequality pruning were
on the 100 randomly pre-selected query objects. Certainly, whether or not there exists an
effective pivot for each of them would affect the query performance of these six methods.
In fact, the most effective pivots for these 100 randomly pre-selected query objects have
various levels of effectiveness. To study the relation between the query performance of
different methods and the level of effectiveness of pivots, we calculated the nearest neigh-
103
bor (i.e., 1-nn) distance for each of the 100 query objects, and divided them into 5 bins of
equally wide distance intervals. The nearest neighbor distance associated with the query
object measures the level of effectiveness of the most effective pivot for the query. Let HIV1-GA denote the HIV-1 dataset using the global alignment distance; Likewise, HIV-1-CCV
and HA-CCV denote the HIV-1 and HA gene datasets using the CCV-based euclidean distance, respectively. Table 5.2 collects the distance intervals associated with the five bins and
the numbers of query objects therein, for HIV-1-GA, HIV-1-CCV, and HA-CCV. Separately
for each bin, we collected the average numbers of distance computations per query by all
six methods in a 1-nn search, and plotted in Figure 5.7 separately. In this experiment, the
input parameters of the M-Tree method were set according to its original paper, FP, OMNI,
and LAESA were set to select 100 fixed pivots, and VP was set to use a lower preprocessing
effort than all the other five methods, with (|P |, |S|, t, |V |) = (80, 5, 20, 10) for HIV-1-GA
and HIV-1-CCV, (|P |, |S|, t, |V |) = (75, 30, 20, 10) for HA-CCV. From these plots, one
can see that the performance of all six methods declined as the average nearest neighbor
distance associated with the query objects increases. Also, our VP method almost always,
except for the middle bin of HIV-1-CCV, outperformed the other five methods.
Plots in Figure 5.7 also show that the performance of some methods deteriorates faster
than the others. For instance, on the HA-CCV dataset (Figure 5.7(c)), LAESA outperformed M-Tree and OMNI when the average nearest neighbor distance associated with the
query objects is small (the first two bins), but LAESA gradually lagged behind as the average nearest neighbor distance increases (the last two bins). Table 5.2 and Figure 5.7(b)
together explain why all six methods performed relatively poorly on HIV-1-CCV. On HIV1-CCV, most query objects (62 out of 100) were in Bin 5, that is, they have large nearest
neighbor distances and consequently low levels of effectiveness of the pivots, fixed or virtual. Nevertheless, our VP method still outperformed the other five methods on this bin.
5.3.2 Query Performance w.r.t. Ranking
Ranking plays an important role in our algorithm. In the preprocessing stage, our algorithm
uses pairwise ranking to construct partial pivots; at the beginning of the query stage, our
algorithm uses a na¨ıve ranking with |S| fixed pivots to select data objects close to the query
as virtual pivots. To test the performance of our algorithm with respect to ranking, on the
HIV-1-GA data-set we applied our algorithm in different scenarios, using virtual pivots only
or using both virtual pivots and partial pivots, with |P | set to 80.
Figure 5.8(a) shows the result when using virtual pivots only. For the two curves in
104
Average #Distance Computations
900
M-Tree
OMNI
SA-Tree
FP
LAESA
VP
800
700
600
500
400
300
200
100
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Nearest Neighbor Distance
(a) HIV-1-GA
Average #Distance Computations
1200
1000
800
600
400
M-Tree
OMNI
SA-Tree
FP
LAESA
VP
200
0
0
50
100
150
200
250
Nearest Neighbor Distance
(b) HIV-1-CCV
Average #Distance Computations
10000
9000
8000
7000
M-Tree
OMNI
SA-Tree
FP
LAESA
VP
6000
5000
4000
3000
2000
1000
0
10
20
30
40
50
60
70
80
90
Nearest Neighbor Distance
(c) HA-CCV
Figure 5.7: The average numbers of distance computations per query by all six methods in
1-nn search for five bins of query objects with different nearest neighbor distance ranges,
on three datasets.
the plot, with |S| varied from 0 to 80, the number of virtual pivots |V | was set to 0 and 10
respectively. When |V | = 0, our algorithm uses only the |S| fixed pivots to do the pruning.
The result shows the number of distance computations drops dramatically as |S| increases
from 1 to 30; after that, the curves are almost flat. This result also indicates that using just
10 virtual pivots consistently renders better performance than using no virtual pivots.
Figure 5.8(b) shows the result when using both virtual pivots and partial pivots, under
105
|V|=0
|V|=10
Average #Distance Computations
1200
1000
800
600
400
200
0
1 5 10
20
30
40
50
60
70
80
|S|
(a) HIV-1-GA with virtual pivots only
|V|=0
|V|=10
Average #Distance Computations
450
400
350
300
250
200
150
100
50
0
1 5 10
20
30
40
|S|
50
60
70
80
(b) HIV-1-GA with virtual pivots and partial pivots
Figure 5.8: Performance on HIV-1-GA when the number of fixed pivots to perform ranking,
|S|, increases from 1 to 80.
the same setting as the previous experiment. Similar to Figure 5.8(a), the number of distance
computations decrease quickly when |S| increases from 0 to 5. After that, increasing |S|
only deteriorates the performance slightly. The result also shows that on this data-set, virtual
pivots are superseded by partial pivots, as we can see that using 0 and 10 virtual pivots does
not have a significant difference in performance.
We also tested our algorithm on the larger HA-CCV data-set, using a similar setting as
the previous experiment (|P | = 75). The results shown in Figure 5.9 are similar to those of
Figure 5.8.
5.3.3 Query Performance w.r.t. t
We also studied how the query performance of our algorithm can be affected by the parameter t, the number of close neighbors to returned for partial pivots. Figure 5.10 shows
the performance of our method when t increases from 0 to 50. For the other parameters, on the HIV-1-GA data-set, (|P |, |S|, |V |) = (80, 5, 10), and on the HA-CCV data-set,
106
Average #Distance Computations
|V|=0
|V|=10
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
1 5 10
20
30
40
50
60
70
|S|
(a) HA-CCV with virtual pivots only
|V|=0
|V|=10
Average #Distance Computations
1200
1000
800
600
400
200
0
1 5 10
20
30
40
50
60
70
|S|
(b) HA-CCV with virtual pivots and partial pivots
Figure 5.9: Performance on HIV-CCV when |S| increases from 1 to 70.
(|P |, |S|, |V |) = (75, 30, 10). Both plots in Figure 5.10 show a similar trend that the number of distance computations decreases as t increases until it levels off around t = 10. The
results also show that t has a bigger effect on the larger data-set of HA-CCV.
5.3.4 Time and Space Complexity
Given a set of data objects B, a set of randomly selected fixed pivots P , and a constant
t specifying the number of predicted close neighbors for each data object, in the preprocessing stage, our VP method computes (|P | · |B| + t · |B|) distances with an overhead of
O(|P | · |B| + t · |B| · log |B|) storing, sorting, and processing the computed distances, when
the prediction of the t close neighbors for each data object is done by a frontier search [57].
The space complexity is O(|P | · |B| + t · |B|).
In the query stage with |V | virtual pivots allowed, the time complexity is O(|V | · |P | ·
|B| + t · |B| + |B| · log |B|) distance lower and upper bound estimations and sorting the
candidate list, in addition to the time for distance computations. The space complexity is
O(|P | · |B| + t · |B|), mostly for storing those pre-computed distances for query processing.
107
Average #Distance Computations
250
200
150
100
50
0
1
5
10
20
30
40
50
40
50
t
(a) HIV-1-GA
Average #Distance Computations
1000
900
800
700
600
500
400
300
200
100
0
1
5
10
20
30
t
(b) HA-CCV
Figure 5.10: Performance when the number of predicted close neighbors for each data
object, t, increases from 1 to 50.
Similarly as pointed out in [34], there are extreme datasets (such as the query objects in
Bin 5 of HIV-1-CCV dataset) on which no pivot can provide any distance information on
data objects whose distances to the query are unknown yet. Our VP method would also fail
to work efficiently on them, and in the worst case the distances between the query and all
data objects have to be computed. The precise theoretical analysis for our VP method on
reasonable distance distributions such as the one in [34] is one of our focuses in the future
work.
5.3.5 Distance Distributions
Based on a “homogeneous” distance distribution model, Ciaccia et al. [14] provided a
theoretical analysis on the CPU cost spent on distance computations and the I/O overhead.
Navarro [34] started from Delaunay graphs and proposed a simplified uniformly distributed
distance model to analyze SA-Tree. It should be noted that both assumptions are on the
relative distances, not the spatial distribution of the data objects. For the complete HIV-1
108
dataset of 1,198 viral strains, we have calculated all pairwise global alignment distances
and CCV-based euclidean distances. These distances are plotted in Figure 5.11 separately.
9000
9000
8000
8000
7000
7000
6000
6000
Number
Number
It is difficult to determine which distribution these two distances follow.
5000
4000
5000
4000
3000
3000
2000
2000
1000
1000
0
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Edit Distance
0
500
1000
1500
2000
2500
3000
3500
4000
CCV-Based Euclidean Distance
Figure 5.11: The distributions of all pairwise global alignment distances and all pairwise
CCV-based euclidean distances on the complete HIV-1 dataset of 1,198 viral strains.
5.3.6 HIV-1 Genotyping Accuracy
The 1, 198 HIV-1 strains include 867 pure subtype strains: 70 A, 268 B, 419 C, 54 D, 10 F,
13 G, 3 H, 2 J, 2 K, 5 N, and 21 O, and 331 recombinants: 196 CRF01AE, 52 CRF02AG,
3 CRF03AB, 3 CRF04CPX, 3 CRF05DF, 8 CRF06CPX, 7 CRF07BC, 4 CRF08BC, 5
CRF09CPX, 3 CRF10CD, 10 CRF11CPX, 10 CRF12BF, 6 CRF13CPX, 7 CRF14BG, 5
CRF1501B, 2 CRF16A2D, 4 CRF18CPX, and 3 CRF19CPX. Using the majority vote, we
examined the subtyping accuracies by k-nn search with k = 1, 3, 5. In more details, when
k = 1, each of the 1, 198 strains was used as a query against the other 1, 197 strains. The
query strain was assigned the subtype of its nearest neighbor. When the global alignment
distance was used, 1, 188 queries, or 99.17%, were assigned the correct subtype or recombination; when the CCV-based euclidean distance was used, 8 more queries, summing up to
99.83%, were assigned subtype or recombination correctly. The only two strains genotyped
incorrectly were DQ207943, from pure subtype B to AB recombination, and AY771588,
from BF recombination to pure subtype B.
When k = 3 (5, respectively), the subtypes and recombinations consisting of less than
or equal to 3 (5, respectively) strains were removed from consideration. That is, only 8
(7, respectively) pure subtypes and 12 (8, respectively) recombinations were used, which
include 1, 174 (1, 151, respectively) strains. When the global alignment distance was used,
1, 160 (1, 136, respectively) queries, or 98.81% (98.70%, respectively), were assigned the
109
correct subtype or recombination; when the CCV-based euclidean distance was used, 6 (11,
respectively) more queries, summing up to 99.32% (99.65%, respectively), were assigned
subtype correctly.
Table 5.3: The HIV-1 computational genotyping accuracies by k-nn search and majority
vote, k = 1, 2, 3, 4, 5.
Distance function
HIV-1-GA
HIV-1-CCV
k=1
99.165%
99.833%
k=2
99.328%
99.832%
k=3
98.807%
99.318%
k=4
98.799%
99.313%
k=5
98.696%
99.652%
Table 5.3 summarizes the HIV-1 computational genotyping accuracies by k-nn search
coupled with majority vote, for k = 1, 2, 3, 4, and 5. From these five pairs of genotyping
accuracies, one can see that the accuracy through the CCV-based euclidean distance was always higher than the corresponding one using the global alignment distance. In this sense,
the CCV representation of the whole genomic sequences and the CCV-based euclidean distance seem to capture better the evolutionary information embedded in the whole genomic
sequences.
5.4 Summary
This chapter presents a substantial speed-up technique for the well studied k-nearest neighbor (k-nn) search, based on a novel concept of virtual pivots and partial pivots, such that
a significant number of the expensive distance computations can be avoided. Using the
same amount of database preprocessing effort, the new method conducted substantially
less distance computations during the query stage, compared to several best known k-nn
search methods including M-Tree, OMNI, SA-Tree and LAESA. We demonstrated the performance on two real-life data-sets of varied sizes, and the use of this method on HIV-1
subtyping, where the subtype of the query viral strain is assigned through a majority vote
among its most similar strains with known subtypes, and the similarity is defined by global
global alignment distance between two whole strains or the euclidean distance between the
Complete Composition Vectors of the whole strains.
110
Chapter 6
Conclusions and Future Work
The rapidly increasing amount of biological sequence and structure data makes it difficult
or even impossible to apply the traditional database searching and mining approaches. This
is especially true when the distance function to measure the dissimilarity between biological objects is expensive to compute. In this thesis, we studied the challenges we are facing
when performing clustering or similarity search on biological sequence and structure data
using an expensive distance function. We identified the central issue as avoiding expensive
distance computations and proposed several novel solutions, including directional extent in
non-vector data bubbles, pairwise ranking, virtual pivots and partial pivots. In-depth theoretical analysis and extensive experimental results confirmed the superiority of our methods
over the previous approaches.
During our study, we have observed the potential of ranking in particular pairwise ranking on performing clustering and similarity search. The technique is embedded into several
algorithms we proposed, and has been shown to play a significant role in the success of our
methods (see the experimental sections of Chapter 3 and Chapter 5). Ranking helps deriving an approximate solution to our problem, and can be computed efficiently in terms of
both the number of distance computations and overhead (see Chapter 3). Whether and how
this (pairwise) ranking technique can be applied to other searching and mining problems is
an interesting direction for future research.
111
Bibliography
[1] www.ncbi.nlm.nih.gov/genomes/FLU/Database/select.cgi.
[2] S. F. Altschul, T. L. Madden, A. A. Sch¨affer, J. Zhang, Z. Zhang, W. Miller, and
D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database
search programs. Nucleic Acids Research, 25:3389–3402, 1997.
[3] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to
identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD’99), pages 49–60, 1999.
[4] V. Athitsos, M. Hadjieleftheriou, G. Kollios, and S. Sclaroff. Query-sensitive embeddings. In Proceedings of the 2005 ACM SIGMOD International Conference on
Management of Data (SIGMOD’05), pages 706–717, 2005.
[5] R. A. Baeza-Yates, W. Cunto, U. Manber, and S. Wu. Proximity matching using fixedqueries trees. In Proceedings of the 5th Annual Symposium on Combinatorial Pattern
Matching (CPM’94), pages 198–212, 1994.
[6] S.-A. Berrani, L. Amsaleg, and P. Gros. Approximate searches: k-neighbors + precision. In Proceedings of the 2003 Conference on Information and Knowledge Management (CIKM’03), pages 24–31, 2003.
[7] K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is “nearest neighbor”
meaningful? In Proceedings of the 7th International Conference on Database Theory
(ICDT’99), pages 217–235, 1999.
[8] P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling clustering algorithms to large
databases. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD’98), pages 9–15, 1998.
[9] M. M. Breunig, H.-P. Kriegel, P. Kr¨oger, and J. Sander. Data bubbles: Quality preserving performance boosting for hierarchical clustering. In Proceedings of the 2001 ACM
SIGMOD international conference on Management of data (SIGMOD’01), pages 79–
90, 2001.
[10] W. A. Burkhard and R. M. Keller. Some approaches to best-match file searching.
Communications of the ACM, 16:230–236, 1973.
[11] B. Bustos, G. Navarro, and E. Ch´avez. Pivot selection techniques for proximity searching in metric spaces. Pattern Recognition Letters, 24:2357–2366, 2003.
[12] E. Ch´avez, G. Navarro, R. A. Baeza-Yates, and J. L. Marroqu´ın. Searching in metric
spaces. ACM Computing Surveys, 33:273–321, 2001.
[13] P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity
search in metric spaces. In Proceedings of 23rd International Conference on Very
Large Data Bases (VLDB’97), pages 426–435, 1997.
112
[14] P. Ciaccia, M. Patella, and P. Zezula. A cost model for similarity queries in metric
spaces. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on
Principles of Database Systems (PODS’98), pages 59–68, 1998.
[15] C. Digout, M. A. Nascimento, and A. Coman. Similarity search and dimensionality
reduction: Not all dimensions are equally useful. In Proceedings of the 9th International Conference on Database Systems for Advances Applications (DASFAA’04),
pages 831–842, 2004.
[16] W. DuMouchel, C. Volinsky, T. Johnson, C. Cortes, and D. Pregibon. Squashing flat
files flatter. In Proceedings of the Fifth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD’99), pages 6–15, 1999.
[17] C. Elkan. Using the triangle inequality to accelerate k-means. In Proceedings of the
Twentieth International Conference (ICML’03), pages 147–153, 2003.
[18] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD’96), pages
226–231, 1996.
[19] C. Faloutsos and K.-I. Lin. Fastmap: A fast algorithm for indexing, data-mining and
visualization of traditional and multimedia datasets. In Proceedings of the 1995 ACM
SIGMOD international conference on Management of data (SIGMOD’95), pages
163–174, 1995.
[20] R. F. S. Filho, A. J. M. Traina, C. Traina Jr., and C. Faloutsos. Similarity search
without tears: The OMNI family of all-purpose access methods. In Proceedings of
the 17th International Conference on Data Engineering (ICDE’01), pages 623–630,
2001.
[21] V. Ganti, R. Ramakrishnan, J. Gehrke, A. L. Powell, and J. C. French. Clustering
large datasets in arbitrary metric spaces. In Proceedings of the 15th International
Conference on Data Engineering (ICDE’99), pages 502–511, 1999.
[22] A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data
(SIGMOD’84), pages 47–57, 1984.
[23] J. M. Hammersley. The distribution of distance in a hypersphere. The Annals of
Mathematical Statistics, 21:447–452, 1950.
[24] J. Han and M. Kamber. Data Mining: Concepts and Techniques (2nd Edition). Morgan Kaufmann, San Francisco, CA, 2006.
[25] G. R. Hjaltason and H. Samet. Index-driven similarity search in metric spaces. ACM
Transactions on Database Systems, 28:517–580, 2003.
[26] H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang. iDistance: An adaptive
B+ -tree based indexing method for nearest neighbor search. ACM Transactions on
Database Systems, 30:364–397, 2005.
[27] W. Jin, A. K. H. Tung, and J. Han. Mining top-n local outliers in large databases.
In Proceedings of the seventh ACM SIGKDD international conference on Knowledge
discovery and data mining (KDD’01), pages 293–298, 2001.
[28] L. Kaufman and P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster
Analysis. John Wiley and Sons, New York, NY, 1990.
[29] R. Kolodny and N. Linial. Approximate protein structural alignment in polynomial
time. Proceedings of the National Academy of Sciences, 101(33):12201–12206, 2004.
113
[30] B. Larsen and C. Aone. Fast and effective text mining using linear-time document
clustering. In Proceedings of the Fifth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD’99), pages 16–22, 1999.
[31] B. Ma, J. Tromp, and M. Li. PatternHunter: Faster and more sensitive homology
search. Bioinformatics, 18:440–445, 2002.
[32] J. MacQueen. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and
Probability, 1:281–297, 1967.
[33] M. L. Mico, J. Oncina, and E. Vidal. A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and
memory requirements. Pattern Recognition Letters, 15:9–17, 1994.
[34] G. Navarro. Searching in metric spaces by spatial approximation. The VLDB Journal,
11:28–46, 2002.
[35] S. B. Needleman and C. D. Wunsch. A general method applicable to the search for
similarities in the amino acid sequence of two proteins. Journal of Molecular Biology,
48(3):443–453, 1970.
[36] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison.
Proceedings of National Academy of Sciences of USA, 85:2444–2448, 1988.
[37] J. R. Rico-Juan and L. Mic´o. Comparison of AESA and LAESA search algorithms
using string and tree-edit-distances. Pattern Recognition Letters, 24:1417–1426, 2003.
[38] G. Robertson, M. Bilenky, K. Lin, A. He, W. Yuen, M. Dagpinar, R. Varhol,
K. Teague, O. L. Griffith, X. Zhang, Y. Pan, M. Hassel, M. C. Sleumer, W. Pan,
E. Pleasance, M. Chuang, H. Hao, Y. Y. Li, N. Robertson, C. Fjell, B. Li, S. Montgomery, T. Astakhova, J. Zhou, J. Sander, A. S. Siddiqui, and S. J. M. Jones. cisred:
a database system for genome-scale computational discovery of regulatory elements.
Nucleic Acids Research, 34(Database Issue):D68–D73, 2006.
[39] J. Sander, X. Qin, Z. Lu, N. Niu, and A. Kovarsky. Automatic extraction of clusters
from hierarchical clustering representations. In Proceedings of the 7th Pacific-Asia
Conference on Knowledge Discovery and Data Mining (PAKDD’03), pages 75–87,
2003.
[40] M. Shapiro. The choice of reference points in best-match file searching. Communications of the ACM, 20:339–343, 1977.
[41] D. Shasha and J. T.-L. Wang. New techniques for best-match retrieval. ACM Transactions on Information Systems, 8:140–158, 1990.
[42] R. Sibson. Slink: An optimally efficient algorithm for the single-link cluster method.
Computer Journal, 16:30–34, 1973.
[43] M. Stoldt, J. W02hnert, M. G¨orlach, and L. R. Brown. The nmr structure of escherichia coli ribosomal protein l25 shows homology to general stress proteins and
glutaminyl-trna synthetases. The EMBO Journal, 17:6377–6384, 1998.
[44] G. D. Stormo. Dna binding sites: representation and discovery. Bioinformatics, 16:16–
23, 2000.
[45] C. J. Traina, R. F. S. Filho, A. J. M. Traina, M. R. Vieira, and C. Faloutsos. The omnifamily of all-purpose access methods: a simple and effective way to make similarity
search more efficient. VLDB Journal, 16:483–505, 2007.
[46] J. K. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information Processing Letters, 40:175–179, 1991.
114
[47] N. C. S. University. Rnase p database. http://www.mbio.ncsu.edu/RNaseP/home.html.
[48] E. Vidal. An algorithm for finding nearest neighbours in (approximately) constant
average time. Pattern Recognition Letters, 4:145–157, 1986.
[49] E. Vidal. New formulation and improvements of the nearest-neighbour approximating
and eliminating search algorithm (AESA). Pattern Recognition Letters, 15:1–7, 1994.
[50] R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study
for similarity-search methods in high-dimensional spaces. In Proceedings of 24th International Conference on Very Large Data Bases (VLDB’98), pages 194–205, 1998.
[51] D. S. Wishart, D. Arndt, M. Berjanskii, P. Tang, J. Zhou, and G. Lin. CS23D: a web
server for rapid protein structure generation using nmr chemical shifts and sequence
data. Nucleic Acids Research, 36(Web Server Issue):W496–W502, 2008.
[52] X. Wu, Z. Cai, X.-F. Wan, T. Hoang, R. Goebel, and G.-H. Lin. Nucleotide composition string selection in HIV-1 subtyping using whole genomes. Bioinformatics,
23:1744–1752, 2007.
[53] P. N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the fourth annual ACM-SIAM Symposium on
Discrete algorithms (SODA’93), pages 311–321, Philadelphia, PA, 1993.
[54] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method
for very large databases. In Proceedings of the 1996 ACM SIGMOD International
Conference on Management of Data (SIGMOD’96), pages 103–114, 1996.
[55] Z. Zhang, S. Schwartz, L. Wagner, and W. Miller. A greedy algorithm for aligning
dna sequences. Journal of Computational Biology, 7:203–214, 2000.
[56] J. Zhou and J. Sander. Data bubbles for non-vector data: Speeding-up hierarchical
clustering in arbitrary metric spaces. In Proceedings of 29th International Conference
on Very Large Data Bases (VLDB’03), pages 452–463, 2003.
[57] J. Zhou and J. Sander. Speedup clustering with hierarchical ranking. In Proceedings
of the 6th IEEE International Conference on Data Mining (ICDM’06), pages 1205–
1210, 2006.
[58] J. Zhou, J. Sander, Z. Cai, L. Wang, and G. Lin. Finding the nearest neighbors in
biological databases using less direct distance computations. IEEE/ACM Transactions
on Computational Biology and Bioinformatics. In press.
115