Mechanisms for Scalable Image Searching: A survey

International Journal of Computer and Advanced Engineering Research (IJCAER)
Volume 02– Issue 02, APRIL 2015
____________________________________________________________________________________________________________________
Mechanisms for Scalable Image Searching: A survey
Ansila Henderson
Kavitha K. V.
Assistant professor
M.Tech Student
SCTCE, Pappanamcode
Trivandrum, Kerala
[email protected]
Department of Computer Science & Engg.
SCTCE, Pappanamcode
Trivandrum, Kerala
[email protected]
Abstract—With the increase in number of images on the Internet,
there is a strong need to develop methods for efficient image
retrieval. Today, the images features are of high dimensions and
the image databases are huge. So the abilities of traditional
systems, methods, and computer programs to perform image
retrieval functions in an efficient, useful and timely manner
become challenged. The latest techniques use hashing methods to
embed high-dimensional image features into Hamming space.
The image search can be performed in real-time based on
Hamming distance of compact hash codes. The traditional
methods (e.g., Euclidean) offers continuous distances, but the
Hamming distances are discrete integer values. A large number
of images would share equal Hamming distances to a query. But
for an image search mechanism fine-grained ranking is very
important.
Keywords-inverted file; hashing; KD tree; indexing
I.
INTRODUCTION
Today, the heterogeneity and the size of digital image
collections grow exponentially. So there is a strong need to
develop methods for efficient image retrieval. The primary
goal of an image management system is to search images
efficiently and in a timely manner. Thus, it could compete with
the applications in the current era. The image searching should
be based on its visual contents. To gain more accurate results
with high retrieval performance, many researchers have
devised techniques based on different parameters.
An image can be the representation of a real object or a
scene. With the development of the internet, and the
availability of image capturing devices such as digital cameras,
huge amounts of images are created every day in different
areas. There are many different fields such as remote sensing,
fashion, crime prevention, medicine, architecture, etc that are
in need of an efficient image retrieval system. The image
similarity can be something very subjective and corresponding
to the world. A search for an image includes generating a set of
feature vectors similar to image characteristics and comparing
the set of features to the features indexed for multiple stored
images. The search result can be produced based on the
comparison of the set of indexed features.
In recent techniques, usually the images were represented
using the popular bag-of-visual-words (BoW) framework. In
this framework, local invariant image descriptors were
extracted and quantized based on a set of visual words. So the
methods used for document retrieval was adopted to the image
retrieval task. The BoW features can be embedded into
compact hash codes for efficient search. Hashing is more
suitable for image retrieval task than the tree-based indexing
structures such as kd-tree. It requires greatly reduced memory
and also works better for high-dimensional samples. Using the
hash codes, image similarity can be efficiently measured (using
logical XOR operations) in hamming space. It uses hamming
distance, an integer value obtained by counting the number of
bits at which the binary values of the images are different.
II.
OBJECTIVE OF SCALABLE IMAGE SEARCHING
As the number of images increases, it is difficult to find
visually similar content. Thorough search is infeasible for
large scale applications because it consumes more time. So an
efficient search mechanisms like indexing methods were
needed to provide efficient search time and retrieval accuracy.
The frequency of similar objects in the data space increases
with the increase in the number of images. These objects may
have similar semantics. So the inferences based on nearest
neighbors would be more reliable. The images were usually
described by sequences of descriptor vectors having more than
a thousand dimensions. The image similarity is examined by
nearest neighbor search.
III.
LITERATURE REVIEW
Today, the existing image features are of high dimensions
and the image databases are huge. So comparing the query
image with every database sample becomes computationally
prohibitive which makes the efficient search mechanism
critical. Recent image feature representations such as BoW
were similar to the bag-of-words representation of textual
documents. So the methods used for document retrieval were
adopted into the image retrieval task. The existing works on
efficient search mechanisms can be mainly classified into three
categories. They are
1. Inverted file
2. Tree-based indexing
3. Hashing
A. Inverted file
J. Zobel proposed the inverted file to retrieve similar
documents [1]. Inverted file is an index structure used for text
__________________________________________________________________________________________________________________
1
International Journal of Computer and Advanced Engineering Research (IJCAER)
Volume 02– Issue 02, APRIL 2015
____________________________________________________________________________________________________________________
query evaluation. The index structure maps terms to the
documents that contains them. The inverted file contains a
collection of lists for each term which stores the identifiers of
the documents containing that term. It has two major
components:
1. Search structure or vocabulary
2. Set of inverted lists
For each term the search structure stores a count and a
pointer. The count specifies the number of documents
containing the term and the pointer points to the start of the
corresponding inverted list. The inverted list stores the
identifiers of the document containing a particular term. It also
stores the count of repetition of the terms in a document. The
word positions within documents can also be recorded. These
components provide all the information needed for the query
evaluation.
In a complete text database system, there were several other
structures, including the documents themselves and a table that
maps ordinal document numbers to disk locations. The
difference is that each inverted list was stored contiguously.
These lists were composed of a sequence of blocks which were
linked or indexed in some way.
The similarity score for a document was calculated based
on how many times a particular term repeat in that document.
In a ranked query, a phrase was treated as an ordinary term, a
lexical entity that occurs in given documents with given
frequencies, and contributes to the similarity score for that
document when it does appear. Similarity can therefore be
computed in the usual way, but the inverted lists must be used
for the terms in the phrase to construct an inverted list for the
phrase itself, using a Boolean intersection algorithm.
The set of identified phrases were added to the vocabulary
and have their own inverted lists. Without any alteration to
query evaluation procedures the users could query them.
However, such indexing was potentially expensive. The
number of distinct two-word phrases grows more rapidly than
the number of distinct terms. There was no obvious mechanism
for accurately identifying which phrases might be used in
queries, and the number of candidate phrases was enormous.
Inverted index was initially proposed and is still very
popular for document retrieval. It was then introduced to the
field of image retrieval. In this structure, for a given query with
several words, a list of references to each image for each visual
word was created so that relevant images could be quickly
located. A key difference of document retrieval from visual
search was that the textual queries usually contain very few
words. But a single image may contain hundreds of visual
words, resulting in a large number of candidate images that
need further verification. This largely limits the application of
inverted files for large scale image search. While increasing
visual vocabulary size in BoW the number of candidates can be
reduced, but it will increase memory usage [2].
B. Tree-based indexing
C. Silpa-Anan and R. Hartley proposed Optimised KDtrees for fast image descriptor matching [3]. The tree-based
indexing dramatically improves the image retrieval quality.
The elements stored in the KD-trees were high-dimensional
vectors. At the root of the tree, the data was split into two
halves by a hyperplane orthogonal to a chosen dimension at a
threshold value. This split was usually made at the midpoint of
the dimension with the greatest variance in the data set. By
comparing the query vector with the splitting value, it would
easily determine to which half of the data the query vector
resides. Each of the two halves of the data was then recursively
split in the same way to generate a fully balanced binary tree.
At the bottom of the tree, each leaf node of the tree would be
similar to a single point in the data set. In some cases, the leaf
nodes may contain more than one point. The height of the tree
would be log2N where N is the number of points in the data set
[3].
For a given query vector, the tree needs log2N comparisons.
The data point related to the root node was the first candidate
for the nearest neighbor. Each node in the tree resembles a cell
as shown in the figure. If a query point lying anywhere in a
given leaf cell was searched, it would have lead to the same
leaf node.
The KD-trees were effective in low dimensions, but their
efficiency decreases for high dimensional data. In high
dimensional images, a large number of nodes would be there to
search. So a KD tree takes a lot of time to backtrack through
the tree to find the optimal solution. By limiting the amount of
backtracking, the certainty of finding the absolute minimum
was sacrificed and replaced with a probabilistic performance
[3]. The recent researches were aimed at increasing the
probability of success by keeping the backtracking within
reasonable limits.
C. Hashing Techniques
The binary hash codes were compactable to memory. The
search can be performed efficiently using the hash table lookup
or bitwise operations. Thus it satisfies the time and also the
memory requirements. Hashing methods used for image
retrieval can be divided into three main categories [12]:
unsupervised methods, supervised methods and semisupervised
methods.
1. Supervised methods:
Supervised hashing is based on the machine learning task
known as supervised learning. It generates functions from
labeled training data. The training data consist of a set of
training samples. Here, each sample is a pair consists of an
input object and a desired output value. The functions derived
from the supervised learning algorithm can be used for
mapping new samples. An optimal scenario would correctly
determine the class labels for unseen specimens. The learning
algorithms need to derive from the training data to unseen
situations in a legitimate way.
__________________________________________________________________________________________________________________
2
International Journal of Computer and Advanced Engineering Research (IJCAER)
Volume 02– Issue 02, APRIL 2015
____________________________________________________________________________________________________________________
2. Unsupervised methods:
Unsupervised hashing is based on the machine learning
task known as unsupervised learning. This method uses the
unlabeled data to generate binary codes for the given points. It
tries to find hidden structure from unlabeled data. There will
be no error signal to estimate a possible solution. Many
methods employed in unsupervised learning were based on
data mining methods. The methods used in unsupervised
learning include clustering such as k-means, mixture models
etc and blind signal separation using feature extraction
techniques for dimensionality reduction.
3. Semisupervised methods:
Semi-supervised learning is a class of supervised learning.
It also makes use of unlabeled data for training. Generally a
small amount of labeled data is used with a large amount of
unlabeled data. Semi-supervised learning falls between
unsupervised learning (without any labeled training data) and
supervised learning (with completely labeled training data).
Many machine learning researchers have found that a
considerable improvement in learning accuracy can be
achieved by using a small amount of labeled data with the
unlabeled data.
Locality Sensitive Hashing (LSH) is one of the most
popular unsupervised hashing methods in computer vision.
Another effective method called Spectral Hashing (SH) was
proposed recently by Weiss et al [4]. Since unsupervised
methods do not require any labeled data, their parameters were
easy to learn using a pre-specified distance metric. However,
in vision problems, sometimes similarity between data points
may not defined with a simple metric. Metric similarity of the
image descriptors may not preserve semantic similarity. The
pairs of images that contain ‘similar’ or ‘dissimilar’ images
need to be provided. From such pair wise labeled data,
hashing mechanism will automatically generate codes that
respect the semantic similarity.
The output of supervised methods may be meaningful to
human. But it will be difficult to label all the images in a huge
database system. Also not everything in the real world has a
distinctive meaning. The image similarity is more related to
high level image features than the low level features. So it is
better to use the semi-supervised learning for image retrieval
task.
Locality Sensitive Hashing (LSH)
P. Indyk and R. Motwani proposed the Locality Sensitive
Hashing [5]. The Locality Sensitive Hashing maps similar
samples to the same bucket with high probability. The
property of locality in the original space would be preserved in
the hamming space. More precisely, the hashing functions h
(.) from LSH family satisfy the following elegant locality
preserving property [5]:
Ph(x) = h(y) = sim(x,y)
where the similarity measure can be directly linked to the
distance function. A typical category of LSH functions
consists of random projections and thresholds as:
h(x) = sign(w > x+b)
(2.2)
where w is a random hyperplane and b is a random
intercept. Clearly, the random vector w is data independent,
which is usually constructed by sampling each component of
w randomly from a p-stable distribution. Although there exists
an asymptotic theoretical guarantee for random projection
based LSH, it is not very efficient in practice since it requires
multiple tables with long codes. Constructing a total of l K-bit
length hash tables H(x) = [h1(x), . . . ,hK(x)] provides the
following collision probability:
For a large scale application, the value of K should be
considerably large to reduce the size of each hash bucket.
However, a large value of K decreases the collision
probability between similar samples. In order to overcome this
drawback, multiple hash tables need to be constructed. This is
inefficient because it needs extra storage cost and larger query
time. Recently, Kulis and Grauman [6] extended LSH to work
in arbitrary kernel space, and Chum et al. [7] proposed minHashing to extend LSH for sets of features.
Spectral Hashing (SH)
Weiss et al. proposed a spectral hashing (SH) method that
hashes the input space based on data distribution [4]. In
spectral hashing bits were calculated by finding the threshold a
subset of eigenvectors of the Laplacian of the similarity graph.
The SH algorithm consists of three key steps [4]:
1. The extraction of maximum variance directions through
Principal Component Analysis (PCA) on the data.
2. The direction selection, which prefers to partition projected
dimensions with large range and small spatial frequency.
3. The partition of projected data by a sinusoidal function with
previously computed angular frequency.
SH was very effective in encoding large-scale, lowdimensional data since the important Principal Component
Analysis (PCA) directions were selected multiple times to
create binary bits. However, for high dimensional images
where many directions contain enough variance, each PCA
direction would take only once. This is because the top few
projections had similar range. Thus, a low spatial frequency
was preferred. In such cases, SH replicates a PCA projection
followed by a mean partition approximately. In SH, the
projection directions were data dependent but learned in an
unsupervised manner. Also, the assumptions of uniform data
distribution were usually not true for real-world data.
(2.1)
__________________________________________________________________________________________________________________
3
International Journal of Computer and Advanced Engineering Research (IJCAER)
Volume 02– Issue 02, APRIL 2015
____________________________________________________________________________________________________________________
Semantic Hashing
R. Salakhutdinov and G. Hinton proposed the semantic
hashing for the retrieval of similar documents [8]. Learning
with deep belief networks (DBN) proposed for dimensionality
reduction [9] was recently adopted for semantic hashing in
large-scale search applications. The deep belief networks
needs image labels during training phase to generate hash
codes. The DBN structure gradually reduces the number of
units in each layer. So the high-dimensional input of original
image features can be projected into a compact Hamming
space.
A general DBN is a directed acyclic graph in which each
node represents a stochastic variable. There were two main
steps for hash code generation using DBN. First, learning the
interactions between variables and second, inferring
observations from inputs. The learning of a DBN with
multiple layers needs to estimate millions of parameters. In
order to reduce the difficulty in learning a DBN, the DBN has
been structured based on the RBMs (Restricted Boltzmann
Machine) [9]. Each RBM consists of two layers containing
respectively output visible units and hidden units. The
multiple RBMs were stacked to form a deep belief net. The
network can be specifically designed to reduce the number of
units, and finally output the compact hash codes.
The training process of a DBN consists of two main stages:
unsupervised pre-training and supervised fine-tuning. The pretraining phase aims to place the network weights (and the
biases) to suitable neighborhoods in parameter space. After
achieving convergence of the parameters of one layer via
Contrastive Divergence, the outputs of this layer were fixed
and treated as inputs to drive the training of the next layer.
During the fine-tuning stage, labeled data was used to refine
the network parameters through back-propagation. The
network parameters were then refined to maximize this
objective function using conjugate gradient descent. The
optimal weights in the entire network were thus obtained using
the DBN.
Semi-Supervised Hashing
J. Wang and Chang proposed the semi-supervised hashing
for scalable image retrieval [10]. In this, for a given set of
points, a fraction of pairs were associated with two categories
of label information. A neighbor-pair can be considered as the
pair of points either neighbors in a metric space or share
common class labels. Similarly, a pair of points is called a non
neighbor-pair if two samples are far away in metric space or
have different class labels.
The objective function of SSH consists of two major
components, supervised empirical fitness and unsupervised
information theoretic regularization. The supervised part tries
to minimize an empirical error on a small amount of labeled
data. The unsupervised term provides effective regularization
by maximizing desirable properties like variance and
independence of individual bits. Given a set of n points S =pi ,
i=1,2, ...n, pi ϵ RD, in which a small fraction of pairs were
associated with two categories of label information. A pair of
points (pi,pj) is denoted as a neighbor-pair M when share
common class labels and is called a non neighbor pair C if the
two samples have no common class label.
( )=
ℎ ( )ℎ ( )
,
−
ℎ ( )ℎ ( )
,
+
[ℎ ( )]
(2.3)
where the first term measures the empirical accuracy over
the labeled samples M and C and the second term realizes the
maximum entropy principle.
The empirical accuracy for a family of hash functions H is
defined as the difference of the total number of correctly
classified pairs and the total number of wrongly classified
pairs. Maximizing the empirical accuracy for just a few pairs
can lead to severe over fitting. Hence a regularizer is used
which utilizes both labeled data and the unlabeled data. From
the theoretic point of view, one would like to maximize the
information provided by each bit. Using maximum entropy
principle, a binary bit that gives balanced partition of points
provides maximum information [10]. The maximum variance
condition is that a hash function with maximum entropy must
maximize the variance of the hash values and vice versa.
D. Comparison of mechanisms in image retrieval
A comparison of mechanisms used in image retrieval is
given below:
__________________________________________________________________________________________________________________
4
International Journal of Computer and Advanced Engineering Research (IJCAER)
Volume 02– Issue 02, APRIL 2015
____________________________________________________________________________________________________________________
TABLE I.
IMAGE
SEARCHING
TECHNIQUES
COMPARISON TABLE
METHODOLOGY
ADVANTAGES
DISADVANTAGES
Inverted Files
Indexing and Query
evaluation
Queries are ranked by
statistical similarity
Increases memory usage
Tree-based Indexing
Multiple randomized KDtrees
Dramatic improvement
in retrieval quality
Do not work well with high
dimensional features
Semi-supervised hashing
Minimize the empirical
error in labeled data
Performance degrades with
less training data due to
overfitting
Hashing Techniques
[5]
IV. CONCLUSION
With the increasing demands of multimedia applications
over the Internet, the importance of image retrieval has also
increased. In this research study, recent image searching
techniques are discussed. All these techniques have their own
advantages as well as certain limitations. State-of the- art
solutions often use hashing methods to embed highdimensional image features into Hamming space, where search
can be performed in real-time based on Hamming distance of
compact hash codes. Unlike traditional metrics (e.g.,
Euclidean) that offer continuous distances, the Hamming
distances are discrete integer values. As a consequence, there
are often a large number of images sharing equal Hamming
distances to a query, which largely hurts search results where
fine-grained ranking is very important.
REFERENCES
[1]
[2]
[3]
[4]
J. Zobel and A. Moffat, “Inverted files for text search engines,” ACM
Comput. Surveys, 2006.
M. D. H. Jegou and C. Schmid, “Packing bag-of-features,” Proc. IEEE
Conf. Computer Vision and Pattern Recognition, 2009.
C. Silpa-Anan and R. Hartley, “Optimised kd-trees for fast image
descriptor matching,” Proc. IEEE Conf. Computer Vision and Pattern
Recognition, 2008.
A. T. Y. Weiss and R. Fergus, “Spectral hashing,” Adv.Neural Inf.
Process. Syst, 2008.
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
P. Indyk and R. Motwani, “Approximate nearest neighbors: Towards
removing the curse of dimensionality,” Proc. Symp. Theory of
Computing, 1998.
B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing for
scalable image search,” Proc. IEEE Int. Conf. Computer Vision, 2009.
M. P. O. Chum and J. Matas, “Geometric min-hashing: Finding a (thick)
needle in a haystack,” Proc. IEEE Conf. Computer Vision and Pattern
Recognition, 2009.
R. Salakhutdinov and G. Hinton, “Semantic hashing,” Proc. Workshop of
ACM SIGIR Conf. Research and Development in Information Retrieval,
2007.
G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data
with neural networks,” Science, 2006.
S. K. J. Wang and S.-F. Chang, “Semi-supervised hashing for scalable
image retrieval,” Proc. IEEE Conf. Computer Vision and Pattern
Recognition, 2010.
R. F. A. Torralba and Y. Weiss, “Small codes and large image databases
for recognition,” Proc. of CVPR, 2008.
G. Hinton, “Training products of experts by minimizing contrastive
divergence,” Neural Computation, 2002.
E. Horster and R. Lienhar, “Deep networks for image retrieval on largescale databases,” Proc. ACM Int. Conf. Multimedia, 2008.
G. H. J. Goldberger, S. Roweis and R. Salakhutdinov, “Neighbourhood
components analysis,” Proc. of NIPS, 2005.
G. H. J. Goldberger, S. Roweis and R. Salakhutdinov, “Neighbourhood
components analysis,” Adv. Neural Inf. Process. Syst, 2004.
X. X. Yu-Gang Jiang, Jun Wang and S.-F. Chang, “Query-adaptive image
search with hash codes,” IEEE transactions on multimedia, 2013.
__________________________________________________________________________________________________________________
5