Relevance Feedback in Content-Based Image Search Hong-Jiang Zhang, Zheng Chen, Wen-Yin Liu and Mingjing Li Microsoft Research, China 5F Sigma Center, 49 Zhichun Road Beijing, 100080, China E-mail: [email protected] Abstract: Content-based image retrieval (CBIR) is a research area dedicated to address the retrieve and search multimedia documents for digital libraries. Relevance feedback is a powerful technique in CBIR and has been an active research topic for the past few years. In this paper, we review the current state-of-the-art of research on relevance feedbacks for CBIR and present the iFind system developed at Microsoft Research China equipped with a set of powerful relevance feedback algorithms. We also provide an outlook on the remaining research issues in CBIR, especially on applying learning and data mining technologies in search of multimedia data on the Web. 1. Introduction: The Challenges in Content-Based Image Retrieval Efficient image indexing and access tools are essential for efficient utilization of this massive digital resource and substantial research efforts have been devoted to address this issue. However, by and large, the earlier image database systems used in digital libraries [e.g. 1] have all taken keyword or text-based approaches for indexing and retrieval of image data. Admittedly, image annotation is a tedious process. Hence, it is practically impossible to annotate all the images on the Internet. Furthermore, due to the multiplicity of contents in a single image and the subjectivity of human perception and understanding, it is also difficult to make exactly the same annotations to the same image by different users. To address those limitations, content-based image retrieval (CBIR) approaches have been researched on in the last decade [2, 3, 4]. These approaches work with descriptions based on properties that are inherent in the images themselves such as color, texture, and shape and utilize them for retrieval purposes. Since visual features are automatically extracted from images, automated indexing of image databases becomes possible. However, despite the many research efforts, the retrieval accuracy of today’s CBIR algorithms is still limited and often worse than keyword based approaches. The problem stems from the fact that visual similarity measures, such as color histograms, in general do not necessarily match perceptional semantics and subjectivity of images. In addition, each type of image features tends to capture only one of many aspects of image similarity and it is difficult to require a user to specify clearly which aspect exactly or what combination of these aspects he/she wants to apply in defining a query. To address these problems, interactive relevance feedback techniques have been proposed. The idea is that we should incorporate human perception subjectivity into the retrieval process and provide users opportunities to evaluate retrieval results, and automatically refine queries on the basis of those evaluations. Lately, this research topic has become the most challenging one in CBIR research. In this paper, we first review the current state of the art of research on relevance feedbacks in CBIR and discuss a set of representative approaches in Section 2. In Section 3, we present a system called iFind©, developed at Microsoft Research China to show case how functionalities of text based image search, query by image example, relevance feedback and data mining can be integrated to build a powerful web-based image search engine. Section 4 presents our conclusion remarks. 2. Relevance Feedback in CBIR: The State of the Art In general, relevance feedback process in CBIR is as following. For a given query, the CBIR system first retrieves a list of ranked images according to a predefined similarity metrics, often defined by the distance between query vector and feature vectors of images in a database. Then, the user selects a set of positive and/or negative examples from the retrieved images, and the system will refine the query and retrieve a new list of images. Hence, the key issue in relevance feedback approaches is how to incorporate positive and negative examples in query and/or the similarity refinement. 2.1 Classical Relevance Feedback Schemes The early relevant feedback schemes for ICBR have been mainly adopted from text document retrieval researches and can be classified into two approaches: query point movement (query refinement) and re-weighting (similarity measure refinement). Both have been built based upon the vector model in information retrieval theory [5, 6, 7]. The query point movement method essentially tries to improve the estimate of the “ideal query point” by moving it towards good example points and away from bad example points. The frequently used technique to iteratively improve this estimation is the Rocchio’s formula given below for a set of relevant documents D’R and non-relevant documents D’N given by the user [6]. Q' = αQ + β ( 1 N R' 1 ∑ Di ) − γ ( N i∈ D ' R ∑ Di ) (1) N ' i∈ D ' N where α, β, and γ are suitable constants; NR’ and NN’ are the number of documents in D’R and D’N respectively. This is the technique implemented in the MARS system [8]. Another implementation of point movement strategy is using the Bayesian method, such as the work in [9] which using Bayesian learning to incorporate user’s feedback to update the probability distribution of all the images in the database. Experiments show that the retrieval performance can be improved considerably by using such relevance feedback approaches. The central idea behind the re-weighting method is very simple and intuitive. Since each image is represented by an N dimensional feature vector, we can view it as a point in an N dimensional space. Then, the basic idea is to enhance the importance of those dimensions of a feature that help in retrieving the relevant images and reduce the importance of those dimensions that hinder this process. That is, if the variance of the good examples is high along a principle axis j, then we can deduce that the values on this axis is not very relevant to the input query so that we assign a low weight wj on it. A simple algorithm based on this idea was described in the ImageRover system [10]. This algorithm automatically selects appropriate Minkowski distance metrics that minimize the mean distance between the relevant images specified by the user. Recently, more computationally robust methods that perform global optimization have been proposed. The MindReader retrieval system designed by Ishikawa et al. [11] formulates a minimization problem on the parameter estimating process. Unlike traditional retrieval systems whose distance function can be represented by ellipses aligned with the coordinate axis, the MindReader system proposed a distance function that is not necessarily aligned with the coordinate axis. Therefore, it allows for correlations between attributes in addition to different weights on each component. A further improvement over this approach is given by Rui and Huang [12]. The inputs to this query refining system are a query vector qi corresponding to the ith feature, an N element vector π=[π1,...πN] that represents the degree of relevance for each of the N input feedback samples, and a set of N training vectors xni for each feature i. An ideal query vector for each feature i is described by the weighted sum of all positive feedback images as follows. qi T * = π T Yi ∑n=1π n N (2) where Yi is the N×Ki training sample matrix for feature i, obtained by stacking the N feedback vectors xni into a matrix. Ki is the length of the ith feature vector. It is interesting to note that the original query vector qi does not appear in (2), which shows that the ideal query with respect to the feedbacks is not influenced by the initial query. 2.2 Relevance Feedback with Semantics However, as presented above, while all the approaches adapted from text document retrieval do improve the performance of ICBR, there are severe limitations: even with feedback, it is still difficult to capture high level semantics of images when only low-level image features are used in queries. The inherent problem with these approaches is that the low-level features are often not as powerful in representing complete semantic content of images as keywords in representing text documents. In other words, applying the relevance feedback approaches used in text document retrieval technologies to low-level feature based image retrieval will not be as successful as in text document retrieval. Using low-level features alone does not be effective in representing users’ feedbacks and in describing their intentions. Furthermore, in these algorithms, the potentially captured semantic in the relevance feedback processes in one query session is not memorized to continuously improve the retrieval performance of a system. To overcome these limitations, another school of ideas is to using learning approaches in incorporating semantics in relevance feedback. The PicHunter framework by Cox, et al further extended the relevance feedback and learning idea with a Bayesian approach [13]. With an explicit model of what users would do, given what target image they want, PicHunter uses Bayesian rule to predict what is the target they want, given their actions. This is done via a probability distribution over possible image targets, rather than refining a query. To achieve this, an entropy minimizing display algorithm is developed that attempts to maximize the information obtained from a user at each iteration of the search. Also, this proposed framework makes use of hidden annotation rather than a possibly inaccurate and inconsistent annotation structure that the user must learn and make queries in. However, this could be an disadvantage since it excluded the possibility of benefiting from good annotations, which may lead to a very slow convergence. In general, there are two different modes of user interactions involved in typical retrieval systems: using keyword to represent semantic contents of the desired images, or query by examples based on low-level image features. In most image retrieval systems, these two modes of interaction are mutually exclusive. We argue that combining these two approaches and allow them to benefit from each other yields a great deal of advantages in terms of both retrieval accuracy and ease of use. The framework proposed in [14] attempted to embed semantic information into CBIR processes through relevance feedback using a semantic correlation matrix and low-level feature distances. In this framework, semantic relevance between image clusters is learnt from user’s feedback and used to improve the retrieval performance. In other word, the framework maintains the strengths of feature-based image retrieval while incorporating learning and annotation in the relevance feedback processes. Experiments have shown that this new framework is effective not only in improving retrieval performance in a given query session, but also utilizes the knowledge learnt from previous queries to reduce the number of iterations in following queries. We have also put forward a framework that performs relevance feedback and query refinement on both the images’ semantic contents represented by keywords and the low-level feature vectors. In other words, semantic and low-level feature based relevance feedback are seamlessly integrated. Only when the semantic information is not available, our method is reduced to one of the previously described low-level feedback approaches as a special case. This framework has been implemented into the image search engine system iFind as described in detail in next section. 3. iFind  An Image Search Engine iFind© is a web-based image retrieval system developed at Microsoft Research China and is implemented with Microsoft COM objects[15]. iFind provides the functionalities of keyword-based image search, query by image example, and their combination. Images in this system are represented by low-level visual features, keyword features, and optionally, annotations when available. The key technology in the system is the integrated semantics and feature based image retrieval, relevance feedback approach and data mining of users’ feedback log. The performance improvement of this new approach over traditional CBIR and relevance feedback approaches is significant. 3.1 Document Modeling of Images As an image search engine, most of the images in iFind are collected by a crawler, a program that can automatically analyze the web pages and download images in web pages and semantic information , from many websites. We have built the document space model, which is a representation of images using a set of (both visual and semantic) feature vectors, from the images and the text content of the web pages. The text features are extracted from image URLs and filenames, page titles, ALT text, hyperlinks, and surrounding text on the web pages according to a set of empirical rules. These text descriptors compose a text feature vector for each image The TF*IDF method [5] is used to weight each keyword in the text feature vector. However, simple combination of traditional text-based retrieval and CBIR is not adequate to deal with image retrieval on the WWW because of several reasons. First, there is often too much clutter and irrelevant information on the web pages; thus, these text features are less accurate than annotating text. There is also the mismatch between the page author’s expression and the user’s understanding and expectation. To overcome these difficulties, we apply relevant feedback and user log analysis processes to improve the representation of images in three aspects. First, the original document space model built from the images and the text content of the web pages is analyzed to detect and remove clutter and irrelevant text information. The accuracy of the semantic features is therefore improved. Second, the user space model, which is the keyword vectors used by the users to represent images in the database, is constructed from analysis the log data of the relevance feedback from users. The user space model is then combined with the document space model to eliminate mismatch between the page author’s expression and the user’s understanding and expectation. Third, the relationship between the low-level features and the high-level features is also discovered from the user log analysis. 3.2 iFind Retrieval and Relevance Feedback Framework The iFind retrieval and relevance feedback framework consists of a semantic network from an image database that links images to semantic annotations, an similarity measure that integrating both semantic features and image features, and a machine learning algorithm to iteratively update the semantic network and to improve the system’s performance over time [16]. The semantic network is represented by a set of keywords having links to the images in the database. Weights are assigned to each individual link. This representation is shown pictorially in Figure 1. The degree of relevance of the keywords to the associated images’ semantic content is represented as the weight on each link. It is clear that an image can be associated with multiple keywords, each of which with a different degree of relevance. ••• ••• keyword keyword ••• keyword Figure 1: Semantic network of the image database In our system, initial keywords annotation can be from web through the crawler when the images are from the Web. More keywords can be learned from the user’s feedback. Whenever the user feeds back a set of image being relevant to a keyword or an example image with a set of keywords, we add the input keywords into the system and link them with these images. This effectively suggests a very simple voting scheme for updating the semantic network in which the keywords with a majority of user consensus will emerge as the dominant representation of the semantic content of their associated images. In this way, as more queries are inputted into the system, the system is able to expand its vocabulary. Also, through the voting process, the keywords that represent the actual semantic content of each image will receive a large weight. The iFind framework also extends the feedback algorithm defined by (2) to incorporate the low-level feature based feedback and ranking results into high-level semantic feedback and ranking. We define a unified distance metric function Gj to measure the relevance of any image j within the image database in terms of both semantic and low-level feature content. The function Gj is defined using a modified form of the Rocchio’s formula as follows.  1  I    1  I   Gj = log(1+π j )Dj + β  ∑ 1+ 1 S jk −γ  ∑ 1+ 2 S jk  NR k∈NR  A1   NN k∈NN  A2   (3) where Dj is the distance score computed by the low-level feedback according to (2), NR and NN are the number of positive and negative feedbacks respectively; I1, I2 are the number of distinct keywords in common between the image j and all the positive feedback images, or negative feedback images, respectively; A1 and A2 are the total number of distinct keywords associated with all the positive and negative feedback images respectively; and finally Sij is simply the Euclidean distance of the low-level features between the images i and j. We have replaced the first parameter α in Rocchio’s formula with the logarithm of the degree of relevance of the jth image. The other two parameters β and γ can be assigned by user to emphasize the weighting difference between the positive and negative feedbacks. In addition, it can be easily seen that our method degenerates into (1) when no semantic information is available. The iFind system updates the annotation of feedback images by increasing the linkage to the positive examples’ annotation and decreasing the linkage to the negative examples’ annotation. The updated annotation can further help to improve image retrieval results of the system in later use. Log mining can also help yield more accurate retrieval results by refining the semantic features. 3.3 iFind Log Mining To further reduce the ambiguity in the text descriptors extracted from web pages and the low-level image features, and to improve the search performance, we have proposed a user space model to supplement the original document space model. This is achieved by applying a user log analysis process. . The user space model is also a vector space model. The difference between the user space model and the document space model is that vectors in the user space model are constructed from the information mined from the log data of user interactions. Let Q be the set of total queries accumulated and Tj (j=1, …, NT) be the set of all keywords that appear in Q. For a query in Q, Iri is one of the relevant images specified by the user and stored in the user log. Based on the Bayesian theory, we have the probability of image Iri that contains keyword Tj being relevant to Tj as following P(T j | I ri ) = P( I ri | T j ) P(T j ) P( I ri ) (4) where P(IriTj) is the probability that image Iri has been retrieved and marked as relevant for those queries that contain word Tj and P(Tj) is the probability that a query that contain Tj. For a given image I, P(TjI) (j = 1..NT) calculated using (4) forms a vector for I. We call this vector the user space model of image I, compared to the document space model of Image I, which is built from the related features extracted from the web pages. We have integrated the user space model as described above into the original document space model to improve the accuracy of the final document space model. That is, for each image I, vector U is the user space model, and vector D is the document space model, and, the updated document space model is as below, D new = ηU + (1 − η ) D (5) where, η is used to adjust the weight between the user space model and the document space model. Since irrelevant images are also recorded in the user feedback log, we can also utilize this information. For each irrelevant image Iii we use P(IiiTj) as the confidence that Iii is irrelevant to query Tj and form a vector I . Then, the text feature vector of the image in the final document space model is defined by (6), similar to the TF*IDF method. D final = D new ∗ (1 − I ) (6) With the combination of document model and user perceived model, the retrieval accuracy of iFind system is significantly improved compared to using only the document model [15]. 4. Conclusion Remarks CBIR is an important technology for automating the indexing process and achieving content-based search in digital multimedia libraries. Learning semantics through relevance feedback is the enabling technique as well as the current research challenge to improve the retrieval performance CBIR. 5. Reference: 1. S. K. Chang and A. Hsu, “Image Information Systems: Where Do We Go From Here,” IEEE Transactions on Knowledge and Data Engineering, October 1992,Vol.4,No.5, pp.431-442. 2. R. Jain, A. Pentland and D. Petkovic (editors), Workshop Report: NSF-ARPA Workshop on Visual Information Management Systems, Cambridge, Mass, USA, June 1995. 3. Flickner M et al. (1995) Query by Image and Video Content. IEEE Computer 28(9):23-32. 4. B. Furht, S. W. Smoliar and H. J. Zhang, Image and Video Processing in Multimedia Systems, Kluwer Academic Publishers, 1995. 5. Salton, G., and McGill, M. J. “Introduction to Modern Information Retrieval,” McGraw-Hill, 1983. 6. Rocchio JJ,“Relevance Feedback in Information Retrieval. “In: The SMART Retrieval System, 1971, pp. 313-323, Prentice Hall. 7. Buckley, C., and Salton, G. “Optimization of Relevance Feedback Weights,” in Proc of SIGIR’95. 8. Rui, Y., Huang, T. S., and Mehrotra, S. “Content-Based Image Retrieval with Relevance Feedback in MARS,” in Proc. IEEE Int. Conf. on Image proc., 1997. 9. Vasconcelos, N., and Lippman, A., “A Bayesian Framework for Content-Based Indexing and Retrieval”, In: Proc. of DCC’98, Snowbird, Utah, 1998. 10. S. Sclaroff, L. Taycher, and M. L. Cascia, “ImageRover: a content-based image browser for the World Wide Web,” Proc of IEEE Workshop on Content-based Access of Image and Video Libraries, 1997. 11. Ishikawa, Y., Subramanya R., and Faloutsos, C., “Mindreader: Query Databases Through Multiple Examples,” In Proc. of the 24th VLDB Conference, (New York), 1998. 12. Rui, Y., Huang, T. S. “A Novel Relevance Feedback Technique in Image Retrieval,” ACM Multimedia, 1999. 13. Cox, I.J., et al. “The Bayesian Image Retrieval System, PicHunter: Theory, Implementation, and Psychophysical Experiments” IEEE Tran. On Image Processing, Volume 9, Issue 1, pp. 20-37, Jan. 2000. 14. Lee, C., Ma, W. Y., and Zhang, H. J. “Information Embedding Based on user’s relevance Feedback for Image Retrieval,” Proc. of SPIE Photonics East, 1998. 15. Z. Chen, et al. “Web Mining for Web Image Retrieval,” to appear at International Journal of the American Society for Information Science, Special issue on Visual Based Retrieval Systems and Web Mining. 16. Lu, Y., Hu, C., Zhu, X., Zhang, H. J., and Yang, Q., “A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems,” ACM Multimedia, 2000.