Query Expansion using Wordnet in N-layer Vector Space Model Jayant R. Gadge Research Scholar, VJTI Mumbai, S.S. Sane H.B. Kekre Professor, VJTI Mumbai, Sr. Professor, MPSTME. Mumbai 2013 Nirma University International Conference on Engineering (NUiCONE) Presenter: SHIH KAI WUN Outline Introduction Related Work Wordnet Query Expansion Based On The Wordnet Result Conclusion Introduction(1/2) The fast increase in the size of the web has introduced new challenges. One of the main difficulty is that user usually submits very short queries. The studies show that the average query length is 2-3 keywords. User query may not contain the most appropriate terms as actually intended by the user. Besides, these short queries have a high degree of ambiguity.This is usually referred as term mismatch problem and is a crucial research issue in information retrieval. Though existing search engines work well to a certain extent but they still face problems that arise because query and document terms are compared on lexical level rather than on semantic level. Introduction(2/2) The Short queries and the incompatibility between the terms in user query and documents strongly affect the retrieval of relevant document. It becomes very difficult to understand the intention of user and provide set of relevant documents. Query expansion has been widely used as a technique that can deal with the query term incompleteness or term mismatch in information retrieval. Frequently, users may provide incomplete queries as they are not fully aware of the retrieval effectiveness of precise queries. Thus it is imperative to construct robust queries. One way to solve this problem is to expand the queries. Related Work(1/3) Query expansion is an effective way of enhancing performance of web information systems. Many expansion methods expand queries by adding terms that are generated .There are two basic query expansion techniques. One, query reformulation method and second, query reweighting methods. The query reformulation is also referred as global method and query reweighting is referred as local method. In query reformulation technique, the query terms are reformulated or expanded independent of the query. The expanded query is used for searching purpose and the results are displayed . Only individual query terms are considered for expansion. Related Work(2/3) Manually developed resources like WordNet, UMLS thesaurus are commonly used for query expansion .It is observed that this approach does not show significant improvement in information retrieval if the original query is well defined. If query is not well defined then it shows significant improvement. Another approach that is used for query expansion is query log mining. Cui et al. used query log to set probabilistic correlations between query terms and documents terms. These correlations are then used to refine the expansion terms of new query. Cui et al. assumed that the documents visited by the user are relevant to the query. It is observed that query log of large size improves the accuracy of information retrieval process. Related Work(3/3) Min Song, Yeol Song and Robert B. Allen and Zoran Obradovic proposed query expansion approach to identify the phrases from the query. The key phrases are extracted from the retrieved documents and weighted with an algorithm based on information gain and co-occurrence of phrases. In query reweighting methods, the documents are retrieved with the original query and later on this result is refined.Initially user’s query is used to retrieve the ranked documents, and then the user checks the relevance of the result. The user’s feedback is used to refine the query and new ranked list is obtained. The user’s feedback can be implicit feedback, explicit feedback or pseudo relevance feedback. Wordnet(1/2) WordNet is a large manually constructed comprehensive thesaurus developed at Princeton University. The basic unit in WordNet is the Synset, which represents a lexicalized concept. Synset are comprised of nouns, verbs, adjectives, and adverbs. They are connected by bi-directional pointers denoting semantic relations such as synonymy, antonymy, hyponymy and metonymy etc. WordNet superficially resembles a thesaurus. It groups words based on their meaning. WordNet interlinks not only word forms but also specific sense of words. It labels the semantic relations among words. Wordnet(2/2) A grouping of words in WordNet does not follow any explicit pattern other than meaning similarity. The most frequently encoded relation in Synset is the supersubordinate relation. It is referred as hyperonymy, hyponymy or ISA relation. It links more general synsets to increasingly specific ones. All noun hierarchies ultimately go up the root node. Hyponymy relation is transitive. For nouns the semantic relationship includes synonymy, hyponymy, hypernym and their corresponding counterpart. The semantic relations between words are used in query expansion process. The query word used by user is most significant but may not sufficient to draw exact user’s need. Query Expansion Based On The Wordnet(1/5) This paper proposes a new query expansion method for N-layer vector space model based on WordNet. The terms appearing in special locations such as title, hyperlinks, body and paragraph represent more important information in the web document. The document is logically divided in N-layers according to the structure and weights are assigned to terms based on their presence in different layers within the document. After the user submits the original query, the first step is preprocessing of the query. Preprocessing is carried out in two steps: 1. In preprocessing, all stopwords are removed. The stopwords are words that are frequently used in document but do not carrying any significant meaning. 2. The stemming of document. Stemming enables the extraction of the root form of each word. Query Expansion Based On The Wordnet(2/5) Further, query word is expanded along three dimensions namely hypernym, hyponym and synonym using WordNet to obtain hierarchy of semantic relations . WordNet often expands a query with too many words and few of them are with low frequency and unusual words.These words may bring some noise and detract the retrieval performance. This leads to decrease in precision and recall. It is important to avoid noise while expanding queries. Here term association concept is used to remove the words from hierarchy of semantic relations that have lower support and confidence to the original word. Query Expansion Based On The Wordnet(3/5) Word is extracted from hierarchy relation tree, term association concept is used to mine out terms. The term associations are based on term co-occurrence. Term cooccurrence has been used in term-association studies based on the intuition that co-occurring words are more likely to be similar. Query Expansion Based On The Wordnet(4/5) The basic concept of association rules is to find the correlation between any two terms in a given dataset. Here one-to-one mapping is considered between original query term and term from hierarchy relation tree. The ‘support’ and ‘confidence’ are two important definitions in the field of data mining. Two definitions, Confidence (conf) and Support (sup) of term association ti → tk are as follows Let D(ti tk) = D(ti)∩ D(tk) Where D(ti) and D(tk) stand for the document containing term ti and tk respectively. Query Expansion Based On The Wordnet(5/5) The confidence is defined as follows Where ∥D(ti,tk)∥ represents the total number of documents that include both term ti and tk. The value ∥ D(ti) ∥ represents the total number of documents that include ti. The support is defined as follows where ∥ D ∥ represents the number of documents in the database. Result(1/5) For experimental purpose, dataset UW-CAN-DATASET from University of Waterloo is used. The data set consists of 314 web pages from various web sites at university of waterloo and some Canadian websites. Ten categories are formed from the above mentioned dataset. Each category contains documents in range of 22 to 54. The N-layer VSM with Query Expansion is compared with N-layer VSM and standard vector space model. For comparing results of mentioned methods, precision and recall is used. Result(2/5) For experimental purpose, 20 queries are used which are listed in the table I. The results are obtained for three approaches and N-layer VSM with query expansion is compared with vector space model and n-layer vector space. Table I. User’s Query Result(3/5) Table II. shows the precision and recall of 20 user’s query results obtained for three methods, vector space model, N-layer vector space model and N-layer vector space model with query expansion. Table II. User’s query with Precision and recall Result(4/5) Figure 1. shows precision graph of 20 user’s query for three approaches and figure 2. shows recall graph of 20 user’s query for three approaches. Fig 1. Comparison of Precision of three methods for 20 user’s query Fig 2. Comparison of recall of three methods for 20 user’s query Result(5/5) The following table III. shows average precision and average recall of the three methods. Table III. Average Precision and Average Recall for three methods Conclusion This paper presents a new query expansion approach using WordNet. WordNet’s noun relations such as hypernym, hyponym and synonym are used to expand the query terms. The association between original query terms and expanded query terms are checked before adding new terms. Experimental result indicates that new query expansion approach leads to better retrieval of documents.
© Copyright 2024