Query Expansion using Wordnet in N

Query Expansion using Wordnet
in N-layer Vector Space Model
Jayant R. Gadge
Research Scholar, VJTI Mumbai,
S.S. Sane
H.B. Kekre
Professor, VJTI Mumbai, Sr. Professor, MPSTME. Mumbai
2013 Nirma University International Conference on Engineering (NUiCONE)
Presenter: SHIH KAI WUN
Outline
 Introduction
 Related Work
 Wordnet
 Query Expansion Based On The Wordnet
 Result
 Conclusion
Introduction(1/2)

The fast increase in the size of the web has introduced new challenges.
One of the main difficulty is that user usually submits very short
queries. The studies show that the average query length is 2-3 keywords.

User query may not contain the most appropriate terms as actually
intended by the user. Besides, these short queries have a high degree
of ambiguity.This is usually referred as term mismatch problem and is a
crucial research issue in information retrieval.

Though existing search engines work well to a certain extent but they
still face problems that arise because query and document terms are
compared on lexical level rather than on semantic level.
Introduction(2/2)

The Short queries and the incompatibility between the terms in user
query and documents strongly affect the retrieval of relevant document.
It becomes very difficult to understand the intention of user and
provide set of relevant documents.

Query expansion has been widely used as a technique that can deal with
the query term incompleteness or term mismatch in information retrieval.
Frequently, users may provide incomplete queries as they are not fully
aware of the retrieval effectiveness of precise queries.

Thus it is imperative to construct robust queries. One way to solve
this problem is to expand the queries.
Related Work(1/3)

Query expansion is an effective way of enhancing performance of web
information systems.

Many expansion methods expand queries by adding terms that are
generated .There are two basic query expansion techniques. One, query
reformulation method and second, query reweighting methods. The query
reformulation is also referred as global method and query reweighting
is referred as local method.

In query reformulation technique, the query terms are reformulated or
expanded independent of the query. The expanded query is used for
searching purpose and the results are displayed . Only individual query
terms are considered for expansion.
Related Work(2/3)

Manually developed resources like WordNet, UMLS thesaurus are commonly
used for query expansion .It is observed that this approach does not
show significant improvement in information retrieval if the original
query is well defined. If query is not well defined then it shows
significant improvement.

Another approach that is used for query expansion is query log mining.
Cui et al. used query log to set probabilistic correlations between
query terms and documents terms. These correlations are then used to
refine the expansion terms of new query. Cui et al. assumed that the
documents visited by the user are relevant to the query. It is observed
that query log of large size improves the accuracy of information
retrieval process.
Related Work(3/3)

Min Song, Yeol Song and Robert B. Allen and Zoran Obradovic proposed
query expansion approach to identify the phrases from the query. The
key phrases are extracted from the retrieved documents and weighted
with an algorithm based on information gain and co-occurrence of
phrases.

In query reweighting methods, the documents are retrieved with the
original query and later on this result is refined.Initially user’s
query is used to retrieve the ranked documents, and then the user
checks the relevance of the result. The user’s feedback is used to
refine the query and new ranked list is obtained. The user’s feedback
can be implicit feedback, explicit feedback or pseudo relevance
feedback.
Wordnet(1/2)

WordNet is a large manually constructed comprehensive thesaurus
developed at Princeton University. The basic unit in WordNet is the
Synset, which represents a lexicalized concept. Synset are comprised of
nouns, verbs, adjectives, and adverbs.

They are connected by bi-directional pointers denoting semantic
relations such as synonymy, antonymy, hyponymy and metonymy etc.
WordNet superficially resembles a thesaurus. It groups words based on
their meaning. WordNet interlinks not only word forms but also specific
sense of words. It labels the semantic relations among words.
Wordnet(2/2)

A grouping of words in WordNet does not follow any explicit pattern
other than meaning similarity. The most frequently encoded relation in
Synset is the supersubordinate relation. It is referred as hyperonymy,
hyponymy or ISA relation. It links more general synsets to increasingly
specific ones. All noun hierarchies ultimately go up the root node.
Hyponymy relation is transitive.

For nouns the semantic relationship includes synonymy, hyponymy,
hypernym and their corresponding counterpart. The semantic relations
between words are used in query expansion process. The query word used
by user is most significant but may not sufficient to draw exact
user’s need.
Query Expansion Based On The Wordnet(1/5)

This paper proposes a new query expansion method for N-layer vector
space model based on WordNet. The terms appearing in special locations
such as title, hyperlinks, body and paragraph represent more important
information in the web document.

The document is logically divided in N-layers according to the
structure and weights are assigned to terms based on their presence in
different layers within the document.

After the user submits the original query, the first step is
preprocessing of the query. Preprocessing is carried out in two steps:
1. In preprocessing, all stopwords are removed. The stopwords are words
that are frequently used in document but do not carrying any
significant meaning.
2. The stemming of document. Stemming enables the extraction of the
root form of each word.
Query Expansion Based On The Wordnet(2/5)

Further, query word is expanded along three dimensions namely hypernym,
hyponym and synonym using WordNet to obtain hierarchy of semantic
relations .

WordNet often expands a query with too many words and few of them are
with low frequency and unusual words.These words may bring some noise
and detract the retrieval performance. This leads to decrease in
precision and recall.

It is important to avoid noise while expanding queries. Here term
association concept is used to remove the words from hierarchy of
semantic relations that have lower support and confidence to the
original word.
Query Expansion Based On The Wordnet(3/5)

Word is extracted from hierarchy relation tree, term association
concept is used to mine out terms. The term associations are based on
term co-occurrence. Term cooccurrence has been used in term-association
studies based on the intuition that co-occurring words are more likely
to be similar.
Query Expansion Based On The Wordnet(4/5)

The basic concept of association rules is to find the correlation
between any two terms in a given dataset.

Here one-to-one mapping is considered between original query term and
term from hierarchy relation tree. The ‘support’ and ‘confidence’
are two important definitions in the field of data mining.

Two definitions, Confidence (conf) and Support (sup) of term
association ti → tk are as follows
Let
D(ti tk) = D(ti)∩ D(tk)
Where D(ti) and D(tk) stand for the document containing term ti and tk
respectively.
Query Expansion Based On The Wordnet(5/5)

The confidence is defined as follows
Where ∥D(ti,tk)∥ represents the total number of documents that include
both term ti and tk. The value ∥ D(ti) ∥ represents the total number
of documents that include ti.

The support is defined as follows
where ∥ D ∥ represents the number of documents in the database.
Result(1/5)

For experimental purpose, dataset UW-CAN-DATASET from University of
Waterloo is used. The data set consists of 314 web pages from various
web sites at university of waterloo and some Canadian websites. Ten
categories are formed from the above mentioned dataset. Each category
contains documents in range of 22 to 54.

The N-layer VSM with Query Expansion is compared with N-layer VSM and
standard vector space model. For comparing results of mentioned methods,
precision and recall is used.
Result(2/5)

For experimental purpose, 20 queries are used which are listed in the
table I. The results are obtained for three approaches and N-layer VSM
with query expansion is compared with vector space model and n-layer
vector space.
Table I. User’s Query
Result(3/5)

Table II. shows the precision and recall of 20 user’s query results
obtained for three methods, vector space model, N-layer vector space
model and N-layer vector space model with query expansion.
Table II. User’s query with Precision and recall
Result(4/5)

Figure 1. shows precision graph of 20 user’s query for three
approaches and figure 2. shows recall graph of 20 user’s query for
three approaches.
Fig 1. Comparison of Precision of three methods for 20 user’s query
Fig 2. Comparison of recall of three methods for 20 user’s query
Result(5/5)

The following table III. shows average precision and average recall of
the three methods.
Table III.
Average Precision and Average Recall for three methods
Conclusion

This paper presents a new query expansion approach using WordNet.
WordNet’s noun relations such as hypernym, hyponym and synonym are
used to expand the query terms. The association between original query
terms and expanded query terms are checked before adding new terms.
Experimental result indicates that new query expansion approach leads
to better retrieval of documents.