Download Report

RQUERY: Rewriting Text Queries to Alleviate the
Vocabulary Mismatch Problem on RDF Knowledge
Bases
Saeedeh Shekarpour1 and Sören Auer1,2
1
EIS Research Group, University of Bonn
2
Fraunhofer-Institute IAIS
Abstract. For non-expert users, a textual query is the most popular and simple
means for communicating with a retrieval or question answering system. However, there is a risk of receiving queries which do not match with the background
knowledge. Although query expansion is a solution in traditional information retrieval, it is not the appropriate choice for question answering systems. In this
paper, we propose a new method for automatic rewriting input queries leveraging background knowledge. We employ Hidden Markov Models to determine the
most suitable derived words from linguistic resources. We introduce the concept
of triple-based co-occurrence for recognizing co-occurred words in RDF data.
This model was bootstrapped with three different statical distributions. The training as well as the test datasets were extracted from the QALD benchmark. Our
experimental study demonstrates the high accuracy of the approach.
1
Introduction
While the amount of information being published on the Web of Data is dramatically
high; yet, retrieving information is an issue due to several known challenges. A key
challenge is the lack of accurate knowledge of the used vocabularies, which even expert users frequently use incorrectly. Thus, communication with search systems particularly based on simple interfaces (i.e., textual queries) requires automatic ways for
tackling the vocabulary mismatch challenge. This challenge is even more important for
schema-aware search systems, such as question answering systems; especially, since
there precise interpretation of the input query as well as accurate spotting of answer
is more demanding. Different origins can be considered as the cause of the vocabulary
mismatch problem, the main ones being:
– inflectional form: is variation of a word for different grammatical categories such
as tense, aspect, person, number, etc. For example, the keyword ‘actress’ may need
to be altered to ‘actor’ or ‘companies’ to ‘company’. Stemming and lemmatization
aim at reducing inflectional forms by converting words to a base form.
– lexical form: relates words based on lexical categories. For example, the keyword
‘wife’ can be altered to ‘spouse’ or ‘altitude’ to ‘elevation’, because they hold the
same meaning. Synonyms, hyponym and hypernym are examples of lexical relations.
– abbreviation form: is a shortened form of a word or phrase. For example, ‘UN’ is
the abbreviation of ‘United Nation’. A common way to deal with abbreviations is
using a dictionary.
Various approaches have been proposed to address vocabulary mismatch problem.
The most important ones being:
– Using a controlled vocabulary maintained by human editors. This approach is common for relatively small or restricted domains.
– Automatically deriving a thesaurus. For instance, word co-occurrence statistics
over a text corpus is an automatic way to induce a thesaurus.
– Interactive query expansion, which provides a list of recommendations for the end
user. The recommendations can come from sources such as query logs or thesaurus.
– Automatic query expansion which automatically (without any user intervention)
adds derived words to the input query in order to increase recall in retrieval systems.
– Query rewriting based on query log mining which leverages the manual query
rewriting. This approach requires comprehensive query logs, thus being particularly appropriate for web search engines.
RDF data is structured data which can be viewed as a directed, labeled graph Gi =
(Vi , Ei ) where Vi is a set of nodes comprising all entities and literal property values,
and Ei is a set of directed edges, i.e. the set of all properties. In this paper, we exploit
the topology as well as the semantics of RDF data for query rewriting. We propose
RQUERY3 a method for automatic query rewriting which exploits the internal structure of RDF data. We employ a Hidden Markov Model (HMM) to obtain optimal tuples
of derived words. We test different bootstrapping methods applying well-known distributions as well as an algorithm based on Hyperlink-Induced Topic Search (HITS). Our
main contributions are as follows:
– We define the concept of triple-based co-occurrence of words in RDF knowledge
bases.
– We extend the Hidden Markov Model for producing and ranking query rewrites.
– We extend a benchmark extracted from QALD benchmark for the query rewriting
task.
– We assess and analyse the effectiveness of the proposed approach in two directions:
(1) how effective is the approach for addressing the vocabulary mismatch problem
(by taking queries into account having the vocabulary mismatch problem) (2) how
effective is the approach for avoiding noise (by taking queries into account which
do not have vocabulary mismatch problem).
The paper is structured as follows: In the following section, we present an overview
of RQUERY in more detail along with some of the notations and concepts used in this
work. Section 3 presents the proposed approach for query rewriting. Our evaluation
results are presented in Section 4; then, related work is reviewed in Section 5. Finally,
we conclude and give pointers to future work.
3
Demo is at: http://rquery.sina.eis.iai.uni-bonn.de/
External Resources
WordNet
Segment
generation
Segment
expansion
RDF Knowledge Base
Derived
word
validation
HMM
model
construct
Viterbi
algorithm
run
Ranked list of
rewritten queries
Input textual
query
RQUERY
Fig. 1: Overview of RQUERY.
2
RQUERY Overview
RQUERY obtains a textual query Q as input; as output it provides a ranked list of query
rewrites each of which can be considered as an alternative of the original input. Figure 1
illustrates the high level overview of RQUERY, which comprises six main modules.
Each module runs sequentially after the previous module and consumes the previous
output. Furthermore, RQUERY relies on external sources, i.e., a linguistic thesaurus
(e.g. WordNet [5]) and an RDF knowledge base along with its associated ontology, (e.g.
DBpedia [9]). In the following, we describe the main objective of each module.
Segment Generation An input textual query Q is initially preprocessed (i.e. applying
tokenization and stop word removal). Then, the remaining keywords are considered
as an n-tuple of keywords q = (k1 , k2 , ..., kn ). Each subset of this tuple is called a
segment.
Definition 1 (Segment). For a given n-tuple of keywords q = (k1 , k2 , ..., kn ), the segment s(i,j) is the sequence of keywords from start position i to end position j, i.e.,
s(i,j) = (ki , ki+1 , ..., kj ).
We generate all possible segments which can be derived from the given n-tuple of
keywords q = (k1 , k2 , ..., kn ). Since the number of keywords is low (in most queries
less than six4 ), generating this set is not computationally expensive. This set is represented as S = {s(i,j) |1 ≤ i ≤ j ≤ n}.
Segment Expansion This module expands segments derived from the previous module
using WordNet as linguistic thesaurus. The linguistic features of WordNet which are
employed in RQUERY are:
– Synonyms: words having the same or a very similar meaning to input word.
4
http://www.keyworddiscovery.com/keyword-stats.html?date=2012-08-01
– Hypernyms: words representing a generalization of the input word.
An observation from our previous research [14] revealed that basically hyponym5
relationship leads to deriving a large number of expansion terms whereas their contribution to the vocabulary mismatch task is low. Thus, to prevent negative influence on
efficiency we do not take hyponym relationship into account. We employ both the original segment s and its lemma s0 for expansion6 . In other words, in case that the original
segment s differs from its lemma s0 , we once derive words for the original segment and
then we derive words for its lemma. The expansion set is formally defined as follows.
Definition 2 (Segment Expansion Set). For a given segment s, we define the associated expansion set denoted by ESs as the union of all the following words:
1. Original words: the given segment s extracted from the n-tuple of keywords q.
2. Lemmatized words: words determined as lemma of the given segment s.
3. Linguistic words: words derived via applying linguistic features over the given segment s as well as its lemma s0 .
Derived Word Validation For every given segment s ∈ S we construct its associated
expansion set ESs . Then we form
S the set of all derived words as union of all available
expansion sets W = {w|w ∈ ∀s∈S ESs }. Subsequently, we validate each word w ∈
W against the existing vocabulary in the underlying RDF knowledge base (KB). More
precisely, we check the occurrence of each word w ∈ KB by sub-string matching in
the literal position of all available triples in the RDF knowledge base. Then, simply
if no occurrence is observed, we exclude that word from our derived word set. After
the validation phase, the remaining words in the derived word set are exploited for the
following operations.
Hidden Markov Model Construct We aim at distinguishing a group of derived words
which respects the intention of the input query Q. In other words, this grouped words
convey the same meaning but it contains words which can be different from the input
query words. If these grouped words do not suffer from the vocabulary mismatch problem, when substituted for the input query, the vocabulary mismatch problem is resolved.
The substitution of the input query is called query rewrite which is defined as follows:
Definition 3 (Query Rewrite). For a given n-tuple query q = (k1 , k2 , ..., kn ), a query
0
) where each ki0 either matches
rewrite is an m-tuple of keywords qr = (k10 , k20 , ..., km
a keyword kj in q or was linguistically derived from either a keyword kx or a segment
s(x,y) of the input query.
We address the problem of finding the appropriate query rewriting by employing a
Hidden Markov Model (HMM). In section 3, we elaborate on the entire aspects of constructing the model as well as bootstrapping the parameters. Here, we succinctly present
an overview. We construct a hidden Markov model in three steps: 1. The state space is
5
6
Words representing a specialization of the input word.
For brevity, we skip indexes of segment symbols which represent start and end positions.
job
band
leader
occupation
Start
director
profession
music
director
line
conductor
business
vacation
profession
bandleader
Observation 1
Observation 2
Fig. 2: The constructed state space for the query "profession of bandleader".
populated. 2. Transitions between states are established. 3. Parameters are bootstrapped.
Assume the input query is “profession of bandleader” and according to the benchmark,
it should be rewritten as “occupation of bandleader”. The state space is populated with
all 1ß validated words derived from this query. Then, all the transitions between states
are recognised and established. Figure 2 illustrates this model. Each eclipse represents
a state containing a derived word. The dashed arrows originating from the states and
pointing to the keywords determine the emitted keyword of each state. Transitions between states are represented via black arrows. Arrows originating from the start point
indicate states from which the first input keyword is observable.
Viterbi Algorithm Run The Viterbi algorithm or Viterbi path [18] is a dynamic programming approach for finding the optimal path through a HMM for a given input
query. It discovers the most likely states that the sequence of input keywords is observable through.
3
Rewriting Queries using HMM
In this section, we describe how we use a Hidden Markov Model (HMM) for rewriting
the input query. First, we introduce the notation of HMM parameters, constructing the
state space, transition between states and then we detail how we bootstrap the parameters of our HMM.
Formally, a Hidden Markov Model (HMM) is a quintuple λ = (X, Y, A, B, π)
where:
– X is a finite set of states. In our case, X equals to the set of the validated derived
words W .
– Y denotes the set of observations. Herein, Y equals to the set of all segments Seg
derived from the input n-tuple of keywords q.
– A : X × X → [0, 1] is the transition matrix. Each entry aij is the transition probability P r(Sj |Si ) from state i to state j.
– B : X × Y → [0, 1] represents the emission matrix. Each entry bih = P r(h|Si ) is
the probability of emitting the symbol h from state i.
– π : X → [0, 1] denotes the initial probability of states.
3.1
State Space
A-priori, the state space is populated with as many states as the total number of words
exist in literal positions of triples available in the underlying RDF knowledge base.
With this regard, the number of states is thus potentially large. To reduce the number
of states, we exclude irrelevant states based on the following observations: A relevant
state is a state for which its associated word is (1) equal to a segment of the input query,
(2) equal to a lemma of a segment of the input query, (3) linguistically derived from
a segment of the input query. Thus, we limit the state space X to the set of validated
derived words W . From each state, the observable keyword (i.e., emitted strings) is
the segment s of which the associated word w of a states is derived. For instance, the
word job is derived from the segment profession, so the keyword profession
is emitted from the state associated with the word job.
3.2
Transitions between States
We define transitions between states based on the concept of co-occurrence of words.
We adopt the concept of co-occurrence of words from the traditional information retrieval context to the context of RDF knowledge bases. Triple-based co-occurrence
means co-occurrence of words in literals found in the resource descriptions of two resources of a given triple:
1. Two words w1 and w2 co-occur in literal values of the property rdfs:label of
resources placed in the
– subject as well as predicate of a given triple (subject-predicate co-occurrence).
– subject as well as object of a given triple (subject-object co-occurrence).
– predicate as well as object of a given triple (predicate-object co-occurrence).
2. Two words w1 and w2 co-occur in the literal of a given triple as well as with the
property rdfs:label of the resource placed in the
– subject of that triple (subject-literal co-occurrence).
– predicate of that triple (predicate-literal co-occurrence).
Figure 3 illustrates the five graph patterns which we employed to check co-occurrence
of two words w1 and w2. In the following, we present the formal definition along with a
sample of SPARQL query recognizing co-occurrence in the subject-predicate position.
Definition 4 (Triple-based Co-occurrence). In a given triple t = (s, p, o), two words
w1 and w2 are co-occurring, if they appear in the labels (rdfs:label) of at least
two resources (i.e., (s, p), (s, o) or (o, p)). The following SPARQL query checks the
co-occurrence of w1 and w2 in the subject-predicate position.
ASK WHERE { {
?s ?p
?o .
?s rdfs:label ?sl .
?p rdfs:label ?pl .
FILTER(regex(?sl,"word1") AND
regex(?pl,"word2"))
} UNION {
?s ?p
?o .
?s rdfs:label ?sl .
?p rdfs:label ?pl .
FILTER(regex(?sl,"word2") AND
regex(?pl,"word1"))
}
FILTER(langMatches(lang(?sl),"en") AND
langMatches(lang(?pl),"en")) }
w1
w2
w1
w2
w1
w1
w2
w1
l
l
l
l
l
l
l
l
s
p
o
s
p
o
(a)
o
subject-predicate.
s
(b)
p
subject-object.
(c)
p
w2
subject-literal.
s
(d)
predicate-object.
s
(e)
p
w2
predicate-literal.
Fig. 3: The graph patterns employed for recognising co-occurrence of the two given
words w1 and w2. Please note that the letters s, p, o, c, l, a respectively stand for
subject, predicate, object, class, rdfs:label, rdf:class.
3.3
Bootstrapping Parameters
Commonly, supervised learning is employed for estimating the Hidden Markov Model
parameters. An important consideration here is that we encounter a dynamic modelling
meaning state space as well as issued observation (i.e., sequence of input keywords)
vary query by query. Thus, learning probability values should be generic and not querydependent because learning model probabilities for each individual query is not feasible. Instead, we rely on bootstrapping, a technique used to estimate an unknown probability distribution function. We apply three distributions (i.e., normal, uniform and
zipfian) to find out the most appropriate distribution.
For bootstrapping the model parameters A and π, we take into account co-occurence
between words as well as frequency of words in the RDF knowledge base. Word frequency is generally defined as the number of times a word appears in the knowledge
base. Herein, we adopt an implicit word frequency which considers the type of the
resources that the given word appears in the literal position of the rdfs:label property. In our previous research [13], we observed that generally connectivity degree of
resources with type class and property is higher than instance resources. In DBpedia,
for example, classes have an average connectivity degree of 14,022, while properties
have in average 1,243 and instances 37. We assign an static word frequency denoted by
wf based on word appearance position. In other words, if a given word w appears in the
label of a class, it obtains higher word frequency value. Equation 1 specifies the word
frequency values according to their appearance position. With respect to our underlying knowledge base (i.e., DBpedia), we assign logarithm of the average of connectivity
degree (as mentioned above) for each α; for instance, α2 approximates log(1243) ∼ 3.


α1 w ∈ ClassLabels
wf = α2 w ∈ P ropertyLabels
(1)


α3 w ∈ InstanceLiterals
We transform the parameters i.e., word co-occurrence as well as word frequency
to hub and authority values computed by HITS algorithm. Hyperlink-Induced Topic
Search (HITS) is a link analysis algorithm that was developed originally for ranking
Web pages [8]. It assigns a hub and an authority value to each Web page. The hub value
estimates the value of links to other pages and the authority value estimates the value
of the content on a page. Hub and authority values are mutually interdependent and
computed in a series of iterations. In each iteration the authority value is updated to
the sum of the hub scores of each referring page; and the hub value is updated to the
sum of the authority scores of each referring page. After each iteration, hub and authority values are normalized. This normalization process causes these values to converge
eventually. Since RDF data is a graph of linked entities, we employ a weighted version
of the HITS algorithm in order to implicitly take co-occurrence as well as frequency of
words into account. The weight wfi is the frequency of the associated word of the state
Si . Authority and hub values are computed as follows:
X
auth(Sj ) =
wfi ∗ hub(Si )
Si
hub(Sj ) =
X
wfi ∗ auth(Si )
Si
Transition Probability. The transition probability of
Pstate Sj following state Si is
denoted as aij = P r(Sj |Si ). Note that the condition
P r(Sj |Si ) = 1 holds. The
Si
transition probability from the state Si to the state Sj is computed by:
aij = P r(Sj |Si ) =
auth(Sj )
P
× hub(Si )
auth(Sk )
∀aik >0
Here, the probabilities from state Si to the neighbouring states are uniformly distributed
based on the authority values. Consequently, states with higher authority values are
more probable to be met.
Initial Probability. The initial probability π(Si ) is the probability that the model
assignsP
to the initial state Si at the beginning. The initial probabilities fulfill the condition
π(Si ) = 1. We denote states for which the first keyword is observable by
∀Si
InitialStates. The initial states are defined as follows:
π(Si ) =
hub(Si )
P
hub(Sj )
∀Sj ∈InitialStates
In fact, π(Si ) of an initial state is uniformly distributed on hub values.
Emission Probability. The probability of emitting a given segment seg from the
state Si depends on the linguistic relation between the state associated word wSi and
the given segment seg. This probability values θ if the state associated word wSi equals
to either seg or its lemma seg 0 and it values η if the state associated word wSi is linguistically driven from seg. This assignment is denoted below.

0

θ wSi = seg ∨ wSi = seg
bik = P r(seg|Si ) = η wSi ∈ ESseg − {seg, seg 0 }
(2)


0 wSi ∈
/ ESseg
Intuitively, θ should be larger than η. A statistical analysis with our query corpus
confirms this assumption. Accordingly, around 82% of the words do not have a vocabulary mismatch problem. Hence, taking either the original words or lemmatised words
into account suffices to a large extent. Only around 12% of the words have a vocabulary
mismatch problem. However, we can not solely rely on this statistics. We consider the
difference γ between θ and η and perform a parameter sensitivity evaluation on γ.
3.4
Viterbi Algorithm
The Viterbi algorithm is a dynamic programming approach for finding the optimal path
through a HMM for a given input. It discovers the most likely sequence of hidden
states which through that the input query is observable. The most likely path has the
maximum joint emission and transition probability of the involved states. Each subpath of this path also has the maximum probability for the corresponding sub-sequence
of input keywords. The common version of this algorithm tracks only the most likely
path. We extended that in a way that it detects and lists all possible paths which the
given input query is observed through them.
3.5
Ranking Mechanisms
A hidden Markov model is prone to detect paths with the same probability values. In
other words, two paths might tie for a place in the corresponding ranking. We adopt two
ranking mechanisms (i.e. dense ranking and modified competition ranking)7 . Although
both of them assign the same ranking number to the items which obtained equal probability values, the latter one leaves out a gap in the ranking numbers. They are defined
as follows:
7
http://www.merriam-webster.com/dictionary/ranking
1. Dense ranking assigns the same rank number to the paths with equal probability
values, it does not leave any gap either after or before items with the same values.
In fact, the next item(s) are assigned the following ranking number. Dense ranking
is denoted by R. For instance, assume the output is four paths labeled by A, B, C,
D. The highest probability value belongs to A, then after that B and C gained the
equal probability values and apparently D acquired the lowest probability value.
Dense ranking places A as the 1th; both B and C are placed at the 2th, and D is
placed at the 3th.
2. Modified competition ranking firstly brings the gap before the items with equal
probability values. Similarly, this gap covers a range being one less than the number
of items with equal probability values. Modified competition ranking is denoted by
R0 . With respect to the assumed example above, applying modified competition
ranking results in placing A as the 1th, then one gap is left out; in sequel both B
and C are placed as the 3th, and ultimately D is placed at the 4th.
Example 1. Let us continue with the given query “profession of bandleader”. The
columns of Table 1 respectively show a subset of (1) the segments (2) the states of
state space which are associated to derived words (3) semantic relation between the derived words and the segments. After running theViterbi algorithm, the generated top-6
outputs are as follows:
R
1
2
3
4
4
5
R’
1
2
3
5
5
6
Probability
0.0327
0.0138
0.0036
0.00327
0.00327
0.00138
Segment
Query Rewrite
profession bandleader.
profession director.
profession conductor.
profession music director.
occupation bandleader.
occupation director.
State
profession
professing
profession
bussiness
job
occupation
director
bandleader
music director
director
profession bandleader profession bandleader
Relation Type
original keyword
synonym
hypernym
hypernym
hypernym
original keyword
hypernym
hypernym
original keyword
Table 1: A subset of segments, state space, deriving relation type for the given query
“profession of bandleader”.
4
Evaluation
Expansion and rewriting methods are endangered to yield large number of irrelevant
words which negatively influence runtime as well as accuracy. For instance, in [14] we
showed that for short queries, the number of derived words is significantly high. Thus,
the goal of our evaluation is investigating positive as well as negative impacts by raising
the following two questions: (1) How effective is the approach for addressing the vocabulary mismatch problem employing queries having vocabulary mismatch problem?
(2) How effective is the approach for avoiding noise employing queries which do not
have vocabulary mismatch problem? We employ Mean Reciprocal Rank M RR as our
metric for measuring accuracy8 of the outputs. This metric takes the order of results
into account. The mean reciprocal rank is defined as average of the reciprocal ranks of
P|Q| 1
1
outputs for a set of queries and calculated as: |Q|
i=1 ranki . Mean reciprocal rank is
computed on the dense ranking as well as standard competition ranking; the former one
is denoted by M RR and the latter one by M RR0 .
QALD series benchmarks9 are the only available benchmarks tailored for question
answering on Linked Data. In these benchmarks, every textual query is associated with
an individual SPARQL query. Generally, both textual and SPARQL Queries ranges
from different complexities; thus, transformation of textual queries to SPARQL queries
causes various challenges. Vocabulary mismatch problem is one of the important challenges. In [14] we extracted a dataset from QALD series benchmarks. Herein, we reuse
and extend that benchmark. This dataset contains queries which half of them have vocabulary mismatch problem and half do not. Our experimental study is divided into two
parts. First, we perform an evaluation of the bootstrapping parameters of the Hidden
Markov Model. We 10 initial queries from the benchmark for this purpose. Second, we
evaluate the overall accuracy using the remaining queries.
4.1
Evaluation of Bootstrapping
We perform an experimental study to discover the optimum setting for emission, initial
and transition probabilities. These parameters are interoperated and equivalently influence the output. Thus, we run a multi-dimensional experiment. The main dimensions
as well as goals of this experiment are as follows:
– To discover a suitable distribution function for transition and initial probabilities.
With this respect, we apply and compare uniform distribution versus normal distribution.
– To find out optimum value for γ as the difference between θ and η in the emission
probability. We ran a parameter sensitivity evaluation over γ ranging from 0 to 0.9.
– To measure the effectiveness of HITS algorithm. With this respect, we ran the distribution functions separately with two input random variables X and Y being defined
8
9
Since all our computations are carried out on the fly; thus, in this work, runtime is not the
subject of our evaluation.
http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/
index.php?x=task1&q=3n for n = 1, 2, 3, 4, 5.
ID
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
Q16
Q17
Q18
Q19
Q20
Query
profession bandleader
Barak Obama wife
Is Natalie Portman an actress?
Lawrence of Arabia conflict
children of Benjamin Franklin
movies with Tom Cruise
husband of Amanda Palmer wife
altitude of Everest
companies in California
writer of Wikipedia
soccer clubs in Spain
employees of Google
successor of John F. Kennedy
nicknames of San Francisco
Statue of Liberty built
capital of Canada
companies in Munich
governor of Texas
official languages of the Philippines
Who founded Intel?
Mismatch word
profession
wife
actress
conflict
children
movie
husband
altitude
company
writer
-
Match word
occupation
spouse
actor
battle
child
film
spouse
elevation
organisation
author
-
Table 2: Queries of the training dataset.
respectively as the word frequency and sum of the hub and the authorithy of the
sink state10 .
Table 2 and Figure 5 show the queries of the training dataset. The half of the queries
have vocabulary mismatch problem and the other half do not. Figure 4 represents M RR
and M RR0 achieved from bootstrapping experiment over different settings. The substantial findings of this experiment can be summarised as follows:
– Uniform distribution significantly outperforms normal distribution in the majority
of settings.
– Increasing the value of γ constantly result in the more precise ranking. The optimum value is γ = 0.9. This experimental finding confirms our previous observation
as around 80% of keywords of queries do not have vocabulary mismatch problem.
– While employing the input variable Y has more substantial positive impact for
queries which do not have vocabulary mismatch problem, its effect is slightly less
than the input variable X for queries having vocabulary mismatch problem. On
average, the input variable Y has higher impact.
4.2
Evaluation
In this part, we show the result of the experiment taking the best learnt setting into
account over the test queries. Table 3 represents M RR, M RR0 for the queries of test
dataset. In four cases Q1, Q2, Q3 and Q4; RQUERY failed. The failure reason is simply
because the match word can not be derived from lexical as well as inflectional relations.
For instance, the keyword ‘extinct’ should be matched to ‘conservation status’ which
with the current relations, it is not possible. For the rest of queries, no failure was
observed.
10
Note that for a given edge the source state is the one from which the edge originates and the
sink state is the one where the edge ends.
%Uniform%Distribu7on%on%X%
Mean%Reciprocal%Rank%
Mean%Reciprocal%Rank%
%Uniform%Distribu7on%on%Y%
1"
0.8"
0.6"
0.4"
0.2"
0"
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
1"
0.8"
0.6"
0.4"
0.2"
0"
0.9"
0"
0.1"
0.2"
0.3"
0.4"
Gamma%
MRR:Q11Q10"
MRR:Q101Q20"
MRR':Q11Q10"
MRR':Q101Q20"
MRR:Q11Q10"
0.8"
0.6"
0.4"
0.2"
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
MRR:Q101Q20"
MRR:Q101Q20"
0.7"
0.8"
0.9"
MRR':Q11Q10"
MRR':Q11Q10"
MRR':Q101Q20"
0.7"
0.8"
0.9"
1"
0.8"
0.6"
0.4"
0.2"
0"
0"
0.1"
0.2"
0.3"
0.4"
Gamma%
MRR:Q11Q10"
0.6"
%Normal%Distribu7on%on%X%
Mean%Reciprocal%Rank%
Mean%Reciprocal%Rank%
%Normal%Distribu7on%on%Y%
1"
0"
0.5"
Gamma%
0.5"
0.6"
0.7"
0.8"
0.9"
Gamma%
MRR':Q101Q20"
MRR:Q11Q10"
MRR:Q101Q20"
MRR':Q11Q10"
MRR':Q101Q20"
Fig. 4: Monitoring M RR and M RR0 for various settings on the training dataset.
Normal"Distribu9on"on"Y"
MRR':"Q10DQ20"
Uniform"Distribu9on"on"Y"
MRR':"Q1DQ10"
Normal"Distribu9on"on"X"
MRR:"Q10DQ20"
Uniform"Distribu9on"on"X"
MRR:"Q1DQ10"
0"
0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9"
1"
Fig. 5: M RR and M RR0 of uniform and normal distributions for γ = 0.9.
5
Related Work
Automatic Query Expansion (aqe) has been focus of researchers for a long time. It
aims at improving retrieval efficiency. Here we divide available works into two parts.
The first part discusses a load of approaches employed for query expansion on Web of
Documents. The second part presents state of art of query expansion on Semantic Web.
Query Expansion on Web of Documents Various approaches differ in the choice
of data sources and features. Data source is a collections of words or documents. The
choice of the data source influences the size of the vocabulary (expansion terms) as
well as the available features. Furthermore, a data source is often restricted to a certain
domain and thus constitutes a certain context for the search terms. Common choices
for data sources are text corpora, WordNet synsets, hyponyms and hypernyms, anchor
ID
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
Q16
Q17
Q18
Q19
Q20
Query
actors casted in film
animals that are extinct
Who owns Aldi?
city inhabitants
television shows were created by Walt Disney
cars produced in germany
launch pads operated by NASA
official languages of Countries
Who developed the video game World of Warcraft?
Greek goddesses dwell
Abraham Lincoln’s death place
area code of Berlin
Is proinsulin a protein?
owner of universal studios
currency of the Czech Republic
spoken language in Estonia
Tim Burton ’s films
breeds of the German Shepherd dog
ingredients of carrot cake
largest city in Canada
Mismatch word
cast
extinct
owns
inhabitants
created
car
operated
countries
developed
dwell
-
Match word
starring
conservation status
key person
population
creator
automobile
operator
country
developer
abide
-
RR
1
0.11
1
1
1
0.125
1
1
1
1
1
0.33
1
0.5
1
1
RR0
1
0.11
1
1
1
0.125
1
1
1
1
1
0.33
1
0.5
1
1
Table 3: M RR and M RR0 on queries of the test dataset.
texts, search engine query logs or top ranked documents. A popular way of choosing data source is Pseudo Relevance Feedback (PRF) based query expansion. In this
method, top-n retrieved documents are assumed to be relevant and employed as data
source. [10] as one of the early works expands queries based on word similarity using cosine coefficient. [16, 17] give the theoretical basis for detecting similarity based
on co-occurence. Feature selection [11] consists of two parts: (1) feature weighting assigns a scoring to each feature and (2) the feature selection criterion determines which
of the features to retain based on the weights. Some common feature selection methods
are mutual information, information gain, divergence from randomness and relevance
models. A framework for feature weighting and selection methods is presented in [11].
The authors compare different feature ranking schemes and show that SVM achieves
the highest F1 -score of the examined methods. [4] presents a comprehensive survey of
aqe in information retrieval and detail a large amount of candidate feature extraction
methods.
Query Expansion on Semantic Web Since Semantic Web is publishing structured
data, expansion methods can be enhanced by taking the structure of data into account.
A very important choice is defining and employing new features. In [14] we presented
semantic features which can perform as well as linguistic features. An approach similar
to that is [2], relying on supervised learning and uses only semantic expansion features
instead of a combination of both semantic and linguistic ones. There is however an approach for mining equivalent relations from Linked Data, that relies on three measures
of equivalency: triple overlap, subject agreement and cardinality ratio. [20]
However, while aqe is prevalent in traditional search engines, but the existing semantic search engines either do not address the vocabulary mismatch problem or employ it in a naive way. SemSeK [1], SINA [15], QAKiS [3] are semantic search engines
which still do not apply query expansion. Alexandria [19] uses Freebase to include
synonyms and different surface forms. MHE11 combines query expansion and entity
recognition by using textual references to a concept and extracting Wikipedia anchor
texts of links. This approach takes advantage of a large amount of hand-made mappings that emerge as a byproduct. However, this approach is only viable for WikipediaDBpedia or other text corpora whose links are mapped to resources. Eager [7] expands
a set of resources with resources of the same type using DBpedia and Wikipedia categories (instead of linguistic and semantic features in our case). Eager extracts implicit
category and synonym information from abstracts and redirect information, determines
additional categories from the DBpedia category hierarchy and then extracts additional
resources which have the same categories in common. PowerAqua [12] is an ontologybased system that answers natural language queries and uses WordNet synonyms and
hypernyms as well as resources related with the owl:sameAs property.
6
Conclusion and Future work
In this paper, we presented a method for automatic query rewriting. The proposed approach benefits structure of data for recognising the best query rewrites. It employs
a Hidden Markov Model which transitions between states defined based on the concept of triple-based co-occurrence. An experimental study was performed on a training
dataset in order to detect the optimum setting for the model parameters. While the result of the evaluation shows the feasibility as well as high accuracy, but still we need to
take into account more semantic relations for tackling vocabulary mismatch problem.
Furthermore, since all computations are carried out on the fly, in the future we are going to construct a semantic indexing on co-occurrence of words in order to speed up
retrieval requests. Another extension is about developing our benchmark by including
more number of queries as well as datasets.
References
1. N. Aggarwal and P. Buitelaar. A system description of natural language query over dbpedia.
In C. Unger, P. Cimiano, V. Lopez, E. Motta, P. Buitelaar, and R. Cyganiak, editors, Proceedings of Interacting with Linked Data (ILD 2012), workshop co-located with the 9th Extended
Semantic Web Conference, May 28, 2012, Heraklion, Greece, pages 97–100, 2012.
2. I. Augenstein, A. L. Gentile, B. Norton, Z. Zhang, and F. Ciravegna. Mapping keywords
to linked data resources for automatic query expansion. In The Semantic Web: Semantics
and Big Data, 10th International Conference, ESWC 2013, Montpellier, France, May 26-30,
2013. Proceedings, Lecture Notes in Computer Science. Springer, 2013.
3. E. Cabrio, A. P. Aprosio, J. Cojan, B. Magnini, F. Gandon, and A. Lavelli. Qakis @ qald2. In C. Unger, P. Cimiano, V. Lopez, E. Motta, P. Buitelaar, and R. Cyganiak, editors,
Proceedings of Interacting with Linked Data (ILD 2012), workshop co-located with the 9th
Extended Semantic Web Conference, May 28, 2012, Heraklion, Greece, pages 88–96, 2012.
4. C. Carpineto and G. Romano. A survey of automatic query expansion in information retrieval. ACM Comput. Surv., 44(1):1:1–1:50, jan 2012.
5. C. Fellbaum, editor. WordNet: an electronic lexical database. MIT Press, 1998.
11
http://ups.savba.sk/~marek
6. J. E. L. Gayo, D. Kontokostas, and S. Auer. Multilingual linked data patterns. Semantic Web
Journal, 2013.
7. O. Gunes, C. Schallhart, T. Furche, J. Lehmann, and A.-C. N. Ngomo. Eager: extending
automatically gazetteers for entity recognition. In Proceedings of the 3rd Workshop on the
People’s Web Meets NLP: Collaboratively Constructed Semantic Resources and their Applications to NLP, 2012.
8. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5), 1999.
9. J. Lehmann, C. Bizer, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann.
DBpedia - a crystallization point for the web of data. Journal of Web Semantics, 7(3):154–
165, 2009.
10. M. E. Lesk. Word-word associations in document retrieval systems. American Documentation, 20(1):27–38, 1969.
11. S. Li, R. Xia, C. Zong, and C.-R. Huang. A framework of feature selection methods for text
categorization. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL
and the 4th International Joint Conference on Natural Language Processing of the AFNLP:
Volume 2 - Volume 2, ACL ’09, pages 692–700, Stroudsburg, PA, USA, 2009. Association
for Computational Linguistics.
12. V. Lopez, M. Fernández, E. Motta, and N. Stieler. Poweraqua: supporting users in querying
and exploring the semantic web content. In Semantik Web Journal, page 17. IOS Press,
March 2011.
13. S. Shekarpour, S. Auer, A.-C. Ngonga Ngomo, D. Gerber, S. Hellmann, and C. Stadler.
Keyword-driven sparql query generation leveraging background knowledge. In International
Conference on Web Intelligence, 2011.
14. S. Shekarpour, K. Höffner, J. Lehmann, and S. Auer. Keyword query expansion on linked
data using linguistic and semantic features. In 7th IEEE International Conference on Semantic Computing, September 16-18, 2013, Irvine, California, USA, 2013.
15. S. Shekarpour, A.-C. N. Ngomo, and S. Auer. Question answering on interlinked data. In
D. Schwabe, V. A. F. Almeida, H. Glaser, R. A. Baeza-Yates, and S. B. Moon, editors, WWW,
pages 1145–1156. International World Wide Web Conferences Steering Committee / ACM,
2013.
16. C. J. van Rijsbergen. Information Retrieval. Buttersworth, London, 1989.
17. C. J. van Rijsbergen. The geometry of information retrieval. Cambridge University Press,
2004.
18. A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding
algorithm. IEEE Transactions on Information Theory, IT-13(2), 1967.
19. M. Wendt, M. Gerlach, and H. Duwiger. Linguistic modeling of linked open data for question
answering. In C. Unger, P. Cimiano, V. Lopez, E. Motta, P. Buitelaar, and R. Cyganiak,
editors, Proceedings of Interacting with Linked Data (ILD 2012), workshop co-located with
the 9th Extended Semantic Web Conference, May 28, 2012, Heraklion, Greece, pages 75–87,
2012.
20. ziqi zhang, A. L. Gentile, I. Augenstein, E. Blomqvist, and F. Ciravegna. Mining equivalent
relations from linked data. In Proceedings of the 51st Annual Meeting of the Association for
Computational Linguistics (ACL), Sofia, Bulgaria, august 2013.