Document 172993

Introduction Question Answering Information Extraction Question Answering
Introduction Question Answering Information Extraction Question Answering
Information Retrieval
Information Retrieval
Question Answering
Jörg Tiedemann
[email protected]
Department of Linguistics and Philology
Uppsala University
Jörg Tiedemann
1/43
Introduction Question Answering Information Extraction Question Answering
Now, something completely different:
I
Search relevant documents
I
Given a query (usually some keywords)
I
Return documents that contain the requested information
I
Clustering & classification can be used to organize the
collection
Jörg Tiedemann
2/43
Introduction Question Answering Information Extraction Question Answering
Question Answering!
Question Answering
"All right," said Deep Thought.
"The Answer to the Great Question..."
"Yes..!"
"Of Life, the Universe and Everything..." said Deep Thought.
"Yes...!"
"Is..." said Deep Thought, and paused.
"Yes...!"
"Is..."
"Yes...!!!...?"
What is the task?
I
automatically find answers
I
to a natural language question
I
in pre-structured data (databases) and/or
I
unstructured data (news texts, Wikipedia, ...)
"Forty-two," said Deep Thought, with infinite majesty and calm."
Jörg Tiedemann
3/43
Jörg Tiedemann
4/43
Introduction Question Answering Information Extraction Question Answering
Introduction Question Answering Information Extraction Question Answering
Question Answering
Answering Questions with Text Snippets
Question Answering (QA) is answering a question posed in natural
language and has to deal with a wide range of question types including: fact,
list, definition, how, why, hypothetical, semantically constrained, and
cross-lingual questions.
Motivation:
I
retrieve specific information (general or domain-specific)
I
do it in a more natural (human) way
possibly integrate this in a dialogue system
I
I
I
I
follow-up questions (with coreferences)
interactive topic specific dialogues
multi-modal, speech recognition & synthesis
! Very ambitious!
Jörg Tiedemann
5/43
Introduction Question Answering Information Extraction Question Answering
Jörg Tiedemann
6/43
Introduction Question Answering Information Extraction Question Answering
When did Boris Yeltsin die?
Question Answering from Unstructured Data
I
When was the Ebola virus first encountered?
I
I
How did Jimi Hendrix die?
I
I
7/43
...and when on September 18, 1970, Jimi
Hendrix died of an overdose, her
reaction...
What is the capital of Russia?
I
Jörg Tiedemann
244 persons died of the Ebola virus, that
was first found in Zaire in 1976
Jörg Tiedemann
The riders had a tour of Moscow this
morning. Tomorrow morning they are leaving
the Russian capital..
8/43
Introduction Question Answering Information Extraction Question Answering
Introduction Question Answering Information Extraction Question Answering
Question Answering: Why is this difficult?
I
Already in 1961 he predicted the unification of Germany.
What is the capital of Ireland?
I
I
I
Information Extraction (IE) is extracting structured information from
When was the unification of Germany?
I
I
Information Extraction
It is a common joke in Ireland to call Cork the “real capital
of Ireland”
In the middle ages Kilkenny was the capital of Ireland
9/43
coreference resoultion
I
terminology extraction
I
relationship extraction
I
extract & store world knowledge from those collections
structured collections (databases) for many puposes
Jörg Tiedemann
PERSON works for ORGANIZATION
I
PERSON lives in LOCATION
IE often requires heavy NLP and semantic inference
I
extract dates of people’s death
I
I
...and when on September 18, 1970, Jimi
Hendrix died of an overdose, her
reaction...
extract capitals of countries in the world
I
The riders had a tour of Moscow this
morning. Tomorrow morning they are leaving
the Russian capital..
Simple cases exist (e.g. infoboxes at Wikipedia)
Precision is usually more important than Recall!
Jörg Tiedemann
10/43
Information Extraction
Find patterns like
I
searchable fact databases
question answering
Introduction Question Answering Information Extraction Question Answering
Sub-tasks:
named entity recognition
unstructured diverse data collections are full of information
! for example, turn Wikipedia into a well-structured fact-DB
Information Extraction
I
I
I
Introduction Question Answering Information Extraction Question Answering
I
Motivation:
I
Website with “Common misconceptions about RSI” ....
RSI is the same as a mouse arm.
Jörg Tiedemann
processing (NLP).
I
What is RSI?
I
unstructured machine-readable documents by means of natural language
11/43
Jörg Tiedemann
12/43
Introduction Question Answering Information Extraction Question Answering
Introduction Question Answering Information Extraction Question Answering
QA Systems: Do IE on-line with arbitrary questions
Jörg Tiedemann
13/43
Introduction Question Answering Information Extraction Question Answering
N
1X
1
=
N
rank (first_correct_answer)
accuracy: compute precision of first answer
I
often: distinguish between
I
Jörg Tiedemann
Precision and Efficiency!
I
Natural Interface, Dialog, Domain Flexibility
Jörg Tiedemann
I
I
I
(considering first 5 answers per question)
I
I
I
14/43
What type of Question?
1
I
Natural Language Understanding
Step 1: Question Analysis
Evaluation with mean reciprocal ranks:
MRRQA
I
Introduction Question Answering Information Extraction Question Answering
Evaluation in QA
I
Challenges
question type ! predict type of answer
define question typology
often quite fine-grained
I
I
correct
inexact
unsupported (correct answer but not in its context)
even more fine-grained: types may take arguments
I
I
15/43
location, date, capital, currency, founder, definition,
born_date, abbreviation, ...
Jörg Tiedemann
inhabitants(Sweden)
function(president, USA)
16/43
Introduction Question Answering Information Extraction Question Answering
Introduction Question Answering Information Extraction Question Answering
Step 1: Question Analysis
Step 1: Question Analysis
Use syntactic information (e.g. dependency relations)
Patterns for question analysis
I
could use machine learning
I
I
I
requires annotated training data
difficult to define features to be used
I
I
17/43
Introduction Question Answering Information Extraction Question Answering
Given a question type,
I
Search through relevant documents for sentences
containing a phrase that is a potential answer.
I
I
I
In which city did the G7 take place?
hin, obj, Geotypei, hGeotype, det, whichi,
hin, wh, Verbi, hVerb, su, Eventi
! location(G7,city)
Jörg Tiedemann
18/43
Step 2: Passage Retrieval
I
Information Retrieval: Search relevant documents
I
QA: Need the text snippet that contains an answer!
Passage retrieval is used as filtering component:
keywords and keyphrases from question
question analysis ! create appropriate query
Rank potential answers & select top n answers
Jörg Tiedemann
hwhen, wh, Verbi, hVerb, su, Eventi
! event_date(Rome Treaty)
Introduction Question Answering Information Extraction Question Answering
Next steps
I
I
I
very effective
easy to extend and to tune
easy to add new Q-types
Jörg Tiedemann
When was the Rome Treaty signed?
I
hand-crafted patterns
I
I
19/43
I
answer extraction is expensive (! heavy NLP)
I
narrow down search space
! smaller segments (e.g. paragraphs)
! only a few but relevant matches (including answer)
Jörg Tiedemann
20/43
Introduction Question Answering Information Extraction Question Answering
Introduction Question Answering Information Extraction Question Answering
Step 2: Evaluating Passage Retrieval
Evaluation Example: Retrieve 3 Paragraphs
Mean reciprocal ranks: The mean of the reciprocal rank of the first
passage retrieved that contains a correct answer.
MRRIR =
N
1 X
1
N
rank (first_relevant_passage)
1
Coverage: Percentage of questions for which at least one passage
is retrieved that contains a correct answer.
Redundancy: The average number of passages retrieved per
question that contain a correct answer
Jörg Tiedemann
21/43
Introduction Question Answering Information Extraction Question Answering
Question: Wie is de leider van Bosnië ?
Accepted Answers: Ratko Mladic, Radovan Karadzic
rank 1, IR-score: 12.754168
Kroatië voldoet aan het verzoek van Bosnië om steun in de noordwestelijke enclave Bihac , die dreigt te worden
ingenomen door Serviërs uit Bosnië en Kroatië , bijgestaan door afvallige milities van moslim-leider Abdic .
rank 2, IR-score: 12.581567
De leiders van de Bosnische en de Kroatische Serviërs deden gisteren een beroep op de Servische regering te hulp
te komen bij het pareren van het Kroatische offensief in West-Bosnië en de Krajina ." De Serviërs in Bosnië en
Kroatië vechten met gebonden handen en hebben dringend steun nodig van Servië en Joegoslavië " , aldus de
Bosnisch-Servische leider Radovan Karadzic en de Kroatisch-Servische leider Milan Martic .
rank 3, IR-score: 12.418680
In een vraaggesprek met de Griekse krant Ta Nea heeft de Bosnisch-Servische leider Radovan Karadzic gezegd
dat hij geen nieuw ultimatum van de Navo zal accepteren . Karadzic wil spoedig een bezoek brengen aan
Griekenland om Athene te vragen de kwestie Bosnië op de agenda van de Europese Unie te plaatsen . De
Servische leider zei niet gelukkig te zijn met de Amerikaanse bemoeienis met Bosnië omdat de VS de zaak uit
handen dreigt te halen van de VN .
I
coverage: 1
I
redundancy: 2
I
MRRIR : 1/2 = 0.5
Jörg Tiedemann
Introduction Question Answering Information Extraction Question Answering
Coverage & Redundancy: Experiments with Joost
Step 2: Two Types of Passage retrieval
100
I
80
10
60
7.5
40
5
20
2.5
0
0
20
I
retrieve relevant passages for given query
can use IR techniques
I
redundancy
coverage/MRR (in %)
IR coverage
QA MRR
I
I
index passages instead of documents
rank passages according to standard tf-idf or similar
! index-time passaging
two-step procedure:
1. standard document retrieval
2. segmentation + passage selection
! seach-time passaging
IR redundancy
40
60
80
number of paragraphs retrieved
22/43
100
! strong correlation between coverage and QA MRR
! coverage is more important than redundancy
Jörg Tiedemann
23/43
Jörg Tiedemann
24/43
40
0
Introduction Question Answering Information Extraction Question Answering
50
Okapi
Approach 1
Approach 2
Approach 3
Approach 4
100
150
200
Rank
Index-time versus Search-time Passaging
5
100
Step 2: Passage Retrieval: What is a Passage?
4
Answer redundancy
% coverage
80
60
Introduction Question Answering Information Extraction Question Answering
How much should we retrieve?
3
sentences
paragraphs
documents
2
1
40
#sent
16,737
80,046
618,865
cov
0.784
0.842
0.877
red
2.95
4.17
6.13
MRR
IR
QA
0.490 0.487
0.565 0.483
0.666 0.457
accuracy
CLEF
0.430
0.416
0.387
(CLEF Experiments with Joost, retrieve 20 units/question)
0
0
50
Okapi
Approach 1
Approach 2
Approach 3
Approach 4
100
150
0
200
50
100
150
Rank
Okapi
Approach 1
Approach 2
(Roberts
& Gaizauskas,
Approach
3
Approach 4
Rank
dotted line = index-time passaging (
200
! retrieving smaller units better than document retrieval
2004))
! not much to gain with two-step procedure
5
4
Answer redundancy
Jörg Tiedemann
25/43
Fig. 1. Results of passage retrieval experiments
Jörg Tiedemann
26/43
3
Introduction Question Answering Information Extraction Question Answering
Introduction Question Answering Information Extraction Question Answering
2
Step 2: Passage Retrieval: Text Segmentation
Step 2: Improved Passage Retrieval
1
Different
document segmentation approaches:
0
0
50
100
Rank
150
200
# sentences
sentences
16,737
paragraphs
80,046
TextTiling
107,879
2 sentences
33468
3 sentences
50190
Fig. 1. Results of passage retrieval experiments
4 sentences
66800
2 sentences (sliding)
29095
3 sentences (sliding)
36415
4 sentences (sliding)
41565
Okapi
Approach 1
Approach 2
Approach 3
Approach 4
Jörg Tiedemann
MRRIR
0.490
0.565
0.586
0.545
0.554
0.581
0.548
0.549
0.546
MRRQA
0.487
0.483
4 0.503
4 0.506
0.504
4 0.512
4 0.516
0.484
0.476
27/43
Augment passage retrieval with additional information from
question analysis and from linguistic annotation:
I
I
I
I
include index of Named Entity labels
! match question type
use syntactic information to include phrase queries
! match dependency relation triples
use weights to boost certain keyword types
! named entities, nouns, entities in specific relations
! include linguistic annotation in zone/tiered index
query expansion ! increase recall
Jörg Tiedemann
28/43
Introduction Question Answering Information Extraction Question Answering
Introduction Question Answering Information Extraction Question Answering
Step 2: Passage Retrieval: Summary
Step 3: Answer Extraction & Ranking
Now we have
I
use IR techniques, but
I
... focus on smaller units (sentences or passages)
I
expected answer type (! question type)
I
... focus on coverage (but also ranking)
I
relevant passages
I
... include extra-information (q-type, annotation)
I
retrieval scores (ranking)
I
retrieve less
! lower risk for wrong answer exraction
! increased efficiency
(syntactically) analysed question
I
Jörg Tiedemann
! Need to extract answer candidates
! Rank candidates and return answers
29/43
Introduction Question Answering Information Extraction Question Answering
Step 3b: Answer Ranking
Can use syntactic information again:
I
I
Combine various knowledge sources: for example
I
Where did the meeting of the G7-countries take place?
I
Final score of an answer is weighted sum of
I
location(meeting,nil)
.. after a three-day meeting of the
G7-countries in Napels.
I
I
in Napels is a potential answer for location(meeting,nil)
I
I
I
Jörg Tiedemann
30/43
Introduction Question Answering Information Extraction Question Answering
Step 3a: Answer Extraction
I
Jörg Tiedemann
I
I
if NE class = LOC
if Answer syntactically related to Event
if modifiers of Event in Q and A overlap
TypeScore
Syntactic Similarity of Q and A sentence
Overlap in Names, Nouns, Adjectives between Q and A
sentence and preceding sentence
IR score
Frequency of A
! tune combination weights (experts eller machine learning)
31/43
Jörg Tiedemann
32/43
Introduction Question Answering Information Extraction Question Answering
Introduction Question Answering Information Extraction Question Answering
Step 2b: Match fact databases
Frequently Asked Question Types
I
I
Answers for most questions are located in relevant
paragraphs returned by IR
For frequently asked Q Types, answers are searched
off-line
I
I
I
I
I
I
Jörg Tiedemann
33/43
Introduction Question Answering Information Extraction Question Answering
How many inhabitants does Location have?
When was Person born?
Who won the Nobelprize for literature in 1990?
What does the abbreviation ADHD mean?
What causes Frei syndrome?
What are the symptoms of poisoning by mushrooms?
Jörg Tiedemann
34/43
Introduction Question Answering Information Extraction Question Answering
Extension: Cross-lingual QA
Cross-lingual QA with Joost
Be able to search in information written in a different language!
I
Who is working on computational linguistics in Uppsala?
I
Where is the oldest university in Scandinavia?
! Often don’t need to translate the answers!
Jörg Tiedemann
35/43
Jörg Tiedemann
36/43
Introduction Question Answering Information Extraction Question Answering
Introduction Question Answering Information Extraction Question Answering
Question Answering vs. IR
Information Retrieval
Summary of Terminology
Question Answering
I
input: keywords (+ boolean
operators, ...)
I
input: natural language
question
I
output: links to relevant
documents (maybe snippets)
I
output: concrete answer
(facts or text)
I
techniques: vector-space
model, bag-of-words, tf-idf
I
techniques: shallow/deep
NLP, passage retrieval (IR),
information extraction
! IR is just a component of question answering!
! QA ! more NLP ! more fun (?!)
Jörg Tiedemann
37/43
Introduction Question Answering Information Extraction Question Answering
Information retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers).
Information Extraction (IE) is extracting structured information
from unstructured machine-readable documents by means of
natural language processing (NLP).
Question Answering (QA) is answering a question posed in
natural language and has to deal with a wide range of question
types including: fact, list, definition, how, why, hypothetical,
semantically constrained, and cross-lingual questions.
Jörg Tiedemann
38/43
Introduction Question Answering Information Extraction Question Answering
Coming Next
Off-line Answer Extraction
Access to historical documents
I
Handwritten Text Recognition for Historical Documents
I
Search the corpus exhaustively
I
Language Technology for Historical Documents
I
Using hand-crafted syntactic patterns
I
Create table of
Seminars with student presentations
I
Selected chapters from the book
I
Language technology for Information Extraction
Jörg Tiedemann
hQuestion-Key,Answer,Frequency,Doc-Idi
39/43
Jörg Tiedemann
40/43
Introduction Question Answering Information Extraction Question Answering
Introduction Question Answering Information Extraction Question Answering
Founder — Organization
Off-line Answer Extraction
I
I
I
I
h oprichten, su, Founder i, h oprichten, obj, Org’n i
I
Minderop richtte de Tros op toen ....
I
Op last van generaal De Gaulle in Londen richtte verzetsheld
Jean Moulin in mei 1943 de Conseil National de la Résistance
(CNR) op.
I
I
Het Algemeen Ouderen Verbond is op 1 december opgericht
door de nu 75-jarige Martin Batenburg.
I
. ... toen de Generale Bank bekend maakte met de Belgische
Post een "postbank" op te richten.
Jörg Tiedemann
41/43
Some results for Joost (TAL, 2006):
Jörg Tiedemann
#Q
293
131
111
IR-based
MRR % ok
0.565
52.2
0.530
48.9
0.622
57.7
#Q
377
200
199
All
MRR
0.635
0.623
0.677
43/43
augment by anaphora resolution
number of facts found for some types in Joost:
Jörg Tiedemann
Results: Table look-up vs. IR-based
Table look-up
#Q
MRR % ok
84 0.879
84.5
69 0.801
75.4
88 0.747
68.2
rank by frequency ! good precision
age
born_date
born_loc
died_age
died_date
died_how
died_loc
Introduction Question Answering Information Extraction Question Answering
Data set
CLEF 2003
CLEF 2004
CLEF 2005
lots of facts ! very effective
% ok
59.4
58.0
62.3
original
17.038
1.941
753
847
892
1.470
642
anaphora
20.119
2.034
891
885
1.061
1.886
646
42/43