Introduction Question Answering Information Extraction Question Answering Introduction Question Answering Information Extraction Question Answering Information Retrieval Information Retrieval Question Answering Jörg Tiedemann [email protected] Department of Linguistics and Philology Uppsala University Jörg Tiedemann 1/43 Introduction Question Answering Information Extraction Question Answering Now, something completely different: I Search relevant documents I Given a query (usually some keywords) I Return documents that contain the requested information I Clustering & classification can be used to organize the collection Jörg Tiedemann 2/43 Introduction Question Answering Information Extraction Question Answering Question Answering! Question Answering "All right," said Deep Thought. "The Answer to the Great Question..." "Yes..!" "Of Life, the Universe and Everything..." said Deep Thought. "Yes...!" "Is..." said Deep Thought, and paused. "Yes...!" "Is..." "Yes...!!!...?" What is the task? I automatically find answers I to a natural language question I in pre-structured data (databases) and/or I unstructured data (news texts, Wikipedia, ...) "Forty-two," said Deep Thought, with infinite majesty and calm." Jörg Tiedemann 3/43 Jörg Tiedemann 4/43 Introduction Question Answering Information Extraction Question Answering Introduction Question Answering Information Extraction Question Answering Question Answering Answering Questions with Text Snippets Question Answering (QA) is answering a question posed in natural language and has to deal with a wide range of question types including: fact, list, definition, how, why, hypothetical, semantically constrained, and cross-lingual questions. Motivation: I retrieve specific information (general or domain-specific) I do it in a more natural (human) way possibly integrate this in a dialogue system I I I I follow-up questions (with coreferences) interactive topic specific dialogues multi-modal, speech recognition & synthesis ! Very ambitious! Jörg Tiedemann 5/43 Introduction Question Answering Information Extraction Question Answering Jörg Tiedemann 6/43 Introduction Question Answering Information Extraction Question Answering When did Boris Yeltsin die? Question Answering from Unstructured Data I When was the Ebola virus first encountered? I I How did Jimi Hendrix die? I I 7/43 ...and when on September 18, 1970, Jimi Hendrix died of an overdose, her reaction... What is the capital of Russia? I Jörg Tiedemann 244 persons died of the Ebola virus, that was first found in Zaire in 1976 Jörg Tiedemann The riders had a tour of Moscow this morning. Tomorrow morning they are leaving the Russian capital.. 8/43 Introduction Question Answering Information Extraction Question Answering Introduction Question Answering Information Extraction Question Answering Question Answering: Why is this difficult? I Already in 1961 he predicted the unification of Germany. What is the capital of Ireland? I I I Information Extraction (IE) is extracting structured information from When was the unification of Germany? I I Information Extraction It is a common joke in Ireland to call Cork the “real capital of Ireland” In the middle ages Kilkenny was the capital of Ireland 9/43 coreference resoultion I terminology extraction I relationship extraction I extract & store world knowledge from those collections structured collections (databases) for many puposes Jörg Tiedemann PERSON works for ORGANIZATION I PERSON lives in LOCATION IE often requires heavy NLP and semantic inference I extract dates of people’s death I I ...and when on September 18, 1970, Jimi Hendrix died of an overdose, her reaction... extract capitals of countries in the world I The riders had a tour of Moscow this morning. Tomorrow morning they are leaving the Russian capital.. Simple cases exist (e.g. infoboxes at Wikipedia) Precision is usually more important than Recall! Jörg Tiedemann 10/43 Information Extraction Find patterns like I searchable fact databases question answering Introduction Question Answering Information Extraction Question Answering Sub-tasks: named entity recognition unstructured diverse data collections are full of information ! for example, turn Wikipedia into a well-structured fact-DB Information Extraction I I I Introduction Question Answering Information Extraction Question Answering I Motivation: I Website with “Common misconceptions about RSI” .... RSI is the same as a mouse arm. Jörg Tiedemann processing (NLP). I What is RSI? I unstructured machine-readable documents by means of natural language 11/43 Jörg Tiedemann 12/43 Introduction Question Answering Information Extraction Question Answering Introduction Question Answering Information Extraction Question Answering QA Systems: Do IE on-line with arbitrary questions Jörg Tiedemann 13/43 Introduction Question Answering Information Extraction Question Answering N 1X 1 = N rank (first_correct_answer) accuracy: compute precision of first answer I often: distinguish between I Jörg Tiedemann Precision and Efficiency! I Natural Interface, Dialog, Domain Flexibility Jörg Tiedemann I I I (considering first 5 answers per question) I I I 14/43 What type of Question? 1 I Natural Language Understanding Step 1: Question Analysis Evaluation with mean reciprocal ranks: MRRQA I Introduction Question Answering Information Extraction Question Answering Evaluation in QA I Challenges question type ! predict type of answer define question typology often quite fine-grained I I correct inexact unsupported (correct answer but not in its context) even more fine-grained: types may take arguments I I 15/43 location, date, capital, currency, founder, definition, born_date, abbreviation, ... Jörg Tiedemann inhabitants(Sweden) function(president, USA) 16/43 Introduction Question Answering Information Extraction Question Answering Introduction Question Answering Information Extraction Question Answering Step 1: Question Analysis Step 1: Question Analysis Use syntactic information (e.g. dependency relations) Patterns for question analysis I could use machine learning I I I requires annotated training data difficult to define features to be used I I 17/43 Introduction Question Answering Information Extraction Question Answering Given a question type, I Search through relevant documents for sentences containing a phrase that is a potential answer. I I I In which city did the G7 take place? hin, obj, Geotypei, hGeotype, det, whichi, hin, wh, Verbi, hVerb, su, Eventi ! location(G7,city) Jörg Tiedemann 18/43 Step 2: Passage Retrieval I Information Retrieval: Search relevant documents I QA: Need the text snippet that contains an answer! Passage retrieval is used as filtering component: keywords and keyphrases from question question analysis ! create appropriate query Rank potential answers & select top n answers Jörg Tiedemann hwhen, wh, Verbi, hVerb, su, Eventi ! event_date(Rome Treaty) Introduction Question Answering Information Extraction Question Answering Next steps I I I very effective easy to extend and to tune easy to add new Q-types Jörg Tiedemann When was the Rome Treaty signed? I hand-crafted patterns I I 19/43 I answer extraction is expensive (! heavy NLP) I narrow down search space ! smaller segments (e.g. paragraphs) ! only a few but relevant matches (including answer) Jörg Tiedemann 20/43 Introduction Question Answering Information Extraction Question Answering Introduction Question Answering Information Extraction Question Answering Step 2: Evaluating Passage Retrieval Evaluation Example: Retrieve 3 Paragraphs Mean reciprocal ranks: The mean of the reciprocal rank of the first passage retrieved that contains a correct answer. MRRIR = N 1 X 1 N rank (first_relevant_passage) 1 Coverage: Percentage of questions for which at least one passage is retrieved that contains a correct answer. Redundancy: The average number of passages retrieved per question that contain a correct answer Jörg Tiedemann 21/43 Introduction Question Answering Information Extraction Question Answering Question: Wie is de leider van Bosnië ? Accepted Answers: Ratko Mladic, Radovan Karadzic rank 1, IR-score: 12.754168 Kroatië voldoet aan het verzoek van Bosnië om steun in de noordwestelijke enclave Bihac , die dreigt te worden ingenomen door Serviërs uit Bosnië en Kroatië , bijgestaan door afvallige milities van moslim-leider Abdic . rank 2, IR-score: 12.581567 De leiders van de Bosnische en de Kroatische Serviërs deden gisteren een beroep op de Servische regering te hulp te komen bij het pareren van het Kroatische offensief in West-Bosnië en de Krajina ." De Serviërs in Bosnië en Kroatië vechten met gebonden handen en hebben dringend steun nodig van Servië en Joegoslavië " , aldus de Bosnisch-Servische leider Radovan Karadzic en de Kroatisch-Servische leider Milan Martic . rank 3, IR-score: 12.418680 In een vraaggesprek met de Griekse krant Ta Nea heeft de Bosnisch-Servische leider Radovan Karadzic gezegd dat hij geen nieuw ultimatum van de Navo zal accepteren . Karadzic wil spoedig een bezoek brengen aan Griekenland om Athene te vragen de kwestie Bosnië op de agenda van de Europese Unie te plaatsen . De Servische leider zei niet gelukkig te zijn met de Amerikaanse bemoeienis met Bosnië omdat de VS de zaak uit handen dreigt te halen van de VN . I coverage: 1 I redundancy: 2 I MRRIR : 1/2 = 0.5 Jörg Tiedemann Introduction Question Answering Information Extraction Question Answering Coverage & Redundancy: Experiments with Joost Step 2: Two Types of Passage retrieval 100 I 80 10 60 7.5 40 5 20 2.5 0 0 20 I retrieve relevant passages for given query can use IR techniques I redundancy coverage/MRR (in %) IR coverage QA MRR I I index passages instead of documents rank passages according to standard tf-idf or similar ! index-time passaging two-step procedure: 1. standard document retrieval 2. segmentation + passage selection ! seach-time passaging IR redundancy 40 60 80 number of paragraphs retrieved 22/43 100 ! strong correlation between coverage and QA MRR ! coverage is more important than redundancy Jörg Tiedemann 23/43 Jörg Tiedemann 24/43 40 0 Introduction Question Answering Information Extraction Question Answering 50 Okapi Approach 1 Approach 2 Approach 3 Approach 4 100 150 200 Rank Index-time versus Search-time Passaging 5 100 Step 2: Passage Retrieval: What is a Passage? 4 Answer redundancy % coverage 80 60 Introduction Question Answering Information Extraction Question Answering How much should we retrieve? 3 sentences paragraphs documents 2 1 40 #sent 16,737 80,046 618,865 cov 0.784 0.842 0.877 red 2.95 4.17 6.13 MRR IR QA 0.490 0.487 0.565 0.483 0.666 0.457 accuracy CLEF 0.430 0.416 0.387 (CLEF Experiments with Joost, retrieve 20 units/question) 0 0 50 Okapi Approach 1 Approach 2 Approach 3 Approach 4 100 150 0 200 50 100 150 Rank Okapi Approach 1 Approach 2 (Roberts & Gaizauskas, Approach 3 Approach 4 Rank dotted line = index-time passaging ( 200 ! retrieving smaller units better than document retrieval 2004)) ! not much to gain with two-step procedure 5 4 Answer redundancy Jörg Tiedemann 25/43 Fig. 1. Results of passage retrieval experiments Jörg Tiedemann 26/43 3 Introduction Question Answering Information Extraction Question Answering Introduction Question Answering Information Extraction Question Answering 2 Step 2: Passage Retrieval: Text Segmentation Step 2: Improved Passage Retrieval 1 Different document segmentation approaches: 0 0 50 100 Rank 150 200 # sentences sentences 16,737 paragraphs 80,046 TextTiling 107,879 2 sentences 33468 3 sentences 50190 Fig. 1. Results of passage retrieval experiments 4 sentences 66800 2 sentences (sliding) 29095 3 sentences (sliding) 36415 4 sentences (sliding) 41565 Okapi Approach 1 Approach 2 Approach 3 Approach 4 Jörg Tiedemann MRRIR 0.490 0.565 0.586 0.545 0.554 0.581 0.548 0.549 0.546 MRRQA 0.487 0.483 4 0.503 4 0.506 0.504 4 0.512 4 0.516 0.484 0.476 27/43 Augment passage retrieval with additional information from question analysis and from linguistic annotation: I I I I include index of Named Entity labels ! match question type use syntactic information to include phrase queries ! match dependency relation triples use weights to boost certain keyword types ! named entities, nouns, entities in specific relations ! include linguistic annotation in zone/tiered index query expansion ! increase recall Jörg Tiedemann 28/43 Introduction Question Answering Information Extraction Question Answering Introduction Question Answering Information Extraction Question Answering Step 2: Passage Retrieval: Summary Step 3: Answer Extraction & Ranking Now we have I use IR techniques, but I ... focus on smaller units (sentences or passages) I expected answer type (! question type) I ... focus on coverage (but also ranking) I relevant passages I ... include extra-information (q-type, annotation) I retrieval scores (ranking) I retrieve less ! lower risk for wrong answer exraction ! increased efficiency (syntactically) analysed question I Jörg Tiedemann ! Need to extract answer candidates ! Rank candidates and return answers 29/43 Introduction Question Answering Information Extraction Question Answering Step 3b: Answer Ranking Can use syntactic information again: I I Combine various knowledge sources: for example I Where did the meeting of the G7-countries take place? I Final score of an answer is weighted sum of I location(meeting,nil) .. after a three-day meeting of the G7-countries in Napels. I I in Napels is a potential answer for location(meeting,nil) I I I Jörg Tiedemann 30/43 Introduction Question Answering Information Extraction Question Answering Step 3a: Answer Extraction I Jörg Tiedemann I I if NE class = LOC if Answer syntactically related to Event if modifiers of Event in Q and A overlap TypeScore Syntactic Similarity of Q and A sentence Overlap in Names, Nouns, Adjectives between Q and A sentence and preceding sentence IR score Frequency of A ! tune combination weights (experts eller machine learning) 31/43 Jörg Tiedemann 32/43 Introduction Question Answering Information Extraction Question Answering Introduction Question Answering Information Extraction Question Answering Step 2b: Match fact databases Frequently Asked Question Types I I Answers for most questions are located in relevant paragraphs returned by IR For frequently asked Q Types, answers are searched off-line I I I I I I Jörg Tiedemann 33/43 Introduction Question Answering Information Extraction Question Answering How many inhabitants does Location have? When was Person born? Who won the Nobelprize for literature in 1990? What does the abbreviation ADHD mean? What causes Frei syndrome? What are the symptoms of poisoning by mushrooms? Jörg Tiedemann 34/43 Introduction Question Answering Information Extraction Question Answering Extension: Cross-lingual QA Cross-lingual QA with Joost Be able to search in information written in a different language! I Who is working on computational linguistics in Uppsala? I Where is the oldest university in Scandinavia? ! Often don’t need to translate the answers! Jörg Tiedemann 35/43 Jörg Tiedemann 36/43 Introduction Question Answering Information Extraction Question Answering Introduction Question Answering Information Extraction Question Answering Question Answering vs. IR Information Retrieval Summary of Terminology Question Answering I input: keywords (+ boolean operators, ...) I input: natural language question I output: links to relevant documents (maybe snippets) I output: concrete answer (facts or text) I techniques: vector-space model, bag-of-words, tf-idf I techniques: shallow/deep NLP, passage retrieval (IR), information extraction ! IR is just a component of question answering! ! QA ! more NLP ! more fun (?!) Jörg Tiedemann 37/43 Introduction Question Answering Information Extraction Question Answering Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Information Extraction (IE) is extracting structured information from unstructured machine-readable documents by means of natural language processing (NLP). Question Answering (QA) is answering a question posed in natural language and has to deal with a wide range of question types including: fact, list, definition, how, why, hypothetical, semantically constrained, and cross-lingual questions. Jörg Tiedemann 38/43 Introduction Question Answering Information Extraction Question Answering Coming Next Off-line Answer Extraction Access to historical documents I Handwritten Text Recognition for Historical Documents I Search the corpus exhaustively I Language Technology for Historical Documents I Using hand-crafted syntactic patterns I Create table of Seminars with student presentations I Selected chapters from the book I Language technology for Information Extraction Jörg Tiedemann hQuestion-Key,Answer,Frequency,Doc-Idi 39/43 Jörg Tiedemann 40/43 Introduction Question Answering Information Extraction Question Answering Introduction Question Answering Information Extraction Question Answering Founder — Organization Off-line Answer Extraction I I I I h oprichten, su, Founder i, h oprichten, obj, Org’n i I Minderop richtte de Tros op toen .... I Op last van generaal De Gaulle in Londen richtte verzetsheld Jean Moulin in mei 1943 de Conseil National de la Résistance (CNR) op. I I Het Algemeen Ouderen Verbond is op 1 december opgericht door de nu 75-jarige Martin Batenburg. I . ... toen de Generale Bank bekend maakte met de Belgische Post een "postbank" op te richten. Jörg Tiedemann 41/43 Some results for Joost (TAL, 2006): Jörg Tiedemann #Q 293 131 111 IR-based MRR % ok 0.565 52.2 0.530 48.9 0.622 57.7 #Q 377 200 199 All MRR 0.635 0.623 0.677 43/43 augment by anaphora resolution number of facts found for some types in Joost: Jörg Tiedemann Results: Table look-up vs. IR-based Table look-up #Q MRR % ok 84 0.879 84.5 69 0.801 75.4 88 0.747 68.2 rank by frequency ! good precision age born_date born_loc died_age died_date died_how died_loc Introduction Question Answering Information Extraction Question Answering Data set CLEF 2003 CLEF 2004 CLEF 2005 lots of facts ! very effective % ok 59.4 58.0 62.3 original 17.038 1.941 753 847 892 1.470 642 anaphora 20.119 2.034 891 885 1.061 1.886 646 42/43
© Copyright 2024