Automatically Finding Answers to "Why" and "How to" Questions for Arabic Language Ziad Salem, Jawad Sadek, Fairouz Chakkour, and Nadia Haskkour* Aleppo University, Electrecal and Electroneic Engineering Faculty, Computer Engineering Department, *Faculty of Arts and Humanities, Aleppo, Syria [email protected], [email protected], [email protected] Abstract. This paper addresses the task of extracting answers to why and how to-questions from Arabic texts which has not been addressed yet for Arabic language in the field of question answering systems (QA). The system developed here uses one of the leading theories in computational linguistics called Rhetorical Structure Theory (RST) and based on cue phrases to both determine the elementary units and the set of rhetorical relations that is relevant to the targeted questions. Our experiment has been conducted on Arabic raw texts (automatically annotated) taken from Arabic websites and has gave a good result comparing with a one already done before to why-questions answering for English language. Keywords: Rhetorical Structure Theory, Natural Language Processing, Question Answering for Arabic, why and how to questions, Discourse analysis. 1 Introduction Day by day the amount of information available on the internet is growing, and it becomes more and more difficult to find answers on the WWW using standard search engines, as consequence question answering systems (QA) will become increasingly important. The main aim of QA systems is to provide the user with a flexible access to information allowing him for writing a question in natural language and presenting a short answer rather than a list of possibly relevant documents which contain the answer. Arabic is the sixth most widely spoken language in the world [1], yet there are relatively few studies to improve Arabic information search and retrieval compared to other languages and this is true for the QA task. However few researches built QA systems oriented to the Arabic language. The systems were focused on factoid questions like who, what, where and when questions [2][3] in which named entity recognition can make a substantial contribution to identifying potential answers in a source document, but none of those system addressed why and how to-questions which different techniques are needed. In the current paper the research aims at developing a system for answering why and how to-questions for Arabic language including a proper evaluation method as first attempt to address this type of questions. The system uses RST that has been applied in a large number of computational Linguistics applications. R. Setchi et al. (Eds.): KES 2010, Part IV, LNAI 6279, pp. 586–593, 2010. © Springer-Verlag Berlin Heidelberg 2010 Automatically Finding Answers to "Why" and "How to" Questions 587 2 Rhetorical Structure Theory Rhetorical structure theory was developed at USC (University of Southern California) by William Mann and Sandra Thompson. The aim was finding a theory of discourse structure or function that provides enough detail to guide a computer program in generating texts. Based on their observation of edited text from a wide variety of sources, Mann & Thompson have several assumptions about how written text functions, and how it involves words, phrases, grammatical structure summarizing as following [4]: ● Organization: Texts consists of functionality significant parts. ● Unity and coherence: There must be sense of unity to which every par contributes. ● Hierarchy: Elementary parts of a text are composed into larger parts, which in turn are composed of yet larger parts up to the scale of the text as whole. ● Relation Composition: Relations hold between parts of a text. In which every part of a text has a role, a function to play, with respect to other parts in the text. A small finite set of highly recurrent relations holding between pairs of parts of text is used to link parts together to form larger parts. All rhetorical relations that can possibly occur in a text can be categorized into a finite set of relation types. ● Asymmetry of Relations: RST establishes two different types of units. Nuclei are the most important parts of a text, whereas satellites contribute to the nuclei and are secondary. The most common type of text structuring relation is an asymmetric class, called nucleus-satellite relations, the nucleus is considered to be the basic information, and more essential to the writer’s purpose than the satellite. The satellite contains additional information about the nucleus. And it is often incomprehensible without the nucleus, whereas a text where the satellites have been deleted can be understood to a certain extent. Table 1 illustrates some of the relations identified by Mann and Thompson. Table 1. Presents some of the relations used in RST Relation name Background Elaboration Antithesis Enablement Evaluation Nucleus Satellite Text whose understanding is text for facilitating being facilitated understanding basic information Additional information ideas favored by the author ideas disfavored by the author An action information intended to aid the reader in performing an action A situation an evaluative comment about the situation Years of text analysis using RST have shown that RST is useful to capture the underlying structure of texts. Furthermore, RST has proven to be adequate in computational implementations, in the automatic analysis of texts and in the generation of coherent text [5]. 588 Z. Salem et al. 3 Using Rhetorical Relations for Question Answering Some types of rhetorical relations that might be relevant to why and how to- questions can help finding answers for those questions. Let us consider the two following examples, taken from Arabic websites, which clarify the method used to extract answers: 3.1 Example 1 ΪϳΰΗ ΓέήΣ ΔΟέΩ ΪϨϋ ϩΩΪϋ· ϢΗ ϱάϟ ΩϮγϻ ϱΎθϟ ϥ· ϝΎϜϳΪϴϣ ζΘϴϳήΑ ΔϔϴΤλ ϲϓ Εήθϧ ΔγέΩ ΖϟΎϗ] ϦϴΑ ϱήϤϟ ϥΎσήδΑ ΔΑΎλϹ ωΎϔΗέ ήδϔϳ ϚϟΫ ϥ·ϭ] ˺[ϥΎσήδϟΎΑ ΔΑΎλϹήτΧ Ϧϣ Ϊϳΰϳ ΔϳϮΌϣ ΔΟέΩ ̀˹ Ϧϋ ˻ [.ΔϴΑήϏ ήϴϐϟ ΏϮόθϟ ξόΑ [The research published in the British Medical Journal found that black tea made at temperature greater than 70 co, can raise the risk of cancer,] 1[and that may be the cause of high rates of esophageal cancer among non western people.] 2 In this example, unit1 gives information about the cause of the problem presented in unit2, so we can say that an interpretation relation holds between the two units as illustrates in Fig.1. 2-1 2 وإن ذﻟﻚ ﻳﻔﺴﺮ ارﺗﻔﺎع اﻹﺻﺎﺑﺔ ﺑﺴﺮﻃﺎن اﻟﻤﺮي ﺑﻴﻦ ﺑﻌﺾ .اﻟﺸﻌﻮب اﻟﻐﻴﺮ ﻏﺮﺑﻴﺔ 1 ﻗﺎﻟﺖ دراﺳﺔ ﻧﺸﺮت ﻓﻲ ﺻﺤﻴﻔﺔ ﺑﺮﻳﺘﺶ ﻣﻴﺪﻳﻜﺎل إن اﻟﺸﺎي اﻷﺳﻮد اﻟﺬي ﺗﻢ إﻋﺪادﻩ ﻋﻨﺪ درﺟﺔ ﺣﺮارة درﺟﺔ ﻳﺰﻳﺪ ﻣﻦ70 ﺗﺰﻳﺪ ﻋﻦ ،ﻣﻦ ﺧﻄﺮ اﻹﺻﺎﺑﺔ ﺑﺎﻟﺴﺮﻃﺎن Fig. 1. The schema of the Arabic text in the example 1 Now in case of the following question: {} ﻟﻤﺎذا ﺗﻌﺪ اﻹﺻﺎﺑﺔ ﺑﺴﺮﻃﺎن اﻟﻤﺮي ﻣﺮﺗﻔﻌﺔ ﺑﻴﻦ اﻟﺸﻌﻮب اﻟﻐﻴﺮ ﻏﺮﺑﻴﺔ ؟ {Why does esophageal cancer has high rates among non western people?} We notice that the question corresponds to the unit2, so the other part of relation will be the answer for the question which is the unit1. Automatically Finding Answers to "Why" and "How to" Questions 589 3.2 Example 2 ﻳﻜﻮن ﻓﻴﻪ آﻞ ﺷﻬﻴﻖ و آﻞ زﻓﻴﺮ، ]ﻣﻦ ﺧﻼل ﺗﻨﻔﺲ ﻣﺘﻌﺎدل1[]ﻳﻤﻜﻨﻚ أن ﺗﺼﻞ إﻟﻰ ﺣﺎﻟﺔ ﻣﻦ اﻻ ﺳﺘﺮﺧﺎء اﻟﻌﻤﻴﻖ 2 [.......... أﻏﻤﺾ ﻋﻴﻨﻴﻚ واﺳﺘﻨﺸﻖ وأﻧﺖ،ﻣﺘﺴﺎوﻳﻴﻦ ﻓﻲ اﻟﻄﻮل وﻳﺴﺎوي آﻞ ﻣﻨﻬﻤﺎ اﻵﺧﺮ ﻓﻲ اﻟﻄﻮل [You can reach a state of deep relaxation]1[through equal breathing where each inhalation and exhalation are long and of equal length. Close your eyes and inhale while…...] 2 Also in this example, we notice that unit1 explains the notion mentioned in unit2, so we can say that an explanation relation holds between the two units as illustrates in Fig.2. Given the following question: {} آﻴﻒ ﻳﻤﻜﻦ اﻟﻮﺻﻮ ل إﻟﻰ اﻻﺳﺘﺮﺧﺎء اﻟﻌﻤﻴﻖ ؟ {How to reach a stage of deep relaxation?} The question corresponds to the unit1, so we can consider the other part of relation as the answer for the question. 2-1 2 ﻳﻜﻮن ﻓﻴﻪ،ﻣﻦ ﺧﻼل ﺗﻨﻔﺲ ﻣﺘﻌﺎدل آﻞ ﺷﻬﻴﻖ وآﻞ زﻓﻴﺮ ﻃﻮﻳﻠﻴﻴﻦ وﻳﺴﺎوي أﻏﻤﺾ.آﻞ ﻣﻨﻬﻤﺎ اﻵﺧﺮ ﻓﻰ اﻟﻄﻮل .. واﺳﺘﻨﺸﻖ واﻧﺖ،ﻋﻴﻨﻴﻚ 1 ﻳﻤﻜﻨﻚ أن ﺗﺼﻞ إﻟﻰ ﺣﺎﻟﺔ ﻣﻦ اﻻﺳﺘﺮﺧﺎء اﻟﻌﻤﻴﻖ Fig. 2. The schema of the Arabic text in the example 2 We did Arabic text analysis in order to extract a set of rhetorical relations that can lead to answer why and how to questions. Identified by Al-sanie [6], eleven rhetorical relations have applied in an Arabic text summarization system. We choose four rhetorical relations from his work (Interpretation–Base–Result–Antithesis) and added other four relations (Causal–Evidence–Explanation–Purpose) to get the set of relations and its corresponding types of answer as shown in Table2. In order to automatically derive the text structure, it first needs to determine the elementary units of a text and then find the rhetorical relations that hold between these units. Marcue [7] relied on cue phrase to perform the previous two steps as a sufficiently accurate indicator of the boundaries between elementary textual units and of the rhetorical relations that hold between them. We will use the same method in the present work. Cue phrases are words and phrases that used by writer as cohesive ties between adjacent clauses and sentences and they are crucial to the reader for understanding of the text. 590 Z. Salem et al. Analyzing an Arabic corpus and studying the way the Arabic writer used to convey his thought to the reader [8][9] we generated a set of cue phrases that signaled each relation shown in Table2. For example the relation Explanation can be hypothesized on the basis of the occurrence of the cue phrases (......،""ﺑﻮاﺳﻄﺔ، ""ﻋﻦ ﻃﺮﻳﻖ،")"ﻣﻦ ﺧﻼل. Also (...." "وﻗﺎل، " "أآﺪ،" )"وأﺷﺎرcan signals an Evidence relation. Table 2. Presents a set of the Arabic rhetorical relations used to answer why and how to Arabic questions Question type Why - how to Why Why Why Why - how to Why - how to how to Why ﻧﻮع اﻟﺴﺆال ﻟﻤﺎذا – آﻴﻒ ﻟﻤﺎذا ﻟﻤﺎذا ﻟﻤﺎذا ﻟﻤﺎذا – آﻴﻒ ﻟﻤﺎذا – آﻴﻒ آﻴﻒ ﻟﻤﺎذا English equivalence Interpretation Causal Result Base Antithesis Purpose Explanation Evidence اﺳﻢ اﻟﻌﻼﻗﺔ ﺗﻔﺴﻴﺮ ﺳﺒﺒﻴﺔ ﻧﺘﻴﺠﺔ ﻗﺎﻋﺪة اﺳﺘﺪراك ﻏﺎﻳﺔ ﺷﺮح اﺛﺒﺎت 4 Textual Units and Question Processing Before starting the answer retrieval task we need to process and tokenize both the question and the text in which the answer may be found, this subsumes performing the following steps: ● Normalization: certain combinations of characters can be written in different ways in the Arabic language. For instance, glyphs that combining HAMZA or MADDA with ALEF ( ﺁ، إ، )أare sometimes written as a plain ALEF ()ا, also the letter TAA MARBOTH ( )ةis sometimes changed to HAA ( )ﻩat the end of a word, and this will result in difficult to recognize some Arabic words, So we have to normalize all orthographic variations. ● Stemming: Arabic, as all Semitic languages, is a highly inflected language and has a very complex morphology; a given headword can be found in huge number of different forms. This abundance of forms results in greater likelihood of mismatch between the form of word in a question and the forms found in text relevant to the question. Thus stemming is a basic step in this context, and many are the research studies which attempt to develop Arabic stemmers. In our system we used Larkey's light- stemmer [10] in case the word's category is noun, or Khoja's root- base stemmer [11] in case of verbs which will be more efficient as proposed by Al- shammari [12]. ● Stop words removal: due to the absence of standardized list of Arabic stop words, we dropped 300 high-frequent common words, based on Arabic literature and excluding the cue phrase list, that gives no benefits to the matching results and may save space and speed searching. Automatically Finding Answers to "Why" and "How to" Questions 591 We compute the similarity between the question and the textual units by applying Vector Space Model and rank the textual units in descending order according to the similarity values using the formula shown below: Sim (Q ,Ui ) = Cosine Ui = ∑ ∑ , , , ∑ (1) , Where WQ,j , Wi,j are the weights of the jth keyword of the question Q and textual unit Ui respectively. The algorithm presented in Fig .3 takes as input a sequence of textual units belonging to a text and a question related to the text, and then returns a set of ranked answers. Input : A question Q , A sequence U[n] of textual units and a list RR of relations that hold among the units in U. Output: A set A of candidate answers. 1. A := null; 2. Identify the type of Q; 3. Identify a set of relations rr in RR corresponding to the Q type; 4. Match Q against the textual units U[n]; 5. For each match Ui 6. if ( Ui have a relation rri of one of the types in rr) 7. sp := related span of rri; 8. A := A sp ; 9. else 10. Discard the current Ui; 11. end if 12. end for 13. Rank the answers; Fig. 3. Algorithm that select answers for a given question 5 Experiments and Results We implemented our system using java programming language. For the purpose of measuring the performance of our system we used the same experiment conducted by S.verberne [13]. We selected a number of texts of 150-350 words each. The texts were extracted from Arabic news websites. Then we distribute those texts to 15 people from different discipline and we asked them to read some of the texts and to formulate why and how to-questions for the answers could be found in the text, the subjects were also asked to formulate answers to each of their questions. This resulted in a set of 98 why and how to-questions and answers pair. We run our system on the 98 questions we collected, and then compared the answers found by the system to the user-formulated answers; if the answer found matches the answer formulated by subject then we judged the answer found as correct. The system found the correct answer for 54 questions and this is 55% of all questions. Result is given in Table 3. 592 Z. Salem et al. In the system created by S.vrberne, they collected a set of 336 why-question and answer pairs, connected to seven manually annotated English texts from the RST Treebank of 350-550 words each. When they evaluated the system, they obtained a recall of 53.3%. Comparing our result with the one obtained by S.verberne (Table 4); it can be seen that they selected longer texts than we did. But on the other hand we dealt with raw text (the structure has automatically driven) whereas they dealt with manually annotated data. Additionally, they reported that the performance would decline if they use automatically created annotation [13]. As consequence, using the rhetorical relations proposed in this research for answering why and how to Arabic questions showing promising results. Table 3. Shows the outcome of the system Questions handled Correctly answered Wrongly answered # questions 98 54 44 % of all questions 100 55.1 44.9 Table 4. Presents a comparison between the two question answering systems Questions # Words # Structure derivation Source Recall Arabic QA 98 150-350 Automatically Arabic Websites 55% English QA 336 350-550 manually RST Treebank 53.3% 6 Conclusion and Future Work In this paper we presented the first study for automatically finding answers to why and how to-questions for Arabic language based on Rhetorical Structure Theory. We performed a manual analysis on a set of Arabic texts to select a number of relation types that is relevant for those kinds of questions; we also selected some of cue phrases to signal the extracted relations. Additionally we carried out an evaluation of the system and compared it with the Suzan study. The result showed promising future in the direction of dealing with longer texts than those handled in this study. References 1. The Bridge Language Report (2007), http://www.bridgelanguagecenter.com 2. Benajiba, Y., Rosso, P., Lyhyaoui, A.: Implementation of the Arabic QA Question Answering System’s Computers. In: ICTC (2007) 3. Hammou, B., Abu-Salem, H., Lytinen, S., Evens, M.: QARAB: A question answering system to support the Arabic language. In: Workshop on Computational Approaches to Semitic Languages, ACL (2002) Automatically Finding Answers to "Why" and "How to" Questions 593 4. Mann, W., Matthiessen, C., Thompson, S.: Rhetorical Structure Theory and Text Analysis. In: A Frame Work for the Analysis of Texts, pp. 79–195 (1992) 5. Mann, W., Taboada, M.: Rhetorical Structure Theory: Looking back and moving ahead. SAGE. Discourse Studies. 8, 423–459 (2006) 6. Al-Sanie, W., Touir, A., Mathkour, H.: Towards a Rhetorical Parsing of Arabic Text. In: International Conference on Computational Intelligence for Modeling, Web Technologies and Internet Commerce (CIMCA-IAWTIC 2005) (2005) 7. Daniel, M.: The Theory and Practice of Discourse Parsing and Summarization. The MIT Press, London (2000) 8. Jattal, M.: Nezam al-Jumlah, pp. 127–140. Aleppo University (1979) 9. Haskour, N.: Al-Sababieh fe Tarkeb al-Jumlah Al-Arabih. Aleppo University (1990) 10. Larkey, L.S., Ballesteros, L., Connell, M.E.: Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In: 25th SIGIR International Conference Research and Development in Information Retrieval, pp. 275–282. Tampere, Finland (2002) 11. Khoja, S., Garside, R.: Stemming Arabic Text. Computing Department. Lancaster University, Lancaster (1999) 12. Al-Shammari, E.: Towards an Error Free Stemming. In: LADIS European Conference on Data Mining (ECDM 2008). The Netherland, Amsterdam (2008) 13. Suzan, V., Lou, B., Nelleke, O.: Discourse-based answering of why-questions. Treatment Automatic Des Languages, Special Issue on Computational Approaches to Discourse and Document Processing 47(2), 21–41 (2007)
© Copyright 2024