Building medical ontologies based on terminology extraction from texts: methodological propositions Audrey Baneyx1 , Jean Charlet1;2 , and Marie-Christine Jaulent1 1 Inserm, U729, Paris, F-75006 France; 15 rue de l'Ecole de Médecine, 75006 Paris, France; 2 STIM - DSI/AP-HP; {Audrey.Baneyx,Jean.Charlet,Marie-Christine.Jaulent}@spim.jussieu.fr In the medical eld, it is now established that the maintenance of unambiguous thesauri is accomplished by the building of ontologies. Our task in the PertoMed project is to help pneumologists code acts and diagnoses with a software that represents medical knowledge by an ontology of the concerned specialty. We apply natural language processing tools to corpora to develop the resources needed to build this ontology. In this paper, our objective is to develop a methodology for the knowledge engineer to build various types of medical ontologies based on terminology extraction from texts according to the dierential semantics theory. Our main research hypothesis concerns the joint use of two methods: distributional analysis and recognition of semantic relationships by lexico-syntactic patterns. The expected result is the building of an ontology of pneumology. Abstract. 1 Introduction For about ten years, French hospitals have had to communicate information about their medical activities. For each patient, information is gathered as patient discharge summary, using the international classication of diseases Cim10 1 for the diagnoses codication and CCAM2 for the acts. The French PMSI3 coding process is usually done manually by physicians using medical specialty thesauri based on common terminologies. However it has become obvious that wording of thesauri is ambiguous and non-exhaustive. We think that the automation of the coding task requires a conceptual modelling (an ontology noncontextual and unambiguous) of medical items whose meaning would be written inside the model's structure itself [1] [2]. The main diculty is to identify and classify the concepts of a given domain. Since classication criteria depend on purposes and are not universal, we do not seek to build a universal ontology, but merely a specic ontology of pneumology [3] [4]. This work is part of the 1 2 3 The French version of the International classication of diseases Common classication of medical acts Program of medicalization of information systems PertoMed 4 research project whose objective is to develop methods and tools to produce and use terminological and ontological resources in the medical eld. We build our specic ontology by applying natural language processing (NLP) tools to analyse textual corpora. We hypothosize that this kind of ressource is the best source to characterize notions useful for the ontological modelling and the semantic content that is associated with them. In this paper, we discribe an experimentation that combines two methods in order to build an ontology according to the dierential semantics principles [4]. A dierential ontology is a hierarchy of concepts and relationships organized according to the similarities and dierencies between them. We expect to extract from this experimentation a methodology for the knowledge engineer to be applied to building medical ontologies. Section 2 presents the material and tools used and section 3 details the different steps of the methodology. Section 4 presents the results we obtained and last, in section 5, we conclude this paper by discussing perspectives expected from this work. 2 Material In order to cover the whole area of pneumology, we have gathered 1 038 patient discharge summaries (corpus named [PDS]) from six hospitals in Paris. We added a teaching book (corpus named [Book]) to that rst corpus. The [PDS] corpus has about 417 000 words and the [Book] corpus has about 823 000 words. We use Syntex-Upery as an NLP tool. Syntex is a syntax analyser module based on the hypothesis of similar dependencies between terms which have a close meaning. Thus, this module allows us to identify relationships of syntactic dependencies between terms or syntagma (noun vs noun phrase, verbs vs verbal syntagma . . . ). At the end of the process, we have a network of syntactic dependencies or a terminological network whose elements are the candidate terms that will be used to build the ontology. Then, the Upery module proceeds to the distributional analysis [5]: it computes distributional proximities between candidate terms on the basis of shared syntactical contexts and exploits all the network data to cluster terms. We obtain a network of candidate terms, their contextual associations, and their links to the corpus. The results of the analysis can be viewed with Termonto, the data access and data process software interface. The editor DOE5 allows us to build our ontology according to the dierential semantics. This software also permits us to complete the ontology with the addition of the English translation of concepts and relationships, and their encyclopaedic denitions. Finally, the ontology is exported into OWL, a knowledge representation language recommended by the W3C Consortium. 4 5 http://www.spim.jussieu.fr/Pertomed The Dierential Ontology Editor, http://opales.ina.fr/public 3 Method The Bachimont's methodology used to design dierential ontologies allows us to describe variations of the words' meaning in context and stresses the importance of the textual corpus [4]. There are four successive steps in this methodology: 1) the constitution of the corpus of knowledge and its analysis by NLP tools, 2) the semantic normalization of the set of terms through the application of differential principles, 3) the ontological engagement to formalize dened concepts, 4) the nalization of the process in a language based on description logic and understandable to the computer. Our research in pneumology brings both precision and adapts the rst two steps of the methodology to the knowledge engineer. This experimentation combines two methods to enrich the ontology building : a) the building of terminological resources by distributional analysis applied to the [PDS] corpus [6], and b) the semantic relationship recognition through the observation of corpus sequences, illustrating the desired relationship applied to the [Book] corpus [7]. The two [PDS] and [Book] corpora resources are processed to obtain an anonymous [PDS] corpus and a [Book] corpus both in XML format. Next, the [PDS] corpus is processed by Syntex-Upery. The results of the analysis allow us to build the basic elements i.e. primitives of the ontology. The second [Book] corpus is analysed to identify semantic relationships (hyperonymy, synonymy. . . ) between candidate terms, using the previously dened lexico-syntactic patterns [7]. These links help us to control and enrich the hierarchy. Candidate terms6 (CT) representative of pneumology are chosen within the results provided by Syntex-Upery from the [PDS] corpus in two steps: - We select the noun phrases (NP) provided by the syntactical analysis that appear in the [PDS] corpus more than 12 times (2% of the corpus). Four conceptual axes are distinguished: symptoms, pathologies, treatments and examinations. The CT are linked with one of these axes. In this way, 35% of the CT classied are used in the second step. - The distributional analysis connects terms sharing the same contexts (descendants in head and descendants in expansion). It also connects the contexts according to the terms they share (neighbours in head and neighbours in expansion). For instance, eusion is the head of the NP eusion of pleura and of pleura is its expansion. Descendants in head yield information on what could be child-concepts or dened concepts. Descendants in expansion provide information about the concept's position in the hierarchy. Neighbours in head and in expansion allow us to constitute the groupings of CT semantically close to the one under study, eusion of pleura. Groupings are a great help for the development of the hierarchical structure of the ontology, for the both horizontal and vertical axes. The example below shows a rst possible connection: we can link group A {eusion, lesion, infection, uncompensation} with {symptoms}. 6 A candidate term is a noun phrase composed by a head and an expansion. For instance, in the NP Opacity in the left lung, the term Opacity is the head and in the left lung is its expansion. For this example, our rst hypothesis will be to consider CT of group A as sibling-concepts (sharing the same semantical context) and symptoms as the parent-concept of the group. In order to work the hierarchy out, the CT are organised by rening the differential principles that dene them. We have to express in natural language the similarities and dierences of each concept with respect to its parent-concept and its sibling-concepts. The meaning of a node is given by the gathering of all similarities and dierences dened for each concept between the node target and the root. The four axes are rened in that way. In addition, we use the results of the processing of the [Book] corpus to help us in applying the dierential principles according to the method described in [3]. The analysis based on lexico-syntactic pattern recognition yields clues on how to apply the dierential principles. The lexico-syntactic patterns are representative of specic semantic relationships [8]. These patterns are built around a marker which is the lexical relationship indice, such as kind of, for hyperonymy relationships. To build dierential ontologies, we apply this method to look for denition statements in the corpus (for instance: Dry cough is a symptom for bronchitis and it is also a pathology ). The patterns used were developed by Malaisé et al. [7]. The extracted lexical units are manually conrmed. At the end of this stage, we have a semantic normalization of the set of terms of the specialty and we have represented the hierarchy of primitive concepts and relationships with DOE. 4 Current results After using Syntex process, the [PDS] corpus gives 36 881 NP and the [Book] corpus gives 17 666 NP. Using the lexico-syntactic patterns for the denition recognition, 799 lexical units were extracted. 15% are conrmed. This study gives precision and help us to order the hierarchy. Our ontology includes today 600 primitive concepts stemming from the rst analysis of the CT. Given that the building stages 1 and 2 are iterative, we will rapidly increase the hierarchy by examining the CT which appear less than 12 times in the [PDS] corpus. The ontology is to be evaluated in terms of both quality and coverage. The conceptual hierarchy must be corrected and validated by pneumologists from the French Pneumology Society with whom we collaborate. 5 Discussion and conclusion The purpose of this work is to show that our methodology allows a knowledge engineer, non-specialist of the medical eld, to build an ontology based on texts using NLP tools. The initial results of our research coupled with a recent work on surgical intensive care gives us reasons to think we are moving in the right direction [6]. This work of knowledge processing shows that it is necessary to use jointly NLP tools (Syntex-Upery) and modelling tools (DOE). The physicians we are working with are interested in formal representations of knowledge based on patients records to be able to perform epidemiological studies from patient data. To do that, an ontological representation is essential. We develop specic ontologies, connected to existing thesauri of the domain. From that point of view, our work is closer to the Galen's terminological server7 [9] than to the SnomedCT approach playing both the role of thesaurus and ontology. The evaluation will also test the completeness of the ontology compared to the specialty thesauri. To estimate that completeness, we will check the possibility of building a conceptual representation of medical knowledge by combining the primitive concepts and the relationships in the ontology. It is important to notice that the last validation stage of the ontology will be to use it in concrete applications available in the framework of the PertoMed terminological plateform. To conclude we have presented a range of methodological principles for the building of dierential medical ontologies based on texts. We plan to complete this ontology by connecting it to the head (conceptual high level) of the Menelas project ontology8 [10]. This allows us to verify whether there exists, in part, a common high level in the medical eld that can be used in other contexts. References 1. Rector, A.: Thesauri and formal classications: Terminologies for people and machines. Methods of Information in Medicine 37 (1998) 501509 2. Staab, S., Studer, R.: Handbook on Ontologies. 1 edn. Springer, Germany (2003) 3. Charlet, J., Bachimont, B., Jaulent, M.: Building medical ontologies by terminology extraction from texts: An experiment for the intensive care units. Computer in Biology and Medicine (2005) To appear . 4. Bachimont, B., Isaac, A., Troncy, R.: Semantic commitment for designing ontologies: A proposal. In: Proceedings of EKAW, Sigüenza, Espagne, Springer (2002) 114121 5. Harris, Z.: Mathematical Structures of Language. John Wiley and Sons, New-York, USA (1968) 6. Le Moigno, S., Charlet, J., Bourigault, D., Degoulet, P., Jaulent, M.: Terminology extraction from text to build an ontology in surgical intensive care. In: Proceedings of the AMIA Annual Symposium 2002, San Antonio, Texas (2002) 430435 7. Malaisé, V., Zweigenbaum, P., Bachimont, B.: Mining dening contexts to help structuring dierential ontologies. In Ibekwe-San, J., Condamines, A., Cabré, T., eds.: Application-Driven Terminology Engineering, Termonology. John Benjamins (2005) 2153 8. Hearst, M.: Automatic acquisition of hyponyms from large text corpora. In Zampolli, A., ed.: Proceedings of the 14 th COLING, Nantes, France (1992) 539545 9. Trombert-Paviot, B., Rodrigues, J., Rogers, J., Baud, R., van der Haring, E., Rassinoux, A., Abrial, V., Clavel, L., Idir, H.: Galen: a third generation terminology tool to support a multipurpose national coding system for surgical procedures. International Journal of Medical Informatics 58-59 (2000) 7185 10. Zweigenbaum, P., Bouaud, J., Bachimont, B., Charlet, J., Séroussi, B., Boisvieux, J.F.: From text to knowledge: a unifying document-oriented view of analyzed medical language. Methods of Information in Medicine 37 (1998) 384393 7 8 http://www.openclinical.org/prj_galen.html http://www.biomath.jussieu.fr/Menelas/Ontologie