as a PDF

Building medical ontologies based on terminology
extraction from texts: methodological
propositions
Audrey Baneyx1 , Jean Charlet1;2 , and Marie-Christine Jaulent1
1
Inserm, U729, Paris, F-75006 France;
15 rue de l'Ecole de Médecine, 75006 Paris, France;
2
STIM - DSI/AP-HP;
{Audrey.Baneyx,Jean.Charlet,Marie-Christine.Jaulent}@spim.jussieu.fr
In the medical eld, it is now established that the maintenance of unambiguous thesauri is accomplished by the building of ontologies. Our task in the PertoMed project is to help pneumologists code
acts and diagnoses with a software that represents medical knowledge
by an ontology of the concerned specialty. We apply natural language
processing tools to corpora to develop the resources needed to build this
ontology. In this paper, our objective is to develop a methodology for the
knowledge engineer to build various types of medical ontologies based on
terminology extraction from texts according to the dierential semantics theory. Our main research hypothesis concerns the joint use of two
methods: distributional analysis and recognition of semantic relationships by lexico-syntactic patterns. The expected result is the building of
an ontology of pneumology.
Abstract.
1
Introduction
For about ten years, French hospitals have had to communicate information
about their medical activities. For each patient, information is gathered as patient discharge summary, using the international classication of diseases Cim10 1 for the diagnoses codication and CCAM2 for the acts. The French PMSI3
coding process is usually done manually by physicians using medical specialty
thesauri based on common terminologies. However it has become obvious that
wording of thesauri is ambiguous and non-exhaustive. We think that the automation of the coding task requires a conceptual modelling (an ontology noncontextual and unambiguous) of medical items whose meaning would be written
inside the model's structure itself [1] [2]. The main diculty is to identify and
classify the concepts of a given domain. Since classication criteria depend on
purposes and are not universal, we do not seek to build a universal ontology,
but merely a specic ontology of pneumology [3] [4]. This work is part of the
1
2
3
The French version of the International classication of diseases
Common classication of medical acts
Program of medicalization of information systems
PertoMed 4
research project whose objective is to develop methods and tools
to produce and use terminological and ontological resources in the medical eld.
We build our specic ontology by applying natural language processing (NLP)
tools to analyse textual corpora. We hypothosize that this kind of ressource is
the best source to characterize notions useful for the ontological modelling and
the semantic content that is associated with them. In this paper, we discribe
an experimentation that combines two methods in order to build an ontology
according to the dierential semantics principles [4]. A dierential ontology is a
hierarchy of concepts and relationships organized according to the similarities
and dierencies between them. We expect to extract from this experimentation
a methodology for the knowledge engineer to be applied to building medical
ontologies.
Section 2 presents the material and tools used and section 3 details the different steps of the methodology. Section 4 presents the results we obtained and
last, in section 5, we conclude this paper by discussing perspectives expected
from this work.
2
Material
In order to cover the whole area of pneumology, we have gathered 1 038 patient discharge summaries (corpus named [PDS]) from six hospitals in Paris. We
added a teaching book (corpus named [Book]) to that rst corpus. The [PDS]
corpus has about 417 000 words and the [Book] corpus has about 823 000 words.
We use Syntex-Upery as an NLP tool. Syntex is a syntax analyser module
based on the hypothesis of similar dependencies between terms which have a
close meaning. Thus, this module allows us to identify relationships of syntactic
dependencies between terms or syntagma (noun vs noun phrase, verbs vs verbal syntagma . . . ). At the end of the process, we have a network of syntactic
dependencies or a terminological network whose elements are the candidate
terms that will be used to build the ontology. Then, the Upery module proceeds
to the distributional analysis [5]: it computes distributional proximities between
candidate terms on the basis of shared syntactical contexts and exploits all the
network data to cluster terms. We obtain a network of candidate terms, their
contextual associations, and their links to the corpus. The results of the analysis can be viewed with Termonto, the data access and data process software
interface. The editor DOE5 allows us to build our ontology according to the
dierential semantics. This software also permits us to complete the ontology
with the addition of the English translation of concepts and relationships, and
their encyclopaedic denitions. Finally, the ontology is exported into OWL, a
knowledge representation language recommended by the W3C Consortium.
4
5
http://www.spim.jussieu.fr/Pertomed
The Dierential Ontology Editor, http://opales.ina.fr/public
3
Method
The Bachimont's methodology used to design dierential ontologies allows us to
describe variations of the words' meaning in context and stresses the importance
of the textual corpus [4]. There are four successive steps in this methodology:
1) the constitution of the corpus of knowledge and its analysis by NLP tools,
2) the semantic normalization of the set of terms through the application of differential principles, 3) the ontological engagement to formalize dened concepts,
4) the nalization of the process in a language based on description logic and understandable to the computer. Our research in pneumology brings both precision
and adapts the rst two steps of the methodology to the knowledge engineer.
This experimentation combines two methods to enrich the ontology building :
a) the building of terminological resources by distributional analysis applied to
the [PDS] corpus [6], and b) the semantic relationship recognition through the
observation of corpus sequences, illustrating the desired relationship applied to
the [Book] corpus [7].
The two [PDS] and [Book] corpora resources are processed to obtain an
anonymous [PDS] corpus and a [Book] corpus both in XML format. Next, the
[PDS] corpus is processed by Syntex-Upery. The results of the analysis allow us
to build the basic elements i.e. primitives of the ontology. The second [Book]
corpus is analysed to identify semantic relationships (hyperonymy, synonymy. . . )
between candidate terms, using the previously dened lexico-syntactic patterns
[7]. These links help us to control and enrich the hierarchy. Candidate terms6
(CT) representative of pneumology are chosen within the results provided by
Syntex-Upery from the [PDS] corpus in two steps:
- We select the noun phrases (NP) provided by the syntactical analysis that
appear in the [PDS] corpus more than 12 times (2% of the corpus). Four conceptual axes are distinguished: symptoms, pathologies, treatments and examinations. The CT are linked with one of these axes. In this way, 35% of the CT
classied are used in the second step.
- The distributional analysis connects terms sharing the same contexts (descendants in head and descendants in expansion). It also connects the contexts
according to the terms they share (neighbours in head and neighbours in expansion). For instance, eusion is the head of the NP eusion of pleura and of
pleura is its expansion. Descendants in head yield information on what could
be child-concepts or dened concepts. Descendants in expansion provide information about the concept's position in the hierarchy. Neighbours in head and
in expansion allow us to constitute the groupings of CT semantically close to
the one under study, eusion of pleura. Groupings are a great help for the development of the hierarchical structure of the ontology, for the both horizontal
and vertical axes. The example below shows a rst possible connection: we can
link group A {eusion, lesion, infection, uncompensation} with {symptoms}.
6
A candidate term is a noun phrase composed by a head and an expansion. For
instance, in the NP Opacity in the left lung, the term Opacity is the head and in the
left lung is its expansion.
For this example, our rst hypothesis will be to consider CT of group A as
sibling-concepts (sharing the same semantical context) and symptoms as the
parent-concept of the group.
In order to work the hierarchy out, the CT are organised by rening the differential principles that dene them. We have to express in natural language the
similarities and dierences of each concept with respect to its parent-concept and
its sibling-concepts. The meaning of a node is given by the gathering of all similarities and dierences dened for each concept between the node target and the
root. The four axes are rened in that way. In addition, we use the results of the
processing of the [Book] corpus to help us in applying the dierential principles
according to the method described in [3]. The analysis based on lexico-syntactic
pattern recognition yields clues on how to apply the dierential principles. The
lexico-syntactic patterns are representative of specic semantic relationships [8].
These patterns are built around a marker which is the lexical relationship indice,
such as kind of, for hyperonymy relationships. To build dierential ontologies,
we apply this method to look for denition statements in the corpus (for instance: Dry cough is a symptom for bronchitis and it is also a pathology ). The
patterns used were developed by Malaisé et al. [7]. The extracted lexical units
are manually conrmed. At the end of this stage, we have a semantic normalization of the set of terms of the specialty and we have represented the hierarchy
of primitive concepts and relationships with DOE.
4
Current results
After using Syntex process, the [PDS] corpus gives 36 881 NP and the [Book]
corpus gives 17 666 NP. Using the lexico-syntactic patterns for the denition
recognition, 799 lexical units were extracted. 15% are conrmed. This study
gives precision and help us to order the hierarchy. Our ontology includes today
600 primitive concepts stemming from the rst analysis of the CT. Given that
the building stages 1 and 2 are iterative, we will rapidly increase the hierarchy
by examining the CT which appear less than 12 times in the [PDS] corpus.
The ontology is to be evaluated in terms of both quality and coverage. The
conceptual hierarchy must be corrected and validated by pneumologists from
the French Pneumology Society with whom we collaborate.
5
Discussion and conclusion
The purpose of this work is to show that our methodology allows a knowledge
engineer, non-specialist of the medical eld, to build an ontology based on texts
using NLP tools. The initial results of our research coupled with a recent work
on surgical intensive care gives us reasons to think we are moving in the right
direction [6]. This work of knowledge processing shows that it is necessary to use
jointly NLP tools (Syntex-Upery) and modelling tools (DOE). The physicians
we are working with are interested in formal representations of knowledge based
on patients records to be able to perform epidemiological studies from patient
data. To do that, an ontological representation is essential. We develop specic
ontologies, connected to existing thesauri of the domain. From that point of view,
our work is closer to the Galen's terminological server7 [9] than to the SnomedCT approach playing both the role of thesaurus and ontology. The evaluation
will also test the completeness of the ontology compared to the specialty thesauri. To estimate that completeness, we will check the possibility of building
a conceptual representation of medical knowledge by combining the primitive
concepts and the relationships in the ontology. It is important to notice that the
last validation stage of the ontology will be to use it in concrete applications
available in the framework of the PertoMed terminological plateform. To conclude we have presented a range of methodological principles for the building of
dierential medical ontologies based on texts. We plan to complete this ontology
by connecting it to the head (conceptual high level) of the Menelas project
ontology8 [10]. This allows us to verify whether there exists, in part, a common
high level in the medical eld that can be used in other contexts.
References
1. Rector, A.: Thesauri and formal classications: Terminologies for people and machines. Methods of Information in Medicine 37 (1998) 501509
2. Staab, S., Studer, R.: Handbook on Ontologies. 1 edn. Springer, Germany (2003)
3. Charlet, J., Bachimont, B., Jaulent, M.: Building medical ontologies by terminology
extraction from texts: An experiment for the intensive care units. Computer in
Biology and Medicine (2005) To appear .
4. Bachimont, B., Isaac, A., Troncy, R.: Semantic commitment for designing ontologies: A proposal. In: Proceedings of EKAW, Sigüenza, Espagne, Springer (2002)
114121
5. Harris, Z.: Mathematical Structures of Language. John Wiley and Sons, New-York,
USA (1968)
6. Le Moigno, S., Charlet, J., Bourigault, D., Degoulet, P., Jaulent, M.: Terminology
extraction from text to build an ontology in surgical intensive care. In: Proceedings
of the AMIA Annual Symposium 2002, San Antonio, Texas (2002) 430435
7. Malaisé, V., Zweigenbaum, P., Bachimont, B.: Mining dening contexts to help
structuring dierential ontologies. In Ibekwe-San, J., Condamines, A., Cabré, T.,
eds.: Application-Driven Terminology Engineering, Termonology. John Benjamins
(2005) 2153
8. Hearst, M.: Automatic acquisition of hyponyms from large text corpora. In Zampolli, A., ed.: Proceedings of the 14 th COLING, Nantes, France (1992) 539545
9. Trombert-Paviot, B., Rodrigues, J., Rogers, J., Baud, R., van der Haring, E., Rassinoux, A., Abrial, V., Clavel, L., Idir, H.: Galen: a third generation terminology
tool to support a multipurpose national coding system for surgical procedures.
International Journal of Medical Informatics 58-59 (2000) 7185
10. Zweigenbaum, P., Bouaud, J., Bachimont, B., Charlet, J., Séroussi, B., Boisvieux,
J.F.: From text to knowledge: a unifying document-oriented view of analyzed
medical language. Methods of Information in Medicine 37 (1998) 384393
7
8
http://www.openclinical.org/prj_galen.html
http://www.biomath.jussieu.fr/Menelas/Ontologie