Extended study for extracting events between genes/proteins and

Extended study for extracting events between genes/proteins and drugs/active pharmaceutical
ingredients (API) with JReX
Ekaterina Buyko, Kerstin Hornbostel and Udo Hahn
Jena University Language and Information Engineering (JULIE) Lab
Friedrich-Schiller-Universität Jena, Germany
http://www.julielab.de
Continuing the work of Buyko and Hahn (2010), we made a study to assess more precisely chances
and difficulties of automatically extracting events about regulation of genes/proteins by drugs or
active pharmaceutical ingredients (API). Therefore we took the osteoporosis corpus described in
Buyko and Hahn (2010) which contains more as 1,000 abstracts with co-occurrences of
genes/proteins on the one hand and drug/API names on the other hand. Both entity types were
automatically pre-tagged with our JulieLab tools. Our last experiment has shown that out of 5,000
events extracted with JReX (Buyko et al, 2010) trained on the BioNLP Shared Task corpus (Kim et
al, 2009) only 60 of found 120 events (which contained drugs/API as arguments with a Cause role
in regulation events) were biologically correct. We now tried to find out whether there are more
drug/API-gene/protein-relations in our corpus, how they could be characterized and, how to
increase the recall. Therefore an expert biologist manually evaluated a random sample of 112
sentences that contain co-occurrences of gene/protein names together with drug/API names but
without tagging any relations between them. In the following we present the results and conclusions
of this evaluation.
In 46 of the 112 analyzed sentences drugs/API are influencing genes/proteins in a biologically
relevant way, but only in half of these 46 sentences, the involved drugs/API or genes/proteins were
tagged correctly. So the first task would be to improve the entity recognition of both entity types.
Our gene mapping system, trained on biomolecular texts, is high-performing in tagging
genes/proteins (Wermter et al, 2009). Whereas this corpus contains a lot of medical abstracts about
clinical trials - and this implicates a different usage of biomedical terms (such as usage of other
synonyms and long or chemical names for proteins), the GeNo performance drops considerably.
This evaluation has also shown that we need to think about a re-definition of the drug/API group.
While drugs are easy to find with a dictionary approach, most of the API (occurring in a prestadium of drug development) cannot be properly detected because of their ambiguity. In many
cases, they have identical names with endogenous (body’s own) substances, e.g., hormones,
metabolic or nutritive substances (and even can belong to the gene/protein group).
Once the entity recognition is optimized, the event extraction should be less a problem, even though
JReX was trained for the extraction of gene regulation events. Even, the focus of the corpus is more
on metabolic pathways, their descriptions differ in most cases only slightly from the gene regulation
events. Our study revealed two complexity groups: (1) Events show simple linguistic structures
such as event inside noun phrases with words such as “modulator” or “inhibitor”, or subject-object
relations indicating events such as Suppression, Regulation, Increase, etc. Further
thoughts have to be made on the complexity group (2) containing phrasings with “therapy”,
“genotype” or “mutation”, e.g. “estrogen replacement therapy”, “apoE genotype” or “patients with
the Sp1 mutation”. The second group should be manually annotated for JReX re-training.
This detailed assessment revealed that the automatic extraction of gene-drugs relations bears
difficulties at various levels and should be approached from different dimensions. This study served
as a preparatory step for annotations of regulation events involving drugs and genes/proteins.
References
Buyko, Ekaterina and Hahn, Udo. Linking a gene/protein-focused relation extractor to the pharmacogenomic domain. In Abstratcs of the PSB 2010
Workshop on Mining the Pharmacogenomics Literature.
Buyko, Ekaterina, Faessler, Erik, Wermter, Joachim and Hahn, Udo. Event extraction from trimmed dependency graphs. Computational Intelligence,
in press.
Kim, Jin-Dong, Ohta, Tomoko, Pyysalo, Sampo, Kano, Yoshinobu and Tsujii, Jun'ichi, Overview of BioNLP'09 Shared Task on Event Extraction, in:
Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, pages 1--9, Association for Computational Linguistics, 2009
Wermter, Joachim, Tomanek, Katrin and Hahn, Udo, High-Performance Gene Name Normalization with GeNo (2009), in: Bioinformatics, 25:6(815-821)