Extended study for extracting events between genes/proteins and drugs/active pharmaceutical ingredients (API) with JReX Ekaterina Buyko, Kerstin Hornbostel and Udo Hahn Jena University Language and Information Engineering (JULIE) Lab Friedrich-Schiller-Universität Jena, Germany http://www.julielab.de Continuing the work of Buyko and Hahn (2010), we made a study to assess more precisely chances and difficulties of automatically extracting events about regulation of genes/proteins by drugs or active pharmaceutical ingredients (API). Therefore we took the osteoporosis corpus described in Buyko and Hahn (2010) which contains more as 1,000 abstracts with co-occurrences of genes/proteins on the one hand and drug/API names on the other hand. Both entity types were automatically pre-tagged with our JulieLab tools. Our last experiment has shown that out of 5,000 events extracted with JReX (Buyko et al, 2010) trained on the BioNLP Shared Task corpus (Kim et al, 2009) only 60 of found 120 events (which contained drugs/API as arguments with a Cause role in regulation events) were biologically correct. We now tried to find out whether there are more drug/API-gene/protein-relations in our corpus, how they could be characterized and, how to increase the recall. Therefore an expert biologist manually evaluated a random sample of 112 sentences that contain co-occurrences of gene/protein names together with drug/API names but without tagging any relations between them. In the following we present the results and conclusions of this evaluation. In 46 of the 112 analyzed sentences drugs/API are influencing genes/proteins in a biologically relevant way, but only in half of these 46 sentences, the involved drugs/API or genes/proteins were tagged correctly. So the first task would be to improve the entity recognition of both entity types. Our gene mapping system, trained on biomolecular texts, is high-performing in tagging genes/proteins (Wermter et al, 2009). Whereas this corpus contains a lot of medical abstracts about clinical trials - and this implicates a different usage of biomedical terms (such as usage of other synonyms and long or chemical names for proteins), the GeNo performance drops considerably. This evaluation has also shown that we need to think about a re-definition of the drug/API group. While drugs are easy to find with a dictionary approach, most of the API (occurring in a prestadium of drug development) cannot be properly detected because of their ambiguity. In many cases, they have identical names with endogenous (body’s own) substances, e.g., hormones, metabolic or nutritive substances (and even can belong to the gene/protein group). Once the entity recognition is optimized, the event extraction should be less a problem, even though JReX was trained for the extraction of gene regulation events. Even, the focus of the corpus is more on metabolic pathways, their descriptions differ in most cases only slightly from the gene regulation events. Our study revealed two complexity groups: (1) Events show simple linguistic structures such as event inside noun phrases with words such as “modulator” or “inhibitor”, or subject-object relations indicating events such as Suppression, Regulation, Increase, etc. Further thoughts have to be made on the complexity group (2) containing phrasings with “therapy”, “genotype” or “mutation”, e.g. “estrogen replacement therapy”, “apoE genotype” or “patients with the Sp1 mutation”. The second group should be manually annotated for JReX re-training. This detailed assessment revealed that the automatic extraction of gene-drugs relations bears difficulties at various levels and should be approached from different dimensions. This study served as a preparatory step for annotations of regulation events involving drugs and genes/proteins. References Buyko, Ekaterina and Hahn, Udo. Linking a gene/protein-focused relation extractor to the pharmacogenomic domain. In Abstratcs of the PSB 2010 Workshop on Mining the Pharmacogenomics Literature. Buyko, Ekaterina, Faessler, Erik, Wermter, Joachim and Hahn, Udo. Event extraction from trimmed dependency graphs. Computational Intelligence, in press. Kim, Jin-Dong, Ohta, Tomoko, Pyysalo, Sampo, Kano, Yoshinobu and Tsujii, Jun'ichi, Overview of BioNLP'09 Shared Task on Event Extraction, in: Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, pages 1--9, Association for Computational Linguistics, 2009 Wermter, Joachim, Tomanek, Katrin and Hahn, Udo, High-Performance Gene Name Normalization with GeNo (2009), in: Bioinformatics, 25:6(815-821)
© Copyright 2024