DIAB: An Ontology of Type 2 Diabetes Stages and Associated Phenotypes Drashtti Vasant 1, Frauke Neff2, Philipp Gormanns3, Nathalie Conte1, Andreas Fritsche4,5,6, Harald Staiger4,5,6 , Jee-Hyub Kim1, James Malone1, Michael Raess7, Martin Hrabe de Angelis3,6,8, Peter Robinson9 and Helen Parkinson1* 1. EMBL-EBI, the European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK. 2. Institute of Pathology, Helmholtz Centre Munich - German Research Center for Environmental Health (GmbH) Neuherberg, Germany. 3 Institute of Experimental Genetics, Helmholtz Centre Munich - German Research Center for Environmental Health (GmbH), Neuherberg, Germany. 4. Institute for Diabetes Research and Metabolic Diseases of the Helmholtz Centre Munich at the University of Tübingen, Tübingen, Germany. 5. Division of Endocrinology, Diabetology, Vascular Disease, Nephrology and Clinical Chemistry, Department of Internal Medicine, University of Tübingen, Tübingen, Germany. 6. German Center for Diabetes Research (DZD), Neuherberg, Germany. 7. INFRAFRONTIER GmbH, Neuherberg, Germany, 8. Chair of Experimental Genetics, Center of Life and Food Sciences Weihenstephan, Technische University Munich, Freising-Weihenstephan, Germany. 9. Institute for Medical Genetics Universitätsklinikum Charité Humboldt-Universität Augustenburger Platz 113353 Berlin, Germany ABSTRACT Motivation: Integration of phenotypic data between animal models and human is critical to improve our understanding of disease. Here we present a process for ontology generation using text mining and expert review and the resulting ontology of disease-phenotype associations using terms from mouse and human phenotype ontologies with the disease stages of Type 2 Diabetes Mellitus. 1 INTRODUCTION Mouse models are a critical tool for translational researchers and access to rich information about phenotypes as well as genotype informs translational researchers in selecting a mouse for future experiments. However, different representation of human and mouse phenotypes, and understanding the relationships between measured phenotypes and disease presents a challenge in translation of mouse data to human researchers. Our chosen disease area is Type 2 Diabetes Mellitus (T2D), a common chronic disease affecting 9% of adults in 2014. As part of the BioMedBridges project we sought to improve the ontological representation of T2D and related phenotypes using both mouse and human phenotype ontologies and propose a rapid and lightweight process for ontology development enabling domain experts to contribute directly to ontology development. In this paper we illustrate 1. How diabetes related phenotypes were text mined from the literature, reviewed and organised by domain experts and 2. Present an ontology representation the of mined terms curated and categorized into temporal stages by domain experts. The curated terms are available as an ontology * in OWL format and expressed in OBAN format as diseasephenotype relationships. The first step in the process was a workshop in which domain experts and ontologists reviewed existing ontology content in this domain for completeness and structure. Four widely used resources were reviewed: the Human Disease Ontology - DO (Kibbe et al., 2015) the Human Phenotype Ontology – HP (Robinson & Mundlos, 2010), Online Mendelian Inheritance in Man – OMIM (Amberger, Bocchini, Schiettecatte, Scott, & Hamosh, 2015) and the Experimental Factor Ontology (Malone et al., 2010) which imports the Orphanet Rare Genetic Disease Classification and the Mouse Phenotype Ontology (Smith & Eppig, 2012). The participants concluded that while all the resources deliver components of the knowledge in the domain these are insufficiently integrated and critical information is missing. For example the temporal nature of the disease in stages, core phenotypes vs. those associated with late stage complications are unavailable, and there is no complete explicit representation of T2D phenotypes with the disease. We have therefore extended existing ontologies to create a T2Dspecific ontological resource, with cross-references to existing ontologies and which enables integration of mouse and human datasets. 2 IDENTIFYING DISEASE PHENOTYPE RELATIONSHIP USING TEXT MINING In order to leverage knowledge in the literature and to limit the amount of human effort required in generating terms for an ontology we identified T2D phenotype associations by To whom correspondence should be addressed. 1 D.Vasant et al. text-mining Europe PubMed Central (EuropePMC) abstracts. 2.1 Text mining Our first step was to search the EuropePMC for papers in the journals listed in Table 1 annotated with MeSH term ‘Type 2 Diabetes’ and its synonyms. 7480 abstracts were retrieved for these journals and used as the corpus for text mining excluding patents and reviews. As few clinical journals are fully open access we elected to mine abstracts only, and to focus on specialist journals where we expected to find rich information on phenotypes within abstracts. In total 609 T2D-phenotype associations were retrieved, 370 of these were identified as T2D associated by domain experts, recall is therefore 60.7%. Table 1: Title and ISSN number of journals used for text mining Journal Title ISSN number Diabetes ISSN: 1939-327X Diabetes Care ISSN: 1935-5548 Diabetologia ISSN: 1432-0428 Diabetes, Metabolic syndrome ISSN: 1178-7007 and Obesity Diabetes, Obesity and Metabo- Mined terms were provided in spreadsheet format in two separate iterations to a review-team consisting of clinical diabetologists and a clinical pathologist with experience of human and mouse data. The review process involved definition of categories of disease stages, organization of phenotypic terms into those categories, deletion of 199 terms considered to be out of scope, and addition of new two terms - hyperosmolarity and macular edema. Additionally 375 terms associated with T2D, Type 1 Diabetes (T1D) and other diseases were retained as the domain experts considered them in scope, 21 terms associated only with T1D. The following categories were established and the experts organized the terms as follows: • Type 2 diabetes has three disease progression stages: prediabetes, manifest diabetes, consequences or complications. • A phenotype is manifest_in at least one Type 2 diabetes stage. E.g. apnea in prediabetes • A phenotype is a cause_of or symptom_of Type 2 diabetes. E.g. lipemia is a cause and ketonuria is a symptom • A phenotype is manifest_in Type 1 diabetes and may also be associated_with other diseases. E.g. premature aging. ISSN: 1463-1326 Table 2: Total count of MP/HP terms in all categories lism Diabetes research and clinical ISSN: 1872-8227 HP MP New terms terms terms Diabetes cause 45 48 0 93 IFG/IGT *(Prediabetes) 98 73 0 171 Manifest Diabetes 237 115 2 354 Diabetes symptom 102 61 0 163 Consequences/complications 200 87 2 289 Type 1 Diabetes 172 75 1 248 Type 2 Diabetes 248 125 2 375 Assoc. w/ other diseases too 230 108 2 340 Total (any temporal stage) 248 125 2 375 practice Diabetology and Metabolic ISSN: 1758-5996 syndrome Whatizit (Rebholz-Schuhmann, Arregui, Gaudan, Kirsch, & Jimeno, 2008) was used as the text mining service accessed using SOAP web services to mine abstracts for phenotypic terms using a dictionary constructed from the Human Phenotype Ontology and the Mammalian Phenotype Ontology both accessed November 2012. The latter was included, as experts had already identified during the workshop that the HP provided incomplete coverage of type 2 diabetes associated phenotypes. The output of the text mining process was a simple spreadsheet designed for presentation to the domain experts reporting: ontology id, term, abstract count, citation count, term frequency, term frequency–inverse document frequency (TF-IDF) and parent child relationship providing a low level indication of term granularity for the domain experts. 572 terms were mined from 7480 abstracts; an expert curator cleaned these for uninformative terms e.g. ‘All’ prior to review by domain experts. Initial inspection revealed terms representing etiology, secondary complications, diagnostic terms as well as phenotypes associated with T2D. 2.2 2 Manual curation Total terms *IFG/IGT: impaired fasted glucose (level) /impaired glucose tolerance 3 KNOWLEDGE REPRESENTATION The knowledge captured by the domain experts is available in two forms. First we generated an OWL representation of the hierarchy generated by the experts initially organised as a spreadsheet. We used the OWL-API (https://github.com/owlcs/owlapi/wiki) to transform the disease stages and phenotypes from the spreadsheet into OWL. The information was modelled as 386 distinct classes and the URIs were retained for MP and HP derived terms. Associations between diseases stages and phenotypes were described by object-properties that were defined during the curation process. The ontology is available from BioPortal DIAB: An Ontology of Type 2 Diabetes Stages and Associated Phenotypes (http://purl.bioontology.org/ontology/DIAB); a screenshot is shown in Figure 1. Figure 1: The Ontology visualised in protégé showing the representation of the HP derived term ‘abdominal aortic aneurysms. Secondly we represented the disease-phenotype association using the Open Biomedical AssociatioN (OBAN https://github.com/jamesmalone/OBAN) framework. This provides a generic association model: association_has_subject disease, and association_has_object phenotype and also allows provenance supporting the associations (who, when etc) to be expressed. The OBAN class ‘provenance’ reuses the object property has_evidence from Miriam (Juty et al., 2012) and the different evidence subclasses from the Evidence Codes Ontology (Chibucos et al., 2014). An example of the OBAN representation is provided in Figure 2. Figure 2: Protégé screenshot of the OWL/RDF file generated to represent the DIAB ontology based on 4 CONCLUSION We have developed an ontological representation of expert knowledge about T2D based on a simple text mining and expert review process. This process used provides a pragmatic and rapid method to extract information from the literature, present it to domain experts for review who then refined the association type to more precisely represent the domain. We and others (e.g. Brewster et al., 2009), have found text mining to be a useful tool in building ontologies, and specifically for making associations between ontologies, here between disease and phenotype. The process has since been used successfully to perform disease-phenotype association in different disease areas with new domain experts. We have determined that terms from both mouse and human phenotype ontologies provide more complete representation of human phenotypes for T2D and this is also confirmed in other disease areas. An expert mapping of terms between the two mouse and human ontologies is underway to determine if new terms could be added to the HP which had an initial focus on rare disease phenotypes rather than those associated with common disease. Future improvements to the process include modification of input vocabularies to improve recall, e.g. phenotype terms from UMLS could be added and sub-selection of ontology terms excluding upper level nodes to reduce noise for experts, for example HP has the root term ‘All’. We note that our process was likely aided by a corpus of literature for T2D from specialist journals which presumably is the reason only two additional T2D-phenotype associated terms were added by domain experts during review. One of these terms ‘macular edema’ is now present in HP, but was not when the text mining was performed and was therefore not retrieved. The second, ‘hyperosmolarity’ is present in Snomed-CT as a child of ‘metabolic finding and as a compound term with T2D in MEDRA but is absent from both from versions of HP and MP used in the text mining and current versions. HP and MP have different foci – and as such are not equivalent. For example the terms confusion, mood changes, nausea present in HP are not in MP, and vaginitis, hypopnea are present in MP and not in HP. As both the HP and MP have been revised since we completed this work we plan to re-run the text mining to evaluate whether they now provide an improved representation of this domain, for example by inclusion of a richer set of synonyms, prior to suggestion of new terms. The ontology is available in OWL and also transformed into the OBAN model as a generic phenotype-disease association for community use. Improvements to the ontology representation include capture of the relevance level of each phenotype for each disease stage. Moreover, the OWL representation would benefit from a more specific modelling of non-diabetic disease relationships currently represented as ‘disease other than diabetes’ by the domain experts and a complete import of the class information from MP/HP to include synonyms and definitions is required. 3 D.Vasant et al. Our work represents a first step in the process to traverse mouse and human data and to define a process for the association of common disease with phenotypes in human and mouse. To evaluate the ontology in integration of human and mouse data we will use the ontology in the PhenoDigm tool (Smedley et al., 2013). PhenoDigm uses a semantic approach to map between clinical features observed in humans and mouse currently represented using the HP and MP ontologies. By using the improved disease-phenotype representation we have produced we will evaluate whether identification of mouse models is improved by the use of this ontology and therefore whether we are able to more widely improve matching between phenotypic data from mouse models, for example generated by the International Mouse Phenotypic Consortium (Koscielny et al., 2013)and human disease data. Finally a manual mapping between the HP and MP phenotype terms we have detected is underway. We are aware that both ontologies have evolved since we performed the text mining, and that mappings between the terms verified by experts are useful in data integration, therefore these will be made public when they are available. The text mining code and the associated spreadsheets can be found at https://github.com/EBISPOT/T2D. In future we will explore whether the use of the T2D associated phenotype terms can improve the retrieval of T2D associated genes from the literature, and will refine the ontology further by the inclusion of clinical tests used in diagnoses to improve its use in electronic health record mining. Juty, N., Le Novère, N., & Laibe, C. (2012). Identifiers.org and MIRIAM Registry: community resources to provide persistent identification. Nucleic Acids Research, 40(Database issue), D580–6. doi:10.1093/nar/gkr1097 ACKNOWLEDGEMENTS Smedley, D., Oellrich, A., Köhler, S., Ruef, B., Westerfield, M., Robinson, P., … Mungall, C. (2013). PhenoDigm: analyzing curated annotations to associate animal models with human diseases. Database : The Journal of Biological Databases and Curation, 2013, bat025. doi:10.1093/database/bat025 This work was supported by BioMedBridges Project funded by the European Commission within Research Infrastructures of the FP7 Capacities Specific Programme, grant agreement number 284209 and EMBL Core funds to HP. Thanks to Anika Oellrich for discussion of text mining approaches and Damian Smedley for comments on the manuscript. REFERENCES Amberger, J. S., Bocchini, C. A., Schiettecatte, F., Scott, A. F., & Hamosh, A. (2015). OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Research, 43(Database issue), D789–98. doi:10.1093/nar/gku1205 Brewster, C., Jupp, S., Luciano, J., Shotton, D., Stevens, R. D., & Zhang, Z. (2009). Issues in learning an ontology from text. BMC Bioinformatics, 10 Suppl 5(Suppl 5), S1. doi:10.1186/1471-2105-10-S5-S1 Chibucos, M. C., Mungall, C. J., Balakrishnan, R., Christie, K. R., Huntley, R. P., White, O., … Giglio, M. (2014). Standardized description of scientific evidence using the Evidence Ontology (ECO). Database, 2014, bau075– bau075. doi:10.1093/database/bau075 4 Kibbe, W. A., Arze, C., Felix, V., Mitraka, E., Bolton, E., Fu, G., … Schriml, L. M. (2015). Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucleic Acids Research, 43(Database issue), D1071–8. doi:10.1093/nar/gku1011 Koscielny, G., Yaikhom, G., Iyer, V., Meehan, T. F., Morgan, H., Atienza-Herrero, J., … Parkinson, H. (2013). The International Mouse Phenotyping Consortium Web Portal, a unified point of access for knockout mice and related phenotyping data. Nucleic Acids Research. doi:10.1093/nar/gkt977 Malone, J., Holloway, E., Adamusiak, T., Kapushesky, M., Zheng, J., Kolesnikov, N., … Parkinson, H. (2010). Modeling sample variables with an Experimental Factor Ontology. Bioinformatics, 26(8), 1112–1118. Rebholz-Schuhmann, D., Arregui, M., Gaudan, S., Kirsch, H., & Jimeno, A. (2008). Text processing through Web services: calling Whatizit. Bioinformatics (Oxford, England), 24(2), 296–8. doi:10.1093/bioinformatics/btm557 Robinson, P. N., & Mundlos, S. (2010). The human phenotype ontology. Clinical Genetics, 77, 525–534. Smith, C. L., & Eppig, J. T. (2012). The Mammalian Phenotype Ontology as a unifying standard for experimental and high-throughput phenotyping data. Mammalian Genome, 23(9-10), 653–668. doi:10.1007/s00335-0129421-3
© Copyright 2024