DIAB: An Ontology of Type 2 Diabetes Stages and Associated
Phenotypes
Drashtti Vasant 1, Frauke Neff2, Philipp Gormanns3, Nathalie Conte1, Andreas
Fritsche4,5,6, Harald Staiger4,5,6 , Jee-Hyub Kim1, James Malone1, Michael Raess7, Martin Hrabe de Angelis3,6,8, Peter Robinson9 and Helen Parkinson1*
1. EMBL-EBI, the European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK.
2. Institute of Pathology, Helmholtz Centre Munich - German Research Center for Environmental Health (GmbH)
Neuherberg, Germany. 3 Institute of Experimental Genetics, Helmholtz Centre Munich - German Research Center for Environmental Health (GmbH), Neuherberg, Germany. 4. Institute for Diabetes Research and Metabolic
Diseases of the Helmholtz Centre Munich at the University of Tübingen, Tübingen, Germany. 5. Division of Endocrinology, Diabetology, Vascular Disease, Nephrology and Clinical Chemistry, Department of Internal Medicine, University of Tübingen, Tübingen, Germany. 6. German Center for Diabetes Research (DZD), Neuherberg,
Germany. 7. INFRAFRONTIER GmbH, Neuherberg, Germany, 8. Chair of Experimental Genetics, Center of Life
and Food Sciences Weihenstephan, Technische University Munich, Freising-Weihenstephan, Germany. 9. Institute for Medical Genetics Universitätsklinikum Charité Humboldt-Universität Augustenburger Platz 113353 Berlin,
Germany
ABSTRACT
Motivation: Integration of phenotypic data between animal
models and human is critical to improve our understanding of
disease. Here we present a process for ontology generation
using text mining and expert review and the resulting ontology of disease-phenotype associations using terms from
mouse and human phenotype ontologies with the disease
stages of Type 2 Diabetes Mellitus.
1
INTRODUCTION
Mouse models are a critical tool for translational researchers
and access to rich information about phenotypes as well as
genotype informs translational researchers in selecting a
mouse for future experiments. However, different representation of human and mouse phenotypes, and understanding
the relationships between measured phenotypes and disease
presents a challenge in translation of mouse data to human
researchers. Our chosen disease area is Type 2 Diabetes
Mellitus (T2D), a common chronic disease affecting 9% of
adults in 2014. As part of the BioMedBridges project we
sought to improve the ontological representation of T2D and
related phenotypes using both mouse and human phenotype
ontologies and propose a rapid and lightweight process for
ontology development enabling domain experts to contribute directly to ontology development. In this paper we illustrate 1. How diabetes related phenotypes were text mined
from the literature, reviewed and organised by domain experts and 2. Present an ontology representation the of mined
terms curated and categorized into temporal stages by domain experts. The curated terms are available as an ontology
*
in OWL format and expressed in OBAN format as diseasephenotype relationships.
The first step in the process was a workshop in which domain experts and ontologists reviewed existing ontology
content in this domain for completeness and structure. Four
widely used resources were reviewed: the Human Disease
Ontology - DO (Kibbe et al., 2015) the Human Phenotype
Ontology – HP (Robinson & Mundlos, 2010), Online Mendelian Inheritance in Man – OMIM (Amberger, Bocchini,
Schiettecatte, Scott, & Hamosh, 2015) and the Experimental
Factor Ontology (Malone et al., 2010) which imports the
Orphanet Rare Genetic Disease Classification and the
Mouse Phenotype Ontology (Smith & Eppig, 2012). The
participants concluded that while all the resources deliver
components of the knowledge in the domain these are insufficiently integrated and critical information is missing. For
example the temporal nature of the disease in stages, core
phenotypes vs. those associated with late stage complications are unavailable, and there is no complete explicit representation of T2D phenotypes with the disease. We have
therefore extended existing ontologies to create a T2Dspecific ontological resource, with cross-references to existing ontologies and which enables integration of mouse and
human datasets.
2
IDENTIFYING
DISEASE
PHENOTYPE
RELATIONSHIP USING TEXT MINING
In order to leverage knowledge in the literature and to limit
the amount of human effort required in generating terms for
an ontology we identified T2D phenotype associations by
To whom correspondence should be addressed.
1
D.Vasant et al.
text-mining Europe PubMed Central (EuropePMC) abstracts.
2.1
Text mining
Our first step was to search the EuropePMC for papers in
the journals listed in Table 1 annotated with MeSH term
‘Type 2 Diabetes’ and its synonyms. 7480 abstracts were
retrieved for these journals and used as the corpus for text
mining excluding patents and reviews. As few clinical journals are fully open access we elected to mine abstracts only,
and to focus on specialist journals where we expected to
find rich information on phenotypes within abstracts. In
total 609 T2D-phenotype associations were retrieved, 370 of
these were identified as T2D associated by domain experts,
recall is therefore 60.7%.
Table 1: Title and ISSN number of journals used for text
mining
Journal Title
ISSN number
Diabetes
ISSN: 1939-327X
Diabetes Care
ISSN: 1935-5548
Diabetologia
ISSN: 1432-0428
Diabetes, Metabolic syndrome
ISSN: 1178-7007
and Obesity
Diabetes, Obesity and Metabo-
Mined terms were provided in spreadsheet format in two
separate iterations to a review-team consisting of clinical
diabetologists and a clinical pathologist with experience of
human and mouse data. The review process involved definition of categories of disease stages, organization of phenotypic terms into those categories, deletion of 199 terms
considered to be out of scope, and addition of new two
terms - hyperosmolarity and macular edema. Additionally
375 terms associated with T2D, Type 1 Diabetes (T1D) and
other diseases were retained as the domain experts considered them in scope, 21 terms associated only with T1D. The
following categories were established and the experts organized the terms as follows:
• Type 2 diabetes has three disease progression stages: prediabetes, manifest diabetes, consequences or
complications.
• A phenotype is manifest_in at least one Type 2 diabetes stage. E.g. apnea in prediabetes
• A phenotype is a cause_of or symptom_of Type 2
diabetes. E.g. lipemia is a cause and ketonuria is a
symptom
• A phenotype is manifest_in Type 1 diabetes and
may also be associated_with other diseases. E.g.
premature aging.
ISSN: 1463-1326
Table 2: Total count of MP/HP terms in all categories
lism
Diabetes research and clinical
ISSN: 1872-8227
HP
MP
New
terms
terms
terms
Diabetes cause
45
48
0
93
IFG/IGT *(Prediabetes)
98
73
0
171
Manifest Diabetes
237
115
2
354
Diabetes symptom
102
61
0
163
Consequences/complications
200
87
2
289
Type 1 Diabetes
172
75
1
248
Type 2 Diabetes
248
125
2
375
Assoc. w/ other diseases too
230
108
2
340
Total (any temporal stage)
248
125
2
375
practice
Diabetology
and
Metabolic
ISSN: 1758-5996
syndrome
Whatizit (Rebholz-Schuhmann, Arregui, Gaudan, Kirsch, &
Jimeno, 2008) was used as the text mining service accessed
using SOAP web services to mine abstracts for phenotypic
terms using a dictionary constructed from the Human Phenotype Ontology and the Mammalian Phenotype Ontology
both accessed November 2012. The latter was included, as
experts had already identified during the workshop that the
HP provided incomplete coverage of type 2 diabetes associated phenotypes. The output of the text mining process was
a simple spreadsheet designed for presentation to the domain experts reporting: ontology id, term, abstract count,
citation count, term frequency, term frequency–inverse
document frequency (TF-IDF) and parent child relationship
providing a low level indication of term granularity for the
domain experts. 572 terms were mined from 7480 abstracts;
an expert curator cleaned these for uninformative terms e.g.
‘All’ prior to review by domain experts. Initial inspection
revealed terms representing etiology, secondary complications, diagnostic terms as well as phenotypes associated
with T2D.
2.2
2
Manual curation
Total terms
*IFG/IGT: impaired fasted glucose (level) /impaired glucose tolerance
3
KNOWLEDGE REPRESENTATION
The knowledge captured by the domain experts is available
in two forms. First we generated an OWL representation of
the hierarchy generated by the experts initially organised as
a
spreadsheet.
We
used
the
OWL-API
(https://github.com/owlcs/owlapi/wiki) to transform the
disease stages and phenotypes from the spreadsheet into
OWL. The information was modelled as 386 distinct classes
and the URIs were retained for MP and HP derived terms.
Associations between diseases stages and phenotypes were
described by object-properties that were defined during the
curation process. The ontology is available from BioPortal
DIAB: An Ontology of Type 2 Diabetes Stages and Associated Phenotypes
(http://purl.bioontology.org/ontology/DIAB); a screenshot is
shown in Figure 1.
Figure 1: The Ontology visualised in protégé showing the representation of the HP derived term ‘abdominal aortic aneurysms.
Secondly we represented the disease-phenotype association
using the Open Biomedical AssociatioN (OBAN https://github.com/jamesmalone/OBAN) framework. This
provides a generic association model: association_has_subject disease, and association_has_object phenotype and also allows provenance supporting the associations (who, when etc) to be expressed. The OBAN class
‘provenance’ reuses the object property has_evidence from
Miriam (Juty et al., 2012) and the different evidence subclasses from the Evidence Codes Ontology (Chibucos et al.,
2014). An example of the OBAN representation is provided
in Figure 2.
Figure 2: Protégé screenshot of the OWL/RDF file
generated to represent the DIAB ontology based on
4
CONCLUSION
We have developed an ontological representation of expert
knowledge about T2D based on a simple text mining and
expert review process. This process used provides a pragmatic and rapid method to extract information from the
literature, present it to domain experts for review who then
refined the association type to more precisely represent the
domain. We and others (e.g. Brewster et al., 2009), have
found text mining to be a useful tool in building ontologies,
and specifically for making associations between ontologies,
here between disease and phenotype. The process has since
been used successfully to perform disease-phenotype association in different disease areas with new domain experts.
We have determined that terms from both mouse and human
phenotype ontologies provide more complete representation
of human phenotypes for T2D and this is also confirmed in
other disease areas. An expert mapping of terms between
the two mouse and human ontologies is underway to determine if new terms could be added to the HP which had an
initial focus on rare disease phenotypes rather than those
associated with common disease. Future improvements to
the process include modification of input vocabularies to
improve recall, e.g. phenotype terms from UMLS could be
added and sub-selection of ontology terms excluding upper
level nodes to reduce noise for experts, for example HP has
the root term ‘All’. We note that our process was likely
aided by a corpus of literature for T2D from specialist journals which presumably is the reason only two additional
T2D-phenotype associated terms were added by domain
experts during review. One of these terms ‘macular edema’
is now present in HP, but was not when the text mining was
performed and was therefore not retrieved. The second,
‘hyperosmolarity’ is present in Snomed-CT as a child of
‘metabolic finding and as a compound term with T2D in
MEDRA but is absent from both from versions of HP and
MP used in the text mining and current versions. HP and
MP have different foci – and as such are not equivalent. For
example the terms confusion, mood changes, nausea present
in HP are not in MP, and vaginitis, hypopnea are present in
MP and not in HP. As both the HP and MP have been revised since we completed this work we plan to re-run the
text mining to evaluate whether they now provide an improved representation of this domain, for example by inclusion of a richer set of synonyms, prior to suggestion of new
terms.
The ontology is available in OWL and also transformed into
the OBAN model as a generic phenotype-disease association for community use. Improvements to the ontology
representation include capture of the relevance level of each
phenotype for each disease stage. Moreover, the OWL representation would benefit from a more specific modelling of
non-diabetic disease relationships currently represented as
‘disease other than diabetes’ by the domain experts and a
complete import of the class information from MP/HP to
include synonyms and definitions is required.
3
D.Vasant et al.
Our work represents a first step in the process to traverse
mouse and human data and to define a process for the association of common disease with phenotypes in human and
mouse. To evaluate the ontology in integration of human
and mouse data we will use the ontology in the PhenoDigm
tool (Smedley et al., 2013). PhenoDigm uses a semantic
approach to map between clinical features observed in humans and mouse currently represented using the HP and MP
ontologies. By using the improved disease-phenotype representation we have produced we will evaluate whether identification of mouse models is improved by the use of this
ontology and therefore whether we are able to more widely
improve matching between phenotypic data from mouse
models, for example generated by the International Mouse
Phenotypic Consortium (Koscielny et al., 2013)and human
disease data. Finally a manual mapping between the HP and
MP phenotype terms we have detected is underway. We are
aware that both ontologies have evolved since we performed
the text mining, and that mappings between the terms verified by experts are useful in data integration, therefore these
will be made public when they are available. The text
mining code and the associated spreadsheets can be found at
https://github.com/EBISPOT/T2D. In future we will explore
whether the use of the T2D associated phenotype terms can
improve the retrieval of T2D associated genes from the
literature, and will refine the ontology further by the inclusion of clinical tests used in diagnoses to improve its use in
electronic health record mining.
Juty, N., Le Novère, N., & Laibe, C. (2012). Identifiers.org
and MIRIAM Registry: community resources to provide
persistent identification. Nucleic Acids Research,
40(Database issue), D580–6. doi:10.1093/nar/gkr1097
ACKNOWLEDGEMENTS
Smedley, D., Oellrich, A., Köhler, S., Ruef, B., Westerfield,
M., Robinson, P., … Mungall, C. (2013). PhenoDigm:
analyzing curated annotations to associate animal models
with human diseases. Database : The Journal of Biological
Databases
and
Curation,
2013,
bat025.
doi:10.1093/database/bat025
This work was supported by BioMedBridges Project funded
by the European Commission within Research Infrastructures of the FP7 Capacities Specific Programme, grant
agreement number 284209 and EMBL Core funds to HP.
Thanks to Anika Oellrich for discussion of text mining
approaches and Damian Smedley for comments on the
manuscript.
REFERENCES
Amberger, J. S., Bocchini, C. A., Schiettecatte, F., Scott, A.
F., & Hamosh, A. (2015). OMIM.org: Online Mendelian
Inheritance in Man (OMIM®), an online catalog of human
genes and genetic disorders. Nucleic Acids Research,
43(Database issue), D789–98. doi:10.1093/nar/gku1205
Brewster, C., Jupp, S., Luciano, J., Shotton, D., Stevens, R.
D., & Zhang, Z. (2009). Issues in learning an ontology from
text. BMC Bioinformatics, 10 Suppl 5(Suppl 5), S1.
doi:10.1186/1471-2105-10-S5-S1
Chibucos, M. C., Mungall, C. J., Balakrishnan, R., Christie,
K. R., Huntley, R. P., White, O., … Giglio, M. (2014).
Standardized description of scientific evidence using the
Evidence Ontology (ECO). Database, 2014, bau075–
bau075. doi:10.1093/database/bau075
4
Kibbe, W. A., Arze, C., Felix, V., Mitraka, E., Bolton, E.,
Fu, G., … Schriml, L. M. (2015). Disease Ontology 2015
update: an expanded and updated database of human
diseases for linking biomedical knowledge through disease
data. Nucleic Acids Research, 43(Database issue), D1071–8.
doi:10.1093/nar/gku1011
Koscielny, G., Yaikhom, G., Iyer, V., Meehan, T. F.,
Morgan, H., Atienza-Herrero, J., … Parkinson, H. (2013).
The International Mouse Phenotyping Consortium Web
Portal, a unified point of access for knockout mice and
related phenotyping data. Nucleic Acids Research.
doi:10.1093/nar/gkt977
Malone, J., Holloway, E., Adamusiak, T., Kapushesky, M.,
Zheng, J., Kolesnikov, N., … Parkinson, H. (2010).
Modeling sample variables with an Experimental Factor
Ontology. Bioinformatics, 26(8), 1112–1118.
Rebholz-Schuhmann, D., Arregui, M., Gaudan, S., Kirsch,
H., & Jimeno, A. (2008). Text processing through Web
services: calling Whatizit. Bioinformatics (Oxford,
England), 24(2), 296–8. doi:10.1093/bioinformatics/btm557
Robinson, P. N., & Mundlos, S. (2010). The human
phenotype ontology. Clinical Genetics, 77, 525–534.
Smith, C. L., & Eppig, J. T. (2012). The Mammalian
Phenotype Ontology as a unifying standard for experimental
and high-throughput phenotyping data. Mammalian
Genome, 23(9-10), 653–668. doi:10.1007/s00335-0129421-3