D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus

HEALTH-F4-2007-200754
www.gen2phen.org
D3.5 High-Level Domain Model Version 2,
with Sample/Phenotype Focus
WP3 – Standard data models and terminologies
V5.0
Final draft
Lead beneficiary: EMBL
Date: 10/08/2009
Nature: Report
Dissemination level: PU
(Public)
© Copyright 2009 GEN2PHEN Consortium
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
Security: PU
Version: v5.0
Final Draft
2/11
TABLE OF CONTENTS
DOCUMENT INFORMATION .................................................................................................. 3 DOCUMENT HISTORY ............................................................................................................. 3 1. INTRODUCTION................................................................................................................. 4 2. DESCRIPTION OF WORK ................................................................................................ 4 3. EXISTING MODEL EVALUATION................................................................................. 4 3.1. 3.2. 3.3. 4. GENOMEUTWIN ............................................................................................................... 5 PAGE-OM ...................................................................................................................... 5 XGAP.............................................................................................................................. 5 GEN2PHEN PHENOTYPE MODEL................................................................................. 6 4.1. 4.2. PHENOTYPE MODEL CLASS DESCRIPTIONS ...................................................................... 6 OBJECT INSTANCE ............................................................................................................ 7 5. PHENOTYPE MODEL IMPLEMENTATION AND TESTING.................................... 9 6. FUTURE PLANS ................................................................................................................ 10 6.1. 6.2. 7. A HIGH-LEVEL DOMAIN MODEL VERSION 3 (D3.6) ...................................................... 10 DERIVATION AND SPECIFICATION OF EXCHANGE FORMAT (D3.7)................................. 10 ABBREVIATIONS ............................................................................................................. 11 REFERENCES............................................................................................................................ 11 © Copyright 2009 GEN2PHEN Consortium
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
HEALTH-200754
Security: PU
Version: v5.0
3/11
Final Draft
Document Information
Grant Agreement HEALTH-F4-2007-200754
Number
GEN2PHEN
Acronym
Full title
Genotype-To-Phenotype Databases: A Holistic Solution
Project URL
http://www.gen2phen.org
EU Project officer Frederick Marcus ([email protected])
Deliverable
Number D3.5 Title
High-Level Domain Model
Sample/Phenotype Focus
Work package
Number 3
Title
WP3 – Standard data models and terminologies
Delivery date
Contractual
June 2009
Actual
2,
with
August 2009
final ;
Version 5.0
Status
Version
Nature
Report ; Prototype ‡ Other ‡
Dissemination
Level
Public ; Confidential ‡
Authors (Partner) Tomasz Adamusiak (EMBL), Juha Muilu (UH.FGC), Morris Swertz (EMBL),
Helen Parkinson (EMBL)
Responsible
Author
Helen Parkinson
Email [email protected]
Partner EMBL-EBI
Phone +44 (0)1223 494 672
Document History
Name
Date
Version
Tomasz Adamusiak
Helen Parkinson
Tomasz Adamusiak
Helen Parkinson
Review
16/6/2009
7/7/2009
12/7/2009
14/7/2009
10/8/2009
1
2
3
4
5
© Copyright 2009 GEN2PHEN Consortium
Description
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
Security: PU
Version: v5.0
Final Draft
4/11
1. INTRODUCTION
Work package 3 ‘Standard data models and terminologies’ provides domain standards to develop
GEN2PHEN specific architecture, facilitate data exchange and integrate data across existing and
emerging resources. This work package is focused on providing standards to act as the
foundation for much of the database development activities of other work packages.
The work package objectives include the rapid development of a standard data model(s) capable
of representing the minimum agreed content standard (as determined by WP2) and a derived data
exchange format. Data models developed in coordination with WP3 will have several uses in
GEN2PHEN: data from pre-existing databases will be mapped to generate data in a derived data
exchange format, thus offering a flexible solution for integrating and exchanging existing and
new data. In this respect, data model development is a necessary prerequisite, initially separated
from implementation details.
2. DESCRIPTION OF WORK
The focus of the GEN2PHEN High-Level Domain Model Version 2, with Sample/Phenotype
Focus development process is:
•
To evaluate relevant public phenotype models
•
To develop a core GEN2PHEN phenotype model
•
To support primary GEN2PHEN use cases, especially in LSDB and HTP domains
The two GEN2PHEN modelling workshops: Hinxton (April 9-11, 2008) and Helsinki (January
19-22, 2009) laid the groundwork for specific sub domain development. Subsequent work was
continued during the first GEN2PHEN Phenotype Workshop (Geneva, May 7-8, 2009), hosted
by SIB). Use cases were gathered and models were developed and minimum content standards to
be used in exchanging data between partners were discussed in the context of specific phenotype
extensions. See Appendix 1 for detailed workshop proceedings. External invited participants
from the epidemiology, medical genetics, ontology development and model organism
communities provided expertise and use cases beyond those of Consortium Partners.
3. Existing model evaluation
Several public data models 1 currently exist in the Phenotype space and those closely aligned to
GEN2PHEN were evaluated for relevance, domain coverage compared to existing resources,
ease of use and complexity during the First Phenotype Workshop.
1
Some of the data models have been documented at www.schemalet.org, which is an experimental wiki site for
documenting use case specific data models.
© Copyright 2009 GEN2PHEN Consortium
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
Security: PU
Version: v5.0
Final Draft
5/11
3.1. GenomEUtwin
A 5th Framework Programme aimed at unifying studies of European volunteer twins to identify
genes underlying common diseases. The GenomEUtwin object model has already been tested on
large population cohorts by UH.FGC. See paragraph 4.1 in Appendix 1 for a diagram and more
details of the model.
3.2. PAGE-OM
A complete OMG standard reference model that represents genotype data at summary and at the
level of the individual. It also represents LSDB type data, phenotype, and supports some legacy
technology use cases. PAGE-OM is very detailed and is useful as a reference model; meaning
that GEN2PHEN specific models can be aligned to it and it can be used as a meta-mapping
model for mapping external data representations. It is however, rather complex and one aim of
WP3 modelling activities is to develop ‘modules’ whereby domain specific models can be
developed, used alone, implemented and made interoperable. See paragraph 4.2 in Appendix 1
for a diagram and more details of the model.
3.3. XGAP
The XGAP model (http://www.xgap.org). XGAP addresses the challenges of system-wide
genetics experiments in data management, querying and integration via a simple tabular text file
format to exchange data between collaborators, a customizable data infrastructure to store, query
and integrate data, as well as providing a foundation for the analysis tools. See paragraph 4.3 in
Appendix 1 for a diagram and more details of the model.
© Copyright 2009 GEN2PHEN Consortium
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
Security: PU
Version: v5.0
Final Draft
6/11
4. GEN2PHEN Phenotype Model
Figure 1. GEN2PHEN Phenotype Model
A GEN2PHEN phenotype model was developed during the Phenotype Workshop in Geneva
based on Partners’ input and invited domain experts’ opinions. It was later iterated through a
series of face to face meetings and teleconferences among Partners. Figure 1 presents the l.0
version of the model, constructed in Enterprise Architect. It is also available from the
schemalet.org website as well as in Enterprise Architect and XML formats from the GEN2PHEN
SVN: (https://svn.gene.le.ac.uk/gen2phen/trunk/object_models/)
4.1. Phenotype Model class descriptions
• Individual – Individual. Subject of a study.
•
Inferred_value – Inferred conclusion, derived from zero or many Observed_value
instances.
•
Observable_feature – A measurable feature of an Individual, e.g. blood pressure.
•
Observation_target – Super class of all observation targets like Individual or Panel.
© Copyright 2009 GEN2PHEN Consortium
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
Security: PU
Version: v5.0
Final Draft
7/11
•
Ontology_term – Term defined in a specific namespace (ontology source). All names
and terms should be defined using ontology terms whenever possible.
•
Observed_value – Specific value measured in an experiment, e.g. 120 (systolic BP,
mmHg).
•
Panel – Collection of Individual instances.
•
Protocol – Describes how measurement is to be performed, or a specific Standard
Operating Procedure.
•
Protocol_application – Describes how Protocol was instantiated a particular case, how
the measurement was done, e.g. on 16/6/2009 by Tomasz Adamusiak.
•
Variable_definition – Extends the Observable_feature class to enable precise definition
of the feature in used applications (for example has unit).
Mappings to PaGE-OM and XGAP are available on the schemalet wiki at:
http://www.schemalet.org/mediawiki/index.php/COMMON:Phenotype
4.2. Object instance
Figure 2. GEN2PHEN Phenotype Model object instance
An example instance of the model is shown in Figure 2. A blood pressure measuring protocol
was applied to observation target Juha on 25/5/2009. Two values were measured at 10am: 150
and 90, which were systolic and diastolic blood pressure in mmHg respectively.
© Copyright 2009 GEN2PHEN Consortium
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
Security: PU
Version: v5.0
Final Draft
8/11
Figure 3. Inferred value example
The instance depicted in Figure 3 extends the previous one to show how a previously measured
blood pressure can be used to infer disease status. A separate inference protocol was applied on
31/5/2009, and a high blood pressure was observed at 2pm.
© Copyright 2009 GEN2PHEN Consortium
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
HEALTH-200754
5.
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
Phenotype Model implementation and testing
Figure 4. GEN2PHEN Phenotype Model implementation in Molgenis notation
© Copyright 2009 GEN2PHEN Consortium
Security: PU
Version: v5.0
Final Draft
9/11
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
Security: PU
Version: v5.0
Final Draft
10/11
In order to test and develop the Gen2Phen phenotype model we have collaborated with the
developers of MOLGENIS [1, 2]. MOLGENIS is an open source software platform to efficiently
design, implement, and autogenerate database, APIs, and web applications from object models.
Its power is in the use of models and generators so the best solutions are easily reused between
applications. MOLGENIS in one simple step generates a database (mySQL or postgreSQL), a
web-based GUI, programmatic interfaces including Java API, SOAP web services usable in tools
like Taverna (http://taverna.sourceforge.net) and by statistical scripts written in the R language
(http://www.r-project.org), as well as a full documentation of the object model. Several Java
plug-in mechanisms are also available to customize the generated software. By developing
smaller models and ensuring interoperability using MOLGENIS some or all of the models can be
consumed by various partners, the majority of whom have use cases which encompass only some
of the models.
MOLGENIS has been successfully used within the GEN2PHEN Consortium by:
1. MAGE-TAB OM:
http://magetab-om.sourceforge.net
2. LSDB object model developed in the course of the Second Modelling Workshop:
http://magetab-om.sourceforge.net/lsdb/1.0/object_model.html
3. An example LSDB - Findis, the Finnish National Mutation Database (NMDB):
http://www.schemalet.org/mediawiki/index.php/FINDIS:Database
Figure 4 depicts GEN2PHEN Phenotype Model as implemented on the MOLGENIS platform.
Full documentation is available in Appendix 2 and a working implementation of the model,
comprising a back end database, GUI, etc. is available from:
http://wwwdev.ebi.ac.uk/microarray-srv/pheno/
6. FUTURE PLANS
6.1. A High-Level Domain Model Version 3 (D3.6)
This will be an improved and tested set of standard UML data models for all required domains,
ready to be implemented by all Partners. Feedback from Partners will be then used to provide the
ultimate design underpinnings for all GEN2PHEN databases in Iterative Specialized Domain
Modelling Complete (D3.9).
These sub-domain models including GEN2PHEN Phenotype Model will all be extensively tested
and a reference implementation will be provided on the MOLGENIS platform.
6.2. Derivation and Specification of Exchange Format (D3.7)
The priorities for data formats in GEN2PHEN are the data exchange between locus specific
databases and central repositories and HTP data. The modelling work to date has separated these
© Copyright 2009 GEN2PHEN Consortium
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
Security: PU
Version: v5.0
Final Draft
11/11
domains to support immediate needs for data exchange. The models developed will eventually
support the phenotype extension reported here as well.
Validation of LSBD data model commenced in 2009 by working with the existing LSDBs inside
and outside the GEN2PHEN consortium, most of who have existing data formats. Those formats
will support the data content of the GEN2PHEN Phenotype Model.
Validation of the MAGE-TAB OM is underway and progress is promising. We envisage that the
phenotypic descriptors, e.g. membership of a cohort through a shared phenotype, or trait will
require an extension of MAGE-TAB, and the requirement to provide details of markers in
context of HTP data will also require an extension.
7. Abbreviations
HGVS
LSDB
XGAP
PaGE-OM
Human Genome Variation Society
Locus Specific Database
Xtensible Genotype And Phenotype data platform
Phenotype and Genotype Experiment object model
REFERENCES
1.
2.
3.
Swertz, M.A., et al., Molecular Genetics Information System (MOLGENIS): alternatives
in developing local experimental genomics databases. Bioinformatics, 2004. 20(13): p.
2075-83.
Swertz, M.A. and R.C. Jansen, Beyond standardization: dynamic software infrastructures
for systems biology. Nat Rev Genet, 2007. 8(3): p. 235-43.
Wildeman, M., et al., Improving sequence variant descriptions in mutation databases and
literature using the Mutalyzer sequence variation nomenclature checker. Hum Mutat,
2008. 29(1): p. 6-13.
© Copyright 2009 GEN2PHEN Consortium