D3.5 High-Level Domain Model Version 2, with Sample/Phenotype Focus

HEALTH-F4-2007-200754
www.gen2phen.org
D3.5 High-Level Domain Model Version 2,
with Sample/Phenotype Focus
WP3 – Standard data models and terminologies
V5.0
Final
Lead beneficiary: EMBL
Date: 10/08/2009
Nature: Report
Dissemination level: PU
(Public)
© Copyright 2009 GEN2PHEN Consortium
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
Security: PU
Version: v5.0
Final
2/12
TABLE OF CONTENTS

APPENDIX I - Report on the First GEN2PHEN Phenotype Workshop
APPENDIX II - GEN2PHEN Phenotype Model Reference Implementation
© Copyright 2009 GEN2PHEN Consortium
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
HEALTH-200754
Security: PU
Version: v5.0
3/12
Final
Document Information
Grant Agreement HEALTH-F4-2007-200754
Number
GEN2PHEN
Acronym
Full title
Genotype-To-Phenotype Databases: A Holistic Solution
Project URL
http://www.gen2phen.org
EU Project officer Frederick Marcus ([email protected])
Deliverable
Number D3.5 Title
High-Level Domain Model
Sample/Phenotype Focus
Work package
Number 3
Title
WP3 – Standard data models and terminologies
Delivery date
Contractual
June 2009
Actual
2,
with
August 2009
final ;
Version 5.0
Status
Version
Nature
Report ; Prototype ‡ Other ‡
Dissemination
Level
Public ; Confidential ‡
Authors (Partner) Tomasz Adamusiak (EMBL), Juha Muilu (UH.FGC), Morris Swertz (EMBL),
Helen Parkinson (EMBL)
Responsible
Author
Helen Parkinson
Email [email protected]
Partner EMBL-EBI
Phone +44 (0)1223 494 672
Document History
Name
Date
Version
Description
Tomasz Adamusiak
Helen Parkinson
Tomasz Adamusiak
Helen Parkinson
Helen Parkinson
16/6/2009
7/7/2009
12/7/2009
14/7/2009
10/8/2009
1
2
3
4
5
First Draft Created
Internal Review
Corrections
Comments
Review
© Copyright 2009 GEN2PHEN Consortium
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
Security: PU
Version: v5.0
Final
4/12
Definitions
ƒ
Partners of the GEN2PHEN Consortium are referred to herein according to the following
codes:
ULEIC – University of Leicester (UK) – Coordinator
EMBL – European Molecular Biology Laboratory (Germany) – Beneficiary
FIMIM – Fundació IMIM (Spain) – Beneficiary
LUMC – Leiden University Medical Center (Netherlands) – Beneficiary
INSERM – Institut National de la Santé et de la Recherche Médicale (France) – Beneficiary
KI – Karolinska Institutet (Sweden) – Beneficiary
FORTH – Foundation for Research and Tecnology Hellas (Greece) – Beneficiary
CEA – Comissariat à l’Energie Atomique (France) – Beneficiary
EMC – Erasmus Universitair Medisch Centrum Rotterdam (Netherlands) – Beneficiary
UH.FGC – Helsingin Yliopisto (Finland) – Beneficiary
UAVR – Universidade de Aveiro (Portugal) – Beneficiary
UWC – University of the Western Cape (South Africa) – Beneficiary
CSIR – Council of Scientific and Industrial Research (India) – Beneficiary
SIB – Swiss Institute of Bioinformatics (Switzerland) – Beneficiary
UNIMAN – The University of Manchester (UK) – Beneficiary
BIOBASE – BioBase GmbH. (Germany) – Beneficiary
deCODE – Islensk Erfoagreining EH (Iceland) – Beneficiary
PHENO – Phenosystems S.A. (Belgium) – Beneficiary
BCP – Biocomputing Platforms Ltd. Oy (Finland) – Beneficiary
ƒ
ƒ
ƒ
ƒ
ƒ
Grant Agreement: The agreement signed between the beneficiaries and the European
Commission for the undertaking of the GEN2PHEN project (HEALTH-200754).
Project: The sum of all activities carried out in the framework of the Grant Agreement by the
Consortium.
Work plan: Schedule of tasks, deliverables, efforts, dates and responsibilities corresponding
to the work to be carried out for the GEN2PHEN project, as specified in Annex I to the Grant
Agreement.
Consortium: The GEN2PHEN Consortium, conformed by the above-mentioned legal
entities.
Consortium agreement: agreement concluded amongst GEN2PHEN participants for the
implementation of the Grant Agreement. Such an agreement shall not affect the parties’
obligations to the Community and/or to one another arising from the Grant Agreement.
© Copyright 2009 GEN2PHEN Consortium
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
Security: PU
Version: v5.0
Final
5/12
1. INTRODUCTION
Work package 3 ‘Standard data models and terminologies’ provides domain standards to develop
GEN2PHEN specific architecture, facilitate data exchange and integrate data across existing and
emerging resources. This work package is focused on providing standards to act as the
foundation for much of the database development activities of other work packages.
The work package objectives include the rapid development of a standard data model(s) capable
of representing the minimum agreed content standard (as determined by WP2) and a derived data
exchange format. Data models developed in coordination with WP3 will have several uses in
GEN2PHEN: data from pre-existing databases will be mapped to generate data in a derived data
exchange format, thus offering a flexible solution for integrating and exchanging existing and
new data. In this respect, data model development is a necessary prerequisite, initially separated
from implementation details.
2. DESCRIPTION OF WORK
The focus of the GEN2PHEN High-Level Domain Model Version 2, with Sample/Phenotype
Focus development process is:
•
To evaluate relevant public phenotype models
•
To develop a core GEN2PHEN phenotype model
•
To support primary GEN2PHEN use cases, especially in LSDB and HTP domains
The two GEN2PHEN modelling workshops: Hinxton (April 9-11, 2008) and Helsinki (January
19-22, 2009) laid the groundwork for specific sub domain development. Subsequent work was
continued during the first GEN2PHEN Phenotype Workshop (Geneva, May 7-8, 2009), hosted
by SIB). Use cases were gathered and models were developed and minimum content standards to
be used in exchanging data between partners were discussed in the context of specific phenotype
extensions. See Appendix 1 for detailed workshop proceedings. External invited participants
from the epidemiology, medical genetics, ontology development and model organism
communities provided expertise and use cases beyond those of Consortium Partners.
3. Existing model evaluation
Several public data models 1 currently exist in the Phenotype space and those closely aligned to
GEN2PHEN were evaluated for relevance, domain coverage compared to existing resources,
ease of use and complexity during the First Phenotype Workshop.
1
Some of the data models have been documented at www.schemalet.org, which is an experimental wiki site for
documenting use case specific data models.
© Copyright 2009 GEN2PHEN Consortium
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
Security: PU
Version: v5.0
Final
6/12
3.1. GenomEUtwin
A 5th Framework Programme aimed at unifying studies of European volunteer twins to identify
genes underlying common diseases. The GenomEUtwin object model has already been tested on
large population cohorts by UH.FGC. See paragraph 4.1 in Appendix 1 for a diagram and more
details of the model.
3.2. PAGE-OM
A complete OMG standard reference model that represents genotype data at summary and at the
level of the individual. It also represents LSDB type data, phenotype, and supports some legacy
technology use cases. PAGE-OM is very detailed and is useful as a reference model; meaning
that GEN2PHEN specific models can be aligned to it and it can be used as a meta-mapping
model for mapping external data representations. It is however, rather complex and one aim of
WP3 modelling activities is to develop ‘modules’ whereby domain specific models can be
developed, used alone, implemented and made interoperable. See paragraph 4.2 in Appendix 1
for a diagram and more details of the model.
3.3. XGAP
The XGAP model (http://www.xgap.org). XGAP addresses the challenges of system-wide
genetics experiments in data management, querying and integration via a simple tabular text file
format to exchange data between collaborators, a customizable data infrastructure to store, query
and integrate data, as well as providing a foundation for the analysis tools. See paragraph 4.3 in
Appendix 1 for a diagram and more details of the model.
© Copyright 2009 GEN2PHEN Consortium
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
Security: PU
Version: v5.0
Final
7/12
4. GEN2PHEN Phenotype Model
Figure 1. GEN2PHEN Phenotype Model
A GEN2PHEN phenotype model was developed during the Phenotype Workshop in Geneva
based on Partners’ input and invited domain experts’ opinions. It was later iterated through a
series of face to face meetings and teleconferences among Partners. Figure 1 presents the l.0
version of the model, constructed in Enterprise Architect. It is also available from the
schemalet.org website as well as in Enterprise Architect and XML formats from the GEN2PHEN
SVN: (https://svn.gene.le.ac.uk/gen2phen/trunk/object_models/)
4.1. Phenotype Model class descriptions
• Individual – Individual. Subject of a study.
•
Inferred_value – Inferred conclusion, derived from zero or many Observed_value
instances.
•
Observable_feature – A measurable feature of an Individual, e.g. blood pressure.
•
Observation_target – Super class of all observation targets like Individual or Panel.
© Copyright 2009 GEN2PHEN Consortium
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
Security: PU
Version: v5.0
Final
8/12
•
Ontology_term – Term defined in a specific namespace (ontology source). All names
and terms should be defined using ontology terms whenever possible.
•
Observed_value – Specific value measured in an experiment, e.g. 120 (systolic BP,
mmHg).
•
Panel – Collection of Individual instances.
•
Protocol – Describes how measurement is to be performed, or a specific Standard
Operating Procedure.
•
Protocol_application – Describes how Protocol was instantiated a particular case, how
the measurement was done, e.g. on 16/6/2009 by Tomasz Adamusiak.
•
Variable_definition – Extends the Observable_feature class to enable precise definition
of the feature in used applications (for example has unit).
Mappings to PaGE-OM and XGAP are available on the schemalet wiki at:
http://www.schemalet.org/mediawiki/index.php/COMMON:Phenotype
4.2. Object instance
Figure 2. GEN2PHEN Phenotype Model object instance
An example instance of the model is shown in Figure 2. A blood pressure measuring protocol
was applied to observation target Juha on 25/5/2009. Two values were measured at 10am: 150
and 90, which were systolic and diastolic blood pressure in mmHg respectively.
© Copyright 2009 GEN2PHEN Consortium
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
Security: PU
Version: v5.0
Final
9/12
Figure 3. Inferred value example
The instance depicted in Figure 3 extends the previous one to show how a previously measured
blood pressure can be used to infer disease status. A separate inference protocol was applied on
31/5/2009, and a high blood pressure was observed at 2pm.
© Copyright 2009 GEN2PHEN Consortium
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
HEALTH-200754
5.
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
Phenotype Model implementation and testing
Figure 4. GEN2PHEN Phenotype Model implementation in Molgenis notation
© Copyright 2009 GEN2PHEN Consortium
Security: PU
Version: v5.0
Final
10/12
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
Security: PU
Version: v5.0
Final
11/12
In order to test and develop the Gen2Phen phenotype model we have collaborated with the
developers of MOLGENIS [1, 2]. MOLGENIS is an open source software platform to efficiently
design, implement, and autogenerate database, APIs, and web applications from object models.
Its power is in the use of models and generators so the best solutions are easily reused between
applications. MOLGENIS in one simple step generates a database (mySQL or postgreSQL), a
web-based GUI, programmatic interfaces including Java API, SOAP web services usable in tools
like Taverna (http://taverna.sourceforge.net) and by statistical scripts written in the R language
(http://www.r-project.org), as well as a full documentation of the object model. Several Java
plug-in mechanisms are also available to customize the generated software. By developing
smaller models and ensuring interoperability using MOLGENIS some or all of the models can be
consumed by various partners, the majority of whom have use cases which encompass only some
of the models.
MOLGENIS has been successfully used within the GEN2PHEN Consortium by:
1. MAGE-TAB OM:
http://magetab-om.sourceforge.net
2. LSDB object model developed in the course of the Second Modelling Workshop:
http://magetab-om.sourceforge.net/lsdb/1.0/object_model.html
3. An example LSDB - Findis, the Finnish National Mutation Database (NMDB):
http://www.schemalet.org/mediawiki/index.php/FINDIS:Database
Figure 4 depicts GEN2PHEN Phenotype Model as implemented on the MOLGENIS platform.
Full documentation is available in Appendix 2 and a working implementation of the model,
comprising a back end database, GUI, etc. is available from:
http://wwwdev.ebi.ac.uk/microarray-srv/pheno/
6. FUTURE PLANS
6.1. A High-Level Domain Model Version 3 (D3.6)
This will be an improved and tested set of standard UML data models for all required domains,
ready to be implemented by all Partners. Feedback from Partners will be then used to provide the
ultimate design underpinnings for all GEN2PHEN databases in Iterative Specialized Domain
Modelling Complete (D3.9).
These sub-domain models including GEN2PHEN Phenotype Model will all be extensively tested
and a reference implementation will be provided on the MOLGENIS platform.
6.2. Derivation and Specification of Exchange Format (D3.7)
The priorities for data formats in GEN2PHEN are the data exchange between locus specific
databases and central repositories and HTP data. The modelling work to date has separated these
© Copyright 2009 GEN2PHEN Consortium
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Helen Parkinson
Security: PU
Version: v5.0
Final
12/12
domains to support immediate needs for data exchange. The models developed will eventually
support the phenotype extension reported here as well.
Validation of LSBD data model commenced in 2009 by working with the existing LSDBs inside
and outside the GEN2PHEN consortium, most of who have existing data formats. Those formats
will support the data content of the GEN2PHEN Phenotype Model.
Validation of the MAGE-TAB OM is underway and progress is promising. We envisage that the
phenotypic descriptors, e.g. membership of a cohort through a shared phenotype, or trait will
require an extension of MAGE-TAB, and the requirement to provide details of markers in
context of HTP data will also require an extension.
7. Abbreviations
HGVS
LSDB
XGAP
PaGE-OM
Human Genome Variation Society
Locus Specific Database
Xtensible Genotype And Phenotype data platform
Phenotype and Genotype Experiment object model
REFERENCES
1.
2.
3.
Swertz, M.A., et al., Molecular Genetics Information System (MOLGENIS): alternatives
in developing local experimental genomics databases. Bioinformatics, 2004. 20(13): p.
2075-83.
Swertz, M.A. and R.C. Jansen, Beyond standardization: dynamic software infrastructures
for systems biology. Nat Rev Genet, 2007. 8(3): p. 235-43.
Wildeman, M., et al., Improving sequence variant descriptions in mutation databases and
literature using the Mutalyzer sequence variation nomenclature checker. Hum Mutat,
2008. 29(1): p. 6-13.
© Copyright 2009 GEN2PHEN Consortium
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus. Appendix 1 - Report on the First
GEN2PHEN Phenotype Workshop
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Gudmundur Thorisson, Helen
Parkinson
Security: PU
Version: v1.0
1/11
Appendix 1 Report on the First GEN2PHEN Phenotype Workshop
Host
Venue
Dates
Swiss Institute of Bioinformatics (SIB)
Centre Medicale Universitaire (CMU)
1 Rue Michel-Servet
CH1211 Geneva
7-8 May 2009
1. Overview
The First GEN2PHEN Phenotype Workshop (Geneva 7-8 May 2009) was hosted by SIB as a
follow up the Second Modelling Workshop hosted by UH.FGC (Helsinki 19-22.1.2009). See
http://askja.gene.le.ac.uk/drupal5/Modelling_Workshop_2_Report for details on the previous
workshop. Use cases and models evaluated previously, served as a basis in developing minimum
content standards for exchanging phenotypic information among partners as well as for building
and evaluating preliminary phenotype model in partial fulfilment of WP3 deliverables D3.5.
Use cases identified in the Genotype to Phenotype domain in a previous deliverable D3.1 were
subsequently refined by contact with the wider community and used to drive the development of
a domain independent phenotype model. Various pre-existing domain models exist and the
workshop began the process of evaluating these for GEN2PHEN needs. This report describes the
workshop content.
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus. Appendix 1 - Report on the First
GEN2PHEN Phenotype Workshop
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Gudmundur Thorisson, Helen
Parkinson
2. Participants
Consortium members
Name
Andrew Devereau
Mike Cornell
Veronique Humbertclaude
Christophe Beroud
Anna Pigeon
David Atlan
Gudmundur Thorisson
Sergio Matos
Anne-Lise Veuthey
Lydie Bougueleret
Annais Mottaz
Lina Yip
Juha Muilu
Helen Parkinson
James Malone
Tomasz Adamusiak
Organisation
UNIMAN
UNIMAN
INSERM
INSERM
INSERM
PHENO
ULEIC
UAVR
SIB
SIB
SIB
SIB
UH.FGC
EMBL
EMBL
EMBL
Security: PU
Version: v1.0
2/11
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus. Appendix 1 - Report on the First
GEN2PHEN Phenotype Workshop
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Gudmundur Thorisson, Helen
Parkinson
Security: PU
Version: v1.0
3/11
Invited domain experts
Domain experts represented among others the following consortia: CASIMIR
(www.casimir.org.uk), ENGAGE (www.euengage.org), GenomEUTwin
(www.genomeutwin.org) BBMRI (www.bbmri.eu) and P3G (www.p3g.org).
Name
Alan Rector
Peter Robinson
John Hancock
Paul Burton
Isabel Fortier
Morris Swertz
Mauno Vihinen
Maria Krestyaninowa
Mike Gostev
IIlkka Lappalainen
Sraboni Ghost
Abriel Hugues
Organisation
UNIMAN
Charite Universitaetsmedizin
MRC
ULEIC
ENEP
University Medical Center Groningen
EMBL
EMBL
EMBL
EMBL
Genionics
Universitaet Bern
3. Agenda and slides
Agenda and speakers' slides are available from http://askja.gene.le.ac.uk/drupal5/content/firstphenotype-workshop-agenda
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus. Appendix 1 - Report on the First
GEN2PHEN Phenotype Workshop
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Gudmundur Thorisson, Helen
Parkinson
Security: PU
Version: v1.0
4/11
4. Models evaluated
4.1. TWIN:Phenotype
Observation is phenotypic observation done by a specific method, which is documented under
an observation framework. Classification is inferred or classified conclusion of measurement(s)
(here blood pressure). Ontology is the name space (E.g. EUTwin) used for vocabulary (i.e. high
blood pressure, low blood pressure) and Classification method provides information on
classification specification. Time_accuracy is needed because it is not always possible to know
the time exactly (e.g. in some cases exact time cannot be given and date and month must be
coded using agreed convention).
More information on the model available on the Schemalet website
http://www.schemalet.org/mediawiki/index.php/TWIN:Phenotype
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus. Appendix 1 - Report on the First
GEN2PHEN Phenotype Workshop
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Gudmundur Thorisson, Helen
Parkinson
Security: PU
Version: v1.0
5/11
4.2. PAGEOM:Phenotype
Observable features (nose size) can be measured using different observation methods (e.g. ruler)
leading to single or multiple observed values (nose size) over observation target(s) (individual).
Features can be categorised under different feature categories (e.g. clinical test, heart function,
etc.)
More information on the model available on the Schemalet website
http://www.schemalet.org/mediawiki/index.php/PAGEOM:Phenotype
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus. Appendix 1 - Report on the First
GEN2PHEN Phenotype Workshop
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Gudmundur Thorisson, Helen
Parkinson
Security: PU
Version: v1.0
6/11
4.3. XGAP:Trait
XGAP-OM is the conceptual model behind the XGAP platform. It can be used to consistently
model a wide variety of organisms, experimental designs, and biomolecular profiling
technologies:
•
•
•
•
Describe core experimental data using only four core data types Trait, Subject, Data and
DataElement.
Add experimental design annotations using core FuGE data types Investigation, Protocols
and ProtocolApplications, OntologyTerms, etc.
Consistently annotate Traits and Subjects using standardized extensions of Trait (e.g.
Probe, Marker) and Subject (e.g. Individual, Strain).
Consistently extend XGAP for new types of annotations by adding more types of Strain
and Subject (e.g. add 'MassPeak' as a new Trait to annotate 'retentiontime' and 'mz')
More information on the model available from http://www.xgap.org/objectmodel.html
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus. Appendix 1 - Report on the First
GEN2PHEN Phenotype Workshop
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Gudmundur Thorisson, Helen
Parkinson
Security: PU
Version: v1.0
7/11
5. 5. Models developed
5.1. COMMON:Phenotype
Note: the attributes were not added during the workshop and the model will be amended with
them after a cooperative iteration effort.
•
•
•
•
•
•
•
Individual - Individual. Subject of study
Inferred_value - Inferred conclusion, derived from zero or many observed values.
Observable_feature - Something we can measure in relation to individual. For example
blood pressure.
Observation_target - Super class of all observation targets like Individual or Panel.
Observed value - Measured value.
Panel - Collection of individuals.
Protocol - Description how measurement is planned to be done.
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus. Appendix 1 - Report on the First
GEN2PHEN Phenotype Workshop
HEALTH-200754
•
•
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Gudmundur Thorisson, Helen
Parkinson
Security: PU
Version: v1.0
8/11
Protocol_Application - Description of how an actual measurement was done (optional
different from protocol).
Ontology_term - Term defined in specific name space (ontology source). All names and
terms will be defined using ontology terms.
More information on the model available on the Schemalet website
http://www.schemalet.org/mediawiki/index.php/COMMON:Phenotype
The model is also available for download in following formats:
•
Enterprise Architect http://bio-models.svn.sourceforge.net/viewvc/bio-
•
models/trunk/object_models/enterprise_architect/phenotype.eap?view=log
XML http://bio-models.svn.sourceforge.net/viewvc/biomodels/trunk/object_models/enterprise_architect/phenotype.xml?view=log
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus. Appendix 1 - Report on the First
GEN2PHEN Phenotype Workshop
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Gudmundur Thorisson, Helen
Parkinson
Security: PU
Version: v1.0
9/11
5.2. MOLGENIS:Pheno implementation
This is a preliminary evaluation of the model, which will be further developed among Partners.
More detailed documentation is available from http://bio-models.svn.sourceforge.net/ viewvc/
bio-models/ molgenis4phenotype/ WebContent/doc/objectmodel.html
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus. Appendix 1 - Report on the First
GEN2PHEN Phenotype Workshop
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Gudmundur Thorisson, Helen
Parkinson
Security: PU
Version: v1.0
10/11
6. Minimal information on phenotype
It was agreed that reporting of the phenotypes is inconsistent. For example only some of the
observation targets are annotated with ultrasound of the liver was significant in one of the
subjects, but no information is given for other observation targets. Thus it is unclear whether
they have also been tested. There are also a number of ethical ramifications which will be
followed up in the Ethics Session during the upcoming Fourth GEN2PHEN General Assembly
Meeting. It was also suggested that minimal information should be content specific, e.g.
obligatory smoking status in reporting of hypertension.
It was agreed that published phenotypic information should at least contain the following
information about observation targets:
•
•
•
•
Age
Gender
Age of onset
Ontology (controlled vocabulary) term for signs and symptoms
Optional information would include:
•
Therapy information (ontology coverage is coming up short in this domain)
D3.5 High-Level Domain Model Version 2, with
Sample/Phenotype Focus. Appendix 1 - Report on the First
GEN2PHEN Phenotype Workshop
HEALTH-200754
WP3 – Standard data models and
terminologies
Authors: Tomasz Adamusiak, Juha Muilu,
Morris Swertz, Gudmundur Thorisson, Helen
Parkinson
Security: PU
Version: v1.0
11/11
7. Pathogenicity
Agreeing on the meaning of pathogenicity was a challenging task, as different communities use
it in a slightly different way. It was proposed to distinguish between pathogenicity modifiers
(positive/negative) and factors directly pathogenic. Pathogenicity could be variant causing
disease or risk, but in a medical setting it is rather mutation causing a disease. Definition for
diagnostic labs would also have to be different. A definition stating that pathogenicity leads to
disease was found too broad, and the final version defined pathogenicity as an ability to cause
disease.
Issues raised during the discussion
•
•
•
Laboratory testing aims to link the existence of a variant to the occurrence of a disease
(bias in over-reporting of pathogenicity).
It is not recorded often enough, as it is hugely important and extremely useful.
How to record values? It was proposed to use a continuous scale (e.g. p-values) to
represent pathogenicity values. It was agreed that from a practical point of view it is more
feasible to deal with four levels.
But this should also be extended to record values: non known and unclassified.
•
•
Pathogenicity values should be backed up by an evidence reference, e.g. journal paper.
In some cases a context is required, e.g. it is pathogenic only in association with...
Appendix 2. GEN2PHEN Phenotype Model reference
implementation
HEALTH-200754
WP3 – Standard data models and
terminologies
Security: PU
Authors: Morris Swertz
Version: 1
1/14
Appendix 2 GEN2PHEN Phenotype Model Reference Implementation
The GEN2PHEN Phenotype Model is a minimal data model to represent a data set of phenotypic
observations resulting from one or more investigations. The objective is to harmonize the
exchange of phenotype descriptions between various repositories and to host phenotype
information ranging annotations in locus specific databases to rich clinical reports from cohort
studies. The initial version of this model was compiled at the GEN2PHEN phenotype workshop
(Geneva, 8th-9th May 2009), building on previous modeling efforts from the XGAP, PaGE,
FuGE, LOVD, and MAGE-TAB projects. Where appropriate mapping to these models is
provided.
This document was created by: Morris Swertz, Juha Muilu, Gudmundur Thorisson, Tomasz
Adamusiak, Isabel Fortier, Paul Burton, John Hancock, Illke Lappalainen, Anthony Brookes,
other members of the GEN2PHEN collaboration and Helen Parkinson. This work is sponsored
by EU-GEN2PHEN, EU-CASIMIR, P3G, NWO-Rubicon, NBIC BioAssist/Biobanking.
Changelog/decisions 11-06-2009 (following G2P AM4):
1. Added self-reference on Protocol to create aggregated protocols Use case: a study is a set
of Questionnaires, each questionaire being a protocol
2. Added VariableDefinition as subclass of Observable feature and moved attribute 'unit'
from ObservedValue to ValueDefinition. VariableDefinition can refer to one (?)
ObservableFeature concept. Use case: a questionaire (protocol) is defined to measure
'length' in cm; 'length' is the observable feature, 'length in cm' the VariableDefinition.
Motivation: if unit was defined on ObservedValue than one cannot define the unit for a
protocol. If unit was defined in two places (protocol and value level) then they can
conflict with each other.
3. Added timestamp to both the protocolApplication and ObservedValue Use case: blood
pressure was measured at five ten minute intervals at 8:00, 8:10, 8:20. The motivation
herefor is that protocols often include repeated measurements. A positive example is the
use case of blood pressure time series. A negative example is 'blood pressure standing'
and 'blood pressure lying down' which are different observableFeatures.
4. Adapted the description of protocolapplication to say it is an 'instance' of the protocol
usage.
5. Did not change observableFeature.name into observableFeature.description, this is not
advisable as it is inconsistent.
6. Did not replace subclass InferredValue with a directional self reference on
ObservedValue for clarity.
Appendix 2. GEN2PHEN Phenotype Model reference
implementation
HEALTH-200754
WP3 – Standard data models and
terminologies
Security: PU
Authors: Morris Swertz
Version: 1
2/14
Changelog/decisions 12-06-2009 (following meeting Juha Muilu, Morris Swertz, Tomasz
Adamusiak):
1. Protocol.name is not unique within an investigation as it can be reused in multiple
studies, a relationship is definable via ProtocolApplication.
2. ObservationTargets are not unique to one investigation as they can be observed in
multiple studies, a relationship definable via the ObservedValue.
3. SelfRecursion on ObservedValue for multivalue and derived value was dropped for
simplicity reasons. Until shown otherwise multivalue features can be grouped by
protocol.
4. ObservedValue name is not made unique within investigation as it defies its purpose to
integrate between studies.
5. There is no explicit relationship between ObservedValue.value and Code.term; such
constraint checking is outside the scope of this model.
6. Added a 'value' to ParameterValue which was missing.
7. Changed that Code doesn't extend the OntologyTerm class but instead refers to an
instance.
8. InferredValue seems not normalized in the sense that one has to repeat ObservationTarget
which is implied via the ObservedValues it refers to. However, this is not changed
because it can be that an inference is provided without providing the ObservedValues or
that a Panel level inference is derived from a set of individual level Observedvalues.
Table of contents
pheno.system
package:
pheno.observation
package:
pheno.target
package:
pheno.variable
package:
pheno.protocol
package:
Identifiable
Investigation
Individual
VariableDefinition
Protocol
Nameable
ObservableFeature
Panel
CodeList
ProtocolApplication
OntologySource
ObservedValue
Code
ProtocolParameter
OntologyTerm
InferredValue
ObservationTarget
ParameterValue
Appendix 2. GEN2PHEN Phenotype Model reference
implementation
HEALTH-200754
WP3 – Standard data models and
terminologies
Security: PU
Authors: Morris Swertz
Version: 1
3/14
1. pheno.system package
This packages describe basic classes that are used as building blocks for the pheno.core model.
1.1. Identifiable (interface)
(For implementation purposes) The Identifiable interface provides its sub-classes with a unique
numeric identifier within the scope of one database. This class maps to FuGE::Identifiable
(together with Nameable interface)
Attributes:
id: int (required)
Automatically generated id-field
1.2. Nameable (interface)
(For modeling purposes) The Nameable interface provides its sub-classes a meaningful name
that need not be unique. This class maps to FuGE::Identifiable (together with Identifiable
interface)
Attributes:
name: string (required)
A human-readable and potentially ambiguous common identifier
Appendix 2. GEN2PHEN Phenotype Model reference
implementation
HEALTH-200754
WP3 – Standard data models and
terminologies
Security: PU
Authors: Morris Swertz
Version: 1
4/14
1.3. OntologySource
implements Identifiable, Nameable
The OntologySource class defines a reference to a an existing ontology or controlled vocabulary
from which well-defined and stable (ontology) terms can be obtained. For instance: MO, GO,
EFO, UMLS, etc. Use of existing ontologies/vocabularies is recommended to harmonize
phenotypic descriptions. This class maps to FuGE::OntologySource, MAGETAB::TermSourceREF.
Attributes:
ontologyURI: hyperlink (required)
A URI that references the location of the ontology.
1.4. OntologyTerm
implements Identifiable
The OntologyTerm class defines references to a single entry from an ontology or a controlled
vocabulary. Other classes can reference to this OntologyTerm to harmonize naming of concepts.
Each term should have a local, unique label. Good practice is to label it 'sourceid:term', e.g.
'MO:cell' If no suitable ontology term exists one can define new terms locally in which case
there is no formal accession for the term. In those cases the local name should be repeated in
both term and termAccession. Maps to FuGE::OntologyIndividual; in MAGE-TAB there is no
separate entity to model terms.
Attributes:
term: string (required)
The ontology term itself, also known as the 'local name' in some ontologies.
termLabel: string (required)
The label that is used to refer to this term inside this data set. For instance 'MO:cell'
termAccession: string (optional)
The accession number assigned to the ontology term in the source ontology. If empty it is
assumed to be a locally defined term.
Associations:
termSource: OntologySource (0..1)
The source ontology or controlled vocabulary list that ontology terms have been obtained
from.
Appendix 2. GEN2PHEN Phenotype Model reference
implementation
HEALTH-200754
WP3 – Standard data models and
terminologies
Security: PU
Authors: Morris Swertz
Version: 1
5/14
2. pheno.observation package
This package describes the minimal model for phenotypes.
2.1. Investigation
implements Identifiable, Nameable
The Investigation class defines self-contained units of study, each having a unique name and a
group of actions (protocol applications) and/or results (in ObservedValues). For instance:
Framingham study. Maps to XGAP/FuGE Investigation and MAGE-TAB experiment.
Discussion: should we adopt MAGE-TAB::IDF type of minimal information about an
investigation?
2.2. ObservableFeature
implements Identifiable, Nameable
The ObservableFeature class defines anything that can be observed (there may be many
alternative protocols to measure them). For instance: systolic blood pressure, Diastolic blood
pressure, Treatment for hypertension. These names are unique within a data set. Preferably each
Appendix 2. GEN2PHEN Phenotype Model reference
implementation
HEALTH-200754
WP3 – Standard data models and
terminologies
Security: PU
Authors: Morris Swertz
Version: 1
6/14
ObservableFeature should be named according to a well-defined ontology. This class maps to
XGAP Trait, FuGE DimensionElement and PaGE ObservableFeature. Multi-value features can
be grouped by protocol. For instance: blood pressure consists of observations for features
systolic and diastolic blood pressure.
Associations:
ontologyReference: OntologyTerm (0..1)
Reference to the formal ontology definition for this feature
2.3. ObservedValue
implements Identifiable
The ObservableValue class defines the actual observation. For instance: 160 mmHg, 90mmHg,
"no treatment". This class has no FuGE equivalent because in FuGE the data protocolapplication association is reversed, i.e. the ProtocolApplication has input/output Data
(which could be ObservedValues). Maps to XGAP DataElement that uses the FuGE approach, so
oberved values are grouped into 'Data'; Maps to PaGE observed value.
Attributes:
time: datetime (required)
time when the protocol was applied.
value: string (required)
The value observed
Associations:
investigation: Investigation (1..1)
Reference to the Investigation this observedValue belongs to.
observationTarget: ObservationTarget (1..1)
Reference to the subject that has been observed
observableFeature: ObservableFeature (1..1)
Reference to the feature that was observed
protocolApplication: ProtocolApplication (0..1)
Reference to the protocol application that produced this observation
Appendix 2. GEN2PHEN Phenotype Model reference
implementation
HEALTH-200754
WP3 – Standard data models and
terminologies
Security: PU
Authors: Morris Swertz
Version: 1
7/14
2.4. InferredValue
extends ObservedValue
The InferredValue class defines ObservedValues that are inferred as result of human or
computational post-processing of previous ObservedValues. The protocol used for this inference
can be defined via the protocolApplication association that is inherited from ObservedValue. For
instance: hypertensive = yes when mean arterial pressure = 135 AND no hypertension affecting
medicine is taken. This class has no direct mapping to other models: XGAP would use
input/ouput Data; PaGE would use a self reference on ObservedValue
Implementation discussion: how to make the derivedFrom relationship understandeable in UI.
Would need a multicolumn lookup including target, feature, value, and unit. Now one just gets a
value.
Associations:
derivedFrom: ObservedValue (1..n)
References to one or more observed values that were used to infer this observation
2.5. ObservationTarget
implements Identifiable, Nameable
An ObservationTarget class defines the subjects of observation. For instance: individual 1 from
study x. This class maps to XGAP subject and maps to Page Abstract_Observation_Target. The
name of observationTargets is unique within its Investigation.
Appendix 2. GEN2PHEN Phenotype Model reference
implementation
HEALTH-200754
WP3 – Standard data models and
terminologies
Security: PU
Authors: Morris Swertz
Version: 1
8/14
3. pheno.target package
3.1. Individual
extends ObservationTarget
The Individuals class defines human cases that are used as observation target. This class maps to
XGAP and PaGE individual.
Discussion: what minimal properties should be hard-coded? E.g. sex is assumed to be an
observablefeature while in PAGE/XGAP it as a direct property of individual.
Attributes:
sex: enum (required)
Associations:
species: OntologyTerm (1..1)
mother: Individual (0..1)
Refers to the mother of the individual.
father: Individual (0..1)
Refers to the father of the individual.
Appendix 2. GEN2PHEN Phenotype Model reference
implementation
HEALTH-200754
WP3 – Standard data models and
terminologies
Security: PU
Authors: Morris Swertz
Version: 1
9/14
3.2. Panel
extends ObservationTarget
The Panel class defines groups of individuals that can act as a single ObservationTarget. Thus a
whole group can have ObservedValues such as 'middle aged man' or 'recombinant mouse inbred
Line dba x b6'. This class maps to XGAP/PaGE panel classes.
Associations:
individuals: Individual (1..n)
The list of individuals in this panel
4. pheno.variable package
The variable package provides classes to define variables as used within a protocol/questionaire.
Variables are specific types of observable features in that they have a unit attached
Appendix 2. GEN2PHEN Phenotype Model reference
implementation
HEALTH-200754
WP3 – Standard data models and
terminologies
Security: PU
Authors: Morris Swertz
Version: 1
10/14
4.1. VariableDefinition
extends ObservableFeature
The VariableDefinition class extends the ObservableFeature class to enable precise definition of
the unit of ObservableFeature.
Associations:
unit: OntologyTerm (1..1)
Reference to the well-defined measurement unit used to observe this features (if feature is
that concrete). E.g. mmHg
codeList: CodeList (0..1)
4.2. CodeList
implements Identifiable, Nameable
The CodeList class names lists of discrete values that are available as options for a particular
VariableDefintion.
4.3. Code
implements Identifiable
The Code class names the code values for a particular codelist. It extends from ontologyTerm
adding the option to define pretty labels. For instance 'f=female', 'm=male'
Attributes:
value: string (required)
The value that represents the code in the data
label: string (required)
The pretty label that represents the human understandeable meaning of the code. For
instance the label on a CRF.
Associations:
codeList: CodeList (1..1)
The code-list this code is defined to be part of
ontologyTerm: OntologyTerm (0..1)
Appendix 2. GEN2PHEN Phenotype Model reference
implementation
HEALTH-200754
WP3 – Standard data models and
terminologies
Security: PU
Authors: Morris Swertz
Version: 1
11/14
5. pheno.protocol package
The protocol package provides classes to describe protocols that are planned, or have been used,
for observation. This can include questionnaires, wet-lab protocols and dry-lab protocols. Very
similar to FuGE/XGAP and MAGE-TAB
5.1. Protocol
implements Identifiable, Nameable
The Protocol class defines parameterizable descriptions of methods; each protocol has a unique
name within a dataset. Each ProtocolApplication can define the ObservableFeatures it can
observe as well as the optional Parameters. For instance: SOP for blood pressure measurement
used by UK biobank. This class maps to FuGE/XGAP/MageTab Protocol, but in contrast to
FuGE it is not required to extend protocol before use. Note that the FuGE's mechanism of
Appendix 2. GEN2PHEN Phenotype Model reference
implementation
HEALTH-200754
WP3 – Standard data models and
terminologies
Security: PU
Authors: Morris Swertz
Version: 1
12/14
parameters (for protocol) and parametervalues (for application) is not shown. Has no equivalent
in PaGE.
Associations:
observableFeatures: ObservableFeature (0..n)
The features that can be observed using this protocol.
protocolComponents: Protocol (0..n)
The set of protocols that together to make up this protocol. For instance: a set of
questionnaires.
5.2. ProtocolApplication
implements Identifiable, Nameable
A ProtocolApplication class defines the actual action of observation by instantiating a protocol
and optional ParameterValues. For example: the action of blood pressure measurement on 1000
individuals, using a particular protocol, resulting in 1000 associated observed values. This class
maps to FuGE/XGAP ProtocolApplication, but in FuGE ProtocolApplications can take Material
or Data (or both) as input and produce Material or Data (or both) as output. Similar to
PaGE.ObservationMethod
Attributes:
time: datetime (required)
time when the protocol was applied.
Associations:
protocol: Protocol (1..1)
Reference to the protocol that is being used.
investigation: Investigation (1..1)
Reference to the Investigation this protocolapplication belongs to.
5.3. ProtocolParameter
implements Identifiable, Nameable
ProtocolParameter represents a variable of a Protocol that is instantiated as a Parameter Value
(see ParameterValue). For instance 'growth temperature' in a protocol where yeast are grown at
permissive and non permissive temperatures. It implements Unit to define the parameter type and
Appendix 2. GEN2PHEN Phenotype Model reference
implementation
HEALTH-200754
WP3 – Standard data models and
terminologies
Security: PU
Authors: Morris Swertz
Version: 1
allowed values.
ProtocolParameter maps to FuGE::Parameter
Associations:
protocol: Protocol (0..1)
5.4. ParameterValue
implements Identifiable
A ParameterValue is instantiated when a ProtocolApplication applies a Protocol with
Parameters. ParameterValue implements Measurement to provide values and Units for
ParameterValues. The FuGE equivalent to ParameterValue is FuGE::ParameterValue
Attributes:
value: string (required)
The chosen value of the parameter within this protocol application
Associations:
protocolApplication: ProtocolApplication (1..1)
Reference to the protocol application for which this parameter value was chosen for
protocolParameter: ProtocolParameter (1..1)
Reference to the protocol parameter that is being bound by this value
13/14
Appendix 2. GEN2PHEN Phenotype Model reference
implementation
HEALTH-200754
WP3 – Standard data models and
terminologies
Security: PU
Authors: Morris Swertz
Version: 1
6. Supplementary figure: complete data model
14/14