Ontology application and use at the ENCODE DCC

Ontology application and use at
the ENCODE DCC
Venkat Malladi
Data Wrangler, ENCODE DCC
Department of Genetics
Stanford University School of Medicine
Venkat Malladi ENCODE DCC
Overview
Intro to
ENCODE and
the DCC
Metadata
Model
Ontologies
Search
Future
directions
Venkat Malladi ENCODE DCC
What is ENCODE?
Approximately ~30 different assays
Modified from PLoS Biol 9-e1001046,2011
(M. Pazin)
Venkat Malladi ENCODE DCC
Role of the Data Coordination Center
Genome Browser
Data files
Production labs
Analysis groups Metadata
DCC
DCC
ENCODE
portal
(DCC)
Integrative
websites
Scientific
community
Role:
Data generation
Data organization
Data access
Tasks:
Perform assays
Data processing & validation Web-based searches
Perform analyses Data file storage
Data downloads
Validate data
Metadata curation
Submit data files
Submit metadata
Venkat Malladi ENCODE DCC
Challenge: Find common biosamples from
data generated by two consortia
356 terms
http://encodeproject.org/ENCODE/cellTypes.html
314 terms
GEO characteristics: common_name, tissue_type, cell_type, lines
Projects are internally consistent…..
Venkat Malladi ENCODE DCC
Simple text match
IMR90
PBMC
Th17
360 terms
Cell type
314 terms
GEO
… but only 3 biosample names match exactly between projects
Venkat Malladi ENCODE DCC
Overview
Intro to
ENCODE and
the DCC
Metadata
Model
Ontologies
Search
Future
directions
Venkat Malladi ENCODE DCC
Metadata annotation using Ontologies
Overview
Intro to
ENCODE and
the DCC
Metadata
Model
Ontologies
Search
Future
directions
Venkat Malladi ENCODE DCC
An ontology is a set of words and relationships …
… All relationships must be true.
Parent term
cell
part_of
part_of
part_of
mitochondrion
nucleus
chromosome
X
Child term
part_of
part_of
is_a
mitochondrial chromosome
Venkat Malladi ENCODE DCC
An ontology is a set of words and relationships …
… All relationships must be true.
Parent term
cell
part_of
part_of
part_of
mitochondrion
nucleus
chromosome
X
Child term
part_of
part_of
part_of
is_a
We can make inferences can be based
upon them.
mitochondrial chromosome
http://www.geneontology.org/GO.ontology.relations.shtml
Venkat Malladi ENCODE DCC
Why use ontologies?
Reason 1:
Consistent way of describing biological concepts
Reason 2:
Consistency of language facilitates identification
of related data easily.
Reason 3:
Consistency in data analysis because
relationships between terms provide flexibility of grouping
while everyone uses the same set of metadata
Venkat Malladi ENCODE DCC
What metadata is annotated with ontologies?
1. The biological sample serving as input (Biosample)
1. The reagents and conditions applied to the biological input
(Treatment)
1. The set of methods and conditions to survey the biological input
(Assay)
Venkat Malladi ENCODE DCC
Biosample ontologies
1. Uber anatomy ontology (Uberon) - structure, location and
heterogenous mixture of cells
1. Cell Ontology (CL) - primary cells or stem cells
1. Experimental Factor Ontology (EFO) - no direct corresponding
anatomical structure or physiological cell type
Venkat Malladi ENCODE DCC
Venkat Malladi ENCODE DCC
Venkat Malladi ENCODE DCC
Venkat Malladi ENCODE DCC
Venkat Malladi ENCODE DCC
Overview
Intro to
ENCODE and
the DCC
Metadata
Model
Ontologies
Search
Future
directions
Venkat Malladi ENCODE DCC
Challenge: Find all heart-related tissues?
Heart_OC
HCF
HCFaa
HCM
Others?
Fetal Heart
Heart
Right Atrium
Right Ventricle
Others?
Venkat Malladi ENCODE DCC
Searching ENCODE metadata
Venkat Malladi ENCODE DCC
Searching using “heart”
Heart_OC
HCF
HCFaa
HCM
Others?
Fetal Heart
Heart
Right Atrium
Right Ventricle
Others?
Venkat Malladi ENCODE DCC
Ontology driven search
Venkat Malladi ENCODE DCC
Overview
Intro to
ENCODE and
the DCC
Metadata
Model
Ontologies
Search
Future
directions
Venkat Malladi ENCODE DCC
Additional ontologies
• Protein Ontology
(PRO,http://pir.georgetown.edu/pro/pro.shtml)
o transforming growth factor beta-1 (human)— PR:P01137
• EDAM Ontology (EDAM, http://edamontology.org)
o FASTQ—format:1930, BAM—format:2572
o sequence alignment—data:0863
Venkat Malladi ENCODE DCC
Ontology based validations
Venkat Malladi ENCODE DCC
Acknowledgments
Eurie Hong, Mike Cherry (PI), Jim Kent (co-PI), Ben Hitz
Data Wranglers
Esther Chan, Jean Davidson, Venkat Malladi, Cricket Sloan, J. Seth
Strattan
Software Engineers
Nikhil Podduturi, Laurence Rowe, Forrest Tanaka
QA, administration,
biocuration
Brian Lee, Stuart Miyasato, Matt Simison, Zhenhua Wang, Marcus Ho
National Institute of General Medical Sciences of the United States AQ1215 National
Institutes of Health (GM10331601); U41 grant from National Human Genome Research
Institute at the U.S. National Institutes of Health (HG006992)
Venkat Malladi ENCODE DCC
Acknowledgments
Eurie Hong, Mike Cherry (PI), Jim Kent (co-PI), Ben Hitz
?
Data Wranglers
You
Esther Chan, Jean Davidson, Venkat Malladi, Cricket Sloan, J. Seth
Strattan
Software Engineers
?
You
We’re hiring!!!
Nikhil Podduturi, Laurence Rowe, Forrest Tanaka
?
QA, administration,
biocuration
You
Brian Lee, Stuart Miyasato, Matt Simison, Zhenhua Wang, Marcus Ho
National Institute of General Medical Sciences of the United States AQ1215 National
Institutes of Health (GM10331601); U41 grant from National Human Genome Research
Institute at the U.S. National Institutes of Health (HG006992)
Venkat Malladi ENCODE DCC