Interoperable metadata leads to integrative analyses

Interoperable metadata
leads to integrative
analyses
Biocuration 2015
April 25, 2015
Mike Cherry
Department of Genetics
Stanford University, School of Medicine
Information
curated by SGD
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Biochemical pathways
Cellular pathways
Chromosomal feature annotation
Full-text papers and abstracts
Functional genomics
Gene Ontology
Gene expression
Gene regulation
Genetic interactions
Mutant phenotypes
Post-translational modification
Protein complex
Protein domains
Protein interactions
Sequence
Strain differences
Balakrishnan, poster #96 Engel, poster #75
Genes GO term
genes by publication
Genes by the number of datasets
in which their expression
profiles are highly correlated
Genes by
interaction
Role of a Genomic Resource
Experimental
data
Genomic
Resource
Publications
Computational
analyses
Data
generation
Literature
curation
Data wrangling
Data integration
Hypothesis generation
Additional
integration
Used in analysis
Google Trends
http://google.com/trends
ENCODE Assays and Elements
Questions we want to answer
1. ChIP-seq results on K562 targeting RNA-binding
proteins
2. Which fastq files were used to create this
integrated analysis file
3. Which version of bwa was used to process this
file
4. Show experiments that have a TF bound near my
gene of interest
5. Find all RNA-seq experiments completed on
liver tissue or primary cells from liver
An ontology is a set of words...
.. with different types of relationships to each other.
All relationships must be true
because inferences can be made based on these relationships
Parent term
cell
part_of
part_of
part_of
mitochondrion
nucleus
chromosome
X
part_of
part_of
part_of
is_a
mitochondrial chromosome
http://www.geneontology.org/GO.ontology.relations.shtml
Child term
X
part_of
Impact of using ontologies:
Common ontologies = instant interoperability
circulatory system
mesoderm
develops_from
part_of
develops_from
part_of
develops_from
heart
part_of
develops_from
Explicit relationships
Inferred relationships
cardiac muscle cell
http://uberon.org/
http://cellontology.org/
myoblast
Project
integration using
ontologies
Malladi, talk #26
Other
projects
OBI (for assays): http://obi-ontology.org
EFO (for cell lines): http://www.ebi.ac.uk/efo/
UBERON (for tissues): http://uberon.org/
CL (for primary cells): http://cellontology.org/
DCC
ENCODE portal
(DCC)
Find common biosamples between ENCODE2 and REMC
356 terms
314 terms
http://genome.ucsc.edu/ENCODE/cellTypes.html
GEO characteristics: common_name, tissue_type, cell_type, lines
Labs were internally consistent
After curating biosample identifiers there are 33 in
common
between ENCODE2 & REMC
20 UBERON
10 CL
2 EFO
1 NTR
217
terms
154
terms
ENCODE Project Portal
https://encodeproject.org
Davidson, poster #77
Ontology-driven searches
http://www.encodeproject.org/
Query for estradiol
treated human samples
track hubs displayed on UCSC browser
Track Hubs on the Fly
Browser pulls
files from DB
Track-hub
displayed
DB constructs
track-hub
files
User Finds
data to view
Thousands of experiments
(multiple files each)
available from ENCODE
Portal.
Primarily previous ENCODE
Construct URLs to Search ENCODE data
curl -H 'Accept: application/json' -X GET
https://www.encodeproject.org/search/
?type=experiment&assay_term_name=RNA-seq
&organ_slims=lung
&replicates.library.biosample.life_stage=fe
tal"
Project
integration using
ontologies
Malladi, talk #26
Other
projects
OBI (for assays): http://obi-ontology.org
EFO (for cell lines): http://www.ebi.ac.uk/efo/
UBERON (for tissues): http://uberon.org/
CL (for primary cells): http://cellontology.org/
DCC
ENCODE portal
(DCC)
ENCODE standard analysis pipelines
Labs
Submission and
Processing of
ENCODE data
DNAnexus
Amazon S3
Metadata
DB
Portal
Chan, poster #73
TF ChIP-seq; A relatively
complicated processing pipeline
How would you deploy this
pipeline
•
•
•
With the same versions
of all the software
components;
The same parameters;
With access to all
40+TB of ENCODE data;
To integrate or compare
your results with ENCODE?
State of the Art in Pipeline Metadata &
Distribution
• ENCODE at UCSC’s Genome Browser
• Materials and Methods
• Galaxy/Globus (Galaxy on the Cloud)
• Seven Bridges Genomics
• tarball of scripts
• DNAnexus
Deploy Analysis Pipelines to the Cloud
Replicable
Provenance
On the web
to re-run.
Accessioned
inputs
Pipeline
metadata in
database
Ease of
use
Drop in
your files
Scalable to 1000’s of runs will be populated from the metadata
database. Re-run on the web for a few datasets. Either way
it’s exactly the same pipeline.
Input
Files
Outputs plumbed to inputs
Output
Files
How much did that run cost?
Software used in pipelines
How to find a region of interest
Search by Region of
Interest
Find ENCODE datasets
overlapping a region of
interest by its genomic
coordinates, or rs ID
(SNP), or gene name,
etc.
Figure 1 from Boyle, et. al., Genome Res. 22:17901797
Acknowledgements
SGD:
• Biocuration Scientists: Stacia Engel, Rama Balakrishnan, Maria Costanzo,
Janos Demeter, Rob Nash, Marek Skrzypek, Edith Wong, Sage Hellerstedt,
Kyla Dalusag
• Software Developers: Ben Hitz, Kelley Paskov, Travis Sheppard, Shuai Weng
• Systems Admins: Stuart Miyasato, Matt Simison
• Project Manager: Gail Binkley
ENCODE:
Stanford:
• Data Wranglers: Esther Chan, Jean Davidson, Venkat Malladi, Cricket
Sloan, J. Seth Strattan, Marcus Ho
• Software Developers: Ben Hitz, Laurence Rowe, Nikhil Podduturi, Forrest
Tanaka, Tim Dreszer
• Project Manager: Eurie Hong
UCSC
• Jim Kent, Brian Lee
• previous members of the UCSC ENCODE DCC
ClinGen:
Stanford: Carlos Bustamante, Tam Snedden, Selina Dwight
Baylor: Sharon Plon, Aleks Milosavljevic, Ronak Patel, Xin Feng,
Harvard: Heidi Rehm
UNC: Jonathan Berg
Members of the many working groups
SGD, GOC, ENCODE & ClinGen
Talk
Venkat Malladi : #26. – Sunday10:40-11:00. Ontology application and use at the ENCODE DCC
Posters ENCODE
Esther Chan : #73. Towards reproducible computational analyses: the ENCODE approach
Cricket Sloan : #74. Tracking data provenance to compare, reproduce, and interpret ENCODE
results
Jean Davidson : #77. The role of the ENCODE Data Coordination Center
Posters SGD
Stacia Engel : #75. The war on disease: Homology curation at SGD to promote budding yeast as a
model for eurkaryotic biology
Rama Balakrishnan : #96. Collection and curation of whole genome studies of budding yeast at
the Saccharomyces Genome Database (SGD)
Workshops
Workshop 2 : Data Visualization & Annotation
Chairs: Rama Balakrishnan (SGD) and Monica Munoz-Torres
Workshop 3 : Biocuration in big data to knowledge: new strategy, process & framework
Participant : Mike Cherry
Workshop 4 : International collaboration in biocuration: projects & data/expertise sharing
Participant : Rama Balakrishnan (SGD)
Workshop 5 : Genotype-2-phenotype: Curation challenges in translational & reverse
translational informatics. Participant : Stacia Engel (SGD)