Download Report

Interoperable metadata
leads to integrative
analyses
Biocuration 2015
April 25, 2015
Mike Cherry
Department of Genetics
Stanford University, School of Medicine
Information
curated by SGD
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Biochemical pathways
Cellular pathways
Chromosomal feature annotation
Full-text papers and abstracts
Functional genomics
Gene Ontology
Gene expression
Gene regulation
Genetic interactions
Mutant phenotypes
Post-translational modification
Protein complex
Protein domains
Protein interactions
Sequence
Strain differences
Balakrishnan, poster #96 Engel, poster #75
Genes GO term
genes by publication
Genes by the number of datasets
in which their expression
profiles are highly correlated
Genes by
interaction
Role of a Genomic Resource
Experimental
data
Genomic
Resource
Publications
Computational
analyses
Data
generation
Literature
curation
Data wrangling
Data integration
Hypothesis generation
Additional
integration
Used in analysis
Google Trends
http://google.com/trends
ENCODE Assays and Elements
Questions we want to answer
1. ChIP-seq results on K562 targeting RNA-binding
proteins
2. Which fastq files were used to create this
integrated analysis file
3. Which version of bwa was used to process this
file
4. Show experiments that have a TF bound near my
gene of interest
5. Find all RNA-seq experiments completed on
liver tissue or primary cells from liver
An ontology is a set of words...
.. with different types of relationships to each other.
All relationships must be true
because inferences can be made based on these relationships
Parent term
cell
part_of
part_of
part_of
mitochondrion
nucleus
chromosome
X
part_of
part_of
part_of
is_a
mitochondrial chromosome
http://www.geneontology.org/GO.ontology.relations.shtml
Child term
X
part_of
Impact of using ontologies:
Common ontologies = instant interoperability
circulatory system
mesoderm
develops_from
part_of
develops_from
part_of
develops_from
heart
part_of
develops_from
Explicit relationships
Inferred relationships
cardiac muscle cell
http://uberon.org/
http://cellontology.org/
myoblast
Project
integration using
ontologies
Malladi, talk #26
Other
projects
OBI (for assays): http://obi-ontology.org
EFO (for cell lines): http://www.ebi.ac.uk/efo/
UBERON (for tissues): http://uberon.org/
CL (for primary cells): http://cellontology.org/
DCC
ENCODE portal
(DCC)
Find common biosamples between ENCODE2 and REMC
356 terms
314 terms
http://genome.ucsc.edu/ENCODE/cellTypes.html
GEO characteristics: common_name, tissue_type, cell_type, lines
Labs were internally consistent
After curating biosample identifiers there are 33 in
common
between ENCODE2 & REMC
20 UBERON
10 CL
2 EFO
1 NTR
217
terms
154
terms
ENCODE Project Portal
https://encodeproject.org
Davidson, poster #77
Ontology-driven searches
http://www.encodeproject.org/
Query for estradiol
treated human samples
track hubs displayed on UCSC browser
Track Hubs on the Fly
Browser pulls
files from DB
Track-hub
displayed
DB constructs
track-hub
files
User Finds
data to view
Thousands of experiments
(multiple files each)
available from ENCODE
Portal.
Primarily previous ENCODE
Construct URLs to Search ENCODE data
curl -H 'Accept: application/json' -X GET
https://www.encodeproject.org/search/
?type=experiment&assay_term_name=RNA-seq
&organ_slims=lung
&replicates.library.biosample.life_stage=fe
tal"
Project
integration using
ontologies
Malladi, talk #26
Other
projects
OBI (for assays): http://obi-ontology.org
EFO (for cell lines): http://www.ebi.ac.uk/efo/
UBERON (for tissues): http://uberon.org/
CL (for primary cells): http://cellontology.org/
DCC
ENCODE portal
(DCC)
ENCODE standard analysis pipelines
Labs
Submission and
Processing of
ENCODE data
DNAnexus
Amazon S3
Metadata
DB
Portal
Chan, poster #73
TF ChIP-seq; A relatively
complicated processing pipeline
How would you deploy this
pipeline
•
•
•
With the same versions
of all the software
components;
The same parameters;
With access to all
40+TB of ENCODE data;
To integrate or compare
your results with ENCODE?
State of the Art in Pipeline Metadata &
Distribution
• ENCODE at UCSC’s Genome Browser
• Materials and Methods
• Galaxy/Globus (Galaxy on the Cloud)
• Seven Bridges Genomics
• tarball of scripts
• DNAnexus
Deploy Analysis Pipelines to the Cloud
Replicable
Provenance
On the web
to re-run.
Accessioned
inputs
Pipeline
metadata in
database
Ease of
use
Drop in
your files
Scalable to 1000’s of runs will be populated from the metadata
database. Re-run on the web for a few datasets. Either way
it’s exactly the same pipeline.
Input
Files
Outputs plumbed to inputs
Output
Files
How much did that run cost?
Software used in pipelines
How to find a region of interest
Search by Region of
Interest
Find ENCODE datasets
overlapping a region of
interest by its genomic
coordinates, or rs ID
(SNP), or gene name,
etc.
Figure 1 from Boyle, et. al., Genome Res. 22:17901797
Acknowledgements
SGD:
• Biocuration Scientists: Stacia Engel, Rama Balakrishnan, Maria Costanzo,
Janos Demeter, Rob Nash, Marek Skrzypek, Edith Wong, Sage Hellerstedt,
Kyla Dalusag
• Software Developers: Ben Hitz, Kelley Paskov, Travis Sheppard, Shuai Weng
• Systems Admins: Stuart Miyasato, Matt Simison
• Project Manager: Gail Binkley
ENCODE:
Stanford:
• Data Wranglers: Esther Chan, Jean Davidson, Venkat Malladi, Cricket
Sloan, J. Seth Strattan, Marcus Ho
• Software Developers: Ben Hitz, Laurence Rowe, Nikhil Podduturi, Forrest
Tanaka, Tim Dreszer
• Project Manager: Eurie Hong
UCSC
• Jim Kent, Brian Lee
• previous members of the UCSC ENCODE DCC
ClinGen:
Stanford: Carlos Bustamante, Tam Snedden, Selina Dwight
Baylor: Sharon Plon, Aleks Milosavljevic, Ronak Patel, Xin Feng,
Harvard: Heidi Rehm
UNC: Jonathan Berg
Members of the many working groups
SGD, GOC, ENCODE & ClinGen
Talk
Venkat Malladi : #26. – Sunday10:40-11:00. Ontology application and use at the ENCODE DCC
Posters ENCODE
Esther Chan : #73. Towards reproducible computational analyses: the ENCODE approach
Cricket Sloan : #74. Tracking data provenance to compare, reproduce, and interpret ENCODE
results
Jean Davidson : #77. The role of the ENCODE Data Coordination Center
Posters SGD
Stacia Engel : #75. The war on disease: Homology curation at SGD to promote budding yeast as a
model for eurkaryotic biology
Rama Balakrishnan : #96. Collection and curation of whole genome studies of budding yeast at
the Saccharomyces Genome Database (SGD)
Workshops
Workshop 2 : Data Visualization & Annotation
Chairs: Rama Balakrishnan (SGD) and Monica Munoz-Torres
Workshop 3 : Biocuration in big data to knowledge: new strategy, process & framework
Participant : Mike Cherry
Workshop 4 : International collaboration in biocuration: projects & data/expertise sharing
Participant : Rama Balakrishnan (SGD)
Workshop 5 : Genotype-2-phenotype: Curation challenges in translational & reverse
translational informatics. Participant : Stacia Engel (SGD)