Interoperable metadata leads to integrative analyses Biocuration 2015 April 25, 2015 Mike Cherry Department of Genetics Stanford University, School of Medicine Information curated by SGD • • • • • • • • • • • • • • • • Biochemical pathways Cellular pathways Chromosomal feature annotation Full-text papers and abstracts Functional genomics Gene Ontology Gene expression Gene regulation Genetic interactions Mutant phenotypes Post-translational modification Protein complex Protein domains Protein interactions Sequence Strain differences Balakrishnan, poster #96 Engel, poster #75 Genes GO term genes by publication Genes by the number of datasets in which their expression profiles are highly correlated Genes by interaction Role of a Genomic Resource Experimental data Genomic Resource Publications Computational analyses Data generation Literature curation Data wrangling Data integration Hypothesis generation Additional integration Used in analysis Google Trends http://google.com/trends ENCODE Assays and Elements Questions we want to answer 1. ChIP-seq results on K562 targeting RNA-binding proteins 2. Which fastq files were used to create this integrated analysis file 3. Which version of bwa was used to process this file 4. Show experiments that have a TF bound near my gene of interest 5. Find all RNA-seq experiments completed on liver tissue or primary cells from liver An ontology is a set of words... .. with different types of relationships to each other. All relationships must be true because inferences can be made based on these relationships Parent term cell part_of part_of part_of mitochondrion nucleus chromosome X part_of part_of part_of is_a mitochondrial chromosome http://www.geneontology.org/GO.ontology.relations.shtml Child term X part_of Impact of using ontologies: Common ontologies = instant interoperability circulatory system mesoderm develops_from part_of develops_from part_of develops_from heart part_of develops_from Explicit relationships Inferred relationships cardiac muscle cell http://uberon.org/ http://cellontology.org/ myoblast Project integration using ontologies Malladi, talk #26 Other projects OBI (for assays): http://obi-ontology.org EFO (for cell lines): http://www.ebi.ac.uk/efo/ UBERON (for tissues): http://uberon.org/ CL (for primary cells): http://cellontology.org/ DCC ENCODE portal (DCC) Find common biosamples between ENCODE2 and REMC 356 terms 314 terms http://genome.ucsc.edu/ENCODE/cellTypes.html GEO characteristics: common_name, tissue_type, cell_type, lines Labs were internally consistent After curating biosample identifiers there are 33 in common between ENCODE2 & REMC 20 UBERON 10 CL 2 EFO 1 NTR 217 terms 154 terms ENCODE Project Portal https://encodeproject.org Davidson, poster #77 Ontology-driven searches http://www.encodeproject.org/ Query for estradiol treated human samples track hubs displayed on UCSC browser Track Hubs on the Fly Browser pulls files from DB Track-hub displayed DB constructs track-hub files User Finds data to view Thousands of experiments (multiple files each) available from ENCODE Portal. Primarily previous ENCODE Construct URLs to Search ENCODE data curl -H 'Accept: application/json' -X GET https://www.encodeproject.org/search/ ?type=experiment&assay_term_name=RNA-seq &organ_slims=lung &replicates.library.biosample.life_stage=fe tal" Project integration using ontologies Malladi, talk #26 Other projects OBI (for assays): http://obi-ontology.org EFO (for cell lines): http://www.ebi.ac.uk/efo/ UBERON (for tissues): http://uberon.org/ CL (for primary cells): http://cellontology.org/ DCC ENCODE portal (DCC) ENCODE standard analysis pipelines Labs Submission and Processing of ENCODE data DNAnexus Amazon S3 Metadata DB Portal Chan, poster #73 TF ChIP-seq; A relatively complicated processing pipeline How would you deploy this pipeline • • • With the same versions of all the software components; The same parameters; With access to all 40+TB of ENCODE data; To integrate or compare your results with ENCODE? State of the Art in Pipeline Metadata & Distribution • ENCODE at UCSC’s Genome Browser • Materials and Methods • Galaxy/Globus (Galaxy on the Cloud) • Seven Bridges Genomics • tarball of scripts • DNAnexus Deploy Analysis Pipelines to the Cloud Replicable Provenance On the web to re-run. Accessioned inputs Pipeline metadata in database Ease of use Drop in your files Scalable to 1000’s of runs will be populated from the metadata database. Re-run on the web for a few datasets. Either way it’s exactly the same pipeline. Input Files Outputs plumbed to inputs Output Files How much did that run cost? Software used in pipelines How to find a region of interest Search by Region of Interest Find ENCODE datasets overlapping a region of interest by its genomic coordinates, or rs ID (SNP), or gene name, etc. Figure 1 from Boyle, et. al., Genome Res. 22:17901797 Acknowledgements SGD: • Biocuration Scientists: Stacia Engel, Rama Balakrishnan, Maria Costanzo, Janos Demeter, Rob Nash, Marek Skrzypek, Edith Wong, Sage Hellerstedt, Kyla Dalusag • Software Developers: Ben Hitz, Kelley Paskov, Travis Sheppard, Shuai Weng • Systems Admins: Stuart Miyasato, Matt Simison • Project Manager: Gail Binkley ENCODE: Stanford: • Data Wranglers: Esther Chan, Jean Davidson, Venkat Malladi, Cricket Sloan, J. Seth Strattan, Marcus Ho • Software Developers: Ben Hitz, Laurence Rowe, Nikhil Podduturi, Forrest Tanaka, Tim Dreszer • Project Manager: Eurie Hong UCSC • Jim Kent, Brian Lee • previous members of the UCSC ENCODE DCC ClinGen: Stanford: Carlos Bustamante, Tam Snedden, Selina Dwight Baylor: Sharon Plon, Aleks Milosavljevic, Ronak Patel, Xin Feng, Harvard: Heidi Rehm UNC: Jonathan Berg Members of the many working groups SGD, GOC, ENCODE & ClinGen Talk Venkat Malladi : #26. – Sunday10:40-11:00. Ontology application and use at the ENCODE DCC Posters ENCODE Esther Chan : #73. Towards reproducible computational analyses: the ENCODE approach Cricket Sloan : #74. Tracking data provenance to compare, reproduce, and interpret ENCODE results Jean Davidson : #77. The role of the ENCODE Data Coordination Center Posters SGD Stacia Engel : #75. The war on disease: Homology curation at SGD to promote budding yeast as a model for eurkaryotic biology Rama Balakrishnan : #96. Collection and curation of whole genome studies of budding yeast at the Saccharomyces Genome Database (SGD) Workshops Workshop 2 : Data Visualization & Annotation Chairs: Rama Balakrishnan (SGD) and Monica Munoz-Torres Workshop 3 : Biocuration in big data to knowledge: new strategy, process & framework Participant : Mike Cherry Workshop 4 : International collaboration in biocuration: projects & data/expertise sharing Participant : Rama Balakrishnan (SGD) Workshop 5 : Genotype-2-phenotype: Curation challenges in translational & reverse translational informatics. Participant : Stacia Engel (SGD)
© Copyright 2024