Ontology application and use at the ENCODE DCC Venkat Malladi Data Wrangler, ENCODE DCC Department of Genetics Stanford University School of Medicine Venkat Malladi ENCODE DCC Overview Intro to ENCODE and the DCC Metadata Model Ontologies Search Future directions Venkat Malladi ENCODE DCC What is ENCODE? Approximately ~30 different assays Modified from PLoS Biol 9-e1001046,2011 (M. Pazin) Venkat Malladi ENCODE DCC Role of the Data Coordination Center Genome Browser Data files Production labs Analysis groups Metadata DCC DCC ENCODE portal (DCC) Integrative websites Scientific community Role: Data generation Data organization Data access Tasks: Perform assays Data processing & validation Web-based searches Perform analyses Data file storage Data downloads Validate data Metadata curation Submit data files Submit metadata Venkat Malladi ENCODE DCC Challenge: Find common biosamples from data generated by two consortia 356 terms http://encodeproject.org/ENCODE/cellTypes.html 314 terms GEO characteristics: common_name, tissue_type, cell_type, lines Projects are internally consistent….. Venkat Malladi ENCODE DCC Simple text match IMR90 PBMC Th17 360 terms Cell type 314 terms GEO … but only 3 biosample names match exactly between projects Venkat Malladi ENCODE DCC Overview Intro to ENCODE and the DCC Metadata Model Ontologies Search Future directions Venkat Malladi ENCODE DCC Metadata annotation using Ontologies Overview Intro to ENCODE and the DCC Metadata Model Ontologies Search Future directions Venkat Malladi ENCODE DCC An ontology is a set of words and relationships … … All relationships must be true. Parent term cell part_of part_of part_of mitochondrion nucleus chromosome X Child term part_of part_of is_a mitochondrial chromosome Venkat Malladi ENCODE DCC An ontology is a set of words and relationships … … All relationships must be true. Parent term cell part_of part_of part_of mitochondrion nucleus chromosome X Child term part_of part_of part_of is_a We can make inferences can be based upon them. mitochondrial chromosome http://www.geneontology.org/GO.ontology.relations.shtml Venkat Malladi ENCODE DCC Why use ontologies? Reason 1: Consistent way of describing biological concepts Reason 2: Consistency of language facilitates identification of related data easily. Reason 3: Consistency in data analysis because relationships between terms provide flexibility of grouping while everyone uses the same set of metadata Venkat Malladi ENCODE DCC What metadata is annotated with ontologies? 1. The biological sample serving as input (Biosample) 1. The reagents and conditions applied to the biological input (Treatment) 1. The set of methods and conditions to survey the biological input (Assay) Venkat Malladi ENCODE DCC Biosample ontologies 1. Uber anatomy ontology (Uberon) - structure, location and heterogenous mixture of cells 1. Cell Ontology (CL) - primary cells or stem cells 1. Experimental Factor Ontology (EFO) - no direct corresponding anatomical structure or physiological cell type Venkat Malladi ENCODE DCC Venkat Malladi ENCODE DCC Venkat Malladi ENCODE DCC Venkat Malladi ENCODE DCC Venkat Malladi ENCODE DCC Overview Intro to ENCODE and the DCC Metadata Model Ontologies Search Future directions Venkat Malladi ENCODE DCC Challenge: Find all heart-related tissues? Heart_OC HCF HCFaa HCM Others? Fetal Heart Heart Right Atrium Right Ventricle Others? Venkat Malladi ENCODE DCC Searching ENCODE metadata Venkat Malladi ENCODE DCC Searching using “heart” Heart_OC HCF HCFaa HCM Others? Fetal Heart Heart Right Atrium Right Ventricle Others? Venkat Malladi ENCODE DCC Ontology driven search Venkat Malladi ENCODE DCC Overview Intro to ENCODE and the DCC Metadata Model Ontologies Search Future directions Venkat Malladi ENCODE DCC Additional ontologies • Protein Ontology (PRO,http://pir.georgetown.edu/pro/pro.shtml) o transforming growth factor beta-1 (human)— PR:P01137 • EDAM Ontology (EDAM, http://edamontology.org) o FASTQ—format:1930, BAM—format:2572 o sequence alignment—data:0863 Venkat Malladi ENCODE DCC Ontology based validations Venkat Malladi ENCODE DCC Acknowledgments Eurie Hong, Mike Cherry (PI), Jim Kent (co-PI), Ben Hitz Data Wranglers Esther Chan, Jean Davidson, Venkat Malladi, Cricket Sloan, J. Seth Strattan Software Engineers Nikhil Podduturi, Laurence Rowe, Forrest Tanaka QA, administration, biocuration Brian Lee, Stuart Miyasato, Matt Simison, Zhenhua Wang, Marcus Ho National Institute of General Medical Sciences of the United States AQ1215 National Institutes of Health (GM10331601); U41 grant from National Human Genome Research Institute at the U.S. National Institutes of Health (HG006992) Venkat Malladi ENCODE DCC Acknowledgments Eurie Hong, Mike Cherry (PI), Jim Kent (co-PI), Ben Hitz ? Data Wranglers You Esther Chan, Jean Davidson, Venkat Malladi, Cricket Sloan, J. Seth Strattan Software Engineers ? You We’re hiring!!! Nikhil Podduturi, Laurence Rowe, Forrest Tanaka ? QA, administration, biocuration You Brian Lee, Stuart Miyasato, Matt Simison, Zhenhua Wang, Marcus Ho National Institute of General Medical Sciences of the United States AQ1215 National Institutes of Health (GM10331601); U41 grant from National Human Genome Research Institute at the U.S. National Institutes of Health (HG006992) Venkat Malladi ENCODE DCC
© Copyright 2024