BI class 2010 Gene Ontology Overview and Perspective What is Ontology? 1606 1700s • Dictionary: A branch of metaphysics concerned with the nature and relations of being What is the Gene Ontology? • Allows biologists to make queries across large numbers of genes without researching each one individually So what does that mean? From a practical view, ontology is the representation of something we know about. “Ontologies" consist of a representation of things, that are detectable or directly observable, and the relationships between those things. Car? Ontology -definition Gene Ontology (GO) Consortium www.geneontology.org • Formed to develop a shared language adequate for the annotation of molecular characteristics across organisms; a common language to share knowledge. • Seeks to achieve a mutual understanding of the definition and meaning of any word used; thus we are able to support cross-database queries. • Members agree to contribute gene product annotations and associated sequences to GO database. How does GO work? What information might we want to capture about a gene product? • What does the gene product do? • Where and does it act? • Why does it perform these activities? What is the Gene Ontology? • Set of standard biological phrases (terms) which are applied to genes/proteins: – protein kinase – apoptosis – membrane GO represents three biological domains Molecular Function = elemental activity/task – the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity Biological Process = biological goal or objective – broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions Cellular Component = location or complex – subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme The GO is Actually Three Ontologies Molecular Function GO term: Malate dehydrogenase. GO id: GO:0030060 (S)-malate + NAD(+) = oxaloacetate + NADH. NAD+ O HO H HO NADH + H+ OH O H O OH H H H HO O O Biological Process GO term: tricarboxylic acid cycle Synonym: Krebs cycle Synonym: citric acid cycle GO id: GO:0006099 Cellular Component GO term: mitochondrion GO id: GO:0005739 Definition: A semiautonomous, self replicating organelle that occurs in varying numbers, shapes, and sizes in the cytoplasm of virtually all eukaryotic cells. It is notably the site of tissue respiration. Cellular Component • where a gene product acts Cellular Component Cellular Component Cellular Component • Enzyme complexes in the component ontology refer to places, not activities. Molecular Function • activities or “jobs” of a gene product glucose-6-phosphate isomerase activity Molecular Function insulin binding insulin receptor activity Molecular Function • A gene product may have several functions • Sets of functions make up a biological process. Biological Process a commonly recognized series of events cell division Biological Process transcription Biological Process regulation of gluconeogenesis Biological Process limb development Ontology Structure • Terms are linked by two relationships – is-a – part-of Ontology Structure cell membrane mitochondrial membrane is-a part-of chloroplast chloroplast membrane Ontology Structure • Ontologies are structured as a hierarchical directed acyclic graph (DAG) • Terms can have more than one parent and zero, one or more children Ontology Structure cell membrane mitochondrial membrane Directed Acyclic Graph (DAG) - multiple parentage allowed chloroplast chloroplast membrane A biological ontology is: • A (machine) interpretable representation of some aspect of biological reality – what kinds of things exist? – what are the relationships between these things? Optic placode develops from sense organ is_a eye part_of lens GO Definitions: Each GO term has 2 Definitions A definition written by a biologist: necessary & sufficient conditions written definition (not computable) Graph structure: necessary conditions formal (computable) Appropriate Relationships to Parents • GO currently has 2 relationship types – Is_a • An is_a child of a parent means that the child is a complete type of its parent, but can be discriminated in some way from other children of the parent. – Part_of • A part_of child of a parent means that the child is always a constituent of the parent that in combination with other constituents of the parent make up the parent. Placement in the Graph: Selecting Parents • To make the most precise definitions, new terms should be placed as children of the parent that is closest in meaning to the term. • To make the most complete definitions, terms should have all of the parents that are appropriate. • In an ontology as complicated as the GO this is not as easy as it seems. True Path Violations Create Incorrect Definitions ..”the pathway from a child term all the way up to its top-level parent(s) must always be true". nucleus Part_of relationship chromosome True Path Violations ..”the pathway from a child term all the way up to its top-level parent(s) must always be true". chromosome Is_a relationship Mitochondrial chromosome True Path Violations ..”the pathway from a child term all the way up to its top-level parent(s) must always be true". nucleus Part_of relationship A mitochondrial chromosome is not part of a nucleus! chromosome Is_a relationship Mitochondrial chromosome True Path Violations ..”the pathway from a child term all the way up to its top-level parent(s) must always be true". nucleus Part_of relationship Nuclear chromosome chromosome Is_a relationships mitochondrion Part_of relationship Mitochondrial chromosome The Development Node (some example for consistent definitions) Cell level [i] y cell differentiation ---[p] y cell fate commitment ------[p] y cell fate specification ------[p] y cell fate determination ---[p] y cell development ------[p] y cellular morphogenesis during differentiation ------[p] y cell maturation y cell differentiation The process whereby a relatively unspecialized cell acquires specialized features of a y cell. [i] y cell differentiation ---[p] y cell fate commitment ------[p] y cell fate specification ------[p] y cell fate determination ---[p] y cell development ------[p] y cellular morphogenesis during differentiation ------[p] y cell maturation http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=cmed6.figgrp.41173 y cell fate commitment The process whereby the developmental fate of a cell becomes restricted such that it will develop into a y cell. [i] y cell differentiation ---[p] y cell fate commitment ------[p] y cell fate specification ------[p] y cell fate determination ---[p] y cell development ------[p] y cellular morphogenesis during differentiation ------[p] y cell maturation y cell fate specification The process whereby a cell becomes capable of differentiating autonomously into a y cell in an environment that is neutral with respect to the developmental pathway. Upon specification, the cell fate can be reversed. [i] y cell differentiation ---[p] y cell fate commitment ------[p] y cell fate specification ------[p] y cell fate determination ---[p] y cell development ------[p] y cellular morphogenesis during differentiation ------[p] y cell maturation y cell fate determination The process whereby a cell becomes capable of differentiating autonomously into a y cell regardless of its environment; upon determination, the cell fate cannot be reversed. [i] y cell differentiation ---[p] y cell fate commitment ------[p] y cell fate specification ------[p] y cell fate determination ---[p] y cell development ------[p] y cellular morphogenesis during differentiation ------[p] y cell maturation Gene Ontology widely adopted AgBase Terms are defined graphically relative to other terms The Gene Ontology (GO) 1. Build and maintain logically rigorous and biologically accurate ontologies 2. Comprehensively annotate reference genomes 3. Support genome annotation projects for all organisms 4. Freely provide ontologies, annotations and tools to the research community Building the ontologies • The GO is still developing daily both in ontological structures and in domain knowledge • Ontology development workshops focus on specific domains needing experts • 2 workshops / year 1. Metabolism and cell cycle 2. Immunology and defense response 3. Early CNS development 4. Peripheral nervous system development 5. Blood Pressure Regulation 6. Muscle Development Mappings files Fatty acid biosynthesis ( Swiss-Prot Keyword) EC:6.4.1.2 (EC number) GO:Fatty acid biosynthesis (GO:0006633) GO:acetyl-CoA carboxylase activity (GO:0003989) IPR000438: Acetyl-CoA carboxylase carboxyl transferase beta subunit (InterPro entry) GO:acetyl-CoA carboxylase activity (GO:0003989) Building the ontology: Immune System Process 725 new terms related to immunology Red part_of Blue is_a 127 new terms added to cell type ontology Annotating Gene Products using GO P05147 PMID: 2976880 Gene Product P05147 Reference GO:0047519 IDA PMID:2976880 IDA GO:0047519 GO Term Evidence Gene protein inherits GO term • There is evidence that this gene product can be best classified using this term • The source of the evidence and other information is included • There is agreement on the meaning of the term Annotations are assertions Annotations for APP: amyloid beta (A4) precursor protein We use evidence codes to describe the basis of the annotation • • • • • • • • • • • • IDA: Inferred from direct assay Direct Experiment in organism IPI: Inferred from physical interaction IMP: Inferred from mutant phenotype IGI: Inferred from genetic interaction IEP: Inferred from expression pattern IEA: Inferred from electronic annotation ISS: Inferred from sequence or structural NO Direct Experiment similarity Inferred from evidence TAS: Traceable author statement NAS: Non-traceable author statement IC: Inferred by curator RCA: Reviewed Computational Analysis ND: no data available GO structure • GO isn’t just a flat list of biological terms • terms are related within a hierarchy GO structure gene A GO structure • This means genes can be grouped according to userdefined levels • Allows broad overview of gene set or genome GO Annotation Stats (2007) GO Annotations Total manual GO annotations - 388,633 Total proteins with manual annotations – 80,402 Contributing Groups (including MGI): - 19 Total Pub Med References – 346,002 Total number predicted annotations – 17,029,553 I Total number taxa – 129,318 Total number distinct proteins – 2,971,374 GO annotations GO database gene -> GO term associated genes genome and protein databases Annotations of gene products to GO are genome specific Now we can query across all annotations based on shared biological activity. GO browser Search on ‘mesoderm development’ mesoderm development Definition of mesoderm development Gene products involved in mesoderm development Traditional analysis Gene 1 Apoptosis Cell-cell signaling Protein phosphorylation Mitosis … Gene 3 Growth control Gene 4 Mitosis Nervous system Oncogenesis Pregnancy Protein phosphorylation Oncogenesis … Mitosis … Gene 2 Growth control Mitosis Oncogenesis Protein phosphorylation … Gene 100 Positive ctrl. of cell prolif Mitosis Oncogenesis Glucose transport … Using GO annotations • But by using GO annotations, this work has already been done GO:0006915 : apoptosis Grouping by process Apoptosis Gene 1 Gene 53 Positive ctrl. of cell prolif. Gene 7 Gene 3 Gene 12 … Mitosis Gene 2 Gene 5 Gene45 Gene 7 Gene 35 … Glucose transport Gene 7 Gene 3 Gene 6 … Growth Gene 5 Gene 2 Gene 6 … Anatomy of a GO term id: GO:0006094 name: gluconeogenesis namespace: process def: The formation of glucose from noncarbohydrate precursors, such as pyruvate, amino acids and glycerol. [http://cancerweb.ncl.ac.uk/omd/index.html] exact_synonym: glucose biosynthesis xref_analog: MetaCyc:GLUCONEO-PWY is_a: GO:0006006 is_a: GO:0006092 unique GO ID term name ontology definition synonym database ref parentage GO is a functional annotation system of great utility to the datadriven biologist GO enables genomic data analysis • Microarrays allow biologists to record changes in gene function across entire genomes • Result: Vast amounts of gene expression data desperately needing cataloging and tagging • Many data analysis tools use GO graph structure to statistically evaluate clusters of co-expressed genes based on shared functional annotations GO supports functional classifications OCT 13, 2006 Cancer Genome Projects Nature: January 2007 GO is wildly successful FIGURE 3. Representative cell-type-specific genes and corresponding molecular functions. Comprehensively annotate Reference Genomes • • • • • • • • • Human Mouse Fly Rat Chicken Zebrafish Worm Dicty E.coli • Saccharomyces cerevisiae • Schizosaccharomyces pombe • Arabidopsis thaliana Species coverage • All major eukaryotic model organism species • Human via GOA group at UniProt • Several bacterial and parasite species through TIGR and GeneDB at Sanger – many more in pipeline Annotation coverage GO tools • GO resources are freely available to anyone to use without restriction – Includes the ontologies, gene associations and tools developed by GO • Other groups have used GO to create tools for many purposes: http://www.geneontology.org/GO.tools GO tools • Affymetrix also provide a Gene Ontology Mining Tool as part of their NetAffx™ Analysis Center which returns GO terms for probe sets GO tools • Many tools exist that use GO to find common biological functions from a list of genes: http://www.geneontology.org/GO.tools.microarray.shtml GO tools • Most of these tools work in a similar way: – input a gene list and a subset of ‘interesting’ genes – tool shows which GO categories have most interesting genes associated with them i.e. which categories are ‘enriched’ for interesting genes – tool provides a statistical measure to determine whether enrichment is significant GO for microarray analysis • Annotations give ‘function’ label to genes • Ask meaningful questions of microarray data e.g. – genes involved in the same process, same/different expression patterns? Using GO in practice • statistical measure – how likely your differentially regulated genes fall into that category by chance 80 70 60 50 40 30 20 10 0 microarray 1000 genes experiment 100 genes differentially regualted mitosis apoptosis positive control of glucose transport cell proliferation mitosis – 80/100 apoptosis – 40/100 p. ctrl. cell prol. – 30/100 glucose transp. – 20/100 Using GO in practice • However, when you look at the distribution of all genes on the microarray: Process mitosis apoptosis p. ctrl. cell prol. glucose transp. Genes on array 800/1000 400/1000 100/1000 50/1000 # genes expected in 100 random genes 80 40 10 5 occurred 80 40 30 20 Enrichment tools • GO is developing its own enrichment tool as part of the GO browser AmiGO • Currently in testing phase, should be released next month
© Copyright 2024