BI class 2010 Gene Ontology Overview and Perspective

BI class 2010
Gene Ontology
Overview and Perspective
What is Ontology?
1606
1700s
• Dictionary:
A branch of metaphysics concerned with
the nature and relations of being
What is the Gene Ontology?
• Allows biologists to make queries across
large numbers of genes without
researching each one individually
So what does that mean?
From a practical view, ontology is the
representation of something we know about.
“Ontologies" consist of a representation of things,
that are detectable or directly observable, and the
relationships between those things.
Car?
Ontology -definition
Gene Ontology (GO)
Consortium
www.geneontology.org
• Formed to develop a shared language
adequate for the annotation of molecular
characteristics across organisms; a common language
to share knowledge.
• Seeks to achieve a mutual understanding of the
definition and meaning of any word used; thus we are
able to support cross-database queries.
• Members agree to contribute gene product annotations
and associated sequences to GO database.
How does GO work?
What information might we want to
capture about a gene product?
• What does the gene product do?
• Where and does it act?
• Why does it perform these activities?
What is the Gene Ontology?
• Set of standard biological phrases (terms)
which are applied to genes/proteins:
– protein kinase
– apoptosis
– membrane
GO represents three biological domains
Molecular Function = elemental activity/task
– the tasks performed by individual gene products;
examples are carbohydrate binding and ATPase activity
Biological Process = biological goal or
objective
– broad biological goals, such as mitosis or purine metabolism, that are
accomplished by ordered assemblies of molecular functions
Cellular Component = location or complex
– subcellular structures, locations, and macromolecular complexes;
examples include nucleus, telomere, and RNA polymerase II
holoenzyme
The GO is Actually Three
Ontologies
Molecular Function
GO term: Malate dehydrogenase.
GO id: GO:0030060
(S)-malate + NAD(+) = oxaloacetate + NADH.
NAD+
O
HO
H
HO
NADH + H+
OH
O
H
O
OH
H
H
H
HO
O
O
Biological Process
GO term: tricarboxylic acid
cycle
Synonym: Krebs cycle
Synonym: citric acid cycle
GO id:
GO:0006099
Cellular Component
GO term: mitochondrion
GO id: GO:0005739
Definition: A semiautonomous, self
replicating organelle that occurs in
varying numbers, shapes, and sizes in
the cytoplasm of virtually all eukaryotic
cells. It is notably the site of tissue
respiration.
Cellular Component
• where a gene product acts
Cellular Component
Cellular Component
Cellular Component
• Enzyme complexes in the component
ontology refer to places, not activities.
Molecular Function
• activities or “jobs” of a gene product
glucose-6-phosphate isomerase activity
Molecular Function
insulin binding
insulin receptor activity
Molecular Function
• A gene product may have several
functions
• Sets of functions make up a biological
process.
Biological Process
a commonly recognized series of events
cell division
Biological Process
transcription
Biological Process
regulation of gluconeogenesis
Biological Process
limb development
Ontology Structure
• Terms are linked by two relationships
– is-a
– part-of 

Ontology Structure
cell
membrane
mitochondrial
membrane
is-a
part-of
chloroplast
chloroplast
membrane
Ontology Structure
• Ontologies are structured as a hierarchical
directed acyclic graph (DAG)
• Terms can have more than one parent and
zero, one or more children
Ontology Structure
cell
membrane
mitochondrial
membrane
Directed Acyclic Graph
(DAG) - multiple
parentage allowed
chloroplast
chloroplast
membrane
A biological ontology is:
• A (machine) interpretable representation of
some aspect of biological reality
– what kinds of
things exist?
– what are the
relationships
between these
things?
Optic placode
develops
from
sense organ
is_a
eye
part_of
lens
GO Definitions: Each GO term has 2
Definitions
A definition written by
a biologist:
necessary & sufficient
conditions
written definition
(not computable)
Graph structure:
necessary
conditions
formal
(computable)
Appropriate Relationships to Parents
• GO currently has 2 relationship types
– Is_a
• An is_a child of a parent means that the child is a
complete type of its parent, but can be
discriminated in some way from other children of
the parent.
– Part_of
• A part_of child of a parent means that the child is
always a constituent of the parent that in
combination with other constituents of the parent
make up the parent.
Placement in the Graph: Selecting Parents
• To make the most precise definitions, new terms
should be placed as children of the parent that is
closest in meaning to the term.
• To make the most complete definitions, terms
should have all of the parents that are
appropriate.
• In an ontology as complicated as the GO this is
not as easy as it seems.
True Path Violations Create Incorrect
Definitions
..”the pathway from a child term all the way up to its top-level parent(s) must always be true".
nucleus
Part_of relationship
chromosome
True Path Violations
..”the pathway from a child term all the way up to its top-level parent(s) must always be true".
chromosome
Is_a relationship
Mitochondrial
chromosome
True Path Violations
..”the pathway from a child term all the way up to its top-level parent(s) must always be true".
nucleus
Part_of relationship
A mitochondrial chromosome is not part of a nucleus!
chromosome
Is_a relationship
Mitochondrial
chromosome
True Path Violations
..”the pathway from a child term all the way up to its top-level parent(s) must always be true".
nucleus
Part_of
relationship
Nuclear
chromosome
chromosome
Is_a relationships
mitochondrion
Part_of
relationship
Mitochondrial
chromosome
The Development Node
(some example for consistent definitions)
Cell level
[i] y cell differentiation
---[p] y cell fate commitment
------[p] y cell fate specification
------[p] y cell fate determination
---[p] y cell development
------[p] y cellular morphogenesis during differentiation
------[p] y cell maturation
y cell differentiation
The process
whereby a relatively
unspecialized cell
acquires specialized
features of a y cell.
[i] y cell differentiation
---[p] y cell fate commitment
------[p] y cell fate specification
------[p] y cell fate determination
---[p] y cell development
------[p] y cellular morphogenesis during differentiation
------[p] y cell maturation
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=cmed6.figgrp.41173
y cell fate commitment
The process whereby the
developmental fate of a cell
becomes restricted such that it
will develop into a y cell.
[i] y cell differentiation
---[p] y cell fate commitment
------[p] y cell fate specification
------[p] y cell fate determination
---[p] y cell development
------[p] y cellular morphogenesis during differentiation
------[p] y cell maturation
y cell fate specification
The process whereby a cell becomes
capable of differentiating autonomously
into a y cell in an environment that is
neutral with respect to the developmental
pathway. Upon specification, the cell fate
can be reversed.
[i] y cell differentiation
---[p] y cell fate commitment
------[p] y cell fate specification
------[p] y cell fate determination
---[p] y cell development
------[p] y cellular morphogenesis during differentiation
------[p] y cell maturation
y cell fate determination
The process whereby a cell becomes
capable of differentiating autonomously
into a y cell regardless of its environment;
upon determination, the cell fate cannot be
reversed.
[i] y cell differentiation
---[p] y cell fate commitment
------[p] y cell fate specification
------[p] y cell fate determination
---[p] y cell development
------[p] y cellular morphogenesis during differentiation
------[p] y cell maturation
Gene Ontology widely adopted
AgBase
Terms are defined graphically relative to other terms
The Gene Ontology (GO)
1. Build and maintain logically rigorous and
biologically accurate ontologies
2. Comprehensively annotate reference genomes
3. Support genome annotation projects for all
organisms
4. Freely provide ontologies, annotations and
tools to the research community
Building the ontologies
• The GO is still developing daily both in ontological
structures and in domain knowledge
• Ontology development workshops focus on specific
domains needing experts
• 2 workshops / year
1. Metabolism and cell cycle
2. Immunology and defense response
3. Early CNS development
4. Peripheral nervous system development
5. Blood Pressure Regulation
6. Muscle Development
Mappings files
Fatty acid biosynthesis
( Swiss-Prot Keyword)
EC:6.4.1.2
(EC number)
GO:Fatty acid biosynthesis
(GO:0006633)
GO:acetyl-CoA carboxylase
activity
(GO:0003989)
IPR000438: Acetyl-CoA
carboxylase carboxyl
transferase beta subunit
(InterPro entry)
GO:acetyl-CoA carboxylase
activity
(GO:0003989)
Building the ontology: Immune System Process
725 new terms related to immunology
Red part_of
Blue is_a
127 new terms added to cell type ontology
Annotating Gene Products using GO
P05147
PMID: 2976880
Gene Product
P05147
Reference
GO:0047519
IDA
PMID:2976880
IDA
GO:0047519
GO Term
Evidence
Gene protein inherits GO term
• There is evidence that this gene product
can be best classified using this term
• The source of the evidence and other
information is included
• There is agreement on the meaning of
the term
Annotations are assertions
Annotations for APP: amyloid beta (A4) precursor protein
We use evidence codes to describe the
basis of the annotation
•
•
•
•
•
•
•
•
•
•
•
•
IDA: Inferred from direct assay
Direct Experiment in organism
IPI: Inferred from physical interaction
IMP: Inferred from mutant phenotype
IGI: Inferred from genetic interaction
IEP: Inferred from expression pattern
IEA: Inferred from electronic annotation
ISS: Inferred from sequence or structural
NO Direct Experiment
similarity
Inferred from evidence
TAS: Traceable author statement
NAS: Non-traceable author statement
IC: Inferred by curator
RCA: Reviewed Computational Analysis
ND: no data available
GO structure
• GO isn’t just a flat list of
biological terms
• terms are related within a
hierarchy
GO structure
gene
A
GO structure
• This means genes
can be grouped
according to userdefined levels
• Allows broad
overview of gene set
or genome
GO Annotation Stats (2007)
GO Annotations
Total manual GO annotations - 388,633
Total proteins with manual annotations – 80,402
Contributing Groups (including MGI): - 19
Total Pub Med References – 346,002
Total number predicted annotations – 17,029,553
I
Total number taxa – 129,318
Total number distinct proteins – 2,971,374
GO annotations
GO database
gene ->
GO term
associated genes
genome and protein
databases
Annotations of gene products to GO are genome specific
Now we can query across all annotations based on shared biological activity.
GO browser
Search on ‘mesoderm development’
mesoderm development
Definition of
mesoderm
development
Gene
products
involved in
mesoderm
development
Traditional analysis
Gene 1
Apoptosis
Cell-cell signaling
Protein phosphorylation
Mitosis
…
Gene 3
Growth control
Gene 4
Mitosis
Nervous system
Oncogenesis
Pregnancy
Protein phosphorylation
Oncogenesis
…
Mitosis
…
Gene 2
Growth control
Mitosis
Oncogenesis
Protein phosphorylation
…
Gene 100
Positive ctrl. of cell prolif
Mitosis
Oncogenesis
Glucose transport
…
Using GO annotations
• But by using GO annotations, this work
has already been done
GO:0006915 : apoptosis
Grouping by process
Apoptosis
Gene 1
Gene 53
Positive ctrl. of
cell prolif.
Gene 7
Gene 3
Gene 12
…
Mitosis
Gene 2
Gene 5
Gene45
Gene 7
Gene 35
…
Glucose transport
Gene 7
Gene 3
Gene 6
…
Growth
Gene 5
Gene 2
Gene 6
…
Anatomy of a GO term
id: GO:0006094
name: gluconeogenesis
namespace: process
def: The formation of glucose from
noncarbohydrate precursors, such as
pyruvate, amino acids and glycerol.
[http://cancerweb.ncl.ac.uk/omd/index.html]
exact_synonym: glucose biosynthesis
xref_analog: MetaCyc:GLUCONEO-PWY
is_a: GO:0006006
is_a: GO:0006092
unique GO ID
term name
ontology
definition
synonym
database ref
parentage
GO is a functional annotation
system of great utility to the datadriven biologist
GO enables genomic data
analysis
• Microarrays allow biologists to
record changes in gene
function across entire
genomes
• Result: Vast amounts of gene
expression data desperately
needing cataloging and
tagging
• Many data analysis tools use
GO graph structure to
statistically evaluate clusters of
co-expressed genes based on
shared functional annotations
GO supports functional classifications
OCT 13, 2006
Cancer Genome Projects
Nature: January 2007
GO is wildly successful
FIGURE 3. Representative cell-type-specific genes and corresponding molecular functions.
Comprehensively annotate Reference Genomes
•
•
•
•
•
•
•
•
•
Human
Mouse
Fly
Rat
Chicken
Zebrafish
Worm
Dicty
E.coli
• Saccharomyces cerevisiae
• Schizosaccharomyces
pombe
• Arabidopsis thaliana
Species coverage
• All major eukaryotic model organism
species
• Human via GOA group at UniProt
• Several bacterial and parasite species
through TIGR and GeneDB at Sanger
– many more in pipeline
Annotation coverage
GO tools
• GO resources are freely available to
anyone to use without restriction
– Includes the ontologies, gene associations
and tools developed by GO
• Other groups have used GO to create
tools for many purposes:
http://www.geneontology.org/GO.tools
GO tools
• Affymetrix also provide a Gene Ontology
Mining Tool as part of their NetAffx™
Analysis Center which returns GO terms
for probe sets
GO tools
• Many tools exist that use GO to find
common biological functions from a list of
genes:
http://www.geneontology.org/GO.tools.microarray.shtml
GO tools
• Most of these tools work in a similar way:
– input a gene list and a subset of ‘interesting’
genes
– tool shows which GO categories have most
interesting genes associated with them i.e.
which categories are ‘enriched’ for interesting
genes
– tool provides a statistical measure to
determine whether enrichment is significant
GO for microarray analysis
• Annotations give ‘function’ label to genes
• Ask meaningful questions of microarray
data e.g.
– genes involved in the same process,
same/different expression patterns?
Using GO in practice
• statistical measure
– how likely your differentially regulated genes
fall into that category by chance
80
70
60
50
40
30
20
10
0
microarray
1000 genes
experiment
100 genes
differentially
regualted
mitosis
apoptosis
positive control of glucose transport
cell proliferation
mitosis – 80/100
apoptosis – 40/100
p. ctrl. cell prol. – 30/100
glucose transp. – 20/100
Using GO in practice
• However, when you look at the distribution
of all genes on the microarray:
Process
mitosis
apoptosis
p. ctrl. cell prol.
glucose transp.
Genes on array
800/1000
400/1000
100/1000
50/1000
# genes expected in
100 random genes
80
40
10
5
occurred
80
40
30
20
Enrichment tools
• GO is developing its own enrichment tool
as part of the GO browser AmiGO
• Currently in testing phase, should be
released next month