NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 , 2012

NGS Bioinformatics Workshop
1.5 Genome Annotation
April 4th, 2012
IRMACS 10900, SFU
Facilitator: Richard Bruskiewich
Adjunct Professor, MBB
Acknowledgment:
This week’s slides partly courtesy of Professor Fiona Brinkman, MBB,
with material from Wyeth W. Wasserman and Shannan Ho Sui
Overview
General observations about genomes
Repeat masking
Gene finding
Gene regulation and promoter analysis
NGS Bioinformatics Workshop 1.5 Genome Annotation
SOME GENERAL OBSERVATIONS
ABOUT GENOMES
3
General Variables of Genomes
Prokaryote versus Eukaryote versus Organelle
Genome size:
Number of chromosomes
Number of base pairs
Number of genes
GC/AT relative content
Repeat content
Genome duplications and polyploidy
Gene content
See: Genomes, 2nd edition Terence A Brown. ISBN-10: 0-471-25046-5
See NCBI Bookshelve: http://www.ncbi.nlm.nih.gov/books/NBK21128/
Eukaryote versus Prokaryote Genomes
Eukaryote
Prokaryote
 Large (10 Mb – 100,000 Mb)
Size
Content
 There is not generally a
relationship between organism
complexity and its genome size
(many plants have larger
genomes than human!)
 Most DNA is non-coding
 Generally small (<10 Mb; most < 5Mb)
 Complexity (as measured by # of genes
and metabolism) generally proportional
to genome size
 DNA is “coding gene dense”
 Circular DNA, doesn't need telomeres
Telomeres/
Centromeres
 Present (Linear DNA)
Number of
chromosomes
 More than one, (often) including
those discriminating sexual
identity
Chromatin
 Histone bound (which serves as a
genome regulation point)
 Don’t have mitosis, hence, no
centromeres.
 Often one, sometimes more, -but
plasmids, not true chromosome.
 No histones
 Uses supercoiling to pack genome
Eukaryote versus Prokaryote Genomes
Eukaryote
Prokaryote
 Often have introns
 Intraspecific gene order and number
generally relatively stable
Genes
 many non-coding (RNA) genes
 There is NOT generally a relationship
between organism complexity and gene
number
Gene regulation
 Promoters, often with distal long range
enhancers/silencers, MARS, transcriptional
domains
 Generally mono-cistronic
Repetitive sequences
Organelle
(subgenomes)
 Generally highly repetitive with genome wide
families from transposable element
propagation
 Mitochondrial (all)
 chloroplasts (in plants)
 No introns
 Gene order and number may
vary between strains of a species
 Promoters
 Enhancers/silencers rare
 Genes often regulated as
polycistronic operons
 Generally few repeated
sequences
 Relatively few transposons
 Absent
Genome Size
Physical:
Amount of DNA / number of base pairs
Number of chromosomes/linkage groups
Information resources:
NCBI: http://www.ncbi.nlm.nih.gov/genome
Animals: http://www.genomesize.com/
Plants: http://data.kew.org/cvalues/
Fungi: http://www.zbi.ee/fungal-genomesize/
Genetic:
Number of genes in the genome
Gregory TR. 2002. Genome size and developmental
complexity. Genetica. May;115(1):131-46.
Size of Organelle Genomes
Species
Type of organism
Genome size (kb)
Mitochondrial genomes
Plasmodium falciparum
Protozoan (malaria parasite)
6
Chlamydomonas reinhardtii
Green alga
16
Mus musculus
Vertebrate (mouse)
16
Homo sapiens
Vertebrate (human)
17
Metridium senile
Invertebrate (sea anemone)
17
Drosophila melanogaster
Invertebrate (fruit fly)
19
Chondrus crispus
Red alga
26
Aspergillus nidulans
Ascomycete fungus
33
Reclinomonas americana
Protozoa
69
Saccharomyces cerevisiae
Yeast
75
Suillus grisellus
Basidiomycete fungus
121
Brassica oleracea
Flowering plant (cabbage)
160
Arabidopsis thaliana
Flowering plant (vetch)
367
Zea mays
Flowering plant (maize)
570
Cucumis melo
Flowering plant (melon)
2500
Chloroplast genomes
Pisum sativum
Flowering plant (pea)
120
Marchantia polymorpha
Liverwort
121
Oryza sativa
Flowering plant (rice)
136
Nicotiana tabacum
Flowering plant (tobacco)
156
Chlamydomonas reinhardtii
Green alga
195
http://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5511
Size of Prokaryote Genomes
Species
Escherichia coli K-12
DNA molecules
One circular molecule
Size (Mb)
4.639
Number of genes
4397
Vibrio cholerae El Tor N16961
Two circular molecules
Main chromosome
2.961
2770
Megaplasmid
Four circular molecules
1.073
1115
Deinococcus radiodurans R1
2.649
0.412
0.177
0.046
2633
369
145
40
Borrelia burgdorferi B31
Chromosome 1
Chromosome 2
Megaplasmid
Plasmid
seven or eight circular molecules, 11 linear molecules
Linear chromosome
0.911
853
Circular plasmid cp9
0.009
12
Circular plasmid cp26
0.026
29
Circular plasmid cp32*
0.032
Not known
Linear plasmid lp17
0.017
25
Linear plasmid lp25
0.024
32
Linear plasmid lp28-1
0.027
32
Linear plasmid lp28-2
0.030
34
Linear plasmid lp28-3
0.029
41
Linear plasmid lp28-4
0.027
43
Linear plasmid lp36
0.037
54
Linear plasmid lp38
0.039
52
Linear plasmid lp54
0.054
76
Linear plasmid lp56
0.056
Not known
http://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5524
Size of Eukaryote Genomes
Species
Genome size (Mb)
Fungi
Saccharomyces cerevisiae
12.1
Aspergillus nidulans
25.4
Protozoa
Tetrahymena pyriformis
190
Invertebrates
Caenorhabditis elegans
97
Drosophila melanogaster
180
Bombyx mori (silkworm)
490
Strongylocentrotus purpuratus (sea urchin)
845
Locusta migratoria (locust)
5000
Vertebrates
Takifugu rubripes (pufferfish)
400
Homo sapiens
3200
Mus musculus (mouse)
3300
Plants
Arabidopsis thaliana (vetch)
125
Oryza sativa (rice)
430
Zea mays (maize)
2500
Pisum sativum (pea)
4800
Triticum aestivum (wheat)
Fritillaria assyriaca (fritillary)
16 000
120 000
http://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5471
http://en.wikipedia.org/wiki/Genome_size
http://en.wikipedia.org/wiki/Genome#Comparison_of_different_genome_sizes
Number of Genes
Species
Ploidy
Cs
Size (Mb)
No. Genes
Saccharomyces cerevisiae
2
16
12
6,281
Plasmodium falciparum
2
14
23
5,509
Caenorhabditis elegans
2
6
100
21,175
Drosophila melanogaster
2
6
139
15,016
Oryza sativa
2
12
410
30,294
Canis lupus familaris
2
39
2,445
24,044
Homo sapiens
2
24
3,100
36,036
Zea mays
2
10
2,046
42,000-56,000 (*)
Protopterus aethiopicus
?
?
130,000
?
Paris japonica
8
40
150,000
?
Polychaos dubium
?
?
670,000
?
http://www.ncbi.nlm.nih.gov/genome
(*) Haberer et al., Structure and architecture of the maize
genome. Plant Physiol. 2005 Dec139(4):1612-24
AT/GC content
 Regional variations correlates with genomic content
and function like transposable element distribution,
gene density, gene regulation, methylation, etc.
 Often introduces bias in sequencing processes (e.g.
library yields, PCR amplification, NGS sequencing)
Species
GC%
Streptomyces coelicolor A3(2)
72
Plasmodium falciparum
20
Arabidopsis thaliana
36
Saccharomyces cerevisiae
38
Arabidopsis thaliana
36
Homo sapiens
41 (35 – 60)
Romiguier et al. 2010. Contrasting GC-content dynamics across 33 mammalian genomes:
Relationship with life-history traits and chromosome sizes. Genome Res. 20: 1001-1009
Repeat Content
 Large genomes generally reflect evolutionary expansion of
large families of repetitive DNA (by RNA/DNA transposon
amplification/insertion, genetic recombination)
 Repeats drive genome mutational processes:
 Recombination resulting in insertion, deletion, translocation,
segmental duplication of DNA
 Insertional mutagenesis, possibly including de novo creation of genes
 Insert novel regulatory signals
Jurka et al. 2007. Repetitive sequences in complex genomes: structure and evolution.
Annu Rev Genomics Hum Genet. 2007;8:241-59.
 Repeats generally confound genome sequence assembly
(especially for NGS, due to short reads). Gene annotation can
also be problematic as transposons mimic gene structures.
Genome Duplications/Polyploidy
 Segmental duplications (i.e. by recombination)
Tandem: direct and inverted
 Whole genome duplication & loss, e.g.
Ancestral vertebrate: 2 rounds
o HOX gene clusters…
Dehal P and Boore JL.2005. Two Rounds of Whole Genome Duplication in the
Ancestral Vertebrate. PLoS Biol 3(10) : e314. doi:10.1371
 Polyploidy - ~70% of all angiosperms
Genomic hybridization (allopolyploids)
Can lead to immediate and extensive changes in gene
expression
Mapping of homeologous gene loci can be tricky
Adams and Wendel. 2005. Polyploidy and genome evolution in plants.
Curr. Opin. Plant Biol. 8(2):135–141
Case Study: Rice
 10 duplicated blocks identified on all 12 chromosomes
of Oryza sativa subspecies indica, that contained 47%
of the total predicted genes.
 Possible genome duplication occurred ~70 million
years ago, supporting the polyploidy origin of rice.
 Additional segmental duplication identified involving
chromosomes 11 and 12, ~5 million years ago.
 Following the duplications, there have been large-scale
chromosomal rearrangements and deletions. About
30–65% of duplicated genes were lost shortly after the
duplications, leading to a rapid diploidization.
Wang et al. 2005. Duplication and DNA segmental loss in the rice genome:
implications for diploidization New Phytologist 165: 937–946
Gene Content
http://www.ncbi.nlm.nih.gov/books/NBK21120/figure/A5501
The Bottom Line
All of these genomic variables:
Type of organism: i.e. prokaryote versus eukaryote
Genome size
GC/AT relative content
Repeat content
Genome duplications and polyploidy
Gene content
are important factors that can drive the strategy,
expected outcome and efficacy of genome
sequence assembly and annotation.
NGS Bioinformatics Workshop 1.5 Genome Annotation
GENOME REPEAT MASKING
19
Genomic (DNA) Sequence Repeat Masking
 Classic approach: search against repeat libraries
 RepeatMasker
http://www.repeatmasker.org/
Uses a previously compiled library of repeat families
Uses (user configured) external sequence search program
Computationally intensive but…
…the project web site also provides “pre-masked” genomic
data for many completed genomes, complete with some
statistical characterization.
More Repeat Masking …
 de novo identification and classification:
RECON: http://www.genetics.wustl.edu/eddy/recon
RepeatGluer: http://nbcr.sdsc.edu/euler/
PILER: http://www.drive5.com/piler
 Repeat databases:
Repbase: http://www.girinst.org/repbase/index.html
plants: http://plantrepeats.plantbiology.msu.edu/
 Related algorithms:
“probability clouds”
Gu et al. 2008. Identification of repeat structure in large genomes using
repeat probability clouds. Anal Biochem. 380(1): 77–83
NGS Bioinformatics Workshop 1.5 Genome Annotation
GENE FINDING
…OR “WHAT IS A GENE?”
23
Objectives
 Review of differences in prokaryotic and
eukaryotic gene organization.
 Understand consequences and challenges
for gene finding algorithms for Prokaryotes
and Eukaryotes.
 Appreciate HMM as powerful tool (in many
areas of computational biology!)
 Be reminded that not all genes encode
proteins and predictions of such genes have
their own computational challenges.
24
Gene Annotation Questions
Which genes are present?
How did they get there (evolution)?
Are the genes present in more than one copy?
Which genes are not there that we would
expect to be present?
What order are the genes in, and does this
have any significance?
How similar is the genome of one organism to
that of another?
Why Gene-finding?
Whole-genome annotation
Genome sequence does not give you list of all
genes
Fully characterizing Yfg (“your favourite gene”)
example: A disease is associated with a SNP in a
location in the human genome. BLAST finds
similarity to a protein coding gene in the area, but
its only similar to part of the whole protein.
What’s the whole gene?
26
After completing the human genome we faced 3 Gigabytes of this
Not immediately apparent where the genes are…
What is a gene?
 A distinct functional unit encoded (at minimum
transcribed) in the genome
 Eukaryotes: The region that is transcribed –
the mRNA defines it
 Prokaryotes: The coding region – because
multiple genes may be on one transcript as an
operon
 Both eukaryotes and prokaryotes: non-coding
RNAs are treated relatively the same
29
Raw Biological Materials
Prokaryotes
Eukaryotes
• High gene density
• mRNA transcriptiontranslation is coupled
• Genes are usually
contiguous stretches of
coding DNA
• mRNAs often polycistronic
• Low gene density
• mRNA transcribed then
transported to cytoplasm for
translation
• Genes’ coding DNA often
split by non-coding introns
• mRNAs are generally
monocistronic
gene
gene
____________________
___________
 transcript 
Great real-time Transcription-Translation video: http://www.youtube.com/watch?v=41_Ne5mS2ls
30
How many genes in human genome?
• 2000: must be at least 100,000 (Rice has ~40,000,
C. elegans has ~19,000)
• 2001: only 35,000?
• 2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts)
• 2006: Ensembl NCBI 36 release: 23,710 protein coding genes, plus
4421 RNA genes (48,851 transcripts)
• Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes +
14,266 RNA genes - but with alternative splicing these produce
likely many more…
How many genes in human genome?
• 2000: must be at least 100,000 (Rice has ~40,000,
C. elegans has ~19,000)
• 2001: only 35,000?
• 2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts)
• 2006: Ensembl NCBI 36 release: 23,710 protein coding genes, plus
4421 RNA genes (48,851 transcripts)
• Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes +
14,266 RNA genes - but with alternative splicing these produce
likely >100,000 proteins (178,191 currently annotated in Ensembl)
Gene Density
1 gene in how many basepairs?...
a.
b.
c.
d.
e.
f.
g.
1:10,000,000
1:1,000,000
1:100,000 roughly for human
1:10,000 (1:5000 for C. elegans)
1:1000 roughly for most bacteria
1:100
1:10
33
Approaches to Finding Genes
 AB INITIO (‘from the beginning’):
search by content: find genes by statistical properties that
distinguish protein-coding DNA from non-coding DNA
Search by signals/sites: splice donor/acceptor sites, promoters,
etc.
 HOMOLOGY:
search by sequence similarity to homologous sequences
state-of-the-art gene finding combines these strategies and
also uses other data like EST/RNA-seq/microarray data
34
Gene-finding in Prokaryotes:
Easy? ….or not?
ORF Finder
Open reading frame (ORF) from methionine
codon to first Stop codon
ORFs linked to BLAST
http://www.ncbi.nlm.nih.gov/gorf/gorf.html
Problem: All ORFs are not genes. Why?
How can this be improved?
35
Gene Finding: Ab Initio Search by Content
 Protein code affects the statistical properties of a DNA
sequence
– some amino acids are used more frequently than others
(Leu more popular than Trp)
– different numbers of codons for different amino acids
(Leu has 6, Trp has 1)
– for a given amino acid, usually one codon is used more
frequently than others
• this is termed codon preference
• these preferences vary by species
36
Gene-finding in Prokaryotes:
Improving predictions…
Common way to search by content
 build Markov models of coding & noncoding regions
 apply to ORFs or fixed-sized sequence windows
Markov Model approaches: prokaryotic gene prediction
 Glimmer
 http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi
 http://cbcb.umd.edu/software/glimmer/
 open source
 GeneMark
 http://opal.biology.gatech.edu/GeneMark/
 not open source but slightly more accurate
37
i.e. If its sunny then the
next weather state is
not likely to become
suddenly rainy.
If its cloudy then it is
more likely the next
weather state will be
rainy.
Reproduced with permission,
Chris Burge
38
For gene prediction:
Certain bases occur
after others in coding
sequence versus noncoding sequence
Reproduced with permission,
Chris Burge
39
What’s Hidden in a Hidden Markov Model?
The HMM is a model of what actually
happened (the true gene structure) but all you
can see is the sequence
Each nucleotide is not labeled with its state;
it’s hidden
Based on training data – which is critical!
Glimmer version 3
 annotates microbial genomes with a sensitivity of 99% and
accuracy of >98%
 resolves overlapping genes
 use appropriate gene dataset for training program
 no annotation of rRNA and tRNA genes
 employs interpolated Markov model
NAR Vol.26 p1107-1115
42
GeneMark.hmm
 HMM plus ribosome binding site signals in primary nucleotide
sequence
 training also required
 Most accurate prokaryotic gene predictor (by a bit over
Glimmer)
 GeneMarkS version - translation start prediction: 83.2% of
Bacillus subtilis genes
94.4% of Escherichia coli genes
43
Main issues with prokaryotic gene prediction
 Which ATG or GTG start to choose? Not that accurate
 Errors in sequence  frame shifts that lead to errors in gene
prediction
 Very small genes either not predicted at all (most prokaryotic
genome projects don’t annotate any genes < 90 bp) or poorly
predicted
 Non-coding RNA genes tend to be ignored!
 RNA-seq data being incorporated now…
44
RNA-seq
 Sequence cDNA from
RNA using “next
generation”
sequencing
 Map sequence reads
to ref genome
 Count number of
sequence reads
(depth of reads) for a
given window of ref
sequence
45
Rne (Rnase E) Upstream Region in P. aeruginosa
Rne (Rnase E) Upstream Region
• Upstream region of Rne in E. coli forms stem loop to regulate
mRNA synthesis
• Region aligns reasonably well between spp.
• Also predicted to encode a small non-coding RNA using QRNA
and RNAz
E. coli
PAO1
Gene prediction in eukaryotes
Overview of eukaryotic gene structure will
illustrate the gene-finding problem
Components described are some of those that
gene-finding tools must model
These signals are never 100% reliable because
there are (almost) as many exceptions as rules
Signals (and noise) vary with organism
3% of human genome is “coding” vs. 25% in fugu
The problem with eukaryotes:
the easy stuff?
49
The problem with eukaryotes:
the hard stuff
~50% of human genes undergo alternative
splicing
Genes can be nested: overlapping on same or
opposite strand or inside an intron
Pseudogenes: “dead” genes
Raw Biological Materials: Eukaryotes
Fig 10.1 Baxevanis and Ouellette, 2000
Transcription Start
core promoter
...
TSS – transcription start site
• ~70% of promoter regions have TATA box
Fig 11.25 from Introduction to Genetic Analysis, 7th ed. Griffiths et al. New York: W H Freeman & Co; c1999.
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&rid=iga.figgrp.d1e46911
End Modification
5’cap added at 5’ end of transcript
polyA-tail added at AATAAA consensus site
Splice site recognition
5’ (a.k.a. donor)
exon
intron
Logos from:
http://genio.informatik.uni-stuttgart.de/GENIO/splice/
3’ (a.k.a. acceptor)
intron
exon
The problem with splice signals
~0.5% of splice sites are non-canonical, i.e. not
GT-AG
~5% of human genes may have a noncanonical splice site
~50% of human genes are alternatively spliced
Intron Phase
A codon can be interrupted by an intron in one of three places
and still code for the same amino acid
Example CAG -> glutamine
exon
Phase 0:
Phase 1:
Phase 2:
C
CA
intron
exon
CAG
AG
G
Translation signals
Start AUG
Note: Bacteria also use GUG
(~10% of starts) and UUG (rarely)
Stop
UAA, UAG, or UGA
Translation signals
Translation Start: “usually” first AUG after 5’
cap
But Kozak consensus sequence aids the
selection so the 2nd or 3rd AUG might be used
Translation Stop: first in-frame stop codon
(simple!)
Gene-finding Strategies
Similarity-based a.k.a. lookup method
genome:protein
genome:genome
ab initio a.k.a. template method
Experimental
microarrays
RNA Sequencing
Similarity – genome:protein
blastx finds similarity between translated
genomic region and known proteins
Problem: on human chr22 only 50% of genes
had similarity to known proteins when first
annotated
Similarity – genome:genome
 Functional sequences will be conserved
 Learn different things from different evolutionary
distances (signal:noise ratio)
coding sequences
regulatory sequences
 Compare mouse:human, pufferfish:human (Exofish)
 Problems:
you need something “similar”
won’t find precise boundaries
ab initio Gene-finding
Based on intrinsic information
Coding statistics plus signal sensors
No prior knowledge of similar genes needed
Problem: gene models can only approach the
biological truth (recall eukaryotic gene
structure variability)
Coding Statistics
Exon/intron base composition differences
indicate protein-coding regions
Use frequency of occurrence of hexamers
Signal Sensors
 Model features recognized by transcriptional, splicing
and translational machinery
Promoter elements
Splice sites
PolyA consensus sites
Start and stop codons
 Use pattern recognition methods as signal detectors
Weight matrices
Neural networks
Decision trees
What don’t (coding) gene finders do?
Explicit alternative splicing models
Overlapping or nested genes
Non-protein-coding genes
Tools for finding genes…
Check out the Bioinformatics Links Directory at
Bioinformatics.ca
http://bioinformatics.ca/links_directory/
Some tools for finding genes…
http://bioinformatics.ca/links_directory/category/dna/gene-prediction
Tools for eukaryotic gene prediction
Ab Initio
genscan, fgenesh
Homology-based prediction
Genewise (uses predicted proteins from one
genome/seq to drive exon finding in a second
genome/seq, combined with splice-site model)
Twinscan (builds two genome predictions together,
using conserved DNA signal as one criterion)
Gene Finding in General: Training Sets
ab initio gene-finders must be trained on real
sequences
Mostly single genes from specific groups of
organisms (or the organism being studied)
Bias of training set will affect a tool’s
prediction accuracy
Example: HMR195 has 195 single- or multiexon murine genes with typical splice sites
and confirmed by cDNA (Rogic et al, 2001)
GENSCAN Dissection
Reproduced with permission,
Chris Burge
Reproduced with permission,
Chris Burge
Genscan HSMM
• Circles and diamonds are
states e.g. exon, phase 0
on plus strand
• Arrows are probabilities of
transition between states
• Sequence you observe is
generated from a series
of states
72
Reproduced with permission, Chris Burge
Reproduced with permission,
Chris Burge
Reproduced with permission,
Chris Burge
74
Reproduced with permission,
Chris Burge
75
Genscan HSMM
• Circles and diamonds are
states e.g. exon, phase 0
on plus strand
• Arrows are probabilities of
transition between states
• Sequence you observe is
generated from a series
of states
76
Reproduced with permission, Chris Burge
Signal Sensors
Model features recognized by transcriptional,
splicing and translational machinery
Start codon
12bp WMM (Weight Matrix Model); probabilities
estimated from GenBank CDS feature annotations
5’ splice site
MDD (Maximal Depedence Decomposition) developed
by Burge and Karlin (1997) to allow for dependencies
between adjacent and non-adjacent nucleotides
Coding Statistics
Exon/intron base composition differences
indicate protein-coding regions
Use frequency of occurrence of hexamers
Model with 5th order HMMs
Non-coding: homogeneous
Coding: heterogeneous
different model, depending on whether it’s 1st, 2nd, 3rd
codon position
The predicted gene structure is
the optimal combination of the above
Reproduced with permission,
Chris Burge
GENSCAN
Strengths
Works well from fly to human (and plants with
some tweaks)
Weaknesses
Practical
Initial exons; HMMgene does (a bit) better
Gene boundaries; tends to join or split genes
Models
Promoter and polyA site models
Transcription start sites (TSS)
Homology Based Approach
 Use evolutionary divergence as a tool
 Based on the principal that introns and other
non-coding regions usually diverge much
faster over time
 Helped by the fact that many splice patterns
have been determined experimentally (cDNA)
 Fails when the divergence of distant
sequences is too extreme (<50%; due to
rapidly evolving proteins or large evolutionary
distances)
GeneWise (Ewan Birney)
• GeneWise aligns a protein sequence (or HMM) to
genomic DNA taking into account splicing
information
See also “Exonerate”
The birth of comparative genomics…
February 2001
December 2002
Comparative Genefinders
• Twinscan -based on Genscan (Korf, 2001)
• ROSETTA – global alignment of orthologous
regions (Batzoglou et al, 2000)
• SLAM (Alexandersson) and Doublescan
(Meyer)
– jointly align while finding predicting genes
– computationally expensive (extra search
dimension) and hard to parameterise
– still effectively requires a full alignment
• SGP2 (Parra, 2003)
– Combines ab initio gene prediction with TBLASTX
Evaluation of Twinscan
Paul Flicek et al. Genome Res. 2003: TWINSCAN pred (red), GENSCAN (green), and
an aligned RefSeq transcript (blue). Yellow: low-complexity regions, black: mouse
alignments
Evaluation of Twinscan
Paul Flicek et al. (Genome Res. 2003): Accuracy of GENSCAN
and TWINSCAN by the exact gene, exact exon, and coding
nucleotide measures
Twinscan (N-SCAN) in UCSC genome browser
Region of seq1 (from HMR195 dataset) presented
Twinscan URL: http://ardor.wustl.edu
What about non-coding RNAs (ncRNAs)?
88
What are non-coding RNAs (ncRNAs)?
 ncRNA genes make transcripts that function directly
as RNA, rather than encoding proteins.
– ribosomal RNA, transfer RNA, RNase P, (tmRNA)
– E.coli translational regulatory RNAs, e.g. OxyS
– Eukaryotes: snRNA, snoRNA (& Archaea)
89
Example predictor: QRNA
- probabilistic model III
RNA: pair stochastic context-free grammar
base stacking, loop length (MFOLD)
Prediction of non-coding RNAs
1. Identification of ncRNAs with known RNA families
RFAM: a database of RNA families (Griffiths-Jones et
al., 2005)
2. Identification of novel ncRNAs
Obtain orthologous sequences using BLASTN
Multiple sequence alignment using MAFFT
ncRNA prediction using RNAz and QRNA
Analyze overlapping predictions and filter false
positives
Need lab validation (RT-PCR, RNA-seq…)
Analysis of non-coding RNAs: Examples of
Research Leaders
1. Sean Eddy (HHMI)
RFAM, QRNA
http://selab.janelia.org/
2. Cenk Sahinalp (SFU)
TaveRNA suite for structural analysis and
interaction prediction
http://compbio.cs.sfu.ca/
Rne (Rnase E) Upstream Region in P. aeruginosa
Rne (Rnase E) Upstream Region
• Upstream region of Rne in E. coli forms stem loop to regulate
mRNA synthesis
• Region aligns reasonably well between spp.
• Also predicted to encode a small non-coding RNA using QRNA
and RNAz
E. coli
PAO1
Summary
• Genes are complex structures, which are
difficult to predict
• Different approaches to gene finding:
– Ab Initio : i.e. GenScan (use appropriate training
dataset!)
– Homology guided: i.e. GeneWise
Outstanding Issues
Most Gene finders cannot handle untranslated
regions (UTRs)
~40% of human genes have non-coding 1st exons
(UTRs)
Most gene finders cannot handle alternative
splicing
Most gene finders cannot handle overlapping or
nested genes
Bottom Line
Gene finding in eukaryotes is not solved
Accuracy of the best methods approaches 80%
at the exon level (90% at the nucleotide level) in
coding-rich regions (much lower for whole
genomes)
Gene predictions should always be verified by
other means (cDNA sequencing RNA-seq/EST,
BLAST search)
Whole Genome Annotation
A lot of work has been done for you!
NCBI Map Viewers
Ensembl
UCSC Genome Browser
98
Top 10 Future Challenges
Ewan Birney, Chris Burge, Jim Fickett in Genome Technology
1. Precise, predictive model of transcription initiation and
termination: ability to predict where and when
transcription will occur in a genome
2. Precise, predictive model of RNA splicing/ alternative
splicing: ability to predict the splicing pattern of any
primary transcript in any tissue
3. Precise, quantitative models of signal transduction
pathways: ability to predict cellular responses to external
stimuli
4. Determining effective protein:DNA, protein:RNA and
protein:protein recognition codes
5. Accurate ab initio protein structure prediction
5 minute break?
NGS Bioinformatics Workshop
1.5 Genome Annotation
GENE REGULATION
AND PROMOTER ANALYSIS
With material from Wyeth W. Wasserman and Shannan Ho Sui
Objectives
Note: Focus on Eukaryotic cells and Polymerase II Promoters
 Overview of transcription
 Structure of eukaryotic promoter
 Laboratory methods for identifying binding motifs
 Prediction of transcription factor binding sites using binding
profiles (“Discrimination”)
 TFBS scanning – a great example of how bioinformatics can be
tweaked when we don’t fully understand the biology!
 Interrogation of sets of co-expressed genes to identify
mediating transcription factors
 TFBS Over-representation
 Detection of novel motifs (TFBS) over-represented in
regulatory regions of co-expressed genes (“Discovery”)
 Motif Discovery!
Transcription Simplified
Three-step Process:
1. Transcription factor (TF; a
protein) binds to transcription
factor binding site (TFBS; in
DNA)
2. TF catalyzes recruitment of polII
complex
3. Production of RNA from
transcription start site (TSS)
TF
Pol-II
TFBS
TATA
TSS
Anatomy of Transcriptional Regulation
WARNING: Terms vary widely in meaning between scientists
Core Promoter/Initiation Region (Inr)
Distal Regulatory Region
TFBS
TFBS
TFBS
Proximal Regulatory Region
TFBS
TFBS
TATA
TSR
EXON
Distal Regulatory region
TFBS
TFBS
EXON
 Core Promoter – Sufficient for initiation of transcription;
orientation dependent (i.e. immediately upstream, functions in
only one orientation)
 TSR – transcription start region
o Refers to a region rather than specific start site (TSS)
 TFBS – single transcription factor binding site
 Regulatory Regions




Proximal/Distal – vague reference to distance from TSR
May be positive (enhancing) or negative (repressing)
Orientation independent and location independent (generally)
Modules – Sets of TFBS within a region that function together
Complexity in Transcription
Chromatin
Distal enhancer
Proximal enhancer
105
Core Promoter
Distal enhancer
Refer again to Transcription in video: http://www.youtube.com/watch?v=41_Ne5mS2ls
Transcription Factors
 Bind to enhancers found in vicinity of gene
 These proteins promote transcription by (i) binding DNA and
(ii) activating transcription
 These two functions usually reside on separate structural domains of
the transcription factor protein
 Activation occurs via interactions with other transcription factors and
/or RNA polymerase
106
DNA Binding Domains – Conserved Motifs
Leucine Zipper Motif
Zinc Finger Motif
Helix-turn-Helix Motif
Helix-loop-Helix
Motif
Structural motifs within different types of transcription factors.
See also http://www.youtube.com/watch?v=7EkSBBDQmpE
Lab Discovery of TF Binding Sites
0%
Reporter Gene Activity
100%
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
mutation
Identify functional regulatory region within a sequence and delineate
specific TFBS through mutagenesis (and in vitro binding studies)
EMSA/Gel Shift Assays
to Identify Binding Proteins
TF + DNA
DNA
http://www.biomedcentral.com/content/figures/1741-7015-4-28-8.jpg
High-throughput Methods
 SELEX
mix random ds DNA oligonucleotides with TF protein,
recover TF-DNA complexes and sequence DNA
 Protein Binding Arrays (UniProbe Database)
prepare arrays with ds DNA attached, label protein with a
fluorescent mark and observe DNA bound by protein
 ChIP (Chromatin Immunoprecipitation)
covalently link proteins to DNA in cell, shear DNA, recover
protein-DNA complexes and identify DNA (PCR, array or
sequencing)
Representing Binding Sites for a TF
• A single site
• AAGTTAATGA
• A set of sites represented as a consensus
• VDRTWRWWSHD (IUPAC degenerate DNA)
IUPAC (International Union of Pure and Applied
Chemistry) nomenclature
• A way to represent ambiguity when more than one
value is possible at a particular position
For DNA
A
C
G
T
R
Y
K
M
=
=
=
=
=
=
=
=
adenine
cytosine
guanine
thymine
G or A (purine)
T or C (pyrimidine)
G or T (keto)
A or C (amino)
S
W
B
D
H
V
N
=
=
=
=
=
=
=
G or C (strong bonds)
A or T (weak bonds)
G, T or C (all but A)
G, A or T (all but C)
A, C or T (all but G)
G, C or A (all but T)
A, G, C or T (any)
Set of
binding
sites
AAGTTAATGA
CAGTTAATAA
GAGTTAAACA
CAGTTAATTA
GAGTTAATAA
CAGTTATTCA
GAGTTAATAA
CAGTTAATCA
AGATTAAAGA
AAGTTAACGA
AGGTTAACGA
ATGTTGATGA
AAGTTAATGA
AAGTTAACGA
AAATTAATGA
GAGTTAATGA
AAGTTAATCA
AAGTTGATGA
AAATTAATGA
ATGTTAATGA
AAGTAAATGA
AAGTTAATGA
AAGTTAATGA
AAATTAATGA
AAGTTAATGA
AAGTTAATGA
AAGTTAATGA
AAGTTAATGA
Representing Binding Sites for a TF
• A single site
• AAGTTAATGA
• A set of sites represented as a consensus
• VDRTWRWWSHD (IUPAC degenerate DNA)
• A matrix describing a set of sites:
A
C
G
T
14 16 4 0 1 19 20 1
3 0 0 0 0 0 0 0
4 3 17 0 0 2 0 0
0 2 0 21 20 0 1 20
4 13 4 4 13 12 3
7 3 1 0 3 1 12
9 1 3 0 5 2 2
1 4 13 17 0 6 4
Logo – A graphical
representation of frequency
matrix. Y-axis is information
content , which reflects the
strength of the pattern in each
column of the matrix
Set of
binding
sites
AAGTTAATGA
CAGTTAATAA
GAGTTAAACA
CAGTTAATTA
GAGTTAATAA
CAGTTATTCA
GAGTTAATAA
CAGTTAATCA
AGATTAAAGA
AAGTTAACGA
AGGTTAACGA
ATGTTGATGA
AAGTTAATGA
AAGTTAACGA
AAATTAATGA
GAGTTAATGA
AAGTTAATCA
AAGTTGATGA
AAATTAATGA
ATGTTAATGA
AAGTAAATGA
AAGTTAATGA
AAGTTAATGA
AAATTAATGA
AAGTTAATGA
AAGTTAATGA
AAGTTAATGA
AAGTTAATGA
Conversion of PFMs to Position Specific Scoring
Matrices (PSSM)
PSSMs also known as Position Weight Matrices(PWMs)
Add the following features to the matrix profile:
1. Correct for nucleotide frequencies in genome
2. Weight for the confidence (depth) in the pattern
3. Convert to log-scale probability for easy arithmetic
pssm
pfm
A
C
G
T
5
0
0
0
0
2
3
0
1
2
1
1
0
4
0
1
0
0
4
1
Log2
(
f(b,i) + s(n)
p(b)
)
A
C
G
T
1.6
-1.7
-1.7
-1.7
-1.7
0.5
1.0
-1.7
-0.2
0.5
-0.2
-0.2
-1.7
1.3
-1.7
-0.2
-1.7
-1.7
1.3
-0.2
TGCTG = 0.9
f(b,i) = frequency of base b at position i
s(n)
= pseudocount correction (optionally applied to small samples of binding sites)
113
p(b) = background probability of base b
PSSM Scoring Scales
Raw scores
Sum of values from indicated cells of the matrix
Relative Scores (most common)
Normalize the scores to range of 0-1 or 0%-100%
Empirical p-values
Based on distribution of scores for some DNA
sequence, determine a p-value (see next slide)
Detecting binding sites in a single sequence
Raw Scores
Sp1
ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC
A
C
G
T
[-0.2284 0.4368
[-0.2284 -0.2284
[ 1.2348 1.2348
[ 0.4368 -0.2284
-1.5
-1.5
2.1222
-1.5
-1.5
-1.5
-1.5 1.5128
2.1222 0.4368
-1.5 -0.2284
0.4368
-1.5
-1.5 -0.2284
1.2348 1.5128
0.4368 0.4368
-1.5 -0.2284
-1.5 -0.2284
1.7457 1.7457
0.4368
-1.5
0.4368
-1.5
-1.5
1.7457
Abs_score = 13.4 (sum of column scores)
Relative Scores
A
C
G
T
[-0.2284 0.4368
[-0.2284 -0.2284
[ 1.2348 1.2348
[ 0.4368 -0.2284
-1.5
-1.5
2.1222
-1.5
-1.5
-1.5
-1.5 1.5128
2.1222 0.4368
-1.5 -0.2284
0.4368
-1.5
-1.5 -0.2284
1.2348 1.5128
0.4368 0.4368
Empirical p-value Scores
-1.5 -0.2284
-1.5 -0.2284
1.7457 1.7457
0.4368
-1.5
0.4368
-1.5
-1.5
1.7457
]
]
]
]
0.3
Max_score = 15.2 (sum of highest column scores)
[-0.2284 0.4368
[-0.2284 -0.2284
[ 1.2348 1.2348
[ 0.4368 -0.2284
-1.5
-1.5
2.1222
-1.5
-1.5
-1.5
-1.5 1.5128
2.1222 0.4368
-1.5 -0.2284
0.4368
-1.5
-1.5 -0.2284
1.2348 1.5128
0.4368 0.4368
-1.5 -0.2284
-1.5 -0.2284
1.7457 1.7457
0.4368
-1.5
0.4368
-1.5
-1.5
1.7457
Min_score = -10.3 (sum of lowest column scores)
Abs_score - Min_score
100 %
Max_score - Min_score
13.4 - (-10.3)

100%  93%
15.2  (10.3)
Rel_score 
]
]
]
]
Frequency
A
C
G
T
Area to right of value
Area under entire curve
0.2
0.1
0.0
0.0
0.2
0.4
0.6
Relative Score
0.8 1.0
]
]
]
]
JASPAR:
AN OPEN-ACCESS DATABASE
OF TF BINDING PROFILES
( jaspar.genereg.net )
(Some other databases with TF profiles include
Transfac, TRRD, mPromDB, SCPD (yeast), dbTBS
and EcoTFS (bacteria))
116
The Good…
 Tronche (1997) tested 50 predicted HNF1 TFBS using an in
vitro binding test and found that 96% of the predicted
sites were bound!
BINDING
ENERGY
 Stormo and Fields (1998) found in detailed biochemical
studies that the best weight matrices produce scores
highly correlated with in vitro binding energy
PSSM SCORE
…the Bad…
Fickett (1995) found that a profile for the
myoD TF made predictions at a rate of 1
per ~500bp of human DNA sequence
This corresponds to an average of 20 sites /
gene (assuming 10,000 bp as average human
gene size)
…and the Ugly!
Human Cardiac a-Actin gene analyzed
with a set of profiles
(each line represents a TFBS prediction)
Futility Conjecture: TFBS
predictions, as a collective group,
are almost always wrong!
True binding sites are defined by
properties not incorporated into
the PSSM profile scores
Red boxes are protein coding exons TFBS predictions excluded in this analysis
What have we learned?
 PSSMs accurately reflect in vitro binding properties of DNA
binding proteins
 Suitable binding sites occur at a rate far too frequent to
reflect in vivo function
 Bioinformatics methods that use PSSMs for binding site
studies must incorporate additional information to
enhance specificity
 Unfiltered predictions are too noisy for most applications
 Note: Organisms with short regulatory sequences are less
problematic (e.g. yeast and bacteria)
Example of a bioinformatics challenge that needs more
biological information to make predictions
70,000,000 years of evolution can
reveal regulatory regions
USING PHYLOGENETIC FOOTPRINTING
TO IMPROVE TFBS DISCRIMINATION
Phylogenetic Footprinting
% Identity
200 bp Window Start Position (human sequence)
Actin gene compared between human and mouse
•
•
•
Align orthologous gene sequences (e.g. use LAGAN). For 1st window of x
bp (i.e. 200 bp), of sequence#1, determine % identity with sequence#2.
Step across (slide window across) the first sequence, recording % identity
in each window with the second sequence.
Observe high identity with exons, lower identity in 5’ and 3’ UTRs
Additional conserved region could be regulatory region
Multi-species
Phylogenetic Footprinting
Phylogenetic Footprinting Dramatically Reduces
Spurious Hits
Human
Mouse
124
Actin, alpha cardiac
TFBS Prediction with Human & Mouse Pairwise Phylogenetic
Footprinting
SELECTIVITY
SENSITIVITY
 Testing set: 40 experimentally defined sites in 15 well studied genes
(Replicated with 100+ site set)
 75-80% of defined sites detected with “sequence conservation filter”
(incorporating orthologous sequence data), while only 11-16% of total
predictions retained
1kbp insulin receptor promoter screened with
footprinting
126
Choosing the ”right” species for pairwise
comparison...
CHICKEN
HUMAN
MOUSE
COW
HUMAN
HUMAN
ConSite
http://consite.genereg.net/
TFBS Discrimination Tools I
 Phylogenetic Footprinting Servers
FOOTER http://biodev.hgen.pitt.edu/footer_php/Footerv2_0.php
CONSITE http://asp.ii.uib.no:8090/cgi-bin/CONSITE/consite/ or
http://consite.genereg.net/
rVISTA http://rvista.dcode.org/
ORCAtk http://burgundy.cmmt.ubc.ca/cgi-bin/OrcaTK/orcatk
 SNPs in TFBS Analysis
 RAVEN http://burgundy.cmmt.ubc.ca/cgi-bin/RAVEN/a?rm=home
TFBS Discrimination Tools II
 Prokaryotes or Yeast
PRODORIC http://prodoric.tu-bs.de/
YEASTRACT http://www.yeastract.com/index.php
 Software Packages
TOUCAN http://homes.esat.kuleuven.be/~saerts/software/toucan.php
 Programming Tools
TFBS http://tfbs.genereg.net/
ORCAtk http://burgundy.cmmt.ubc.ca/cgi-bin/OrcaTK/orcatk
What have we learned?
 TFBS discrimination coupled with phylogenetic
footprinting has greater specificity with tolerable loss
of sensitivity
 As with any purification process, some true binding sites will be
lost
 There are several online resources that provide
precomputed analyses using phylogenetic
footprinting
TFBS Over-representation
We seek to determine if a set of co-expressed
genes contains an over-abundance of
predicted binding sites for a known TF
Phylogenetic footprinting to reduce false prediction
rate
Similar to looking for functional over-representation
(i.e. identify annotations of gene function in sets of coexpressed genes that are over-represented)
Two Examples of
TFBS Over-Representation
Foreground
Foreground
More Total TFBS
More Genes with TFBS
Background
Background
Statistical Methods for Identifying Overrepresented TFBS
 Binomial test (Z scores)
Based on the number of occurrences of the TFBS relative
to background
Normalized for sequence length
Simple binomial distribution model
 Fisher exact probability scores
Based on the number of genes containing the TFBS relative
to background
Hypergeometric probability distribution
oPOSSUM Server
Validation using Reference Gene Sets
A. Muscle-specific (23 input; 16 analyzed)
Rank
Z-score
B. Liver-specific (20 input; 12 analyzed)
Fisher
Rank
Z-score
Fisher
SRF
1
21.41
1.18e-02
HNF-1
1
38.21
8.83e-08
MEF2
2
18.12
8.05e-04
HLF
2
11.00
9.50e-03
c-MYB_1
3
14.41
1.25e-03
Sox-5
3
9.822
1.22e-01
Myf
4
13.54
3.83e-03
FREAC-4
4
7.101
1.60e-01
TEF-1
5
11.22
2.87e-03
HNF-3beta
5
4.494
4.66e-02
deltaEF1
6
10.88
1.09e-02
SOX17
6
4.229
4.20e-01
S8
7
5.874
2.93e-01
Yin-Yang
7
4.070
1.16e-01
Irf-1
8
5.245
2.63e-01
S8
8
3.821
1.61e-02
Thing1-E47
9
4.485
4.97e-02
Irf-1
9
3.477
1.69e-01
HNF-1
10
3.353
2.93e-01
COUP-TF
10
3.286
2.97e-01
TFs with experimentally-verified sites in the reference sets.
Structurally-related TFs with Indistinguishable
TFBS
Most
structurally
related TFs
bind to highly
similar
patterns
Zn-finger is a
big exception
Notes
 TFBS over-representation analysis can identify increased
number of sites in a set of co-expressed genes, or increased
number of genes in a co-expressed gene set that have a
given site.
 Generally best performance has been with data directly
linked to a transcription factor
 Highly dependent on the experimental design – cannot overcome noisy
data from poor design
Note that the identity of a mediating TF may not be
apparent when many proteins can bind to the same motif
de novo Pattern Discovery
 String-based
e.g. YMF (Sinha & Tompa)
Generalization: Identify over-represented oligomers in
comparison of “+” and “-” (or complete) promoter
collections
Used often for yeast promoter analysis
 Profile-based
e.g. AnnSpec (Workman & Stormo) or MEME (Bailey &
Elkin)
Generalization: Identify strong patterns (position weight
matrices) in “+” promoter collection vs. background model
of expected sequence characteristics
String-based methods(1)
How likely are X words in a set of
sequences, given background sequence
characteristics?
CCCGCCGGAATGAAATCTGATTGACATTTTCC
TTCAAATTTTAACGCCGGAATAATCTCCTATT
TCGCTGTAACCGGAATATTTAGTCAGTTTTTG
TATCGTCATTCTCCGCCTCTTTTCTT
GCTTATCAATGCGCCCGGAATAAAACGCTATA
CATTGACTTTATCGAATAAATCTGTT
ATCTATTTACAATGATAAAACTTCAA
ATGGTCTCTACCGGAAAGCTACTTTCAGAATT
TTTCAAATCCGGAATTTCCACCCGGAATTACT
TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC
ACTGAACTTGTCTTCAAATTTCAACACCGGAA
TCAATGCCGGAATTCTGAATGTGAGTCGCCCT
>EP71002
>EP63009
>EP63010
>EP11013
>EP11014
>EP11015
>EP11016
>EP11017
>EP63007
>EP63008
>EP17012
>EP55011
(+)
(+)
(+)
(+)
(+)
(-)
(+)
(+)
(-)
(+)
(+)
(-)
Ce[IV] msp-56 B; range -100 to -75
Ce Cuticle Col-12; range -100 to -75
Ce Cuticle Col-13; range -100 to -75
Ce vitellogenin 2; range -100 to -75
Ce vitellogenin 5; range -100 to -75
Ce vitellogenin 4; range -100 to -75
Ce vitellogenin 6; range -100 to -75
Ce calmodulin cal-2; range -100 to -75
Ce cAMP-dep. PKR P1+; range -100 to -75
Ce cAMP-dep. PKR P2; range -100 to -75
Ce hsp 16K-1 A; range -100 to -75
Ce hsp 16K-1 B; range
String-based methods(2)
Find all words of length n in the yeast promoters (e.g. n=7)
GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGG
ACAAGCGTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGAC
GGTAAGAAGATCACTTCTAACCAAAGAATTGTTGCTGCTTTGC
CAACCATCAAGTACGTTTTGGAACACCACCCAAGATACGTTGT
CTTGTTCTCACTTGGGTAGACCAAACGGTGAAAGAAACGAAAA
ATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAATCATTGTTG
GGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGAA
GTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTT
TGTTGGAAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGA
AAGGTCGATGGTCAAAAGGTCAAGGCTCAAGGAAGATGTTCA
AAAGTTCAGACACGAATTGAGCTCTTTGGCTGATGTTTACATC
ACGATGCCTTCGGTACCGCTCACAGAGCTCACTCTTCTATGGT
CGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTGTTGGAAAA
GGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAG
ACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAG
ATTCAATTGATTGACAACTTGTTGGACAAGGTCGACTCTATCAT
CATTGGTGGTGGTATGGCTTTCCCTTCAAGAAGGTTTTGGAAA
ACACTGAAATCGGTGACTCCATCTTCGACAAGGCTGGTGCTG
AAATCGTTCCAAAGTTGATGGAAAAGGCCAAGGCCAAGGGTG
TCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGATGCTTTC
TCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGT
ATTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCT
AGAAAGTGTTTGCTGCTACTGTTGCAAAGGCTAAGACCATTGT
CTGGAACGGTCCACCAGGTGTTTTCGAATTCGAAAAGTTCGCT
GCTGGTACTAAGGCTTTGTTAGACGAAGTTGTCAAGAGCTCTG
CTGCTGGTAACACCGTCATCATTGGTGGTGGTGACACTGCCA
Make a lookup table:
AAACCTTT
TTTTTTTT
GATAGGCA
Etc...
456
57788
589
String-based methods(3)
Xw: Instances of a
word w within our set
of X genes
X w  EX w 
Zw 
VarX w 
E[Xw]: Average
number of instances of
w based on number of
genes in our set
Var[Xw]: Variance –
how much deviation
from the average is
expected for w
Limitations of String-based Methods
 Longer word lengths not possible
 Number of possible strings of length n is 4n without any
ambiguities and more if we allow ambiguous bases. Thus, as
n increases, the computational time notably increases.
 While degeneracy codes can be used, TFBS are not words
– we lose quantitation for variable positions with
consensus sequences
 Imagine column in PFM with 7 A’s and 1 T --- in a consensus
sequence we would represent as W or throw out the
instance with T
 Recently the string-based method has found renewed
utility in the analysis of 3’UTRs for the presence of
microRNA target sequences...
Aside: microRNA Target Sequences
 Lim et al expressed miRNAs in cells and observed that the
overall pattern of gene expression shifted toward the pattern
of expression observed in cells which naturally express the
miRNA
 The genes with reduced expression in response to miRNA
exposure shared 7nt motifs the 3’UTR of their transcripts
 Nice website tutorial:
 http://www.ambion.com/main/explorations/mirna.html
Probabilistic Methods
Overview:
Find a local alignment of width x of sites that maximizes information
content (or related measure) in reasonable time
Usually by Gibbs sampling (i.e. Gibbs Motif Sampler) or Expectation
Maximization methods (i.e. MEME)
Motivation:
TFBS are not words, but rather sequence patterns
Efficiency – can handle longer patterns than string-based
methods
Can be intentionally influenced to reflect prior knowledge
Ab initio motif finding:
Gibbs sampling
Popular “Markov chain Monte Carlo”
algorithm for motif discovery
Motif model: Position Weight Matrix
Local search algorithm
Gibbs sampling: basic idea
Current motif = PWM formed
by circled substrings
Gibbs sampling: basic idea
Delete one substring
Gibbs sampling: basic idea
Try a replacement:
Compute its score,
Accept the replacement
depending on the score.
Gibbs sampling: basic idea
New motif
Gibbs Sampling
(grossly over-simplified)
ttcgctcc
cgatacgc
tgctacct
tgacttcc
agacctca
ctgtagtg
acgcatct
A
C
G
T
1 2 3 4 5 6 7 8
2 0 2 2 2 1 0 1
0 2 3 3 2 1 6 2
0 4 1 0 1 0 1 1
4 1 1 2 2 5 0 2
i.e. Gibbs Motif Sampler: http://bayesweb.wadsworth.org/gibbs/gibbs.html
http://www.youtube.com/watch?v=_4yrlzVTr_k
Pattern Discovery
Gibbs sampling is guaranteed to return an
optimal pattern if repeated sufficiently often
Procedure is fast, so running many 1000s of times
is feasible
Problems if the TFBS’s are not strongly overrepresented relative to other patterns…
Applied Pattern Discovery is Acutely
Sensitive to Noise
PATTERN SIMILARITY
vs. TRUE MEF2 PROFILE
18
Pink line is negative control
with no Mef2 sites included
16
14
12
10
0
100
200
300
400
500
SEQUENCE LENGTH
True Mef2 Binding Sites
600
Ab initio motif finding:
Expectation Maximization
Popular algorithm for motif discovery
Motif model: Position Weight Matrix
Local search algorithm
Move from current choice of motif to a new
similar motif, so as to improve the score
Keep doing this until no more improvement is
obtained: converge to local optima
Basic idea of iteration
PWM
1.
Current motif
2.
Scan sequence for good matches to the
current motif.
3.
Build a new PWM out of these matches, and
make it the new motif
The basic idea can be formalized in the language of probability
MEME
Popular motif finding program that uses
Expectation-Maximization
Web site
http://meme.sdsc.edu/meme/website/meme.html
Four Approaches to Improve Sensitivity
 Better background models
-Higher-order properties of DNA
 Phylogenetic Footprinting
 Human:Mouse comparison eliminates ~75% of sequence
 Regulatory Modules
 Architectural rules
 Limit the types of binding profiles allowed
 TFBS patterns are NOT random
(Also, for prokaryotes in particular, accurate ID of transcription
start sites can greatly aid identification of promoter sequences)
Pattern Discovery Summary
Pattern discovery methods can recover overrepresented patterns in the promoters of coexpressed genes
Methods are acutely sensitive to noise –
notable problem if the signal we seek is weak
TFs tolerate great variability between binding sites
Supplementary information/approaches (such
as phylogenetic footprinting) may be required
to over-come the noise
REFLECTIONS
 TFBS prediction
 Futility Theorem – Essentially predictions of individual TFBS
have no relationship to an in vivo function
 Current successful bioinformatics methods incorporate
additional information (clusters, sequence conservation)
 TFBS over-representation
 TFBS over-representation is powerful way to identify TFs
likely to contribute to observed patterns of co-expression
 Novel motif discovery (also useful in general)
 Pattern discovery methods are restricted by the Signal-toNoise problem
 Observed patterns must be carefully considered
 Again, successful methods incorporate additional
information (conservation, structural constraints on TFs, and
(for prokaryotes) positional information relative to TSS)