Computational Gene Prediction dvancing e Sta-of-e-art i Bill Majoros ()

dvancing e Sta-of-e-art i
Computational Gene Prediction
Bill Majoros ([email protected])
Overview
•
Part I : Background and Current Methods
– Definition of the Problem
– HMM’s, GHMM’s, PairHMM’s, PhyloHMM’s
– Combiners & Expression-based Methods
•
Part II : Limitations of Current Methods
–
–
–
–
•
Alternative Splicing
Suboptimality of Pure HMM’s
Lack of Practical Discriminative Training Methods
Reliance on Pre-computed Alignments
Part III : Future Directions
–
–
–
–
Redefining the Problem
Greater Reliance on Machine-learning Methods
Focus on Integrative Approaches
Importance of Interoperability
I. Background and Current
Methods
Cells, Chromosomes, and DNA
Chromosome
Nucleus
Telomere
Centromere
Cell
histones
base pairs
DNA
(double helix)
Telomere
Nucleotide Composition of DNA
Sugar phosphate backbone
Base pair
Adenine
(A)
Nitrogenous base
Thymine
(T)
Guanine
(G)
Cytosine
(C)
Exons, Introns, and Genes
Exon
Intron
Gene
The human genome:
Exon
23 pairs of chromosomes
2.9 billion A’s, T’s, C’s, G’s
~25,000 genes (?)
~1.4% of genome is coding
The Central Dogma of Molecular Biology
cellular structure / function
protein
protein folding
(via chaparones)
polypeptide
RNA
DNA
“messenger”
translation
(via ribosome)
transcription
(via RNA polymerase)
RNA
amino
acid
AGC
CGA
UUR
GCU
GUU
...
S
R
L
A
V
...
The mod-3 Nature of the Genetic Code
transf
er RN
A
Chain of
amino
acids
one amino acid
Anti-codon (3 bases)
Codon (3 bases)
messenger RNA
(mRNA)
Ribosome
(performs translation)
Phase Constraints
phase:
sequence:
coordinates:
phase:
sequence:
coordinates:
012012012012012012
+ATGCGATATGATCGCTAG
|
|
|
|
0
5
10
forward
strand
15
210210210210210210
+CTAGCGATCATATCGCAT
-GATCGCTAGTATAGCGTA
|
|
|
|
0
5
10
reverse
strand
15
phase:
01201201
2012012012
sequence: + GTATGCGATAGTCAAGAGTGATCGCTAGACC
coordinates:
|
|
|
|
|
|
|
0
5
10
15
20
25
30
forward
strand,
spliced
Splicing of Eukaryotic Messenger RNAs
pre-mRNA
exon
intron
mature
mRNA
exon
discarded
intron
The Eukaryotic Gene Finding Problem
complete mRNA
coding segment
ATG
exon
intron
ATG . . . GT
start codon
TGA
exon
AG
donor site acceptor
site
...
intron
GT
exon
AG . . . TGA
donor site acceptor stop codon
site
>4271863
GAATTCTCCAACCCCAGGTGAGGATCTGACTACCTGGAACAGAACCCCGCTCTTCCAGGTGAGAATATGACAGAATAAAGCAG
CACCCTGCACCCCCAGTTGAGCATCTGACAGCCTGGGGCAGCACCCACACTCCCAGGTGAGCATCTGACAGCCTGGAGCAGCA
CCCACACCCCCAGGTGTGCATCTGACAGCCTGAAACAGCACCCTCCACCACCAGATGAGCATCTGACAACCAGAACCTGCACC
ACACACCCCAAGGTGGGCATCCGATGGCATGGAACAGCACCCCCACTCACAGGTGATGTGACTGCGTGGAACAGCACATCCCC
TCAGGTGAGCATCTGACAGCATAAAACAGCACCCCACAACCCCAGGTGATCATTTGCCAGCCAGGAACGGCAACCCACATCCC
CAGGTAAGTGTCTGACAGCCTAGAGCGGCACCTGCACACTTAGGTAAGAATCTGAAAGCCTGGATCAACACTCGAACCCCCAG
GTGAGCATCTGACAGCCTGGAGCAGCACCCCACACCTCCAGGGGAGCATCTGACATCCTGGAACAGCACCCCACACCCCCAGT
GAGCATCTGACAACCTGGAGCAGCACCCCACACCTCCAGGGGAGCATCTGACATCCTGGAACAGCACCCCACACCTCCAGGGG
GAGCATCTGACAGCCTGGAGCAACACCCCACACCTCCAGGGGAGCATCTGACAGCCTGGAACAGCACCCCACACCTCCAGGGG
AGCATCTGACAGCATGGACAAGTCCTGCCCCCCGGTTAGTGTCTGAATTCCTGGAATATGTGCTGTCCTTTTCCACCAGGTGA
GCATATGACCGCCTGGAAGAAGCACCCCTGCATGTTACCTGTGGTGAAACCAAGGCTGAGAGACAGGACAGGGTTGTTGGCCA
GGAGGAGGGGCCTGCTGCTGAGCCCCAGCGCTGAGTCAGAGCTCACAGCCTTGAGCCTGTGCCATGCCTCCTCTCAGGGTGAA
GAGGCAGAGGGCATGGGGGGGGGGGGCGCAGCATCAGCCCATATCCTGGATGCACAAATTCACAGGCATGACGGGGCGAGGGC
CTGCTGCTCCTACCGTCACTCCCACATGCTAGCCCTCCAACGTCCTGGCTGACTTTCCCTGCCTCTGGTCCTGCGGCCCTGGA
CACAGCGGGAGGAGGGGACAGGATCCTGTGGTACCCCTTGGAGGAGGTTCCGGCACCTGTAGGCAGTTTCCAAGGATCCCTTG
TGGCCACATCCAGCTGTTAAATGGGCGTGTCCCTGGCAGCCACAGATGTCTGTCCAGTGCAGGAAAGTCTGTCCAGTGCAGGA
AAGGGCAGGCAGAGAGCTGGCTCCCAGCCCCAAATGCATGTCTCCCTCCCTGGGGCCCAGGCTGGCACAGAAGTCAGGCCTGC
CAGTGGGAAACTTGGGGGAAACTTGGTGCCCTGGTCCAGCAGCCTGCCCTGCAGCAGCAAGCATGGCCTCTGGAGGCTGTCGT
CCTCCTGGCCTCCAGGATTGCTTTTCCTTTCTTCCTAGAACTCCAGCCCTTAAGAAAATCAGAAGCCCTGGCAGGCACATTGC
CTCTGTGCTGTGCTTTTACCCAGCGAAGCATCAGGGCAGACAGCCAATTTCAACACTGCTCTTGGCTGGGAAGTGCCCTCATC
TCTGGCAGCCCCCACAGAGAAAGTGCAGGGCCCCGGGGCTGTGGCTGCCTCAGGGCAGGTCTCCCTTGTGACAGCCTCTTGTC
ATGGGCCTGGGAGTGGACCCCTCCCATCCCTGCCGTGCATCCTGTTGAGTAGACAGCTCAGGCTAGTACCCAAGAGGGTGGCC
AGCAGATCACAGGGGATGTCCCTTTTGTCTTAGCTGTTTATGGGCTGGAGGAACCACTGTTCAGCCACATCTCCTCTCCTGCC
CCTCCTGCATTCACCCGAGGTTGATGGGAGCTCTGTGGGGGGAGACAGGCAGGGGAGGGGCCAGGCAACACCTGGGCATATCC
CCAGGGTGCCTCCTCTGATCCCCAGGAGGGGCAGCACCCACCCTCTTCTTTTTCCAATTTGTTTTTATTGTGGTTAAATATAC
ATAACGTAAAATTTACCATCTTAATTATCTTTATGTGTACGGTGTGGTGGCATTAAATACATTCATCATGTTGTGTGGCCGTC
ACCACCATCCCTTTCCAGAACTAGCTCATCTTCCCAAACTGAAACTCTGTCCCCGTTAAATACTAACTCTCCGTTCCCCAGGC
ACTCTCTGCCCCCAACCCCAGGCACCCACCATTCTGCTTTCTGTCTCTGTGATTCGATGACTCTAGGGACTTCATATAAGGGA
AATCACACAGTGTTTGTCCTTTTGTGGTGGCTGCTTATTTTGCTGAGCACAATGTCCTTGAGGTTCATCCATGTTGTAGTGTG
TACCAGGAATCCCTTCCTTTTTAATGTTGAATAATTCCCCATTGTATGGATGGATCATGTTTGGCTTATCCACCCATCCATCG
GTGGACACCTGGGTGCCTTCCACCTCCAAGCTCTTGTGAACAATGATGCTATCTATGAATATGGTGTACAAATGTCTCTAAAA
The Stochastic Nature of Signal Sensing
(stop codons)
(start codons)
A T G
T G A
T A A
T A G
(donor splice sites)
(acceptor splice sites)
G T
A G
Common Assumptions in Gene Finding
•No overlapping genes
•No nested genes
•No frame shifts or sequencing errors
•Optimal parse only
•No split start codons (ATGT...AGG)
•No split stop codons (TGT...AGAG)
•No alternative splicing
•No selenocysteine codons (TGA)
•No ambiguity codes (Y,R,N, etc.)
Definition of a Hidden Markov Model
M = (Q, , Pt, q0, Pe)
Q = (q0,q1,q2,q3,q4)
={A,T,C,G}
A=11%
T=17%
C=43%
G=29%
q3
q4
35%
20%
A=27%
T=14%
C=22%
G=37%
%
50%
A=35%
T=25%
C=15%
G=25%
%
80
100
q0
100%
A=10%
T=30%
C=40%
G=20%
50%
%
q1
65
q2
Finding the Most Probable Path
q0
100%
bottom: 2.810-9
A=10%
T=30%
C=40%
G=20%
50%
A=11%
T=17%
C=43%
G=29%
q3
q4
35%
A=27%
T=14%
C=22%
G=37%
20%
%
top:
7.010-7
50%
A=35%
T=25%
C=15%
G=25%
%
80
100
q1
%
Example: CATTAATAG
65
q2
The most probable path is:
States:
122222224
Sequence: CATTAATAG
feature 1: C
resulting in this parse:
States:
122222224
Sequence: CATTAATAG
feature 2: ATTAATA
feature 3: G
Decoding with an HMM
max
argmax
argmax P( S )
=
P ( | S ) =
P( S )
argmax
=
P ( S )
argmax
=
P( S | ) P()
L 1
P( S | ) = Pe ( xi | q(i +1) )
i =0
max
emission prob.
L
P() = Pt (q(i +1) | q(i ) )
i =0
transition prob.
L 1
argmax
=
Pt (q0 | q( L ) ) Pe ( xi | q(i +1) ) Pt (q(i +1) | q(i ) )
i =0
Decoding with the Viterbi Algorithm
max V ( j , k 1) P (q | q ) P ( x , q ) if k > 0,
t
i
j
e
k
i
V (i, k ) = j
Pt (qi | q0 ) Pe ( x0 | qi )
if k = 0.
arg max
V (i, L 1) Pt (q0 | qi )
=
i , L 1
*
Training an HMM from Labeled Sequence
CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC
011111112222222111111222211111112222111110
transitions
to state
0
from
state
0
0 (0%)
1
2
1 (100%) 0 (0%)
1
1 (4%)
21 (84%)
3 (12%)
2
0 (0%)
3 (20%)
12 (80%)
ai , j =
Ai , j
|Q|1
h =0
Ai ,h
emissions
symbol
A
C
G
T
in
1
state
6
(24%)
7
(28%)
5
(20%)
7
(28%)
2
3
(20%)
3
(20%)
2
(13%)
7
(47%)
ei ,k =
Ei ,k
||1
E
h =0 i ,h
Using an HMM for Gene Prediction
Intron
Donor
Acceptor
Exon
the Markov model:
Start
codon
Stop
codon
Intergenic
q0
the input sequence:
the most probable path:
the gene prediction:
AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTAGTATAGGTCGATAGTACGCGA
exon 1
exon 2
exon 3
Higher Order Markovian Eukaryotic Recognizer
(HOMER)
H3
H5
H17
H95
H27
H77
HOMER, version H3
Intron
I=intron state
Donor
E=exon state
Acceptor
N=intergenic state
Exon
Start
codon
Stop
codon
Intergenic
tested on 500
Arabidopsis genes:
q0
splice
sites
nucleotides
start/stop
codons
exons
genes
Sn
Sp
F
Sn
Sp
Sn
Sp
Sn
Sp
F
Sn
#
baseline 50
28
44
0
0
0
0
0
0
0
0
0
H3
88
66
0
0
0
0
0
0
0
0
0
53
HOMER, version H5
three exon states, for
the three codon
positions
splice
sites
nucleotides
start/stop
codons
exons
genes
Sn
Sp F
Sn
Sp
Sn
Sp
Sn
Sp F
Sn #
H3
53
88
66
0
0
0
0
0
0
0
0
0
H5
65
91
76
1
3
3
3
0
0
0
0
0
HOMER
version H17
donor
site
acceptor
site
stop codon
start codon
splice
sites
nucleotides
start/stop
codons
exons
genes
Sn
Sp F
Sn
Sp
Sn
Sp
Sn
Sp F
Sn #
H5
65
91
76
1
3
3
3
0
0
0
0
0
H17
81
93
87
34
48
43
37
19
24
21
7
35
HOMER
version H27
three
separate
intron
models
nucleotides
splice
start/stop
exons
genes
Sn
Sp
F
Sn
Sp
Sn
Sp
Sn
Sp
F
Sn
#
H17
81
93
87
34
48
43
37
19
24
21
7
35
H27
83
93
88
40
49
41
36
23
27
25
8
38
HOMER
version H77
positional biases
near splice sites
nucleotides
splice
start/stop
exons
genes
Sn
Sp
F
Sn
Sp
Sn
Sp
Sn
Sp
F
Sn #
H27
83
93
88
40
49
41
36
23
27
25
8
38
H77
88
96
92
66
67
51
46
47
46
46
13
65
HOMER
version H95
nucleotides
splice
start/stop
exons
genes
Sn
Sp
F
Sn
Sp
Sn
Sp
Sn
Sp
F
Sn
#
H77
88
96
92
66
67
51
46
47
46
46
13
65
H95
92
97
94
79
76
57
53
62
59
60
19
93
Higher-order Markov Models
P(G)
0th order:
ACGCTA
P(G|C)
1st order:
ACGCTA
P(G|AC)
2nd order:
ACGCTA
Higher-order Markov Models
order
nucleotides
Sn
Sp
F
splice
sites
Sn
Sp
starts/
stops
Sn
Sp
exons
Sn
Sp
genes
F
Sn
#
0
92 97 94 79 76 57 53 62 59 60 19 93
95 98 97 87 81 64 61 72 68 70 25 127
H95 2
2
98 98 98 91 82 65 62 76 69 72 27 136
H395 3
98 98 98 91 82 67 63 76 69 72 28 140
H495 4
98 97 98 90 81 69 64 76 68 72 29 143
H595 5
98 97 98 90 81 66 62 74 67 70 27 137
H95 0
H195 1
Generalized Hidden Markov Models
Advantages:
* Submodel abstraction
* Architectural simplicity
* State duration modeling
Disadvantages:
* Decoding complexity
HMMs & Geometric Feature Lengths
d 1
d 1
P( x0 ...xd 1 | ) = Pe ( xi | ) p (1 p )
i =0
geometric
distribution
ge
om
etr
ic
exon length
E0
Fixed-length
states are called
signal states (red).
E1
D
A
D
Each state has a
separate submodel
or “sensor”
UTR5
TATA
D
A
I0
E2
I1
A
D
I2
A
Einitial
ATG
D
A
Variable-length
states are called
content states (blue).
D
A
Efinal
Esng
TAG
N
UTR3
AATAAA
+strand
-strand
Some GHMM Submodel Types
L 1
1. WMM (Weight Matrix)
P (x )
i
i
i =0
n 1
2. Nth-order Markov Chain (MC)
L 1
P( x | x ...x ) P( x | x
i
i 1
0
i =0
i
i =n
i n
...xi 1 )
L 1
3. Three-Periodic Markov Chain (3PMC)
P
i =0
( f + i )(mod 3)
( xi )
n
max Pi ( x
4. Nonstationary Markov Chain (NSMC)
i =1
n 1
5. Codon Bias
P( x
i =0
6. MDD
x
x
+ 3i + 3i +1 + 3i + 2
)
7. Interpolated Markov Models
i 1
j =1
dj
...x
(
i
j =1
d j ) 1
)
Recall: Decoding with an HMM
max
argmax
argmax P( S )
=
P ( | S ) =
P( S )
argmax
=
P ( S )
argmax
=
P( S | ) P()
L 1
P( S | ) = Pe ( xi | q(i +1) )
i =0
max
emission prob.
L
P() = Pt (q(i +1) | q(i ) )
i =0
transition prob.
L 1
argmax
=
Pt (q0 | q( L ) ) Pe ( xi | q(i +1) ) Pt (q(i +1) | q(i ) )
i =0
Decoding with a GHMM
max
argmax
argmax P( S )
=
P ( | S ) =
P( S )
argmax
=
P ( S )
argmax
=
P( S | ) P()
||2
P( S | ) = Pe ( Si | q(i ) , d i )
i =1
max =
P() = Pt (q(i +1) | q(i ) ) Pd (d i | q(i ) )
emission prob.
||2
||2
i =0
transition
prob.
duration
prob.
argmax
Pe ( Si | q(i ) , d i ) Pt (q(i +1) | q(i ) ) Pd (d i | q(i ) )
i =0
Efficient Decoding via Signal Sensors
Each signal state has a signal sensor:
The “trellis” or “ORF graph”:
.0045
.0032 .0031 .0021 .002
.0072
.0023
.0034 .0082
Accuracy
nucleotides
splice
start/stop
exons
genes
Sn
Sp
F
Sn
Sp
Sn
Sp
Sn
Sp
F
Sn
HMM
98
97
98
90
81
66
62
74
67
70
27
GHMM
94
96
95
87
89
77
74
79
80
80
45
#
Comparative Methods
Problem: Predict genes in a target genome G based on the
contents of G and also based on the contents of
one or more informant genomes I(1)... I(n):
{
GC-ATCGGTCTTA
...|:.|:|.||:|:|...
target genome: ATCGGTAAC-GTGTAATGC
informant genome:
alignment
Rationale: Natural selection should operate more
strongly on protein-coding DNA than on nonfunctional DNA such as introns.
Pair HMM’s (PHMM’s)
IX : emit a symbol into output channel X
IX
IY
μ
M
IY : emit a symbol into output channel Y
M : emit a symbol into both X and Y
IX can be called an insertion state
IY can be called a deletion state
M can be called a match state
Decoding for PHMM’s
argmax
={q 0 =q0 ,...,qm1 =q 0 }
Pt (q0 | qm 2 ) Pe (ai,1 ,ai,2 | qi )Pt (qi | qi1 )
i=1
max V
for qk QM
i1, j1,h Pt (qk | qh )Pe (si1,1 ,s j1,2 | qk )
h
for qk QI
Vi, j,k = max Vi1, j,h Pt (qk | qh )Pe (si1,1 , | qk )
h
Vi, j1,h Pt (qk | qh )Pe ( ,s j1,2 | qk )
for qk QD
max
h
argmax Vi1, j1,h Pt (qk | qh )Pe (si1,1 ,s j1,2 | qk ) for qk QM
(i1, j1,h)
Ti, j,k = argmax Vi1, j,h Pt (qk | qh )Pe (si1,1 , | qk )
for qk QI
(i1,
j,h)
argmax Vi, j1,h Pt (qk | qh )Pe ( ,s j1,2 | qk )
for qk QD
(i, j1,h)
1 for q = q
k
V0,0,k = 0 otherwise
max Vi1,0,h Pt (qk | q h ) Pe (si1,1 , | qk )
Vi>0,0,k = h
0
max V0, j1,h Pt (qk | qh ) Pe (, s j1,2 | q k )
V0, j>0,k = h
0
states
* =
m 2
s
u
eq
seq
e
e
nc
uen
2
ce
1
0
for qk Q D
arg max Vi1,0,h Pt (q k | qh ) Pe (si1,1 , | qk )
Ti>0,0,k = ( i1,0,h )
NIL
arg max V0, j1,h Pt (qk | qh ) Pe (, s j1,2 | qk )
T0, j>0,k = (0, j1,h )
NIL
otherwise
T0,0,k = NIL
for qk Q I
otherwise
for q k Q I
otherwise
for qk Q D
otherwise
genome 2
Pruning the Search Space for PHMM’s
E
T
RE
NO
A
U
IG
V
E
L
A
RE
NO
IG
genome 1
•Find significantly conserved regions (thick bars) using BLAST
•Force the DP algorithm to select a path which passes through these regions
•Allow more flexibility in the regions not aligned
•Do not evaluate regions of the matrix far from the conserved regions
Pair GHMM’s (PGHMM’s)
Each state in the
GHMM now
contains a
PHMM, and emits
a pair of sequence
features rather
than a single
sequence feature.
-
+
-
=
=
+
+
-
-
=
-
+
+
=
-
+
=
+
-
=
+
-
+
+
-
=
+
-
=
=
+
-
+
=
=
+
-
+
+
=
+
-
+
-
=
=
+
=
-
=
=
-
+
-
+
-
=
=
-
+
=
+
-
=
=
=
=
+
-
+
+
-
-
=
=
-
+
=
-
+
=
Recall: GHMM Decoding
Finding the optimal parse, max:
max
arg max
arg max P(, S )
P ( | S ) =
=
P( S )
arg max
arg max
=
P(, S ) =
P( S | ) P()
n 1
arg max
=
Pe ( Si | qi , d i ) Pt (qi | qi 1 ) Pd (d i | qi )
i =1
=emission
=transition
=duration
Decoding with a GPHMM
n2
argmax
*
0
= Pt (q | qn2 )
Pe (S i,1 ,S i,2 | qi ,di,1 ,di,2 )
i=1
Pt (qi | qi1 )Pd (di,1 ,di,2 | qi )
=transition
Pe (Si,1 , Si,2 | qi , d i,1 ,d i,2 ) Pe (Si,1 | qi , d i,1 ) Pe (Si,2 | Si,1 ,q i ,d i,2 )
alignment score
=single-genome
emission score
Pd (d i,1 ,d i,2 | qi ) = P(d i,1 | q i ) P(d i,2 | d i,1 ,q i )
P(d i,1 | q i ) P( d | qi ),
=single-genome
duration score
d = d i,2 d i,1
ignore
(implicit in alignment score)
genome 1
Practical GPHMM Decoding
build ORF
graph
Align ORF
graphs
genome 2
“guide”
alignments
sparse
alignment
matrix
build ORF
graph
Extract best
alignment
predictions
for genome 1
predictions for
genome 2
Aligning Parse Graphs
The two parse graphs can be aligned using a global alignment
algorithm. The optimal alignment corresponds to the chosen pair of
orthologous gene predictions.
The alignment is
constrained by the
topologies of the two
parse graphs:
1.Only like signals can
align,
2.Two signals can align
only if they have
neighbors which also
align,
3.Standard phase
constraints apply.
Using a Sparse Alignment Matrix
unpruned cells
guide alignment
Accuracy
Data set: 147 high-confidence Aspergillus fumigatus A. nidulans
orthologs (493 exons, 564kb).
nucleotide
accuracy
exon
sensitivity
exon
specificity
exact
genes
GHMM
99%
78%
73%
54%
GPHMM
99%
89%
85%
74%
How Does Homology Help?
ATG
1
TGA
2
3
4
ATG
1
A. fumigatus
TGA
2
3
feature
amino acid
alignment score
exon 1
100%
intron 1
14%
exon 2
98%
intron 2
29%
exon 3
97%
intron 3
9%
exon 4
96%
4
<,>
>
<
>
<
>
<
>
A. nidulans
nucleotide
alignment score
71%
51%
85%
49%
82%
49%
83%
Phylogenetic HMM’s (PhyloHMM’s)
model of gene structure
model of phylogeny
I
A1
GT
A2
A3
AG
Eint
Einit
I1
S
Efin
Esng
I2
ATG
TAG
N
= a model of gene structure informed by
observed evolutionary divergence
Evolutionary Sequence Conservation
• Using multiple
genomes increases
effective sample size
chicken
• However, we have to
control for the nonindependence of
informant genomes
galago
chimpanzee human
rat
mouse
dog
human:
chimp:
cow:
dog:
galago:
rat:
mouse:
• The “ideal
evolutionary distance”
for informant
genomes usually is
not known
cow
AAGGGAAGACAGGTGAGGGTCAAGCCCCAGCAAGTGCACCCAG------------ACACC
AAGGGAAGACAGGTGAGGGTCAAGCCCCAGCAAGTGCACCCAG------------ACACC
AAGGGAAGACATTTACGAGTCAAGCCACAGAAAGAGCCCCTGAG-----------GTGCC
AAAGGAGGACATGTGAGGGCCAAACTACTGAAGGTTCAACCAGG-----------ATGCT
AAGGGGAGACAGGGGAGGGTCACACCATGGCAGAGG--CCAAG------------ACAGC
AAAGGAAACAATGGGAAGGTTA-TCAACTCCAAGTATGCCCAAGATCAAGGGAACCCCTT
AAAGGAAACCACTGGGAGGTTA-GAAATCACAGGTGCACCCAAGATCAAGGAA--CCCCT
Decoding with a PhyloHMM
arg max
* =
P( | S , I (1) ,..., I ( n ) )
(1)
(n)
arg max P(, S , I ,..., I )
=
P( S , I (1) ,..., I ( n ) )
arg max
=
P(, S , I (1) ,..., I ( n ) )
arg max
(1)
(n)
=
P() P( S , I ,..., I | )
arg max
=
P() P( S | ) P( I (1) ,..., I ( n ) | S , )
standard GHMM computation
tree likelihood (Felsenstein’s algorithm)
Phylogenies as Bayes Networks
2
S
Bayesian network
28
29
32
93
28
29
32
93
43
49
11
23
28
15
95
45
A3
93
38
32
90
43
49
11
23
28
15
95
45
re-root and
attach rate
matrices
93
38
32
90
28
29
32
93
43
49
11
23
28
15
95
45
A1
A2
93
38
32
90
I1
A1
28
29
32
93
28
29
32
93
43
49
11
23
28
15
95
45
93
38
32
90
43
49
11
23
28
15
95
45
S
I2
1
93
38
32
90
(1)
P( I ,..., I
A2
3
I1
=
tree likelihood
(Felsenstein’s
algorithm)
P( I
A1 , A2 , A3
A3
1
(n)
I2
phylogeny
| S , ) =
P(v | parent (v), ) unobservables nonroot
v
| A2 ) P( A2 | A1 ) P( A1 | A3 ) P( I 2 | A3 ) P( A3 | S )
Likelihood rapidly decreases with larger numbers of mutations
Tree Likelihood
(x10)
(five
species
aligned over
5000
columns)
Number of different residues in column
Modeling Evolutionary Change in both Nucleotides and
Amino Acids
Recall from the earlier GHMM example:
feature
amino acid
alignment score
<,>
nucleotide
alignment score
exon 1
100%
>
71%
intron 1
14%
<
51%
exon 2
98%
>
85%
intron 2
29%
<
49%
exon 3
97%
>
82%
intron 3
9%
<
49%
exon 4
96%
>
83%
Thus, we model separately the rate of evolutionary change in coding regions (ideally at
the amino acid level) and noncoding regions (at the nucleotide level).
human
human
coding
tree
chimp
noncoding
tree
chimp
rat mouse
galago
chicken
dog cow
chicken
cow galago
rat
dog
mouse
Expression-based Methods
proteins and
mRNA’s aligned to
the genome
outputs of other
prediction programs
Proteins and sequenced mRNA’s can be aligned to the genome using dynamic
programming algorithms.
Various ad hoc methods have been explored for utilizing this information during
gene prediction.
Outputs from other gene finders can also be used as input to a “combiner”
program.
(figure courtesy of J. Allen, University of Maryland)
Ad hoc “Combiner” Methods
boundaries of putative exons
0.6
Splice Predictions
Gene Finder 1
evidence
tracks
0.9
0.8
Gene Finder 2
Protein Alignment
mRNA Alignment
0.89
0.49
0.35
0.32
combining
function
weighted ORF graph
decoder
(dyn. prog.)
gene prediction
(J. Allen, University of Maryland)
II. Limitations of Current Methods
1. MLE+Viterbi Is Not Optimal
Currently, most gene finders are trained via maximum likelihood estimation (MLE):
arg max P(S, | ) =
( S , )T
N 1
arg max P ( i | qi ,d i ) P (q i | q i1 ) P (d i | qi ) =
( S , )T i=1
*
The advantage of MLE is that it is easy to perform, since the transition, emission, and
duration parameters can be optimized separately:
P(q
log
(S i , i )T
j
| qj1 )
(transitions)
q j i , j>0
logP(
j
| q)
(emissions)
j S i T
(S i , i )T
log
P(d
(q j ,d j ) i
j
| qj )
(durations)
anecdotal evidence
Anecdotal evidence:
• folklore about 1/ 3 in the source code of a certain popular gene finder*
• fudge factor in: NSCAN (“conservation score coefficient”; Gross & Brent, 2005)
• fudge factor in: ExoniPhy (“tuning parameter”; Siepel & Haussler, 2004)
• fudge factor in TWAIN (“percent identity”; Majoros et al., 2005)
• fudge factor in GlimmerHMM (“optimism”; M. Pertea, pers. communication)
• fudge factor in TIGRscan (“optimism”; Majoros et al., 2004)
• lack of fudge factors in EvoGene (Pedersen & Hein, 2003)
If MLE+Viterbi produced optimal parsers, why would “tweaking” &
the use of “fudge factors” be necessary? (...apart from sample size
issues...)
* folklore also states that this programs’s author made pact with the devil in exchange for gene-finding accuracy,
but I have encountered no independent verification of this.
a simple experiment
M1
M1
I
I
E
N
generate
synthetic
genes
input
E
parse
accuracy:
predictions
N
78%
q0
q0
I
input
parse
E
discri
minat
ive training
N
q0
M2
accuracy:
predictions
87%
changes selected by the hill-climber
changes to the emission parameters
changes to the transition parameters
symbol
state
A
intergenic
12%
exon
+1%
intron
+3%
to state
C
G
+5%
3%
+2%
1%
+3%
6%
T
+10%
6%
from state
0
1
2
3
0
1
1.5%
+1.5%
2
+0.4%
0.8%
+0.4%
3
+0.9%
0.9%
+4%
“These changes are consistent with the notion that superior discr imination requires exaggeration
of (or emphasis on) those model parameters which most reliably separate the classes of interest.
In this case, the Arabidopsis training genes showed both lower T content and higher C and G
content in the exonic regions versus th e other two regions, and these biases, which were already
present in the same proportions in the MLE model (data not shown) are obviously exaggerated in
the discriminative model (Table 3). The most extreme exaggeration, the -12% for A in th e
intergenic state, is among the most subtle differences in the training data, in which the intergenic
regions showed ~29% A versus ~27% in the exonic and intronic regions; in this way, the
discriminative trainer has identified a subtle but consis tent difference and boosted the “weight” of
this feature in the model by exaggerating the corresponding emission probabilities.”
another experiment: GHMM
NUCLEOTIDE
100%
90
MLE
discrim
94
83
87
EXON
81
80%
ACCURACY
71
60%
51
40%
GENE
48
33
31
30
18
20%
0%
arab asp
arab asp
arab asp
what about conditional maximum likelihood?
arg max Instead of: =
P(S, | ) ( S , )T
*
arg max we might consider: =
P( | S, ) ( S , )T
*
This is called Conditional Maximum Likelihood. It expands into:
arg max P( | S, ) =
( S , )T
arg max P(S, | ) =
( S , )T P(S | ) N 1
P ( i | qi ,d i ) P (q i | q i1 ) P (d i | qi ) arg max i=1
=
P (S)
( S , )T
*
Unfortunately, EM-like update equations which have been derived for pure HMM’s
tend to be unstable (e.g., Reichl and Ruske, 1995; Normandin, 1996). Update
equations for GHMM’s, PHMM’s, GPHMM’s, and PhyloHMM’s have (as far as I know)
not yet been derived.
2. Newer Decoders Also Not Optimal
Posterior Viterbi (Fariselli et al., 2005), OAD (Käll et al., 2005):
argmax
Pe ( i | si ) (q i | q i1 )
i algorithm
200 training
genes, 50 test
genes
1000 training
genes, 200 test
genes
200 training
genes, 50 test
genes
AVit
APV
AOAD
algorithm
AVit
APV
AOAD
algorithm
AVit
APV
AOAD
argmax
Pe ( i | si ) (q i | q i1 )
i nuc
85%
89%
89%
exon
42%
41%
41%
gene
24%
26%
26%
(Oryza sativa)
nuc
96%
97%
97%
exon
68%
64%
61%
gene
36%
33%
30%
(Arabidopsis thaliana)
nuc
94%
96%
96%
exon
50%
39%
39%
gene
30%
22%
22%
(Aspergillus fumigatus)
...even for GHMM’s
G/OAD and G/PV dynamic programming update formulas:
0
if deg in (q j ) = 0
Wj = Wi + fibji jP(s j | q j ) otherwise
imax
pred ( j)
1
if deg in (q j ) = 0
Vj = Vi fibji jP(s j | q j ) otherwise
imax
pred ( j)
5-fold cross
validation;
(600 training
genes & 200
test genes per
run)
(Arabidopsis
thaliana)
3. Discriminative Training is Expensive
training
data
MLE
trainer
Gradient Ascent Optimizer
new
model
problems:
• slow (esp. for
pair & phylo
HMM’s)
• perverse: no
longer a
descriptive
model
GHMM
Decoder
SLOW!
predictions
keep
the
best
modify
parameters
evaluate
accuracy
4. Intron Evolution is Rarely Modeled
Many Aspergillus orthologs have unequal numbers of
exons.
(intron evolution, cont.)
5. Dependence on Pre-computed Alignments
genome 2
The PHMM case:
E
T
RE
NO
IG
V
E
The GPHMM case:
A
U
L
A
unpruned cells
guide alignment
RE
NO
IG
genome 1
As for the PhyloHMM & Combiner cases...
(pre-computed alignments, cont.)
N genomes:
aligner
?
human:
chimp:
cow:
dog:
galago:
rat:
mouse:
AAGGGAAGACAGGTGAGGGTCAAGCCCCAGCAAGTGCACCCAG------------ACACC
AAGGGAAGACAGGTGAGGGTCAAGCCCCAGCAAGTGCACCCAG------------ACACC
AAGGGAAGACATTTACGAGTCAAGCCACAGAAAGAGCCCCTGAG-----------GTGCC
AAAGGAGGACATGTGAGGGCCAAACTACTGAAGGTTCAACCAGG-----------ATGCT
AAGGGGAGACAGGGGAGGGTCACACCATGGCAGAGG--CCAAG------------ACAGC
AAAGGAAACAATGGGAAGGTTA-TCAACTCCAAGTATGCCCAAGATCAAGGGAACCCCTT
AAAGGAAACCACTGGGAGGTTA-GAAATCACAGGTGCACCCAAGATCAAGGAA--CCCCT
PhyloHMM or
Combiner
predictions
6. Alternative Splicing
alternative
alternative
start codons ATG donor sites
ATG
GT GT
AG
GT
alternative
exon
alternative
alternative
stop codons
acceptor sites
AG AG
TGA
TGA
III. Future Directions
1. Redefine the Problem
• Predict exons rather than whole genes -- a step backward?
• Classification rather than parsing
• This “solves” the alternative splicing problem (in a manner of speaking...)
• Possibly use a statistical ensemble representation of parses, and develop more
sophisticated browsers that can use such a representation
GHMM
submodels
sequence
modified
decoder
ep
acc
scores of
putative
exon
features
reje
classifier
homology evidence
t
predict an
exon
ct
do
nothing
2. A Greater Role for Machine Learning
• Use established, general-purpose ML algorithms
• Let the learner attend to the goal of maximal discrimination
training
data
MLE
trainer
pos
feature
extraction
neg
get negative
examples
hybrid model
classifier
trainer
3. Focus on Combiners
• Combiners integrate all available evidence
• Results from EGASP:
NUCLEOTIDE
NS n
NS p
EXON
N CC
ES p
GS n
GS p
74.67%
75.18%
80.61%
76.85%
76.76%
69.31%
89.33%
88.91%
47.97%
69.93%
72.64%
69.59%
35.59%
42.09%
65.95%
61.32%
0.76
0.53
0.66
52.39%
50.58%
62.08%
62.93%
29.01%
50.25%
24.32%
15.20%
19.59%
17.22%
3.24%
8.84%
79.14%
83.45%
92.02%
94.33%
59.67%
92.77%
0.84
0.88
0.91
0.89
0.73
0.90
85.75%
74.10%
77.53%
79.34%
64.44%
76.63%
56.98%
77.40%
82.65%
83.45%
41.77%
88.95%
63.51%
47.64%
71.62%
63.18%
21.96%
69.59%
48.65%
37.01%
67.32%
80.82%
6.33%
61.71%
80.15%
88.24%
74.13%
89.02%
81.39%
0.84
0.74
0.78
0.87
0.65
63.06%
53.11%
65.56%
67.66%
38.82%
69.14%
77.34%
61.65%
82.05%
50.73%
26.01%
10.81%
33.45%
35.47%
4.39%
18.64%
14.61%
24.94%
36.71%
3.44%
Combiners
AUGUSTUS-any
FGENESH++
JIGSAW
PAIRAGON-any
94.42%
91.09%
94.56%
87.77%
82.43%
76.89%
92.19%
92.78%
0.88
0.83
0.93
0.90
HMM’s & GHMM’s
AUGUSTUS-abinit
GENEMARK.hmm
GENEZILLA
78.65%
78.43%
87.56%
75.29%
37.97%
50.93%
Expression-based
ACEVIEW
AUGUSTUS-EST
ENSEMBL
EXOGEAN
EXONHUNTER
PAIRAGON+NSCAN
90.94%
92.62%
90.18%
84.18%
90.46%
87.56%
Pair & Phylo HMM’s
AUGUSTUS-dual
DOGFISH
MARS
NSCAN
SAGA
88.86%
64.81%
84.25%
85.38%
52.54%
Source: Guigo et al., 2006 (to appear in Genome Biology)
ES n
GENE
4. Interoperability
universal
decoder
gene predictions
re-weighted ORF graph
“one decoder to rule
them all, one decoder to
find them...”
various other evidence
(possibly including
other ORF graphs)
next generation
gene finder
weighted ORF graph
presumably the best
ORF graph generator
available for this
organism
ab initio gene
finder XYZ
ATCGGCATTCGCGATTACGATATCTAGCGATGCTA
Contemplating the Future
pure
alignmentbased
Pair
(G)HMM’s
Phylo(G)HMM’s
“Discriminative Bayes
Net Combiner GHMM’s”
(DBNCGHMM’s)
...
combiners
2000
...
?
2002
2003
2006
20xx
ACKNOWLEDGEMENTS
TIGR: S. Salzberg, M. Pertea, J. Allen, A. Delcher, J. Wortman, B. Haas
Duke: U. Ohler, S. Mukherjee
Elsewhere: M. Yandell, I. Korf, J. Eisen
Methods for Computational
Gene Prediction
W.H. Majoros
Expected publication date:
late 2006
www.geneprediction.org/book