dvancing e Sta-of-e-art i Computational Gene Prediction Bill Majoros ([email protected]) Overview • Part I : Background and Current Methods – Definition of the Problem – HMM’s, GHMM’s, PairHMM’s, PhyloHMM’s – Combiners & Expression-based Methods • Part II : Limitations of Current Methods – – – – • Alternative Splicing Suboptimality of Pure HMM’s Lack of Practical Discriminative Training Methods Reliance on Pre-computed Alignments Part III : Future Directions – – – – Redefining the Problem Greater Reliance on Machine-learning Methods Focus on Integrative Approaches Importance of Interoperability I. Background and Current Methods Cells, Chromosomes, and DNA Chromosome Nucleus Telomere Centromere Cell histones base pairs DNA (double helix) Telomere Nucleotide Composition of DNA Sugar phosphate backbone Base pair Adenine (A) Nitrogenous base Thymine (T) Guanine (G) Cytosine (C) Exons, Introns, and Genes Exon Intron Gene The human genome: Exon 23 pairs of chromosomes 2.9 billion A’s, T’s, C’s, G’s ~25,000 genes (?) ~1.4% of genome is coding The Central Dogma of Molecular Biology cellular structure / function protein protein folding (via chaparones) polypeptide RNA DNA “messenger” translation (via ribosome) transcription (via RNA polymerase) RNA amino acid AGC CGA UUR GCU GUU ... S R L A V ... The mod-3 Nature of the Genetic Code transf er RN A Chain of amino acids one amino acid Anti-codon (3 bases) Codon (3 bases) messenger RNA (mRNA) Ribosome (performs translation) Phase Constraints phase: sequence: coordinates: phase: sequence: coordinates: 012012012012012012 +ATGCGATATGATCGCTAG | | | | 0 5 10 forward strand 15 210210210210210210 +CTAGCGATCATATCGCAT -GATCGCTAGTATAGCGTA | | | | 0 5 10 reverse strand 15 phase: 01201201 2012012012 sequence: + GTATGCGATAGTCAAGAGTGATCGCTAGACC coordinates: | | | | | | | 0 5 10 15 20 25 30 forward strand, spliced Splicing of Eukaryotic Messenger RNAs pre-mRNA exon intron mature mRNA exon discarded intron The Eukaryotic Gene Finding Problem complete mRNA coding segment ATG exon intron ATG . . . GT start codon TGA exon AG donor site acceptor site ... intron GT exon AG . . . TGA donor site acceptor stop codon site >4271863 GAATTCTCCAACCCCAGGTGAGGATCTGACTACCTGGAACAGAACCCCGCTCTTCCAGGTGAGAATATGACAGAATAAAGCAG CACCCTGCACCCCCAGTTGAGCATCTGACAGCCTGGGGCAGCACCCACACTCCCAGGTGAGCATCTGACAGCCTGGAGCAGCA CCCACACCCCCAGGTGTGCATCTGACAGCCTGAAACAGCACCCTCCACCACCAGATGAGCATCTGACAACCAGAACCTGCACC ACACACCCCAAGGTGGGCATCCGATGGCATGGAACAGCACCCCCACTCACAGGTGATGTGACTGCGTGGAACAGCACATCCCC TCAGGTGAGCATCTGACAGCATAAAACAGCACCCCACAACCCCAGGTGATCATTTGCCAGCCAGGAACGGCAACCCACATCCC CAGGTAAGTGTCTGACAGCCTAGAGCGGCACCTGCACACTTAGGTAAGAATCTGAAAGCCTGGATCAACACTCGAACCCCCAG GTGAGCATCTGACAGCCTGGAGCAGCACCCCACACCTCCAGGGGAGCATCTGACATCCTGGAACAGCACCCCACACCCCCAGT GAGCATCTGACAACCTGGAGCAGCACCCCACACCTCCAGGGGAGCATCTGACATCCTGGAACAGCACCCCACACCTCCAGGGG GAGCATCTGACAGCCTGGAGCAACACCCCACACCTCCAGGGGAGCATCTGACAGCCTGGAACAGCACCCCACACCTCCAGGGG AGCATCTGACAGCATGGACAAGTCCTGCCCCCCGGTTAGTGTCTGAATTCCTGGAATATGTGCTGTCCTTTTCCACCAGGTGA GCATATGACCGCCTGGAAGAAGCACCCCTGCATGTTACCTGTGGTGAAACCAAGGCTGAGAGACAGGACAGGGTTGTTGGCCA GGAGGAGGGGCCTGCTGCTGAGCCCCAGCGCTGAGTCAGAGCTCACAGCCTTGAGCCTGTGCCATGCCTCCTCTCAGGGTGAA GAGGCAGAGGGCATGGGGGGGGGGGGCGCAGCATCAGCCCATATCCTGGATGCACAAATTCACAGGCATGACGGGGCGAGGGC CTGCTGCTCCTACCGTCACTCCCACATGCTAGCCCTCCAACGTCCTGGCTGACTTTCCCTGCCTCTGGTCCTGCGGCCCTGGA CACAGCGGGAGGAGGGGACAGGATCCTGTGGTACCCCTTGGAGGAGGTTCCGGCACCTGTAGGCAGTTTCCAAGGATCCCTTG TGGCCACATCCAGCTGTTAAATGGGCGTGTCCCTGGCAGCCACAGATGTCTGTCCAGTGCAGGAAAGTCTGTCCAGTGCAGGA AAGGGCAGGCAGAGAGCTGGCTCCCAGCCCCAAATGCATGTCTCCCTCCCTGGGGCCCAGGCTGGCACAGAAGTCAGGCCTGC CAGTGGGAAACTTGGGGGAAACTTGGTGCCCTGGTCCAGCAGCCTGCCCTGCAGCAGCAAGCATGGCCTCTGGAGGCTGTCGT CCTCCTGGCCTCCAGGATTGCTTTTCCTTTCTTCCTAGAACTCCAGCCCTTAAGAAAATCAGAAGCCCTGGCAGGCACATTGC CTCTGTGCTGTGCTTTTACCCAGCGAAGCATCAGGGCAGACAGCCAATTTCAACACTGCTCTTGGCTGGGAAGTGCCCTCATC TCTGGCAGCCCCCACAGAGAAAGTGCAGGGCCCCGGGGCTGTGGCTGCCTCAGGGCAGGTCTCCCTTGTGACAGCCTCTTGTC ATGGGCCTGGGAGTGGACCCCTCCCATCCCTGCCGTGCATCCTGTTGAGTAGACAGCTCAGGCTAGTACCCAAGAGGGTGGCC AGCAGATCACAGGGGATGTCCCTTTTGTCTTAGCTGTTTATGGGCTGGAGGAACCACTGTTCAGCCACATCTCCTCTCCTGCC CCTCCTGCATTCACCCGAGGTTGATGGGAGCTCTGTGGGGGGAGACAGGCAGGGGAGGGGCCAGGCAACACCTGGGCATATCC CCAGGGTGCCTCCTCTGATCCCCAGGAGGGGCAGCACCCACCCTCTTCTTTTTCCAATTTGTTTTTATTGTGGTTAAATATAC ATAACGTAAAATTTACCATCTTAATTATCTTTATGTGTACGGTGTGGTGGCATTAAATACATTCATCATGTTGTGTGGCCGTC ACCACCATCCCTTTCCAGAACTAGCTCATCTTCCCAAACTGAAACTCTGTCCCCGTTAAATACTAACTCTCCGTTCCCCAGGC ACTCTCTGCCCCCAACCCCAGGCACCCACCATTCTGCTTTCTGTCTCTGTGATTCGATGACTCTAGGGACTTCATATAAGGGA AATCACACAGTGTTTGTCCTTTTGTGGTGGCTGCTTATTTTGCTGAGCACAATGTCCTTGAGGTTCATCCATGTTGTAGTGTG TACCAGGAATCCCTTCCTTTTTAATGTTGAATAATTCCCCATTGTATGGATGGATCATGTTTGGCTTATCCACCCATCCATCG GTGGACACCTGGGTGCCTTCCACCTCCAAGCTCTTGTGAACAATGATGCTATCTATGAATATGGTGTACAAATGTCTCTAAAA The Stochastic Nature of Signal Sensing (stop codons) (start codons) A T G T G A T A A T A G (donor splice sites) (acceptor splice sites) G T A G Common Assumptions in Gene Finding •No overlapping genes •No nested genes •No frame shifts or sequencing errors •Optimal parse only •No split start codons (ATGT...AGG) •No split stop codons (TGT...AGAG) •No alternative splicing •No selenocysteine codons (TGA) •No ambiguity codes (Y,R,N, etc.) Definition of a Hidden Markov Model M = (Q, , Pt, q0, Pe) Q = (q0,q1,q2,q3,q4) ={A,T,C,G} A=11% T=17% C=43% G=29% q3 q4 35% 20% A=27% T=14% C=22% G=37% % 50% A=35% T=25% C=15% G=25% % 80 100 q0 100% A=10% T=30% C=40% G=20% 50% % q1 65 q2 Finding the Most Probable Path q0 100% bottom: 2.810-9 A=10% T=30% C=40% G=20% 50% A=11% T=17% C=43% G=29% q3 q4 35% A=27% T=14% C=22% G=37% 20% % top: 7.010-7 50% A=35% T=25% C=15% G=25% % 80 100 q1 % Example: CATTAATAG 65 q2 The most probable path is: States: 122222224 Sequence: CATTAATAG feature 1: C resulting in this parse: States: 122222224 Sequence: CATTAATAG feature 2: ATTAATA feature 3: G Decoding with an HMM max argmax argmax P( S ) = P ( | S ) = P( S ) argmax = P ( S ) argmax = P( S | ) P() L 1 P( S | ) = Pe ( xi | q(i +1) ) i =0 max emission prob. L P() = Pt (q(i +1) | q(i ) ) i =0 transition prob. L 1 argmax = Pt (q0 | q( L ) ) Pe ( xi | q(i +1) ) Pt (q(i +1) | q(i ) ) i =0 Decoding with the Viterbi Algorithm max V ( j , k 1) P (q | q ) P ( x , q ) if k > 0, t i j e k i V (i, k ) = j Pt (qi | q0 ) Pe ( x0 | qi ) if k = 0. arg max V (i, L 1) Pt (q0 | qi ) = i , L 1 * Training an HMM from Labeled Sequence CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC 011111112222222111111222211111112222111110 transitions to state 0 from state 0 0 (0%) 1 2 1 (100%) 0 (0%) 1 1 (4%) 21 (84%) 3 (12%) 2 0 (0%) 3 (20%) 12 (80%) ai , j = Ai , j |Q|1 h =0 Ai ,h emissions symbol A C G T in 1 state 6 (24%) 7 (28%) 5 (20%) 7 (28%) 2 3 (20%) 3 (20%) 2 (13%) 7 (47%) ei ,k = Ei ,k ||1 E h =0 i ,h Using an HMM for Gene Prediction Intron Donor Acceptor Exon the Markov model: Start codon Stop codon Intergenic q0 the input sequence: the most probable path: the gene prediction: AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTAGTATAGGTCGATAGTACGCGA exon 1 exon 2 exon 3 Higher Order Markovian Eukaryotic Recognizer (HOMER) H3 H5 H17 H95 H27 H77 HOMER, version H3 Intron I=intron state Donor E=exon state Acceptor N=intergenic state Exon Start codon Stop codon Intergenic tested on 500 Arabidopsis genes: q0 splice sites nucleotides start/stop codons exons genes Sn Sp F Sn Sp Sn Sp Sn Sp F Sn # baseline 50 28 44 0 0 0 0 0 0 0 0 0 H3 88 66 0 0 0 0 0 0 0 0 0 53 HOMER, version H5 three exon states, for the three codon positions splice sites nucleotides start/stop codons exons genes Sn Sp F Sn Sp Sn Sp Sn Sp F Sn # H3 53 88 66 0 0 0 0 0 0 0 0 0 H5 65 91 76 1 3 3 3 0 0 0 0 0 HOMER version H17 donor site acceptor site stop codon start codon splice sites nucleotides start/stop codons exons genes Sn Sp F Sn Sp Sn Sp Sn Sp F Sn # H5 65 91 76 1 3 3 3 0 0 0 0 0 H17 81 93 87 34 48 43 37 19 24 21 7 35 HOMER version H27 three separate intron models nucleotides splice start/stop exons genes Sn Sp F Sn Sp Sn Sp Sn Sp F Sn # H17 81 93 87 34 48 43 37 19 24 21 7 35 H27 83 93 88 40 49 41 36 23 27 25 8 38 HOMER version H77 positional biases near splice sites nucleotides splice start/stop exons genes Sn Sp F Sn Sp Sn Sp Sn Sp F Sn # H27 83 93 88 40 49 41 36 23 27 25 8 38 H77 88 96 92 66 67 51 46 47 46 46 13 65 HOMER version H95 nucleotides splice start/stop exons genes Sn Sp F Sn Sp Sn Sp Sn Sp F Sn # H77 88 96 92 66 67 51 46 47 46 46 13 65 H95 92 97 94 79 76 57 53 62 59 60 19 93 Higher-order Markov Models P(G) 0th order: ACGCTA P(G|C) 1st order: ACGCTA P(G|AC) 2nd order: ACGCTA Higher-order Markov Models order nucleotides Sn Sp F splice sites Sn Sp starts/ stops Sn Sp exons Sn Sp genes F Sn # 0 92 97 94 79 76 57 53 62 59 60 19 93 95 98 97 87 81 64 61 72 68 70 25 127 H95 2 2 98 98 98 91 82 65 62 76 69 72 27 136 H395 3 98 98 98 91 82 67 63 76 69 72 28 140 H495 4 98 97 98 90 81 69 64 76 68 72 29 143 H595 5 98 97 98 90 81 66 62 74 67 70 27 137 H95 0 H195 1 Generalized Hidden Markov Models Advantages: * Submodel abstraction * Architectural simplicity * State duration modeling Disadvantages: * Decoding complexity HMMs & Geometric Feature Lengths d 1 d 1 P( x0 ...xd 1 | ) = Pe ( xi | ) p (1 p ) i =0 geometric distribution ge om etr ic exon length E0 Fixed-length states are called signal states (red). E1 D A D Each state has a separate submodel or “sensor” UTR5 TATA D A I0 E2 I1 A D I2 A Einitial ATG D A Variable-length states are called content states (blue). D A Efinal Esng TAG N UTR3 AATAAA +strand -strand Some GHMM Submodel Types L 1 1. WMM (Weight Matrix) P (x ) i i i =0 n 1 2. Nth-order Markov Chain (MC) L 1 P( x | x ...x ) P( x | x i i 1 0 i =0 i i =n i n ...xi 1 ) L 1 3. Three-Periodic Markov Chain (3PMC) P i =0 ( f + i )(mod 3) ( xi ) n max Pi ( x 4. Nonstationary Markov Chain (NSMC) i =1 n 1 5. Codon Bias P( x i =0 6. MDD x x + 3i + 3i +1 + 3i + 2 ) 7. Interpolated Markov Models i 1 j =1 dj ...x ( i j =1 d j ) 1 ) Recall: Decoding with an HMM max argmax argmax P( S ) = P ( | S ) = P( S ) argmax = P ( S ) argmax = P( S | ) P() L 1 P( S | ) = Pe ( xi | q(i +1) ) i =0 max emission prob. L P() = Pt (q(i +1) | q(i ) ) i =0 transition prob. L 1 argmax = Pt (q0 | q( L ) ) Pe ( xi | q(i +1) ) Pt (q(i +1) | q(i ) ) i =0 Decoding with a GHMM max argmax argmax P( S ) = P ( | S ) = P( S ) argmax = P ( S ) argmax = P( S | ) P() ||2 P( S | ) = Pe ( Si | q(i ) , d i ) i =1 max = P() = Pt (q(i +1) | q(i ) ) Pd (d i | q(i ) ) emission prob. ||2 ||2 i =0 transition prob. duration prob. argmax Pe ( Si | q(i ) , d i ) Pt (q(i +1) | q(i ) ) Pd (d i | q(i ) ) i =0 Efficient Decoding via Signal Sensors Each signal state has a signal sensor: The “trellis” or “ORF graph”: .0045 .0032 .0031 .0021 .002 .0072 .0023 .0034 .0082 Accuracy nucleotides splice start/stop exons genes Sn Sp F Sn Sp Sn Sp Sn Sp F Sn HMM 98 97 98 90 81 66 62 74 67 70 27 GHMM 94 96 95 87 89 77 74 79 80 80 45 # Comparative Methods Problem: Predict genes in a target genome G based on the contents of G and also based on the contents of one or more informant genomes I(1)... I(n): { GC-ATCGGTCTTA ...|:.|:|.||:|:|... target genome: ATCGGTAAC-GTGTAATGC informant genome: alignment Rationale: Natural selection should operate more strongly on protein-coding DNA than on nonfunctional DNA such as introns. Pair HMM’s (PHMM’s) IX : emit a symbol into output channel X IX IY μ M IY : emit a symbol into output channel Y M : emit a symbol into both X and Y IX can be called an insertion state IY can be called a deletion state M can be called a match state Decoding for PHMM’s argmax ={q 0 =q0 ,...,qm1 =q 0 } Pt (q0 | qm 2 ) Pe (ai,1 ,ai,2 | qi )Pt (qi | qi1 ) i=1 max V for qk QM i1, j1,h Pt (qk | qh )Pe (si1,1 ,s j1,2 | qk ) h for qk QI Vi, j,k = max Vi1, j,h Pt (qk | qh )Pe (si1,1 , | qk ) h Vi, j1,h Pt (qk | qh )Pe ( ,s j1,2 | qk ) for qk QD max h argmax Vi1, j1,h Pt (qk | qh )Pe (si1,1 ,s j1,2 | qk ) for qk QM (i1, j1,h) Ti, j,k = argmax Vi1, j,h Pt (qk | qh )Pe (si1,1 , | qk ) for qk QI (i1, j,h) argmax Vi, j1,h Pt (qk | qh )Pe ( ,s j1,2 | qk ) for qk QD (i, j1,h) 1 for q = q k V0,0,k = 0 otherwise max Vi1,0,h Pt (qk | q h ) Pe (si1,1 , | qk ) Vi>0,0,k = h 0 max V0, j1,h Pt (qk | qh ) Pe (, s j1,2 | q k ) V0, j>0,k = h 0 states * = m 2 s u eq seq e e nc uen 2 ce 1 0 for qk Q D arg max Vi1,0,h Pt (q k | qh ) Pe (si1,1 , | qk ) Ti>0,0,k = ( i1,0,h ) NIL arg max V0, j1,h Pt (qk | qh ) Pe (, s j1,2 | qk ) T0, j>0,k = (0, j1,h ) NIL otherwise T0,0,k = NIL for qk Q I otherwise for q k Q I otherwise for qk Q D otherwise genome 2 Pruning the Search Space for PHMM’s E T RE NO A U IG V E L A RE NO IG genome 1 •Find significantly conserved regions (thick bars) using BLAST •Force the DP algorithm to select a path which passes through these regions •Allow more flexibility in the regions not aligned •Do not evaluate regions of the matrix far from the conserved regions Pair GHMM’s (PGHMM’s) Each state in the GHMM now contains a PHMM, and emits a pair of sequence features rather than a single sequence feature. - + - = = + + - - = - + + = - + = + - = + - + + - = + - = = + - + = = + - + + = + - + - = = + = - = = - + - + - = = - + = + - = = = = + - + + - - = = - + = - + = Recall: GHMM Decoding Finding the optimal parse, max: max arg max arg max P(, S ) P ( | S ) = = P( S ) arg max arg max = P(, S ) = P( S | ) P() n 1 arg max = Pe ( Si | qi , d i ) Pt (qi | qi 1 ) Pd (d i | qi ) i =1 =emission =transition =duration Decoding with a GPHMM n2 argmax * 0 = Pt (q | qn2 ) Pe (S i,1 ,S i,2 | qi ,di,1 ,di,2 ) i=1 Pt (qi | qi1 )Pd (di,1 ,di,2 | qi ) =transition Pe (Si,1 , Si,2 | qi , d i,1 ,d i,2 ) Pe (Si,1 | qi , d i,1 ) Pe (Si,2 | Si,1 ,q i ,d i,2 ) alignment score =single-genome emission score Pd (d i,1 ,d i,2 | qi ) = P(d i,1 | q i ) P(d i,2 | d i,1 ,q i ) P(d i,1 | q i ) P( d | qi ), =single-genome duration score d = d i,2 d i,1 ignore (implicit in alignment score) genome 1 Practical GPHMM Decoding build ORF graph Align ORF graphs genome 2 “guide” alignments sparse alignment matrix build ORF graph Extract best alignment predictions for genome 1 predictions for genome 2 Aligning Parse Graphs The two parse graphs can be aligned using a global alignment algorithm. The optimal alignment corresponds to the chosen pair of orthologous gene predictions. The alignment is constrained by the topologies of the two parse graphs: 1.Only like signals can align, 2.Two signals can align only if they have neighbors which also align, 3.Standard phase constraints apply. Using a Sparse Alignment Matrix unpruned cells guide alignment Accuracy Data set: 147 high-confidence Aspergillus fumigatus A. nidulans orthologs (493 exons, 564kb). nucleotide accuracy exon sensitivity exon specificity exact genes GHMM 99% 78% 73% 54% GPHMM 99% 89% 85% 74% How Does Homology Help? ATG 1 TGA 2 3 4 ATG 1 A. fumigatus TGA 2 3 feature amino acid alignment score exon 1 100% intron 1 14% exon 2 98% intron 2 29% exon 3 97% intron 3 9% exon 4 96% 4 <,> > < > < > < > A. nidulans nucleotide alignment score 71% 51% 85% 49% 82% 49% 83% Phylogenetic HMM’s (PhyloHMM’s) model of gene structure model of phylogeny I A1 GT A2 A3 AG Eint Einit I1 S Efin Esng I2 ATG TAG N = a model of gene structure informed by observed evolutionary divergence Evolutionary Sequence Conservation • Using multiple genomes increases effective sample size chicken • However, we have to control for the nonindependence of informant genomes galago chimpanzee human rat mouse dog human: chimp: cow: dog: galago: rat: mouse: • The “ideal evolutionary distance” for informant genomes usually is not known cow AAGGGAAGACAGGTGAGGGTCAAGCCCCAGCAAGTGCACCCAG------------ACACC AAGGGAAGACAGGTGAGGGTCAAGCCCCAGCAAGTGCACCCAG------------ACACC AAGGGAAGACATTTACGAGTCAAGCCACAGAAAGAGCCCCTGAG-----------GTGCC AAAGGAGGACATGTGAGGGCCAAACTACTGAAGGTTCAACCAGG-----------ATGCT AAGGGGAGACAGGGGAGGGTCACACCATGGCAGAGG--CCAAG------------ACAGC AAAGGAAACAATGGGAAGGTTA-TCAACTCCAAGTATGCCCAAGATCAAGGGAACCCCTT AAAGGAAACCACTGGGAGGTTA-GAAATCACAGGTGCACCCAAGATCAAGGAA--CCCCT Decoding with a PhyloHMM arg max * = P( | S , I (1) ,..., I ( n ) ) (1) (n) arg max P(, S , I ,..., I ) = P( S , I (1) ,..., I ( n ) ) arg max = P(, S , I (1) ,..., I ( n ) ) arg max (1) (n) = P() P( S , I ,..., I | ) arg max = P() P( S | ) P( I (1) ,..., I ( n ) | S , ) standard GHMM computation tree likelihood (Felsenstein’s algorithm) Phylogenies as Bayes Networks 2 S Bayesian network 28 29 32 93 28 29 32 93 43 49 11 23 28 15 95 45 A3 93 38 32 90 43 49 11 23 28 15 95 45 re-root and attach rate matrices 93 38 32 90 28 29 32 93 43 49 11 23 28 15 95 45 A1 A2 93 38 32 90 I1 A1 28 29 32 93 28 29 32 93 43 49 11 23 28 15 95 45 93 38 32 90 43 49 11 23 28 15 95 45 S I2 1 93 38 32 90 (1) P( I ,..., I A2 3 I1 = tree likelihood (Felsenstein’s algorithm) P( I A1 , A2 , A3 A3 1 (n) I2 phylogeny | S , ) = P(v | parent (v), ) unobservables nonroot v | A2 ) P( A2 | A1 ) P( A1 | A3 ) P( I 2 | A3 ) P( A3 | S ) Likelihood rapidly decreases with larger numbers of mutations Tree Likelihood (x10) (five species aligned over 5000 columns) Number of different residues in column Modeling Evolutionary Change in both Nucleotides and Amino Acids Recall from the earlier GHMM example: feature amino acid alignment score <,> nucleotide alignment score exon 1 100% > 71% intron 1 14% < 51% exon 2 98% > 85% intron 2 29% < 49% exon 3 97% > 82% intron 3 9% < 49% exon 4 96% > 83% Thus, we model separately the rate of evolutionary change in coding regions (ideally at the amino acid level) and noncoding regions (at the nucleotide level). human human coding tree chimp noncoding tree chimp rat mouse galago chicken dog cow chicken cow galago rat dog mouse Expression-based Methods proteins and mRNA’s aligned to the genome outputs of other prediction programs Proteins and sequenced mRNA’s can be aligned to the genome using dynamic programming algorithms. Various ad hoc methods have been explored for utilizing this information during gene prediction. Outputs from other gene finders can also be used as input to a “combiner” program. (figure courtesy of J. Allen, University of Maryland) Ad hoc “Combiner” Methods boundaries of putative exons 0.6 Splice Predictions Gene Finder 1 evidence tracks 0.9 0.8 Gene Finder 2 Protein Alignment mRNA Alignment 0.89 0.49 0.35 0.32 combining function weighted ORF graph decoder (dyn. prog.) gene prediction (J. Allen, University of Maryland) II. Limitations of Current Methods 1. MLE+Viterbi Is Not Optimal Currently, most gene finders are trained via maximum likelihood estimation (MLE): arg max P(S, | ) = ( S , )T N 1 arg max P ( i | qi ,d i ) P (q i | q i1 ) P (d i | qi ) = ( S , )T i=1 * The advantage of MLE is that it is easy to perform, since the transition, emission, and duration parameters can be optimized separately: P(q log (S i , i )T j | qj1 ) (transitions) q j i , j>0 logP( j | q) (emissions) j S i T (S i , i )T log P(d (q j ,d j ) i j | qj ) (durations) anecdotal evidence Anecdotal evidence: • folklore about 1/ 3 in the source code of a certain popular gene finder* • fudge factor in: NSCAN (“conservation score coefficient”; Gross & Brent, 2005) • fudge factor in: ExoniPhy (“tuning parameter”; Siepel & Haussler, 2004) • fudge factor in TWAIN (“percent identity”; Majoros et al., 2005) • fudge factor in GlimmerHMM (“optimism”; M. Pertea, pers. communication) • fudge factor in TIGRscan (“optimism”; Majoros et al., 2004) • lack of fudge factors in EvoGene (Pedersen & Hein, 2003) If MLE+Viterbi produced optimal parsers, why would “tweaking” & the use of “fudge factors” be necessary? (...apart from sample size issues...) * folklore also states that this programs’s author made pact with the devil in exchange for gene-finding accuracy, but I have encountered no independent verification of this. a simple experiment M1 M1 I I E N generate synthetic genes input E parse accuracy: predictions N 78% q0 q0 I input parse E discri minat ive training N q0 M2 accuracy: predictions 87% changes selected by the hill-climber changes to the emission parameters changes to the transition parameters symbol state A intergenic 12% exon +1% intron +3% to state C G +5% 3% +2% 1% +3% 6% T +10% 6% from state 0 1 2 3 0 1 1.5% +1.5% 2 +0.4% 0.8% +0.4% 3 +0.9% 0.9% +4% “These changes are consistent with the notion that superior discr imination requires exaggeration of (or emphasis on) those model parameters which most reliably separate the classes of interest. In this case, the Arabidopsis training genes showed both lower T content and higher C and G content in the exonic regions versus th e other two regions, and these biases, which were already present in the same proportions in the MLE model (data not shown) are obviously exaggerated in the discriminative model (Table 3). The most extreme exaggeration, the -12% for A in th e intergenic state, is among the most subtle differences in the training data, in which the intergenic regions showed ~29% A versus ~27% in the exonic and intronic regions; in this way, the discriminative trainer has identified a subtle but consis tent difference and boosted the “weight” of this feature in the model by exaggerating the corresponding emission probabilities.” another experiment: GHMM NUCLEOTIDE 100% 90 MLE discrim 94 83 87 EXON 81 80% ACCURACY 71 60% 51 40% GENE 48 33 31 30 18 20% 0% arab asp arab asp arab asp what about conditional maximum likelihood? arg max Instead of: = P(S, | ) ( S , )T * arg max we might consider: = P( | S, ) ( S , )T * This is called Conditional Maximum Likelihood. It expands into: arg max P( | S, ) = ( S , )T arg max P(S, | ) = ( S , )T P(S | ) N 1 P ( i | qi ,d i ) P (q i | q i1 ) P (d i | qi ) arg max i=1 = P (S) ( S , )T * Unfortunately, EM-like update equations which have been derived for pure HMM’s tend to be unstable (e.g., Reichl and Ruske, 1995; Normandin, 1996). Update equations for GHMM’s, PHMM’s, GPHMM’s, and PhyloHMM’s have (as far as I know) not yet been derived. 2. Newer Decoders Also Not Optimal Posterior Viterbi (Fariselli et al., 2005), OAD (Käll et al., 2005): argmax Pe ( i | si ) (q i | q i1 ) i algorithm 200 training genes, 50 test genes 1000 training genes, 200 test genes 200 training genes, 50 test genes AVit APV AOAD algorithm AVit APV AOAD algorithm AVit APV AOAD argmax Pe ( i | si ) (q i | q i1 ) i nuc 85% 89% 89% exon 42% 41% 41% gene 24% 26% 26% (Oryza sativa) nuc 96% 97% 97% exon 68% 64% 61% gene 36% 33% 30% (Arabidopsis thaliana) nuc 94% 96% 96% exon 50% 39% 39% gene 30% 22% 22% (Aspergillus fumigatus) ...even for GHMM’s G/OAD and G/PV dynamic programming update formulas: 0 if deg in (q j ) = 0 Wj = Wi + fibji jP(s j | q j ) otherwise imax pred ( j) 1 if deg in (q j ) = 0 Vj = Vi fibji jP(s j | q j ) otherwise imax pred ( j) 5-fold cross validation; (600 training genes & 200 test genes per run) (Arabidopsis thaliana) 3. Discriminative Training is Expensive training data MLE trainer Gradient Ascent Optimizer new model problems: • slow (esp. for pair & phylo HMM’s) • perverse: no longer a descriptive model GHMM Decoder SLOW! predictions keep the best modify parameters evaluate accuracy 4. Intron Evolution is Rarely Modeled Many Aspergillus orthologs have unequal numbers of exons. (intron evolution, cont.) 5. Dependence on Pre-computed Alignments genome 2 The PHMM case: E T RE NO IG V E The GPHMM case: A U L A unpruned cells guide alignment RE NO IG genome 1 As for the PhyloHMM & Combiner cases... (pre-computed alignments, cont.) N genomes: aligner ? human: chimp: cow: dog: galago: rat: mouse: AAGGGAAGACAGGTGAGGGTCAAGCCCCAGCAAGTGCACCCAG------------ACACC AAGGGAAGACAGGTGAGGGTCAAGCCCCAGCAAGTGCACCCAG------------ACACC AAGGGAAGACATTTACGAGTCAAGCCACAGAAAGAGCCCCTGAG-----------GTGCC AAAGGAGGACATGTGAGGGCCAAACTACTGAAGGTTCAACCAGG-----------ATGCT AAGGGGAGACAGGGGAGGGTCACACCATGGCAGAGG--CCAAG------------ACAGC AAAGGAAACAATGGGAAGGTTA-TCAACTCCAAGTATGCCCAAGATCAAGGGAACCCCTT AAAGGAAACCACTGGGAGGTTA-GAAATCACAGGTGCACCCAAGATCAAGGAA--CCCCT PhyloHMM or Combiner predictions 6. Alternative Splicing alternative alternative start codons ATG donor sites ATG GT GT AG GT alternative exon alternative alternative stop codons acceptor sites AG AG TGA TGA III. Future Directions 1. Redefine the Problem • Predict exons rather than whole genes -- a step backward? • Classification rather than parsing • This “solves” the alternative splicing problem (in a manner of speaking...) • Possibly use a statistical ensemble representation of parses, and develop more sophisticated browsers that can use such a representation GHMM submodels sequence modified decoder ep acc scores of putative exon features reje classifier homology evidence t predict an exon ct do nothing 2. A Greater Role for Machine Learning • Use established, general-purpose ML algorithms • Let the learner attend to the goal of maximal discrimination training data MLE trainer pos feature extraction neg get negative examples hybrid model classifier trainer 3. Focus on Combiners • Combiners integrate all available evidence • Results from EGASP: NUCLEOTIDE NS n NS p EXON N CC ES p GS n GS p 74.67% 75.18% 80.61% 76.85% 76.76% 69.31% 89.33% 88.91% 47.97% 69.93% 72.64% 69.59% 35.59% 42.09% 65.95% 61.32% 0.76 0.53 0.66 52.39% 50.58% 62.08% 62.93% 29.01% 50.25% 24.32% 15.20% 19.59% 17.22% 3.24% 8.84% 79.14% 83.45% 92.02% 94.33% 59.67% 92.77% 0.84 0.88 0.91 0.89 0.73 0.90 85.75% 74.10% 77.53% 79.34% 64.44% 76.63% 56.98% 77.40% 82.65% 83.45% 41.77% 88.95% 63.51% 47.64% 71.62% 63.18% 21.96% 69.59% 48.65% 37.01% 67.32% 80.82% 6.33% 61.71% 80.15% 88.24% 74.13% 89.02% 81.39% 0.84 0.74 0.78 0.87 0.65 63.06% 53.11% 65.56% 67.66% 38.82% 69.14% 77.34% 61.65% 82.05% 50.73% 26.01% 10.81% 33.45% 35.47% 4.39% 18.64% 14.61% 24.94% 36.71% 3.44% Combiners AUGUSTUS-any FGENESH++ JIGSAW PAIRAGON-any 94.42% 91.09% 94.56% 87.77% 82.43% 76.89% 92.19% 92.78% 0.88 0.83 0.93 0.90 HMM’s & GHMM’s AUGUSTUS-abinit GENEMARK.hmm GENEZILLA 78.65% 78.43% 87.56% 75.29% 37.97% 50.93% Expression-based ACEVIEW AUGUSTUS-EST ENSEMBL EXOGEAN EXONHUNTER PAIRAGON+NSCAN 90.94% 92.62% 90.18% 84.18% 90.46% 87.56% Pair & Phylo HMM’s AUGUSTUS-dual DOGFISH MARS NSCAN SAGA 88.86% 64.81% 84.25% 85.38% 52.54% Source: Guigo et al., 2006 (to appear in Genome Biology) ES n GENE 4. Interoperability universal decoder gene predictions re-weighted ORF graph “one decoder to rule them all, one decoder to find them...” various other evidence (possibly including other ORF graphs) next generation gene finder weighted ORF graph presumably the best ORF graph generator available for this organism ab initio gene finder XYZ ATCGGCATTCGCGATTACGATATCTAGCGATGCTA Contemplating the Future pure alignmentbased Pair (G)HMM’s Phylo(G)HMM’s “Discriminative Bayes Net Combiner GHMM’s” (DBNCGHMM’s) ... combiners 2000 ... ? 2002 2003 2006 20xx ACKNOWLEDGEMENTS TIGR: S. Salzberg, M. Pertea, J. Allen, A. Delcher, J. Wortman, B. Haas Duke: U. Ohler, S. Mukherjee Elsewhere: M. Yandell, I. Korf, J. Eisen Methods for Computational Gene Prediction W.H. Majoros Expected publication date: late 2006 www.geneprediction.org/book
© Copyright 2024