NGS Bioinformatics Workshop 1.5 Genome Annotation April 4th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment: This week’s slides partly courtesy of Professor Fiona Brinkman, MBB, with material from Wyeth W. Wasserman and Shannan Ho Sui Overview General observations about genomes Repeat masking Gene finding Gene regulation and promoter analysis NGS Bioinformatics Workshop 1.5 Genome Annotation SOME GENERAL OBSERVATIONS ABOUT GENOMES 3 General Variables of Genomes Prokaryote versus Eukaryote versus Organelle Genome size: Number of chromosomes Number of base pairs Number of genes GC/AT relative content Repeat content Genome duplications and polyploidy Gene content See: Genomes, 2nd edition Terence A Brown. ISBN-10: 0-471-25046-5 See NCBI Bookshelve: http://www.ncbi.nlm.nih.gov/books/NBK21128/ Eukaryote versus Prokaryote Genomes Eukaryote Prokaryote Large (10 Mb – 100,000 Mb) Size Content There is not generally a relationship between organism complexity and its genome size (many plants have larger genomes than human!) Most DNA is non-coding Generally small (<10 Mb; most < 5Mb) Complexity (as measured by # of genes and metabolism) generally proportional to genome size DNA is “coding gene dense” Circular DNA, doesn't need telomeres Telomeres/ Centromeres Present (Linear DNA) Number of chromosomes More than one, (often) including those discriminating sexual identity Chromatin Histone bound (which serves as a genome regulation point) Don’t have mitosis, hence, no centromeres. Often one, sometimes more, -but plasmids, not true chromosome. No histones Uses supercoiling to pack genome Eukaryote versus Prokaryote Genomes Eukaryote Prokaryote Often have introns Intraspecific gene order and number generally relatively stable Genes many non-coding (RNA) genes There is NOT generally a relationship between organism complexity and gene number Gene regulation Promoters, often with distal long range enhancers/silencers, MARS, transcriptional domains Generally mono-cistronic Repetitive sequences Organelle (subgenomes) Generally highly repetitive with genome wide families from transposable element propagation Mitochondrial (all) chloroplasts (in plants) No introns Gene order and number may vary between strains of a species Promoters Enhancers/silencers rare Genes often regulated as polycistronic operons Generally few repeated sequences Relatively few transposons Absent Genome Size Physical: Amount of DNA / number of base pairs Number of chromosomes/linkage groups Information resources: NCBI: http://www.ncbi.nlm.nih.gov/genome Animals: http://www.genomesize.com/ Plants: http://data.kew.org/cvalues/ Fungi: http://www.zbi.ee/fungal-genomesize/ Genetic: Number of genes in the genome Gregory TR. 2002. Genome size and developmental complexity. Genetica. May;115(1):131-46. Size of Organelle Genomes Species Type of organism Genome size (kb) Mitochondrial genomes Plasmodium falciparum Protozoan (malaria parasite) 6 Chlamydomonas reinhardtii Green alga 16 Mus musculus Vertebrate (mouse) 16 Homo sapiens Vertebrate (human) 17 Metridium senile Invertebrate (sea anemone) 17 Drosophila melanogaster Invertebrate (fruit fly) 19 Chondrus crispus Red alga 26 Aspergillus nidulans Ascomycete fungus 33 Reclinomonas americana Protozoa 69 Saccharomyces cerevisiae Yeast 75 Suillus grisellus Basidiomycete fungus 121 Brassica oleracea Flowering plant (cabbage) 160 Arabidopsis thaliana Flowering plant (vetch) 367 Zea mays Flowering plant (maize) 570 Cucumis melo Flowering plant (melon) 2500 Chloroplast genomes Pisum sativum Flowering plant (pea) 120 Marchantia polymorpha Liverwort 121 Oryza sativa Flowering plant (rice) 136 Nicotiana tabacum Flowering plant (tobacco) 156 Chlamydomonas reinhardtii Green alga 195 http://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5511 Size of Prokaryote Genomes Species Escherichia coli K-12 DNA molecules One circular molecule Size (Mb) 4.639 Number of genes 4397 Vibrio cholerae El Tor N16961 Two circular molecules Main chromosome 2.961 2770 Megaplasmid Four circular molecules 1.073 1115 Deinococcus radiodurans R1 2.649 0.412 0.177 0.046 2633 369 145 40 Borrelia burgdorferi B31 Chromosome 1 Chromosome 2 Megaplasmid Plasmid seven or eight circular molecules, 11 linear molecules Linear chromosome 0.911 853 Circular plasmid cp9 0.009 12 Circular plasmid cp26 0.026 29 Circular plasmid cp32* 0.032 Not known Linear plasmid lp17 0.017 25 Linear plasmid lp25 0.024 32 Linear plasmid lp28-1 0.027 32 Linear plasmid lp28-2 0.030 34 Linear plasmid lp28-3 0.029 41 Linear plasmid lp28-4 0.027 43 Linear plasmid lp36 0.037 54 Linear plasmid lp38 0.039 52 Linear plasmid lp54 0.054 76 Linear plasmid lp56 0.056 Not known http://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5524 Size of Eukaryote Genomes Species Genome size (Mb) Fungi Saccharomyces cerevisiae 12.1 Aspergillus nidulans 25.4 Protozoa Tetrahymena pyriformis 190 Invertebrates Caenorhabditis elegans 97 Drosophila melanogaster 180 Bombyx mori (silkworm) 490 Strongylocentrotus purpuratus (sea urchin) 845 Locusta migratoria (locust) 5000 Vertebrates Takifugu rubripes (pufferfish) 400 Homo sapiens 3200 Mus musculus (mouse) 3300 Plants Arabidopsis thaliana (vetch) 125 Oryza sativa (rice) 430 Zea mays (maize) 2500 Pisum sativum (pea) 4800 Triticum aestivum (wheat) Fritillaria assyriaca (fritillary) 16 000 120 000 http://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5471 http://en.wikipedia.org/wiki/Genome_size http://en.wikipedia.org/wiki/Genome#Comparison_of_different_genome_sizes Number of Genes Species Ploidy Cs Size (Mb) No. Genes Saccharomyces cerevisiae 2 16 12 6,281 Plasmodium falciparum 2 14 23 5,509 Caenorhabditis elegans 2 6 100 21,175 Drosophila melanogaster 2 6 139 15,016 Oryza sativa 2 12 410 30,294 Canis lupus familaris 2 39 2,445 24,044 Homo sapiens 2 24 3,100 36,036 Zea mays 2 10 2,046 42,000-56,000 (*) Protopterus aethiopicus ? ? 130,000 ? Paris japonica 8 40 150,000 ? Polychaos dubium ? ? 670,000 ? http://www.ncbi.nlm.nih.gov/genome (*) Haberer et al., Structure and architecture of the maize genome. Plant Physiol. 2005 Dec139(4):1612-24 AT/GC content Regional variations correlates with genomic content and function like transposable element distribution, gene density, gene regulation, methylation, etc. Often introduces bias in sequencing processes (e.g. library yields, PCR amplification, NGS sequencing) Species GC% Streptomyces coelicolor A3(2) 72 Plasmodium falciparum 20 Arabidopsis thaliana 36 Saccharomyces cerevisiae 38 Arabidopsis thaliana 36 Homo sapiens 41 (35 – 60) Romiguier et al. 2010. Contrasting GC-content dynamics across 33 mammalian genomes: Relationship with life-history traits and chromosome sizes. Genome Res. 20: 1001-1009 Repeat Content Large genomes generally reflect evolutionary expansion of large families of repetitive DNA (by RNA/DNA transposon amplification/insertion, genetic recombination) Repeats drive genome mutational processes: Recombination resulting in insertion, deletion, translocation, segmental duplication of DNA Insertional mutagenesis, possibly including de novo creation of genes Insert novel regulatory signals Jurka et al. 2007. Repetitive sequences in complex genomes: structure and evolution. Annu Rev Genomics Hum Genet. 2007;8:241-59. Repeats generally confound genome sequence assembly (especially for NGS, due to short reads). Gene annotation can also be problematic as transposons mimic gene structures. Genome Duplications/Polyploidy Segmental duplications (i.e. by recombination) Tandem: direct and inverted Whole genome duplication & loss, e.g. Ancestral vertebrate: 2 rounds o HOX gene clusters… Dehal P and Boore JL.2005. Two Rounds of Whole Genome Duplication in the Ancestral Vertebrate. PLoS Biol 3(10) : e314. doi:10.1371 Polyploidy - ~70% of all angiosperms Genomic hybridization (allopolyploids) Can lead to immediate and extensive changes in gene expression Mapping of homeologous gene loci can be tricky Adams and Wendel. 2005. Polyploidy and genome evolution in plants. Curr. Opin. Plant Biol. 8(2):135–141 Case Study: Rice 10 duplicated blocks identified on all 12 chromosomes of Oryza sativa subspecies indica, that contained 47% of the total predicted genes. Possible genome duplication occurred ~70 million years ago, supporting the polyploidy origin of rice. Additional segmental duplication identified involving chromosomes 11 and 12, ~5 million years ago. Following the duplications, there have been large-scale chromosomal rearrangements and deletions. About 30–65% of duplicated genes were lost shortly after the duplications, leading to a rapid diploidization. Wang et al. 2005. Duplication and DNA segmental loss in the rice genome: implications for diploidization New Phytologist 165: 937–946 Gene Content http://www.ncbi.nlm.nih.gov/books/NBK21120/figure/A5501 The Bottom Line All of these genomic variables: Type of organism: i.e. prokaryote versus eukaryote Genome size GC/AT relative content Repeat content Genome duplications and polyploidy Gene content are important factors that can drive the strategy, expected outcome and efficacy of genome sequence assembly and annotation. NGS Bioinformatics Workshop 1.5 Genome Annotation GENOME REPEAT MASKING 19 Genomic (DNA) Sequence Repeat Masking Classic approach: search against repeat libraries RepeatMasker http://www.repeatmasker.org/ Uses a previously compiled library of repeat families Uses (user configured) external sequence search program Computationally intensive but… …the project web site also provides “pre-masked” genomic data for many completed genomes, complete with some statistical characterization. More Repeat Masking … de novo identification and classification: RECON: http://www.genetics.wustl.edu/eddy/recon RepeatGluer: http://nbcr.sdsc.edu/euler/ PILER: http://www.drive5.com/piler Repeat databases: Repbase: http://www.girinst.org/repbase/index.html plants: http://plantrepeats.plantbiology.msu.edu/ Related algorithms: “probability clouds” Gu et al. 2008. Identification of repeat structure in large genomes using repeat probability clouds. Anal Biochem. 380(1): 77–83 NGS Bioinformatics Workshop 1.5 Genome Annotation GENE FINDING …OR “WHAT IS A GENE?” 23 Objectives Review of differences in prokaryotic and eukaryotic gene organization. Understand consequences and challenges for gene finding algorithms for Prokaryotes and Eukaryotes. Appreciate HMM as powerful tool (in many areas of computational biology!) Be reminded that not all genes encode proteins and predictions of such genes have their own computational challenges. 24 Gene Annotation Questions Which genes are present? How did they get there (evolution)? Are the genes present in more than one copy? Which genes are not there that we would expect to be present? What order are the genes in, and does this have any significance? How similar is the genome of one organism to that of another? Why Gene-finding? Whole-genome annotation Genome sequence does not give you list of all genes Fully characterizing Yfg (“your favourite gene”) example: A disease is associated with a SNP in a location in the human genome. BLAST finds similarity to a protein coding gene in the area, but its only similar to part of the whole protein. What’s the whole gene? 26 After completing the human genome we faced 3 Gigabytes of this Not immediately apparent where the genes are… What is a gene? A distinct functional unit encoded (at minimum transcribed) in the genome Eukaryotes: The region that is transcribed – the mRNA defines it Prokaryotes: The coding region – because multiple genes may be on one transcript as an operon Both eukaryotes and prokaryotes: non-coding RNAs are treated relatively the same 29 Raw Biological Materials Prokaryotes Eukaryotes • High gene density • mRNA transcriptiontranslation is coupled • Genes are usually contiguous stretches of coding DNA • mRNAs often polycistronic • Low gene density • mRNA transcribed then transported to cytoplasm for translation • Genes’ coding DNA often split by non-coding introns • mRNAs are generally monocistronic gene gene ____________________ ___________ transcript Great real-time Transcription-Translation video: http://www.youtube.com/watch?v=41_Ne5mS2ls 30 How many genes in human genome? • 2000: must be at least 100,000 (Rice has ~40,000, C. elegans has ~19,000) • 2001: only 35,000? • 2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts) • 2006: Ensembl NCBI 36 release: 23,710 protein coding genes, plus 4421 RNA genes (48,851 transcripts) • Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes + 14,266 RNA genes - but with alternative splicing these produce likely many more… How many genes in human genome? • 2000: must be at least 100,000 (Rice has ~40,000, C. elegans has ~19,000) • 2001: only 35,000? • 2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts) • 2006: Ensembl NCBI 36 release: 23,710 protein coding genes, plus 4421 RNA genes (48,851 transcripts) • Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes + 14,266 RNA genes - but with alternative splicing these produce likely >100,000 proteins (178,191 currently annotated in Ensembl) Gene Density 1 gene in how many basepairs?... a. b. c. d. e. f. g. 1:10,000,000 1:1,000,000 1:100,000 roughly for human 1:10,000 (1:5000 for C. elegans) 1:1000 roughly for most bacteria 1:100 1:10 33 Approaches to Finding Genes AB INITIO (‘from the beginning’): search by content: find genes by statistical properties that distinguish protein-coding DNA from non-coding DNA Search by signals/sites: splice donor/acceptor sites, promoters, etc. HOMOLOGY: search by sequence similarity to homologous sequences state-of-the-art gene finding combines these strategies and also uses other data like EST/RNA-seq/microarray data 34 Gene-finding in Prokaryotes: Easy? ….or not? ORF Finder Open reading frame (ORF) from methionine codon to first Stop codon ORFs linked to BLAST http://www.ncbi.nlm.nih.gov/gorf/gorf.html Problem: All ORFs are not genes. Why? How can this be improved? 35 Gene Finding: Ab Initio Search by Content Protein code affects the statistical properties of a DNA sequence – some amino acids are used more frequently than others (Leu more popular than Trp) – different numbers of codons for different amino acids (Leu has 6, Trp has 1) – for a given amino acid, usually one codon is used more frequently than others • this is termed codon preference • these preferences vary by species 36 Gene-finding in Prokaryotes: Improving predictions… Common way to search by content build Markov models of coding & noncoding regions apply to ORFs or fixed-sized sequence windows Markov Model approaches: prokaryotic gene prediction Glimmer http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi http://cbcb.umd.edu/software/glimmer/ open source GeneMark http://opal.biology.gatech.edu/GeneMark/ not open source but slightly more accurate 37 i.e. If its sunny then the next weather state is not likely to become suddenly rainy. If its cloudy then it is more likely the next weather state will be rainy. Reproduced with permission, Chris Burge 38 For gene prediction: Certain bases occur after others in coding sequence versus noncoding sequence Reproduced with permission, Chris Burge 39 What’s Hidden in a Hidden Markov Model? The HMM is a model of what actually happened (the true gene structure) but all you can see is the sequence Each nucleotide is not labeled with its state; it’s hidden Based on training data – which is critical! Glimmer version 3 annotates microbial genomes with a sensitivity of 99% and accuracy of >98% resolves overlapping genes use appropriate gene dataset for training program no annotation of rRNA and tRNA genes employs interpolated Markov model NAR Vol.26 p1107-1115 42 GeneMark.hmm HMM plus ribosome binding site signals in primary nucleotide sequence training also required Most accurate prokaryotic gene predictor (by a bit over Glimmer) GeneMarkS version - translation start prediction: 83.2% of Bacillus subtilis genes 94.4% of Escherichia coli genes 43 Main issues with prokaryotic gene prediction Which ATG or GTG start to choose? Not that accurate Errors in sequence frame shifts that lead to errors in gene prediction Very small genes either not predicted at all (most prokaryotic genome projects don’t annotate any genes < 90 bp) or poorly predicted Non-coding RNA genes tend to be ignored! RNA-seq data being incorporated now… 44 RNA-seq Sequence cDNA from RNA using “next generation” sequencing Map sequence reads to ref genome Count number of sequence reads (depth of reads) for a given window of ref sequence 45 Rne (Rnase E) Upstream Region in P. aeruginosa Rne (Rnase E) Upstream Region • Upstream region of Rne in E. coli forms stem loop to regulate mRNA synthesis • Region aligns reasonably well between spp. • Also predicted to encode a small non-coding RNA using QRNA and RNAz E. coli PAO1 Gene prediction in eukaryotes Overview of eukaryotic gene structure will illustrate the gene-finding problem Components described are some of those that gene-finding tools must model These signals are never 100% reliable because there are (almost) as many exceptions as rules Signals (and noise) vary with organism 3% of human genome is “coding” vs. 25% in fugu The problem with eukaryotes: the easy stuff? 49 The problem with eukaryotes: the hard stuff ~50% of human genes undergo alternative splicing Genes can be nested: overlapping on same or opposite strand or inside an intron Pseudogenes: “dead” genes Raw Biological Materials: Eukaryotes Fig 10.1 Baxevanis and Ouellette, 2000 Transcription Start core promoter ... TSS – transcription start site • ~70% of promoter regions have TATA box Fig 11.25 from Introduction to Genetic Analysis, 7th ed. Griffiths et al. New York: W H Freeman & Co; c1999. http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&rid=iga.figgrp.d1e46911 End Modification 5’cap added at 5’ end of transcript polyA-tail added at AATAAA consensus site Splice site recognition 5’ (a.k.a. donor) exon intron Logos from: http://genio.informatik.uni-stuttgart.de/GENIO/splice/ 3’ (a.k.a. acceptor) intron exon The problem with splice signals ~0.5% of splice sites are non-canonical, i.e. not GT-AG ~5% of human genes may have a noncanonical splice site ~50% of human genes are alternatively spliced Intron Phase A codon can be interrupted by an intron in one of three places and still code for the same amino acid Example CAG -> glutamine exon Phase 0: Phase 1: Phase 2: C CA intron exon CAG AG G Translation signals Start AUG Note: Bacteria also use GUG (~10% of starts) and UUG (rarely) Stop UAA, UAG, or UGA Translation signals Translation Start: “usually” first AUG after 5’ cap But Kozak consensus sequence aids the selection so the 2nd or 3rd AUG might be used Translation Stop: first in-frame stop codon (simple!) Gene-finding Strategies Similarity-based a.k.a. lookup method genome:protein genome:genome ab initio a.k.a. template method Experimental microarrays RNA Sequencing Similarity – genome:protein blastx finds similarity between translated genomic region and known proteins Problem: on human chr22 only 50% of genes had similarity to known proteins when first annotated Similarity – genome:genome Functional sequences will be conserved Learn different things from different evolutionary distances (signal:noise ratio) coding sequences regulatory sequences Compare mouse:human, pufferfish:human (Exofish) Problems: you need something “similar” won’t find precise boundaries ab initio Gene-finding Based on intrinsic information Coding statistics plus signal sensors No prior knowledge of similar genes needed Problem: gene models can only approach the biological truth (recall eukaryotic gene structure variability) Coding Statistics Exon/intron base composition differences indicate protein-coding regions Use frequency of occurrence of hexamers Signal Sensors Model features recognized by transcriptional, splicing and translational machinery Promoter elements Splice sites PolyA consensus sites Start and stop codons Use pattern recognition methods as signal detectors Weight matrices Neural networks Decision trees What don’t (coding) gene finders do? Explicit alternative splicing models Overlapping or nested genes Non-protein-coding genes Tools for finding genes… Check out the Bioinformatics Links Directory at Bioinformatics.ca http://bioinformatics.ca/links_directory/ Some tools for finding genes… http://bioinformatics.ca/links_directory/category/dna/gene-prediction Tools for eukaryotic gene prediction Ab Initio genscan, fgenesh Homology-based prediction Genewise (uses predicted proteins from one genome/seq to drive exon finding in a second genome/seq, combined with splice-site model) Twinscan (builds two genome predictions together, using conserved DNA signal as one criterion) Gene Finding in General: Training Sets ab initio gene-finders must be trained on real sequences Mostly single genes from specific groups of organisms (or the organism being studied) Bias of training set will affect a tool’s prediction accuracy Example: HMR195 has 195 single- or multiexon murine genes with typical splice sites and confirmed by cDNA (Rogic et al, 2001) GENSCAN Dissection Reproduced with permission, Chris Burge Reproduced with permission, Chris Burge Genscan HSMM • Circles and diamonds are states e.g. exon, phase 0 on plus strand • Arrows are probabilities of transition between states • Sequence you observe is generated from a series of states 72 Reproduced with permission, Chris Burge Reproduced with permission, Chris Burge Reproduced with permission, Chris Burge 74 Reproduced with permission, Chris Burge 75 Genscan HSMM • Circles and diamonds are states e.g. exon, phase 0 on plus strand • Arrows are probabilities of transition between states • Sequence you observe is generated from a series of states 76 Reproduced with permission, Chris Burge Signal Sensors Model features recognized by transcriptional, splicing and translational machinery Start codon 12bp WMM (Weight Matrix Model); probabilities estimated from GenBank CDS feature annotations 5’ splice site MDD (Maximal Depedence Decomposition) developed by Burge and Karlin (1997) to allow for dependencies between adjacent and non-adjacent nucleotides Coding Statistics Exon/intron base composition differences indicate protein-coding regions Use frequency of occurrence of hexamers Model with 5th order HMMs Non-coding: homogeneous Coding: heterogeneous different model, depending on whether it’s 1st, 2nd, 3rd codon position The predicted gene structure is the optimal combination of the above Reproduced with permission, Chris Burge GENSCAN Strengths Works well from fly to human (and plants with some tweaks) Weaknesses Practical Initial exons; HMMgene does (a bit) better Gene boundaries; tends to join or split genes Models Promoter and polyA site models Transcription start sites (TSS) Homology Based Approach Use evolutionary divergence as a tool Based on the principal that introns and other non-coding regions usually diverge much faster over time Helped by the fact that many splice patterns have been determined experimentally (cDNA) Fails when the divergence of distant sequences is too extreme (<50%; due to rapidly evolving proteins or large evolutionary distances) GeneWise (Ewan Birney) • GeneWise aligns a protein sequence (or HMM) to genomic DNA taking into account splicing information See also “Exonerate” The birth of comparative genomics… February 2001 December 2002 Comparative Genefinders • Twinscan -based on Genscan (Korf, 2001) • ROSETTA – global alignment of orthologous regions (Batzoglou et al, 2000) • SLAM (Alexandersson) and Doublescan (Meyer) – jointly align while finding predicting genes – computationally expensive (extra search dimension) and hard to parameterise – still effectively requires a full alignment • SGP2 (Parra, 2003) – Combines ab initio gene prediction with TBLASTX Evaluation of Twinscan Paul Flicek et al. Genome Res. 2003: TWINSCAN pred (red), GENSCAN (green), and an aligned RefSeq transcript (blue). Yellow: low-complexity regions, black: mouse alignments Evaluation of Twinscan Paul Flicek et al. (Genome Res. 2003): Accuracy of GENSCAN and TWINSCAN by the exact gene, exact exon, and coding nucleotide measures Twinscan (N-SCAN) in UCSC genome browser Region of seq1 (from HMR195 dataset) presented Twinscan URL: http://ardor.wustl.edu What about non-coding RNAs (ncRNAs)? 88 What are non-coding RNAs (ncRNAs)? ncRNA genes make transcripts that function directly as RNA, rather than encoding proteins. – ribosomal RNA, transfer RNA, RNase P, (tmRNA) – E.coli translational regulatory RNAs, e.g. OxyS – Eukaryotes: snRNA, snoRNA (& Archaea) 89 Example predictor: QRNA - probabilistic model III RNA: pair stochastic context-free grammar base stacking, loop length (MFOLD) Prediction of non-coding RNAs 1. Identification of ncRNAs with known RNA families RFAM: a database of RNA families (Griffiths-Jones et al., 2005) 2. Identification of novel ncRNAs Obtain orthologous sequences using BLASTN Multiple sequence alignment using MAFFT ncRNA prediction using RNAz and QRNA Analyze overlapping predictions and filter false positives Need lab validation (RT-PCR, RNA-seq…) Analysis of non-coding RNAs: Examples of Research Leaders 1. Sean Eddy (HHMI) RFAM, QRNA http://selab.janelia.org/ 2. Cenk Sahinalp (SFU) TaveRNA suite for structural analysis and interaction prediction http://compbio.cs.sfu.ca/ Rne (Rnase E) Upstream Region in P. aeruginosa Rne (Rnase E) Upstream Region • Upstream region of Rne in E. coli forms stem loop to regulate mRNA synthesis • Region aligns reasonably well between spp. • Also predicted to encode a small non-coding RNA using QRNA and RNAz E. coli PAO1 Summary • Genes are complex structures, which are difficult to predict • Different approaches to gene finding: – Ab Initio : i.e. GenScan (use appropriate training dataset!) – Homology guided: i.e. GeneWise Outstanding Issues Most Gene finders cannot handle untranslated regions (UTRs) ~40% of human genes have non-coding 1st exons (UTRs) Most gene finders cannot handle alternative splicing Most gene finders cannot handle overlapping or nested genes Bottom Line Gene finding in eukaryotes is not solved Accuracy of the best methods approaches 80% at the exon level (90% at the nucleotide level) in coding-rich regions (much lower for whole genomes) Gene predictions should always be verified by other means (cDNA sequencing RNA-seq/EST, BLAST search) Whole Genome Annotation A lot of work has been done for you! NCBI Map Viewers Ensembl UCSC Genome Browser 98 Top 10 Future Challenges Ewan Birney, Chris Burge, Jim Fickett in Genome Technology 1. Precise, predictive model of transcription initiation and termination: ability to predict where and when transcription will occur in a genome 2. Precise, predictive model of RNA splicing/ alternative splicing: ability to predict the splicing pattern of any primary transcript in any tissue 3. Precise, quantitative models of signal transduction pathways: ability to predict cellular responses to external stimuli 4. Determining effective protein:DNA, protein:RNA and protein:protein recognition codes 5. Accurate ab initio protein structure prediction 5 minute break? NGS Bioinformatics Workshop 1.5 Genome Annotation GENE REGULATION AND PROMOTER ANALYSIS With material from Wyeth W. Wasserman and Shannan Ho Sui Objectives Note: Focus on Eukaryotic cells and Polymerase II Promoters Overview of transcription Structure of eukaryotic promoter Laboratory methods for identifying binding motifs Prediction of transcription factor binding sites using binding profiles (“Discrimination”) TFBS scanning – a great example of how bioinformatics can be tweaked when we don’t fully understand the biology! Interrogation of sets of co-expressed genes to identify mediating transcription factors TFBS Over-representation Detection of novel motifs (TFBS) over-represented in regulatory regions of co-expressed genes (“Discovery”) Motif Discovery! Transcription Simplified Three-step Process: 1. Transcription factor (TF; a protein) binds to transcription factor binding site (TFBS; in DNA) 2. TF catalyzes recruitment of polII complex 3. Production of RNA from transcription start site (TSS) TF Pol-II TFBS TATA TSS Anatomy of Transcriptional Regulation WARNING: Terms vary widely in meaning between scientists Core Promoter/Initiation Region (Inr) Distal Regulatory Region TFBS TFBS TFBS Proximal Regulatory Region TFBS TFBS TATA TSR EXON Distal Regulatory region TFBS TFBS EXON Core Promoter – Sufficient for initiation of transcription; orientation dependent (i.e. immediately upstream, functions in only one orientation) TSR – transcription start region o Refers to a region rather than specific start site (TSS) TFBS – single transcription factor binding site Regulatory Regions Proximal/Distal – vague reference to distance from TSR May be positive (enhancing) or negative (repressing) Orientation independent and location independent (generally) Modules – Sets of TFBS within a region that function together Complexity in Transcription Chromatin Distal enhancer Proximal enhancer 105 Core Promoter Distal enhancer Refer again to Transcription in video: http://www.youtube.com/watch?v=41_Ne5mS2ls Transcription Factors Bind to enhancers found in vicinity of gene These proteins promote transcription by (i) binding DNA and (ii) activating transcription These two functions usually reside on separate structural domains of the transcription factor protein Activation occurs via interactions with other transcription factors and /or RNA polymerase 106 DNA Binding Domains – Conserved Motifs Leucine Zipper Motif Zinc Finger Motif Helix-turn-Helix Motif Helix-loop-Helix Motif Structural motifs within different types of transcription factors. See also http://www.youtube.com/watch?v=7EkSBBDQmpE Lab Discovery of TF Binding Sites 0% Reporter Gene Activity 100% LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE mutation Identify functional regulatory region within a sequence and delineate specific TFBS through mutagenesis (and in vitro binding studies) EMSA/Gel Shift Assays to Identify Binding Proteins TF + DNA DNA http://www.biomedcentral.com/content/figures/1741-7015-4-28-8.jpg High-throughput Methods SELEX mix random ds DNA oligonucleotides with TF protein, recover TF-DNA complexes and sequence DNA Protein Binding Arrays (UniProbe Database) prepare arrays with ds DNA attached, label protein with a fluorescent mark and observe DNA bound by protein ChIP (Chromatin Immunoprecipitation) covalently link proteins to DNA in cell, shear DNA, recover protein-DNA complexes and identify DNA (PCR, array or sequencing) Representing Binding Sites for a TF • A single site • AAGTTAATGA • A set of sites represented as a consensus • VDRTWRWWSHD (IUPAC degenerate DNA) IUPAC (International Union of Pure and Applied Chemistry) nomenclature • A way to represent ambiguity when more than one value is possible at a particular position For DNA A C G T R Y K M = = = = = = = = adenine cytosine guanine thymine G or A (purine) T or C (pyrimidine) G or T (keto) A or C (amino) S W B D H V N = = = = = = = G or C (strong bonds) A or T (weak bonds) G, T or C (all but A) G, A or T (all but C) A, C or T (all but G) G, C or A (all but T) A, G, C or T (any) Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA Representing Binding Sites for a TF • A single site • AAGTTAATGA • A set of sites represented as a consensus • VDRTWRWWSHD (IUPAC degenerate DNA) • A matrix describing a set of sites: A C G T 14 16 4 0 1 19 20 1 3 0 0 0 0 0 0 0 4 3 17 0 0 2 0 0 0 2 0 21 20 0 1 20 4 13 4 4 13 12 3 7 3 1 0 3 1 12 9 1 3 0 5 2 2 1 4 13 17 0 6 4 Logo – A graphical representation of frequency matrix. Y-axis is information content , which reflects the strength of the pattern in each column of the matrix Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA Conversion of PFMs to Position Specific Scoring Matrices (PSSM) PSSMs also known as Position Weight Matrices(PWMs) Add the following features to the matrix profile: 1. Correct for nucleotide frequencies in genome 2. Weight for the confidence (depth) in the pattern 3. Convert to log-scale probability for easy arithmetic pssm pfm A C G T 5 0 0 0 0 2 3 0 1 2 1 1 0 4 0 1 0 0 4 1 Log2 ( f(b,i) + s(n) p(b) ) A C G T 1.6 -1.7 -1.7 -1.7 -1.7 0.5 1.0 -1.7 -0.2 0.5 -0.2 -0.2 -1.7 1.3 -1.7 -0.2 -1.7 -1.7 1.3 -0.2 TGCTG = 0.9 f(b,i) = frequency of base b at position i s(n) = pseudocount correction (optionally applied to small samples of binding sites) 113 p(b) = background probability of base b PSSM Scoring Scales Raw scores Sum of values from indicated cells of the matrix Relative Scores (most common) Normalize the scores to range of 0-1 or 0%-100% Empirical p-values Based on distribution of scores for some DNA sequence, determine a p-value (see next slide) Detecting binding sites in a single sequence Raw Scores Sp1 ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC A C G T [-0.2284 0.4368 [-0.2284 -0.2284 [ 1.2348 1.2348 [ 0.4368 -0.2284 -1.5 -1.5 2.1222 -1.5 -1.5 -1.5 -1.5 1.5128 2.1222 0.4368 -1.5 -0.2284 0.4368 -1.5 -1.5 -0.2284 1.2348 1.5128 0.4368 0.4368 -1.5 -0.2284 -1.5 -0.2284 1.7457 1.7457 0.4368 -1.5 0.4368 -1.5 -1.5 1.7457 Abs_score = 13.4 (sum of column scores) Relative Scores A C G T [-0.2284 0.4368 [-0.2284 -0.2284 [ 1.2348 1.2348 [ 0.4368 -0.2284 -1.5 -1.5 2.1222 -1.5 -1.5 -1.5 -1.5 1.5128 2.1222 0.4368 -1.5 -0.2284 0.4368 -1.5 -1.5 -0.2284 1.2348 1.5128 0.4368 0.4368 Empirical p-value Scores -1.5 -0.2284 -1.5 -0.2284 1.7457 1.7457 0.4368 -1.5 0.4368 -1.5 -1.5 1.7457 ] ] ] ] 0.3 Max_score = 15.2 (sum of highest column scores) [-0.2284 0.4368 [-0.2284 -0.2284 [ 1.2348 1.2348 [ 0.4368 -0.2284 -1.5 -1.5 2.1222 -1.5 -1.5 -1.5 -1.5 1.5128 2.1222 0.4368 -1.5 -0.2284 0.4368 -1.5 -1.5 -0.2284 1.2348 1.5128 0.4368 0.4368 -1.5 -0.2284 -1.5 -0.2284 1.7457 1.7457 0.4368 -1.5 0.4368 -1.5 -1.5 1.7457 Min_score = -10.3 (sum of lowest column scores) Abs_score - Min_score 100 % Max_score - Min_score 13.4 - (-10.3) 100% 93% 15.2 (10.3) Rel_score ] ] ] ] Frequency A C G T Area to right of value Area under entire curve 0.2 0.1 0.0 0.0 0.2 0.4 0.6 Relative Score 0.8 1.0 ] ] ] ] JASPAR: AN OPEN-ACCESS DATABASE OF TF BINDING PROFILES ( jaspar.genereg.net ) (Some other databases with TF profiles include Transfac, TRRD, mPromDB, SCPD (yeast), dbTBS and EcoTFS (bacteria)) 116 The Good… Tronche (1997) tested 50 predicted HNF1 TFBS using an in vitro binding test and found that 96% of the predicted sites were bound! BINDING ENERGY Stormo and Fields (1998) found in detailed biochemical studies that the best weight matrices produce scores highly correlated with in vitro binding energy PSSM SCORE …the Bad… Fickett (1995) found that a profile for the myoD TF made predictions at a rate of 1 per ~500bp of human DNA sequence This corresponds to an average of 20 sites / gene (assuming 10,000 bp as average human gene size) …and the Ugly! Human Cardiac a-Actin gene analyzed with a set of profiles (each line represents a TFBS prediction) Futility Conjecture: TFBS predictions, as a collective group, are almost always wrong! True binding sites are defined by properties not incorporated into the PSSM profile scores Red boxes are protein coding exons TFBS predictions excluded in this analysis What have we learned? PSSMs accurately reflect in vitro binding properties of DNA binding proteins Suitable binding sites occur at a rate far too frequent to reflect in vivo function Bioinformatics methods that use PSSMs for binding site studies must incorporate additional information to enhance specificity Unfiltered predictions are too noisy for most applications Note: Organisms with short regulatory sequences are less problematic (e.g. yeast and bacteria) Example of a bioinformatics challenge that needs more biological information to make predictions 70,000,000 years of evolution can reveal regulatory regions USING PHYLOGENETIC FOOTPRINTING TO IMPROVE TFBS DISCRIMINATION Phylogenetic Footprinting % Identity 200 bp Window Start Position (human sequence) Actin gene compared between human and mouse • • • Align orthologous gene sequences (e.g. use LAGAN). For 1st window of x bp (i.e. 200 bp), of sequence#1, determine % identity with sequence#2. Step across (slide window across) the first sequence, recording % identity in each window with the second sequence. Observe high identity with exons, lower identity in 5’ and 3’ UTRs Additional conserved region could be regulatory region Multi-species Phylogenetic Footprinting Phylogenetic Footprinting Dramatically Reduces Spurious Hits Human Mouse 124 Actin, alpha cardiac TFBS Prediction with Human & Mouse Pairwise Phylogenetic Footprinting SELECTIVITY SENSITIVITY Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set) 75-80% of defined sites detected with “sequence conservation filter” (incorporating orthologous sequence data), while only 11-16% of total predictions retained 1kbp insulin receptor promoter screened with footprinting 126 Choosing the ”right” species for pairwise comparison... CHICKEN HUMAN MOUSE COW HUMAN HUMAN ConSite http://consite.genereg.net/ TFBS Discrimination Tools I Phylogenetic Footprinting Servers FOOTER http://biodev.hgen.pitt.edu/footer_php/Footerv2_0.php CONSITE http://asp.ii.uib.no:8090/cgi-bin/CONSITE/consite/ or http://consite.genereg.net/ rVISTA http://rvista.dcode.org/ ORCAtk http://burgundy.cmmt.ubc.ca/cgi-bin/OrcaTK/orcatk SNPs in TFBS Analysis RAVEN http://burgundy.cmmt.ubc.ca/cgi-bin/RAVEN/a?rm=home TFBS Discrimination Tools II Prokaryotes or Yeast PRODORIC http://prodoric.tu-bs.de/ YEASTRACT http://www.yeastract.com/index.php Software Packages TOUCAN http://homes.esat.kuleuven.be/~saerts/software/toucan.php Programming Tools TFBS http://tfbs.genereg.net/ ORCAtk http://burgundy.cmmt.ubc.ca/cgi-bin/OrcaTK/orcatk What have we learned? TFBS discrimination coupled with phylogenetic footprinting has greater specificity with tolerable loss of sensitivity As with any purification process, some true binding sites will be lost There are several online resources that provide precomputed analyses using phylogenetic footprinting TFBS Over-representation We seek to determine if a set of co-expressed genes contains an over-abundance of predicted binding sites for a known TF Phylogenetic footprinting to reduce false prediction rate Similar to looking for functional over-representation (i.e. identify annotations of gene function in sets of coexpressed genes that are over-represented) Two Examples of TFBS Over-Representation Foreground Foreground More Total TFBS More Genes with TFBS Background Background Statistical Methods for Identifying Overrepresented TFBS Binomial test (Z scores) Based on the number of occurrences of the TFBS relative to background Normalized for sequence length Simple binomial distribution model Fisher exact probability scores Based on the number of genes containing the TFBS relative to background Hypergeometric probability distribution oPOSSUM Server Validation using Reference Gene Sets A. Muscle-specific (23 input; 16 analyzed) Rank Z-score B. Liver-specific (20 input; 12 analyzed) Fisher Rank Z-score Fisher SRF 1 21.41 1.18e-02 HNF-1 1 38.21 8.83e-08 MEF2 2 18.12 8.05e-04 HLF 2 11.00 9.50e-03 c-MYB_1 3 14.41 1.25e-03 Sox-5 3 9.822 1.22e-01 Myf 4 13.54 3.83e-03 FREAC-4 4 7.101 1.60e-01 TEF-1 5 11.22 2.87e-03 HNF-3beta 5 4.494 4.66e-02 deltaEF1 6 10.88 1.09e-02 SOX17 6 4.229 4.20e-01 S8 7 5.874 2.93e-01 Yin-Yang 7 4.070 1.16e-01 Irf-1 8 5.245 2.63e-01 S8 8 3.821 1.61e-02 Thing1-E47 9 4.485 4.97e-02 Irf-1 9 3.477 1.69e-01 HNF-1 10 3.353 2.93e-01 COUP-TF 10 3.286 2.97e-01 TFs with experimentally-verified sites in the reference sets. Structurally-related TFs with Indistinguishable TFBS Most structurally related TFs bind to highly similar patterns Zn-finger is a big exception Notes TFBS over-representation analysis can identify increased number of sites in a set of co-expressed genes, or increased number of genes in a co-expressed gene set that have a given site. Generally best performance has been with data directly linked to a transcription factor Highly dependent on the experimental design – cannot overcome noisy data from poor design Note that the identity of a mediating TF may not be apparent when many proteins can bind to the same motif de novo Pattern Discovery String-based e.g. YMF (Sinha & Tompa) Generalization: Identify over-represented oligomers in comparison of “+” and “-” (or complete) promoter collections Used often for yeast promoter analysis Profile-based e.g. AnnSpec (Workman & Stormo) or MEME (Bailey & Elkin) Generalization: Identify strong patterns (position weight matrices) in “+” promoter collection vs. background model of expected sequence characteristics String-based methods(1) How likely are X words in a set of sequences, given background sequence characteristics? CCCGCCGGAATGAAATCTGATTGACATTTTCC TTCAAATTTTAACGCCGGAATAATCTCCTATT TCGCTGTAACCGGAATATTTAGTCAGTTTTTG TATCGTCATTCTCCGCCTCTTTTCTT GCTTATCAATGCGCCCGGAATAAAACGCTATA CATTGACTTTATCGAATAAATCTGTT ATCTATTTACAATGATAAAACTTCAA ATGGTCTCTACCGGAAAGCTACTTTCAGAATT TTTCAAATCCGGAATTTCCACCCGGAATTACT TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC ACTGAACTTGTCTTCAAATTTCAACACCGGAA TCAATGCCGGAATTCTGAATGTGAGTCGCCCT >EP71002 >EP63009 >EP63010 >EP11013 >EP11014 >EP11015 >EP11016 >EP11017 >EP63007 >EP63008 >EP17012 >EP55011 (+) (+) (+) (+) (+) (-) (+) (+) (-) (+) (+) (-) Ce[IV] msp-56 B; range -100 to -75 Ce Cuticle Col-12; range -100 to -75 Ce Cuticle Col-13; range -100 to -75 Ce vitellogenin 2; range -100 to -75 Ce vitellogenin 5; range -100 to -75 Ce vitellogenin 4; range -100 to -75 Ce vitellogenin 6; range -100 to -75 Ce calmodulin cal-2; range -100 to -75 Ce cAMP-dep. PKR P1+; range -100 to -75 Ce cAMP-dep. PKR P2; range -100 to -75 Ce hsp 16K-1 A; range -100 to -75 Ce hsp 16K-1 B; range String-based methods(2) Find all words of length n in the yeast promoters (e.g. n=7) GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGG ACAAGCGTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGAC GGTAAGAAGATCACTTCTAACCAAAGAATTGTTGCTGCTTTGC CAACCATCAAGTACGTTTTGGAACACCACCCAAGATACGTTGT CTTGTTCTCACTTGGGTAGACCAAACGGTGAAAGAAACGAAAA ATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAATCATTGTTG GGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGAA GTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTT TGTTGGAAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGA AAGGTCGATGGTCAAAAGGTCAAGGCTCAAGGAAGATGTTCA AAAGTTCAGACACGAATTGAGCTCTTTGGCTGATGTTTACATC ACGATGCCTTCGGTACCGCTCACAGAGCTCACTCTTCTATGGT CGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTGTTGGAAAA GGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAG ACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAG ATTCAATTGATTGACAACTTGTTGGACAAGGTCGACTCTATCAT CATTGGTGGTGGTATGGCTTTCCCTTCAAGAAGGTTTTGGAAA ACACTGAAATCGGTGACTCCATCTTCGACAAGGCTGGTGCTG AAATCGTTCCAAAGTTGATGGAAAAGGCCAAGGCCAAGGGTG TCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGATGCTTTC TCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGT ATTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCT AGAAAGTGTTTGCTGCTACTGTTGCAAAGGCTAAGACCATTGT CTGGAACGGTCCACCAGGTGTTTTCGAATTCGAAAAGTTCGCT GCTGGTACTAAGGCTTTGTTAGACGAAGTTGTCAAGAGCTCTG CTGCTGGTAACACCGTCATCATTGGTGGTGGTGACACTGCCA Make a lookup table: AAACCTTT TTTTTTTT GATAGGCA Etc... 456 57788 589 String-based methods(3) Xw: Instances of a word w within our set of X genes X w EX w Zw VarX w E[Xw]: Average number of instances of w based on number of genes in our set Var[Xw]: Variance – how much deviation from the average is expected for w Limitations of String-based Methods Longer word lengths not possible Number of possible strings of length n is 4n without any ambiguities and more if we allow ambiguous bases. Thus, as n increases, the computational time notably increases. While degeneracy codes can be used, TFBS are not words – we lose quantitation for variable positions with consensus sequences Imagine column in PFM with 7 A’s and 1 T --- in a consensus sequence we would represent as W or throw out the instance with T Recently the string-based method has found renewed utility in the analysis of 3’UTRs for the presence of microRNA target sequences... Aside: microRNA Target Sequences Lim et al expressed miRNAs in cells and observed that the overall pattern of gene expression shifted toward the pattern of expression observed in cells which naturally express the miRNA The genes with reduced expression in response to miRNA exposure shared 7nt motifs the 3’UTR of their transcripts Nice website tutorial: http://www.ambion.com/main/explorations/mirna.html Probabilistic Methods Overview: Find a local alignment of width x of sites that maximizes information content (or related measure) in reasonable time Usually by Gibbs sampling (i.e. Gibbs Motif Sampler) or Expectation Maximization methods (i.e. MEME) Motivation: TFBS are not words, but rather sequence patterns Efficiency – can handle longer patterns than string-based methods Can be intentionally influenced to reflect prior knowledge Ab initio motif finding: Gibbs sampling Popular “Markov chain Monte Carlo” algorithm for motif discovery Motif model: Position Weight Matrix Local search algorithm Gibbs sampling: basic idea Current motif = PWM formed by circled substrings Gibbs sampling: basic idea Delete one substring Gibbs sampling: basic idea Try a replacement: Compute its score, Accept the replacement depending on the score. Gibbs sampling: basic idea New motif Gibbs Sampling (grossly over-simplified) ttcgctcc cgatacgc tgctacct tgacttcc agacctca ctgtagtg acgcatct A C G T 1 2 3 4 5 6 7 8 2 0 2 2 2 1 0 1 0 2 3 3 2 1 6 2 0 4 1 0 1 0 1 1 4 1 1 2 2 5 0 2 i.e. Gibbs Motif Sampler: http://bayesweb.wadsworth.org/gibbs/gibbs.html http://www.youtube.com/watch?v=_4yrlzVTr_k Pattern Discovery Gibbs sampling is guaranteed to return an optimal pattern if repeated sufficiently often Procedure is fast, so running many 1000s of times is feasible Problems if the TFBS’s are not strongly overrepresented relative to other patterns… Applied Pattern Discovery is Acutely Sensitive to Noise PATTERN SIMILARITY vs. TRUE MEF2 PROFILE 18 Pink line is negative control with no Mef2 sites included 16 14 12 10 0 100 200 300 400 500 SEQUENCE LENGTH True Mef2 Binding Sites 600 Ab initio motif finding: Expectation Maximization Popular algorithm for motif discovery Motif model: Position Weight Matrix Local search algorithm Move from current choice of motif to a new similar motif, so as to improve the score Keep doing this until no more improvement is obtained: converge to local optima Basic idea of iteration PWM 1. Current motif 2. Scan sequence for good matches to the current motif. 3. Build a new PWM out of these matches, and make it the new motif The basic idea can be formalized in the language of probability MEME Popular motif finding program that uses Expectation-Maximization Web site http://meme.sdsc.edu/meme/website/meme.html Four Approaches to Improve Sensitivity Better background models -Higher-order properties of DNA Phylogenetic Footprinting Human:Mouse comparison eliminates ~75% of sequence Regulatory Modules Architectural rules Limit the types of binding profiles allowed TFBS patterns are NOT random (Also, for prokaryotes in particular, accurate ID of transcription start sites can greatly aid identification of promoter sequences) Pattern Discovery Summary Pattern discovery methods can recover overrepresented patterns in the promoters of coexpressed genes Methods are acutely sensitive to noise – notable problem if the signal we seek is weak TFs tolerate great variability between binding sites Supplementary information/approaches (such as phylogenetic footprinting) may be required to over-come the noise REFLECTIONS TFBS prediction Futility Theorem – Essentially predictions of individual TFBS have no relationship to an in vivo function Current successful bioinformatics methods incorporate additional information (clusters, sequence conservation) TFBS over-representation TFBS over-representation is powerful way to identify TFs likely to contribute to observed patterns of co-expression Novel motif discovery (also useful in general) Pattern discovery methods are restricted by the Signal-toNoise problem Observed patterns must be carefully considered Again, successful methods incorporate additional information (conservation, structural constraints on TFs, and (for prokaryotes) positional information relative to TSS)
© Copyright 2024