Summation: Principles of Bioinformatics • Review the key ingredients of the Recipe for Bioinformatics. • Use the Human Genome results as examples for understanding the importance of these ingredients in future genomics and bioinformatics problems. • Integrate these principles with all the specifics you’ve learned this quarter: these principles are present in everything we’ve done in this class. What is Genomics? G=(MB)n Genomics is any (molecular) biology experiment taken to the whole genome scale. • Ideally in a single experiment. • E.g. genome sequencing. • E.g. DNA microarray analysis of gene expression. • E.g. mass spectrometry protein mixture analyses: quantity, phosphorylation, etc. Genomics Foundation: High Throughput Technology • Automation: any human step is a bottleneck. • Multiplexing & parallelization. • Miniaturization. • Read-out speed, sensitivity. • “GMP” Q/A, reproducibility, “production line” mindset. In Genomics every question is really an information problem • In molecular biology, experiments are small and designed to test a specific hypothesis clearly and directly. • In genomics, experiments are massive and not designed for a single hypothesis. • Every biology question about genomics data corresponds to a computer science problem: how to find the desired pattern in a dataset. Human Genome Sequencing • The experimental part (the actual sequencing) was easy. It was the information problem that was hard. • Assembly: the high frequency of repeats in the human genome can fool you into joining the wrong fragments Purely Sequence-Based Assembly Celera believed they could assemble the whole human genome from shotgun sequence fragments in this way. But this approach failed. They had to use the public domain map data to resolve problems in their assembly. Genome Annotation • Genes are what biologists really want, not just the genome sequence. • Unfortunately, most of the 32,000 gene annotation is based on gene prediction, not measured experimental evidence. • It is likely that 50% of the reported genes are wrong in details (individual exons, boundaries) or entirely. • The Drosophila annotation has recently been shown to be deeply flawed. • An information problem that is still not solved. Definition of Bioinformatics Bioinformatics is the study of the inherent structure of biological information. • Data-driven: let the data speak for themselves. • non-random patterns in the data. • Measure significance of patterns as evidence for competing hypotheses. Computational Challenges • Cluster genes by expression pattern over the course of the cell cycle. • Identify groups of genes that are coexpressed, co-regulated. • Identify regulatory elements in common to the promoters of these genes, that make them be expressed at the same time. Solving the Information Problem • Modeling the problem: choosing what to include, and how to describe them. • Relating this to known information problems. • Algorithms for solution. • Complexity: amount of time & memory the algorithm requires. Completeness Changes Everything • In molecular biology cleverness is finding a way to answer a question definitively by ignoring 99.99% of genes. You can’t see them, so the experiment must exclude them. • In genomics cleverness is discovering what becomes when possible when you can see everything. Have to switch our deepest assumptions. What specifically can you learn from Everything? E.g. Protein function prediction: • genomic neighbors method • phylogenetic profiles • domain fusion (Rosetta Stone method) Microarray gene expression analysis • meaningful signal not in just a few genes Example: Ortholog Prediction • Orthologs: two genes related by speciation events alone. “the same gene in two species”, typically, same function. • Paralogs: two genes related by at least one gene-duplication & divergence event. Homology: an ortholog or a paralog? • Experimentally very hard to answer. Genomics Requires Statistical Measures of Evidence • Evaluate competing hypotheses under uncertainty--automatically? • based on statistical tendencies, not “proofs” • false positives, false negatives • the need for cross-validation • the need for experimental validation • best role: experiment interpretation and planning Measures of Evidence • SNP identification from sequence data • Genome annotation: Gene evidence? Keys • explicit, realistic likelihood models, priors measured from tons of real data. • Explicit evaluation of alternative models. • Real posteriors, w/ measures of uncertainty. Integrating Independent Evidence • Typically, a calculation works with one kind of data; hard to integrate very different data. • Likelihood Models provide easy way to integrate many different types of data: if they are really different, just multiply them • Independence; Factorization! Statistical Problems • Microarray analysis: hierarchical clustering? • Genome annotation: gene prediction? • COGs: no statistics at all… • Protein function prediction: distance metrics instead of probabilistic modeling, no posteriors. From Reductionism to Systems Analysis • Mol. Biol.: dissect a complex phenomenon into its smallest pieces; characterize each. • Very hard to put the pieces back together again: Given AB, AC: A+B+C = ? • Genomics: The cell as test-tube. Able to see A+B+C (+D+E+…) working together. • Study how all the components work together as a system. Study system behavior. Cell-Cycle Regulated Genes by whole genome mArray Automated Discovery of Cell Cycle Regulatory Elements Rosetta Stone Assumption: Fusion of functionally-linked domains In organism 1: A A In organism 2: A' B B' Implies proteins A and B may be functionally linked PHYLOGENETIC PROFILE METHOD From Hypothesis-Driven to Data-Driven Science • Mol. Biol.: can’t see 99.99% of genes, so use black-box logic based on controls: keep everything the same except for one small change. Isolate a specific causeeffect. • In reality you rarely have the perfect control. • Hypothesis driven: can only see what you look for: a few genes, a few controls. • Interpretable: ask a YES-NO question. From Hypothesis-driven to Data-driven Science • Genomics: measure all genes at once. • Don’t have to assume a hypothesis as basis for designing the experiment. • Objective: let the data speak for themselves. • Reality: vast amounts of data, very complex, hard to interpret. “System Science” or just “Stupid Science”? Stupid Science: Data-driven Science Done Wrong • No hypothesis. • Assumptions: alternative models not explicitly enumerated, weighed. • Statistical basis of model either neglected or only implicit (and therefore poor). • No cross-validation: just one form of evidence. • Greedy algorithms, sensitive to noise. • Measures of significance weak or absent, both computationally and experimentally. Data-driven Science Done Right • Multiple competing hypotheses. • Alternative models explicitly included, computed, to eliminate assumptions. • Statistical models clear, well-justified. • Multiple, independent types of evidence. • Robust algorithms w/ well demonstrated convergence to global optimum. • Rigorous posterior probability calculated for all possible models of the data. Priors derived from data. False +/- measured. Implications of Data-driven Science • To get strong posteriors that can distinguish multiple models, you need LOTS of data. • Genomics is creating an unprecedented avalanche of data, opportunities. • A change in the nature of data: “lost” data in old notebooks, journals, heads; vs. electronic databases that can be queried, analyzed. • The end of (purely) human analysis. • Don’t confuse observations & interpretations. Bioinformatics as Prediction • Given a protein sequence, bioinformatics would seek to predict its fold. • Given a genome sequence, bioinformatics would seek to predict the locations and exon-intron structures of its genes. • The ultimate test: make a blind prediction (when no experimental data A new kind of Bioinformatics • The massive experimental data produced by genomics projects has created a demand for a fundamentally different kind of bioinformatics, which we can characterize (with some exaggeration) as a mix of three principles: CHEAT • Don’t even try to predict anything. • Just say, “Give us all your experimental data that contain the answer to this question, and THEN we’ll tell you what we think the answer is!” • Focus is on statistically accurate measurement of the strength of the evidence for different interpretations of the experimental data. Steal Other People’s Data • The massive amount of data being produced in the public domain is an opportunity for heavy duty data-mining, using statistics to expose patterns that would otherwise be missed in this huge dataset. Trust No One What kind of data do we want? • RAW EXPERIMENTAL DATA, ideally straight from the sequencing machines. • INTERPRETED DATA is untrustworthy. • Actually, bioinformatics PREDICTIONS are contaminating the experimental databases! Chromatographic Evidence G G T G Hs#S785496 zu42c08.r1 G G T C C C G G T G A Hs#S1065649 oz03ho7.x1* A T C C C Science by Computer? • No human scientist will ever look at all these data. • To make discoveries in these data, scientific judgment of evidence must be formalized as a computation. • Computational Inference about hidden states H from observable states posteriorO: likelihood prior Bayes’ Law p(O | H ) p( H ) p ( H | O)   p(O | h) p(h) h Sum over all hidden states Diversity Kills Bayes Law • Posterior probability p(M|obs) assumes all observations came from one model. • e.g. gene prediction: predict “best gene structure”. Completely ignores possibility of alternative splicing. • What if there are multiple, different models in reality? Observations will appear contradictory… • Must treat world as a hidden mixture of models: don’t know how many; don’t weight When data is Mixed Up… No correlation? height weight Real Results are Hidden Baseball players Basketball players Good correlation within each group! height …or Completely Wrong weight Sumo wrestlers Overall correlation line height Basketball players weight Mixture Evidence must be convincing! height Mixture Evidence must be convincing! weight Or we could arbitrarily split up the data any way we like, to generate any desired (ridiculous) conclusion! height Not just one Genome! a Hidden Mixture • Generalize linear model S to partial order graph with hidden edge r67 probabilities rij: S1 S2 S3 S4 SNP S5 S6 S7 S8 S9 S10 S11 r69 splice Cf. Gene Prediction: assume only one model possible (no alternative splicing), so treat rij as binary (0 or 1). Evidence Confidence 1-r C A G G T C r T A G G C G 1  p(obs | r ) p( r )dr •Odds ratio that a feature exists: 0 log log p(r > 0 | obs) p(obs | r  0) Pr( r  0) •Does the probability of the observations drop catastrophically when we eliminate a given model feature (ie. Set its r=0)? •That means there are some observations that cannot be explained well any other way! This is strong evidence. t 1 -t p(obs | t > 0) = 10-3 p(obs | t = 0) = 10-7.2 Evidence – Confidence LOD VALUE of 4.2 Sorting out EST complexity • Need to allow for real divergences within an alignment e.g. chimeras, paralogs, alt. splicing… • Detect clustering errors by dividing alignment into groups of divergent sequences (possible paralogs). • Apply graph theory (mathematics of branching structures) to deal with this. • Developed new multiple sequence alignment method to do this: Partial Order Alignment • Two cases: with genomic sequence, or Linear Alignment: assumes NO Structural Divergences Cluster AA702884 C vs. T polymorphism Novel SNP, not previously identified. Linear MSA • simple assembly • substitutions • simple indels Major Divergences within an Alignment: “Partial Order” Branching Loops • multiple domains • chimeric sequences • paralogous genes • alternative splicing or polyadenylation • not simple indels • alternative splicing • paralogous genes • multiple domains Find Optimal Traversals Use graph theory to find the minimum number of traversals needed to completely encode the alignment. Completely encoded by a single traversal Can only be encoded by two distinct traversals Assign each EST to the traversal that encodes it Separation of Paralogous Groups #Unigene clusters Partial Order structure in Unigene 4500 4000 3500 3000 2500 2000 1500 1000 500 0 Series1 1 2 3 4 5 6 7 8 9 10 #of distinct bundles Most Unigene clusters contain mutually inconsistent sequences (e.g. branching, correlated substitutions suggesting paralogs, etc.) Pairwise Multiple Sequence Alignment O(N2) pairwise distances; find minimum spanning tree; iteratively align via shortest edges. 3 2 4 1 Multiple Domains Proteins may share a domain, but differ elsewhere: One domain may be observed in many proteins. One protein may contain many domains. As a linear MSA: Should it be stored as... In linear MSAs, indel placement is often arbitrary. This leads to arbitrary differences in Or as... gap penalties charged in later alignments... PO Alignment of Multi Domain Sequences MATK ABL1 GRB2 CRKL M A KINASE M SH3 A SH2 G G C SH3 SH3 C Data Convergence: the Genome is the Glue • Biology has become highly specialized, fragmented: natural result of reductionism. • But ultimately most activities attach to a gene or group of genes. • Because of evolution, genes connect to each other through orthology (across species) and paralogy (duplication events). • Discovery is connecting previously unrelated facts about phenotypes and causes Examples • Human proteins work in yeast. Use yeast to figure out protein-protein interactions… • Drosophila get high on cocaine. Use them to study addiction, therapies. • C. elegans on Prozac... • Spiders build crazy webs on caffeine... Human Mouse Expansion of Gene Families Star Topologies: O(N2) power in O(N) links mutants Regulatory sites phenotypes Alternative splicing Model organisms activities ligands expression genes domains proteins polymorphisms screens development mapping Disease associations folds motifs Genomics & Bioinformatics • Incredible increase in experimental data production is making possible entirely new analyses. • Bioinformatics required for interpretation of the meaning of the raw data. • A lot of discovery is possible, but… Evidence Matters! • The genome is a very complex place. • Every possible trap for naïve bioinformatics analysis is actually there: repeats, paralogs, polymorphisms • Rigorous statistical measurement of our confidence is the only thing that will keep us from making silly mistakes. Sources of Uncertainty Experimental Factors Type of errors caused: (+) false positive, (-) false negative Chimeric ESTs (+) in methods that simply compare ESTs. Genomic contamination EST fragmentation (+) in methods that don’t screen for pairs of mutually exclusive splices. (+) in methods that don’t screen for fully valid splice sites (which requires genomic mapping, intronic sequence.) (+/-) in methods that don’t correct misreported orientation, or don’t distinguish overlapping genes on opposite strands (+/-). Single pass EST sequencing error can be very high locally (e.g. >10% at the ends). Need chromatograms. Where ESTs end cannot be treated as significant. EST coverage limitations, bias Genomic coverage, assembly errors (-). Most genes have very few ESTs, from even fewer tissues. The main barrier to alternative splice detection. (-) in methods that map ESTs on the genome. Short contigs may cause >25% false negatives. RT / PCR artifacts EST orientation error, uncertainty Sequencing error Modrek & Lee, Nature Genetics (in press). Bioinformatics Factors Alignment size limitations (-) in methods that can’t align >102, >103 sequences Alignment degeneracy (+/-). Alignment of ESTs to genomic is frequently degenerate around splice sites. “Pathological” assemblies (+/-). What should assembly programs do when the assembled reads disagree in regions (e.g. alt. splicing)? Programs vary. Non-standard splice sites? (+) in methods that don’t fully check splice sites; (-) in methods that do restrict to standard splice sites. Arbitrary cutoff thresholds (+/-) in methods that use cutoffs (e.g. “99% identity”). Rigorous measures of evidence (+/-). How can the strength of experimental evidence for a specific splice form be measured rigorously? Mapping ESTs to the genome (-) in methods that map genomic location for each EST. Paralogous genes (+) in all current methods, but mostly in those that don’t map genomic location or don’t check all possible locations. Biological Interpretation Factors Defining the coding region Predicting ORF in novel genes; splicing may change ORF. Predicting impact in UTR Relatively little has been proved for UTR effects. Predicting impact in protein Motif, signal, domain prediction, and functional effects. Assessing and correcting for bias Our genome-wide view of function is under construction. Until then, we have unknown selection bias. Spliceosome errors? Is splicing perfect? I.e. does it only make correct forms? What’s truly functional? Just because a splice form is real (i.e. present in the cell) doesn’t mean it’s biologically functional. Conversely, even an mRNA isoform that makes a truncated, inactive protein might be a biologically valid form of functional regulation.