Genomes: The good, the bad and the ugly The limits of automatic biocuration Michal Linial Institute of Life Sciences, The Hebrew University of Jerusalem, Israel The Sudarsky Center for Computational Biology (SCCB) The Israel Institute for Advanced Studies (IIAS) Beijing, China April 26th, 2015 Genomes: The good, the bad and the ugly A classical movie - 1966 The story (dated to Civil War 1862) traces how 3 men gain information about the location of a buried treasure of gold, and then uncover that treasure. A community-based expedition The “compass”: Follow the footsteps of evolution Biocuration - it is about changing the TOOLS Compass - Sextant - GPS - WAZE Where are we going? Create a MAP (for proteins) Develop NAVIGATION tools (for functions) Insights on HIDDEN FUNCTIONS The LESSON… A treasure hunt for hidden functions What for: Maximizing knowledge.. Understand living systems… Expose design principles.. Instructions - some guideline Listen (carefully) to “big data” Listen (carefully) to “biological uniqueness” It is a Tsunami.. You better accept it. 70 M 60 M 50 M 40 M 30 M 20 M 10 M Bio curation April-2015 47 million sequences (proteome redundancy) 1.5^10 amino acids It is an unexplored territory .. You better accept it 70 M 60 M 50 M 40 M 30 M 2015 1: Protein level 0.13% 2: Transcript level 2.05% 3: Inferred homology 20.8% 4: Predicted/Hypoth. 77.0% 20 M Only 1 out of 770 proteins with evidence 10 M 1/770 is a small number.. China and Israel 1/ 180 in population 1/ 460 in area The goal Closing the gap.. Learn from the experts: Observe the space and classify.. Houston we have a problem: • High dimensional data • What distance? • Dark matter (hypotheticals) • No “gold standard” 12 Sept 2013, 17:00 Abell 1689, Virgo Cluster Distance: 2 billion light years • The “known” is negligible The challenge: Sequence based function Known drawbacks in “search” for protein function (1) Dominated by local alignments Manual vs. Automatic vs. Hybrid: Is it a realistic task ? • Domination by Local alignments • Statistical confidence score (meaningless ?) • Minimal robustness, multi-parameters • Faulty annotations propagation (1) • Community based competition – i.e. CAFA (2) • • (1) Kaplan N & Linial M (2005) On automatic detection of false annotations. BMC Bioinformatics (2) Radivojac P et al. (2013) On computational protein function prediction. Nature Methods Create a MAP - Classifying the space The Goal: Charting the Protein Space GUIDING PRINCILES • Use only sequence information • Treat all proteins equally (>5 Million “putative’) • Use unbiased “automatic” methods Homology derived from a common ancestor ProtoNet as a Family Tree & Map Guidelines: All sequences are included (UniProtKB / UniRef) Preprocess : All against All BLAST (E-value) Include very remote distance, E value=100 ! Bottom up clustering – ProtoNet Tree Prune the tree – Report on STABLE clusters N Rappoport, N Linial, M Linial (2013) ProtoNet: charting the expanding universe of protein sequences. Nature Biotechnology Bottom Up: Agglomerative Clustering • A clustering algorithm based on pre-calculated local ‘distances’ • Seek appropriate rules according to which two clusters merge • Testing the robustness to data, sensitivity to ‘merging rules’ • Prune the tree to keep only stable clusters (data driven) Creating a Map – A Historical View.. 1995 2000 2005 2010 2014 2015 - 37K proteins (SWP) -ProtoMap 94K proteins (SWP) 114K (total 1M, UniProtKB) 3M proteins (UniRef90) 18M proteins (UniRef50) + expansion + Complete Genomes M. Fromer N. Linial O. Sasson E. Portugaly N. Rappoport N. Kaplan The More the Merrier: Leveraging huge-scale by clustering Swiss-Prot similarities S S S S S S S S S S S S Weak similarities between remote families S Adding missing sequences UniRef90 UniRef90 sample (> 2.5 millions) S S S S S S S S S S S S S • Similarities amplified (similar on average), Detect consistency • signal > noise Boosting the sampling size More is better Signal>> noise A B C The “bad wolf” is still there (false transitivity) A 1e-42 1e-42 B 8e-78 8e-78 C A and B are similar (homologous) B and C are similar (homologous) A and C are not similar and not homologous – false transitivity Mostly due to local similarities in a multi-class scenario Triangle invalidated - similarities inherently non-metric The next merge.. Take the correct exit… Avg. E=96 8e-78 Avg. 1e-42 E=1e-07 ProtoNet Clusters & Expert families Correspondence Score (CS) Jaccard Score (J) 1.0 = perfect, Size of the intersection divided by the size of the union 0 = no match TN Expert Specificity = TP /(TP+FP) Sensitivity = TP /(TP+FN) Jaccard (CS) = TP/(TP + FP + FN) (CATH, InterPro..) FP ProtoNet TP Cluster FN Quality assessment for the tree Clusters tested w.r.t. external families: InterPro Pfam E.C. enzyme SCOP, CATH SF, Gene3D GO … Evolutionary relatedness is captured Continuous granularities of superfamilies Average UPGMA* vs.‘naïve’ clustering 10,000 Pfam keywords PNet-UPGMA Jw=0.86 Single-linkage Jw=0.59 *UMPGA -Unweighted Pair Group Method with Arithmetic Mean Merging rules make a difference Geometric Quality measure w.r.t. Pfam (>10 proteins) Arithmetic The numbers keep growing… *Efficient / exact clustering algorithm for (most) sparse date Input: Similarity data cluster similarities clusterer: MC-UPGMA Output: Complete tree to next round Problem: Too Big (partial) tree merger: edge re-calculation Clusterer – partial clustering Merger – creating next round’s input Loewenstein et al. (2008) Efficient Algorithms for Exact Hierarchical Clustering of Huge Datasets Bioinformatics A critical assessment on 1.8 M sequences InterPro Families InterPro Domains J spec. sens. J spec. sens. MC-UPGMA .90 .97 .93 .74 .90 .80 ProtoNet4 Single-linkage CluSTr Slim .80 .81 .28 .94 .95 .93 .83 .84 .29 .67 .57 .24 .90 .88 .89 .72 .62 .25 A MAP – What for? New discoveries: Family definition, Superfamilies, Hidden connections… Evolution of genomes… Improved data quality: Detecting faulty annotations New insights: The ‘unexplored’ territory for 3D – Target selection The family MAP / Isolated families Family models No internal distances Family clusters Internal distances Follow the footstep of evolution Identity of > fold 35% inproteins sequence ensures homology Globins Identity of 20-35% in sequence - very hard (Twilight) Identity of <20% deep in the noise (Midnight) Short (~120-160 aa) Oxygen transport Low sequence similarity <15% Partition to subfamilies (no FP) ~1000 proteins, 50< 3D ‘globin-like fold’ 78 % of all potential BLAST pairs are with E-value 100 or worse a ee b, d Functional Road Map of Enzymatic Activity - UREASE Hypothetical: Urease like Urease alpha subunit 3.5.1.5 Urease 3.5.1.5 17 subfamilies beta subunit Dihydropyrimidinase 3.5.2.2 D-hydantoinase 3.5.2.2 allantoinase 3.5.2.5 Imidazolonepropionase 3.5.2.7 N-acyl-D-aspartate deacylase 3.5.1.83 adenine deaminase 3.5.4.2 Hypotetical Guanine deaminase 3.5.4.36 N-acetylglucosamine-6-phosphate deacetylase 3.5.1.25 Dihydroorotase 3.5.2.3 Hypothetica l AMP deaminase 3.5.4.6 Adenosine deaminase 3.5.4.4 Shachar and Linial (2004) On remote homologues In Proteins ProtoNet navigation tool Overlooked connections: Connect the Dots Functional connection? i ii iii iv a b c d 1 2 3 4 A and C coincide CB (CA ‘s sibling) A and C coincide on a multi-domain protein = false-transitivity Cluster A Cluster B Beware of false inference… Cluster A Cluster B Connect the dots A-B JA >0.6 JB >0.6 710 A-B pairs 4 80 (non-coinciding) pairs Connect the Dots Passing the structural test SCOP Superfamily Aerolisin / ETX pore-forming 1preB Pfam: Aerolysin toxin (PF01117) SCOP (SF & Fold): Aerolisin/ETX pore-forming domain 1uyjA Pfam: ETX_MTX2 (PF03318) SCOP (SF & Fold): Aerolisin/ETX pore-forming domain A - B: Estimating Safe Predictions 480 safe A-B pairs Correct A-B Wrong A-B Pfam Clans Unknown A-? Unknown ?-? Let the data speak Distribution of Pfam AB-pairs 1 No clan One clan Correct wrong J(B) 0.9 0.8 0.7 0.6 0.6 0.7 0.8 J(A) 0.9 1 Functional insight on DUFs 32% are DUFs (Domains of Unknown Function) ??? Spore surface determinant 47 DUF1429 YabP family J=1.0 (24/0) J=1.0 (21/0) 20 2 22 3 Unknown to Known 50% in Bacillales 40% in Clostridia 40% in Bacillales 50% in Clostridia Small packages: 50% A-B pairs (most top scoring) A parallel space: Virus world On viral evolution Evolution forces that are unique to the virus‘biology’ • Fast recombination (co-infected host cell) • High mutation rate (specialized polymerase) • Fast selection (antigenic shift) • Host-co evolution Viruses proteins are often isolated in the protein family tree Rappoport, Linial (2012) Plos Comp Biology Herpes Virus No 3D / No external support PF04541 Herpesvirus virion protein U34 A B PF05900 Gammaherpesvirus BFRF1 protein Connect Herpes Gamma-Herpes (Epstein-Barr virus (EBV) and Kaposi’s sarcoma herpesvirus (KSHV)* Assessment of the hidden connections Clan True Clan True Clan-Wrong Clan-Wrong Pfam-Obsolete Pfam-Obsolete Name-True Name-True A-B TrueA-B True A-B-Possible A-B-Possible A-B Worng A-B Worng Protein sequences come in bulk… The complete genomes flood >30 Insect’s genomes !! Gaining insight on the life style of an insect ~1,100,000 species of insects The most diverged known class HUGE variation in: morphology, life span, life cycle, development, sex determination, medical impact, genome size, behaviour... 18 complete genomes insects & crustacean Removed redundancy (10 Drosophilae ) Not yet included: Butterfly and Moth …. Arthropods and Insects genomes: 300 MY of evolution Hypotheticals Others Ants complete genomes • • • • • • • Leaf cutter ant Atta cephalotes Fungus-growing ants Acromyrmex exchinatior Florida Carpenter ant Camponotus floridanus Jerdon's Jumping ant Harpegnathos saltator, Argentine ant Linepithema humile, Red harvester ant Pogonomyrmex barbatus Fire ant Solenopsis invicta. 14,000 species Dominating Biomass of Animals Lesson 1: 300 K proteins, 300 M years Secured functional inference to 1000s of sequences 1398 Root SF (Superfamilies) To Root (PL) ProtoLevel Depth ? . . .. … Life Time 77,988 unannotated 20,134 families PL70 . .. … …. PL50 ProtoBug: Insights from COMPLETE proteomes - ‘secure’ inference Pure families w.r.t. Pfam 1 CS Median - 0.932 CS Average - 0.888 100% Pure no FP Specificity 0.8 0.6 0.4 0.2 1 401 801 1201 1601 2001 2401 2801 3201 Pfam keywords assigned to ProtBug families (PL70, ≥ 10 proteins) ProtoBug: Automatic view on expanded & contracted families . . .. … Life Time ? Depth ProtoLevel (PL) To Root 18 representatives 300,000 proteins 20,000 families ~5000 clusters / species 22% - Annotation inference . .. … …. 1010101010101010111111010110110101 1010101011101110111100001011011011 1010101000101010111111010110110101 1010101011101110101100001011011011 1010101010101010111111010110110101 1010101011101110111100001011011011 1010101010101010110111010110110101 1010101011101110111100001011011011 1010101011101110101100001011011011 1010101010101010111111010110110101 1010101011101110111100001011011011 1010101010101010110111010110110101 The basis for GAIN and LOST of a family Vectors of families Gained and lost families are traceable Hymenoptera H. saltator C. floridanus L. humil P. barbatus S. invicta A. echinatior A. cephalotes A. melifera N. vitripennis D. virilis D. melanogaster A. darligini A. gamiae C. quiquefasciatus A. aegypti T. castaneum P. humans D. pulex Diptera {b,c} Parent Uncle {a,b,c} {b,c,e} n Sibling {a,b,d} {a,c} Lesson 2: 1000s of gained and lost families at a (surprising) variable rate Gain Diptera Hymenoptera H. saltator C. floridanus L. humil P. barbatus S. invicta A. echinatior A. cephalotes A. melifera N. vitripennis D. virilis D. melanogaster A. darligini A. gamiae C. quiquefasciatus A. aegypti T. castaneum P. humans D. pulex 628 973 1215 1327 579 1441 2214 494 777 926 512 1656 427 896 685 2329 674 4969 Loss 351 334 245 657 1568 1623 556 629 322 74 89 838 492 296 380 330 353 0 Lesson 2: 1000s of gained and lost families at a (surprising) variable rate Diptera 800 Normalized TOR Hymenoptera H. saltator C. floridanus L. humil P. barbatus S. invicta A. echinatior A. cephalotes A. melifera N. vitripennis D. virilis D. melanogaster A. darligini A. gamiae C. quiquefasciatus A. aegypti T. castaneum P. humans D. pulex 787 750 600 412 410 400 247 245 241 200 194 171 169 163 159 151 132 24 12 0 TuroOver Rate ( Dynamics of functions) Benefit from the ProtBug MAP Seek a phylogenetic signal 114 Root SF at least 200 protein each 82.3K proteins 1398 Root SF (Superfamilies) To Root ProtoLevel (PL) Depth ? . . .. … Life Time 77,988 unannotated 20,134 families PL70 . .. … …. PL50 Lesson 3: Transposition / DNA dynamic functions are enriched in Hymenoptera.. A secret for adaptation? ProtoBug: Navigating tools 1. Cluster ID 2. Cluster name 3. Tree view 5. Species view 4. Cluster summary 6. LT 8. Kw statistics 7. PANDORA 9. Proteins / features Insights from Genome-Centric Curation Evolution insight: 100s of families/ functions were gained / lost NA dynamics is enriched in some taxa Resource: ProtoBug – a Genome view on 18 insect representatives Rappoport and Linial (2015) ProtoBug… DATABASE (Oxford) Bio-curators: A tsunami is coming insects & crustacean …. 5000 genomes The i5k initiative is a transformative project that aims to sequence and analyze the genomes of 5,000 arthropod species. Hexapoda 702 Chelicerata 64 Crustacea 20 Myriapoda 6 Not all genomes are born equal.. The case of Daphnia pulex ~200 Mb 31 K genes Hypotheses: A phenotypic adaptability for changeable environments ? A fitness in Daphnia’s aquatic toxic environment ? Colbourne et al. (2011) Science Proteins arrive in Bulk (a new genome..) 31,000 genes 10,000 are unknown X2 w.r.t insects… Expansion of Daphnia’s paralogs • Extreme number of paralogs • Many clusters have >10 paralogs D. Pulex Low divergence - similar functions… Divergence among Daphnia’s paralogs High divergence High similarity Number of clusters Daphnia ≥ 10 paralogs High divergence Tree Score (TS) High similarity Number of clusters Drosophila ≥ 10 paralogs Expansion in signaling cascade Some General Conclusions • The MORE, the Merrier. Improving S/N • LET the DATA lead you • SCALABILITY is critical • CONNECTED MAP is a key for DISCOVERY • NAVIGATION TOOLS are important Fascinating Biology is always in front of you Hidden functions New Genomes, New functions ‘Maybe’ Boarder line similarity Only part of protein Conflicting exp/ lit Having Function Experiments Literature Expert view ‘Wrong’ Fault annotation Wrong inference Inferring Function Models, Maps Predictions No Function No similarity No evidence Hidden niches: Unique “life style” Hypothesis: Unexplored world of functions Short peptides Signaling molecules Viruses & pathogens Metagenomics Innovation, novelty, evolutionary games Short proteins: Hidden niche Hypothesis: Short peptides – hidden functions uncharacterized Average 24.2% Short <100 aa NOT fragments SwissProt: 9.7 % TrEMBL: 8.0 % Protein Length Short Proteins: When sequence similarity fails • Genomics: – Similarity-based searches: short proteins, non-significant results (weak signal) – NGS: few reads • Proteomics: – Best for long seq: “global” MS experiments, low coverage, low DB scores .. – Missed spectra (modification?) – DB search: If the proteins do not exist in the database they cannot be found. Many animal toxins are short Eukaryotes short proteins (<100 aa, no fragments, SWP, 15,224) Property Amount P-value Toxin 2421/3342 0 Neurotoxin 1468/1505 0 Ion channel inhibitor 270/304 1E-247 Why searching toxin-like sequences • • • • Secrets of evolution Unexplored niche Unexpected surprises Peptide therapy Source Name Action Application Cone snail ω-conotoxin Ca2+ channel inhibitor Chronic pain Cone snail μ-conotoxin Na2+ channel inhibitor Epilepsy, pain, arrhythmias Scorpion, Sea-anemone margatoxin ShK K+ channel inhibitor Cone snail α-conotoxins Cone snail Conantokins Immunosuppressant Multiple Sclerosis nAChR inhibitor Pain NMDA inhibitor Pain ICI in sporadic metazoa Eukaryota Bacteria Archea Metazoa (animals) Cnidaria Mammals Bilateria Arthropods Sea anemone Spiders Molluscs Scorpions Vertebrata Insects Cone snails Reptiles Breaking the rules for the 3D perspective MANY folds – ONE target K+ MANY targets – ONE fold Ca2+, Na+, Cl-, K+, No simple relation between the folds and the targets of the ICIs Ion channel inhibitor (ICI) toxins Weak (no) signal in sequence •No sequence consensus •No phylogenetic tree •No 3D folds specificity Conotoxin, 112 seed proteins What’ s common among ICI? • • • • • All EXTRACELLULAR expression SHORT & COMPACT Act by blocking a wide range of RECEPTORS & CHANNELS STABLE protein (stored in venomous gland, often modified) SPECIFIC, HIGH AFFINIRY Alternative “curation” tools: A search for features A collection of 600 relevant features • Amino acid frequency (20) • Amino acid pairs frequency (202=400) • Length (1) • Hydrophobic binary pattern (25=32) • Charged entropy • Amino acid entropy • Cysteine binary pattern (25=32) • Amino acid “center of mass” (40) • More… The Key: Design features that capture your biological intuition Noam Kaplan From sequence to a prediction machine Short proteins Supervised learning A classisfier machine Predicting label for each protein seq CONFIDENCE SCORE for being ICI ClanTox: the “Classifier Machine” Applying of SWP short proteins 3-fold cross validation results: AUC: mean 0.9934, sd 0.0026 Kaplan, Morpurgo, Linial J. Mol. Biol (2007) Naamati, Askenazi, Linial (2010) Bionformatics TOLIP= toxin like proteins TOLIPs discovery platform Input - Discovery TOLIPs Predictor Genome Positive Set Proteome Negative Negative Set Negative Set Negative Set Set TOLIPs Annotation Functional TOLIPs Candidate TOLIPs Nega ve Short proteins ClanTox P1 P2 P3 ClanTox: a “Classifier Machine” Guy Naamati www.clantox.cs.huji.ac.il Where should we search: Unexplored Proteomes INSECT: Honey Bee (10K proteins) NEW Toxin-like (TOLIP): OCLP1 Not in the venom gland, similar to w-conotoxin AMPHEBIA (121K, 2.4K short proteins) Huge expansion in Amphebia When needed… D. pulex TOLIPs •31K ORFs, 11% < 100 aa •ClanTox identified 6 top score predictions Signal pep de Metallothionein Daphnia pulex and Arthropods TOLIP paralogs - A cassette for protection (local duplication) Metallothionein (MT) - Cys-rich short proteins. Localized to the membrane of the Golgi apparatus. MTs bind to metal ions (Zn, Cu, Se, Ag, Cd, As, Hg ) TOLIPs and New functions: Brain & Immune system The playground of creativity and innovation Toxin-like sequences without a venom ?? Some (wild) thoughts as a Biocurator… 1. Short proteins – EASY to create (evolution) 2. The 3D constrains are simple – analogy for A CORK Mouse TOLIPs – RNA-Seq data •Defensin •Testis expressed •Cancer related •Immunological signaling •Developmental cues reproductive HIGHLY RESPONSIVE Immunity Beetle, Spider and Human The use of a successful scaffold • Rodent-Primate Signaling molecules • Regulator of GPCR • Regulators of Energy Consumption ASIP (Agouti Signaling) Mammals / Primate AgRP (Agouti Related) A statement from EVOLUTION: "If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck” ANLPs: Shared 3D, but minimal similarity Yitzhak Tirosh Treasure hunt: An overlooked loci 1.1M, Chr 9q4A , 19 mouse genes – testis related 19 TOLIPs in 1.1M bp Unknown function Testis expression ANLP Lynx1-2 SLURP12 UPAR/ LY6 254 M=60; H=40 Activin receptor 203 M=29; H=22 BAMBI 16 M=2; H=3 PLA2 inh 45 M=2; H=0 Toxin 375 M=0; H=0 TOLIPs “the power of many” WAP proteins (e.g., Elafin) Chr 2 0.55 Mb • Serine-proteinase inhibitors • Antimicrobial and anti-inflammatory activity. • Amplified in cancers • Carcinogenesis ? Tumor progression ? WAP proteins Yet another TOLIP cluster Ly6 - gene expansion mouse Chr 15 & human Chr Cell8identity code? B cells T cells NK cells Monocytes Neutrophils Dendritic cells All defense lines of the cellular immune system J Leukoc Biol. 2013 ANLP-LY6-Toxin functions in the clinics Skin disease (psoriasis) Pain control ‘drug’ Immune system (sorting) Alzheimer (soluble marker) TOLIPs in Metazoa ~100 TOLIPs Real toxins ~100 TOLIPs no toxins Target receptors Target receptors Very short (<5 kDa) Short (~10 kD) Simple modifications Rich modifications Chr. amplification Chr. amplification Irreversible action Reversible action Feature based curation Neuropeptides - Powerful modulators Dan Ofer neuropid.cs.huji.ac.il Ofer D. and Linial, M. (2013) Bioinformatics Ofer, D. et al (2014) Nucl. Acids. Res Closing remarks Maps - the way to follow evolution footsteps Innovators and replicators - work hand in hand Plasticity / adaptation calls for ‘innovation’ Unification in function – a common phenomenon BIOCURATION - essential (sequence & features) My “take home message” for knowledge sharing • Develop methodologies, tools and resources • Share ideas and Share data • Keep crossing from Data to Biology & Back Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime 10-14 July 2015 Thank You
© Copyright 2024