Functional curation of sequence data for RefSeq Kim D. Pruitt 8th International Biocuration Conference April 24, 2015 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, DHHS, USA What is RefSeq? An NCBI project to provide reference sequence standards, that incorporate current knowledge, for genomes, transcripts, and proteins. Vertebrates Genomes 169 Genes 4 million Transcripts 5.6 million Proteins 4.9 million National Center for Biotechnology Information Most curated organisms: human - mouse - rat - zebrafish - cow - chicken RefSeq curation focus for vertebrate genomes • Gene type • Gene location/length • Transcript variants • Names • Publications • Functional annotation Protein-coding ncRNAs Pseudogene Unknown ??? • Functional regions of sequence • Gene summary • References and GeneRIFs (Gene References Into Function) National Center for Biotechnology Information RefSeq: www.ncbi.nlm.nih.gov/refseq/ Gene: www.ncbi.nlm.nih.gov/gene/ RefSeqs unique contribution • Correct transcript/protein sequence • Clear data source & evidence • Connect DNA<>RNA<>Protein • Connect sequence regions to function - for both transcripts and proteins National Center for Biotechnology Information NM_001033952.2 Curation Examples (2014) Antimicrobial peptides (human genes) • Histones (human and mouse genes) • Regulatory upstream open reading frames (uORFs; human, mouse and rat genes) National Center for Biotechnology Information Antimicrobial peptides (AMPs) • • • • Short peptides (typically 15-100 aa) Found in most multicellular organisms Not highly conserved at sequence level Cationic residues bind with negatively-charged bacterial membranes A new approach is needed to fight antibiotic-resistant bacterial infections Advantages of AMPs: Bacteria generally cannot become resistant to AMPs Low toxicity Kill pathogens efficiently National Center for Biotechnology Information Curating human AMPs PubMed + AMP databases AMP sequence BLASTP RefSeq & Gene Manual Curation -Annotate AMP peptide -RefSeq attribute -RefSeq Summary -Store GeneRIF 191 proteins 139 genes National Center for Biotechnology Information Query Nucleotide or Protein: refseq*filter+ AND “protein has antimicrobial activity”*properties+ Human calcitonin-related polypeptide alpha • CALCA encodes three peptide hormones: • Calcitonin • calcium & phosphorus regulation • Calcitonin gene-related peptide (CGRP) • AMP and vasodilator • Katacalcin • calcium regulation National for Biotechnology Information CALCA,Center NCBI Gene ID: 796 PubMed: 18603306 Human calcitonin-related polypeptide alpha NCBI annotation updated March 2015 Calcitonin Katacalcin National Center for Biotechnology Information CALCA: screen shot from NCBI Gene ID: 796 Calcitonin generelated peptide (AMP) RefSeq NP_001029125.1 National Center for Biotechnology Information Examples of Curation Antimicrobial peptides Histones • Regulatory upstream open reading frames (uORFs) National Center for Biotechnology Information Replication-dependent Histones • Most eukaryotic mRNAs have poly(A) tails • Precursor RNA undergoes endonucleolytic cleavage • Poly(A) addition • Replication-independent histones have a poly(A) tail • Replication-dependent Histones do not have poly(A) tails • Precursor RNA undergoes endonucleolytic cleavage between a conserved hairpin and a purine-rich histone downstream element (HDE) • Expression is cell cycle regulated (G1/S phase) National Center for Biotechnology Information Replication-dependent Histones Precursor transcripts are processed at an endonucleolytic cleavage site located between a conserved 16 nucleotide stem-loop structure and a purine-rich histone downstream element (HDE). Curation (human & mouse genes): • Confirm no polyA • Confirm 3’UTR end • Annotate stem loop T T C T Results: • 133 Transcript records • 133 Genes 5’ UTR-CDS- 3’UTR…AAA Dominski & Marzluff (2007) Gene. PMID: 17531405 National Center for Biotechnology Information T C C C G G – – – – - A G G G C C Red = ultra-conserved bases Precursor RNA cleavage site 3’ ACNNN – histone downstream element (HDE) – 3’ NCBI Reference Sequence: NM_003539.3 Graphics Format GenBank Format National for Biotechnology Information Query Center nucleotide: refseq[filter] AND stem_loop[feature key] AND histone[title] Examples of Curation Antimicrobial peptides Histones Regulatory upstream open reading frames (uORFs) National Center for Biotechnology Information Regulatory uORFs • Found in about 40% of all mRNAs • Thought to regulate translation of the primary open reading frame (pORF) • The uORF competes for ribosome, down-regulating translation from the pORF. • Translation of the pORF may rely on leaky scanning or reinitiation events National Center for Biotechnology Information Human major vault protein (MVP) Graphical display of MVP gene annotation on human chromosome 16 NC_000016.10, Reference Assembly GRCh38.p2, GCF_000001405.28 * # # * * uORF is found in one alternate 5’UTR (NM_017458.3) # 5’ UTR is partial # # National Center for Biotechnology Information PMID: 11297743 MVP GeneID: 9961 NM_017458.3, variant 1 uORF start codon: Exon 1 3’: AT Exon 2 5’: G NM_005115.4, variant 2 Exon 1 3’: AT Exon 2 5’: T National Center for Biotechnology Information RefSeq transcript record (MVP variant 1) NM_017458.3 Review status Data source Functional Summary Transcript Variant Description RefSeq Attributes Support evidence National Center for Biotechnology Information Query Nucleotide: refseq*filter+ AND “regulatory uORF”*properties+ GenBank format (NM_017458.3 MVP variant 1) Curation: • Annotate uORF region • Add RefSeq attribute • Add publications Results: uORF • 260 Transcript records • 150 Genes • Human > Mouse > Rat National Center for Biotechnology Information Acknowledgements RefSeq Vertebrate Curation Group, especially: Mike Murphy (AMPs) Lillian Riddick (AMPs, Histones) Wendy Wu (uORFs) DB & programming: Terence Murphy Mike DiCuccio Andrei Shkeda NCBI Leadership: • David Lipman • James Ostell Eric Cox Catherine Farrell Tamara Goldfarb Tripti Gupta Vinita Joardar Vamsi Kodali Kelly McGarvey Nuala O'Leary Shashi Pujar National Center for Biotechnology Information Bhanu Rajput Sanjida Rangwala Dave Webb Matt Wright
© Copyright 2024