Curating sequence and literature data for RefSeq and Gene Kim D. Pruitt 8th International Biocuration Conference Training workshop April 23, 2015 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, DHHS, USA RefSeq overview What is RefSeq? How does it compare to GenBank? What are the advantages? How is the dataset built? • Curated data • Sequence analysis • Curation in-depth – examples • Data access National Center for Biotechnology Information What is RefSeq? An NCBI project to provide reference sequence standards, that incorporate current knowledge, for genomes, transcripts, and proteins. Vertebrates Eukaryotes Prokaryotes Virus Genomes 169 503 31,000 4,538 Genes 4 million 9.2 million 2 million 200,000 Transcripts 5.6 million 11 million 20,000 na Proteins 10 million 38 million 214,287 4.9 million National Centerin forearly Biotechnology Information Counts taken March 2015 RefSeq versus GenBank GenBank Is archival (member of INSDC) Yes Source of sequence Submitter Source of annotation Submitter Genome is always annotated No RefSeq No GenBank (INSDC) GenBank, Collaboration, Literature, Curation, Computation Yes for archaea, bacteria, eukaryotes ‘Owner’ of sequence records and annotation Submitter NCBI NCBI staff can update based on user requests Submitter must authorize RefSeq may drop contamination RefSeq may add transcript/protein/pseudogene based on data analysis and curation RefSeq may update annotation Annotation may be curated by NCBI staff No National Center for Biotechnology Information Yes 15 years of building RefSeq www.ncbi.nlm.nih.gov/refseq/ Advantages: Consistency Non-redundant Use current names Expanded feature annotation Connected to Gene information Products & Access: Annotated genomes, transcripts, proteins Gene, BLAST, FTP, programming API National Center for Biotechnology Information Curation: Correct errors Add new records Add functional information Connect sequence to function Gene & protein names Functional sequence elements Curation focus Human Mouse Rat Zebrafish Cow Chicken RefSeqs unique contribution for vertebrates • Correct transcript/protein sequence even if genome is incomplete/wrong • Clear information on data source & evidence NM_001033952.2 • Connect DNA<>RNA<>Protein • Connect sequence regions to function - for both transcripts and proteins National Center for Biotechnology Information RefSeq Genomes in a Nutshell Submitter Sequence Assembly (Annotate) GenBank/INSDC Genome Protein SRA (reads) Assembly BioSample BioProject BLAST FTP Web eUtils Submit Data Submissions Resources RefSeq Gene BLAST Genome Tracks FTP Reports Assembly HomoloGene National Center for Biotechnology Information Nucleotide RefSeq Creation Annotation Pipeline RefSeq Curation Collaboration Sequence Meta-data Access RefSeq Process Flows RefSeq genomes: Leveraging computation & curation www.ncbi.nlm.nih.gov/genome/annotation_euk/process/ International CCDS Collaboration UniProtKB/ SwissProt Align: RefSeq cDNAs Proteins RNA-Seq Curated RefSeqs Iterative process Model Organism Databases Nomenclature Groups Annotation Pipeline Quality Checks RefSeqs Filter: Best hits Genes Interpret: Build models Curation Call orthologs: vs. human Literature Review miRBase Sequence Analysis Assign GeneID Assign Accession Public release Genome Reference Consortium (GRC) User Feedback! Iterative process National Center for Biotechnology Information Model RefSeqs Gene FTP Nucleotide Protein Annotation - a conservative approach Annotate every exon that is observed once? X 1. STAG3L5P-PVRIG2P-PILRB readthrough 2. stromal antigen 3-like 5 pseudogene 4. paired immunoglobin-like type 2 receptor beta (regulation of inflammatory responses) 3. poliovirus receptor related immunoglobulin domain pseudogene National Center for Biotechnology Information Consolidate information to represent supported genes and transcripts! Annotation pipeline results in NCBI Gene Access genome annotation information including RNA-Seq tracks Rabbit - GeneID:103352519 - Assembly: OryCun2.0 Configure Model RefSeqs Not annotated in Ensembl 76 Ensembl track RNA-Seq tracks Interpreted introns Curated Track names National Center for Biotechnology Information Exon coverage Log2 scale graphs How to identify a RefSeq sequence record Keyword: • RefSeq Accession format: Two alpha + _+ 6-9 digits – or - Two alpha + _ + GenBank accession RefSeq categories (transcripts & proteins): • Known RefSeq • Subject to curation • Accession prefix N*_ • Model RefSeq • Evidence-based predictions • Accession prefixInformation X*_ National Center for Biotechnology www.ncbi.nlm.nih.gov/nucleotide/NM_002197.2 RefSeq overview Curated data Genes Sequence Publications Imported data • Sequence analysis • Curation in-depth – examples • Data access National Center for Biotechnology Information BULK PROCESSES CURATION • • • Review data • Import Gene information Gene-2-sequence associations Publications Data from collaborators • Add data from collaborators Update DB • Add, update, remove accessions to match GenBank QA Resolve Errors • • • • Remove wrong name synonyms, publications Fix sequence associations Update gene type Correct collaborator Gene: NCBI Gene associations Add data • • • • Create RefSeq records RefSeq Attributes & Summary Transcript variant description Alternate names, publications • Identify data conflicts for curator review National Center for Biotechnology Information Vertebrate transcripts How do we curate? Collaboration Sequence Analysis • Collaborations • Nomenclature, MODs, UniProt, Genome Reference Consortium, individual scientists • In-depth sequence analysis • • • • • • Genome, transcript and protein sequence Alignments RNA-Seq QA tests Epigenomics Clinical variants • Literature review National Center for Biotechnology Information Validation Literature Guidelines Curation mRNA, ncRNA, protein, and pseudogene records Genome Annotation WWW – FTP - BLAST Tracking data & curation consistency Data management Curation management • Specifications for the product • Standard operating procedures • Relational database to track data and curation decisions over time • Process flows • Curation decision trees • ncRNA <> pseudo <> protein-coding? • 5’ complete transcript <>partial? • Data validation • Sequence analysis tools and CGI’s • Disaster recovery/backup • Support collaborations • Public access National Center for Biotechnology Information What do we curate? • Genes: • Type, location, length • Names, Summary • Publications • Gene-2-accession bins Protein-coding ncRNAs Pseudogene Unknown ??? • Imported data • Sequence: • Accuracy, length • Alternate splice products • Sequence features • Functional regions National Center for Biotechnology Information RefSeq: www.ncbi.nlm.nih.gov/refseq/ Gene: www.ncbi.nlm.nih.gov/gene/ Curating Literature • Curation Review for Genes • • • • • Move to correct gene Add functional citations Mark to include on RefSeq GeneRIF submissions from public Add RefSeq attribute and citation National Center for Biotechnology Information • Most publications are added from: • National Library of Medicine MeSH indexing service • Sequence records • Nomenclature groups, MODs, GO, OMIM, GWAS catalog, more… GeneRIFs – an annotated bibliography RefSeq curators review GeneRIF submissions from individuals to correct spelling, check the gene association, and remove irrelevant submissions. National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/gene/10309 Curation supports data import processes HGNC Pseudo geneOrg MGD RGD FTP/API OMIM ZFIN XenBase Compare to known data Update if OK QTL db Generic Processing Dataflow CGNC MIRBASE National Center for Biotechnology Information Report for curation if conflicts found Gene Backend Database Curating data import errors • Manually add or update some data • HGNC may have: HGNC ID 1 = genome location ‘x’ = ENSG ID 1 • Processing can’t identify corresponding GeneID • Curator reviews genomic location and either updates or creates a Gene record. • Coordinate with data sources to reconcile data association conflicts between sites • NCBI may have: • HGNC may have: • NCBI may have: National Center for Biotechnology Information Gene ID 1 = HGNC ID 1 = Accession 123 HGNC ID 1 = Gene ID 1 = Accession 234 Accession 234 = GeneID 2 = HGNC ID 2 (a paralog) RefSeq overview Curated data Sequence analysis Tools Quality assurance checks • Curation in-depth - examples • Data access National Center for Biotechnology Information Quick access to stored BLAST results Gene back-end curation database Quick access to results UniVec EST NR Genome Blastn Blastx blastp National Center for Biotechnology Information In-house: Set of BLAST searches per accession Results are stored for 3 months View hits in NCBI’s genome browser Sequence and alignment analysis using NCBI’s Genome Workbench An application for viewing and analyzing sequence data from NCBI databases, or upload your data for analysis • Compiled for several operating systems • Analysis: BLAST and more • Supports many display options • graphical • alignments • dot plot • phylogenetic trees • more www.ncbi.nlm.nih.gov/tools/gbench/ National Center for Biotechnology Information General layout * * Data display area Project Tree shows loaded data Search for features, search the sequence, search for open reading frames Monitor the progress of analysis tasks National Center for Biotechnology Information Multi-pane cross alignment view Turkey_2.01 Chromosome 1 Turkey_5.0 Chromosome 1 National Center for Biotechnology Information Search National Center for Biotechnology Information National Center for Biotechnology Information Load a set of protein accession.version numbers Select accessions to include in your analysis Select the analysis option from the Tool menu National Center for Biotechnology Information Load a set of protein accession.version numbers Select accessions to include in your analysis Select analysis option from the Tool menu National Center for Biotechnology Information Display the phylogentic tree calculated from selected CELF proteins. National Center for Biotechnology Information Genome workbench - Multiple protein alignment display Curation use: - Orthology review - Gene type review - Sequence conservation National Center for Biotechnology Information RADAR – a Genome Workbench plug-in for RefSeq Curation RefSeq Analysis, Display, and Recommendation New RefSeq Strain QA Library Displays Information on: Genomic region, gene annotation RNA-seq called introns CpG Islands, Repeats, variation, more QA results for newly build RefSeq Aligned RefSeqs, cDNAs, ESTs Coding sequence region (green) Strain data Clone library Stored in DB with quality concern (D) Multiple alignments to the genome (M) Consensus splice sites (‘a’, ‘d’) Mismatches Indels Unaligned ends (not shown) National Center for Biotechnology Information RADAR • Functions • • • • • • • • • • RNAseq supported intron ORF finder Signal peptides Transmembrane regions Compare/diff transcripts Find similar transcripts Integrated QA tests View nucleotide View translation Links to web for details National Center for Biotechnology Information PROCESS CURATION • • • Review data • Import Gene information Gene-2-sequence associations Publications Data from collaborators • Add data from collaborators Update DB • Add, update, remove accessions to match GenBank QA Resolve Errors • • • • Remove wrong name synonyms, publications Fix sequence associations Update gene type Correct collaborator Gene: NCBI Gene associations Add data • • • • Create RefSeq records RefSeq Attributes & Summary Transcript variant description Alternate names, publications and GeneRIF • Identify data conflicts for curator review National Center for Biotechnology Information Quality assurance tests Transcript tests – protein tests – genome tests – alignment tests Sequence tested Results over time Results summary National Center for Biotechnology Tests are available in the NCBIInformation C++ toolkit – http://www.ncbi.nlm.nih.gov/toolkit/ Details (not shown) RefSeq overview Curated data Sequence analysis Curation in-depth – examples Work flow Making decisions Working with collaborators RefSeq curated data is in Gene Annotating RefSeq records • Data access National Center for Biotechnology Information General process flow for manual transcript-based curation Identify quality full-length cDNAs or ESTs gt ag Identify splice variants and assess their protein-coding capacity Extend 5’ and 3’ ends using all aligning transcript data gt ag Determine the supported complete CDS Protein-coding variant that encodes an alternate C-terminus Non-coding variant that is subject to nonsense-mediated decay (NMD) AAAAAA AAAAAA Representative RefSeqs NMs AAAAAA AAAAAA National Center for Biotechnology Information NR Transcript-based curation process Example: Human DNAJC22 gene (Gene ID:79962)- RefSeqs are constructed using RADAR NCBI RADAR: NC_000012.12 Chromosome 12 GRCh38.p2 (similar to UCSC hg20) Curated NMs are based on fulllength transcripts RNA-seq alignments Chr 12 Known Model UTRs are extended Aligned cDNAs Model XMs are created computationally based on transcript and RNA-seq data and often lack full-length support. National Center for Biotechnology Information Determining protein-coding potential of a variant Example: Human CCNO gene (Gene ID: 10309) – Three non-coding RefSeq (NRs) were made to represent fulllength transcript variants that either lack an open reading frame (ORF) that meets our quality criteria or the ORF renders the transcript a candidate for nonsense-mediated decay (NMD) . NCBI RADAR: NC_000005.10 Chromosome 5 GRCh38.p2 (similar to UCSC hg20) protein-coding variant (NM_) non-coding variants (NR_) NMD candidate ORFs are short < 60 aa National Center for Biotechnology Information Detailed documentation improves consistency Protein-coding RNA loci Non-coding RNA loci • 1 long cDNA • 1 long cDNA if > 2 exons • Or, 2 lines of support: • 2 independent lines of support if 2 exons • Overlapping partial transcripts + more support • Protein homology or ORF conservation or publication • 5 lines of support if 1 exon • ORF length <100aa • Consensus splice sites • No quality protein hits (blastX) • ORF length >=100 aa • Consensus splice • If <100 aa require more support • Consider if syntenic region in human, mouse • Not apparently pseudogene • No other data (publication) indicates it is protein-coding • 3’ end does not correspond to genomic polyA National Center for Biotechnology Information Using Epigenomic data to determine 5’ completeness Example: mouse Fgd4 gene (Gene ID: 224014). NCBI RADAR: NC_000082.6 Chromosome 1 GRCm38 UCSC Browser H3K4me3 tracks from the UCSC Genome Browser National Center for Biotechnology Information Representing genes based on published data Example: Human APELA gene (Gene ID: 100506013) – transcript data supports an independent gene with a short ORF (54 aa) that typically would not meet RefSeq criteria for a protein-coding locus. Literature review confirms the short ORF is functional. NCBI RADAR: NC_000004.12 Chromosome 1 GRCh38.p2 Assembly: GRCh38.p2, chromosome 4. 54 aa ORF Functional data support the 54 aa ORF National Center for Biotechnology Information Gene type decisions depend on transcript data, epigenomics and functional studies Example: Human FALEC gene (Gene ID: 100874054) Assembly: GRCh38.p2; chromosome 1 NCBI RADAR: NC_000001.11 Chromosome 1 GRCh38.p2 (hg20) The locus is supported by a single two-exon EST (AL713297.1) UCSC - NC_000001.10 Chromosome 1 GRCh37 (hg19) Epigenomic marks support the 5’ completeness of the transcripts data Published data support a functional role for this lncRNA National Center for Biotechnology Information Working with nomenclature groups to coordinate changes Example: Non-coding gene LINC00948 was updated to a protein-coding gene MRLN (GeneID: 100507027). Private comments in the in-house Gene database record the curation history Human Annotation Release 107 RefSeq proteins (red) National Center for Biotechnology Information Functional annotation on the RefSeq record Example: Human GHRL gene (Gene ID: 51738) - ghrelin/obestatin prepropeptide AAAAAA GHRL gene Prepro-ghrelin Ghrelin C-Ghrelin Ghrelin C-Ghrelin Signal peptide pro-ghrelin Mature peptides Ghrelin-28 Obestatin National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/protein/NP_057446.1 GRLH annotation display in NCBI’s Gene resource • Mature peptides were annotated on protein products of 8 alternatively spliced transcripts (red arrows). • The Graphics display shown in NCBI’s Gene resource was reconfigured to show all transcripts and proteins, and to show the protein features. National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/gene/51738 Micro RNA annotation – collaboration with miRBase Example: Human MIR124-1 (Gene ID: 406907) miRBase ID: MI0000443 NCBI imports data directly from miRBase (mirbase.org) Gene Graphics view RefSeq represents the miRNA stemloop precursor NR_029668.1 RefSeq annotates the mature microRNAs National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/gene/406907 RefSeq record – feature annotation for miRNAs RefSeq NR_029668.1 - Human MIR124-1 - Gene ID: 406907 National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/nuccore/NR_029668.1 Feature annotation – More examples of feature annotation will be provided in Session 1 National Center for Biotechnology Information RefSeq collaborates to improve genome annotation GRCh37 – Several exons of the Chromosome 7 GRCh37/hg19 NC_000007.13 human COPG2 RefSeq were missing in the reference genome assembly. Curators constructed the RefSeq from transcripts and reported the assembly gap to the Genome Reference Chromosome 7 GRCh38/hg20 NC_000007.14 Consortium (GRC). GRCh38 – The gap is fixed in the updated assembly. RefSeq and Sanger collaborate to produce matching annotation on the new assembly. CCDS – The annotated CDS is tracked by the Consensus CDS (CCDS) collaboration once NCBI and Ensembl have both annotated the protein Caution: using RefSeq data from non-NCBI resources NCBI’s Graphics Viewer GRCh38/hg20 UCSC’s Genome Browser RefSeq Genes track GRCh37/hg19 missing pseudogene locus missing locus - Also missing for UCSC GRCh38/hg20 National Center for Biotechnology Information missing XM_ variant RefSeq overview Curated data Sequence analysis Curation in-depth – examples Data access National Center for Biotechnology Information Finding RefSeq data in NCBI’s Gene resource • NCBI’s Gene resource is primarily based on RefSeq • Gene integrates data from many sources: • • • • RefSeq & GeneRIF Official Nomenclature Gene Ontology Orthologs, Pathways, Phenotypes, Variation, Protein interactions, and more • Gene provides a unique ID and includes RefSeq details: • RefSeq genome annotation • RefSeq details including transcript variant descriptions • Report of exon coordinates National Center for Biotechnology Information RefSeq data in Gene • Genomic regions, transcripts, proteins • Find genome annotation datails • NCBI Reference Sequences (RefSeqs) • Find information for individual accessions National Center for Biotechnology Information Manual curation provides annotation for Gene Example: human GHRL (GeneID:51738) Nomenclature Summary Publications RefSeq transcript variant descriptions National Center for Biotechnology Information Navigating from Gene to Sequence to download National Center for Biotechnology Information Nucleotide & Protein queries • Build a query starting with: refseq[filter] • Add an organism: AND human[organism] • Add a name, a RefSeq attribute, or a specific feature type • AND ghrelin-27[protein name] • Or… ‘AND mat_peptide*feature key+’ Or … ‘AND obestatin*protein name+’ Protein database query example: refseq[filter] AND human[orgn] AND ghrelin-27[protein name] AND mat_peptide[feature key] National Center for Biotechnology Information RefSeq in BLAST National Center for Biotechnology Information Bulk retrievals • RefSeq FTP site – ftp://ftp.ncbi.nlm.nih.gov/refseq/ • Comprehensive bi-monthly release organized by major groups (e.g., vertebrate_mammals, etc.) • Weekly updates of transcript/protein records for some organisms • Genomes FTP site – ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/ • Releases of genome assembly and annotation data. Updated to add new file formats, when assembly updates, when there is a major annotation update. • Gene FTP site – ftp://ftp.ncbi.nlm.nih.gov/gene/ • Reports Gene to RefSeq accession associations, and more. • NCBI Programming Utilities (eUtils) – supports scripted retreivals • Introduction: http://www.ncbi.nlm.nih.gov/books/NBK25497/ • Help: http://www.ncbi.nlm.nih.gov/books/NBK25501/ National Center for Biotechnology Information User feedback and RefSeq updates • Feedback: http://www.ncbi.nlm.nih.gov/projects/RefSeq/update.cgi RefSeq Home page Gene report pages • RefSeq Updates: subscribe to the refseq-admin mail list http://www.ncbi.nlm.nih.gov/mailman/listinfo/refseq-announce/ • NCBI News http://www.ncbi.nlm.nih.gov/news/ National Center for Biotechnology Information Acknowledgements RefSeq Curators (Vertebrates & Other taxa) Stacy Ciufo Eric Cox Diana Haddad Catherine Farrell Tamara Goldfarb Tripti Gupta Vinita Joardar Vamsi Kodali Wenjun Li Kelly McGarvey Mike Murphy Nuala O'Leary Kathleen O’Neill Shashi Pujar Bhanu Rajput Sanjida Rangwala NCBI Leadership • David Lipman • James Ostell National Center for Biotechnology Information Lillian Riddick Barbara Robberts Brian Smith-White Anjana Raina Vatsan Dave Webb Matt Wright Databases & programming • • • • • • • • Terence Murphy Olga Ermolaeva Craig Wallin Alex Astashyn David Maganadze Mike DiCuccio Andrei Shkeda Donna Maglott Genome Workbench & RADAR • Anatoliy Kuznetsov • David Falk • Andrei Shkeda
© Copyright 2024