RefSeq - 8th International Biocuration Conference

Curating sequence and literature
data for RefSeq and Gene
Kim D. Pruitt
8th International Biocuration Conference
Training workshop
April 23, 2015
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, DHHS, USA
RefSeq overview
What is RefSeq?
How does it compare to GenBank?
What are the advantages?
How is the dataset built?
• Curated data
• Sequence analysis
• Curation in-depth – examples
• Data access
National Center for Biotechnology Information
What is RefSeq?
An NCBI project to provide reference sequence standards, that
incorporate current knowledge, for genomes, transcripts, and proteins.
Vertebrates Eukaryotes
Prokaryotes
Virus
Genomes
169
503
31,000
4,538
Genes
4 million
9.2 million
2 million
200,000
Transcripts 5.6 million
11 million
20,000
na
Proteins
10 million
38 million
214,287
4.9 million
National
Centerin
forearly
Biotechnology
Information
Counts taken
March 2015
RefSeq versus GenBank
GenBank
Is archival (member of INSDC) Yes
Source of sequence Submitter
Source of annotation Submitter
Genome is always annotated No
RefSeq
No
GenBank (INSDC)
GenBank, Collaboration, Literature, Curation,
Computation
Yes for archaea, bacteria, eukaryotes
‘Owner’ of sequence records and annotation Submitter
NCBI
NCBI staff can update based on user requests Submitter must
authorize
RefSeq may drop contamination
RefSeq may add transcript/protein/pseudogene
based on data analysis and curation
RefSeq may update annotation
Annotation may be curated by NCBI staff No
National Center for Biotechnology Information
Yes
15 years of building RefSeq
www.ncbi.nlm.nih.gov/refseq/
Advantages:
Consistency
Non-redundant
Use current names
Expanded feature annotation
Connected to Gene information
Products & Access:
Annotated genomes, transcripts, proteins
Gene, BLAST, FTP, programming API
National Center for Biotechnology Information
Curation:
Correct errors
Add new records
Add functional information
Connect sequence to function
Gene & protein names
Functional sequence elements
Curation focus
Human
Mouse
Rat
Zebrafish
Cow
Chicken
RefSeqs unique contribution for vertebrates
• Correct transcript/protein sequence even if genome is incomplete/wrong
• Clear information on data source & evidence
NM_001033952.2
• Connect DNA<>RNA<>Protein
• Connect sequence regions to function
- for both transcripts and proteins
National Center for Biotechnology Information
RefSeq Genomes in a Nutshell
Submitter
Sequence
Assembly
(Annotate)
GenBank/INSDC Genome
Protein
SRA
(reads)
Assembly
BioSample
BioProject
BLAST
FTP
Web
eUtils
Submit
Data Submissions
Resources
RefSeq
Gene
BLAST
Genome
Tracks
FTP
Reports
Assembly
HomoloGene
National Center for Biotechnology Information
Nucleotide
RefSeq Creation
Annotation Pipeline
RefSeq Curation
Collaboration
Sequence
Meta-data
Access
RefSeq
Process Flows
RefSeq genomes: Leveraging computation & curation
www.ncbi.nlm.nih.gov/genome/annotation_euk/process/
International CCDS
Collaboration
UniProtKB/
SwissProt
Align:
RefSeq
cDNAs
Proteins
RNA-Seq
Curated RefSeqs
Iterative process
Model Organism
Databases
Nomenclature
Groups
Annotation Pipeline
Quality Checks
RefSeqs
Filter:
Best hits
Genes
Interpret:
Build models
Curation
Call orthologs:
vs. human
Literature Review
miRBase
Sequence Analysis
Assign GeneID
Assign Accession
Public release
Genome Reference
Consortium (GRC)
User Feedback!
Iterative process
National Center for Biotechnology Information
Model
RefSeqs
Gene
FTP
Nucleotide
Protein
Annotation - a conservative approach
Annotate every exon
that is observed once?
X
1. STAG3L5P-PVRIG2P-PILRB readthrough
2. stromal antigen 3-like 5 pseudogene
4. paired immunoglobin-like type 2 receptor beta
(regulation of inflammatory responses)
3. poliovirus receptor related immunoglobulin domain pseudogene
National Center for Biotechnology Information
Consolidate information
to represent supported
genes and transcripts!
Annotation pipeline results in NCBI Gene
Access genome annotation information including RNA-Seq tracks
Rabbit - GeneID:103352519 - Assembly: OryCun2.0
Configure
Model RefSeqs
Not annotated in Ensembl 76
Ensembl track
RNA-Seq tracks
Interpreted introns
Curated
Track names
National Center for Biotechnology Information
Exon coverage
Log2 scale graphs
How to identify a RefSeq sequence record
Keyword:
• RefSeq
Accession format:
Two alpha + _+ 6-9 digits – or -
Two alpha + _ + GenBank accession
RefSeq categories
(transcripts & proteins):
• Known RefSeq
• Subject to curation
• Accession prefix N*_
• Model RefSeq
• Evidence-based predictions
• Accession
prefixInformation
X*_
National Center
for Biotechnology
www.ncbi.nlm.nih.gov/nucleotide/NM_002197.2
RefSeq overview
Curated data
Genes
Sequence
Publications
Imported data
• Sequence analysis
• Curation in-depth – examples
• Data access
National Center for Biotechnology Information
BULK PROCESSES
CURATION
•
•
•
Review data •
Import
Gene information
Gene-2-sequence associations
Publications
Data from collaborators
• Add data from
collaborators
Update
DB
• Add, update,
remove accessions
to match GenBank
QA
Resolve
Errors
•
•
•
•
Remove wrong name synonyms, publications
Fix sequence associations
Update gene type
Correct collaborator Gene: NCBI Gene associations
Add data
•
•
•
•
Create RefSeq records
RefSeq Attributes & Summary
Transcript variant description
Alternate names, publications
• Identify data
conflicts for
curator review
National Center for Biotechnology Information
Vertebrate transcripts
How do we curate?
Collaboration
Sequence Analysis
• Collaborations
• Nomenclature, MODs, UniProt, Genome
Reference Consortium, individual scientists
• In-depth sequence analysis
•
•
•
•
•
•
Genome, transcript and protein sequence
Alignments
RNA-Seq
QA tests
Epigenomics
Clinical variants
• Literature review
National Center for Biotechnology Information
Validation
Literature
Guidelines
Curation
mRNA, ncRNA, protein,
and pseudogene records
Genome Annotation
WWW – FTP - BLAST
Tracking data & curation consistency
Data management
Curation management
• Specifications for the product
• Standard operating procedures
• Relational database to track data and curation
decisions over time
• Process flows
• Curation decision trees
• ncRNA <> pseudo <> protein-coding?
• 5’ complete transcript <>partial?
• Data validation
• Sequence analysis tools and CGI’s
• Disaster recovery/backup
• Support collaborations
• Public access
National Center for Biotechnology Information
What do we curate?
• Genes:
• Type, location, length
• Names, Summary
• Publications
• Gene-2-accession bins
Protein-coding
ncRNAs
Pseudogene
Unknown ???
• Imported data
• Sequence:
• Accuracy, length
• Alternate splice products
• Sequence features
• Functional regions
National Center
for Biotechnology Information
RefSeq:
www.ncbi.nlm.nih.gov/refseq/
Gene: www.ncbi.nlm.nih.gov/gene/
Curating Literature
• Curation Review for Genes
•
•
•
•
•
Move to correct gene
Add functional citations
Mark to include on RefSeq
GeneRIF submissions from public
Add RefSeq attribute and citation
National Center for Biotechnology Information
• Most publications are added from:
• National Library of Medicine MeSH
indexing service
• Sequence records
• Nomenclature groups, MODs, GO,
OMIM, GWAS catalog, more…
GeneRIFs – an annotated bibliography
RefSeq curators review GeneRIF submissions from
individuals to correct spelling, check the gene
association, and remove irrelevant submissions.
National Center for Biotechnology Information
http://www.ncbi.nlm.nih.gov/gene/10309
Curation supports data import processes
HGNC
Pseudo
geneOrg
MGD
RGD
FTP/API
OMIM
ZFIN
XenBase
Compare to known data
Update if OK
QTL db
Generic
Processing
Dataflow
CGNC
MIRBASE
National Center for Biotechnology Information
Report for curation if
conflicts found
Gene
Backend
Database
Curating data import errors
• Manually add or update some data
• HGNC may have:
HGNC ID 1 = genome location ‘x’ = ENSG ID 1
• Processing can’t identify corresponding GeneID
• Curator reviews genomic location and either updates or creates a Gene record.
• Coordinate with data sources to reconcile data association conflicts
between sites
• NCBI may have:
• HGNC may have:
• NCBI may have:
National Center for Biotechnology Information
Gene ID 1 = HGNC ID 1 = Accession 123
HGNC ID 1 = Gene ID 1 = Accession 234
Accession 234 = GeneID 2 = HGNC ID 2 (a paralog)
RefSeq overview
Curated data
Sequence analysis
Tools
Quality assurance checks
• Curation in-depth - examples
• Data access
National Center for Biotechnology Information
Quick access to stored BLAST results
Gene back-end curation database
Quick access to results
UniVec
EST
NR
Genome
Blastn
Blastx
blastp
National Center for Biotechnology Information
In-house: Set of BLAST searches per accession
Results are stored for 3 months
View hits in NCBI’s genome browser
Sequence and alignment analysis using NCBI’s
Genome Workbench
An application for viewing and
analyzing sequence data from
NCBI databases, or upload your
data for analysis
• Compiled for several
operating systems
• Analysis: BLAST and more
• Supports many display
options
• graphical
• alignments
• dot plot
• phylogenetic trees
• more
www.ncbi.nlm.nih.gov/tools/gbench/
National Center for Biotechnology Information
General layout
*
*
Data display area
Project Tree shows loaded data
Search for features, search the sequence, search for open reading frames
Monitor the progress of analysis tasks
National Center for Biotechnology Information
Multi-pane cross alignment view
Turkey_2.01
Chromosome 1
Turkey_5.0
Chromosome 1
National Center for Biotechnology Information
Search
National Center for Biotechnology Information
National Center for Biotechnology Information
Load a set of protein accession.version numbers
Select accessions to include in your analysis
Select the analysis option from the Tool menu
National Center for Biotechnology Information
Load a set of protein accession.version numbers
Select accessions to include in your analysis
Select analysis option from the Tool menu
National Center for Biotechnology Information
Display the phylogentic tree calculated
from selected CELF proteins.
National Center for Biotechnology Information
Genome workbench - Multiple protein
alignment display
Curation use:
- Orthology review
- Gene type review
- Sequence conservation
National Center for Biotechnology Information
RADAR – a Genome Workbench plug-in for RefSeq Curation
RefSeq Analysis, Display, and Recommendation
New RefSeq Strain
QA
Library
Displays Information on:
Genomic region, gene annotation
RNA-seq called introns
CpG Islands, Repeats, variation, more
QA results for newly build RefSeq
Aligned RefSeqs, cDNAs, ESTs
Coding sequence region (green)
Strain data
Clone library
Stored in DB with quality concern (D)
Multiple alignments to the genome (M)
Consensus splice sites (‘a’, ‘d’)
Mismatches
Indels
Unaligned ends (not shown)
National Center for Biotechnology Information
RADAR
• Functions
•
•
•
•
•
•
•
•
•
•
RNAseq supported intron
ORF finder
Signal peptides
Transmembrane regions
Compare/diff transcripts
Find similar transcripts
Integrated QA tests
View nucleotide
View translation
Links to web for details
National Center for Biotechnology Information
PROCESS
CURATION
•
•
•
Review data •
Import
Gene information
Gene-2-sequence associations
Publications
Data from collaborators
• Add data from
collaborators
Update
DB
• Add, update,
remove
accessions to
match GenBank
QA
Resolve
Errors
•
•
•
•
Remove wrong name synonyms, publications
Fix sequence associations
Update gene type
Correct collaborator Gene: NCBI Gene associations
Add data
•
•
•
•
Create RefSeq records
RefSeq Attributes & Summary
Transcript variant description
Alternate names, publications and GeneRIF
• Identify data
conflicts for
curator review
National Center for Biotechnology Information
Quality assurance tests
Transcript tests – protein tests – genome tests – alignment tests
Sequence
tested
Results
over time
Results
summary
National
Center
for Biotechnology
Tests are
available
in the NCBIInformation
C++ toolkit – http://www.ncbi.nlm.nih.gov/toolkit/
Details (not
shown)
RefSeq overview
Curated data
Sequence analysis
Curation in-depth – examples
Work flow
Making decisions
Working with collaborators
RefSeq curated data is in Gene
Annotating RefSeq records
• Data access
National Center for Biotechnology Information
General process flow for manual transcript-based curation
Identify
quality full-length
cDNAs or ESTs
gt
ag
Identify splice variants
and assess their
protein-coding capacity
Extend 5’ and 3’ ends
using all aligning
transcript data
gt ag
Determine the supported
complete CDS
Protein-coding variant that encodes an
alternate C-terminus
Non-coding variant that is subject to
nonsense-mediated decay (NMD)
AAAAAA
AAAAAA
Representative
RefSeqs
NMs
AAAAAA
AAAAAA
National Center for Biotechnology Information
NR
Transcript-based curation process
Example: Human DNAJC22 gene (Gene ID:79962)- RefSeqs are constructed using RADAR
NCBI RADAR: NC_000012.12 Chromosome 12 GRCh38.p2 (similar to UCSC hg20)
Curated NMs are
based on fulllength transcripts
RNA-seq
alignments
Chr 12
Known
Model
UTRs are
extended
Aligned
cDNAs
Model XMs are created
computationally based on
transcript and RNA-seq data and
often lack full-length support.
National Center for Biotechnology Information
Determining protein-coding potential of a variant
Example: Human CCNO gene (Gene ID: 10309) – Three non-coding RefSeq (NRs) were made to represent fulllength transcript variants that either lack an open reading frame (ORF) that meets our quality criteria or the ORF
renders the transcript a candidate for nonsense-mediated decay (NMD) .
NCBI RADAR: NC_000005.10 Chromosome 5 GRCh38.p2 (similar to UCSC hg20)
protein-coding variant (NM_)
non-coding variants (NR_)
NMD candidate
ORFs are short < 60 aa
National Center for Biotechnology Information
Detailed documentation improves consistency
Protein-coding RNA loci
Non-coding RNA loci
• 1 long cDNA
• 1 long cDNA if > 2 exons
• Or, 2 lines of support:
• 2 independent lines of support if 2 exons
• Overlapping partial transcripts + more support
• Protein homology or ORF conservation or
publication
• 5 lines of support if 1 exon
• ORF length <100aa
• Consensus splice sites
• No quality protein hits (blastX)
• ORF length >=100 aa
• Consensus splice
• If <100 aa require more support
• Consider if syntenic region in human, mouse
• Not apparently pseudogene
• No other data (publication) indicates it is
protein-coding
• 3’ end does not correspond to genomic polyA
National Center for Biotechnology Information
Using Epigenomic data to determine 5’ completeness
Example: mouse Fgd4 gene (Gene ID: 224014).
NCBI RADAR: NC_000082.6 Chromosome 1 GRCm38
UCSC Browser
H3K4me3 tracks
from the UCSC
Genome Browser
National Center for Biotechnology Information
Representing genes based on published data
Example: Human APELA gene (Gene ID: 100506013) – transcript data supports an independent gene
with a short ORF (54 aa) that typically would not meet RefSeq criteria for a protein-coding locus.
Literature review confirms the short ORF is functional.
NCBI RADAR: NC_000004.12 Chromosome 1 GRCh38.p2
Assembly: GRCh38.p2, chromosome 4.
54 aa ORF
Functional data support the 54 aa ORF
National Center for Biotechnology Information
Gene type decisions depend on transcript data,
epigenomics and functional studies
Example: Human FALEC gene (Gene ID: 100874054)
Assembly: GRCh38.p2; chromosome 1
NCBI RADAR: NC_000001.11 Chromosome 1 GRCh38.p2 (hg20)
The locus is supported by a single
two-exon EST (AL713297.1)
UCSC - NC_000001.10 Chromosome 1 GRCh37 (hg19)
Epigenomic marks support the 5’
completeness of the transcripts data
Published data support a functional
role for this lncRNA
National Center for Biotechnology Information
Working with nomenclature groups to coordinate changes
Example: Non-coding gene LINC00948 was updated to a protein-coding gene MRLN (GeneID: 100507027).
Private comments in the in-house Gene database record the curation history
Human Annotation Release 107
RefSeq
proteins
(red)
National Center for Biotechnology Information
Functional annotation on the RefSeq record
Example: Human GHRL gene (Gene ID: 51738)
- ghrelin/obestatin prepropeptide
AAAAAA
GHRL gene
Prepro-ghrelin
Ghrelin
C-Ghrelin
Ghrelin
C-Ghrelin
Signal
peptide
pro-ghrelin
Mature
peptides
Ghrelin-28
Obestatin
National Center for Biotechnology Information
http://www.ncbi.nlm.nih.gov/protein/NP_057446.1
GRLH annotation display
in NCBI’s Gene resource
• Mature peptides were annotated on protein products of 8
alternatively spliced transcripts (red arrows).
• The Graphics display shown in NCBI’s Gene resource was
reconfigured to show all transcripts and proteins, and to
show the protein features.
National Center for Biotechnology Information
http://www.ncbi.nlm.nih.gov/gene/51738
Micro RNA annotation – collaboration with miRBase
Example: Human MIR124-1 (Gene ID: 406907)
miRBase ID:
MI0000443
NCBI imports data directly from miRBase (mirbase.org)
Gene Graphics view
RefSeq represents
the miRNA stemloop precursor
NR_029668.1
RefSeq annotates the mature microRNAs
National Center for Biotechnology Information
http://www.ncbi.nlm.nih.gov/gene/406907
RefSeq record – feature annotation for miRNAs
RefSeq NR_029668.1
- Human MIR124-1
- Gene ID: 406907
National Center for Biotechnology Information
http://www.ncbi.nlm.nih.gov/nuccore/NR_029668.1
Feature annotation –
More examples of feature annotation will be provided in Session 1
National Center for Biotechnology Information
RefSeq collaborates to improve genome annotation
GRCh37 – Several exons of the Chromosome 7 GRCh37/hg19 NC_000007.13
human COPG2 RefSeq were
missing in the reference genome
assembly. Curators constructed
the RefSeq from transcripts and
reported the assembly gap to
the Genome Reference
Chromosome 7 GRCh38/hg20 NC_000007.14
Consortium (GRC).
GRCh38 – The gap is fixed in
the updated assembly. RefSeq
and Sanger collaborate to
produce matching annotation
on the new assembly.
CCDS – The annotated CDS is
tracked by the Consensus CDS
(CCDS) collaboration once NCBI
and Ensembl have both
annotated the protein
Caution: using RefSeq data from non-NCBI resources
NCBI’s Graphics Viewer
GRCh38/hg20
UCSC’s Genome Browser
RefSeq Genes track
GRCh37/hg19
missing pseudogene
locus
missing locus
- Also missing for UCSC
GRCh38/hg20
National Center for Biotechnology Information
missing XM_ variant
RefSeq overview
Curated data
Sequence analysis
Curation in-depth – examples
Data access
National Center for Biotechnology Information
Finding RefSeq data in NCBI’s Gene resource
• NCBI’s Gene resource is primarily based on RefSeq
• Gene integrates data from many sources:
•
•
•
•
RefSeq & GeneRIF
Official Nomenclature
Gene Ontology
Orthologs, Pathways, Phenotypes, Variation, Protein interactions, and
more
• Gene provides a unique ID and includes RefSeq details:
• RefSeq genome annotation
• RefSeq details including transcript variant descriptions
• Report of exon coordinates
National Center for Biotechnology Information
RefSeq data in Gene
• Genomic regions, transcripts, proteins
• Find genome annotation datails
• NCBI Reference Sequences (RefSeqs)
• Find information for individual accessions
National Center for Biotechnology Information
Manual curation provides annotation for Gene
Example: human GHRL (GeneID:51738)
Nomenclature
Summary
Publications
RefSeq transcript
variant
descriptions
National Center for Biotechnology Information
Navigating from Gene to Sequence to download
National Center for Biotechnology Information
Nucleotide & Protein queries
• Build a query starting with: refseq[filter]
• Add an organism: AND human[organism]
• Add a name, a RefSeq attribute, or a specific feature type
• AND ghrelin-27[protein name]
• Or… ‘AND mat_peptide*feature key+’ Or … ‘AND obestatin*protein name+’
Protein database query example:
refseq[filter] AND human[orgn] AND ghrelin-27[protein name] AND mat_peptide[feature key]
National Center for Biotechnology Information
RefSeq in BLAST
National Center for Biotechnology Information
Bulk retrievals
• RefSeq FTP site – ftp://ftp.ncbi.nlm.nih.gov/refseq/
• Comprehensive bi-monthly release organized by major groups (e.g.,
vertebrate_mammals, etc.)
• Weekly updates of transcript/protein records for some organisms
• Genomes FTP site – ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/
• Releases of genome assembly and annotation data. Updated to add new file formats,
when assembly updates, when there is a major annotation update.
• Gene FTP site – ftp://ftp.ncbi.nlm.nih.gov/gene/
• Reports Gene to RefSeq accession associations, and more.
• NCBI Programming Utilities (eUtils) – supports scripted retreivals
• Introduction: http://www.ncbi.nlm.nih.gov/books/NBK25497/
• Help: http://www.ncbi.nlm.nih.gov/books/NBK25501/
National Center for Biotechnology Information
User feedback and RefSeq updates
• Feedback:
http://www.ncbi.nlm.nih.gov/projects/RefSeq/update.cgi
RefSeq Home page
Gene report pages
• RefSeq Updates: subscribe to the refseq-admin mail list
http://www.ncbi.nlm.nih.gov/mailman/listinfo/refseq-announce/
• NCBI News
http://www.ncbi.nlm.nih.gov/news/
National Center for Biotechnology Information
Acknowledgements
RefSeq Curators (Vertebrates & Other taxa)
Stacy Ciufo
Eric Cox
Diana Haddad
Catherine Farrell
Tamara Goldfarb
Tripti Gupta
Vinita Joardar
Vamsi Kodali
Wenjun Li
Kelly McGarvey
Mike Murphy
Nuala O'Leary
Kathleen O’Neill
Shashi Pujar
Bhanu Rajput
Sanjida Rangwala
NCBI Leadership
• David Lipman
• James Ostell
National Center for Biotechnology Information
Lillian Riddick
Barbara Robberts
Brian Smith-White
Anjana Raina Vatsan
Dave Webb
Matt Wright
Databases & programming
•
•
•
•
•
•
•
•
Terence Murphy
Olga Ermolaeva
Craig Wallin
Alex Astashyn
David Maganadze
Mike DiCuccio
Andrei Shkeda
Donna Maglott
Genome Workbench & RADAR
• Anatoliy Kuznetsov
• David Falk
• Andrei Shkeda