RefSeq - 8th International Biocuration Conference

Functional curation of
sequence data for RefSeq
Kim D. Pruitt
8th International Biocuration Conference
April 24, 2015
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, DHHS, USA
What is RefSeq?
An NCBI project to provide reference sequence standards, that
incorporate current knowledge, for genomes, transcripts, and proteins.
Vertebrates
Genomes
169
Genes
4 million
Transcripts 5.6 million
Proteins
4.9 million
National
Center for
Biotechnology
Information
Most curated
organisms:
human
- mouse - rat - zebrafish - cow - chicken
RefSeq curation focus for vertebrate genomes
• Gene type
• Gene location/length
• Transcript variants
• Names
• Publications
• Functional annotation
Protein-coding
ncRNAs
Pseudogene
Unknown ???
• Functional regions of sequence
• Gene summary
• References and GeneRIFs (Gene References Into Function)
National Center
for Biotechnology Information
RefSeq:
www.ncbi.nlm.nih.gov/refseq/
Gene: www.ncbi.nlm.nih.gov/gene/
RefSeqs unique contribution
• Correct transcript/protein sequence
• Clear data source & evidence
• Connect DNA<>RNA<>Protein
• Connect sequence regions to function
- for both transcripts and proteins
National Center for Biotechnology Information
NM_001033952.2
Curation Examples (2014)
Antimicrobial peptides (human genes)
• Histones (human and mouse genes)
• Regulatory upstream open reading frames
(uORFs; human, mouse and rat genes)
National Center for Biotechnology Information
Antimicrobial peptides (AMPs)
•
•
•
•
Short peptides (typically 15-100 aa)
Found in most multicellular organisms
Not highly conserved at sequence level
Cationic residues bind with negatively-charged bacterial membranes
A new approach is needed to fight antibiotic-resistant bacterial infections
Advantages of AMPs:
Bacteria generally cannot become resistant to AMPs
Low toxicity
Kill pathogens efficiently
National Center for Biotechnology Information
Curating human AMPs
PubMed
+
AMP
databases
AMP
sequence
BLASTP
RefSeq &
Gene
Manual Curation
-Annotate AMP peptide
-RefSeq attribute
-RefSeq Summary
-Store GeneRIF
191 proteins
139 genes
National
Center for Biotechnology
Information
Query Nucleotide
or Protein:
refseq*filter+ AND “protein has antimicrobial activity”*properties+
Human calcitonin-related polypeptide alpha
• CALCA encodes three peptide hormones:
• Calcitonin
• calcium & phosphorus regulation
• Calcitonin gene-related peptide (CGRP)
• AMP and vasodilator
• Katacalcin
• calcium regulation
National
for Biotechnology
Information
CALCA,Center
NCBI
Gene ID: 796
PubMed: 18603306
Human calcitonin-related polypeptide alpha
NCBI annotation updated
March 2015
Calcitonin
Katacalcin
National
Center
for Biotechnology
Information
CALCA:
screen
shot from
NCBI Gene
ID: 796
Calcitonin generelated peptide (AMP)
RefSeq NP_001029125.1
National Center for Biotechnology Information
Examples of Curation
Antimicrobial peptides
Histones
• Regulatory upstream open reading frames
(uORFs)
National Center for Biotechnology Information
Replication-dependent Histones
• Most eukaryotic mRNAs have poly(A) tails
• Precursor RNA undergoes endonucleolytic cleavage
• Poly(A) addition
• Replication-independent histones have a poly(A) tail
• Replication-dependent Histones do not have poly(A) tails
• Precursor RNA undergoes endonucleolytic cleavage between a conserved
hairpin and a purine-rich histone downstream element (HDE)
• Expression is cell cycle regulated (G1/S phase)
National Center for Biotechnology Information
Replication-dependent Histones
Precursor transcripts are processed at an endonucleolytic cleavage
site located between a conserved 16 nucleotide stem-loop
structure and a purine-rich histone downstream element (HDE).
Curation (human & mouse genes):
• Confirm no polyA
• Confirm 3’UTR end
• Annotate stem loop
T
T
C
T
Results:
• 133 Transcript records
• 133 Genes
5’ UTR-CDS- 3’UTR…AAA
Dominski
& Marzluff
(2007) Gene.
PMID: 17531405
National
Center
for Biotechnology
Information
T
C
C
C
G
G
–
–
–
–
-
A
G
G
G
C
C
Red = ultra-conserved bases
Precursor RNA cleavage site
3’ ACNNN
– histone downstream element (HDE) – 3’
NCBI Reference Sequence: NM_003539.3
Graphics Format
GenBank Format
National
for Biotechnology
Information
Query Center
nucleotide:
refseq[filter]
AND stem_loop[feature key] AND histone[title]
Examples of Curation
Antimicrobial peptides
Histones
Regulatory upstream open reading frames
(uORFs)
National Center for Biotechnology Information
Regulatory uORFs
• Found in about 40% of all mRNAs
• Thought to regulate translation of the primary open reading frame (pORF)
• The uORF competes for ribosome, down-regulating translation from the pORF.
• Translation of the pORF may rely on leaky scanning or reinitiation events
National Center for Biotechnology Information
Human major vault protein (MVP)
Graphical display of MVP gene annotation on human chromosome 16
NC_000016.10, Reference Assembly GRCh38.p2, GCF_000001405.28
*
#
#
*
*
uORF is found in one alternate
5’UTR (NM_017458.3)
# 5’ UTR is partial
#
#
National Center for Biotechnology Information
PMID: 11297743
MVP GeneID: 9961
NM_017458.3, variant 1
uORF start codon:
Exon 1 3’: AT
Exon 2 5’: G
NM_005115.4, variant 2
Exon 1 3’: AT
Exon 2 5’: T
National Center for Biotechnology Information
RefSeq transcript record (MVP variant 1)
NM_017458.3
Review status
Data source
Functional
Summary
Transcript Variant
Description
RefSeq Attributes
Support evidence
National
Center for Biotechnology
Information
Query Nucleotide:
refseq*filter+
AND “regulatory uORF”*properties+
GenBank format (NM_017458.3 MVP variant 1)
Curation:
• Annotate uORF region
• Add RefSeq attribute
• Add publications
Results:
uORF
• 260 Transcript records
• 150 Genes
• Human > Mouse > Rat
National Center for Biotechnology Information
Acknowledgements
RefSeq Vertebrate Curation Group, especially:
Mike Murphy (AMPs)
Lillian Riddick (AMPs, Histones) Wendy Wu (uORFs)
DB & programming:
Terence Murphy
Mike DiCuccio
Andrei Shkeda
NCBI Leadership:
• David Lipman
• James Ostell
Eric Cox
Catherine Farrell
Tamara Goldfarb
Tripti Gupta
Vinita Joardar
Vamsi Kodali
Kelly McGarvey
Nuala O'Leary
Shashi Pujar
National Center for Biotechnology Information
Bhanu Rajput
Sanjida Rangwala
Dave Webb
Matt Wright