Talk-InnoMol-Macek-1

InnoMol Proteomics Workshop
April 8, 2014
Principles of Shotgun Proteomics
and Proteogenomics
Boris Maček
Proteome Center Tuebingen
General MS-based proteomics workflow
Aebersold R and Mann M. 2003. Nature 422: 198-207
Principle of protein database search
K
G
A
Intensity
L
Intensity
Translated Genomic Sequence
Theoretical Spectra for Proteins
m/z
m/z
Intensity
Theoretical spectra that
fall into the defined mass range.
Each of them is compared to our fragment
Ion spectra.
m/z

Intensity
A
S
m/z
Database
3
Principle of protein database search
A
S
L
K
G
A
Intensity
MaxQuant
Software
m/z
>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo
sapiens GN=YWHAB PE=1 SV=3
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGAR
RSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQP
ESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLG
LALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLT
LWTSENQGDEGDAGEGEN
>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo
sapiens GN=YWHAE PE=1 SV=1
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARR
ASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANT
GESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRL
GLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNL
TLWTSDMQGDGEEQNKEALQDVEDENQ
>sp|P62258-2|1433E_HUMAN Isoform SV of 14-3-3 protein
epsilon OS=Homo sapiens GN=YWHAE
MVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKL
KMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFA
TGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACR
LAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDV
EDENQ
>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens
GN=YWHAH PE=1 SV=4
MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARR
SSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCND
FQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPI
RLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRD
NLTLWTSDQQDEEAGEGN
>tr|F2Z3E5|F2Z3E5_HUMAN Hydroxyacid-oxoacid
transhydrogenase, mitochondrial OS=Homo sapiens GN=ADHFE1
PE=4 SV=1
MAAAARARVAYLLRQLQRAACQCPTHSHTYSQDGCFKY
>tr|Q5SS58|Q5SS58_HUMAN MHC class I polypeptide-related
sequence A OS=Homo sapiens GN=MICA PE=4 SV=2
MGQRDQGLDRERKGPQDDPGSYQGPERRNFLKEDAMKTKTHYHAMHADCLQELRRYL
ESGVVLRRTVPPMVNVTRSEASEGNITVTCRASSFYPRNIILTWRQDGVSLSHDTQQ
WGDVLPDGNGTYQTWVATRICRGEEQRFTCYMEHSGNHSTHPVPSGKVLVLQSHWQT
FHVSAVAAGCCYFCYYYFLCPLL
>tr|Q5T409|Q5T409_HUMAN Disrupted in schizophrenia 1
OS=Homo sapiens GN=DISC1 PE=2 SV=1
MPGGGPQGAPAAAGGGGVSHRAGSRDCLPPAACFRRRRLARRPGYMRSSTGPGIGFL
SPAVGTLFRFPGGVSGEESHHSESRARQCGLDSRGLLVRSPVSKSAAAPTVTSVRGT
SAHFGIQLRGGTRLPDRLSWPCGPGSAGWQQEFAAMDSSETLDASWEAACSDGARRV
RAAGSLPSAELSSNSCSPGCGPEVPPTPPGSHSAFTSSFSFIRLSLGSAGERGEAEG
CPPSREAESHCQSPQEMGAKAASLDGPHEDPRCLSRPFSLLATRVSADLAQAARNSS
RPERDMHSLPDMDPGSSSSLDPSLAGCGGDGSSGSGDAHSWDTLLRKWEPVLRDCLL
RNRRQMEVISLRLKLQKLQEDAVENDDYDKAETLQQRLEDLEQEKISLHFQLPSRQP
ALSSFLGHLAAQVQAALRRGATQQASGDDTHTPLRMEPRLLEPTAQDSLHVSITRRD
WLLQEKQQLQKEIEALQARMFVLEAKDQQLRREIEEQEQQLQWQGCDLTPLVGQLSL
GQLQEVSKALQDTLASAGQIPFHAEPPETIRSLQERIKSLNLSLKEITTKVCMSEKF
CSTLRKKVNDIETQLPALLEAKMHAISGNHFWTAKDLTEEIRSLTSEREGLEGLLSK
LLVLSSRNVKKLGSVKEDYNRLRREVEHQETAYETSVKENTMKYMETLKNKLCSCKC
PLLGKVWEADLEACRLLIQSLQLQEARGSLSVEDERQMDDLEGAAPPIPPRLHSEDK
RKTPLKESYILSAELGEKCEDIGKKLLYLEDQLHTAIHSHDEDLIHSLRRELQMVKE
TLQAMILQLQPAKEAGEREAAASCMTAGVHEAQA
Translated Genomic Sequence
Theoretical Spectra for Proteins
Homo Sapiens Reference Proteome
71,434 entries
(20,246 reviewed proteins)
(51,188 un-reviewed)
4
Database
MS instrumentation in proteomics
Aebersold R and Mann M. 2003. Nature 422: 198-207
Coupling LC to MS for complex mixture analysis
Nanoflow LC/MS interface set-up:
Column (75 µm)/spray tip (8 μm)
Proxeon Easy nLC
nanoflow LC System
Reverse-phase C18 beads, 3 μm
LTQ-Orbitrap
No precolumn or split!
Platin-wire
2.0 kV
12-15 cm
Sample Loading:~700 nl/min
Gradient elution:~200 nl/min
Coupling LC to MS for complex mixture analysis
BSA tryptic
in-solution digest
50 fmol on column
LTQ-Orbitrap (2005)
Linear ion trap
(LTQ)
Source
C-Trap
Octopole
coll. cell
Orbitrap
LTQ-FT MS/MS optimized scan cycle:
→ peptide mass measurement
Orbitrap-MS MS-Full Scan
MS2
LTQ-MS
0
300
MS2
MS2
600
MS2
900
Time [msec]
1200
MS2
→ peptide sequencing
1500
1800
Data processing workflow: MaxQuant
Acquisition speed
LTQ Orbitrap XL
LTQ Orbitrap Velos
□ CID Identified
+ CID Not Iidentified
Acquisition speed
# of MS/MS Scans
120000
100000
80000
LTQ Orbitrap XL (2007)
60000
LTQ Orbitrap Velos (2009)
LTQ Orbitrap Elite (2011)
40000
20000
0
60 min
100 min
140 min
240 min
Stable Isotope Labeling by Amino Acids in
Cell Culture (SILAC)
”normal AA”
”heavy AA”
Lys-12C6
Lys-13C6
Resting cells
Treated (drug, GF)
Combine and lyse,
protein purification
or fractionation
Proteolysis
(trypsin, Lys-C, etc.)
Quantitation and identification by MS
(nanoscale LC-MS/MS)
Current research at the PCT
• Proteogenomics
• B. subtilis, E. coli (Krug et al, 2011, Mol Bosystems; 2013 MCP)
• Pristionchus pacificus (Borchert et al, 2010, Genome Res)
• cancer cell lines/tissues
• Proteomics for systems biology
• In-depth sequencing and quantitation of model organisms (B.subtilis,
E.coli, S. pombe, A. thaliana) (Soufi et al, 2010, J Prot Res; Schütz et al, 2011,
Plant Cell; Soufi et al, 2012, Curr Opinion Microbiol; Soares et al, 2013, JPR)
• Phosphoproteomics
• targets of Aurora kinase in S. pombe (Koch et al, 2011, Science Signaling)
• targets of protein kinase D in human cells (Franz-Wachtel et al., 2012, MCP)
• targets of S/T/Y kinases and phosphatases in B.subtilis and E.coli
• Protein modifications
• ubiquitylation (Ikeda et al, 2011, Nature)
• lysine acetylation (Carpy et al., in preparation)
• Clinical proteomics
• genetic rescue of Fragile X phenotype in FMR1 KO mice
Super-SILAC in Bacteria
Super-SILAC in Bacteria
E. coli: Replicate 1 and 2
Parameter
Number
Total MS/MS
757,835
Total Peptides Identified
18,273
Total Proteins Identified
2,292
Single Peptide Hits
6.5%
Total Proteins Quantified*
1923
*in all phases of growth
Soufi et al. in preparation
Biological reproducibility
Soufi et al. in preparation
Proteome dynamics during growth
Soufi et al. in preparation
Dynamics of stress proteins during growth
Soufi et al. in preparation
OD
600
Estimation of absolute copy numbers
T5
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
T6
T7
T4
T3
UPS standard (iBAQ)
T1 T2
0
100
200
300
400
500
600
700
Time
(min)
Soufi et al. in preparation
800
1800
5760
Summary of absolutely quantified proteins
During Growth
Membrane
Proteins
Identified
2,292
684
Quantified
(All Phases)
1,923
588
Absolutely
Quantified
2,096
494
Soufi et al. in preparation
Most abundant Proteins (ES)
Protein
Copies per cell (ES)
Elongation factor Tu 1;P-43
341,047.56
Outer membrane protein A
313,464.22
Braun lipoprotein
216,037.00
Cysteine synthase A;O
187,791.26
Enolase
164,914.38
DNA-binding protein HU-alpha
136,208.45
Scavengase P20;Thiol peroxidase
131,599.61
Glyceraldehyde-3-phosphate dehydrogenase A
127,416.09
Malate dehydrogenase
123,943.77
IDP;Isocitrate dehydrogenase [NADP]
117,787.02
High-affinity zinc uptake system protein znuA
111,748.80
Cadmium-induced protein yodA
107,098.12
Outer membrane protein C
106,108.02
50S ribosomal protein L6
98,724.11
Universal stress protein A
94,784.63
Soufi et al. in preparation
Count
Dynamic range of protein abundance
Blue: All proteins
Red: Membrane proteins
Log2 Protein Copy Number
Soufi et al. in preparation
Proteogenomics
• Application of tandem mass spectrometry to genome re-annotation
• Search MS/MS spectra against a database containing the complete genome
translated in 6 reading frames
Problem: database size and structure
„Ususal“ Proteomics applications
Predicted ORFs
REV_Predicted ORFs
•Incompatibility with some data
processing programs
•Long search times
Proteogenomics applications
Predicted ORFs
Frame1
Frame2
Frame3
Frame4
Frame5
Frame6
REV_Predicted ORFs
REV_Frame1
REV_Frame2
REV_Frame3
REV_Frame4
REV_Frame5
REV_Frame6
•Decreased sensitivity of database
search
•Unequal target and decoy search
spaces
•Most translated frames are in fact
decoy sequences
•Overestimation of the FDR
Proteogenomics of E. coli
•
•
•
•
Model Gram-negative bacterium
Small (4.6 Mb) and well characterized genome
~4,300 protein coding genes (manually annotated and reviewed)
Comprehensive high accuracy MS dataset comprising >42,000 unique
peptide sequences from >2,600 proteins
• Hypothesis: genome annotation approaches completeness
• Assessment of general properties of a simple proteogenomic experiment
MS/MS
spectra
acquired
MQ
TPP
MS/MS
spectra
identified
1,941,724
370,231
1,941,724
162,028
Results I
MS/MS
spectra
identified
(%)
19,1
8.3
Peptide
sequences
Novel
peptides
Decoy
peptides
Lab
contaminant
peptides
E. coli
proteins
33,964
263
336
306
2,653
25,724
59
0
209
2,524
Proteogenomics of E. coli
1.9M peptide mass spectra
Results I
Proteogenomics of E. coli
A
B
fes
fepa
ybdz
PEP = 4.02E-08
PP = 0.9999
Annotated genes
Detected peptides
Six-frame ORFs
Position (Mb)
MFEVTFWWRDPQGSEEY...
fes
VGSESWWQSK
TWGYGVTALKVGSESWWQSKHGPEWQRLNDEMFEVTFWWRDPQGSEEY...
C
D
yhja
yhjb
tref
PEP = 0.027976
PP = 0.9504
Annotated genes
Detected peptides
Six-frame ORFs
Position (Mb)
tref
MLNQKIQNPNPDELMIEVDLCYELDPYELKLDEMIEAEP...
KPPQIRISL
...NAVFKPPQIRISL
LATNFGGWILMLNQKIQNPNPDELMIEVDLCYELDPYELKLDEMIEAEP...
Krug et al. Mol Cell Proteomics, 2013
Majority of Novel Peptides are False Positives
Results I
Krug et al. Mol Cell Proteomics, 2013
Assessment of Processing Workflows
Results I
Krug et al. Mol Cell Proteomics, 2013
Deep Proteome Coverage of Escherichia coli
MS/MS scans
Mean:
Median:
0
50
20 scans
7 scans
100
20-fold base coverage of 27.5% genome sequence
Results I
Krug et al. Mol Cell Proteomics, 2013
Conclusions
• proteomics reaches analytical capacity to identify and quantify all gene
products in microorganisms grown in culture
• several regulatory protein modifications (e.g. S/T/Y-phosphorylation, lysine
acetylation) can routinly be analyzed on a global scale
• many challenges ahead:
• analysis of H/D-phosphorylation
• analysis of environmental samples
• coverage of genome/protein sequence by detected peptides
• future developments:
• faster MS/MS acquisition
• smarter acquisition software
• large-scale targeted proteomics
• metaproteomics and individual proteomics
Acknowledgements
Proteome Center Tuebingen
Boumediene Soufi
Nelson C. Soares
Philipp Spät
Karsten Krug
Alejantro Carpy
Sasa Popic
Silke Wahl
Funding