BIONF/BENG 203: Functional Genomics Sources of Functional Data Lectures 1 and 2

BIONF/BENG 203:
Functional Genomics
Sources of Functional Data
Lectures 1 and 2
Lecture TI 1,2
Trey Ideker
UCSD Departments of Medicine & Bioengineering
1
Instructors
 Trey
Ideker
 Vineet Bafna
 Anand Patel (TA)
2
Grading
 40%
Problem Sets (best 4 of 5)
 30% Midterm
 30% Final Project
3
Topics Covered By This Course
①
②
③
④
⑤
⑥
⑦
4
⑧
Signal detection in bioinformatics
Large-scale data generation platforms
Understanding next-gen sequencing data
Understanding mass spectrometry data
Clustering and Classification
Genotype-phenotype association
Understanding physical & genetic networks
Gene network inference and evolution
Bioinformatics as Signal Detection
Ideker, Dutkowski, Hood. Cell 2011
Power, FDR, and all that...
Test Statistic t
Ideker, Dutkowski, Hood. Cell 2011
Power, FDR, and all that...
Test Statistic t
An Example:
Pathway-Level Integration of
Genome-wide Association Studies
Segrè et al., 2010 A.V. Segrè, L. Groop, V.K. Mootha, M.J. Daly and D.
Altshuler, PLoS Genet. 6 (2010), p. e1001058.
Classes of biological measurements
1) Molecular States

DNA sequence / genotype:
Next-gen sequencing, SNP & CNV arrays

2) Molecular Networks

Two-hybrid system, coIP, protein
antibody array
Gene expression:
DNA microarrays, mRNA sequencing
Protein-protein interactions:

Protein-DNA interactions:
Chromatin IP (chip) sequencing

Protein levels, locations, mods:

Protein-compound
Mass spectrometry, fluorescence
microscopy, protein arrays
3) Phenotypic traits
Physiological or disease state, binary or quantitative
 Growth rate, response to stimulus or stress
 Behaviors

Sequencing By Synthesis
(Illumina GenomeAnalyzer or HiSeq)
Bridge
Amplification
Pyrosequencing
Note: No actual houses
are burned down in
pyrosequencing
Pyrosequencing
(Life Sciences / Roche 454)

A luciferase is an enzyme which emits light in
the presence of ATP.
Several organisms, such as the American firefly and the
poisonous Jack-o-lantern mushroom, produce luciferases.
Detecting polymerase activity



Recall: Pyrophosphate is also known as PPi,
also known as “two phosphate groups stuck
together”. During replication, each addition
of a dNTP releases pyrophosphate
In the reaction mixture, PPi allows adenosine
phosphosulfate (APS) to be converted to
ATP; this ATP allows luciferase to luciferate
(emit light).
Measures strand extension as it happens
Pyrosequencing cycle




Add dATP. If light is emitted, your sequence
starts with A. If not, the dATP is degraded
(or elutes past immobilized primer).
Add dGTP. If light is emitted, the next base
must be a G.
Then add T, then C. You now know at least
one (maybe more) base of the sequence.
Repeat!
Pyrosequencing output
Runs of bases produce higher peaks – for instance, the sequence for (a)
is GGCCCTTG. Sample (c) comes from a heterozygous individual (hence
the heights in multiples of ½)
The X Prize Foundation
In October 2006, the X Prize Foundation
established an initiative to promote the
development of full genome sequencing
technologies, called the Archon X Prize,
intending to award $10 million to "the first
Team that can build a device and use it to
sequence 100 human genomes within 10 days
or less, with an accuracy of no more than one
error in every 100,000 bases sequenced, with
sequences accurately covering at least 98% of
the genome, and at a recurring cost of no more
than $10,000 (US) per genome.”
http://genomics.xprize.org/
Gene and Protein Expression






26

The transcriptome is the full complement of RNA molecules
produced by a genome
The proteome is the full complement of proteins enabled by the
transcriptome
DNA  RNA  protein
Genome  transcriptome  proteome
30,000 genes  ??? RNAs  ??? proteins?
For example, the drosophila gene Dscam can generate 40,000
distinct transcripts through alternative splicing.
What is the minimum number of exons that would be required?
mRNA Expression: Two dominant approaches
RNA sequencing
DNA Microarrays
Others / older approaches:
 EST sequencing
 RT-PCR
 Differential display
 SAGE
 Massively parallel signature sequencing (MPSS)
27
Microarrays
Monitors the level of each gene:
Is it turned on or off in a
particular biological condition?
Is this on/off state different
between two biological
conditions?
28
Microarray is a rectangular grid of
spots printed on a glass
microscope slide, where each spot
contains DNA for a different
gene
Two-color DNA
microarray design
29
Reverse
Transcription
Types of microarrays

Spotted (cDNA)
–
–

Synthetic (oligo)
–
–
–
30
Robotic transfer of cDNA clones or PCR products
Spotting on nylon membranes or glass slides coated with poly-lysine
Direct oligo synthesis on solid microarray substrate
Uses photolithography (Affymetrix) or ink-jet printing (Agilent)
100,000 features per cm2

All configurations assume the DNA on the array is in excess of the
hybridized sample—thus the kinetics are linear and the spot intensity
reflects that amount of hybridized sample.

Labeling can be radioactive, fluorescent (one-color), or two-color
Microarray Spotter
31
Affymetrix High Density Arrays
Microarray
confocal scanner





Collects sharply defined optical sections
from which 3D renderings can be created
The key is spatial filtering to eliminate outof-focus light or glare in specimens whose
thickness exceeds the immediate plane of
focus.
Two lasers for excitation
Two color scan in less than 10 minutes
High resolution, 10 micron pixel size
Next-Gen Sequencing of mRNAs
cDNA = complementary or copy DNA
EST = Expressed Sequence Tag





The microarray could be described as a “closed system”
because information about RNAs is limited by the targets
available for hybridization. RNAs not represented on the
array are not interrogated.
Direct sequencing of cDNAs overcomes this problem by
large-scale random sampling of sequences from a wholecell RNA extract
Statistical counting of distinct sequences provides a precise
estimate of expression level
cDNA library can be normalized to capture rare messages
Has been dramatically enabled by large scale sequencing
mRNA Sequencing:
Preparation of a cDNA
library in phage  vector
Proteomics
MS / MS
1D and 2D SDS PAGE
36
Mass spectrometry
Mass spectrometers consist of 3 essential parts
–
–
–
37
Ionization source: Converts peptides into gas-phase ions
(MALDI + ESI)
Mass analyzer:
Separates ions by mass to charge (m/z) ratio
(Ion trap, time of flight, quadrupole)
Ion detector: Current over time indicates amount of signal at
each m/z value
MS/MS Overview
MS/MS Overview
A raw fragmentation spectrum
By calculating the molecular weight difference between ions of the same
type the sequence can be determined.
Algorithms like SEQUEST use the fragmentation pattern to search through
a complete protein database to identify the sequence which best fits the
pattern.
43
Tandem
Mass Spec (MS/MS)
Isotope Coded Affinity Tags (ICAT)
Mass spec based method for measuring relative protein abundances
between two samples
ICAT Reagents: Heavy reagent: d8-ICAT (X=deuterium)
Normal reagent: d0-ICAT (X=hydrogen)
O
N
N
O
XX
N
S
Biotin
tag
XX
O
O
O
XX
O
XX
Linker (d0 or d8)
N
I
Thiol specific
reactive group
Protein Quantification & Identification
via ICAT Strategy
100
Mixture 1
Light
0
550
570
580
m/z
ICATlabeled
cysteines
Quantitation
100
Mixture 2
560
Heavy
Combine and
proteolyze
(trypsin)
NH2-EACDPLR-COOH
Affinity
separation
(avidin)
0
200
400
600
800
m/z
ICAT Flash animation:
http://occawlonline.pearsoned.com/bookbind/pubbooks/bc_mcampbell_genomics_1/medialib/method/ICAT/ICAT.html
Protein identification
ICAT continued


The heavy (blue) and light (gray) peptides are separated and
quantified to produce a ratio for each peptide – here, a single
peptide ratio is shown
Each peptide is subjected to CID fragmentation in the second MS
stage in order to identify it
Gene replacement for yeast & other model species
Using HR-based gene replacement, genes can be replaced with drug
resistance cassette, tagged with GFP, epitope tagged, etc.
Systematic phenotyping
Barcode
CTAACTC
(UPTAG):
Deletion
Strain:
yfg1
TCGCGCA
TCATAAT
yfg2
yfg3
Rich media
…
Growth 6hrs
in minimal media
(how many doublings?)
Harvest and label genomic DNA
Systematic phenotyping with a
barcode array
Ron Davis and friends…

These oligo barcodes are also
spotted on a DNA microarray

Growth time in minimal media:
–
Red: 0 hours
–
Green: 6 hours
YFP tagging for protein localization
YPF is green, transmitted light is red
NIC96 Nuclear Pore
TUB1 Tubulin
cytoskeleton
HHF2 Histone
Nucleus
BNI4 Bud neck
Images courtesy T. Davis lab
See also work by
Weissman and O’Shea labs at UCSF
Molecular Interactions
Among proteins,
mRNA, small
molecules, and so on…
51
Protein→DNA
interactions
▲ Chromatin IP
▼ DNA microarray
Gene levels
(on/off)
Protein—protein
interactions
▲ Protein coIP
▼ Mass spectrometry
Protein levels
(present/absent)
Biochemical
reactions
▲Not yet!!!
Metabolic flux ▼
measurements
52
Biochemical
levels
Measurements of molecular interactions
Protein-protein interactions



Yeast-two-hybrid
Kinase-substrate assays
Co-immunoprecipitation w/ mass spec
Protein-DNA interactions

ChIP-on-chip and ChIP-seq
Genetic interactions
53

Systematic Genetic Analysis
Yeast two-hybrid method
54
Fields and Song
Kinase-target interactions
55
Mike Snyder and colleagues
Protein interactions by protein immunoprecipitation
followed by mass spectrometry
TEV = Tobacco Etch Virus proteolytic site
CBP = Calmodulin binding peptide
Protein A = IgG binding from Staphylococcus
56
Gavin / Cellzome
ChIP measurement of protein→DNA interactions
From Figure 1 of Simon et al. Cell 2001
Genetic interactions: synthetic lethals and suppressors

Genetic Interactions:

Widespread method used by
geneticists to discover
pathways in yeast, fly, and
worm

Implications for drug
targeting and drug
development for human
disease

Thousands are now reported
in literature and systematic
studies

As with other types, the
number of known genetic
interactions is exponentially
increasing…
Adapted from Tong et al., Science 2001
Most recorded genetic interactions are
synthetic lethal relationships
A
59
B
A
B
A
B
A
B
Adapted from Hartman, Garvik, and Hartwell, Science 2001
Interpretation of genetic interactions (Guarente T.I.G. 1990)
Parallel Effects
(Redundant or Additive)
Sequential Effects
(Additive)


GOAL: Identify
downstream
B physical pathways
A
A
B

Single A or B mutations typically
abolish their biochemical activities

Single A or B mutations typically
reduce their biochemical activities