PWM Genome Scan

Max-Planck-Institut
für molekulare Genetik
Comparing Methods for Identifying
Transcription Factor Target Genes
Alena van Bömmel (R 3.3.73)
Matthew Huska (R 3.3.18)
Max Planck Institute for Molecular Genetics
Software Praktikum, 1.2.2013
Folie 1
Max-Planck-Institut
für molekulare Genetik
Transcriptional Regulation
TF not bound =
no gene expression
TF bound =
gene expression
Software Praktikum, 1.2.2013
Max-Planck-Institut
für molekulare Genetik
Transcriptional Regulation
TF not bound =
no gene expression
TF bound =
gene expression
Problem: There are many genes and many TF's,
how do we identify the targets of a TF?
Software Praktikum, 1.2.2013
Max-Planck-Institut
für molekulare Genetik
Methods for Identifying TF
Target Genes
PWM Genome Scan
Microarray
ChIP-seq
Software Praktikum, 1.2.2013
Max-Planck-Institut
für molekulare Genetik
PWM Genome Scan
•Purely computational method
•Input:
o
o
position weight matrix for your TF
genomic region(s) of interest
Score threshold
•Pros:
o
No need to do wet lab experiments
•Cons:
o
Many false positives, not able to take biological conditions into account
Software Praktikum, 1.2.2013
Max-Planck-Institut
für molekulare Genetik
PWM genome scan
1)
Download the PWMs of your TF
of interest from the database
(they might include >1 motif)
1)
Define the sequences to analyze
(promoter sequences)
1)
Run the PWM genome scan (hitbased method or affinity
prediction method)
1)
Rank the genomic sequences by
the affinity signal
Software Praktikum, 1.2.2013
Suggested Reading:
• Roider et al.: Predicting transcription factor
affinities to DNA from a biophysical model.
Bioinformatics (2007).
• Thomas-Chollier et al. Transcription factor
binding predictions using TRAP for the
analysis of ChIP-seq data and regulatory
SNPs. Nature Protocols (2011).
Folie 6
Max-Planck-Institut
für molekulare Genetik
PWM-PSCM
Stat3 pscm
Binding motif for the transcription factor:
Stat3
from ChIP-seq experiment in mouse
(Jaspar ID: MA0144.1)
Software Praktikum, 1.2.2013
Folie 7
Max-Planck-Institut
für molekulare Genetik
TRAP
1) Convert the PSSM(position
2)
3)
4)
5)
specific scoring matrix) to PSEM
(position specific energy matrix)
Scan the sequences of interest
with TRAP
Results in 1 score per
sequence=binding affinity
Doesn’t separate the exact TF
binding sites (easier for ranking)
Sequences must have the same
length!
ANNOTATE=/project/gbrowse/Pipeline/ANNOTATE_v3.02/Release
TRAP trap.molgen.mpg.de/cgi-bin/home.cgi
Software Praktikum, 1.2.2013
Folie 8
Max-Planck-Institut
für molekulare Genetik
Matrix-scan
1) Use directly the PSSM
2) Finds all TFBS which exceed a predefined threshold (e.g. p-value)
3) More complicated to create ranked lists of genomic sequences (more
hits in the sequence)
4) Exact location of the binding site reported
matrix-scan http://rsat.ulb.ac.be/
Software Praktikum, 1.2.2013
Folie 9
Max-Planck-Institut
für molekulare Genetik
Finding the target genes
•
target genes will be the top-ranked genes (promoters)
•
which are the top-ranked genes? (top-100,500,1000...?)
•
There’s no exact definition of promoters, usually 2000bp upstream,
500bp downstream of the TSS
Software Praktikum, 1.2.2013
Folie 10
Max-Planck-Institut
für molekulare Genetik
Microarrays
→ R/Bioconductor (details later)
Software Praktikum, 1.2.2013
Max-Planck-Institut
Folie
12 Genetik
für molekulare
Microarrays (2)
•Pros:
o
o
o
There is a lot of microarray data already available (might not have to
generate the data yourself)
Inexpensive and not very difficult to perform
Computational workflow is well established
•Cons:
o
Can not distinguish between indirect regulation and direct regulation
Software Praktikum, 1.2.2013
Max-Planck-Institut
für molekulare Genetik
ChIP-seq
Map reads to the genome
Call peaks to determine most likely TF binding locations
Software Praktikum, 1.2.2013
Max-Planck-Institut
Folie
14 Genetik
für molekulare
ChIP-seq (2)
•Pros:
o
Direct measure of genome-wide protein-DNA interaction(*)
•Cons:
o
o
o
o
o
Don't know whether binding causes changes in gene expression
More complicated experimentally and in terms of computational analysis
Most expensive
Need an antibody against your protein of interest
Biases are not as well understood as with microarrays
Software Praktikum, 1.2.2013
Max-Planck-Institut
für molekulare Genetik
ChIP-seq analysis
1) Download the reads from
given source (experiments and
controls)
2) Quality control of the reads
and statistics (fastqc)
3) Mapping the reads to the
reference genome
(bwa/Bowtie)
4) Peak calling (MACS)
5) Visualization of the peaks in a
genome browser (genome
browser, IGV)
6) Finding the closest genes to
the
peaks(Bioconductor/ChIPp
eakAnno)
Software Praktikum, 1.2.2013
Visualised peaks in a genome browser
Suggested Reading:
• Bailey et alPractical Guidelines for the
Comprehensive Analysis of ChIP-seq Data.
PLoS Comput Biol (2013).
• Thomas-Chollier et al. A complete workflow
for the analysis of full-size ChIP-seq (and
similar) data sets using peak-motifs. Nature
Protocols (2012).
Folie 15
Max-Planck-Institut
für molekulare Genetik
Sequencing data
•
•
raw data=reads usually very
large file (few GB)
format fastq (ENCODE) or SRA
(Sequence Read Archive of NCBI)
Analysis
1) Quality control with fastqc
2) Filtering of reads with adapter
sequences
3) Mapping of the reads to the
reference genome (bwa or Bowtie)
Example of fastq data file
Software Praktikum, 1.2.2013
Folie 16
Max-Planck-Institut
für molekulare Genetik
Quality control with fastqc
•
•
•
•
•
•
•
•
per base quality
sequence quality (avg. > 20)
sequence length
sequence duplication level
(duplication by PCR)
overrepresented
sequences/kmers (adapter
sequences)
produces a html report
manual (read it!)
Example of per base seq quality scores
software at the MPI
FASTQC=/scratch/ngsvin/bin/chip-seq/fastqc/FastQC/fastqc
Software Praktikum, 1.2.2013
Folie 17
Max-Planck-Institut
für molekulare Genetik
Mapping with bwa
•
•
•
1)
2)
mapping the sequencing reads to a reference genome
manual (read it!)
map the experiments and the controls
reference genome in fasta format (hg19)
create an index of the reference file for faster mapping (only if not
available)
3) align the reads (specify parameters e.g. for # of mismatches, read
trimming, threads used...)
4) generate alignments in the SAM format (different commands for
single-end and pair-end reads!)
software and data at the MPI:
BWA = /scratch/ngsvin/bin/executables/bwa
hg19: /scratch/ngsvin/MappingIndices/hg19.fa
bwa index: /scratch/ngsvin/MappingIndices/BWA/hg19
Software Praktikum, 1.2.2013
Folie 18
Max-Planck-Institut
für molekulare Genetik
File manipulation with samtools
•
•
1)
2)
3)
utilities that manipulate SAM/BAM files
manual (read it!)
merge the replicates in one file (still separate experiment and control)
convert the SAM file into BAM file (binary version of SAM, smaller)
sort and index the BAM file
now the sequencing files are ready for further analysis
software at the MPI:
SAMTOOLS = /scratch/ngsvin/bin/executables/samtools
Software Praktikum, 1.2.2013
Folie 19
Max-Planck-Institut
für molekulare Genetik
Peak finding with MACS
•
find the peaks, i.e. the regions with a high density of reads,
where the studied TF was bound
•
manual (read it!)
1) call the peaks using the experiment (treatment) data vs. control
2) set the parameters e.g. fragment length, treatment of duplication reads
3) analyse the MACS results (BED file with peaks/summits)
software at the MPI:
MACS = /scratch/ngsvin/bin/executables/macs
Software Praktikum, 1.2.2013
Folie 20
Max-Planck-Institut
für molekulare Genetik
Finding the target genes
•
•
•
find the genes which are in the closest distance to the
(significant) peaks
how to define the closest distance? (+- X kb)
use ChIPpeakAnno in Bioconductor or bedtools
Scale
chr10:
69,200,000
78 _
GM12878 c-Myc Sg
100 kb
hg18
69,250,000
69,300,000
69,350,000
UCSC Genes (RefSeq, GenBank, tRNAs & Comparative Genomics)
DNAJC12
SIRT1
DNAJC12
SIRT1
SIRT1
HERC4
HERC4
HERC4
HERC4
HERC4
KIAA1593
ENCODE TFBS, Yale/UCD/Harvard ChIP-seq Peaks (c-Myc in GM12878 cells)
HERC4
ENCODE TFBS, Yale/UCD/Harvard ChIP-seq Signal (c-Myc in GM12878 cells)
0_
ENCODE TFBS, Yale/UCD/Harvard ChIP-seq Peaks (c-Myc in K562 cells)
78 _
ENCODE TFBS, Yale/UCD/Harvard ChIP-seq Signal (c-Myc in K562 cells)
K562 c-Myc Sig
0_
RepeatMasker
Software Praktikum, 1.2.2013
Repeating Elements by RepeatMasker
Folie 21
Max-Planck-Institut
für molekulare Genetik
Methods for Identifying TF
Target Genes
PWM Genome Scan
Microarray
ChIP-seq
Software Praktikum, 1.2.2013
Threshold
s
Max-Planck-Institut
für molekulare Genetik
Bioinformatics
•
•
•
•
Read mapping
(Bowtie/bwa)
Peak Calling
(MACS/Bioconduct •
or)
•
Peak-Target
Analysis
(Bioconductor)
Microarray data
analysis
(Bioconductor)
Differential Genes
(R)
GSEA
•
•
Software Praktikum, 1.2.2013
•
•
PWM Genome
Scan
(TRAP/MatScan)
Statistics (R)
Data Integration
(R/Python/Perl)
Statistical
Analysis (R)
Folie 23
Max-Planck-Institut
für molekulare Genetik
Bioinformatics tools
READ THE MANUALS!
•
•
•
•
•
•
Bowtie bowtie-bio.sourceforge.net/manual.shtml
bwa bio-bwa.sourceforge.net/bwa.shtml
MACS github.com/taoliu/MACS/blob/macs_v1/README.rst
TRAP trap.molgen.mpg.de/cgi-bin/home.cgi
matrix-scan http://rsat.ulb.ac.be/
Bioconductor www.bioconductor.org/ (more info in R course)
Databases
•
•
•
•
GEO www.ncbi.nlm.nih.gov/geo/
ENCODE genome.ucsc.edu/ENCODE/
SRA www.ncbi.nlm.nih.gov/sra
JASPAR http://jaspar.genereg.net/
Software Praktikum, 1.2.2013
Folie 24
Max-Planck-Institut
für molekulare Genetik
Schedule
•
•
•
•
•
•
•
03.03. Introduction lecture, R course
04.03. R & Bioconductor homework submission
11.03. Presentation of the detailed plan of each group
(which TF, cell line, tools, data, data integration, team
work ) 10:30am, 11:30am
every Tuesday 10:30am, 11:30am progress meetings
17.04. Final report deadline
24.04. (tentative) Presentations
28.04. Final meeting, discussion of final reports
Software Praktikum, 1.2.2013
Folie 25
Max-Planck-Institut
für molekulare Genetik
GR Group
•
Expression and ChIP-seq data: Luca F, Maranville JC,
et al., PLoS ONE, 2013
• PWM database: jaspar.genereg.net
Software Praktikum, 1.2.2013
Folie 26
Max-Planck-Institut
für molekulare Genetik
c-Myc Group
•
Expression data: Cappellen, Schlange, Bauer et al.,
EMBO reports, 2007
• Musgrove et al., PLoS One, 2008
• ChIP-seq data: ENCODE Project
• PWM database: jaspar.genereg.net
Software Praktikum, 1.2.2013
Folie 27
Max-Planck-Institut
für molekulare Genetik
Additional analysis
Binding motifs
binding motifs
• are the overrepresented
motifs in the ChIP-peak
regions different?
• do we find any co-factors?
Recommended tool:
RSAT rsat.ulb.ac.be
binding motifs
Software Praktikum, 1.2.2013
binding motifs
Folie 28