BS-Seq data processing lecture

Bisulfite-Sequencing theory
and Quality Control
Felix Krueger
[email protected]
January 2015
•
•
•
•
•
•
•
•
•
•
9:15-10:15
10:15-10:30
10:30-11:30
11:30-12:30
12:30-13:30
13:30-14:00
14:00-15:30
15:30-15:45
15:45-16:35
16:35-16:45
2
Bisulfite-Seq theory and QC
coffee
Mapping and QC practical
Visualising and Exploring talk
Lunch
Methylation tools in SeqMonk
Visualising and Exploring practical
coffee
Differential methylation talk & practical
Introduction to other cytosine
modifications and oxBS
DNA Methylation
DNA methyl-transferases
DNA-demethylase(s)?
Hydroxymethyl cytosine? TETs?
Passive demethylation?
5-methyl Cytosine
Cytosine
•
•
•
•
•
•
•
3
Dynamic patterning
Correlation with gene expression
Reset during development
Suppression of repetitive elements
X-chromosome inactivation
Imprinting
Carcinogenesis
DNA methylation is sequence context dependent
sequence
CCAGTCGCTATA
CCAGTCGCTATA
CCAGTCGCTATA
context mammals plants
CpG
YES
YES
CHG
no**
YES
CHH
no**
YES
* H being anything but G
** somewhat more relevant in certain cell types
4
Promoter methylation causes transcriptional silencing
Developmentally important gene or tumour suppressor gene
Adapted from Stephen B Baylin
Nature Clinical Practice Oncology (2005)
5
Imprinted Genes: mono-allelic expression
Differential allelic
DNA methylation
X
CGI (CpG island)
methylated CpG
unmethylated CpG
Imprinted Genes: Mono-allelic expression with parent-of-origin specificity.
Have key roles in energy metabolism, placenta functions.
6
Imprinted Genes: example
ICR: Imprinting Control Region
DMR: Differentially Methylated Region
methylated CGI
ICR / DMR
unmethylated CGI
Rather unusual case where a non-methylated DMR causes silencing of a locus
from Ferguson-Smith A. Nat. Rev. Genet. 2011
7
DNA methylation is maintained
from W. Reik & J. Walter, Nat. Rev. Genet. 2001
8
DNA methylation is reset during reprogramming
9
Measuring DNA methylation by Bisulfite-sequencing
Image by Illumina
10
Bisulfite Informatics
me
me
CCAGTCGCTATAGCGCGATATCGTA
Convert
TTAGTTGCTATAGTGCGATATTGTA
Map
TTAGTTGCTATAGTGCGATATTGTA
|||||||||||||||||||||||||
...CCAGTCGCTATAGCGCGATATCGTA...
11
Bisulfite conversion of a genomic locus
mC
|
mC
|
>>CCGGCATGTTTAAACGCT>>
<<GGCCGTACAAATTTGCGA<<
Top strand
|
mC
|
hmC
|
mC
Bisulfite conversion
>>UCGGUATGTTTAAACGUT>>
Bottom strand
<<GGUCGTACAAATTTGCGA<<
PCR amplification
OT >>TCGGTATGTTTAAACGTT>>
CTOT <<AGCCATACAAATTTGCAA<<
>>CCAGCATGTTTAAACGCT>> CTOB
<<GGTCGTACAAATTTGCGA<< OB
- 2 different PCR products and 4 possible different sequence strands from one genomic locus
- each of these 4 sequence strands can theoretically exist in any possible conversion state
12
Mapping a Bisulfite-Seq read
sequence of interest
TTGGCATGTTTAAACGTT
bisulfite convert read (treat sequence as both
forward and reverse strand)
5’…TTGGTATGTTTAAATGTT…3’ 5’…TTAACATATTTAAACATT…3’
(1)
(2)
align to bisulfite converted genomes
Bismark
(3)
…TTGGTATGTTTAAATGTT…
…AACCATACAAATTTACAA…
forward strand C -> T converted genome
(1)
(2)
(4)
…CCAACATATTTAAACACT…
…GGTTGTATAAATTTGTGA…
forward strand G -> A converted genome
(equals reverse strand C -> T conversion)
(3)
(4)
read all 4 alignment outputs and extract
the unmodified genomic sequence if the
sequence could be mapped uniquely
5’…CCGGCATGTTTAAACGCT…3’
methylation call
genomic sequence
TTGGCATGTTTAAACGTT
CCGGCATGTTTAAACGCT
methylation call
cc..C.........Z.c.
read sequence
c
C
z
Z
unmethylated C
methylated C
unmethylated C in CpG context
methylated C in CpG context
Common sequencing protocols
mC
|
mC
|
>>CCGGCATGTTTAAACGCT>>
<<GGCCGTACAAATTTGCGA<<
|
mC
Top strand
|
hmC
|
mC
Bottom strand
>>UCGGUATGTTTAAACGUT>>
<<GGUCGTACAAATTTGCGA<<
1) Directional libraries
(vast majority)
2) PBAT libraries
3) Non-directional libraries
14
OT
>>TCGGTATGTTTAAACGTT>>
<<GGTCGTACAAATTTGCGA<<
OB
CTOT <<AGCCATACAAATTTGCAA<<
>>CCAGCATGTTTAAACGCT>> CTOB
OT >>TCGGTATGTTTAAACGTT>>
CTOT <<AGCCATACAAATTTGCAA<<
>>CCAGCATGTTTAAACGCT>> CTOB
<<GGTCGTACAAATTTGCGA<< OB
80
biased
unbiased
60
40
20
0
25
50
75
100
Read length (bp)
15
125
150
CpG
CHG
CHH
Mapping efficiency
100
100
80
80
60
60
40
40
20
20
0
0
0
20
40
60
80
Bisulfite conversion rate (%)
100
Mapping efficiency (%)
Mapping efficiency (%)
100
Cytosines called unmethylated (%)
Validation
BS-Seq Analysis Workflow
16
QC
Trimming
Mapping
Analysis
Mapped QC
Methylation
extraction
Raw Sequence Data
...
up to 1,000,000,000 lines per lane
17
Part I: Initial QC What does QC tell you about your library?
•
•
•
•
•
# of sequences
Basecall qualities
Base composition
Potential contaminants
Expected duplication rate
18
QC Raw data: Sequence Quality
Error rate
0.1%
1%
10%
19
QC: Base Composition
WGSBS
RRBS
20
QC: Duplication rate
21
QC: Overrepresented sequences
22
Phred score
Common problems in BS-Seq
Base content (%)
Position in read (bp)
Not observed
in ‘normal’ libraries,
e.g. ChIP or RNA-Seq
Position in read (bp)
23
Removing poor quality basecalls
Phred score
before trimming
24
after trimming
Removing adapter contamination
before trimming
25
after trimming
Summary Adapter/Quality Trimming
Important to trim because failure to do so might result in:
 Low mapping efficiency
 Mis-alignments
 Errors in methylation calls since adapters are methylated
 Basecall errors tend toward 50% (C:mC)
26
Part II: Sequence alignment –
Bismark primary alignment output (BAM file)
chromosome
position
Read 1
HISEQ2000-06:366:C3G4NACXX:3:1101:1316:2067_1:N:0:
99
16
71322125
255
100M
=
71322232
207
sequence
NTTATTTAGTTTTTTAGGGTTTGTGTGTAGGAGTGTGGGAATTATGTTTTTTATGGTTGATATTTATTTAAAAGTGAGTATAAATTATATATATTTTTTT
#1=DDDDDAAFFHIIIA:<FGHCCEFGHD?CFFBBBGEHHGHIII<FEHIIIII==DE??EHHFHEEEEEEEC>;>66;@CDEEEDCEEEEEEEDDDCBB
quality
NM:i:14 XX:Z:G8C2C7C21C13C6CC1C17CC3C4CC4
XM:Z:.........h..h.......x.....................h.............x......hh.h.................hh...h....hh....
XR:Z:CT XG:Z:CT XA:Z:1
HISEQ2000-06:366:C3G4NACXX:3:1101:1316:2067_1:N:0:
147
16
71322232
255
100M
=
71322125
-207
GGTTATTTTATTTAGGGTTATTGTTTTAGAGTTTTATTGTTGTGAACAGATATATGATTAAGGTAATTTTTATAAGGATAATATTTAATTGGAGTTGGTT
CCCEEECADCFFFFHHGHGHIIGIHFIJJIJIHFGHGGGEHIJIIJGIGFJJJJJJJJJJIGJJJJGJJJJIIIJJIJIJJJJJJIJHHHHHFFFFFCCC
NM:i:21 XX:Z:2G2CC1C1C1C11C11C2C10C1C4CC4C2C1C3C5C2C12C3C1
XM:Z:.....hh.h.h.x...........h...........x..x......X...h.h....hh....h..h.h...h.....h..h............x...h.
XR:Z:GA XG:Z:CT XB:Z:1
Read 2
methylation call
27
Sequence duplication
Complex/diverse library:
Duplicated library:
28
percent methylation
55
17 100
100 100 71 100
percent methylation
33
50 100
100 100 50 100
deduplication
Deduplication - considerations
Advisable for large genomes and moderate coverage
- unlikely to sequence several genuine copies of the same fragment
amongst >5bn possible fragments with different start sites
- maximum coverage with duplication may still be
(read length)-fold (even more with paired-end reads)
NOT advisable for RRBS or other target enrichment methods
where higher coverage is either desired or expected
RRBS
CCGG
CCGG
deduplication
29
Methylation extraction
Read 1
....Z.....h..h.......x.....Z.........x......hh.h.............z....hx...h....hh.Z...
...........hh.h.............z....hx...h....hh.Z...hh....x.....Z.h.....h..h......x...h........
redundant methylation calls
Read 2
Read 1
....Z.....h..h.......x.....Z.........x......hh.h.............z....hx...h....hh.Z...
hh....x.....Z.h.....h..h......x...h........
Read 2
CpG methylation
output
read ID
30
meth
state
chr
pos context
Methylation extraction
CpG methylation output
bedGraph/coverage output
chr
31
pos
methylation
percentage
meth
unmeth
Part III: Mapped QC Methylation bias
good opportunity to look at conversion efficiency
32
Artificial methylation calls in paired-end libraries
end repair + A-tailing
5’3’-
33
GGGNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCA
ACCCNNNNNNNNNNNNNNNNNNNNNNNNNNNGGG
-3’
-5’
Specialist applications (I):
Reduced representation BS-Seq (RRBS)
Sequence composition bias
34
High duplication rate
Fragment size distribution in RRBS
identical (redundant) methylation calls
35
Artificial methylation calls in RRBS libraries
C genomic cytosine
C unmethylated cytosine
36
Specialist application (II):
Post-bisulfite adapter tagging (PBAT)
WGBS
PBAT
suitable for low input material
37
PBAT-Seq
trim off/ ignore first couple of basepairs
38
Bismark workflow
Pre Alignment
FastQC
Trim Galore
Initial quality control
Adapter/quality trimming using Cutadapt; handles RRBS
and paired-end reads; Trim Galore and RRBS User guide
Alignment
Bismark
Output BAM
Post Alignment
Deduplication
Methylation extractor
optional
Output individual cytosine methylation calls; optionally
bedGraph or genome-wide cytosine report
M-bias analysis
bismark2report
Graphical HTML report generation
Example: http://www.bioinformatics.babraham.ac.uk/projects/bismark/PE_report.html
protocol: Quality Control, trimming and alignment of Bisulfite-Seq data
39
Useful links
•
FastQC
www.bioinformatics.babraham.ac.uk/projects/fastqc/
•
Trim Galore
www.bioinformatics.babraham.ac.uk/projects/trim_galore/
•
Cutadapt
https://code.google.com/p/cutadapt/
•
Bismark
www.bioinformatics.babraham.ac.uk/projects/bismark/
•
Bowtie
http://bowtie-bio.sourceforge.net/
•
Bowtie 2
http://bowtie-bio.sourceforge.net/bowtie2/
•
SeqMonk
www.bioinformatics.babraham.ac.uk/projects/seqmonk/
•
Cluster Flow
www.bioinformatics.babraham.ac.uk/projects/clusterflow/
protocol: Quality control, trimming and alignment of Bisulfite-Seq data
http://www.epigenesys.eu/en/protocols/bio-informatics/483-quality-control-trimming-and-alignment-of-bisulfite-seq-data-prot-57
40
Sierra: A web-based LIMS system
for small sequencing facilities
SeqMonk: Genome browser,
quantitation and data analysis
Trim Galore! Quality and adapter
trimming for (RRBS) libraries
FastQ Screen: organism and
contamination detection
Bismark: Bisulfite-sequencing
alignments and methylation calls
Hi-C mapping
ASAP: Allele-specific alignments
FastQC: quality control for
high throughput sequencing
41
www.bioinformatics.babraham.ac.uk