Document 399154

talks Quality control Thoughts on how to assess quality and detect problems SEQUENCE DATA QC Sequence data QC is a vital first step •  FastQC on the FastQ •  QualiMap on the BAM (GUI on top of Picard tools) •  DepthofCoverage (mainly for WGS) •  DiagnoseTargets (mainly for exome) •  Pay aKenLon to read counts and filtering summaries! FastQC hKp://www.bioinformaLcs.babraham.ac.uk/projects/fastqc/ QualiMap hKp://qualimap.bioinfo.cipf.es/ GATK DepthOfCoverage Example of coverage analysis Increasing concentra-ons of betaine improve coverage GC-­‐rich genes are badly covered by new protocol without betaine High GC intervals (>0.6) Betaine rescues high-­‐GC genes Med. GC Intervals (0.4<<0.6) Low GC intervals (<0.4) Wide distribuLon => uneven coverage Narrow distribuLon => even coverage Normaliza2on: X Norm(X) = Mean(x) Visualizing typical elps document document issues Visualizing typical ccases ases hhelps issues Old protocol New protocol (no betaine) Coverage in GC-­‐rich regions increases with betaine concentraLon New protocol +1M betaine New protocol +2M betaine Coverage is similar between GC-­‐rich and average regions GC content = 0.69 Follow-­‐up experiments to track improvements Previous experiment (0M betaine samples) Coverage of high-­‐GC genes is grossly inadequate High GC intervals (>0.6) Med. GC Intervals (0.4<<0.6) Control (0M) of new experiment Wide distribu2on => uneven coverage Lopsided distribu2on => uneven coverage Low GC intervals (<0.4) Normaliza2on: X Norm(X) = Mean(x) GATK DiagnoseTargets Example of eoxome targeLng failure Example f targeLng failure Tech 1 provided decent coverage so sequence context is fine Tech 1 Tech 2 produces abundant coverage in the intron region Tech 2 Tech 2 produces bad coverage in area of interest intron exon Tech 1 interval Tech 2 interval Caveat: raw sequence data displayed here are by definiLon not normalized, so comparisons should be limited to relaLve amounts of coverage between areas per technology, rather than absolute amounts between technologies. Example problemaLc ssequence context Example of opf roblemaLc equence context Old Tech 1 (CEU Trio) Tech 1 Tech 2 Tech 1 coverage was very low in this area But deeper coverage in the rest inflates the overall coverage score for the exon, allowing it to pass filters Tech 2 performs most badly in the area where Tech 1 also fails; The sequence context is rich in G and C, repeats and homopolymers exon Tech 1 interval Tech2 interval extends to the intron (250 bp upstream, also void of coverage) Caveat: raw sequence data displayed here are by definiLon not normalized, so comparisons should be limited to relaLve amounts of coverage between areas per technology, rather than absolute amounts between technologies. Check GATK’s read counts and summaries! INFO 18:35:23,731 MicroScheduler -­‐ 6 reads were filtered out during the traversal out of approximately 120 total reads (5.00%) INFO 18:35:23,731 MicroScheduler -­‐ -­‐> 0 reads (0.00% of total) failing BadCigarFilter INFO 18:35:23,731 MicroScheduler -­‐ -­‐> 6 reads (5.00% of total) failing DuplicateReadFilter INFO 18:35:23,731 MicroScheduler -­‐ -­‐> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 18:35:23,732 MicroScheduler -­‐ -­‐> 0 reads (0.00% of total) failing MalformedReadFilter INFO 18:35:23,732 MicroScheduler -­‐ -­‐> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 18:35:23,732 MicroScheduler -­‐ -­‐> 0 reads (0.00% of total) failing UnmappedReadFilter INFO 18:35:24,964 GATKRunReport -­‐ Uploaded run staLsLcs report to AWS S3 VARIANT QC Two levels of variant QC •  Sample QC -­‐> filter out samples with dubious quality –  Infer phenotypes (esp. ethnicity and sex) with a measure of likelihood if possible –  QC metrics -­‐> alone or together with phenotypes (inferred or known) to make filtering decisions •  Site QC -­‐> filter out sites that are unreliable for further analysis Sample QC •  Contamina2on / swap •  GATK –T GenotypeConcordance: proporLon of sites with same genotype in same sample sequenced earlier •  hKp://genome.sph.umich.edu/wiki/VerifyBamID •  Sample Match Likelihood Test (fingerprinLng chip) •  ContEst hKp://www.ncbi.nlm.nih.gov/pubmed/21803805 Site QC •  Quality of coverage -­‐> base quals in bam -­‐> GQ in GVCFs •  Percent missing (proporLon of samples with missing data) -­‐> indicates issue in region / problemaLc context •  Synonymous / Non-­‐synonymous Subs2tu2on ra2o -­‐> expectaLon of selecLve pressure on coding genes •  Hardy-­‐Weinberg Equilibrium -­‐> assumpLon of “populaLon not evolving” Example problemaLc ssequence context Example of opf roblemaLc equence context Old Tech 1 (CEU Trio) Tech 1 Tech 2 Tech 1 coverage was very low in this area But deeper coverage in the rest inflates the overall coverage score for the exon, allowing it to pass filters Tech 2 performs most badly in the area where Tech 1 also fails; The sequence context is rich in G and C, repeats and homopolymers exon Tech 1 interval Tech2 interval extends to the intron (250 bp upstream, also void of coverage) Caveat: raw sequence data displayed here are by definiLon not normalized, so comparisons should be limited to relaLve amounts of coverage between areas per technology, rather than absolute amounts between technologies. QC of singletons using TiTv raLo T2D 26K marathon, singletons overall
3
LOD cutoff ●
●●
●
●●●●
●
●
●●
●●
●
●●●●
●
●●
●●
●●●
●
●●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●●●
●
●
●
●
●●●●
●
●●●●
●
●●
●
●
●●
●
●●
●
●
●●
●●
●
●
●
●
●●●
●
2
vqslod > cut
●
ti/tv Ratio
FALSE
●
TiTv raLo values crash population
●
1
0
0
500000
1000000
Decreasing VQSLOD Bins
TRUE
1500000
ALL
Phenotypic inference •  Kinship -­‐> degree of relaLon between samples (King / PLINK) •  Pedigree -­‐> reconstruct family structure (trios) •  Sex -­‐> coverage / clustering analysis over X and Y Note: many projects discard samples with non-­‐standard sex genotypes (e.g. X0, XXY) •  Ethnicity inference -­‐> PCA + clustering on subset of conserved sites (S. Purcell) Kinship inference Duplicates Monkol Lek, 2014 Parent-­‐
Offspring Siblings Phenotypic inference •  Kinship -­‐> degree of relaLon between samples (King / PLINK) •  Pedigree -­‐> reconstruct family structure (trios) •  Sex -­‐> coverage / clustering analysis over X and Y Note: many projects discard samples with non-­‐standard sex genotypes (e.g. X0, XXY) •  Ethnicity inference -­‐> PCA + clustering on subset of conserved sites (S. Purcell) Ethnicity is super important – adjust expectaLons accordingly! SNP count 24K 21K Monkol Lek, 2014 18K 1000G Bipolar ATV BUP African ESP OKawa NFBC European T2D-­‐GENES, GoT2D SCZ American East Asian SIGMA TAT South Asian Old populaLons & founder effects Old populaLons & founder effects Monkol Lek, 2014 Ethnicity is super important – adjust expectaLons accordingly! African European American East Asian South Asian Monkol Lek, 2014 Ethnicity is super important – adjust expectaLons accordingly! African European American East Asian South Asian Monkol Lek, 2014 Ethnicity is super important – adjust expectaLons accordingly! African European American East Asian South Asian Benchmarking against a knowledge base •  GSA’s favorite sample: NA12878 (+ parents) •  Sequenced with mulLple technologies •  Manual review over extensive stretches of genome Benchmark for tesLng changes to GATK tools NA12878 Knowledge Base
1
2
3
5
FALSE_POSITIVE
6
TRUE_POSITIVE
7
reviews
8
ami
9
chartl
10 11 12 13 14 15 16 17 18 19 20 21 22
Assigned truth status
4
delangel
X
0
50,000,000
100,000,000
150,000,000
Position along chromosome
200,000,000
TruthStatus
250,000,000
delangel_fosmids
depristo
ebanks
gauthier
gege
haasb
justinzook
multiple
rpoplin
thibault
valentin
talks Further reading hKp://www.broadinsLtute.org/gatk/guide/