Document 400532

talks Variant Calling + Genotyping Examine the evidence for varia9on from reference via Bayesian inference We are here in the Best Practices workflow!
Variant Calling PURPOSE Real muta9ons are hidden in the noise PRINCIPLES Muta9on or noise? A large scale Bayesian inference problem Variant callers in GATK •  UnifiedGenotyper Call SNPs and indels separately by considering each variant locus independently –  Accepts any ploidy –  Pooled calling •  HaplotypeCaller Call SNPs, indels, and some SVs simultaneously by doing local re-­‐assembly and considering haplotypes –  More accurate (esp. indels) –  Reference confidence model –  Replaces UG HaplotypeCaller method overview •  Call SNPs, indels, and some SVs simultaneously by doing local re-­‐assembly and considering haplotypes –  Determine if a region has poten1al varia1on –  Make deBruijn assembly graph of the region –  Paths in the graph = poten1al haplotypes to evaluate –  Calculate haplotype likelihoods given the data (PairHMM) –  Iden1fy variants on most likely haplotypes –  Compute allele frequency distribu1on to determine most likely allele count, and emit a variant call if appropriate –  If emiTng a variant, assign genotype to each sample HC method illustrated Assemble plausible haplotypes
Identify ActiveRegions
READS
READS
A
A
A
A
C
C
C
C
T
T
T
A
T
T
T
T
T
C
G
G
A
A
A
A
A
A
T
(-bamOut)
A
A
pruned
T
A
A
C
C
T
T
T
T
?
T
G
A
G
A
G
0/0 0/1 1/1
A/ C
G
A/ G
?
A
GGC T
G
Genotype sample
?
A
T
A
T A T GA A A T T G G T A T AGG C T
?
C
A
GGT A T
AAT
TA T G
Determine per-read likelihoods (PairHMM)
A
T
A
REF T A T G A A A T T G G T A T A G G C T
C
T
G
GLs + annotations
Determining Ac9veRegions Assemble plausible haplotypes
•  Sliding window along the genome reference T
A
A
GGC T
GGT A T
AAT
TA T G
A
C
G
•  Count mismatches, indels and soU pruned
T A T GA A A T T G G T A T AGG C T
clips Identify ActiveRegions
REF T A T G A A A T T G G T A T A G G C T
READS
READS
A
A
A
A
C
C
C
C
T
T
T
A
T
T
T
T
T
G
G
A
A
A
A
A
A
(-bamOut)
A
A
C
C
T
T
T
T
A
G
A
G
 Measure of entropy Determine per-read likelihoods (PairHMM)
A
T
A
A
?
?
?
T
G
Over threshold: trigger “Ac1veRegion” where HC w0/0
ill proceed 0/1 1/1
A
C
T
Genotype sample
T
A/ C
G
A/ G
?
A
C
T
G
Can specify a per-cular interval using the –forceAc-ve parameter GLs + annotations
HC method illustrated Assemble plausible haplotypes
Identify ActiveRegions
READS
READS
A
A
A
A
C
C
C
C
T
T
T
A
T
T
T
T
T
C
G
G
A
A
A
A
A
A
T
(-bamOut)
A
A
pruned
T
A
A
C
C
T
T
T
T
?
T
G
A
G
A
G
0/0 0/1 1/1
A/ C
G
A/ G
?
A
GGC T
G
Genotype sample
?
A
T
A
T A T GA A A T T G G T A T AGG C T
?
C
A
GGT A T
AAT
TA T G
Determine per-read likelihoods (PairHMM)
A
T
A
REF T A T G A A A T T G G T A T A G G C T
C
T
G
GLs + annotations
HC method illustrated Assemble plausible haplotypes
Identify ActiveRegions
READS
READS
A
A
A
A
C
C
C
C
T
T
T
A
T
T
T
T
T
C
G
G
A
A
A
A
A
A
T
(-bamOut)
A
A
pruned
T
A
A
C
C
T
T
T
T
?
T
G
A
G
A
G
0/0 0/1 1/1
A/ C
G
A/ G
?
A
GGC T
G
Genotype sample
?
A
T
A
T A T GA A A T T G G T A T AGG C T
?
C
A
GGT A T
AAT
TA T G
Determine per-read likelihoods (PairHMM)
A
T
A
REF T A T G A A A T T G G T A T A G G C T
C
T
G
GLs + annotations
Assemble plausible haplotypes Identify
ActiveRegions
•  Local re-­‐assembly Assemble plausible haplotypes
T A T GA A A T T G G T A T AGG C T
•  Traverse graph to collect A
T
A
T
G
most likely h
aplotypes T
A
G
A
A
A
READS
READS
T
T
T
T
T
C
C
C
C
C
(-bamOut)
pruned
A
A
C
C
T
T
T
T
A
G
A
G
Genotype sample
?
A
?
T
GGC T
G
A
T + candidate G
A
Likely haplotypes variant sites 0/0 0/1
?
C
A
T A T GA A A T T G G T A T AGG C T
A
A
A
A
A
Determine per-read likelihoods (PairHMM)
T
A
GGT A T
AAT
TA T G
•  Align haplotypes to ref using Smith-­‐Waterman A
T
A
REF
T
A/ C
G
A/ G
?
A
1/1
C
T
G
GLs + annotations
Can have HC output the reassembled reads and selected halpotypes using the –bamOut parameter Assemble plausible haplotypes
Identify ActiveRegions
READS
READS
A
A
A
A
C
C
C
C
T
T
T
A
T
T
T
T
T
T
C
(-bamOut)
A
A
T
?
T
A
A
C
C
T
T
T
T
GGC T
G
A
G
A
G
Genotype sample
G
0/0 0/1 1/1
A/ C
G
A/ G
?
A
Traverse the graph to enumerate the possible haplotypes Assembly of large genomes using second-­‐genera-on sequencing. Schatz. Genome Research. 2010. pruned
?
A
T
A
T A T GA A A T T G G T A T AGG C T
?
C
A
GGT A T
AAT
TA T G
G
G
A
A
A
A
A
A
Determine per-read likelihoods (PairHMM)
A
T
A
REF T A T G A A A T T G G T A T A G G C T
C
T
G
GLs + annotations
Ar9factual SNPs and small indels caused by large indel can be recovered by assembly NA12878 original read data Haplotype Caller (validated) 15 Mul9ple caller ar9facts that are hard to filter out, since they are well supported by read data Allele determination is much more accurate through local
assembly of candidate haplotypes
Original BWA alignments -assembly:
+assembly:
BAM read bases are all iden9cal; individual alignments differ based on the the whims of the mapper 1 multi-allelic SNP and two 1bp indels are called
Only the complex substitution (TT to TAC) is called
16 HC method illustrated Assemble plausible haplotypes
Identify ActiveRegions
READS
READS
A
A
A
A
C
C
C
C
T
T
T
A
T
T
T
T
T
C
G
G
A
A
A
A
A
A
T
(-bamOut)
A
A
pruned
T
A
A
C
C
T
T
T
T
?
T
G
A
G
A
G
0/0 0/1 1/1
A/ C
G
A/ G
?
A
GGC T
G
Genotype sample
?
A
T
A
T A T GA A A T T G G T A T AGG C T
?
C
A
GGT A T
AAT
TA T G
Determine per-read likelihoods (PairHMM)
A
T
A
REF T A T G A A A T T G G T A T A G G C T
C
T
G
GLs + annotations
HC method illustrated Assemble plausible haplotypes
Identify ActiveRegions
READS
READS
A
A
A
A
C
C
C
C
T
T
T
A
T
T
T
T
T
C
G
G
A
A
A
A
A
A
T
(-bamOut)
A
A
pruned
T
A
A
C
C
T
T
T
T
?
T
G
A
G
A
G
0/0 0/1 1/1
A/ C
G
A/ G
?
A
GGC T
G
Genotype sample
?
A
T
A
T A T GA A A T T G G T A T AGG C T
?
C
A
GGT A T
AAT
TA T G
Determine per-read likelihoods (PairHMM)
A
T
A
REF T A T G A A A T T G G T A T A G G C T
C
T
G
GLs + annotations
Score haplotypes using PairHMM Assemble plausible haplotypes
Identify ActiveRegions
•  Calculate haplotype likelihoods given the data A
REF T A T G A A A T T G G T A T A G G C T
READS
READS
A
–  PairHMM aTTligns eGach read to each haplotype C
A
A
A
C
C
C
C
T
A
T
T
T
T
T
TA T G
G
A
A
A
A
A
A
GGC T
pruned
G
T
T
T
T
A
G
A
G
Likelihood of the read being observed if a given (-bamOut)
haplotype is considered true T
A
A
T
Genotype sample
G
0/0 0/1 1/1
?
A
?
T
A
A
A
C
C
?
C
A
GGT A T
T A T GA A A T T G G T A T AGG C T
Determine per-read likelihoods (PairHMM)
A
T
AAT
T
A/ C
G
A/ G
?
A
C
T
G
GLs + annotations
READS
READS
A
A
A
C
C
C
C
T
T
A
T
T
T
T
T
G
G
A
A
A
A
A
A
pruned
T A T GA A A T T G G T A T AG
(-bamOut)
Determine per-read likelihoods (PairHMM)
A
T
A
A
?
T
PairHMM Empirical gap penal9es = derived from data by BQSR Base mismatch penal9es = base quality scores -­‐> likelihoods of the reads vs the haplotypes -­‐> store in matrix 0/0
A/ C
A/ G
?
A
Reads Haplotypes Aij = probability of read vs haplotype A
G
A
G
Genotype s
G
G
?
T
T
T
T
T
?
A
C
T
A
A
C
C
C
T
G
GLs + anno
HC method illustrated Assemble plausible haplotypes
Identify ActiveRegions
READS
READS
A
A
A
A
C
C
C
C
T
T
T
A
T
T
T
T
T
C
G
G
A
A
A
A
A
A
T
(-bamOut)
A
A
pruned
T
A
A
C
C
T
T
T
T
?
T
G
A
G
A
G
0/0 0/1 1/1
A/ C
G
A/ G
?
A
GGC T
G
Genotype sample
?
A
T
A
T A T GA A A T T G G T A T AGG C T
?
C
A
GGT A T
AAT
TA T G
Determine per-read likelihoods (PairHMM)
A
T
A
REF T A T G A A A T T G G T A T A G G C T
C
T
G
GLs + annotations
HC method illustrated Assemble plausible haplotypes
Identify ActiveRegions
READS
READS
A
A
A
A
C
C
C
C
T
T
T
A
T
T
T
T
T
C
G
G
A
A
A
A
A
A
T
(-bamOut)
A
A
pruned
T
A
A
C
C
T
T
T
T
?
T
G
A
G
A
G
0/0 0/1 1/1
A/ C
G
A/ G
?
A
GGC T
G
Genotype sample
?
A
T
A
T A T GA A A T T G G T A T AGG C T
?
C
A
GGT A T
AAT
TA T G
Determine per-read likelihoods (PairHMM)
A
T
A
REF T A T G A A A T T G G T A T A G G C T
C
T
G
GLs + annotations
Genotype samples Assemble plausible haplotypes
Identify ActiveRegions
•  Determine likelihoods of reads vs. alleles from PairHMM output T
A
REF T A T G A A A T T G G T A T A G G C T
A
GGC T
GGT A T
AAT
•  Apply Bayes’ rule tGo assign a genotype for each variant posi9on pruned
READS
READS
A
A
A
A
C
C
C
C
T
T
T
A
T
T
T
T
T
TA T G
C
G
A
A
A
A
A
A
(-bamOut)
A
A
C
C
T
T
T
T
A
G
A
G
Genotype likelihoods given the data T
A
A
?
?
T
T
Genotype sample
G
0/0 0/1 1/1
?
A
C
G
T A T GA A A T T G G T A T AGG C T
Determine per-read likelihoods (PairHMM)
A
A
T
A/ C
G
A/ G
?
A
C
T
G
GLs + annotations
READS
READS
A
A
A
C
C
C
C
T
T
A
T
T
T
T
T
G
G
A
A
A
A
A
A
pruned
T A T GA A A T T G G T A T AG
(-bamOut)
Determine per-read likelihoods (PairHMM)
PairHMM A
T
A
A
?
T
READS
READS
T
T
T
A
T
T
T
T
T
C
G
G
A
A
A
A
A
A
C
T
G
A
GGC T
GGT A T
A
A/ G
?
A
T
AAT
TA T G
pruned
G
T A T GA A A T T G G T A T AGG C T
A
A
C
C
T
T
T
T
A
G
A
G
likelihoods of the reads vs the haplotypes (-bamOut)
Determine per-read likelihoods (PairHMM)
Reference: ATCGATCATAGCTAGCTGCG
?
?
Haplotype 1: ATCGA-CATAGCTAGCTGCG Haplotype 2: ATGGATCATAGCTTGCTGCG ?
?
Haplotype 3: ATCGA-CATAGCTTGCTGCG A
T
A
A
A
C
R A
3 -­‐ G
0/0 0/1 1/1
A/ C
G
C
Genotype sample
A/ G
T
G
GLs + annotations
Alleles 1 0.01 0.02 0.03 0.04 0.04 T 0.03 2 0.09 0.06 0.07 0.08 0.08 0.09 2 3 0.10 0.11 0.01 0.02 0.11 0.10 3 Take highest probability of read vs any of the haplotypes that contain the allele (for each variant posi1on) * These numbers are made up to give a sense of how the process works. In reality the numbers would be much smaller. 1 Reads Reads * Haplotypes 1 2 T
T
T
0/0
A/ C
G
?
T
C
A
A
G
A
G
Genotype s
?
A
REF T A T G A A A T T G G T A T A G G C T
T
T
T
T
G
Assemble plausible haplotypes
Identify ActiveRegions
A
A
A
A
C
C
C
C
T
A
A
C
C
GLs + anno
C
A
A
T
T
(-bamOut)
Determine per-read likelihoods (PairHMM)
-­‐ A
T
T 0.03 A
0.04 ?
C
T
A
A
?
T
A
T
T
T
T
G
?
C
T
G
1 0.08 0.09 2 0.11 0.10 3 G
A
G
Genotype sample
0/0 0/1 1/1
?
A/ C
A/ G
G
Reads Just plug in the numbers! Alleles A
C
C
GLs + annotations
T
T
T
C
C
A
A
A
A
C
C
(-bamOut)
Genotype sample
Determine per-read likelihoods (PairHMM)
A
T
Summed up formally: 4
C
T
G
A/ C
A/ G
?
A
T
C
G
GLs + annotations
Prior of the Likelihood
of
Simple genotype
likelihoods
for presentations
genotype! the genotype!
Pr{G} Pr{D|G}
Diploid
Pr{G|D} = Identify ActiveRegions
, Assemble
[Bayes’
rule]
plausible
haplotypes
assumption!
i Pr{Gi } Pr{D|Gi }
⇥
⇧ Pr{Dj |H1 } Pr{Dj |H2 }
Pr{D|G} =
+
where G = H1 H2
2
2
j
READS
READS
A
A
A
A
C
C
C
C
T
T
T
A
T
T
T
T
T
T
A
REF T A T G A A A T T G G T A T A G G C T
Bayesian model 0/0 0/1 1/1
G
?
T
T
?
A
SNP calling
4.1
A
A
?
G
A
G
T
T
T
C
G
G
A
A
A
A
A
A
A
T
pruned
A
A
C
C
SNP haploid
likelihood
Determine per-read likelihoods (PairHMM)
A
GGC T
G
T A T GA A A T T G G T A T AGG C T
(-bamOut)
Pr{D|H} is the haploid likelihood
function
4.1.1
A
GGT A T
AAT
TA T G
A
A
?
T
G
?
T
T
T
T
A
G
A
G
Genotype sample
0/0 0/1 1/1
Pr{Dj |H} = Pr{Dj |b}, [single baseA/ Cpileup]
⇤ ?
A/ G
?
1
j Dj = b,
GLs + annotations
Pr{Dj |b} =
otherwise.
j
A
C
4.1.2
T
A
T
G
C
T
G
Indel haploid likelihood
⌅
Joint Genotyping Add a joint analysis step to take advantage of cohort / pop gene1cs data READS
READS
T
T
T
A
T
T
T
T
T
AAT
C
G
G
A
A
A
A
A
A
READS
READS
T
A
TA T G
A
A
A
A
A
C
C
C
C
T
T
T
A
T
T
T
T
T
G
G
A
A
A
A
A
A
(-bamOut)
A
A
C
C
T
TT
T
T
A
?
A
?
Determine per-read likelihoods (PairHMM)
T
A
A
?
?
C
T
T
T
T
A
T
C
C
G
T
T
T
T
A
G
A
G
Genotype sample
0/0 0/1 1/1
A/ G
T
0/0 0/1 1/1
A/ G
T
GGC T
G
A/ C
A/ C
G
A
A
C
C
?
?
A
pruned
G
G
mul9-­‐ G
?
A
A
G
A
GT
Genotype sample
?
A
(-bamOut)
Determine per-read likelihoods (PairHMM)
A
A
T A T GA A A T T G G T A T AGG C T
GGC T
T A T GA A A T T G G T A T AGG C T
C
A
C
G
pruned
A
GGT A T
AAT
TA T G
A
GGT A T
T
A
REF T A T G A A A T T G G T A T A G G C T
REF T A T G A A A T T G G T A T A G G C T
A
A
A
A
C
C
C
C
Assemble plausible haplotypes
Identify ActiveRegions
Assemble plausible haplotypes
Identify ActiveRegions
GLs + annotations
G
GLs + annotations
PROTOCOL (HC+GGVCFS) Variant calling + joint genotyping workflow Analysis-­‐ready BAM file Analysis-­‐ready Analysis-­‐ready BAM file Analysis-­‐ready BAM file BAM file HaplotypeCaller Raw gVCF* file Raw gVCF* file Raw gVCF* file Raw gVCF* file GenotypeGVCFs Raw VCF file * Genomic VCF – see Day 2 presenta-on on cohort analysis Call variants per sample using HaplotypeCaller Analysis-­‐ready BAM file Analysis-­‐ready Analysis-­‐ready BAM file Analysis-­‐ready BAM file BAM file HaplotypeCaller Raw gVCF* file Raw gVCF* file Raw gVCF* file Raw gVCF* file GenotypeGVCFs Raw VCF file * Genomic VCF – see Day 2 presenta-on on cohort analysis TOOL TIPS HaplotypeCaller •  In GVCF mode, outputs records for all sites in gVCF format java –jar GenomeAnalysisTK.jar –T HaplotypeCaller \ –R human.fasta \ –I sample1.bam \ –o sample1.g.vcf \ –ERC GVCF \ -­‐-­‐variant_index_type LINEAR \ -­‐-­‐variant_index_parameter 128000 –  Extension must be .vcf, NOT .gvcf –  ERC mode also available in BP_RESOLUTION –  variant index arguments are related to file compression HC –ERC GVCF outputs a raw genomic VCF Analysis-­‐ready BAM file Analysis-­‐ready Analysis-­‐ready BAM file Analysis-­‐ready BAM file BAM file HaplotypeCaller Raw gVCF* file Raw gVCF* file Raw gVCF* file Raw gVCF* file GenotypeGVCFs Raw VCF file * Genomic VCF – see Day 2 presenta-on on cohort analysis Do joint analysis of all gVCFs for the cohort Analysis-­‐ready BAM file Analysis-­‐ready Analysis-­‐ready BAM file Analysis-­‐ready BAM file BAM file HaplotypeCaller Raw gVCF* file Raw gVCF* file Raw gVCF* file Raw gVCF* file GenotypeGVCFs Raw VCF file * Genomic VCF – see Day 2 presenta-on on cohort analysis TOOL TIPS GenotypeGVCFs •  Performs joint genotyping on all samples together java –jar GenomeAnalysisTK.jar –T GenotypeGVCFs\ –R human.fasta \ –V sample1.g.vcf \ –V sample2.g.vcf \ –V sampleN.g.vcf \ –o output.vcf •  If >200 samples, combine in batches first using CombineGVCFs Final result is a regular mul9sample VCF Analysis-­‐ready BAM file Analysis-­‐ready Analysis-­‐ready BAM file Analysis-­‐ready BAM file BAM file HaplotypeCaller Raw gVCF* file Raw gVCF* file Raw gVCF* file Raw gVCF* file GenotypeGVCFs Raw VCF file * Genomic VCF – see Day 2 presenta-on on cohort analysis RESULTS Did the muta9on calling work properly? •  Raw callsets are oUen very large and full of false posi9ve muta9on calls •  Further work is needed before this callset can be used for any meaningful analysis! •  See downstream steps (VQSR etc.) on how to assess the quality of a variant callset talks Further reading hpp://www.broadins9tute.org/gatk/guide/best-­‐prac9ces hpp://www.broadins9tute.org/gatk/guide/ar9cle?id=1237 hpp://www.broadins9tute.org/gatk/gatkdocs/
org_broadins9tute_s9ng_gatk_walkers_genotyper_UnifiedGenotyper.html hpp://www.broadins9tute.org/gatk/gatkdocs/
org_broadins9tute_s9ng_gatk_walkers_haplotypecaller_HaplotypeCaller.html