talks Variant Calling + Genotyping Examine the evidence for varia9on from reference via Bayesian inference We are here in the Best Practices workflow! Variant Calling PURPOSE Real muta9ons are hidden in the noise PRINCIPLES Muta9on or noise? A large scale Bayesian inference problem Variant callers in GATK • UnifiedGenotyper Call SNPs and indels separately by considering each variant locus independently – Accepts any ploidy – Pooled calling • HaplotypeCaller Call SNPs, indels, and some SVs simultaneously by doing local re-‐assembly and considering haplotypes – More accurate (esp. indels) – Reference confidence model – Replaces UG HaplotypeCaller method overview • Call SNPs, indels, and some SVs simultaneously by doing local re-‐assembly and considering haplotypes – Determine if a region has poten1al varia1on – Make deBruijn assembly graph of the region – Paths in the graph = poten1al haplotypes to evaluate – Calculate haplotype likelihoods given the data (PairHMM) – Iden1fy variants on most likely haplotypes – Compute allele frequency distribu1on to determine most likely allele count, and emit a variant call if appropriate – If emiTng a variant, assign genotype to each sample HC method illustrated Assemble plausible haplotypes Identify ActiveRegions READS READS A A A A C C C C T T T A T T T T T C G G A A A A A A T (-bamOut) A A pruned T A A C C T T T T ? T G A G A G 0/0 0/1 1/1 A/ C G A/ G ? A GGC T G Genotype sample ? A T A T A T GA A A T T G G T A T AGG C T ? C A GGT A T AAT TA T G Determine per-read likelihoods (PairHMM) A T A REF T A T G A A A T T G G T A T A G G C T C T G GLs + annotations Determining Ac9veRegions Assemble plausible haplotypes • Sliding window along the genome reference T A A GGC T GGT A T AAT TA T G A C G • Count mismatches, indels and soU pruned T A T GA A A T T G G T A T AGG C T clips Identify ActiveRegions REF T A T G A A A T T G G T A T A G G C T READS READS A A A A C C C C T T T A T T T T T G G A A A A A A (-bamOut) A A C C T T T T A G A G Measure of entropy Determine per-read likelihoods (PairHMM) A T A A ? ? ? T G Over threshold: trigger “Ac1veRegion” where HC w0/0 ill proceed 0/1 1/1 A C T Genotype sample T A/ C G A/ G ? A C T G Can specify a per-cular interval using the –forceAc-ve parameter GLs + annotations HC method illustrated Assemble plausible haplotypes Identify ActiveRegions READS READS A A A A C C C C T T T A T T T T T C G G A A A A A A T (-bamOut) A A pruned T A A C C T T T T ? T G A G A G 0/0 0/1 1/1 A/ C G A/ G ? A GGC T G Genotype sample ? A T A T A T GA A A T T G G T A T AGG C T ? C A GGT A T AAT TA T G Determine per-read likelihoods (PairHMM) A T A REF T A T G A A A T T G G T A T A G G C T C T G GLs + annotations HC method illustrated Assemble plausible haplotypes Identify ActiveRegions READS READS A A A A C C C C T T T A T T T T T C G G A A A A A A T (-bamOut) A A pruned T A A C C T T T T ? T G A G A G 0/0 0/1 1/1 A/ C G A/ G ? A GGC T G Genotype sample ? A T A T A T GA A A T T G G T A T AGG C T ? C A GGT A T AAT TA T G Determine per-read likelihoods (PairHMM) A T A REF T A T G A A A T T G G T A T A G G C T C T G GLs + annotations Assemble plausible haplotypes Identify ActiveRegions • Local re-‐assembly Assemble plausible haplotypes T A T GA A A T T G G T A T AGG C T • Traverse graph to collect A T A T G most likely h aplotypes T A G A A A READS READS T T T T T C C C C C (-bamOut) pruned A A C C T T T T A G A G Genotype sample ? A ? T GGC T G A T + candidate G A Likely haplotypes variant sites 0/0 0/1 ? C A T A T GA A A T T G G T A T AGG C T A A A A A Determine per-read likelihoods (PairHMM) T A GGT A T AAT TA T G • Align haplotypes to ref using Smith-‐Waterman A T A REF T A/ C G A/ G ? A 1/1 C T G GLs + annotations Can have HC output the reassembled reads and selected halpotypes using the –bamOut parameter Assemble plausible haplotypes Identify ActiveRegions READS READS A A A A C C C C T T T A T T T T T T C (-bamOut) A A T ? T A A C C T T T T GGC T G A G A G Genotype sample G 0/0 0/1 1/1 A/ C G A/ G ? A Traverse the graph to enumerate the possible haplotypes Assembly of large genomes using second-‐genera-on sequencing. Schatz. Genome Research. 2010. pruned ? A T A T A T GA A A T T G G T A T AGG C T ? C A GGT A T AAT TA T G G G A A A A A A Determine per-read likelihoods (PairHMM) A T A REF T A T G A A A T T G G T A T A G G C T C T G GLs + annotations Ar9factual SNPs and small indels caused by large indel can be recovered by assembly NA12878 original read data Haplotype Caller (validated) 15 Mul9ple caller ar9facts that are hard to filter out, since they are well supported by read data Allele determination is much more accurate through local assembly of candidate haplotypes Original BWA alignments -assembly: +assembly: BAM read bases are all iden9cal; individual alignments differ based on the the whims of the mapper 1 multi-allelic SNP and two 1bp indels are called Only the complex substitution (TT to TAC) is called 16 HC method illustrated Assemble plausible haplotypes Identify ActiveRegions READS READS A A A A C C C C T T T A T T T T T C G G A A A A A A T (-bamOut) A A pruned T A A C C T T T T ? T G A G A G 0/0 0/1 1/1 A/ C G A/ G ? A GGC T G Genotype sample ? A T A T A T GA A A T T G G T A T AGG C T ? C A GGT A T AAT TA T G Determine per-read likelihoods (PairHMM) A T A REF T A T G A A A T T G G T A T A G G C T C T G GLs + annotations HC method illustrated Assemble plausible haplotypes Identify ActiveRegions READS READS A A A A C C C C T T T A T T T T T C G G A A A A A A T (-bamOut) A A pruned T A A C C T T T T ? T G A G A G 0/0 0/1 1/1 A/ C G A/ G ? A GGC T G Genotype sample ? A T A T A T GA A A T T G G T A T AGG C T ? C A GGT A T AAT TA T G Determine per-read likelihoods (PairHMM) A T A REF T A T G A A A T T G G T A T A G G C T C T G GLs + annotations Score haplotypes using PairHMM Assemble plausible haplotypes Identify ActiveRegions • Calculate haplotype likelihoods given the data A REF T A T G A A A T T G G T A T A G G C T READS READS A – PairHMM aTTligns eGach read to each haplotype C A A A C C C C T A T T T T T TA T G G A A A A A A GGC T pruned G T T T T A G A G Likelihood of the read being observed if a given (-bamOut) haplotype is considered true T A A T Genotype sample G 0/0 0/1 1/1 ? A ? T A A A C C ? C A GGT A T T A T GA A A T T G G T A T AGG C T Determine per-read likelihoods (PairHMM) A T AAT T A/ C G A/ G ? A C T G GLs + annotations READS READS A A A C C C C T T A T T T T T G G A A A A A A pruned T A T GA A A T T G G T A T AG (-bamOut) Determine per-read likelihoods (PairHMM) A T A A ? T PairHMM Empirical gap penal9es = derived from data by BQSR Base mismatch penal9es = base quality scores -‐> likelihoods of the reads vs the haplotypes -‐> store in matrix 0/0 A/ C A/ G ? A Reads Haplotypes Aij = probability of read vs haplotype A G A G Genotype s G G ? T T T T T ? A C T A A C C C T G GLs + anno HC method illustrated Assemble plausible haplotypes Identify ActiveRegions READS READS A A A A C C C C T T T A T T T T T C G G A A A A A A T (-bamOut) A A pruned T A A C C T T T T ? T G A G A G 0/0 0/1 1/1 A/ C G A/ G ? A GGC T G Genotype sample ? A T A T A T GA A A T T G G T A T AGG C T ? C A GGT A T AAT TA T G Determine per-read likelihoods (PairHMM) A T A REF T A T G A A A T T G G T A T A G G C T C T G GLs + annotations HC method illustrated Assemble plausible haplotypes Identify ActiveRegions READS READS A A A A C C C C T T T A T T T T T C G G A A A A A A T (-bamOut) A A pruned T A A C C T T T T ? T G A G A G 0/0 0/1 1/1 A/ C G A/ G ? A GGC T G Genotype sample ? A T A T A T GA A A T T G G T A T AGG C T ? C A GGT A T AAT TA T G Determine per-read likelihoods (PairHMM) A T A REF T A T G A A A T T G G T A T A G G C T C T G GLs + annotations Genotype samples Assemble plausible haplotypes Identify ActiveRegions • Determine likelihoods of reads vs. alleles from PairHMM output T A REF T A T G A A A T T G G T A T A G G C T A GGC T GGT A T AAT • Apply Bayes’ rule tGo assign a genotype for each variant posi9on pruned READS READS A A A A C C C C T T T A T T T T T TA T G C G A A A A A A (-bamOut) A A C C T T T T A G A G Genotype likelihoods given the data T A A ? ? T T Genotype sample G 0/0 0/1 1/1 ? A C G T A T GA A A T T G G T A T AGG C T Determine per-read likelihoods (PairHMM) A A T A/ C G A/ G ? A C T G GLs + annotations READS READS A A A C C C C T T A T T T T T G G A A A A A A pruned T A T GA A A T T G G T A T AG (-bamOut) Determine per-read likelihoods (PairHMM) PairHMM A T A A ? T READS READS T T T A T T T T T C G G A A A A A A C T G A GGC T GGT A T A A/ G ? A T AAT TA T G pruned G T A T GA A A T T G G T A T AGG C T A A C C T T T T A G A G likelihoods of the reads vs the haplotypes (-bamOut) Determine per-read likelihoods (PairHMM) Reference: ATCGATCATAGCTAGCTGCG ? ? Haplotype 1: ATCGA-CATAGCTAGCTGCG Haplotype 2: ATGGATCATAGCTTGCTGCG ? ? Haplotype 3: ATCGA-CATAGCTTGCTGCG A T A A A C R A 3 -‐ G 0/0 0/1 1/1 A/ C G C Genotype sample A/ G T G GLs + annotations Alleles 1 0.01 0.02 0.03 0.04 0.04 T 0.03 2 0.09 0.06 0.07 0.08 0.08 0.09 2 3 0.10 0.11 0.01 0.02 0.11 0.10 3 Take highest probability of read vs any of the haplotypes that contain the allele (for each variant posi1on) * These numbers are made up to give a sense of how the process works. In reality the numbers would be much smaller. 1 Reads Reads * Haplotypes 1 2 T T T 0/0 A/ C G ? T C A A G A G Genotype s ? A REF T A T G A A A T T G G T A T A G G C T T T T T G Assemble plausible haplotypes Identify ActiveRegions A A A A C C C C T A A C C GLs + anno C A A T T (-bamOut) Determine per-read likelihoods (PairHMM) -‐ A T T 0.03 A 0.04 ? C T A A ? T A T T T T G ? C T G 1 0.08 0.09 2 0.11 0.10 3 G A G Genotype sample 0/0 0/1 1/1 ? A/ C A/ G G Reads Just plug in the numbers! Alleles A C C GLs + annotations T T T C C A A A A C C (-bamOut) Genotype sample Determine per-read likelihoods (PairHMM) A T Summed up formally: 4 C T G A/ C A/ G ? A T C G GLs + annotations Prior of the Likelihood of Simple genotype likelihoods for presentations genotype! the genotype! Pr{G} Pr{D|G} Diploid Pr{G|D} = Identify ActiveRegions , Assemble [Bayes’ rule] plausible haplotypes assumption! i Pr{Gi } Pr{D|Gi } ⇥ ⇧ Pr{Dj |H1 } Pr{Dj |H2 } Pr{D|G} = + where G = H1 H2 2 2 j READS READS A A A A C C C C T T T A T T T T T T A REF T A T G A A A T T G G T A T A G G C T Bayesian model 0/0 0/1 1/1 G ? T T ? A SNP calling 4.1 A A ? G A G T T T C G G A A A A A A A T pruned A A C C SNP haploid likelihood Determine per-read likelihoods (PairHMM) A GGC T G T A T GA A A T T G G T A T AGG C T (-bamOut) Pr{D|H} is the haploid likelihood function 4.1.1 A GGT A T AAT TA T G A A ? T G ? T T T T A G A G Genotype sample 0/0 0/1 1/1 Pr{Dj |H} = Pr{Dj |b}, [single baseA/ Cpileup] ⇤ ? A/ G ? 1 j Dj = b, GLs + annotations Pr{Dj |b} = otherwise. j A C 4.1.2 T A T G C T G Indel haploid likelihood ⌅ Joint Genotyping Add a joint analysis step to take advantage of cohort / pop gene1cs data READS READS T T T A T T T T T AAT C G G A A A A A A READS READS T A TA T G A A A A A C C C C T T T A T T T T T G G A A A A A A (-bamOut) A A C C T TT T T A ? A ? Determine per-read likelihoods (PairHMM) T A A ? ? C T T T T A T C C G T T T T A G A G Genotype sample 0/0 0/1 1/1 A/ G T 0/0 0/1 1/1 A/ G T GGC T G A/ C A/ C G A A C C ? ? A pruned G G mul9-‐ G ? A A G A GT Genotype sample ? A (-bamOut) Determine per-read likelihoods (PairHMM) A A T A T GA A A T T G G T A T AGG C T GGC T T A T GA A A T T G G T A T AGG C T C A C G pruned A GGT A T AAT TA T G A GGT A T T A REF T A T G A A A T T G G T A T A G G C T REF T A T G A A A T T G G T A T A G G C T A A A A C C C C Assemble plausible haplotypes Identify ActiveRegions Assemble plausible haplotypes Identify ActiveRegions GLs + annotations G GLs + annotations PROTOCOL (HC+GGVCFS) Variant calling + joint genotyping workflow Analysis-‐ready BAM file Analysis-‐ready Analysis-‐ready BAM file Analysis-‐ready BAM file BAM file HaplotypeCaller Raw gVCF* file Raw gVCF* file Raw gVCF* file Raw gVCF* file GenotypeGVCFs Raw VCF file * Genomic VCF – see Day 2 presenta-on on cohort analysis Call variants per sample using HaplotypeCaller Analysis-‐ready BAM file Analysis-‐ready Analysis-‐ready BAM file Analysis-‐ready BAM file BAM file HaplotypeCaller Raw gVCF* file Raw gVCF* file Raw gVCF* file Raw gVCF* file GenotypeGVCFs Raw VCF file * Genomic VCF – see Day 2 presenta-on on cohort analysis TOOL TIPS HaplotypeCaller • In GVCF mode, outputs records for all sites in gVCF format java –jar GenomeAnalysisTK.jar –T HaplotypeCaller \ –R human.fasta \ –I sample1.bam \ –o sample1.g.vcf \ –ERC GVCF \ -‐-‐variant_index_type LINEAR \ -‐-‐variant_index_parameter 128000 – Extension must be .vcf, NOT .gvcf – ERC mode also available in BP_RESOLUTION – variant index arguments are related to file compression HC –ERC GVCF outputs a raw genomic VCF Analysis-‐ready BAM file Analysis-‐ready Analysis-‐ready BAM file Analysis-‐ready BAM file BAM file HaplotypeCaller Raw gVCF* file Raw gVCF* file Raw gVCF* file Raw gVCF* file GenotypeGVCFs Raw VCF file * Genomic VCF – see Day 2 presenta-on on cohort analysis Do joint analysis of all gVCFs for the cohort Analysis-‐ready BAM file Analysis-‐ready Analysis-‐ready BAM file Analysis-‐ready BAM file BAM file HaplotypeCaller Raw gVCF* file Raw gVCF* file Raw gVCF* file Raw gVCF* file GenotypeGVCFs Raw VCF file * Genomic VCF – see Day 2 presenta-on on cohort analysis TOOL TIPS GenotypeGVCFs • Performs joint genotyping on all samples together java –jar GenomeAnalysisTK.jar –T GenotypeGVCFs\ –R human.fasta \ –V sample1.g.vcf \ –V sample2.g.vcf \ –V sampleN.g.vcf \ –o output.vcf • If >200 samples, combine in batches first using CombineGVCFs Final result is a regular mul9sample VCF Analysis-‐ready BAM file Analysis-‐ready Analysis-‐ready BAM file Analysis-‐ready BAM file BAM file HaplotypeCaller Raw gVCF* file Raw gVCF* file Raw gVCF* file Raw gVCF* file GenotypeGVCFs Raw VCF file * Genomic VCF – see Day 2 presenta-on on cohort analysis RESULTS Did the muta9on calling work properly? • Raw callsets are oUen very large and full of false posi9ve muta9on calls • Further work is needed before this callset can be used for any meaningful analysis! • See downstream steps (VQSR etc.) on how to assess the quality of a variant callset talks Further reading hpp://www.broadins9tute.org/gatk/guide/best-‐prac9ces hpp://www.broadins9tute.org/gatk/guide/ar9cle?id=1237 hpp://www.broadins9tute.org/gatk/gatkdocs/ org_broadins9tute_s9ng_gatk_walkers_genotyper_UnifiedGenotyper.html hpp://www.broadins9tute.org/gatk/gatkdocs/ org_broadins9tute_s9ng_gatk_walkers_haplotypecaller_HaplotypeCaller.html
© Copyright 2024