Johan Dahlberg: Primary data analysis provided by NGI

Primary data analysis
provided by NGI
Johan Dahlberg, NGI
[email protected]
Why?
Sequencing output
Centralize
● Processing this much data is difficult
… and oring
● Better usage of existing resources
● Don’t do what others an do for you
● Make your needs known
● Know the limits of automation!
How does it work?
The NGI pipeline
NGI_Pipeline
Trigger
Manual
commands
Fetches
info from
Automatically
triggers
Engine
Piper
Helper
Manual
commands
Datastore
Charon
Piper
Best practice analysis workflow (WGS)
Genotyping
data
Verify sample
identity (GATK)
Quality
control
data
Processed
aligned reads
(bam)
Raw
data
(fastq)
Alignment quality
control (Qualimap)
Map to genome
(bwa + samtools)
Indel realignment
(GATK)
Duplicate marking
(Picard)
Base quality
recalibration
(GATK)
Variant Annotation
(SnpEff)
Variant quality
recalibration and
evaluation (GATK)
Variant calling
(GATK)
Variant
calls
(vcf)
References and versions
bwa: 0.7.5a
samtools: 0.1.19
qualimap: v2.0
snpEff: 4.0
gatk: 3.3-0-geee94ec
reference: human_g1k_v37
resources: GATK bundle 2.8
What will you get?
● For each sample:
o Raw sequencing data in fastq format
o Genotyping data in vcf and idat format
(Optional)
o Processed alignments in bam format
o Variant calls in gvcf and vcf format
o Quality control data (alignment statistics, gccontent, etc)
● Per project:
o Project quality control summary statistics
Where will you get it?
● Delivered to a Uppmax resources (at the moment
Milou)
The team
NGI Stockholm
Francesco Vezzi
Per Kraulis
Mario Giovacchini
Guillermo Carrasco
Denis Moreno
Pär Lundin
Pelin Akan
Phil Ewels
NGI Uppsala
Jessica Nordlund
Patrik Smeds
Per Lundmark
Pontus Larsson
Johan Dahlberg
It’s all out there,
let us know what you think!
github.com/NationalGenomicsInfrastructure
[email protected]
Questions?