De novo comparison of huge metagenomics experiments coming from NGS technologies

De novo comparison of
huge metagenomics
experiments coming
from NGS technologies
Application on tara oceans project
Nicolas MAILLET - October
7th 2014
Work realized under the supervision of
Dominique LAVENIER & Pierre PETERLONGO
Credit: Sauveur
D.
Comparative Metagenomics: What do
we want?
How to compare two samples?
1. Introduction
1/5
Comparative Metagenomics: What do
we want?
How to compare two samples?
Mapping sequences
on current knowledge
1. Introduction
1/5
Comparative Metagenomics: What do
we want?
How to compare two samples?
Mapping sequences
on current knowledge
1. Introduction
De novo assembly
1/5
Comparative Metagenomics: What do
we want?
How to compare two samples?
Mapping sequences
on current knowledge
De novo assembly
De novo comparative metagenomics!
1. Introduction
1/5
Comparative Metagenomics: What do
we want?
How to compare two samples?
Mapping sequences
on current knowledge
De novo assembly
De novo comparative metagenomics!
Quantify similarity between two
metagenomic datasets at the
read level, i.e.
find similar sequences between
two datasets
1. Introduction
1/5
Comparative Metagenomics: What do
we want?
How to compare two samples?
Mapping sequences
on current knowledge
De novo assembly
De novo comparative metagenomics!
Quantify similarity between two
metagenomic datasets at the
read level, i.e.
find similar sequences between
Highly
efficient approach to
two datasets
scale with huge metagenomic
datasets (100-500 millions
reads)
1. Introduction
1/5
But… How to define "similarity"?
2. Methodology
2/5
But… How to define "similarity"?
A rough but efficient notion of "similar
sequences":
2. Methodology
2/5
But… How to define "similarity"?
A rough but efficient notion of "similar
sequences":
Given integers k and t, two sequences s1
and s2 are said similar
if and only if they share at least t non
overlapping k-mers.
2. Methodology
2/5
But… How to define "similarity"?
A rough but efficient notion of "similar
sequences":
Given integers k and t, two sequences s1
and s2 are said similar
if and only if they share at least t non
Based on this definition, the Compareads
overlapping k-mers.
Nicolas
Maillet, was
Claire realized.
Lemaitre, Rayan Chikhi, Dominique Lavenier
software
and Pierre Peterlongo,
“Compareads: comparing huge metagenomic experiments.”, 2012.
2. Methodology
2/5
Tara Oceans expedition
In collaboration with Thomas Vannier & Olivier Jaillon (Genoscope)
3. Some results
3/5
Tara Oceans expedition
0.8 to 5μm
In collaboration with Thomas Vannier & Olivier Jaillon (Genoscope)
3. Some results
3/5
Tara Oceans expedition
0.8 to 5μm
180 to 2000μm
In collaboration with Thomas Vannier & Olivier Jaillon (Genoscope)
3. Some results
3/5
Compareads enhanced: COMMET
Direct
intersection
of multiple
samples
4. New methodology
4/5
Compareads enhanced: COMMET
Direct
intersection
of multiple
samples
2x faster for
small datasets,
more for huge
datasets.
Storage footprint
divided by a
factor 100 using
bit vectors.
4. New methodology
4/5
Compareads enhanced: COMMET
Direct
intersection
of multiple
samples
Logical operations
to combine
comparison results
2x faster for
small datasets,
more for huge
datasets.
Storage footprint
divided by a
factor 100 using
bit vectors.
4. New methodology
4/5
Compareads enhanced: COMMET
Direct
intersection
of multiple
samples
Logical operations
to combine
comparison results
Blue subset: reads from A not
similar to any read from B.
2x faster for
small datasets,
more for huge
datasets.
4.
Green subset: reads from A similar
to at least one read from B and
one read from C.
Orange subset: reads from B
similar to at least one read from
A or one read from C.
Storage footprint
divided by a
Red subset : reads from C similar
Nicolas
Maillet, Guillaume Collet, Thomas Vannier, Dominique
factor 100
using
Lavenier and Pierre Peterlongo,
“COMMET
: comparing
combining
to at
least
one and
read
from A, but
bit vectors.
multiple metagenomic datasets.”, in press.
not similar to any read from B.
New methodology
4/5
Compareads enhanced: COMMET
New
outputs
5. Some
Dendrogram and
heatmaps
representing the
newsimilarity
results!
5/5
Compareads enhanced: COMMET
New
outputs
Metasoil
T. O. Delmont et al.,
study fluctuation and
“Structure,
magnitude of a natural
grassland soil metagenome”,
2012
Original study
5. Some
Dendrogram and
heatmaps
representing the
newsimilarity
results!
5/5
Compareads enhanced: COMMET
New
outputs
Metasoil
T. O. Delmont et al.,
study fluctuation and
“Structure,
magnitude of a natural
grassland soil metagenome”,
2012
Original study
5. Some
Dendrogram and
heatmaps
representing the
newsimilarity
results!
COMMET output (3.5x
faster)
5/5
Thank you for your attention!
http://github.com/pierrepeterlongo/commet