Supplementary Material of: IVA: accurate de novo

Supplementary Material of:
IVA: accurate de novo assembly of RNA virus genomes
Martin Hunt 1 , Astrid Gall 1 , Swee Hoe Ong 1 , Jacqui Brener 2 , Bridget Ferns 3 ,
Philip Goulder 2 , Eleni Nastouli 4 , Jacqueline A Keane 1 , Paul Kellam 1,3 and Thomas
D Otto 1
1
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge,
CB10 1SA, UK, 2 Department of Paediatrics, University of Oxford, Oxford, UK,
3
Division of Infection and Immunity, Faculty of Medical Sciences, University College
London, London, UK, 4 Department of Virology, University College London Hospital
NHS Foundation Trust, London, UK
Contents
1 Data and scripts
3
2 IVA assembly methods
2.1 Third-party software . . . . . . .
2.2 Illumina adapter and PCR primer
2.3 Seed generation . . . . . . . . . .
2.4 Contig cleaning and merging . . .
.
.
.
.
3
3
3
4
4
3 Sample QC
3.1 Description of test data . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Removing low quality samples . . . . . . . . . . . . . . . . . . . . . .
3.3 Reference databases . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
4
5
5
4 Benchmarking
4.1 IVA . . . .
4.2 VICUNA . .
4.3 PRICE . . .
4.4 Trinity . . .
6
6
7
7
8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . .
trimming
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Assembly validation
8
6 Run time and memory usage
9
7 Example assemblies
9
2
1
Data and scripts
The supplementary data and scripts developed specifically for this project can be
found in the github repository https://github.com/sanger-pathogens/iva-publication.
The Python 3 package Fastaq (https://github.com/sanger-pathogens/Fastaq) must
be installed for the scripts in the iva-publication repository to work. The script
fastaq is distributed with the Fastaq package, all other scripts are found in the
Scripts/ directory of the iva-publication repository.
The supplementary file supplementary tables.xls contains supplementary Tables S3–S8. These tables are also available as tab-delimited plain text files in the
github repository iva-publication.
2
IVA assembly methods
A flowchart describing the assembly algorithm used by IVA is shown in Figure S1.
2.1
Third-party software
The installation of IVA requires the user to install a set of third-party dependencies.
IVA version 0.11.0 requires the following programs and versions:
• kmc version 2.1
• MUMmer version 3.23
• samtools version 0.1.19-44428cd
• SMALT version 0.7.6
• Trimmomatic version 0.32.
2.2
Illumina adapter and PCR primer trimming
IVA can optionally trim Illumina adapters and PCR primer sequences from the
reads before assembling. Illumina adapters are removed from a pair of FASTQ files
(in.1.fq and in.2.fq) using the following call to Trimmomatic:
java -jar trimmomatic-0.32.jar PE in.1.fq in.2.fq \
trimmo.1.fq out.up.1.fq trimmo.2.fq out.2.up.2.fq \
ILLUMINACLIP:adapters.fasta:2:10:7:1 MINLEN:50
and the resulting paired reads trimmo.1.fq and trimmo.2.fq are retained. A file
of Illumina adapters is distributed with IVA, but the user can provide their own file.
Next, perfect matches to PCR primers are removed using:
fastaq sequence_trim --revcomp trimmo.1.fq trimmo.2.fq \
out.1.fq out.2.fq pcr_primers.fa
and the resulting paired reads out.1.fq and out.2.fq are then used for assembly.
The user must provide a FASTA file of PCR primers.
3
2.3
Seed generation
When generating a seed kmer, the length is by default set to two-thirds of the read
length, up to a maximum of 95. To save time, the first 100,000 reads are used as
input to kmc, which is run with
kmc -fa -m4 -k $k -sf 1 -ci 200 -cs 1000000000 -cx 1000000000 \
input_reads.fa kmc_out $PWD
kmc_dump kmc_out count
to make a file, count, of kmer counts, where
$k = min
2.4
2
3
× read length, 95 .
Contig cleaning and merging
First, low quality contig ends are trimmed by removing bases that have more than
80% of their coverage on either the forward or reverse strand only. These were
found by mapping the reads with SMALT with index options -k 19 -s 11 and
map options -r 1 -x -y 0.9 -i 1000 and then running samtools mpileup with
each of --rf 0x10 and --ff 0x10 to get the read depth on each strand.
Next, the set of contigs is aligned to itself using nucmer with settings
nucmer --maxmatch -p p contigs.fasta contigs.fasta
delta-filter -i 95 -l 100 p.delta > p.delta-filter
show-coords -dTlro p.delta.filter > out.coords
to make an output file of hits called out.coords. In particular, the minimum
identity is 95% and minimum length is 100. In order of shortest to longest, any
contig that has hits of total length at least 95% of its own length is discarded. Finally
overlapping contigs are merged using nucmer hits linking one contig to another contig
end. A merge is only made when there is exactly one hit between two contig ends,
to minimize the introduction of new errors into the assembly.
3
3.1
Sample QC
Description of test data
172 Influenza virus and 68 HIV-1 clinical samples were initially considered (ENA
accessions are in Tables S3 and S4). Prior to submission to the ENA, the reads
were mapped to the human genome reference build GRCh37 with BWA version
0.5.10 (BWA-backtrack algorithm “aln”) and option -q 15. Reads mapping to
the human genome, and any relevant unmapped reads paired with a mapped read,
were then removed with custom tool AlignmentFilter (https://github.com/wtsinpg/illumina2bam).
4
3.2
Removing low quality samples
The analysis for this study started with the read sets from the ENA, which meant
that host contamination was already removed from the reads. Many samples had
large regions of the virus genome unrepresented in the reads due to failure of RTPCR amplification. To determine which samples had unrepresented regions, the
reads of each sample (after trimming Illumina adapter and PCR primer sequences
as described in supplementary Section 2.2) were mapped to the closest reference
using SMALT version 0.7.0.1 with index options -k 13 -s 2 and map options -r 1
-x -y 0.5 -i 1000. In particular, this only required at least 50% of each read to
map (-y 0.5). Any sample that did not have at least 90% of the reference genome
positions covered by a minimum of 5 reads on each strand, and at least 1% of its
reads mapped to the reference genome, was removed from further analysis. This
was determined for each sample using the script
bam_is_genome_covered.pl in.bam 90 5 1
where in.bam contained the results of the SMALT mapping. After removing low
quality samples, there were 98 Influenza virus and 42 HIV-1 samples remaining.
3.3
Reference databases
In order to assess the quality of each de novo assembly, Kraken was used to choose
a closely related reference genome from public databases for each of the HIV-1 and
Influenza virus samples. These reference genomes were solely used for evaluation
purposes and were not used to aid the assembly process. The reference genomes were
chosen using an automated method in order to be reproducible, and are sufficient
for the analysis of the assemblies in this study. However, researchers may wish to
manually build their own database to tailor to a particular project’s needs.
3.3.1
HIV-1 reference database
HIV-1 reference genomes were chosen from the LANL database
(http://www.hiv.lanl.gov/content/sequence/NEWALIGN/align.html) using the following method. The options Alignment type ‘Compendium’ and year ‘2012’ were
used and then ‘Get Alignment’ to download a multi-fasta alignment file called
HIV1 COM 2012 genome DNA.fasta. To make the file hiv.ids, the following command was run
hiv_get_good_genomes.pl HIV1_COM_2012_genome_DNA.fasta > hiv.ids
This selected only the genomes that had all 9 HIV genes (env, gag, nef, pol, rev,
tat, vif, vpr, vpu) annotated and resulted in a file of 162 GenBank IDs. A Kraken
database was made using the command
iva_qc_make_db --skip_viruses --add_to_ref hiv.ids HIV_db
Finally, the closest reference to each HIV-1 sample was chosen using Kraken by
running the script
choose_ref.py HIV_db reads.fastq prefix_of_output_files.
A summary of the similarity between the assembly of each sample and the reference
chosen is given in Tables S1 and S2.
5
3.3.2
Influenza reference database
Meta information for complete Influenza virus genomes was downloaded from the
NCBI and 100 genomes of each of A and B type human Influenza virus were chosen
using the script:
get_flu_genomes.pl > flu.ids
Similarly to that of HIV-1, a Kraken database was made with
iva_qc_make_db --skip_viruses --add_to_ref flu.ids Flu_db
and the closest reference to each Influenza virus sample was chosen using Kraken
by running the script
choose_ref.py Flu_db reads.fastq prefix_of_output_files.
4
Benchmarking
Prior to assembly, all reads had Illumina adapter and PCR primer sequences trimmed
using the methods described in supplementary Section 2.2. The default Illumina
adapters file distributed with IVA was used with Trimmomatic. FASTA files of the
PCR primer sequences are provided in the github repository (hiv pcr primers.fa
and flu pcr primers.fa).
For each sample, the same trimmed reads were used as input to each of the
assemblers IVA, PRICE, Trinity and VICUNA.
4.1
IVA
For each sample, IVA version 0.11.0 was run on paired FASTQ files reads 1.fq and
reads 2.fq with:
iva --pcr_primers fasta_file_of_pcr_primers.fasta \
--threads 8 -f reads_1.fq -r reads_2.fq Output_directory
Although the input reads already had PCR primers trimmed, the process is not
perfect and using the option --pcr primers makes IVA trim PCR primers off the
ends of any contigs, as a final assembly stage.
Recall from the main text that IVA uses the longest available kmer of length
k to extend a contig if that kmer appears at least 10 times and is at least four
times as abundant as the next most common kmer of length k. In order to test the
robustness of these values, we also ran IVA requiring that the kmer appeared at least
five times and was at least two times as abundant as the next most common kmer.
This required running IVA as above, but with the extra parameters --ext min cov
5 --ext min ratio 2 --seed ext min cov 5 and --seed ext min ratio 2. The
results in the main text refer to IVA run with the default settings, however, both
sets of results are reported in this supplementary information. The default and
alternative runs are referred to as ‘IVA’ and ‘IVA.c5r2’ throughout this text and
supplementary tables.
6
4.2
VICUNA
VICUNA version 1.3 was run using a wrapper script:
vicuna-wrapper.pl -min_identify 80 \
reads_1.fq reads_2.fq VICUNA 100 1000
where 100 and 1000 are the minimum and maximum insert sizes, and min identify
80 set the minimum percent identity to merge two contigs to 80%, instead of the
default of 90%. These were the only values that were changed from the default settings. We also ran VICUNA with the default setting of min identify 90, which on
average produced lower quality assemblies. The numbers in the main text are with
the value set to 80%, however, both sets of results are reported in the supplementary information. The two runs are referred to as ‘VICUNA.80’ and ‘VICUNA.90’
throughout this text and supplementary tables.
After assembling, we modified each contig name by retaining everything before
the first whitespace with
awk ’{print $1}’ contig.fasta > contigs.fasta
This was to allow processing with downstream analysis tools.
4.3
PRICE
Version 1.2 of PRICE was used. PRICE needs starting seed sequences, which can be
a subset of the input reads and using these makes the assembly truly de novo. We
chose 200 reads at random (but evenly spaced throughout the file) from the FASTQ
file reads 1.fq containing trimmed forward reads with:
fastq_to_equally_spaced_sample.py reads_1.fq seeds.out.fq
The seeds were converted to FASTA format (these FASTA files are available from the
github repository). PRICE requires the insert size of the input reads. The median insert size was calculated from the SMALT output created when removing low quality
samples (see the methods described in supplementary Section 3.2) using an in-house
pipeline. The insert size used for each sample is given in Tables S3 and S4 (and
supplementary files table.S3.flu samples.tsv and table.S4.hiv samples.tsv).
PRICE was run with the following options:
PriceTI -a 8 -fpp reads_1.fq reads_2.fq $m 95 \
-icf seed_reads.fasta 1 1 5 -nc 30 -dbmax 250 \
-target 90 2 1 1 -o contigs.fasta
where $m was set to the appropriate median insert size for each sample.
We remark that we tried PRICE without the option -target 90 2 1 1 and
found it to exacerbate the problem of producing multiple copies of the same sequence
over many contigs. This was as expected because the -target option, according to
the manual, should limit the final contigs to extensions of the input seeds. Further,
we tried PRICE without -target 90 2 1 1 and also with 10 seed reads instead of
200. It exhibited the same behaviour of multiple contigs covering the same region of
the genome (and since some of the 8 Influenza virus segments were not represented
7
in the 10 seed reads, those segments were understandably not present in the output
contigs).
After assembling, only contigs of length at least 50bp were kept, and the contig
names were modified in the same way as for VICUNA, using
fastaq filter --min_length 50 contigs.cycle30.fasta - \
| awk ’{print $1}’ > contigs.fasta
As for VICUNA, this was to allow processing with downstream analysis tools.
The assembly of one Influenza sample (ERR732356) was manually terminated
after running for six days. Since PRICE writes a FASTA file of contigs at the end
of each iteration, we used the last file written as the final assembly. This was the
15th iteration.
4.4
Trinity
Trinity version 20140717 was run with the command:
Trinity --seqType fq --JM 16G \
--left trimmo.pcr_trim_1.fq --right trimmo.pcr_trim_2.fq --CPU 8 \
--output trinity.out
The final output from Trinity and the output from its first stage, ‘Inchworm’, were
evaluated. To allow processing with downstream analysis tools the final contigs file
output by Trinity was modified using
fastaq filter --min_length 50 Trinity.fasta - \
| awk ’{print $1}’ > contigs.fasta
The Inchworm stage produces a file called target.fa and the above command was
also applied to the file target.fa by substituting the input Trinity.fasta for
target.fa.
5
Assembly validation
Two scripts for running assembly QC are distributed with IVA:
1. iva qc make db – makes a custom database
2. iva qc – runs all the QC analysis.
These both rely on Kraken being installed. We used version 0.10.4-beta of Kraken.
The GAGE analysis code is also distributed with IVA, which we modified so that the
minimum percent identity was parameterized and given a default of 80%, instead of
the hard-coded 95%.
The QC script was run on each set of contigs with:
iva_qc -f trimmed_reads_1.fq -r trimmed_reads_2.fq \
--embl_dir EMBL contigs.fasta output_prefix
8
where trimmed reads 1.fq and trimmed reads 2.fq were the forward and reverse
trimmed reads and EMBL was the directory of EMBL files chosen to be the closest
reference (as described earlier).
The complete results of running the QC are given in Tables S5 and S6 (and
supplementary files table.S5.qc summary.flu.tsv and
table.S6.qc summary.hiv.tsv). A summary of the results of all assemblies is
given in Tables S1 and S2. Box plots of the proportion of genome assembled and
success of annotation transfer are given in Figure S3.
6
Run time and memory usage
None of the assemblers had significant memory requirements. The values used for
peak memory usage were those reported by the compute farm job scheduling software
Platform Load Sharing Facility (LSF). It polls the memory usage every minute and
reports the maximum value when a job finishes.
The option --JM 16G was used with Trinity, which requested 16GB of Java
memory when running Jellyfish (a k-mer counting program used as part of the
Trinity pipeline). Although for some samples a lower value could have been used
to reduce the peak memory usage, we found that many of the assemblies crashed
with --JM 8GB and therefore --JM16 was used for all samples. We do not report
values for the Inchworm stage of Trinity because it is one part of the entire Trinity
assembly pipeline and therefore the numbers were not available.
The run time varied between assemblers and samples. See Figure S4 for summary
plots of the CPU and memory usage. The complete data are in Tables S7 and S8
(and supplementary files table.S7.resources flu.tsv and
table.S8.resources hiv.tsv). Note that IVA, PRICE and Trinity were used with
8 threads and VICUNA has no threading option.
7
Example assemblies
Typical results of assembling an Influenza virus sample are shown in supplementary
Figures S5 and S6. These plots are produced by the script iva qc. IVA assembles
all segments, with most segments assembled into one unique contig. PRICE and
VICUNA assemble most segments, but with many duplications. Inchworm also has
many duplications, but most of these are removed in the final output of Trinity.
Each plot shows the following information.
• Vertical grey lines mark the boundaries between segments of the genome (only
applicable to Influenza virus).
• The top panel shows the contigs aligned to the closest reference genome (chosen
for each sample as described earlier), using nucmer hits. There is one row per
contig, and one column per genome segment. More than one rectangle on a
row represents a chimeric contig. Blue means that each part of the contig had
a single nucmer match to the reference. Otherwise, the contig is coloured red.
Light and dark corresponds to forward and reverse orientation respectively.
9
• The middle panel is in two sections, the upper section shows contig information
and the lower section shows read information. See Figure S7 for an example
where all of the tracks are visible (they are not all visible in Figure S5). The
upper three tracks show contig coverage of the reference. The first and second
tracks (black) show presence and absence of contig coverage. The third track
(red) shows absence of contig coverage, but where there was at least 5X read
coverage on each strand, i.e. assembly should have been possible. The tracks
are not all visible in Figure S5 because there was good read coverage across
the entire genome.
• The lower three tracks of the middle panel show properties of the read depth.
The first track (black) shows where there was at least 5X read depth on both
strands. The second and third tracks (red) show read depth less than 5X on
the forward and reverse strands respectively.
• The bottom panel shows line plots of the read depth on the forward and reverse
strand above and below the y axis respectively.
10
Sequencing
reads
Trim Illumina
adapters and PCR
primers from reads
Generate new contig.
Use read pairs that do
not map to any contig
New contig
successfully
made?
Yes
Extend
contigs
No
At least one
contig end
extended?
Yes
No
Trim and merge
current contigs.
Yes
Max contigs
limit
reached?
Final assembled
contigs
No
Figure S1: Flowchart of the IVA assembly algorithm
11
12
57.1
98.1 (2.51)
101.5 (3.81)
109.0 (20.05)
0.1 (0.13)
98.0 (8.04)
4
91.0 (1.44)
11.9
97.2 (4.98)
94.4 (25.96)
127.2 (33.36)
0.2 (0.23)
90.0 (17.91)
4
91.1 (1.41)
PRICE
0.0
98.4 (2.39)
90.1 (20.4)
312.9 (97.19)
1.8 (0.93)
98.7 (4.11)
2
91.1 (1.4)
Inchworm
14.3
89.8 (20.9)
76.2 (29.98)
154.0 (79.7)
0.5 (0.6)
86.2 (22.62)
0
91.2 (1.41)
Trinity
2.4
98.3 (2.37)
99.5 (11.03)
146.1 (38.09)
0.2 (0.22)
97.3 (9.24)
1
91.2 (1.38)
VICUNA.80
the entire genome must be assembled into a unique contig.
Number of duplicated reference bases reported by GAGE divided by the length of the reference.
An error is an inversion, relocation or translocation reported by GAGE. Numbers reported are the total across all assemblies.
as reported by GAGE.
57.1
97.9 (2.53)
99.9 (7.64)
110.5 (22.99)
0.1 (0.14)
99.0 (5.04)
1
91.0 (1.33)
IVA.c5r2
0.0
98.4 (2.39)
101.5 (7.0)
160.7 (54.16)
0.3 (0.3)
99.2 (3.55)
3
91.1 (1.34)
VICUNA.90
Table S1: Summary of HIV-1 QC results. Numbers in parentheses are the standard deviations. See Figures 2a, 2b, S2 (an expanded
version of Figure 2) and S3 for Box plots.
4
3
2
1
Ideal assemblies (%)1
Mean reference bases assembled (%)
Longest contig(s) sum (% of reference)
Assembly length (% of reference)
Mean duplication rate2
Mean % annotation transferred
Total assembly errors3
Mean per-sample identity to reference (%)4
IVA
13
18.4
(2.66)
(5.49)
(8.26)
(0.07)
(2.63)
6
99.0 (1.07)
99.0
99.0
107.7
0.1
99.1
0.0
89.8 (13.81)
91.8 (16.61)
290.7 (92.12)
1.8 (0.84)
92.1 (11.12)
6
98.8 (1.1)
PRICE
0.0
99.6 (1.39)
94.9 (10.18)
229.4 (48.32)
0.4 (0.21)
98.3 (3.14)
4
99.0 (1.0)
Inchworm
1.0
97.6 (4.64)
94.4 (8.97)
131.7 (25.92)
0.3 (0.27)
96.1 (4.45)
0
99.0 (1.03)
Trinity
0.0
94.3 (4.38)
82.0 (10.95)
136.9 (21.0)
0.4 (0.2)
95.3 (5.77)
0
98.9 (1.1)
VICUNA.80
each segment must be assembled into a unique contig.
Number of duplicated reference bases reported by GAGE divided by the length of the reference.
An error is an inversion, relocation or translocation reported by GAGE. Numbers reported are the total across all assemblies.
as reported by GAGE.
21.4
(2.84)
(6.51)
(8.09)
(0.07)
(2.87)
0
99.0 (1.07)
98.8
96.6
107.2
0.1
99.0
IVA.c5r2
0.0
92.4 (4.81)
75.7 (11.16)
135.4 (18.8)
0.4 (0.18)
92.9 (7.84)
0
99.0 (1.06)
VICUNA.90
Table S2: Summary of Influenza QC results. Numbers in parentheses are the standard deviations. See Figures 2a, 2b, S2 (an expanded
version of Figure 2) and S3 for Box plots.
4
3
2
1
Ideal assemblies (%)1
Mean reference bases assembled (%)
Longest contig(s) sum (% of reference)
Assembly length (% of reference)
Mean duplication rate2
Mean % annotation transferred
Total assembly errors3
Mean per-sample identity to reference (%)4
IVA
HIV−1
Influenza
●
150
50
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
IVA
IVA.c5r2
PRICE
Inchworm
Trinity
VICUNA.80
VICUNA.90
100
●
●
●
●
●
●
IVA
IVA.c5r2
PRICE
Inchworm
Trinity
VICUNA.80
VICUNA.90
Longest contig(s) / reference length (%)
a
HIV−1
Influenza
600
●
500
400
●
●
300
●
●
●
●
●
●
200
●
●
●
●
●
●
●
●
●
●
●
●
●
●
100
IVA
IVA.c5r2
PRICE
Inchworm
Trinity
VICUNA.80
VICUNA.90
0
IVA
IVA.c5r2
PRICE
Inchworm
Trinity
VICUNA.80
VICUNA.90
b
Assembly length / reference length (%)
Assembler
Assembler
Figure S2: Comparison of assembly success. This is an expanded version of Figure
2. (a) For each segment of the reference, the longest matching contig was found.
This plot shows the total length of these contigs for each assembly, as a percentage
of the reference length. (b) Total assembly lengths, excluding contamination by only
counting contigs that match the reference, as a percentage of the reference length.
14
●
●
●
●
80
●
60
●
●
40
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80
●
●
●
70
60
●
●
IVA
50
VICUNA.90
VICUNA.80
Inchworm
PRICE
●
40
20
●
6
4
●
●
●
●
●
2
●
●
VICUNA.80
VICUNA.90
Trinity
Inchworm
●
●
●
●
f
●
●
●
●
●
●
●
50
●
●
●
●
●
25
●
●
●
●
●
●
●
●
●
80
●
●
●
●
●
●
●
●
●
●
●
70
60
●
50
VICUNA.90
VICUNA.80
Trinity
Inchworm
PRICE
IVA.c5r2
IVA
●
●
●
●
VICUNA.90
●
75
90
●
●
●
●
●
●
●
VICUNA.80
●
●
●
●
Trinity
●
Inchworm
●
PRICE
●
IVA.c5r2
●
100
IVA
●
Annotation features transferred (%)
e
IVA
VICUNA.90
VICUNA.80
Trinity
Inchworm
PRICE
IVA.c5r2
0
IVA
0
8
PRICE
100
●
●
●
●
●
●
●
●
●
●
●
●
d
Uniquely assembled segments
Assembled into a unique contig (%)
60
Annotation features transferred (%)
c
IVA.c5r2
IVA
20
Trinity
●
●
●
90
●
●
VICUNA.90
●
●
●
●
●
●
●
●
●
VICUNA.80
●
100
Trinity
●
●
●
●
●
●
Inchworm
●
●
●
●
●
●
PRICE
●
●
●
●
●
●
●
●
●
●
IVA.c5r2
●
●
●
●
●
●
●
●
IVA.c5r2
100
Per cent of reference assembled
b
Per cent of reference assembled
a
Figure S3: Box plots of data summarised in Tables S1 and S2. (a), (b) Per cent of
genome assembled, using GAGE output, for (a) HIV-1 and (b) Influenza. (c) Per
cent of HIV-1 samples that were assembled into a single unique contig. (d) Number
of segments assembled into a single unique contig, for each Influenza sample. (e),
(f) Per cent of annotation elements transferred from reference for (e) HIV-1 and (f)
Influenza.
15
b
●
●
60
●
40
●
●
●
●
●
20
●
●
●
●
●
●
●
●
●
●
20
●
10
●
●
●
0
●
●
Assembler
●
●
●
●
●
●
●
●
●
IVA
IVA.c5r2
PRICE
Trinity
VICUNA.80
VICUNA.90
0
●
●
●
Influenza
30
IVA
IVA.c5r2
PRICE
Trinity
VICUNA.80
VICUNA.90
●
●
●
●
●
HIV
●
IVA
IVA.c5r2
PRICE
Trinity
VICUNA.80
VICUNA.90
Influenza
Wall clock time (hours)
HIV
IVA
IVA.c5r2
PRICE
Trinity
VICUNA.80
VICUNA.90
Total CPU time (hours)
a
Assembler
c
HIV
Influenza
Peak RAM (GB)
●
●
15
10
●
5
●
●
●
●
●
●
●
●
●
IVA
IVA.c5r2
PRICE
Trinity
VICUNA.80
VICUNA.90
IVA
IVA.c5r2
PRICE
Trinity
VICUNA.80
VICUNA.90
0
●
●
●
●
●
Assembler
Figure S4: Resource usage of the assemblers per sample. (a) Total CPU time. Two
PRICE outliers were removed from the Influenza plot, with values 177 and 650. (b)
Wall clock time. One PRICE outlier was removed from the Influenza plot, with
value 144. (c) Peak RAM usage.
16
IVA QC contig layout and read depth
IVA QC contig layout and read depth
b
contig.00002
contig.00003
Contigs
contig.00005
contig.00006
contig.00007
contig.00004
6000
8000
10000
12000
Contig/Read coverage OK
4000
Contigs
Reads
Read depth
5000
Read depth
2000
−5000
Contig/Read coverage OK
0
0
2000
4000
6000
8000
10000
0
2000
4000
0
2000
4000
6000
8000
10000
12000
IVA QC contig layout and read depth
d
c47_g3_i1
c47_g3_i2
c47_g2_i1
c47_g2_i2
c43_g1_i1
Contigs
c43_g1_i2
c47_g1_i1
c78_g1_i1
c9_g1_i1
c23_g1_i1
c24_g2_i1
c24_g1_i1
c42_g1_i1
6000
8000
10000
12000
Contig/Read coverage OK
4000
Contigs
−5000
Read depth
5000
Reads
0
2000
4000
6000
8000
10000
12000
Position in reference
0
2000
4000
0
2000
4000
6000
8000
10000
12000
6000
8000
10000
12000
Contigs
Reads
5000
2000
−5000
Contigs
12000
Position in reference
a34370;4
a28972;13
a37363;22
a14552;4
a38893;4
a38336;2
a810;4
a24693;4
a39011;2
a19805;4
a78;3
a14455;3
a4228;2
a23066;2
a12115;4
a36986;6815
a12106;2
a38672;4
a15471;3
a11166;7945
a13829;10
a29629;2
a39205;2
a2;6298
a38497;4
a40339;2
a11689;2
a1155;2
a32051;2
a7643;2
a39426;3
a13909;2
a11640;2
a6641;2
a37216;2
a4986;2
a30472;2
a31502;2
a30648;2
a2763;2
a25868;2
a3385;2
a28587;6
a40808;2
a36987;3738
a31816;4
a38890;5
a40710;3
a3752;6
a9027;2
a38060;2
a27916;6357
a24224;3
a13842;5
a42830;2
a36985;6015
a11798;26
a3881;16
a12345;26
a25836;2
a4975;13
a19830;54
a2841;41
a37115;84
a37058;39
a2716;46773
a37061;27
a1432;4
a773;2
a14138;2
a27956;24
a12061;9
a23416;2
a38428;3
a6923;3
a4231;5
a204;4
a31856;4
a28183;2
a13327;5
a2724;5137
a15632;2
a21419;3
a27919;5824
a11436;7
a31060;5
a2722;8989
a4087;3
a29470;7
a20178;5
a13090;3
a24892;4
a28962;7
a33147;2
a3675;9
a2778;9
a19753;8516
a11350;6
a37112;27
a23185;4
a2957;19
a3278;27
a37044;69
a20751;25
a2868;63
a27955;51
a37347;59
a28004;38
a19991;32
a2709;40253
a19752;40253
a38214;7
a11585;3
a29606;6
a11584;5
a28557;3
a27943;7
a8061;2
a270;6
a32243;2
a2728;4522
a13777;2
a13348;4
a29871;2
a39872;2
a4343;2
a19761;4028
a11172;4037
0
Contig/Read coverage OK
10000
Reads
12000
IVA QC contig layout and read depth
Read depth
8000
Contigs
Position in reference
c
6000
5000
Contigs
contig.00008
contig_112
contig_71
contig_101
contig_52
contig_4
contig_22
contig_68
contig_5
contig_36
contig_10
contig_26
contig_8
contig_12
contig_13
contig_17
contig_20
contig_28
contig_46
contig_30
contig_89
contig_3
contig_39
contig_18
contig_41
contig_47
contig_73
contig_113
contig_119
contig_111
contig_14
contig_32
contig_66
contig_44
contig_33
contig_51
contig_31
contig_34
contig_1
contig_23
contig_21
contig_42
contig_115
contig_61
contig_114
contig_90
contig_109
contig_16
contig_59
contig_15
contig_106
contig_58
contig_9
contig_11
contig_35
contig_19
contig_50
contig_91
contig_72
contig_117
contig_81
contig_27
contig_25
contig_29
contig_69
contig_24
contig_6
contig_121
contig_88
contig_75
contig_80
contig_56
contig_55
contig_62
contig_48
contig_38
contig_79
contig_110
contig_43
contig_45
contig_54
contig_64
contig_67
contig_63
contig_65
contig_7
contig_60
contig_94
contig_77
contig_100
contig_108
contig_93
contig_103
contig_76
contig_102
contig_49
contig_86
contig_2
−5000
a
contig.00001
Position in reference
Figure S5: Example Influenza virus assemblies for sample ERR732276. (a) IVA,
(b) PRICE, (c) Inchworm, (d) Trinity. The corresponding plot for VICUNA is in
Figure S6. See text for an explanation.
17
IVA QC contig layout and read depth
dg−17
dg−6
dg−28
dg−47
dg−9
dg−22
dg−26
dg−43
dg−44
dg−45
dg−0
dg−67
dg−58
dg−42
dg−35
Contigs
dg−64
dg−41
dg−20
dg−46
dg−1
dg−54
dg−38
dg−5
dg−27
dg−2
dg−16
dg−31
dg−3
dg−4
dg−29
dg−65
dg−61
dg−7
dg−30
2000
4000
0
2000
4000
6000
8000
10000
12000
6000
8000
10000
12000
Contigs
Reads
5000
Read depth
0
−5000
Contig/Read coverage OK
dg−52
Position in reference
Figure S6: VICUNA assembly for Influenza virus assembly sample ERR732276. See
Figure S5 for plots with the other assemblers.
IVA QC contig layout and read depth
2000
0
2000
4000
6000
8000
4000
6000
8000
Contigs
Reads
5000
Read depth
0
−5000
Contig/Read coverage OK
Contigs
contig.00001
Position in reference
Figure S7: IVA contig layout, contig coverage and read depth for HIV sample
ERR732130. See main text for an explanation.
18