Supplementary Material of: IVA: accurate de novo assembly of RNA virus genomes Martin Hunt 1 , Astrid Gall 1 , Swee Hoe Ong 1 , Jacqui Brener 2 , Bridget Ferns 3 , Philip Goulder 2 , Eleni Nastouli 4 , Jacqueline A Keane 1 , Paul Kellam 1,3 and Thomas D Otto 1 1 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK, 2 Department of Paediatrics, University of Oxford, Oxford, UK, 3 Division of Infection and Immunity, Faculty of Medical Sciences, University College London, London, UK, 4 Department of Virology, University College London Hospital NHS Foundation Trust, London, UK Contents 1 Data and scripts 3 2 IVA assembly methods 2.1 Third-party software . . . . . . . 2.2 Illumina adapter and PCR primer 2.3 Seed generation . . . . . . . . . . 2.4 Contig cleaning and merging . . . . . . . 3 3 3 4 4 3 Sample QC 3.1 Description of test data . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Removing low quality samples . . . . . . . . . . . . . . . . . . . . . . 3.3 Reference databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 5 5 4 Benchmarking 4.1 IVA . . . . 4.2 VICUNA . . 4.3 PRICE . . . 4.4 Trinity . . . 6 6 7 7 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . trimming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Assembly validation 8 6 Run time and memory usage 9 7 Example assemblies 9 2 1 Data and scripts The supplementary data and scripts developed specifically for this project can be found in the github repository https://github.com/sanger-pathogens/iva-publication. The Python 3 package Fastaq (https://github.com/sanger-pathogens/Fastaq) must be installed for the scripts in the iva-publication repository to work. The script fastaq is distributed with the Fastaq package, all other scripts are found in the Scripts/ directory of the iva-publication repository. The supplementary file supplementary tables.xls contains supplementary Tables S3–S8. These tables are also available as tab-delimited plain text files in the github repository iva-publication. 2 IVA assembly methods A flowchart describing the assembly algorithm used by IVA is shown in Figure S1. 2.1 Third-party software The installation of IVA requires the user to install a set of third-party dependencies. IVA version 0.11.0 requires the following programs and versions: • kmc version 2.1 • MUMmer version 3.23 • samtools version 0.1.19-44428cd • SMALT version 0.7.6 • Trimmomatic version 0.32. 2.2 Illumina adapter and PCR primer trimming IVA can optionally trim Illumina adapters and PCR primer sequences from the reads before assembling. Illumina adapters are removed from a pair of FASTQ files (in.1.fq and in.2.fq) using the following call to Trimmomatic: java -jar trimmomatic-0.32.jar PE in.1.fq in.2.fq \ trimmo.1.fq out.up.1.fq trimmo.2.fq out.2.up.2.fq \ ILLUMINACLIP:adapters.fasta:2:10:7:1 MINLEN:50 and the resulting paired reads trimmo.1.fq and trimmo.2.fq are retained. A file of Illumina adapters is distributed with IVA, but the user can provide their own file. Next, perfect matches to PCR primers are removed using: fastaq sequence_trim --revcomp trimmo.1.fq trimmo.2.fq \ out.1.fq out.2.fq pcr_primers.fa and the resulting paired reads out.1.fq and out.2.fq are then used for assembly. The user must provide a FASTA file of PCR primers. 3 2.3 Seed generation When generating a seed kmer, the length is by default set to two-thirds of the read length, up to a maximum of 95. To save time, the first 100,000 reads are used as input to kmc, which is run with kmc -fa -m4 -k $k -sf 1 -ci 200 -cs 1000000000 -cx 1000000000 \ input_reads.fa kmc_out $PWD kmc_dump kmc_out count to make a file, count, of kmer counts, where $k = min 2.4 2 3 × read length, 95 . Contig cleaning and merging First, low quality contig ends are trimmed by removing bases that have more than 80% of their coverage on either the forward or reverse strand only. These were found by mapping the reads with SMALT with index options -k 19 -s 11 and map options -r 1 -x -y 0.9 -i 1000 and then running samtools mpileup with each of --rf 0x10 and --ff 0x10 to get the read depth on each strand. Next, the set of contigs is aligned to itself using nucmer with settings nucmer --maxmatch -p p contigs.fasta contigs.fasta delta-filter -i 95 -l 100 p.delta > p.delta-filter show-coords -dTlro p.delta.filter > out.coords to make an output file of hits called out.coords. In particular, the minimum identity is 95% and minimum length is 100. In order of shortest to longest, any contig that has hits of total length at least 95% of its own length is discarded. Finally overlapping contigs are merged using nucmer hits linking one contig to another contig end. A merge is only made when there is exactly one hit between two contig ends, to minimize the introduction of new errors into the assembly. 3 3.1 Sample QC Description of test data 172 Influenza virus and 68 HIV-1 clinical samples were initially considered (ENA accessions are in Tables S3 and S4). Prior to submission to the ENA, the reads were mapped to the human genome reference build GRCh37 with BWA version 0.5.10 (BWA-backtrack algorithm “aln”) and option -q 15. Reads mapping to the human genome, and any relevant unmapped reads paired with a mapped read, were then removed with custom tool AlignmentFilter (https://github.com/wtsinpg/illumina2bam). 4 3.2 Removing low quality samples The analysis for this study started with the read sets from the ENA, which meant that host contamination was already removed from the reads. Many samples had large regions of the virus genome unrepresented in the reads due to failure of RTPCR amplification. To determine which samples had unrepresented regions, the reads of each sample (after trimming Illumina adapter and PCR primer sequences as described in supplementary Section 2.2) were mapped to the closest reference using SMALT version 0.7.0.1 with index options -k 13 -s 2 and map options -r 1 -x -y 0.5 -i 1000. In particular, this only required at least 50% of each read to map (-y 0.5). Any sample that did not have at least 90% of the reference genome positions covered by a minimum of 5 reads on each strand, and at least 1% of its reads mapped to the reference genome, was removed from further analysis. This was determined for each sample using the script bam_is_genome_covered.pl in.bam 90 5 1 where in.bam contained the results of the SMALT mapping. After removing low quality samples, there were 98 Influenza virus and 42 HIV-1 samples remaining. 3.3 Reference databases In order to assess the quality of each de novo assembly, Kraken was used to choose a closely related reference genome from public databases for each of the HIV-1 and Influenza virus samples. These reference genomes were solely used for evaluation purposes and were not used to aid the assembly process. The reference genomes were chosen using an automated method in order to be reproducible, and are sufficient for the analysis of the assemblies in this study. However, researchers may wish to manually build their own database to tailor to a particular project’s needs. 3.3.1 HIV-1 reference database HIV-1 reference genomes were chosen from the LANL database (http://www.hiv.lanl.gov/content/sequence/NEWALIGN/align.html) using the following method. The options Alignment type ‘Compendium’ and year ‘2012’ were used and then ‘Get Alignment’ to download a multi-fasta alignment file called HIV1 COM 2012 genome DNA.fasta. To make the file hiv.ids, the following command was run hiv_get_good_genomes.pl HIV1_COM_2012_genome_DNA.fasta > hiv.ids This selected only the genomes that had all 9 HIV genes (env, gag, nef, pol, rev, tat, vif, vpr, vpu) annotated and resulted in a file of 162 GenBank IDs. A Kraken database was made using the command iva_qc_make_db --skip_viruses --add_to_ref hiv.ids HIV_db Finally, the closest reference to each HIV-1 sample was chosen using Kraken by running the script choose_ref.py HIV_db reads.fastq prefix_of_output_files. A summary of the similarity between the assembly of each sample and the reference chosen is given in Tables S1 and S2. 5 3.3.2 Influenza reference database Meta information for complete Influenza virus genomes was downloaded from the NCBI and 100 genomes of each of A and B type human Influenza virus were chosen using the script: get_flu_genomes.pl > flu.ids Similarly to that of HIV-1, a Kraken database was made with iva_qc_make_db --skip_viruses --add_to_ref flu.ids Flu_db and the closest reference to each Influenza virus sample was chosen using Kraken by running the script choose_ref.py Flu_db reads.fastq prefix_of_output_files. 4 Benchmarking Prior to assembly, all reads had Illumina adapter and PCR primer sequences trimmed using the methods described in supplementary Section 2.2. The default Illumina adapters file distributed with IVA was used with Trimmomatic. FASTA files of the PCR primer sequences are provided in the github repository (hiv pcr primers.fa and flu pcr primers.fa). For each sample, the same trimmed reads were used as input to each of the assemblers IVA, PRICE, Trinity and VICUNA. 4.1 IVA For each sample, IVA version 0.11.0 was run on paired FASTQ files reads 1.fq and reads 2.fq with: iva --pcr_primers fasta_file_of_pcr_primers.fasta \ --threads 8 -f reads_1.fq -r reads_2.fq Output_directory Although the input reads already had PCR primers trimmed, the process is not perfect and using the option --pcr primers makes IVA trim PCR primers off the ends of any contigs, as a final assembly stage. Recall from the main text that IVA uses the longest available kmer of length k to extend a contig if that kmer appears at least 10 times and is at least four times as abundant as the next most common kmer of length k. In order to test the robustness of these values, we also ran IVA requiring that the kmer appeared at least five times and was at least two times as abundant as the next most common kmer. This required running IVA as above, but with the extra parameters --ext min cov 5 --ext min ratio 2 --seed ext min cov 5 and --seed ext min ratio 2. The results in the main text refer to IVA run with the default settings, however, both sets of results are reported in this supplementary information. The default and alternative runs are referred to as ‘IVA’ and ‘IVA.c5r2’ throughout this text and supplementary tables. 6 4.2 VICUNA VICUNA version 1.3 was run using a wrapper script: vicuna-wrapper.pl -min_identify 80 \ reads_1.fq reads_2.fq VICUNA 100 1000 where 100 and 1000 are the minimum and maximum insert sizes, and min identify 80 set the minimum percent identity to merge two contigs to 80%, instead of the default of 90%. These were the only values that were changed from the default settings. We also ran VICUNA with the default setting of min identify 90, which on average produced lower quality assemblies. The numbers in the main text are with the value set to 80%, however, both sets of results are reported in the supplementary information. The two runs are referred to as ‘VICUNA.80’ and ‘VICUNA.90’ throughout this text and supplementary tables. After assembling, we modified each contig name by retaining everything before the first whitespace with awk ’{print $1}’ contig.fasta > contigs.fasta This was to allow processing with downstream analysis tools. 4.3 PRICE Version 1.2 of PRICE was used. PRICE needs starting seed sequences, which can be a subset of the input reads and using these makes the assembly truly de novo. We chose 200 reads at random (but evenly spaced throughout the file) from the FASTQ file reads 1.fq containing trimmed forward reads with: fastq_to_equally_spaced_sample.py reads_1.fq seeds.out.fq The seeds were converted to FASTA format (these FASTA files are available from the github repository). PRICE requires the insert size of the input reads. The median insert size was calculated from the SMALT output created when removing low quality samples (see the methods described in supplementary Section 3.2) using an in-house pipeline. The insert size used for each sample is given in Tables S3 and S4 (and supplementary files table.S3.flu samples.tsv and table.S4.hiv samples.tsv). PRICE was run with the following options: PriceTI -a 8 -fpp reads_1.fq reads_2.fq $m 95 \ -icf seed_reads.fasta 1 1 5 -nc 30 -dbmax 250 \ -target 90 2 1 1 -o contigs.fasta where $m was set to the appropriate median insert size for each sample. We remark that we tried PRICE without the option -target 90 2 1 1 and found it to exacerbate the problem of producing multiple copies of the same sequence over many contigs. This was as expected because the -target option, according to the manual, should limit the final contigs to extensions of the input seeds. Further, we tried PRICE without -target 90 2 1 1 and also with 10 seed reads instead of 200. It exhibited the same behaviour of multiple contigs covering the same region of the genome (and since some of the 8 Influenza virus segments were not represented 7 in the 10 seed reads, those segments were understandably not present in the output contigs). After assembling, only contigs of length at least 50bp were kept, and the contig names were modified in the same way as for VICUNA, using fastaq filter --min_length 50 contigs.cycle30.fasta - \ | awk ’{print $1}’ > contigs.fasta As for VICUNA, this was to allow processing with downstream analysis tools. The assembly of one Influenza sample (ERR732356) was manually terminated after running for six days. Since PRICE writes a FASTA file of contigs at the end of each iteration, we used the last file written as the final assembly. This was the 15th iteration. 4.4 Trinity Trinity version 20140717 was run with the command: Trinity --seqType fq --JM 16G \ --left trimmo.pcr_trim_1.fq --right trimmo.pcr_trim_2.fq --CPU 8 \ --output trinity.out The final output from Trinity and the output from its first stage, ‘Inchworm’, were evaluated. To allow processing with downstream analysis tools the final contigs file output by Trinity was modified using fastaq filter --min_length 50 Trinity.fasta - \ | awk ’{print $1}’ > contigs.fasta The Inchworm stage produces a file called target.fa and the above command was also applied to the file target.fa by substituting the input Trinity.fasta for target.fa. 5 Assembly validation Two scripts for running assembly QC are distributed with IVA: 1. iva qc make db – makes a custom database 2. iva qc – runs all the QC analysis. These both rely on Kraken being installed. We used version 0.10.4-beta of Kraken. The GAGE analysis code is also distributed with IVA, which we modified so that the minimum percent identity was parameterized and given a default of 80%, instead of the hard-coded 95%. The QC script was run on each set of contigs with: iva_qc -f trimmed_reads_1.fq -r trimmed_reads_2.fq \ --embl_dir EMBL contigs.fasta output_prefix 8 where trimmed reads 1.fq and trimmed reads 2.fq were the forward and reverse trimmed reads and EMBL was the directory of EMBL files chosen to be the closest reference (as described earlier). The complete results of running the QC are given in Tables S5 and S6 (and supplementary files table.S5.qc summary.flu.tsv and table.S6.qc summary.hiv.tsv). A summary of the results of all assemblies is given in Tables S1 and S2. Box plots of the proportion of genome assembled and success of annotation transfer are given in Figure S3. 6 Run time and memory usage None of the assemblers had significant memory requirements. The values used for peak memory usage were those reported by the compute farm job scheduling software Platform Load Sharing Facility (LSF). It polls the memory usage every minute and reports the maximum value when a job finishes. The option --JM 16G was used with Trinity, which requested 16GB of Java memory when running Jellyfish (a k-mer counting program used as part of the Trinity pipeline). Although for some samples a lower value could have been used to reduce the peak memory usage, we found that many of the assemblies crashed with --JM 8GB and therefore --JM16 was used for all samples. We do not report values for the Inchworm stage of Trinity because it is one part of the entire Trinity assembly pipeline and therefore the numbers were not available. The run time varied between assemblers and samples. See Figure S4 for summary plots of the CPU and memory usage. The complete data are in Tables S7 and S8 (and supplementary files table.S7.resources flu.tsv and table.S8.resources hiv.tsv). Note that IVA, PRICE and Trinity were used with 8 threads and VICUNA has no threading option. 7 Example assemblies Typical results of assembling an Influenza virus sample are shown in supplementary Figures S5 and S6. These plots are produced by the script iva qc. IVA assembles all segments, with most segments assembled into one unique contig. PRICE and VICUNA assemble most segments, but with many duplications. Inchworm also has many duplications, but most of these are removed in the final output of Trinity. Each plot shows the following information. • Vertical grey lines mark the boundaries between segments of the genome (only applicable to Influenza virus). • The top panel shows the contigs aligned to the closest reference genome (chosen for each sample as described earlier), using nucmer hits. There is one row per contig, and one column per genome segment. More than one rectangle on a row represents a chimeric contig. Blue means that each part of the contig had a single nucmer match to the reference. Otherwise, the contig is coloured red. Light and dark corresponds to forward and reverse orientation respectively. 9 • The middle panel is in two sections, the upper section shows contig information and the lower section shows read information. See Figure S7 for an example where all of the tracks are visible (they are not all visible in Figure S5). The upper three tracks show contig coverage of the reference. The first and second tracks (black) show presence and absence of contig coverage. The third track (red) shows absence of contig coverage, but where there was at least 5X read coverage on each strand, i.e. assembly should have been possible. The tracks are not all visible in Figure S5 because there was good read coverage across the entire genome. • The lower three tracks of the middle panel show properties of the read depth. The first track (black) shows where there was at least 5X read depth on both strands. The second and third tracks (red) show read depth less than 5X on the forward and reverse strands respectively. • The bottom panel shows line plots of the read depth on the forward and reverse strand above and below the y axis respectively. 10 Sequencing reads Trim Illumina adapters and PCR primers from reads Generate new contig. Use read pairs that do not map to any contig New contig successfully made? Yes Extend contigs No At least one contig end extended? Yes No Trim and merge current contigs. Yes Max contigs limit reached? Final assembled contigs No Figure S1: Flowchart of the IVA assembly algorithm 11 12 57.1 98.1 (2.51) 101.5 (3.81) 109.0 (20.05) 0.1 (0.13) 98.0 (8.04) 4 91.0 (1.44) 11.9 97.2 (4.98) 94.4 (25.96) 127.2 (33.36) 0.2 (0.23) 90.0 (17.91) 4 91.1 (1.41) PRICE 0.0 98.4 (2.39) 90.1 (20.4) 312.9 (97.19) 1.8 (0.93) 98.7 (4.11) 2 91.1 (1.4) Inchworm 14.3 89.8 (20.9) 76.2 (29.98) 154.0 (79.7) 0.5 (0.6) 86.2 (22.62) 0 91.2 (1.41) Trinity 2.4 98.3 (2.37) 99.5 (11.03) 146.1 (38.09) 0.2 (0.22) 97.3 (9.24) 1 91.2 (1.38) VICUNA.80 the entire genome must be assembled into a unique contig. Number of duplicated reference bases reported by GAGE divided by the length of the reference. An error is an inversion, relocation or translocation reported by GAGE. Numbers reported are the total across all assemblies. as reported by GAGE. 57.1 97.9 (2.53) 99.9 (7.64) 110.5 (22.99) 0.1 (0.14) 99.0 (5.04) 1 91.0 (1.33) IVA.c5r2 0.0 98.4 (2.39) 101.5 (7.0) 160.7 (54.16) 0.3 (0.3) 99.2 (3.55) 3 91.1 (1.34) VICUNA.90 Table S1: Summary of HIV-1 QC results. Numbers in parentheses are the standard deviations. See Figures 2a, 2b, S2 (an expanded version of Figure 2) and S3 for Box plots. 4 3 2 1 Ideal assemblies (%)1 Mean reference bases assembled (%) Longest contig(s) sum (% of reference) Assembly length (% of reference) Mean duplication rate2 Mean % annotation transferred Total assembly errors3 Mean per-sample identity to reference (%)4 IVA 13 18.4 (2.66) (5.49) (8.26) (0.07) (2.63) 6 99.0 (1.07) 99.0 99.0 107.7 0.1 99.1 0.0 89.8 (13.81) 91.8 (16.61) 290.7 (92.12) 1.8 (0.84) 92.1 (11.12) 6 98.8 (1.1) PRICE 0.0 99.6 (1.39) 94.9 (10.18) 229.4 (48.32) 0.4 (0.21) 98.3 (3.14) 4 99.0 (1.0) Inchworm 1.0 97.6 (4.64) 94.4 (8.97) 131.7 (25.92) 0.3 (0.27) 96.1 (4.45) 0 99.0 (1.03) Trinity 0.0 94.3 (4.38) 82.0 (10.95) 136.9 (21.0) 0.4 (0.2) 95.3 (5.77) 0 98.9 (1.1) VICUNA.80 each segment must be assembled into a unique contig. Number of duplicated reference bases reported by GAGE divided by the length of the reference. An error is an inversion, relocation or translocation reported by GAGE. Numbers reported are the total across all assemblies. as reported by GAGE. 21.4 (2.84) (6.51) (8.09) (0.07) (2.87) 0 99.0 (1.07) 98.8 96.6 107.2 0.1 99.0 IVA.c5r2 0.0 92.4 (4.81) 75.7 (11.16) 135.4 (18.8) 0.4 (0.18) 92.9 (7.84) 0 99.0 (1.06) VICUNA.90 Table S2: Summary of Influenza QC results. Numbers in parentheses are the standard deviations. See Figures 2a, 2b, S2 (an expanded version of Figure 2) and S3 for Box plots. 4 3 2 1 Ideal assemblies (%)1 Mean reference bases assembled (%) Longest contig(s) sum (% of reference) Assembly length (% of reference) Mean duplication rate2 Mean % annotation transferred Total assembly errors3 Mean per-sample identity to reference (%)4 IVA HIV−1 Influenza ● 150 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● IVA IVA.c5r2 PRICE Inchworm Trinity VICUNA.80 VICUNA.90 100 ● ● ● ● ● ● IVA IVA.c5r2 PRICE Inchworm Trinity VICUNA.80 VICUNA.90 Longest contig(s) / reference length (%) a HIV−1 Influenza 600 ● 500 400 ● ● 300 ● ● ● ● ● ● 200 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 IVA IVA.c5r2 PRICE Inchworm Trinity VICUNA.80 VICUNA.90 0 IVA IVA.c5r2 PRICE Inchworm Trinity VICUNA.80 VICUNA.90 b Assembly length / reference length (%) Assembler Assembler Figure S2: Comparison of assembly success. This is an expanded version of Figure 2. (a) For each segment of the reference, the longest matching contig was found. This plot shows the total length of these contigs for each assembly, as a percentage of the reference length. (b) Total assembly lengths, excluding contamination by only counting contigs that match the reference, as a percentage of the reference length. 14 ● ● ● ● 80 ● 60 ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 80 ● ● ● 70 60 ● ● IVA 50 VICUNA.90 VICUNA.80 Inchworm PRICE ● 40 20 ● 6 4 ● ● ● ● ● 2 ● ● VICUNA.80 VICUNA.90 Trinity Inchworm ● ● ● ● f ● ● ● ● ● ● ● 50 ● ● ● ● ● 25 ● ● ● ● ● ● ● ● ● 80 ● ● ● ● ● ● ● ● ● ● ● 70 60 ● 50 VICUNA.90 VICUNA.80 Trinity Inchworm PRICE IVA.c5r2 IVA ● ● ● ● VICUNA.90 ● 75 90 ● ● ● ● ● ● ● VICUNA.80 ● ● ● ● Trinity ● Inchworm ● PRICE ● IVA.c5r2 ● 100 IVA ● Annotation features transferred (%) e IVA VICUNA.90 VICUNA.80 Trinity Inchworm PRICE IVA.c5r2 0 IVA 0 8 PRICE 100 ● ● ● ● ● ● ● ● ● ● ● ● d Uniquely assembled segments Assembled into a unique contig (%) 60 Annotation features transferred (%) c IVA.c5r2 IVA 20 Trinity ● ● ● 90 ● ● VICUNA.90 ● ● ● ● ● ● ● ● ● VICUNA.80 ● 100 Trinity ● ● ● ● ● ● Inchworm ● ● ● ● ● ● PRICE ● ● ● ● ● ● ● ● ● ● IVA.c5r2 ● ● ● ● ● ● ● ● IVA.c5r2 100 Per cent of reference assembled b Per cent of reference assembled a Figure S3: Box plots of data summarised in Tables S1 and S2. (a), (b) Per cent of genome assembled, using GAGE output, for (a) HIV-1 and (b) Influenza. (c) Per cent of HIV-1 samples that were assembled into a single unique contig. (d) Number of segments assembled into a single unique contig, for each Influenza sample. (e), (f) Per cent of annotation elements transferred from reference for (e) HIV-1 and (f) Influenza. 15 b ● ● 60 ● 40 ● ● ● ● ● 20 ● ● ● ● ● ● ● ● ● ● 20 ● 10 ● ● ● 0 ● ● Assembler ● ● ● ● ● ● ● ● ● IVA IVA.c5r2 PRICE Trinity VICUNA.80 VICUNA.90 0 ● ● ● Influenza 30 IVA IVA.c5r2 PRICE Trinity VICUNA.80 VICUNA.90 ● ● ● ● ● HIV ● IVA IVA.c5r2 PRICE Trinity VICUNA.80 VICUNA.90 Influenza Wall clock time (hours) HIV IVA IVA.c5r2 PRICE Trinity VICUNA.80 VICUNA.90 Total CPU time (hours) a Assembler c HIV Influenza Peak RAM (GB) ● ● 15 10 ● 5 ● ● ● ● ● ● ● ● ● IVA IVA.c5r2 PRICE Trinity VICUNA.80 VICUNA.90 IVA IVA.c5r2 PRICE Trinity VICUNA.80 VICUNA.90 0 ● ● ● ● ● Assembler Figure S4: Resource usage of the assemblers per sample. (a) Total CPU time. Two PRICE outliers were removed from the Influenza plot, with values 177 and 650. (b) Wall clock time. One PRICE outlier was removed from the Influenza plot, with value 144. (c) Peak RAM usage. 16 IVA QC contig layout and read depth IVA QC contig layout and read depth b contig.00002 contig.00003 Contigs contig.00005 contig.00006 contig.00007 contig.00004 6000 8000 10000 12000 Contig/Read coverage OK 4000 Contigs Reads Read depth 5000 Read depth 2000 −5000 Contig/Read coverage OK 0 0 2000 4000 6000 8000 10000 0 2000 4000 0 2000 4000 6000 8000 10000 12000 IVA QC contig layout and read depth d c47_g3_i1 c47_g3_i2 c47_g2_i1 c47_g2_i2 c43_g1_i1 Contigs c43_g1_i2 c47_g1_i1 c78_g1_i1 c9_g1_i1 c23_g1_i1 c24_g2_i1 c24_g1_i1 c42_g1_i1 6000 8000 10000 12000 Contig/Read coverage OK 4000 Contigs −5000 Read depth 5000 Reads 0 2000 4000 6000 8000 10000 12000 Position in reference 0 2000 4000 0 2000 4000 6000 8000 10000 12000 6000 8000 10000 12000 Contigs Reads 5000 2000 −5000 Contigs 12000 Position in reference a34370;4 a28972;13 a37363;22 a14552;4 a38893;4 a38336;2 a810;4 a24693;4 a39011;2 a19805;4 a78;3 a14455;3 a4228;2 a23066;2 a12115;4 a36986;6815 a12106;2 a38672;4 a15471;3 a11166;7945 a13829;10 a29629;2 a39205;2 a2;6298 a38497;4 a40339;2 a11689;2 a1155;2 a32051;2 a7643;2 a39426;3 a13909;2 a11640;2 a6641;2 a37216;2 a4986;2 a30472;2 a31502;2 a30648;2 a2763;2 a25868;2 a3385;2 a28587;6 a40808;2 a36987;3738 a31816;4 a38890;5 a40710;3 a3752;6 a9027;2 a38060;2 a27916;6357 a24224;3 a13842;5 a42830;2 a36985;6015 a11798;26 a3881;16 a12345;26 a25836;2 a4975;13 a19830;54 a2841;41 a37115;84 a37058;39 a2716;46773 a37061;27 a1432;4 a773;2 a14138;2 a27956;24 a12061;9 a23416;2 a38428;3 a6923;3 a4231;5 a204;4 a31856;4 a28183;2 a13327;5 a2724;5137 a15632;2 a21419;3 a27919;5824 a11436;7 a31060;5 a2722;8989 a4087;3 a29470;7 a20178;5 a13090;3 a24892;4 a28962;7 a33147;2 a3675;9 a2778;9 a19753;8516 a11350;6 a37112;27 a23185;4 a2957;19 a3278;27 a37044;69 a20751;25 a2868;63 a27955;51 a37347;59 a28004;38 a19991;32 a2709;40253 a19752;40253 a38214;7 a11585;3 a29606;6 a11584;5 a28557;3 a27943;7 a8061;2 a270;6 a32243;2 a2728;4522 a13777;2 a13348;4 a29871;2 a39872;2 a4343;2 a19761;4028 a11172;4037 0 Contig/Read coverage OK 10000 Reads 12000 IVA QC contig layout and read depth Read depth 8000 Contigs Position in reference c 6000 5000 Contigs contig.00008 contig_112 contig_71 contig_101 contig_52 contig_4 contig_22 contig_68 contig_5 contig_36 contig_10 contig_26 contig_8 contig_12 contig_13 contig_17 contig_20 contig_28 contig_46 contig_30 contig_89 contig_3 contig_39 contig_18 contig_41 contig_47 contig_73 contig_113 contig_119 contig_111 contig_14 contig_32 contig_66 contig_44 contig_33 contig_51 contig_31 contig_34 contig_1 contig_23 contig_21 contig_42 contig_115 contig_61 contig_114 contig_90 contig_109 contig_16 contig_59 contig_15 contig_106 contig_58 contig_9 contig_11 contig_35 contig_19 contig_50 contig_91 contig_72 contig_117 contig_81 contig_27 contig_25 contig_29 contig_69 contig_24 contig_6 contig_121 contig_88 contig_75 contig_80 contig_56 contig_55 contig_62 contig_48 contig_38 contig_79 contig_110 contig_43 contig_45 contig_54 contig_64 contig_67 contig_63 contig_65 contig_7 contig_60 contig_94 contig_77 contig_100 contig_108 contig_93 contig_103 contig_76 contig_102 contig_49 contig_86 contig_2 −5000 a contig.00001 Position in reference Figure S5: Example Influenza virus assemblies for sample ERR732276. (a) IVA, (b) PRICE, (c) Inchworm, (d) Trinity. The corresponding plot for VICUNA is in Figure S6. See text for an explanation. 17 IVA QC contig layout and read depth dg−17 dg−6 dg−28 dg−47 dg−9 dg−22 dg−26 dg−43 dg−44 dg−45 dg−0 dg−67 dg−58 dg−42 dg−35 Contigs dg−64 dg−41 dg−20 dg−46 dg−1 dg−54 dg−38 dg−5 dg−27 dg−2 dg−16 dg−31 dg−3 dg−4 dg−29 dg−65 dg−61 dg−7 dg−30 2000 4000 0 2000 4000 6000 8000 10000 12000 6000 8000 10000 12000 Contigs Reads 5000 Read depth 0 −5000 Contig/Read coverage OK dg−52 Position in reference Figure S6: VICUNA assembly for Influenza virus assembly sample ERR732276. See Figure S5 for plots with the other assemblers. IVA QC contig layout and read depth 2000 0 2000 4000 6000 8000 4000 6000 8000 Contigs Reads 5000 Read depth 0 −5000 Contig/Read coverage OK Contigs contig.00001 Position in reference Figure S7: IVA contig layout, contig coverage and read depth for HIV sample ERR732130. See main text for an explanation. 18
© Copyright 2024