ESSENTIALS manual. How to effectively detect your (conditional) essential genes from a TnSeq experiment. Summary: We present EssenTials (Essential gene detection by Transposon sequencing), an easy to use web-interface and an open source software tool for determining conditionally essential genes from Tn-Seq derived next-generation sequencing data. It features different normalization methods, statistical tests, genomic location bias correction and visualization options. We successfully applied it to in-house and published Tn-Seq, INseq, TraDIS and HITS datasets. GSF performs equally well or better in detecting the conditionally essential genes of literature datasets showing concordance with previously detected conditionally essential genes and experimental data of gene essentiality. Availability: The webinterface of GSF and the example files are freely accessible at: http://bamics.cmbi.ru.nl/websoftware/essentials/ Contact: [email protected] 1 Table of Contents Table of Contents .................................................................................................................................... 2 1 Introduction ..................................................................................................................................... 3 2 Part 1. Running the program ........................................................................................................... 4 2.1 Running the software............................................................................................................... 4 2.2 Creating configuration files. .................................................................................................... 5 2.3 Parameter selection.................................................................................................................. 6 3 Run diagnostics ............................................................................................................................. 12 4 Output generated by ESSENTIALS .............................................................................................. 13 4.1 Conditionally essential genes ................................................................................................ 13 4.2 Conditionally essential insertion sites ................................................................................... 18 4.3 Essential genes ...................................................................................................................... 19 5 MINOMICS .................................................................................................................................. 20 5.1 Entering (conditional) gene essentiality data in the database ................................................ 20 5.2 Entering (conditional) insertion site essentiality data in the database. .................................. 22 5.3 Visualizing all data on gene essentiality ............................................................................... 23 2 1 Introduction Essentials was developed to aid us in our analysis of TN-Seq datasets. Tn-Seq utilizes genome-wide transposon mutant libraries, of which the presence of each mutant is determined by amplification of DNA flanking transposon insertions and massively parallel sequencing. Results are mapped on the genome and summarized for each insertion and gene, generating a measurement of fitness for every knockout in comparison to the expected values based on the library size and number of insertion sites per gene. Similarly, to identify conditionally essential genes, these libraries are exposed to a challenge condition that will induce lost of mutants of genes essential for survival in these conditions. A measurement of fitness for every knockout comparison the challenge to the control condition can then be calculated. We’ve successfully applied our tool to analyze datasets from Tn-Seq experiments on various species of S. pneumoniae, H. influenzae and M. catarrhalis in house, but also on literature data sets of H. influenzae, P. aeruginosa, S. typhi and S. pneumoniae which were generated by Tn-Seq or Tn-seq like techniques. Essentials is implemented in Perl and R. Its web-interface is generated by the FG-web framework (van Hijum et al., unpublished; https://trac.nbic.nl/fgweb/). There are three major sections in the webinterface: (i) configuration file upload and genome selection; (ii) parameter settings; and (iii) displaying the results. The web tool works with major web browsers such as Internet Explorer, Firefox, Safari and Opera. It can be tuned to the needs of a researcher by modifying several parameters controlling the alignments, normalization, statistical tests and visualization. 3 2 Part 1. Running the program 2.1 Running the software The software is part of the FG-web framework, a bioinformatics work flow system for web based tools. Go to http://bamics2.cmbi.ru.nl/websoftware/essentials/essentials_start.php and select your genome of interest or select your own genbank on your computer. Select if you want to fill in the sample details manually, or if you want to upload a configuration file (e.g. from a previous run). Details on editing this file is given in 2.2. If you want to use the demo reads, just select demo mode and data from our lab on Streptococcus pneumoniae R6 is used in which a single library was sequenced four times under normal conditions . The configuration file and the genbank for the demo mode can also be downloaded from this site. ESSENTIALS has the ability to use premapped reads, either through BAM or SAM files, or through the use of a tab delimited file, which should look like: Genomic position<tab>number of reads Example: 44 809 1279 534 44 211 Figure 1. First page of Essentials webinterface 4 2.2 Creating configuration files. Although it is possible to fill in the sample characteristics in the website, this can get quite tedious if a very large number of samples have to be analysed. An alternative is to create the configuration file manually. Create a “configuration file” which contains the links to the files containing the samples, the barcodes, the presence of a transposon sequence, whether they are target or control samples, whether you used different libraries (for pairwise comparisons), the file format and the compression. The file is tab delimited. I prefer to use Microsoft Excel (see Figure 1). • • • • • • • Create one 1ine per sample. Use the same link to a file if multiple samples are present in the file. Barcodes can be of different lengths. If you did not use barcodes, please fill in “N”. This will assign the entire file to a single sample. If there is transposon sequence that needs to be removed, please fill in here. Use the sequence as in the sequence reads, don’t reverse complement etc. If there’s no transposon sequence present in your reads, fill in “N”. Sampletype can either be “control”, “target” or “ignore” If you used a single library, just fill in “lib1” or “library1” etc. If you used multiple libraries for target and control and you want to do pairwise comparisons, couple the target and control samples by giving the libraries the same name. Don’t use numbers only for naming the libraries. Sampleformat can be “fasta”, “fastq”, “scarf”, “export”, “csfasta” (SOLiD color space fasta), sam, bam and a custom wiggle tract format “wiggle”. Compression can be “gz” (gzipped), “bz2” (bzipped) or “zip” (normal zip). Password protected archives are not supported. Compressed files are expected to contain only one single file containing the (barcoded) reads. Save the file as a tab delimited text file. Figure 2. Configuration file in Microsoft Excel 5 2.3 Parameter selection Now we’ve arrived at one most important pages in the webinterface, what parameters to select (Fig 3). Quite some settings can be adapted to suit the experiment. The link “Help, all these settings, what should I change” gives a concise overview of all the options, but below we will give a more detailed explanation on the why and how of these settings. Transposon insertion site options Selection of librarysize and type of mutagenesis system Figure 3. Parameter selection 1. Select what kind of integration system was used. Mariner inserts into TA inverted repeat sequences and upon insertion generates a duplication of the TA site. We can use this to assign the reads to these specific insertion sites. If you used Tn5 or another method that integrates randomly, select random. If a new preference site needs to be added for your mutagenesis system, please contact the authors. 2. The setting library size allows the filtering of spurious reads which could be a common source of error with Tn-Seq experiments. The insertion sites are sorted according to their number of reads. When a mutant library of 30k mutants is selected, any data of insertion site 30.0001 and higher is removed. This filtering is particularly important when no transposon sequence or TA site is present in your sequence reads, which allows the filtering of actual transposon insertion reads from spurious reads. 6 Filtering, splitting and alignment options This section is used to modify the alignments of your read to your reference. Figure 4. Parameter selection 1. We can select if we want to ignore insertion sites that are not unique. This is recommended, else insertion sites will be detected and their associated reads will be counted in essential genes, just because a non-unique piece of DNA was present in the essential gene. 2. The number of mismatches to your barcode can be selected. We use a modified version of the fastx barcode splitter. The splitter uses Levenshtein (edit) distances instead of hamming distances to allow for insertions/deletions in barcodes and barcodes of unequal size. Only go above 2 if you have very large barcodes. 3. Please select how much sequence is left after removal of the barcode and removal of the transposon sequence. 4. Select on which end the barcode is present. Although commonly on the 5’ end, we cannot rule out the possibility that someone sequences from the 3’ end but has their barcode on the 5’ end. 5. Select on which end the transposon sequence is (or was!) present. This is quite important, since this defines what strand was sequenced relative to the transposon cassette. For most TnSeq experiments, the transposon sequence is on the end of the line (3’) which means that you should also select the reverse strand, instead of the forward strand. For TraDIS for instance, the transposon sequence is/was present on the 5’ end, resulting in reads on the forward strand relative to the transposon cassette. 7 6. Select how much sequence triggers a sequence read count in an alignment. The number should be lower or the same to the number given in step 3. 7. Select what strand to align with. For 5’ (bol) transposon sequence select 0, for 3’ (eol) transposon sequence select 1 (see 5). If you don’t know (for instance for lit. data), you can select 2. Settings 5 and 7 determine which side of the read determines the exact insertion site. 8. Select if you want to ignore insertion sites very close to the end of the gene, since these likely do not cause a loss of functionality. Specifically in the case of mariner which integrates in the TA insertion and duplicates the TA site. Integration in a TAA stopcodon occurs, the duplication restores the TAA stopcodon and the transposon insertion, although still detected as being inside the gene, will not change the gene. Normalization and statistics options These settings control several options used in the various R scripts for bias correction, normalization and statistics. Figure 5. Parameter selection 1. Presumably, because of genomic replication during growth and the resulting increase in available DNA close to the origin of replication (ORI), read counts and number of insertions are higher closer to the ORI, producing a bias in sequencing coverage of insertion sites. We correct for this bias by using locally weighted scatterplot smoothing (LOESS) on read counts per insertion site to genomic position. Without this correction, more reads will be found for genes close to the ORI, potentially causing these to be detected as non-essential when comparing it to the expected number of reads. This bias can easily cause a 3 fold difference in sequence reads, which could for 8 instance change the ratio between observed and expected read counts from 10 fold to 30 fold. Consequently, close to the terminus, non-essential genes could be detected as being essential because of their perceived lower number of reads compared to what is expected. Reads per insertion site Reads per gene X – axis: genomic position Y – axis log 10 read counts Figuur 6 A. Bacterial replication. Unlike eukaryotic DNA replication, bacterial replication occurs at one position on the genome, the origin of replication (ORI) and often multiple replication forks are present in fast dividing cells. B. Effect of Loess correction. Top: uncorrected, Bottom: corrected. Up 3 fold difference in average read counts can be seen between insertional positions close to the ORI and close to the terminus. 2. Which normalization to use, TMM, RLE, Quantile or just correction on the total read count per sample. TMM and RLE are robust methods normally used for RNA-seq data, Quantile is the well known microarray normalization method from Bolstadt. 3. qCML or CR. The statistical model EdgeR uses is based on the negative binomial distribution. When a negative binomial model is fitted, we need to estimate the dispersion(s) before we carry out the analysis. edgeR provides two ways of estimating the dispersion(s), the quantile-adjusted conditional maximum likelihood (qCML) method and the Cox-Reid profile-adjusted likelyhood (CR) method. Since the exact test used in the qCML method does not allow the use of confounders EdgeR uses a Generalized Linear Model to model variances if multiple conditions are used. In general, we apply the qCML method to experiments with single factor (two conditions, one single mutant library) and the CR method to experiments with multiple factors (two conditions, multiple mutant libraries). 4. Common or tagwise dispersion. This parameter controls how the variance is modeled. Common dispersion means that the same dispersion is used for each gene while moderated tagwise dispersion allows the use of different dispersion values per gene. Moderated tagwise dispersion includes the variance in your replicates of each gene per gene, while common only includes the variance between your replicates between all genes. 5. Prior.n . The value for prior.n determines the amount of smoothing of tagwise dispersions towards the common dispersion. Quoted from one of the EdgeR developers: “You can think of it as like a weight for the common value. (It is actually the weight for the common likelihood in the weighted likelihood equation). The larger the value for prior.n, the more smoothing, i.e. the closer your tagwise dispersion estimates will be to the common dispersion. If you use a prior.n of 1, then that gives the common likelihood the weight of one observation” 6. P value adjustment method. The adjustment methods include the Bonferroni correction (bonferroni) in which the p-values are multiplied by the number of comparisons. Less conservative corrections are also included by Holm (1979) (holm), Hochberg (1988) (hochberg), Hommel (1988) (hommel), Benjamini & Hochberg (1995) (BH), and Benjamini & Yekutieli (2001) (BY), respectively. A pass-through option (none) is also included 9 Minomics visualization option Figuur 7 Parameter selection Controls which P value is shown in the info box in the visualization and how many reads will trigger the generation of the visualization of an insertion site. Figure 8. Minomics visualization example 10 Other options Figure 9. Minomics visualization example Modifies output for debug purposes. At present a zip archive of the results can be generated for quick download and archival. Additionally an email address can be provided which will be used to report on the progress of the run. Press proceed when done. 11 3 Run diagnostics Steps of the procedure are shown in the running window. This page can be bookmarked and retrieved at a later stage when the results are ready. Results take between 15 minutes to a couple of hours to complete. Mostly depends on the speed of download of the sequence files and the size of the genome. Various statistics are displayed such as the numbers of reads in the file, reads after splitting, reads after filtering, reads aligned to insertion sites etc. If there’s a problem with your files, likely it will be shown here. Please contact me ([email protected]) or Sacha van Hijum ([email protected] ) with the link to the failed analysis on FG-web. Figure 10. Run diagnostics 12 4 Output generated by ESSENTIALS Here, we describe the output that can be obtained from an ESSENTIALS run. Various files are presented to the user, as well as an option to store and visualize the results in MINOMICS, the genomics visualization tool available in FG-web. Parameters, run log and the original configuration file containing the download links can also be viewed. 4.1 Conditionally essential genes If there were target and control samples present, various graphs and a tab delimited table are generated on gene conditional essentiality. Click on the appropriate links to show them. Figure 11. Output page 13 PCA plots. A PCA visualization to discover if the replicates of your datasets are comparable. PCA needs at least three samples. If only two samples were provided, an x-plot of the samples is produced instead. This will allow for some comparison between the samples. Figure 12. Example of a PCA plot with 12 samples. Numbers that are close to each other are samples with similar data characteristics. In grey the locustags or insertion sites plotted on the same axis. 14 MA plots. A standard MA plot is produced. The signal is calculated as the log fraction of the number of reads of a gene divided by the total number of reads. Fold change (y-axis) is the log2 of the number of reads of the target sample divided by the number of reads of the control sample Figure 13. Example MA plot. The blue line shows where the log2 2fold change is and genes marked in red are the top 200 significant genes. Density plot The log2 transformed ratio of target over control or measured over expected and signal of (conditionally) essential genes or insertion sites is then used to generate kernel density plots using a Gaussian model with stepwise increasing bandwidth and 2048 bins until a single (in the case of essential genes) or four (in the case of conditional essential genes) local minima are found. Local minima are detected by calculating the first derivative of the density and by locating the position where it traverses from values below to values above zero. This fold change value corresponds to a value closest to the minimum between the peaks of essential and non-essential genes and can be used as a cut-off to determine whether a gene is essential or not. 15 Figure 14. Example density plot. the red dot depics the optimal cutoff point between essential and non-essential genes Combined table A table is produced containing the data before between-sample normalization, after between-sample normalization ,output from the statistical test and annotation of the genes. This file can be opened in Microsoft Excel or any other useful utility to read tab delimited files. 16 Figure 15. Tab delimited conditional gene essentiality output file opened in Microsoft Excel. Column B contains the locus tags, columns marked with .x contain the data before between-sample normalization. Columns marked with .y contains data normalized (pseudocounts). The next four columns contain data generated by the statistical test. logConc: The signal calculated as the log fraction of the number of reads of a gene divided by the total number of reads, logFC: log2 of the number of reads of the target sample divided by the number of reads of the control sample. P-value: probability of obtaining a test statistic as the one that was actually observed. Adj. P val: P value adjusted for multiple testing. The final seven columns contain information about the location, size and annotation of the gene. Genes marked with NA have less than 10 reads per sample assigned to them either because they are essential, because they are not hit in the mutant library, or because the gene has no unique insertion sites. 17 4.2 Conditionally essential insertion sites If there were target and control samples present, various graphs and a tab delimited table are generated on insertion site conditional essentiality. Click on the appropriate links to show them. Please see 4.1 for an explanation of the output generated. Figure 16 Tab delimited conditional insertion site essentiality output file opened in Microsoft Excel. Column C contains the positions of the insertion sites, columns marked with .x contain the data before between-sample normalization. Column marked with .y contains data normalized (pseudocounts). The next four columns contain data generated by the statistical test. logConc: The signal calculated as the log fraction of the number of reads of a gene divided by the total number of reads, logFC: log2 of the number of reads of the target sample divided by the number of reads of the control sample. P-value: probability of obtaining a test statistic as the one that was actually observed. Adj. P val: P value adjusted for multiple testing. The final eight columns contain information about the location, size and annotation of the gene 18 4.3 Essential genes Finally, the last section contains information about genes that are always essential. Predictions are made based on comparison of the expected number of reads of a gene vs the measured number of reads per gene. The expected number of reads is based on number of (unique) insertion sites in a gene, the total number of insertion sites in the genome and the number of reads in the sequencing library. All the outputs are the same as for the conditionally essential genes (see 4.1), except that the final .x column contains the number of putative insertion sites, the final .y column contains the expected number of reads. Figure 1. Tab delimited conditional gene essentiality output file opened in Microsoft Excel. Column B contains the locus tags, columns marked with .x contain the data before between-sample normalization. The final .x column contains the number of unique insertion site flanking sequences. Column marked with .y contain data normalized (pseudocounts). The final .y column contains the expected number of reads. The next four columns contain data generated by the statistical test. logConc: The signal calculated as the log fraction of the number of reads of a gene divided by the total number of reads, logFC: log2 of the number of reads of the target sample divided by the number of reads of the control sample. P-value: probability of obtaining a test statistic as the one that was actually observed. Adj. P val: P value adjusted for multiple testing. The final seven columns contain information about the location, size and annotation of the gene. Genes marked with NA have less than 2 reads per sample and less than 2 insertion sites assigned to them because the gene has no unique insertion sites or because the gene is very small. 19 5 MINOMICS The data on gene essentiality can be visualized in MINOMICS, a genomics visualization tool developed by Rutger Brouwer. For more information about MINOMICS, please see http://bioinformatics.biol.rug.nl/supplementary/minomics/main.php or the journal publication. MINOMICS relies on count or expression data entered into the Add-2-DB database used in FG-web. 5.1 Entering (conditional) gene essentiality data in the database To visualize data in MINOMICS, it has to be entered into the database from which the images are generated. Please follow the steps outlined below 1. Press the purple arrow on the black circle to add data about (conditional) gene essentiality. If you have not selected an organism for minomics during this session, the organism selection window will open before data can be added. 2. In the window that opens, select new organism if the organism has not been entered in the database yet. 20 3. Select the organism of choice and give it a short name. Click proceed. After finishing close the window. 4. Click on the purple arrow on the black circle again. Now you should be able to add data to the organism of choice. Select “new” and give the database a description. Press proceed in the next window 21 5.2 Entering (conditional) insertion site essentiality data in the database. To visualize insertion site flanking sequences, please follow the steps below: 1. Click on the purple arrow in the black circle in the conditional essential insertion site section for control insertion sites 2. Add the control insertion sites as regulatory motifs. Use a new database. Name it “control”. Click proceed, close the next window. 3. Click on the purple arrow in the black circle in the conditional essential insertion site section for target insertion sites. Add the control insertion sites as regulatory motifs. Use a new database. Name it “target”. Click proceed, close the next window. 22 5.3 Visualizing all data on gene essentiality 1. Click on Run MINOMICS 2. Click proceed to MINOMICS. We don’t need to upload files since we entered them in the database already in the previous steps 23 3. Select ratio/P, select the uploaded runs, select operons, motif data, select target and control, select transcriptional terminators. Click proceed 4. Select if you want to visualize both essential and conditionally essential genes. Click proceed 5. Click proceed in the next window 6. Click proceed in the correlation window. Currently there is nothing to correlate. 24 7. Press genome map in the output window of MINOMICS. A visualization window will open. The visualization uses SVG and javascript. Please have both enabled. The tool works best in Firefox. For more information about MINOMICS please visit http://bioinformatics.biol.rug.nl/supplementary/minomics/main.php 25
© Copyright 2025