Transcriptor: a web-tool for analysis of genome-wide transcriptome data of prokaryotes Summary: Transcriptor is a web-tool for the analysis of RNA-Seq data of prokaryotes. It uses a method for detecting transcripts that allow pinpointing untranslated regions, operons and small RNAs. Transcriptor detects, annotates and compares transcripts from different samples and / or experiments. Results can be downloaded as tab-delimited files and in GFF format for visualization in e.g., Artemis. Availability: Transcriptor and its guide are accessible from: http://bamics2.cmbi.ru.nl/websoftware/transcriptor. Contact: [email protected] 1 TABLE OF CONTENTS Introduction .................................................................................................................................................. 3 Part 1: Running Transcriptor ......................................................................................................................... 5 Step 1 – Select reference genome ............................................................................................................ 5 Step 2 – Select run mode .......................................................................................................................... 5 Step 3a – Upload data sets using existing configuration file .................................................................... 7 Step 3b – Create configuration file ........................................................................................................... 8 Step 4 – Change Transcriptor settings .................................................................................................... 10 Step 5 – Change Bowtie2 settings........................................................................................................... 12 Step 6 – Run PHASE................................................................................................................................. 13 PART 2: Output generated by Transcriptor ................................................................................................ 14 Visualisation of results in Artemis........................................................................................................... 16 Annotation of Operons ........................................................................................................................... 17 Annotation of RNA .................................................................................................................................. 18 Studying gene expression ....................................................................................................................... 20 References .................................................................................................................................................. 21 2 INTRODUCTION This guide provides a short overview of the application of Transcriptor to the analysis of a RNA-Seq dataset. Transcriptor was developed for the annotation and comparison of bacterial transcriptomes derived from RNA-seq data (see Fig. 1 for its flow-chart). It is designed to allow researchers to easily pinpoint coherences or differences in multiple genome-wide transcriptomes from different samples / conditions accounting for different (prokaryotic) genomics features such as operon structures and small RNAs. To guide this process, the tool generates various results to aid in the comparative transcriptomics process. Figure 1. Transcriptor flow-chart. As depicted in Figure 1, the transcript annotation starts with the detection of transcripts based on expression data. Transcripts are annotated using existing genome annotation as reference. Additional features are inferred from the expression data including untranslated regions (UTRs), transcription start sites (TSSs), novel small RNAs (sRNA) and operon structures. Annotated transcriptomes are then compared with each other to generate additional information that allows the user to straight-forward pinpoint transcriptional differences across different samples and conditions. From the results, the user 3 can identify conditionally expressed transcripts and operon structures that can be used in downstream analysis of gene regulation, promoters or regulatory RNAs. The Transcriptor web-interface is generated by the FG-web framework (http://trac.nbic.nl/fgweb). The web-interface has three sections: 1) the data upload, 2) parameter settings, and 3) displaying the results. The web tool does not require a login, works with major web browsers such as Internet Explorer, Firefox, Safari and Opera. The results can easily be viewed in genome browser related tools (such as Artemis) and further analyzed with standard spread sheets software. In this guide, the application of Transcriptor is exemplified for the analysis of samples of single-stranded RNA-seq data (Passalacqua et al., 2012). The samples are derived from cultures of Bacillus anthracis grown in rich medium with no treatment, cold stress, with ethanol, and with NaCl. 4 PART 1: RUNNING TRANSCRIPTOR Transcriptor is part of the FG-web framework, a bioinformatics work flow system for web based tools. To start the software, go to http://bamics2.cmbi.ru.nl/websoftware/transcriptor. STEP 1 – SELECT REFERENCE GENOME Select a reference genome from a list of finished genomes or upload your own reference genome file, which includes genome sequence and annotation (Genbank formatted). STEP 2 – SELECT RUN MODE On the same page, select one of the following options for the run mode (see Fig. 2): Figure 2. Select run mode. There are three ways to provide information about the datasets: 1. Upload existing configuration file. You can create your own configuration file in a text editor or spread sheet application and upload this file. This option is preferable in case there are many data sets. See Step 3a (further below) for details on how to create a configuration file in a text editor or spread-sheet editor (like MS Excel). 5 2. Create a configuration file. Alternatively, you can enter all data sets one by one via the webapplication. In this case, you must enter the number of samples that you want to include in your analysis. 3. Run in demo mode. This option is provided for as a demo of Transcriptor. In this case, Transcriptor uses four different samples of single-stranded RNA-Seq data of Bacillus anthracis (Passalacqua et al., 2012). The RNA-Seq data was originally used to study the transcriptional response of B. anthracis to different growth conditions. For the demo, RNA-seq data (including replicates) from two conditions (grown in rich medium versus treated with ethanol) are used which were mapped against the first 350.000 bases of the reference genome of B. anthracis to reduce run-time. Press Next when done. 6 STEP 3A – UPLOAD DATA SETS USING EXISTING CONFIGURATION FILE If you have chosen the run mode “upload existing configuration file”, Transcriptor downloads the data sets specified in the configuration file to the server for further calculations. The configuration file can be created using a text editor or spread-sheet application. The configuration file is a tab-delimited text file (see Table 1). Each line contains the information on one sample. If a line starts with a hash tag the information in the line is ignored. You can use the hash tag to include comments. Create a new line for a new sample / data set. Enter for each sample the following information separated by tabs. In the first and second columns, the names of the experiment and the sample are provided. All samples with the same experiment name are considered (technical or biological) replicates. During the annotation process, Transcriptor uses information retrieved from all replicates to generate annotation results for each experiment. Please consider this when deciding which samples should be treated as replicates. Next, enter the strand information: “forward” (data derived from forward strand), “reverse” (data derived from reverse complement strand) or “unstranded” (data derived from both strand orientations, e.g. non-strand-specific data). If you want Transcriptor to choose the strand orientation of the data, enter “stranded” (strand-specific data). Next, enter the sequence format. Various formats are supported: Alignments of read data can be submitted in the binary version of the standard alignment format (BAM). Alternatively, RNAseq read data can be submitted in the FASTQ format. Finally, enter the link (URL) to a file that contains the sample data. Transcriptor uses this link to download the sample data. The sample data file (of FASTQ formatted data) may be compressed. The following compression formats are supported: gzip and zip. Table 1. Example of a configuration file. Experiment Sample Strand Format # URL B.anthracis NaCl Unstranded BAM http://bamics2.cmbi.ru.nl/websoftware/transcriptor/sample1 B.anthracis Control Forward SAM http://bamics2.cmbi.ru.nl/websoftware/transcriptor/sample2 B.anthracis Control Reverse FASTQ http://bamics2.cmbi.ru.nl/websoftware/transcriptor/sample3 7 STEP 3B – CREATE CONFIGURATION FILE If you have chosen the run mode option to create a configuration file, you proceed with the following web page (see Fig. 3). 1 2 3 4 5 6 7 Figure 3. Create config file. First, choose name for your experiment. This name is used as a prefix for result files created by Transcriptor. Next, you provide for each sample the following information (the numbers correspond to those in Fig. 3): 1. Select in the drop-down box the option to upload a file or to let Transcriptor download the file. 2. In the following text field you provide the link to the sample data. 3. Select in the following drop-down box the compression format. The following formats are supported: “Uncompressed”, “GZip” for data that is compressed with gzip or “Zip” for data that is compressed with zip (Note that only PKZIP versions 4.5 or earlier are supported). 4. Select the sample format. The following formats are supported: for read data that needs to be mapped to a reference genome use FASTQ formatted data. Alternatively, you can use mapped read data formatted in BAM or SAM. 5. Provide a sample name. This name is used as a prefix for sample-specific output files created by Transcriptor. 6. Provide the name of the experiment. This name is used as a prefix for experiment-specific output files. Samples that have the same experiment name, are considered (technical or 8 biological) replicates. During the annotation process, Transcriptor joins information retrieved from different replicates. Please consider this when you decided which samples should be joined in this manner. 7. Select the strand information of the sample: “Forward”, “Reverse” or “Unstranded” if the data is not strand-specific. If you want Transcriptor to choose the strand orientation of the data, enter “stranded”. Step 1 to 7 is repeated for each sample. Press Proceed when done. 9 STEP 4 – CHANGE TRANSCRIPTOR SETTINGS After you uploaded an existing configuration file (Step 3a) or created a new configuration file (Step 3b), you can change the following general settings before starting the analysis (the numbers correspond to those in Fig. 4): 1 2 3 4 5 6 7 8 9 10 Figure 4. Parameter selection. 1. Enter the minimum length of transcripts. Transcriptor detects transcripts based on the expression profiles. Transcripts that are shorter than the minimum length are not reported. 2. Select whether Transcriptor should annotate transcripts. If no annotation is performed, Transcriptor only reports the profile derived transcripts with positional information. Otherwise, transcripts are annotated by determining coding regions (CDS), UTRs and TSSs. For the annotation, the genome annotation provided by the reference GenBank file is used. 3. Enter the minimum coverage of transcripts. Transcriptor determines if a CDS has a minimum coverage of reads. If a CDS has less read coverage than the threshold, the reads partially covering the CDS are considered an additional transcript. This additional transcript is reported separately by Transcriptor. This procedure allows detecting and reporting regulatory RNAs. The 10 4. 5. 6. 7. 8. 9. 10. coverage is calculated by dividing the number of overlapping nucleotides of the transcript with the CDS by the length of the CDS. Select “Yes”, if you want Transcriptor to provide RPKM values for each transcript. This option requires read data in FASTQ, SAM or BAM format. If you have entered samples from different experiments, Transcriptor can compare transcripts across different experiments. Additional results are generated that allows to easily pinpoint transcriptional variations across different experiment. For details see further below. If Transcriptor compares different transcripts across different experiments, this threshold is used as the percentage nucleotides that minimally overlap between transcripts. If the percentage of overlap less than the threshold, Transcriptor considers these transcripts as two separate transcripts. The percentage is calculated by dividing the number of overlapping nucleotides of the two transcripts by the length of the shorter transcript. Transcripts are determined by applying a segmentation algorithm previously described (Todt et al., 2012). Segments with read coverage above a minimum read count threshold are identified and adjacent segments are joined. The threshold is obtained from the frequency distribution of mapped non-overlapping reads. Alternatively, the user can select a value for this threshold. Enter the read length of the RNA-seq data. If you have data with different read lengths enter the minimum read length. This option is used by the segmentation algorithm to identify isolated reads. If you want Transcriptor to choose the strand orientation of the data, enter “yes”. Otherwise, the provided strand information is used. If you want Transcriptor to determine differentially expressed genes across experiments choose, enter “yes”. 11 STEP 5 – CHANGE BOWTIE2 SETTINGS If you provide read sequence data for your analysis, Transcriptor maps the reads to the selected reference genome using Bowtie2 (Langmead, et al. 2012). You can change some Bowtie2 specific parameters (see Fig. 5). For detailed information on Bowtie2 see the documentation of Bowtie2. Figure 5. Parameter selection. 1. Choose the platform specific Phred quality offset of the read data. Default is 33, which is the standard offset used by most platforms. 2. Select the Bowtie2 preset alignment option. The default is “fast-local”, which provides good results in almost all cases (see the documentation of Bowtie2). Press Proceed when done. 12 STEP 6 – RUN PHASE The progress of the analysis is shown in the running window (see Fig. 6). The url of this page can be bookmarked and retrieved at a later stage when the results are ready. The analysis may take between 10 minutes to a couple of hours to complete. The progress mostly depends on the speed of download of the data files and the size of the genome. During the analysis, various statistics are displayed such as the numbers of features in the reference genome file, the number of mapped reads, the number of features detected in each sample, etc. If there’s a problem with your files, likely it will be shown here. A link with contact information will be shown when something is wrong with the run. Also provide the url to the failed analysis on FG-web in the email to the contact person. Figure 6. Run diagnostics. 13 PART 2: OUTPUT GENERATED BY TRANSCRIPTOR The results of the analysis can be downloaded as compressed archive file. This file contains different result files depending on the number of the samples and the parameter settings. The following tables (see Table 2, 3, 4, and 5) give an overview of the results files. In this examples, the name of the run is “Demo”. This name is used as a prefix for the names of the result files. Table 2. Results files on general summaries. Overview The following files give an overview of the main results of the analysis. results.html This file can be viewed in a web browser and summarizes the main results of the run. Demo_anno_sum.txt The file contains a table that lists for each sample and experiment the number of features detected by Transcriptor. The following features are included: Genes, RNA (including small RNAs and anti-sense RNAs), untranslated regions (including 5’UTR and 3’UTRs), and transcription starts sites (TSS) and termination sites (TES). Table 3. Results files on operon information. Operon The following files contain information on operon structures derived from expression profiles Demo_ops_ sum.txt The file gives a summary of additional transcripts grouped by experiment and strand orientation. Three types of transcripts are distinguished: - RNA partially overlap CDS in sense orientation asRNA overlap CDS in anti-sense orientation ncRNA do not overlap with CDS Demo_ops_match.txt The file contains a tab-separated table of all additional transcripts grouped by experiment and strand orientation. The name of a transcript indicates the type of the transcript (RNA, asRNA or ncRNA, see above). The table lists for each transcript the experiments the transcript was detected. Demo_ops_match.gff The file is formatted in GFF and used for visualization in a genome browser (such as Artemis). When this file is viewed in Artemis (Craver et al., 2012), genes that are detected in all experiments are colored green, the other genes are colored red. Genes that are not expressed are colored grey. 14 Table 4. Results files on RNA information. RNA The following files contain information on transcripts that were not linked to genes of the reference genome annotation. Demo_rna_sum.txt The file gives a summary of additional transcripts grouped by experiment and strand orientation. Three types of transcripts are distinguished: - RNA partially overlap CDS in sense orientation asRNA overlap CDS in anti-sense orientation ncRNA do not overlap with CDS Demo_rna_match.txt The file contains a tab-separated table of all additional transcripts grouped by experiment and strand orientation. The name of a transcript indicates the type of the transcript (RNA, asRNA or ncRNA, see above). The table lists for each transcript the experiments the transcript was detected. Demo_rna_match.gff The file is formatted in GFF and used for visualization in a genome browser (such as Artemis). When this file is viewed in Artemis, transcripts that are detected in all experiments are colored green, the other transcripts are colored red. Table 5. Results files on differentially expressed genes. Gene expression The following files include information on differentially expressed genes Demo_FC.txt This file (see Fig. 11) contains for each gene annotation information, read count data, fold changes (fc) and false-discovery (fdr, using the Benjamini and Hochberg’s algorithm) values for different experiment. The file can be open in a spreadsheet software or in a genome browser (such as Artemis). Demo_FC.pdf Fold changes (FC) and false-discovery values are determined using edgeR (Robinson et al., 2010). The PDF includes plots generated by edgeR. The file includes: - Demo_FC.jpeg a plot of a principal components analysis of log FC values of all samples a plot of the dispersion (for the variance parameter) of the read data a hierarchical clustering of samples by CPM values. CPMs are the read counts per gene normalized by the sample’s library size (total counts) in million counts. This plot is the same hierarchical clustering of CPM values included in the PDF file. 15 VISUALISATION OF RESULTS IN ARTEMIS The results include various GFF-formatted result files that can be used for visualization in a Genome browser (such as Artemis (Carver, et al. 2012)). The GFF-formatted files include annotation results for all experiments (see Fig. 7). Files names are generated by concatenating the name of the run (such as “Demo”) with the names of the experiments. In addition, GFF-formatted files are generated providing matching information between experiments for operons and RNAs. Figure 6 shows the visualization in Artemis using the GFF formatted annotation files produced by Transcriptor. Here, results of the transcriptional analysis using Transcriptor of RNA-Seq data obtained for B. anthracis grown under four different growth conditions are shown. Four different growth conditions were applied: grown in rich medium with no treatment (A: control), cold stress (B: stress), with ethanol (C: EtOH), and with NaCl (D: NaCl). In Figure 6, the read coverage is plotted for the forward and reverse complement strands in the upper two panels. In the lower two panels the annotations of the forward and reverse complement strands are shown. Forward strand Reverse strand A B C D E E D C B A Q Figure 6. Visualization of transcript annotation viewed in Artemis. Transcriptional analysis using Transcriptor of RNA-Seq data obtained for B. anthracis grown under four different growth conditions: grown in rich medium with no treatment (A: control), cold stress (B: stress), with ethanol (C: EtOH), and with NaCl (D: NaCl). The read coverage is plotted for the forward and reverse complement strands in the upper two panels. In the lower two panels the annotations of the forward and reverse complement strands are shown. 16 Colors highlight different genome features. CDS retrieved for the genome annotation of the reference GenBank file are highlighted blue. However, if the number of reads covering the CDS is less than the minimum read coverage threshold parameter, the CDS is colored grey. UTR-related regions are colored brown. All additional transcripts are colored orange. GFF files on operon matching and RNA matching use the colors green for transcripts found in all experiment and red in all other cases. Tools like Artemis allow to view many annotation files simultaneously by use of different tracks in the visualization. ANNOTATION OF OPERONS Transcriptor identifies ab initio putative operons structures based on transcripts derived from the expression profiles. A transcript covering two or more annotated genes is regarded as operon. Operons are determined for different experiments and are compared to each other. The results (see Table 3) matching information are stored in a tab-delimited file that can be viewed in a spread-sheets application (see Fig. 7) and formatted as GFF file for visualization (see Fig. 8). The following example illustrates how matching information and visualisation results can be used to pinpoint variations in expression patterns of transcription units across different experiments. Figure 6. Matching operon information. Operon naming is according to the first gene (5’ -> 3’) in the operon. The GBAA_0034 to GBAA_0036 genes are under all conditions expressed as a single transcriptional unit. Genes GBAA_0038 and 0039 are expressed in the operon (op_GBAA_0038) under Cold and Ethanol stress. The file with matching results is shown in Figure 7. The first column lists the name of the genes. The second column gives the strand information. For every experiment, a column is included indicating to which operons the genes are assigned. A dash indicates that for a given genes in a given experiment no or not enough read data was available. Usually, this means that the gene is not expressed. The last column in the table contains the name of the operon if it is found in all experiments. Otherwise a dash is inserted. Transcription units and operons are assigned by determining transcripts derived from expression profiles. A transcription unit or operon is named by the name of the first gene on the reference genome sequence. 17 A B C D E E D C B A Figure 7. Annotation viewed in Artemis. Transcriptional analysis using Transcriptor of RNA-Seq data obtained for B. anthracis grown under four different growth conditions: grown in rich medium with no treatment (A: control), cold stress (B: stress), with ethanol (C: EtOH), and with NaCl (D: NaCl). The read coverage is plotted for the forward and reverse complement strands in the upper two panels. In the lower two panels the annotations of the forward and reverse complement strands are shown. The two genes GBAA_0038 and GBAA_0039 (see boxed region) are probably forming an operon. The expression of GBAA_0039 is low (grey coloured genes) in the two experiment (Control and NaCl) resulting in the different operon detection results. In Figure 8, in four experiments operon op_GBAA_0038 was determined starting with gene GBAA_0038. In two experiments (Cold and Ethanol), operon op_GBAA_0038 consists of GBAA_0038 and GBAA_0039. In the other two experiments (Control and NaCl), op_GBAA_0038 consists only of GBAA_0038 and GBAA_0039 is not expressed. Therefore, in the last column of the table, only for gene GBAA_0038 a matching is indicated by the name of the transcription unit. In figure 9 the annotation of this region shows that GBAA_0038 and GBAA_0039 are probably forming an operon. The expression of GBAA_0039 is low in the two experiment (Control and NaCl) resulting in the different operon detection results. ANNOTATION OF RNA Transcriptor provides comparison information on transcripts that were not linked to existing gene annotation information (e.g. novel transcripts or transcripts partly overlapping genes). These transcripts are grouped into transcripts partly overlapping genes (RNA), anti-sense RNA (asRNA) and non-coding RNA (ncRNA). Transcriptor also provides summary and matching information on grouped transcripts. This information makes it easy to pinpoint differences in ncRNAs over different experiments. The regions of interest can be visualized by using the annotations information provided by Transcriptor. The following example illustrates how matching information and visualisation results can be used to pinpoint variations in expression patterns of transcription units across different experiments (or conditions). Using the tabular RNA matching information created by Transcriptor we selected two non-coding RNA ncRNA_17 and ncRNA_20 that are expressed under all four growth conditions (Cold, Control, Ethanol 18 and NaCl, see Figure 9). The expression profiles (read coverage profiles) and annotation (viewed in Artemis) are shown in Figure 10. Figure 8. Results on RNA matching viewed in MS Excel. Transcriptional analysis using Transcriptor of B. anthracis grown under four different growth conditions. Here, only a part of the original table is shown. First column: name of the reference RNAs. These RNAs are all transcripts over all conditions not linked to genes. Second column: strand information. For every condition, a column is included. If the condition-specific RNA overlaps on the genome with the reference RNA (the left-most column), the condition is stated (here: Cold, Control, Ethanol and NaCl). A dash indicates that under the specific condition the RNA is not detected due to that the read coverage is below the minimum read coverage of 80%. The final column is the consensus information indicating expression in all (by an asterisk) or not in all (dash). The first column in the the tabular RNA matching information contains the name of the reference RNAs. These RNAs are all transcripts over all conditions not linked to genes. The second column contains strand information. For every condition, a column is included. If the condition-specific RNA overlaps on the genome with the reference RNA (the left-most column), the condition is stated (here: Control, Cold, Ethanol and NaCl). A dash indicates that under the specific condition the RNA is not detected due to that the read coverage is below the minimum read coverage (see Part 1 for parameter settings of Transcriptor). The final column contains the consensus information indicating expression in all (by an asterisk) or not in all (dash) experiments. 19 Figure 9. RNA annotation viewed in Artemis. Transcriptional analysis was done using Transcriptor of RNA-Seq data obtained for B. anthracis grown under four different conditions. Four different growth conditions were applied: rich medium with no treatment (A: control), cold stress (B: cold), with ethanol (C: Ethanol), and with NaCl (D: NaCl). The read coverage is plotted for the forward and reverse complement strands in the upper two panels. In the lower two panels the annotations of the forward and reverse complement strands are shown. STUDYING GENE EXPRESSION If the data was submitted in FASTQ, BAM or SAM format and consisted of at least two experiments (with two or more replicates), Transcriptor determines fold changes (FC) and false-discovery rate (FDR) across the experiments using edgeR (Robinson et al., 2010). The result file is GFF-formatted but can also be viewed in a spread sheet software such as MS Excel (see Fig. 11). This makes it possible to select genes by FC or FDR and view the selected genes in a genome browser. Figure 10. Gene expression data viewed in MS Excel. 20 REFERENCES Carver,T., Harris,S.R., et al. (2012) Artemis: an integrated platform for visualization and analysis of high-throughput sequence-based experimental data. Bioinformatics, 28, 464-469. Langmead,B. and Salzberg,S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nat.Methods, 9, 357-359. Mao,X., Ma,Q., et al. (2014) DOOR 2.0: presenting operons and their functions through dynamic and integrated views. Nucleic Acids Res., 42, D654-9. McClure,R., Balasubramanian,D., et al. (2013) Computational analysis of bacterial RNA-Seq data. Nucleic Acids Res., 41, e140. Passalacqua,K.D., Varadarajan,A., et al. (2012) Strand-specific RNA-seq reveals ordered patterns of sense and antisense transcription in Bacillus anthracis. PLoS One, 7, e43350. Robinson,M.D., McCarthy,D.J., et al. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26, 139-140. Salgado,H., Peralta-Gil,M., et al. (2013) RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more. Nucleic Acids Res., 41, D203-13. Todt,T.J., Wels,M., et al. (2012) Genome-wide prediction and validation of sigma70 promoters in Lactobacillus plantarum WCFS1. PLoS One, 7, e45097. 21
© Copyright 2024