Guide - Bacterial genomics

Transcriptor: a web-tool for analysis of genome-wide
transcriptome data of prokaryotes
Summary: Transcriptor is a web-tool for the analysis of RNA-Seq data of prokaryotes. It uses a method
for detecting transcripts that allow pinpointing untranslated regions, operons and small RNAs.
Transcriptor detects, annotates and compares transcripts from different samples and / or experiments.
Results can be downloaded as tab-delimited files and in GFF format for visualization in e.g., Artemis.
Availability: Transcriptor and its guide are accessible from:
http://bamics2.cmbi.ru.nl/websoftware/transcriptor.
Contact: [email protected]
1
TABLE OF CONTENTS
Introduction .................................................................................................................................................. 3
Part 1: Running Transcriptor ......................................................................................................................... 5
Step 1 – Select reference genome ............................................................................................................ 5
Step 2 – Select run mode .......................................................................................................................... 5
Step 3a – Upload data sets using existing configuration file .................................................................... 7
Step 3b – Create configuration file ........................................................................................................... 8
Step 4 – Change Transcriptor settings .................................................................................................... 10
Step 5 – Change Bowtie2 settings........................................................................................................... 12
Step 6 – Run PHASE................................................................................................................................. 13
PART 2: Output generated by Transcriptor ................................................................................................ 14
Visualisation of results in Artemis........................................................................................................... 16
Annotation of Operons ........................................................................................................................... 17
Annotation of RNA .................................................................................................................................. 18
Studying gene expression ....................................................................................................................... 20
References .................................................................................................................................................. 21
2
INTRODUCTION
This guide provides a short overview of the application of Transcriptor to the analysis of a RNA-Seq
dataset. Transcriptor was developed for the annotation and comparison of bacterial transcriptomes
derived from RNA-seq data (see Fig. 1 for its flow-chart). It is designed to allow researchers to easily
pinpoint coherences or differences in multiple genome-wide transcriptomes from different samples /
conditions accounting for different (prokaryotic) genomics features such as operon structures and small
RNAs. To guide this process, the tool generates various results to aid in the comparative transcriptomics
process.
Figure 1. Transcriptor flow-chart.
As depicted in Figure 1, the transcript annotation starts with the detection of transcripts based on
expression data. Transcripts are annotated using existing genome annotation as reference. Additional
features are inferred from the expression data including untranslated regions (UTRs), transcription start
sites (TSSs), novel small RNAs (sRNA) and operon structures. Annotated transcriptomes are then
compared with each other to generate additional information that allows the user to straight-forward
pinpoint transcriptional differences across different samples and conditions. From the results, the user
3
can identify conditionally expressed transcripts and operon structures that can be used in downstream
analysis of gene regulation, promoters or regulatory RNAs.
The Transcriptor web-interface is generated by the FG-web framework (http://trac.nbic.nl/fgweb). The
web-interface has three sections: 1) the data upload, 2) parameter settings, and 3) displaying the
results. The web tool does not require a login, works with major web browsers such as Internet Explorer,
Firefox, Safari and Opera. The results can easily be viewed in genome browser related tools (such as
Artemis) and further analyzed with standard spread sheets software.
In this guide, the application of Transcriptor is exemplified for the analysis of samples of single-stranded
RNA-seq data (Passalacqua et al., 2012). The samples are derived from cultures of Bacillus anthracis
grown in rich medium with no treatment, cold stress, with ethanol, and with NaCl.
4
PART 1: RUNNING TRANSCRIPTOR
Transcriptor is part of the FG-web framework, a bioinformatics work flow system for web based tools.
To start the software, go to http://bamics2.cmbi.ru.nl/websoftware/transcriptor.
STEP 1 – SELECT REFERENCE GENOME
Select a reference genome from a list of finished genomes or upload your own reference genome file,
which includes genome sequence and annotation (Genbank formatted).
STEP 2 – SELECT RUN MODE
On the same page, select one of the following options for the run mode (see Fig. 2):
Figure 2. Select run mode.
There are three ways to provide information about the datasets:
1. Upload existing configuration file. You can create your own configuration file in a text editor or
spread sheet application and upload this file. This option is preferable in case there are many
data sets. See Step 3a (further below) for details on how to create a configuration file in a text
editor or spread-sheet editor (like MS Excel).
5
2. Create a configuration file. Alternatively, you can enter all data sets one by one via the webapplication. In this case, you must enter the number of samples that you want to include in your
analysis.
3. Run in demo mode. This option is provided for as a demo of Transcriptor. In this case,
Transcriptor uses four different samples of single-stranded RNA-Seq data of Bacillus anthracis
(Passalacqua et al., 2012). The RNA-Seq data was originally used to study the transcriptional
response of B. anthracis to different growth conditions. For the demo, RNA-seq data (including
replicates) from two conditions (grown in rich medium versus treated with ethanol) are used
which were mapped against the first 350.000 bases of the reference genome of B. anthracis to
reduce run-time.
Press Next when done.
6
STEP 3A – UPLOAD DATA SETS USING EXISTING CONFIGURATION FILE
If you have chosen the run mode “upload existing configuration file”, Transcriptor downloads the data
sets specified in the configuration file to the server for further calculations. The configuration file can be
created using a text editor or spread-sheet application.
The configuration file is a tab-delimited text file (see Table 1). Each line contains the information on one
sample. If a line starts with a hash tag the information in the line is ignored. You can use the hash tag to
include comments.
Create a new line for a new sample / data set. Enter for each sample the following information
separated by tabs.




In the first and second columns, the names of the experiment and the sample are provided. All
samples with the same experiment name are considered (technical or biological) replicates.
During the annotation process, Transcriptor uses information retrieved from all replicates to
generate annotation results for each experiment. Please consider this when deciding which
samples should be treated as replicates.
Next, enter the strand information: “forward” (data derived from forward strand), “reverse”
(data derived from reverse complement strand) or “unstranded” (data derived from both strand
orientations, e.g. non-strand-specific data). If you want Transcriptor to choose the strand
orientation of the data, enter “stranded” (strand-specific data).
Next, enter the sequence format. Various formats are supported: Alignments of read data can
be submitted in the binary version of the standard alignment format (BAM). Alternatively, RNAseq read data can be submitted in the FASTQ format.
Finally, enter the link (URL) to a file that contains the sample data. Transcriptor uses this link to
download the sample data. The sample data file (of FASTQ formatted data) may be compressed.
The following compression formats are supported: gzip and zip.
Table 1. Example of a configuration file.
Experiment
Sample
Strand
Format
# URL
B.anthracis
NaCl
Unstranded
BAM
http://bamics2.cmbi.ru.nl/websoftware/transcriptor/sample1
B.anthracis
Control
Forward
SAM
http://bamics2.cmbi.ru.nl/websoftware/transcriptor/sample2
B.anthracis
Control
Reverse
FASTQ
http://bamics2.cmbi.ru.nl/websoftware/transcriptor/sample3
7
STEP 3B – CREATE CONFIGURATION FILE
If you have chosen the run mode option to create a configuration file, you proceed with the following
web page (see Fig. 3).
1
2
3
4
5
6
7
Figure 3. Create config file.
First, choose name for your experiment. This name is used as a prefix for result files created by
Transcriptor. Next, you provide for each sample the following information (the numbers correspond to
those in Fig. 3):
1. Select in the drop-down box the option to upload a file or to let Transcriptor download the file.
2. In the following text field you provide the link to the sample data.
3. Select in the following drop-down box the compression format. The following formats are
supported: “Uncompressed”, “GZip” for data that is compressed with gzip or “Zip” for data that
is compressed with zip (Note that only PKZIP versions 4.5 or earlier are supported).
4. Select the sample format. The following formats are supported: for read data that needs to be
mapped to a reference genome use FASTQ formatted data. Alternatively, you can use mapped
read data formatted in BAM or SAM.
5. Provide a sample name. This name is used as a prefix for sample-specific output files created by
Transcriptor.
6. Provide the name of the experiment. This name is used as a prefix for experiment-specific
output files. Samples that have the same experiment name, are considered (technical or
8
biological) replicates. During the annotation process, Transcriptor joins information retrieved
from different replicates. Please consider this when you decided which samples should be joined
in this manner.
7. Select the strand information of the sample: “Forward”, “Reverse” or “Unstranded” if the data is
not strand-specific. If you want Transcriptor to choose the strand orientation of the data, enter
“stranded”.
Step 1 to 7 is repeated for each sample.
Press Proceed when done.
9
STEP 4 – CHANGE TRANSCRIPTOR SETTINGS
After you uploaded an existing configuration file (Step 3a) or created a new configuration file (Step 3b),
you can change the following general settings before starting the analysis (the numbers correspond to
those in Fig. 4):
1
2
3
4
5
6
7
8
9
10
Figure 4. Parameter selection.
1. Enter the minimum length of transcripts. Transcriptor detects transcripts based on the
expression profiles. Transcripts that are shorter than the minimum length are not reported.
2. Select whether Transcriptor should annotate transcripts. If no annotation is performed,
Transcriptor only reports the profile derived transcripts with positional information. Otherwise,
transcripts are annotated by determining coding regions (CDS), UTRs and TSSs. For the
annotation, the genome annotation provided by the reference GenBank file is used.
3. Enter the minimum coverage of transcripts. Transcriptor determines if a CDS has a minimum
coverage of reads. If a CDS has less read coverage than the threshold, the reads partially
covering the CDS are considered an additional transcript. This additional transcript is reported
separately by Transcriptor. This procedure allows detecting and reporting regulatory RNAs. The
10
4.
5.
6.
7.
8.
9.
10.
coverage is calculated by dividing the number of overlapping nucleotides of the transcript with
the CDS by the length of the CDS.
Select “Yes”, if you want Transcriptor to provide RPKM values for each transcript. This option
requires read data in FASTQ, SAM or BAM format.
If you have entered samples from different experiments, Transcriptor can compare transcripts
across different experiments. Additional results are generated that allows to easily pinpoint
transcriptional variations across different experiment. For details see further below.
If Transcriptor compares different transcripts across different experiments, this threshold is
used as the percentage nucleotides that minimally overlap between transcripts. If the
percentage of overlap less than the threshold, Transcriptor considers these transcripts as two
separate transcripts. The percentage is calculated by dividing the number of overlapping
nucleotides of the two transcripts by the length of the shorter transcript.
Transcripts are determined by applying a segmentation algorithm previously described (Todt et
al., 2012). Segments with read coverage above a minimum read count threshold are identified
and adjacent segments are joined. The threshold is obtained from the frequency distribution of
mapped non-overlapping reads. Alternatively, the user can select a value for this threshold.
Enter the read length of the RNA-seq data. If you have data with different read lengths enter the
minimum read length. This option is used by the segmentation algorithm to identify isolated
reads.
If you want Transcriptor to choose the strand orientation of the data, enter “yes”. Otherwise,
the provided strand information is used.
If you want Transcriptor to determine differentially expressed genes across experiments choose,
enter “yes”.
11
STEP 5 – CHANGE BOWTIE2 SETTINGS
If you provide read sequence data for your analysis, Transcriptor maps the reads to the selected
reference genome using Bowtie2 (Langmead, et al. 2012). You can change some Bowtie2 specific
parameters (see Fig. 5). For detailed information on Bowtie2 see the documentation of Bowtie2.
Figure 5. Parameter selection.
1. Choose the platform specific Phred quality offset of the read data. Default is 33, which is the
standard offset used by most platforms.
2. Select the Bowtie2 preset alignment option. The default is “fast-local”, which provides good
results in almost all cases (see the documentation of Bowtie2).
Press Proceed when done.
12
STEP 6 – RUN PHASE
The progress of the analysis is shown in the running window (see Fig. 6). The url of this page can be
bookmarked and retrieved at a later stage when the results are ready. The analysis may take between
10 minutes to a couple of hours to complete. The progress mostly depends on the speed of download of
the data files and the size of the genome. During the analysis, various statistics are displayed such as the
numbers of features in the reference genome file, the number of mapped reads, the number of features
detected in each sample, etc.
If there’s a problem with your files, likely it will be shown here. A link with contact information will be
shown when something is wrong with the run. Also provide the url to the failed analysis on FG-web in
the email to the contact person.
Figure 6. Run diagnostics.
13
PART 2: OUTPUT GENERATED BY TRANSCRIPTOR
The results of the analysis can be downloaded as compressed archive file. This file contains different
result files depending on the number of the samples and the parameter settings. The following tables
(see Table 2, 3, 4, and 5) give an overview of the results files. In this examples, the name of the run is
“Demo”. This name is used as a prefix for the names of the result files.
Table 2. Results files on general summaries.
Overview
The following files give an overview of the main results of the analysis.
results.html
This file can be viewed in a web browser and summarizes the main results of
the run.
Demo_anno_sum.txt The file contains a table that lists for each sample and experiment the number
of features detected by Transcriptor. The following features are included:
Genes, RNA (including small RNAs and anti-sense RNAs), untranslated regions
(including 5’UTR and 3’UTRs), and transcription starts sites (TSS) and
termination sites (TES).
Table 3. Results files on operon information.
Operon
The following files contain information on operon structures derived from
expression profiles
Demo_ops_ sum.txt
The file gives a summary of additional transcripts grouped by experiment and
strand orientation. Three types of transcripts are distinguished:
-
RNA partially overlap CDS in sense orientation
asRNA overlap CDS in anti-sense orientation
ncRNA do not overlap with CDS
Demo_ops_match.txt The file contains a tab-separated table of all additional transcripts grouped by
experiment and strand orientation. The name of a transcript indicates the
type of the transcript (RNA, asRNA or ncRNA, see above). The table lists for
each transcript the experiments the transcript was detected.
Demo_ops_match.gff
The file is formatted in GFF and used for visualization in a genome browser
(such as Artemis). When this file is viewed in Artemis (Craver et al., 2012),
genes that are detected in all experiments are colored green, the other genes
are colored red. Genes that are not expressed are colored grey.
14
Table 4. Results files on RNA information.
RNA
The following files contain information on transcripts that were not linked
to genes of the reference genome annotation.
Demo_rna_sum.txt
The file gives a summary of additional transcripts grouped by experiment and
strand orientation. Three types of transcripts are distinguished:
-
RNA partially overlap CDS in sense orientation
asRNA overlap CDS in anti-sense orientation
ncRNA do not overlap with CDS
Demo_rna_match.txt The file contains a tab-separated table of all additional transcripts grouped by
experiment and strand orientation. The name of a transcript indicates the
type of the transcript (RNA, asRNA or ncRNA, see above). The table lists for
each transcript the experiments the transcript was detected.
Demo_rna_match.gff
The file is formatted in GFF and used for visualization in a genome browser
(such as Artemis). When this file is viewed in Artemis, transcripts that are
detected in all experiments are colored green, the other transcripts are
colored red.
Table 5. Results files on differentially expressed genes.
Gene expression
The following files include information on differentially expressed genes
Demo_FC.txt
This file (see Fig. 11) contains for each gene annotation information, read
count data, fold changes (fc) and false-discovery (fdr, using the Benjamini
and Hochberg’s algorithm) values for different experiment. The file can be
open in a spreadsheet software or in a genome browser (such as Artemis).
Demo_FC.pdf
Fold changes (FC) and false-discovery values are determined using edgeR
(Robinson et al., 2010). The PDF includes plots generated by edgeR. The file
includes:
-
Demo_FC.jpeg
a plot of a principal components analysis of log FC values of all samples
a plot of the dispersion (for the variance parameter) of the read data
a hierarchical clustering of samples by CPM values. CPMs are the read
counts per gene normalized by the sample’s library size (total counts) in
million counts.
This plot is the same hierarchical clustering of CPM values included in the
PDF file.
15
VISUALISATION OF RESULTS IN ARTEMIS
The results include various GFF-formatted result files that can be used for visualization in a Genome
browser (such as Artemis (Carver, et al. 2012)). The GFF-formatted files include annotation results for all
experiments (see Fig. 7). Files names are generated by concatenating the name of the run (such as
“Demo”) with the names of the experiments. In addition, GFF-formatted files are generated providing
matching information between experiments for operons and RNAs.
Figure 6 shows the visualization in Artemis using the GFF formatted annotation files produced by
Transcriptor. Here, results of the transcriptional analysis using Transcriptor of RNA-Seq data obtained for
B. anthracis grown under four different growth conditions are shown. Four different growth conditions
were applied: grown in rich medium with no treatment (A: control), cold stress (B: stress), with ethanol
(C: EtOH), and with NaCl (D: NaCl). In Figure 6, the read coverage is plotted for the forward and reverse
complement strands in the upper two panels. In the lower two panels the annotations of the forward
and reverse complement strands are shown.
Forward strand
Reverse strand
A
B
C
D
E
E
D
C
B
A
Q
Figure 6. Visualization of transcript annotation viewed in Artemis. Transcriptional analysis using Transcriptor of
RNA-Seq data obtained for B. anthracis grown under four different growth conditions: grown in rich medium with
no treatment (A: control), cold stress (B: stress), with ethanol (C: EtOH), and with NaCl (D: NaCl). The read coverage
is plotted for the forward and reverse complement strands in the upper two panels. In the lower two panels the
annotations of the forward and reverse complement strands are shown.
16
Colors highlight different genome features. CDS retrieved for the genome annotation of the reference
GenBank file are highlighted blue. However, if the number of reads covering the CDS is less than the
minimum read coverage threshold parameter, the CDS is colored grey. UTR-related regions are colored
brown. All additional transcripts are colored orange. GFF files on operon matching and RNA matching
use the colors green for transcripts found in all experiment and red in all other cases. Tools like Artemis
allow to view many annotation files simultaneously by use of different tracks in the visualization.
ANNOTATION OF OPERONS
Transcriptor identifies ab initio putative operons structures based on transcripts derived from the
expression profiles. A transcript covering two or more annotated genes is regarded as operon. Operons
are determined for different experiments and are compared to each other. The results (see Table 3)
matching information are stored in a tab-delimited file that can be viewed in a spread-sheets application
(see Fig. 7) and formatted as GFF file for visualization (see Fig. 8).
The following example illustrates how matching information and visualisation results can be used to
pinpoint variations in expression patterns of transcription units across different experiments.
Figure 6. Matching operon information. Operon naming is according to the first gene (5’ -> 3’)
in the operon. The GBAA_0034 to GBAA_0036 genes are under all conditions expressed as a single
transcriptional unit. Genes GBAA_0038 and 0039 are expressed in the operon (op_GBAA_0038)
under Cold and Ethanol stress.
The file with matching results is shown in Figure 7. The first column lists the name of the genes. The
second column gives the strand information. For every experiment, a column is included indicating to
which operons the genes are assigned. A dash indicates that for a given genes in a given experiment no
or not enough read data was available. Usually, this means that the gene is not expressed. The last
column in the table contains the name of the operon if it is found in all experiments. Otherwise a dash is
inserted. Transcription units and operons are assigned by determining transcripts derived from
expression profiles. A transcription unit or operon is named by the name of the first gene on the
reference genome sequence.
17
A
B
C
D
E
E
D
C
B
A
Figure 7. Annotation viewed in Artemis. Transcriptional analysis using Transcriptor of RNA-Seq data
obtained for B. anthracis grown under four different growth conditions: grown in rich medium with no
treatment (A: control), cold stress (B: stress), with ethanol (C: EtOH), and with NaCl (D: NaCl). The read
coverage is plotted for the forward and reverse complement strands in the upper two panels. In the lower two
panels the annotations of the forward and reverse complement strands are shown. The two genes GBAA_0038
and GBAA_0039 (see boxed region) are probably forming an operon. The expression of GBAA_0039 is low (grey
coloured genes) in the two experiment (Control and NaCl) resulting in the different operon detection results.
In Figure 8, in four experiments operon op_GBAA_0038 was determined starting with gene GBAA_0038.
In two experiments (Cold and Ethanol), operon op_GBAA_0038 consists of GBAA_0038 and GBAA_0039.
In the other two experiments (Control and NaCl), op_GBAA_0038 consists only of GBAA_0038 and
GBAA_0039 is not expressed. Therefore, in the last column of the table, only for gene GBAA_0038 a
matching is indicated by the name of the transcription unit. In figure 9 the annotation of this region
shows that GBAA_0038 and GBAA_0039 are probably forming an operon. The expression of GBAA_0039
is low in the two experiment (Control and NaCl) resulting in the different operon detection results.
ANNOTATION OF RNA
Transcriptor provides comparison information on transcripts that were not linked to existing gene
annotation information (e.g. novel transcripts or transcripts partly overlapping genes). These transcripts
are grouped into transcripts partly overlapping genes (RNA), anti-sense RNA (asRNA) and non-coding
RNA (ncRNA). Transcriptor also provides summary and matching information on grouped transcripts.
This information makes it easy to pinpoint differences in ncRNAs over different experiments. The
regions of interest can be visualized by using the annotations information provided by Transcriptor. The
following example illustrates how matching information and visualisation results can be used to pinpoint
variations in expression patterns of transcription units across different experiments (or conditions).
Using the tabular RNA matching information created by Transcriptor we selected two non-coding RNA
ncRNA_17 and ncRNA_20 that are expressed under all four growth conditions (Cold, Control, Ethanol
18
and NaCl, see Figure 9). The expression profiles (read coverage profiles) and annotation (viewed in
Artemis) are shown in Figure 10.
Figure 8. Results on RNA matching viewed in MS Excel. Transcriptional analysis using
Transcriptor of B. anthracis grown under four different growth conditions. Here, only a part of
the original table is shown. First column: name of the reference RNAs. These RNAs are all
transcripts over all conditions not linked to genes. Second column: strand information. For every
condition, a column is included. If the condition-specific RNA overlaps on the genome with the
reference RNA (the left-most column), the condition is stated (here: Cold, Control, Ethanol and
NaCl). A dash indicates that under the specific condition the RNA is not detected due to that the
read coverage is below the minimum read coverage of 80%. The final column is the consensus
information indicating expression in all (by an asterisk) or not in all (dash).
The first column in the the tabular RNA matching information contains the name of the reference RNAs.
These RNAs are all transcripts over all conditions not linked to genes. The second column contains
strand information. For every condition, a column is included. If the condition-specific RNA overlaps on
the genome with the reference RNA (the left-most column), the condition is stated (here: Control, Cold,
Ethanol and NaCl). A dash indicates that under the specific condition the RNA is not detected due to that
the read coverage is below the minimum read coverage (see Part 1 for parameter settings of
Transcriptor). The final column contains the consensus information indicating expression in all (by an
asterisk) or not in all (dash) experiments.
19
Figure 9. RNA annotation viewed in Artemis. Transcriptional analysis was done using
Transcriptor of RNA-Seq data obtained for B. anthracis grown under four different conditions.
Four different growth conditions were applied: rich medium with no treatment (A: control), cold
stress (B: cold), with ethanol (C: Ethanol), and with NaCl (D: NaCl). The read coverage is plotted
for the forward and reverse complement strands in the upper two panels. In the lower two
panels the annotations of the forward and reverse complement strands are shown.
STUDYING GENE EXPRESSION
If the data was submitted in FASTQ, BAM or SAM format and consisted of at least two experiments (with
two or more replicates), Transcriptor determines fold changes (FC) and false-discovery rate (FDR) across
the experiments using edgeR (Robinson et al., 2010). The result file is GFF-formatted but can also be
viewed in a spread sheet software such as MS Excel (see Fig. 11). This makes it possible to select genes
by FC or FDR and view the selected genes in a genome browser.
Figure 10. Gene expression data viewed in MS Excel.
20
REFERENCES
Carver,T., Harris,S.R., et al. (2012) Artemis: an integrated platform for visualization and analysis of high-throughput
sequence-based experimental data. Bioinformatics, 28, 464-469.
Langmead,B. and Salzberg,S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nat.Methods, 9, 357-359.
Mao,X., Ma,Q., et al. (2014) DOOR 2.0: presenting operons and their functions through dynamic and integrated
views. Nucleic Acids Res., 42, D654-9.
McClure,R., Balasubramanian,D., et al. (2013) Computational analysis of bacterial RNA-Seq data. Nucleic Acids Res.,
41, e140.
Passalacqua,K.D., Varadarajan,A., et al. (2012) Strand-specific RNA-seq reveals ordered patterns of sense and
antisense transcription in Bacillus anthracis. PLoS One, 7, e43350.
Robinson,M.D., McCarthy,D.J., et al. (2010). edgeR: a Bioconductor package for differential expression analysis of
digital gene expression data. Bioinformatics, 26, 139-140.
Salgado,H., Peralta-Gil,M., et al. (2013) RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory
phrases, cross-validated gold standards and more. Nucleic Acids Res., 41, D203-13.
Todt,T.J., Wels,M., et al. (2012) Genome-wide prediction and validation of sigma70 promoters in Lactobacillus
plantarum WCFS1. PLoS One, 7, e45097.
21