Examining evolution in bacteria: Mycobacterium

Examining evolution in bacteria: Mycobacterium
tuberculosis
Tuberculosis (TB) is a major global health problem and the second leading cause of
death from an infectious
disease, after HIV[1]. It is an
ancient disease that plagued
people for thousands of years,
and our relationship with its
causative agent,
Mycobacterium tuberculosis, is
so intimate that it appears that
the pathogen and the host have
coevolved[2, 3]. The M.
tuberculosis bacterium is
spread through the inhalation
of infectious aerosols, and
newly infected individuals
progress to one of two states: a
symptomatic and potentially infectious state, known as active TB, or an
asymptomatic, noninfectious state called latent TB infection (LTBI). Approximately
one-third of the global human population has LTBI[1], and although LTBI does not
manifest with any clinical symptoms, it comes with the risk of developing into active
TB disease. In 2013, an estimated 9 million people developed TB and 1.5 million
died from the disease[4]. Despite evidence that TB is slowly declining, the
emergence and spread of multidrug-resistant strains of Mycobacterium tuberculosis
(MDR-TB) represents a major challenge to the global control of the disease[4].
Looking at how bacteria respond to antibiotics is a good way to study evolution.
Many scientific studies have found drug resistance can be associated with a single
nucleotide change that results in a different amino acid being encoded in one
position. A simple change like this can result in drastic results for a human patient.
We will examine genes that are associated with antibiotic resistance in
Mycobacterium tuberculosis. The list of genes comes from the Tuberculosis Drug
Resistance Mutation Database (TBDream, https://tbdreamdb.ki.se/Info/)[5] and is
provided in the text file that I have also sent.
The strategy of this exercise is to first locate the genes that have been associated
with antibiotic resistance across different strains of M. tuberculosis to see if they all
have the same genes. The next is to find the single amino acid changes that result in
antibiotic resistance. Something as small as this can have devastating results to a
human infected with M. tuberculosis, and can in fact result in a life or death situation.
Creating a group of genomes
1. Enter the word “Belarus” into the global search box at PATRIC and click the
search icon.
2. This will return 145 genomes in the Search Results page. These genomes all
are associated with the word “Belarus,” either in the comment section
provided for each genome, or in the isolation country metadata.
3. To create the genome group, click on the number 146 in the parenthesis.
4. This will return a table that lists the first 20 of 146 genomes.
5. To create a group with all 146 of the genomes, you will need to resize the
table. Scroll down to the bottom of the table and change the number in the
text box from 20 to 146, then hit return.
6. To create the genome group, scroll up to the top of the table and click in the
checkbox next to the text that says “Select all (145) displayed genome(s).
7. Above that button, click on the Workspace Folder that says “Add Genome(s)”.
8. This will generate a pop-up window that asks you to log in. You must log in
to continue.
9. You will be asked if you want to name the group.
10. Clicking on the down arrow next to “None” gives you the option of creating a
new group.
11. You can name the group by typing the name you want (I used Belarus) in the
text box. Click on the “Save to Workspace” button at the bottom of the popup window, and this will save the 145 genomes into your workspace.
Finding the Protein Families that include the TBDream
genes
1. The Rv locus tags belong to the Mycobacterium tuberculosis H37Rv genome
that was sequenced by Welcome Sanger Trust in 1998. Currently, there are 8
different laboratories that have sequenced this same genome, all of which are
in PATRIC. We need to find the original genome that will map to the locus
tags from TBDream in PATRIC. In the global search, enter the text H37Rv
and Sanger.
2. On the landing page that returns, click on the Mycobacterium tuberculosis
H37Rv name.
3. This will take you to the landing page for that genome.
4. On the left hand side under the tabs are the Search Tools. Click on Protein
Family Sorter, which is the fourth one down.
5. This will take you to the landing page for that tool, which is preselected to
search the Welcome Sanger H37Rv genome.
6. Enter all 39 of the TBDream locus tags into the text box. To get exact
matches, the locus tags must be defined by quotation marks (ex. “Rv0006”).
Click the Search button.
7. This will return a protein family table that includes a lot of information about
the genes that were entered. A filter is on the left side, but won’t be used in
this example.
8. We need the protein family identifiers that are found on the first column and
begin with FIG. In the toolbar immediately above the table, find the
Download box and click on the down arrow next to the table to choose the
download format for the data.
9. To get just the information in this table, select Excel file (.xlxs). This will give
you just the summary information for the protein families. If you want all the
information across all the 8 M. tuberculosis H37Rv genomes, including all the
locus tags across those genomes that belong to each family, select Family
Details: Excel file (.xlxs).
10. From the excel file, you need to select the column that has all the protein
family ids. Each of these ids needs to be defined by quotation marks (i.e.,
“FIG00000080”) to find an exact match in the next step.
11. There were 39 Rv locus tags that were originally used to find families, but
only 37 protein families were returned. Using the original list, plus the
protein family list with all the locus tags (see step 9 above, Family Details:
Excel file (.xlxs)), you can highlight duplicate values (the locus tags) in Excel
and see that protein families for two genes were not returned. These are
Rv2427A and Rv3126. To get the protein families, individually enter each of
these genes in the global search box. To get an exact match, put the search
term in quotation marks.
12. When you enter “Rv3126” into the global search, the results page shows two
possible hits. You will need to add the protein family id (“FIG00823168”) for
this family to your list.
13. The last locus tag (Rv2427A) returns no results at PATRIC. The “A” next to
the locus number is non-standard. If you remove the “A” and enter “Rv2427”
in global search, four results are returned.
14. Looking at the original list from TBDream, Rv2427A had a gene name of
oxyR. On the search results page, the first hit is to Rv2427Ac, which is
identified as a pseudogene. This is why this gene does not have a protein
family in PATRIC.
15. If you’re curious about Rv2427A, you can click on the hyperlink for the first
hit (725571..2726087oxyR') and this will take you to the landing page for
that feature.
16. If you click on the genome browser tab, you can see how the RefSeq
annotation compares to the PATRIC annotation.
Do the Belarus genomes have all of the protein families
that contain the TBDream genes?
1. To look for presence or absence of the protein families within a genome
group that you have created, click on the Tools tab and under Comparative
Genomics, select the Protein Family Sorter tool
2. This will take you to the landing page for that tool. If you are a registered
user and are logged in, you will see groups you have previously created and
saved in the Select Groups box.
3. Scroll down and select the group you want to examine. In this case, I selected
the “Belarus Mycobacterium” group.
4. In the text box on the Protein Family Sorter tool page, enter all the protein
family ids, each one inside quotation marks. This will give you an exact
match. Then click the Search button.
5. This will return the results page for the tool. Thirty-seven families were
found. The protein family table shows a lot of information about these
families, including those that have many more (or less) proteins than the 144
genomes examined.
6. To see presence and absence of genes within these protein families across
the selected genomes, click on the Heatmap tab.
7. This will take you the heatmap view, where absence (black cells) and
presence (yellow, mustard and orange cells) can be seen across all 144
genomes. The genomes are on the y-axis, and the 37 protein families on the
x-axis.
Examining an alignment to look for amino acid changes
associated with antibiotic resistance.
1. The data at TBDream describes a specific amino acid change that is
associated with a change in antibiotic resistance in the Rv0006 gene at
position 90. They describe a change from a valine (V) to an alanine (A) at
that specific location. We can look across the specific Belarus genome group
to see how many of the 144 genomes have an alanine at position 90. Rv0006
is DNA gyrase subunit A, and reading the names of the genes across the top of
the heatmap shows that the first column contains DNA gyrase subunit A.
Clicking on the name in the column will select all the genes within the
column.
2. This generates a pop-up window that gives the user choices on what they
want to do with the selected data. Click the Show Proteins button at the
bottom of the pop-up window.
3. This will open a new window that shows the genes found across the 144
Belarus genomes.
4. Resize the table so that you can see all the genes by entering 144 and hitting
return.
5. To generate a multiple sequence alignment, first select the checkbox at the
top of the first column, Genome Name. This will select all of the genes.
6. Next, in the toolbar heading for the table, go to the Tools section and click on
the Multiple Sequence Alignment icon.
7. This will open up the Protein Alignment page that has a gene tree for the
selected proteins on the left hand side, and the actual alignment on the right
hand side. You can scroll along the alignment using the slider at the bottom
of the page.
8. To see the alignment in a more traditional manner, click on the Printable
Alignment button that is above the alignment on the right hand side of the
page.
9. This will open up a window that has the alignment.
10. Scroll down the alignment, find position 90, and then continue scrolling
down. Towards the end of the 144 genomes you can see that 25 of the
Belarus genomes have a valine in that specific position.
How broadly are these protein families are shared across
all the Mycobacterium tuberculosis genomes?
1. Click on the Organism tab and then click on Mycobacterium.
2. This will take you to the Mycobacterium landing page.
3. Click on the Taxonomy tab.
4. Scroll down the list until you reach the Mycobacterium tuberculosis complex.
5. Open the folder for the Mycobacterium tuberculosis complex, find
Mycobacterium tuberculosis, and click on the first icon on the left hand side.
6. This takes you to the landing page for M. tuberculosis. There are 1952 M.
tuberculosis genomes currently available in PATRIC.
7. On the left hand side under the tabs are the Search Tools. Click on Protein
Family Sorter, which is the fourth one down.
8. This will take you to the landing page for that tool, which is preselected to
search across all 1952 M. tuberculosis genomes.
9. In the text box on the Protein Family Sorter tool page, enter all the protein
family ids, each one inside quotation marks. This will give you an exact
match. Then click the Search button.
10. This will return the results page for the tool. Thirty-eight families were
found. The protein family table shows a lot of information about these
families, including those that have many more (or less) proteins than the
1952 genomes examined. The filter on the left hand side does not have a list
of genomes, as that filter can only display 500 genomes.
11. To see presence and absence of genes within these protein families across
the selected genomes, click on the Heatmap tab.
12. This will take you the heatmap view, where absence (black cells) and
presence (yellow, mustard and orange cells) can be seen across all 1833
genomes. The genomes are on the y-axis, and the 38 protein families on the
x-axis.
Assignment: Answer the following questions using
the PATRIC website.
Rv0005 encodes a gene called DNA gyrase subunit B (EC 5.99.1.3).
a. What protein family does this gene belong to (FIG….?)?
b. In how many of the Belarus genomes is this gene 675 aa long?
c. A change in amino acid sequence from an Arginine (Arg, R) to a
Cysteine (Cys, C) around position 457 has resulted in resistance to
the drug ofloxacin. How many of the Belarus genomes have the
“C” in that approximate position that might make them resistant to
this antibiotic? You will find that some of the proteins have
different lengths, differing as much as 20 aa. You will have to take
this into account while looking for this specific change.
d. In this same example, scientists have found antibiotic resistance
associated with a change Serine (Ser, S) to Phenylalanine (Phe, F)
at position 458. How many of the Belarus genomes have this
specific change?
e. A study in Thailand showed that a change from Aspartic Acid (Asp,
D) to Asparagine (Asn, N) is associated with resistance to multiple
drugs. How many of the Belarus genomes have this specific
change?
f. Finally, if resistance to any antibiotic depended on not having any
of the three changes mentioned above, how many of the Belarus
genomes could be said to be susceptible?
Bonus Question.
Scientists have found that resistance to ethambutol has been associated with
a single amino acid change in tuberculosis genomes isolated from India. This
change, found in Rv3797, was found at position 270 and involved a change
from Isoleucine (Ile, I) to Threonine (Thr, T).
A. How many genomes from India show a change from I > T?
B. How many genomes from India have an Isoleucine in that approximate
position?
C. Considering when these specific genomes were collected, what is the
earliest time that this particular mutation (T instead of I in the position
around 270) could have first appeared in India?
D. What is the specific nucleotide change that is responsible for this change?
(Hint: Use the genome browser).
References
1.
2.
3.
Wlodarska, M., et al., A microbiological revolution meets an ancient disease:
improving the management of tuberculosis with genomics. Clin Microbiol Rev,
2015. 28(2): p. 523-39.
Brosch, R., et al., A new evolutionary scenario for the Mycobacterium
tuberculosis complex. Proc Natl Acad Sci U S A, 2002. 99(6): p. 3684-9.
Bos, K.I., et al., Pre-Columbian mycobacterial genomes reveal seals as a source
of New World human tuberculosis. Nature, 2014. 514(7523): p. 494-7.
4.
5.
Fonseca, J.D., G.M. Knight, and T.D. McHugh, The complex evolution of
antibiotic resistance in Mycobacterium tuberculosis. Int J Infect Dis, 2015. 32:
p. 94-100.
Sandgren, A., et al., Tuberculosis drug resistance mutation database. PLoS
Med, 2009. 6(2): p. e2.