C341: Introduction to Bioinformatics Sample Solution for Coursework - Spring 2013

I MPERIAL C OLLEGE L ONDON , D EPARMENT OF C OMPUTING
C341: Introduction to Bioinformatics
Sample Solution for Coursework - Spring 2013
Q1 - SEQUENCE SIMILARITY / CLUSTERING /
COMPUTATIONAL COMPLEXITY (30 MARKS)
1. (10 marks) Needleman-Wunsch algorithm is a dynamic programming algorithm that
identifies the optimal global alignment among two sequences. The Needleman-Wunsch
algorithm may introduce three different types of gaps in the alignment:
• Opening gap: The first occurrence of a gap in the alignment.
• Extension gap: The consecutive gaps that follows an opening gap in the alignment.
• Begin/End gap: The gaps that are introduced at the beginning/end of the alignment.
1
V
2
A
3
A
A
4
V
V
5
F
6
F
7
A
8
L
L
9
I
10
F
-
11
V
-
For the above alignment, positions {1, 2, 10, 11} show begin/end gaps, positions {5, 9}
show opening gaps, and positions {6, 7} show extension gaps as they follow the opening
gap in position 5.
The three gap types are sometimes penalized different from each other. Explain which
modifications you need to make in the Needleman-Wunsch algorithm in order to compute the optimal alignment correctly when the gap penalties are different for the gap
types. Do these modifications change the computational complexity of the algorithm?
1
A NSWER : The Needleman-Wunsch algorithm can be used in the cases of different
opening gap, extension gap, and begin/end gap penalties by adding some extra check
conditions for filling the cost matrix, C . The modifications that are required for the
correct computation are:
• Begin Gap: When filling the cost matrix, the first column and row of C should be
penalized with the begin gap penalty, P beg i n .
• Opening/Extension Gap: When computing the gap penalty of a cell C (i + 1, j )
or C (i , j + 1), the move taken for reaching C (i , j ) needs to be considered. If we
are considering the move C (i + 1, j ) and the previous move for reaching C (i , j )
departs from C (i − 1, j ), then we add a gap extension penalty. Similarly, if we are
considering the move C (i , j +1) and the previous move for reaching C (i , j ) departs
from C (i , j − 1), then this is also penalized with a gap extension penalty. Other
cases are penalized with a gap open penalty.
• End Gap: The number of aminoacids that are aligned in the alignment until reaching cell C (i , j ) can be obtained from the i and j values. If i is equal to the length
of the horizontal sequence, then adding a gap to reach C (i , j +1) should be penalized with an end gap penalty. Similarly, if j is equal to the length of the vertical
sequence, then adding a gap to reach C (i + 1, j ) should be penalized with an end
gap penalty.
These changes to not cause any increase in the computational complexity of the algorithm.
2. (15 marks) K-medoids clustering algorithm is a variation of the K-means clustering algorithm, where the sample having the minimum distance to all other samples in the
cluster is chosen as the cluster center during the iterations 1 . For the microarray expression values for 10 genes that are provided in Table 1, compute the following:
Gene Name
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
Gene 9
Gene 10
Exp.1
1.1
2.5
2.45
1.8
2.3
4.8
1.9
1.2
0.7
4.9
Exp.2
1.9
3.8
2.4
4.8
3.7
0.8
1.1
0.6
0.7
0.4
True Cluster
C1
C2
C3
C2
C2
C3
C1
C1
C1
C3
Table 1: Microarray Expression Values
1 K-medoids:
Kmedoids.html
http://www.math.le.ac.uk/people/ag153/homepage/KmeansKmedoids/Kmeans_
2
a) Compute the Euclidean distances between the genes. Show these distances in a
10 x 10 matrix. Draw a 2-dimensional plot of expression values of the genes in
Table 1 (x-axis is Exp.1 and y-axis is Exp.2 values).
b) Initiate the K-medoids clustering with Gene 1, Gene 2, and Gene 3. Iterate the
clustering 3 times. Show the iteration steps of the algorithm. Does the algorithm
converge? Report the obtained clusters.
c) Compute the precision, accuracy, and recall for the obtained clustering.
A NSWER :
a) The Euclidean distances among these genes are:
Gene
1
2
3
4
5
6
7
8
9
10
1
0
2
2.36
0
3
1.44
1.40
0
4
2.98
1.22
2.49
0
5
2.16
0.22
1.31
1.21
0
6
3.86
3.78
2.84
5.00
3.83
0
7
1.13
2.77
1.41
3.70
2.63
2.92
0
8
1.30
3.45
2.19
4.24
3.29
3.61
0.86
0
9
1.26
3.58
2.44
4.24
3.40
4.10
1.26
0.51
0
10
4.09
4.16
3.16
5.38
4.20
0.41
0.38
3.71
4.21
0
The 2-dimensional plot of the expression values:
b) Iteration 1: First, the cluster members are identified using the shortest distance.
The clusters are:
• Cluster 1 (Centroid 1) - 1, 7, 8, 9
• Cluster 2 (Centroid 2) - 2, 4, 5
3
• Cluster 3 (Centroid 3) - 3, 6, 10
The cluster centers are updated by computing the total distances of each member to other elements. The element having the minimum distance to all other
elements is chosen as the new cluster center. The new cluster centers are:
• Centroid of Cluster 1: 8
• Centroid of Cluster 2: 5
• Centroid of Cluster 3: 6
Iteration 2: The clusters are redefined based on the new cluster centers:
• Cluster 1 (Centroid 8): 1, 7, 8, 9
• Cluster 2 (Centroid 5): 2, 3, 4, 5
• Cluster 3 (Centroid 6): 6, 10
When we re-compute the cluster centers, the centroids are still 8, 5, and 6.
Iteration 3: The centroids did not change. Therefore, the clusters stay the same.
The algorithm converged.
The final clusters are {1, 7, 8, 9}, {2, 3, 4, 5}, and {6, 10}.
c) The precision, accuracy and recall needs to be evaluated in terms of network pairs
that are correctly grouped. The network pairs that are originally from the same
cluster form the true set and vice versa defines the false set. The network pairs
that are clustered together form the positive set and vice versa forms the negative
set. In the light of this information, the true positive(TP), false positive (FP), true
negative (TN), and false negative(FN) counts are as follows:
• TP: 10
• TN: 30
• FP: 3
• FN: 2
As the obtained clusters are identical with the true clusters, the FP and FN values
are 0. So the precision, accuracy, and recall are :
TP
10
T P +F P = 10+3
P +T N
Accur ac y = T P +TT N
+F P +F N
TP
10
Recal l = T P +F N = 10+2
• P r eci si on =
•
•
=
10+30
10+30+3+2
3. (5 marks) Let G(V, E) be a graph. If the running time of an algorithm on G is O(l og (|V |2 |E |)),
then which of the following is a better running time than O(l og (|V |2 |E |)) and why?
• O(|V |2 + |E |)
• O(|E |2 )
• O(l og (|E |3 ))
• O(l og (|E ||V |15 ))
4
¡ ¢
A NSWER : The maximum number of edges in a network that have |V | nodes is |V2 | =
|V ||V −1|
. In order to be able to fairly compare the provided complexities, we need to
2
convert the |E | terms into |V |. The complexities become:
• O(l og (|V |2 |E |))
= O(l og (|V |2 ( |V |(|V2 |−1) )))
2
= O(l og (|V |2 ( |V2| − |V2 | )))
4
3
= O(l og ( |V2| − |V2| ))
= O(l og (|V |4 ))
= O(4l og (|V |))
• O(|V |2 + |E |)
= O(|V |2 + |V |(|V2 |−1) )
= O(|V |2 )
• O(|E |2 )
= O(( |V |(|V2 |−1) )2 )
= O(|V |4 )
• O(l og (|E |3 ))
= O(l og (( |V |(|V2 |−1) )3 ))
= O(l og (|V |6 ))
= O(6l og (|V |))
• O(l og (|E ||V |15 ))
= O(l og ( |V |(|V2 |−1) |V |15 ))
= O(l og (|V |17 ))
= O(17l og (|V |))
In terms of computational complexity, O(l og (|E |3 )) and O(l og (|E ||V |15 )) have the
same computational complexity O(l og (|V |2 |E |)) which is O(l og (|V |)). Still, in
terms of running time, none of the other alternatives would be any faster than
O(l og (|V |2 |E |)).
Q2 - BIOLOGICAL NETWORKS / GRAPH THEORY (30 MARKS)
1. (5 marks) Describe the relationship between protein-protein interaction networks and
metabolic networks? How can they be linked to each other? Provide a detailed explanation.
A NSWER : Protein-protein interaction (PPI) networks represent the physical binding
information of protein pairs. Metabolic networks represent the set of chemical reactions that occur within a cell. Proteins enter the reactions in metabolic networks as
enzymes that catalyse the reaction. Both proteins in PPI networks and enzymes in
metabolic networks are synthesized from the genes. The two networks can be linked
with each other using these genes.
5
2. (15 marks) For the graph below, identify the following:
a) Degree distribution
b) Average clustering coefficient and Clustering coefficient spectrum
c) Shortest path length distribution
d) Betweenness centrality distribution
e) The list of all graphlets of type G 5 and G 6
f) All automorphisms and automorphism orbits.
g) Graphlet signatures of nodes 3, 6, 7, and 9 using only the graphlets with 2, 3, and
4 nodes (i.e., omit orbits 15 to 72).
A NSWERS :
a) Degree distribution:
b) Clustering coefficients of the nodes are as follows:
6
• Nodes 1, 2, 7, 8, 9: 1
• Nodes 4, 5: 0
• Node 3:
• Node 6:
1
6
3
10
Therefore the average clustering coefficient of the network is : 0.607.
The clustering specrum is:
c) Shortest path length distribution:
d) The betweenness centralities of the nodes are:
• Nodes 1, 2, 7, 8, 9: 0
• Nodes 4, 5: 6 (Normalized : 0.214)
7
• Node 3: 12.5 (Normalized: 0.446)
• Node 6: 15.5 (Normalized: 0.554)
This distribution can be plotted as:
e) The list of G 5 are: {3, 4, 6, 5}.
The list of G 6 are: {1, 2, 3, 5}, {1, 2, 3, 4}, {7, 9, 6, 5}, {7, 9, 6, 4}, {7, 8, 6, 5}, {7, 8, 6, 4},
{8, 9, 6, 5}, {8, 9, 6, 4}.
f) The autonomorphism orbits of this graph are : {1, 2}, {3}, {4, 5}, {6}, {7, 8, 9}.
There are 2 ∗ 2 ∗ 6 = 24 automorphisms of this graph. These mappings are listed
as follows (assuming that the mapping order is [1, 2, 3, 4, 5, 6, 7, 8, 9]):
• [1, 2, 3, 4, 5, 6, 7, 8, 9]
• [1, 2, 3, 4, 5, 6, 7, 9, 8]
• [1, 2, 3, 4, 5, 6, 8, 7, 9]
• [1, 2, 3, 4, 5, 6, 8, 9, 7]
• [1, 2, 3, 4, 5, 6, 9, 7, 8]
• [1, 2, 3, 4, 5, 6, 9, 8, 7]
• [1, 2, 3, 5, 4, 6, 7, 8, 9]
• [1, 2, 3, 5, 4, 6, 7, 9, 8]
• [1, 2, 3, 5, 4, 6, 8, 7, 9]
• [1, 2, 3, 5, 4, 6, 8, 9, 7]
• [1, 2, 3, 5, 4, 6, 9, 7, 8]
• [1, 2, 3, 5, 4, 6, 9, 8, 7]
• [2, 1, 3, 4, 5, 6, 7, 8, 9]
8
• [2, 1, 3, 4, 5, 6, 7, 9, 8]
• [2, 1, 3, 4, 5, 6, 8, 7, 9]
• [2, 1, 3, 4, 5, 6, 8, 9, 7]
• [2, 1, 3, 4, 5, 6, 9, 7, 8]
• [2, 1, 3, 4, 5, 6, 9, 8, 7]
• [2, 1, 3, 5, 4, 6, 7, 8, 9]
• [2, 1, 3, 5, 4, 6, 7, 9, 8]
• [2, 1, 3, 5, 4, 6, 8, 7, 9]
• [2, 1, 3, 5, 4, 6, 8, 9, 7]
• [2, 1, 3, 5, 4, 6, 9, 7, 8]
• [2, 1, 3, 5, 4, 6, 9, 8, 7]
g) The graphlet signatures of nodes 3, 6, 7, and 8 are summarized as follows:
node3
node6
node7
node9
O. 0
4
5
3
3
O. 1
2
2
2
2
O. 2
5
7
0
0
O. 3
1
3
3
3
O. 4
6
4
2
2
O. 5
4
6
0
0
O. 6
0
0
1
1
O. 7
2
3
0
0
O. 8
1
1
0
0
O. 9
0
0
0
0
O. 10
0
0
4
4
O. 11
2
6
0
0
O. 12
0
0
0
0
O. 13
0
0
0
0
3. (5 marks) What is the maximum number of graphlets of type G 6 that a connected graph
with 10 nodes and 10 edges can have? Show the resulting graph.
A NSWER : With the given conditions, the maximum number of G 6 is 7. For forming
G 6 a triangle is needed, which uses 3 nodes and 3 edges. The rest of the forming a
second triangle would require 1 node and 2 more edges. In this case, the graph would
be unconnected since we are left with 6 nodes and 5 edges. Therefore, the maximum
number of triangles that we can include in this graph is 1, if we want it to be connected.
The rest of the nodes and edges can be attached to any of the nodes. The maximum
number of G 6 will be 7, one G 6 being formed by each newly attached node.
4. (5 marks) If a graph contains 10 triangles, what is the maximum number of 4-node
cliques that it can have? Show the steps of your computation.
A NSWER : We can obtain the maximum number of 4-node cliques when we form the
¡ ¢
clique with the maximum possible size. A 4-node clique contains 43 = 4 triangles. A
¡ ¢
5-node clique contains 53 = 10 triangles. Therefore, we can form a 5-node clique with
the 10 triangles in the graph.
¡ ¢
The 5-node clique contains 54 = 5 4-node cliques. Therefore, the maximum number of
4-node cliques is 5.
9
O. 14
0
1
1
1
Q3 - DATA ANALYSIS / PROGRAMMING (40 MARKS )
This question contains two sections; Section A and B. Section A contains data analysis questions that aim to provide you experience with accessing the public databases and analysing
the data that you can obtain from these databases. Section B contains a bioinformatics programming question that requires computational thinking and programming. You can choose
to answer either of the sections to get full marks. If you choose to answer both sections, both
of your answers will be evaluated and you will get the highest of the two marks.
A - D ATA A NALYSIS
Osteocalcin, also known as bone gamma-carboxyglutamic acid-containing protein (BGLAP),
is a noncollagenous protein found in bone and dentin. As osteocalcin is produced by osteoblasts, it is often used as a marker for the bone formation process. In this question, you
are going to analyse the properties of this protein over different species.
1. (5 marks) From UniProt database 2 , obtain the following information on Osteocalcin
for Human (Homo Sapiens), Mouse (Mus Musculus), Pig (Sus Scrofa), Dog (Canis Familiaris), and Cat (Felis Catus):
a) UniProt IDs
b) Gene IDs
c) Gene Names
d) KEGG Database IDs
e) Sequence lengths
f) FASTA sequences
g) Gene Ontology (GO) Annotations for biological process, molecular function, and
cellular component
A NSWER : The information is summarized as follows:
Uniprot Id
Gene Id
Gene Name
KEGG Database ID
Sequence Length
Human
(Homo Sapiens)
P02818
632
BGLAP
hsa:632
100
Mouse
(Mus Musculus)
P86546
12096
BGLAP
mmu:12096
95
Pig
(Sus scrofa)
Q8HYY9
397530
BGLAP
ssc:397530
49
Dog
(Canis Familiaris)
P81455
403762
BGLAP
cfa:403762
49
Cat
(Felis Catus)
P02821
101085947
BGLAP
fca:101085947
49
Sequences in FASTA format:
>sp|P02818|OSTCN_HUMAN Osteocalcin OS=Homo sapiens GN=BGLAP PE=1 SV=2
MRALTLLALLALAALCIAGQAGAKPSGAESSKGAAFVSKQEGSEVVKRPRRYLYQWLGAP
2 Uniprot:
http://www.uniprot.org/
10
VPYPDPLEPRREVCELNPDCDELADHIGFQEAYRRFYGPV
>sp|P86546|OSTCN_MOUSE Osteocalcin OS=Mus musculus GN=Bglap PE=2 SV=1
MRTIFLLTLLTLAALCLSDLTDAKPSGPESDKAFMSKQEGNKVVNRLRRYLGASVPSPDP
LEPTREQCELNPACDELSDQYGLKTAYKRIYGITI
>sp|Q8HYY9|OSTCN_PIG Osteocalcin OS=Sus scrofa GN=BGLAP PE=1 SV=2
YLDHGLGAPAPYPDPLEPRREVCELNPDCDELADHIGFQEAYRRFYGIA
>sp|P81455|OSTCN_CANFA Osteocalcin OS=Canis familiaris GN=BGLAP PE=1 SV=1
YLDSGLGAPVPYPDPLEPKREVCELNPNCDELADHIGFQEAYQRFYGPV
>sp|P02821|OSTCN_FELCA Osteocalcin OS=Felis catus GN=BGLAP PE=1 SV=1
YLAPGLGAPAPYPDPLEPKREICELNPDCDELADHIGFQDAYRRFYGTV
Gene Ontology Annotations:
Human:
Biological Process:
• skeletal system development
• ossification
• osteoblast differentiation
• osteoblast development
• cell adhesion
• aging
• cell aging
• response to mechanical stimulus
• response to gravity
• response to inorganic substance
• response to zinc ion
• response to organic cyclic compound
• response to activity
• bone mineralization
• regulation of bone mineralization
• biomineral tissue development
• response to nutrient levels
11
• response to vitamin K
• response to vitamin D
• response to testosterone stimulus
• response to hydroxyisoflavone
• odontogenesis
• response to drug
• response to estrogen stimulus
• regulation of bone resorption
• response to ethanol
• regulation of osteoclast differentiation
• response to glucocorticoid stimulus
• bone development
• cellular response to vitamin D
• cellular response to growth factor stimulus
Molecular Function:
• structural molecule activity
• calcium ion binding
• structural constituent of bone
• hydroxyapatite binding
• metal ion binding
Cellular Component:
• extracellular region
• extracellular space
• cytoplasm
• rough endoplasmic reticulum
• Golgi apparatus
• dendrite
• membrane-bounded vesicle
• cell projection
• perikaryon
Mouse:
Biological Process:
• regulation of bone mineralization
12
• biomineral tissue development
Molecular Function:
• calcium ion binding
• metal ion binding
Cellular Component:
• extracellular region
• cytoplasm
Pig & Dog & Cat:
Biological Process:
• osteoblast differentiation
• regulation of bone mineralization
• biomineral tissue development
• response to vitamin D
Molecular Function:
• calcium ion binding
• metal ion binding
Cellular Component:
• extracellular region
2. (10 marks) Align the Osteocalcin sequences for the 5 species using the BLAST server 3 .
a) Create a table showing the bit scores and the E-values of all pairwise BLAST alignments.
A NSWER : The following table shows that bit scores of BLAST alignments:
Human
Mouse
Pig
Dog
Cat
Human
203
100
92
93.2
90.1
Mouse
95.9
195
66.6
65.9
63.9
Pig
92.8
67
102
91.7
90.5
Dog
94.4
66.2
91.7
102
89
Cat
90.9
64.7
90.5
89
102
Note that, the matrix is not symetric (aligning A with B produces different results
than aligning B with A). This is because of the heuristic approach of BLAST. The
initial seed that the alignment starts with defines the alignment. Therefore, the
output of the two results might differ.
3 BLAST Server:
http://blast.ncbi.nlm.nih.gov/Blast.cgi
13
b) Show the phylogenetic similarities among the Osteocalcin proteins by constructing a dendrogram using average linkage clustering on the bit scores of BLAST
alignments. (Hint: be careful about the definition of the bit scores of BLAST alignments.)
A NSWER : The higher bit scores mean that the two sequences are more similar to
eeach other. The asymmetry of bit-score matrix should be eliminated by taking
the maximum of the two values, as it is the maximum reachable similarity of the
two sequences. The symmetric similarity matrix is as follows:
Human
Mouse
Pig
Dog
Cat
Human
X
X
X
X
X
Mouse
100
X
X
X
X
Pig
92.8
67
X
X
X
Dog
94.4
66.2
91.7
X
X
Cat
90.9
64.7
90.5
89
X
The dendrogram that will be produced from this similarity matrix using average
linkage clustering is as follows:
c) Show the BLAST alignment on the Osteocalcin proteins of Dog and Cat. How
many matches, mismatches and gaps are there in this alignment?
A NSWER : The alignment is:
14
There are 41 matches, 8 mismatches and 0 gaps in this alignment.
d) Is the BLAST alignment on the Osteocalcin proteins of Dog and Cat optimal? Prove
your answer. (Hint: BLAST server provides support for identifying the optimal
global alignment.)
A NSWER : The optimal alignment obtained by Needleman Wunsch algorithm is:
There are 41 matches, 8 mismatches and 0 gaps in this optimal alignment. For this particular example, BLAST identifies the same alignment with Needlemam-Wunsch Algorithm. Therefore, the BLAST alignment for these two sequences is optimal. But note
that, this is not always guaranteed to be true.
3. (25 marks) Obtain the human protein-protein interaction network (PPI) from BioGRID
database 4 and analyse this network using Cytoscape (version 2.8.3) 5 .
a) Collect the human PPI data from BioGRID database. Use database version 3.2.97
and include only the data from yeast-2-hybrid and affinity capture / mass - spectrometry experiments and annotate the nodes with their Entrez Gene IDs. How
many nodes and edges are there when duplicate edges and self-loops are eliminated? What is the edge density of this network ? (Hint: BioGRID database provides a Cytoscape plug-in named BiogridPlugin2. This plug-in may ease the data
collection from BioGRID database).
4 BioGRID:http://thebiogrid.org/
5 CytoScape:
http://www.cytoscape.org/
15
A NSWER : The PPI network that is collected using only Yeast-2-Hybrid and Affinity Capture - MS experiments contains 12,234 nodes and 62,607 edges. Therefore,
the edge-density of this network is:
#o f ed g es
Densi t y = ¡#o f nod es ¢ = 0.000837
(0.1)
2
b) Identify the first and second neighborhoods of the human Osteocalcin protein.
Construct the induced subgraph on Osteocalcin and the identified neighborhoods.
Visualize the induced subgraph using Cytoscape and include the visualization
into your report. What are the number of nodes and edges in this network?
A NSWER : The induced subgraph contains 101 nodes and 203 edges. The illustration of this network is as below:
16
c) Compute the following network properties for the induced subgraph obtained
above:
• degree distribution
• average clustering coefficient
• clustering spectrum
• network diameter
• spectrum of shortest path lengths
• betweenness centrality distribution
• eccentricity centrality distribution
You may use the NetworkAnalyzer plug-in of Cytoscape for performing this analysis.
A NSWER : Degree Distribution:
Average clustering coefficient: 0.633
Clustering Spectrum:
17
Network Diameter: 2
Spectrum of Shortest Path Lengths:
Betweenness Centrality Distribution:
Eccentricity Centrality Distribution:
d) Compute the number of graphlets in the induced subgraph for each 2-to-5 node
18
graphlet. You may use GraphCrunch2 6 package for this purpose.
A NSWER : The graphlet counts are summarized as follows:
Graphlet
1
2
3
4
5
6
7
8
9
10
Count
7761
357
2401
194901
94
18040
3889
256
3070
29122
Graphlet
11
12
13
14
15
16
17
18
19
20
Count
3977258
9292
1482
546201
35
1864
159978
3640
1417
97
Graphlet
21
22
23
24
25
26
27
28
29
Count
356
51066
11402
4225
112
7722
200
1112
100
e) Which proteins are topologically identical to the Osteocalcin protein in the induced subgraph? Use graphlet signature similarities of the nodes for identifying
the topological similarities among the nodes. (Hint: GraphCrunch2 package can
compute the graphlet signature similarities.)
A NSWER : The Gene IDs of the proteins that are topologically identical to Osteocalcin are listed as follows: 10782, 146542, 199704, 148266, 115509, 65251, 79818,
284406, 253639, 92285, 7770, 632, 92595, 286205, 642954, 65988, 55900, 25799,
653147, 374899, 57337, 58499, 2492, 6874, 51224.
f) What are the three highest degree nodes (proteins) in the induced subgraph? Write
their Gene Names and Entrez Gene IDs. From KEGG database 7 , identify the pathways these proteins are associated with. Write the names and KEGG ID’s of these
pathways.
A NSWER : The three highest degree nodes in the induced subgraph are CBX5,
UBC, and SUMO2. The pathways that they are associated are:
CBX5: No pathway associations in KEGG.
UBC: PPAR signaling pathway (hsa03320)
SUMO2: RNA transport (hsa03013)
B - B IOINFORMATICS P ROGRAMMING
Smith-Waterman algorithm is a dynamic programming algorithm which identifies the optimal local alignment among two protein sequences. In this question, you are asked to imple6 GraphCrunch2:
http://bio-nets.doc.ic.ac.uk/graphcrunch2/
http://www.genome.jp/kegg/
7 KEGG Database:
19
ment a modified version of the Smith-Waterman algorithm that can identify the optimal local
alignment among three protein sequences.
1. (5 marks) How would you modify the Smith-Waterman algorithm to find the optimal
local alignment of 3 sequences rather than 2? Explain.
A NSWER : Smith-Waterman algorithm considers 3 directions when aligning two sequences; for the cost matrix cell (x, y), checks the cost of reaching from (x −1, y), (x, y −
1), and (x − 1, y − 1). For aligning 3-sequences, the cost matrix should be modified to
be 3 dimensional. For checking the cost of a cell (x, y, z) in the cost matrix, 7 different
directions need to be evaluated. These directions and their meaning are as follows:
• (x − 1, y, z) : Means aligning an amino acid from the first sequence with gaps for
the second and third sequences
• (x, y −1, z) :Means aligning an amino acid from the second sequence with gaps for
the first and third sequences
• (x, y, z − 1) :Means aligning an amino acid from the third sequence with gaps for
the first and second sequences.
• (x − 1, y − 1, z) :Means aligning amino acids from the first and second sequences
with a gap for the third sequence.
• (x −1, y, z −1) :Means aligning amino acids from the first and third sequences with
a gap for the second sequence.
• (x, y − 1, z − 1) :Means aligning amino acids from the second and third sequences
with a gap for the third sequence.
• (x − 1, y − 1, z − 1) :Means aligning amino acids from all sequences.
At these steps, the cost of gap penalties and matches/mismatches should be computed for each pair of aminoacid alignments and averaged as described in the
question text. The maximum score of the 7 options should be chosen to fill the
cost matrix cell.
2. (20 marks) Implement a modified version of Smith-Waterman algorithm that can identify the optimal local alignment among three sequences. Use BLOSUM62 matrix8 for
scoring the alignments. Compute the similarity score of 3 aminoacids (or gaps) as the
average score of their pairwise distances, i.e., for a match (a, b, c) the alignment score is
S(a, b, c) = S(a,b)+S(a,c)+S(b,c)
. Your implementation should read a file containing three
3
FASTA sequences (an example is provided in part 3). You are allowed to use any wellknown programming language (preferably C, C++, Java, or Python). You will be penalized 5 marks if your code is not clearly commented. You are supposed to include your
implementation both in your report and also as a soft copy on CATE. You are supposed
to write a clear README file containing easy-to-follow instructions on the compilation/execution process. Failing to do so will be penalized by 10 marks.
8 BLOSUM62 :
http://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt
20
A NSWER : The implementation is not provided since the implementation depends on
the choices of programming languages.
3. (5 marks) What is the optimal local alignment of Insulin protein sequences from Human, Pig and Chicken? What is the alignment score for this alignment? The sequences
for theseInsulin proteins are provided in FASTA format as follows (this is the format that
your code in part 1 should read):
• Human:
>sp|P01308|INS_HUMAN Insulin OS=Homo sapiens GN=INS PE=1 SV=1
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTR
REAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
• Pig:
>sp|P01315|INS_PIG Insulin OS=Sus scrofa GN=INS PE=1 SV=2
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKAR
REAENPQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN
• Chicken:
>sp|P67970|INS_CHICK Insulin OS=Gallus gallus GN=INS PE=1 SV=1
MALWIRSLPLLALLVFSGPGTSYAAANQHLCGSHLVEALYLVCGERGFFYSPKARR
DVEQPLVSSPLRGEAGVLPFQQEEYEKVKRGIVEQCCHNTCSLYQLENYCN
A NSWER : The optimal alignment between these three sequences are as follows:
The optimal alignment score is for this alignment 396.
There were slight differences in the answers that are mainly caused because of
different way of threating gaps. We have been very flexible for these differences.
Any sound alignment is accepted as true.
4. (5 marks) What is the computational complexity of your implementation? Explain.
21
A NSWER : The highest computational complexity is caused at the cost matrix filling
step. For three sequences of lengths m, n, and k, the computational complexity of the
algorithm is O(mnk). This is the theoretical complexity of the algorithm. Depending
on the quality of the implementation, this complexity might get higher.
5. (5 marks) Can the pairwise Smith-Waterman algorithm be directly used for identifying
the optimal local alignment among three sequences? If so, explain how and compare
the computational complexity of the two approaches? If not, explain why not?
A NSWER : No, pairwise version of the Smith-Waterman algorithm can not be used for
identifying the optimal alignment among three sequences. This is mainly because,
when you are performing pairwise alignment, you optimize the alignment to get the
best alignment for the two sequences that you have chosen. The second alignment
which includes the third sequence preserves the first alignment which was optimized
for the first two sequences. For this reason, the resulting alignment might be very different from the optimal alignment of the three sequences.
22