I MPERIAL C OLLEGE L ONDON , D EPARMENT OF C OMPUTING C341: Introduction to Bioinformatics Sample Solution for Coursework - Spring 2013 Q1 - SEQUENCE SIMILARITY / CLUSTERING / COMPUTATIONAL COMPLEXITY (30 MARKS) 1. (10 marks) Needleman-Wunsch algorithm is a dynamic programming algorithm that identifies the optimal global alignment among two sequences. The Needleman-Wunsch algorithm may introduce three different types of gaps in the alignment: • Opening gap: The first occurrence of a gap in the alignment. • Extension gap: The consecutive gaps that follows an opening gap in the alignment. • Begin/End gap: The gaps that are introduced at the beginning/end of the alignment. 1 V 2 A 3 A A 4 V V 5 F 6 F 7 A 8 L L 9 I 10 F - 11 V - For the above alignment, positions {1, 2, 10, 11} show begin/end gaps, positions {5, 9} show opening gaps, and positions {6, 7} show extension gaps as they follow the opening gap in position 5. The three gap types are sometimes penalized different from each other. Explain which modifications you need to make in the Needleman-Wunsch algorithm in order to compute the optimal alignment correctly when the gap penalties are different for the gap types. Do these modifications change the computational complexity of the algorithm? 1 A NSWER : The Needleman-Wunsch algorithm can be used in the cases of different opening gap, extension gap, and begin/end gap penalties by adding some extra check conditions for filling the cost matrix, C . The modifications that are required for the correct computation are: • Begin Gap: When filling the cost matrix, the first column and row of C should be penalized with the begin gap penalty, P beg i n . • Opening/Extension Gap: When computing the gap penalty of a cell C (i + 1, j ) or C (i , j + 1), the move taken for reaching C (i , j ) needs to be considered. If we are considering the move C (i + 1, j ) and the previous move for reaching C (i , j ) departs from C (i − 1, j ), then we add a gap extension penalty. Similarly, if we are considering the move C (i , j +1) and the previous move for reaching C (i , j ) departs from C (i , j − 1), then this is also penalized with a gap extension penalty. Other cases are penalized with a gap open penalty. • End Gap: The number of aminoacids that are aligned in the alignment until reaching cell C (i , j ) can be obtained from the i and j values. If i is equal to the length of the horizontal sequence, then adding a gap to reach C (i , j +1) should be penalized with an end gap penalty. Similarly, if j is equal to the length of the vertical sequence, then adding a gap to reach C (i + 1, j ) should be penalized with an end gap penalty. These changes to not cause any increase in the computational complexity of the algorithm. 2. (15 marks) K-medoids clustering algorithm is a variation of the K-means clustering algorithm, where the sample having the minimum distance to all other samples in the cluster is chosen as the cluster center during the iterations 1 . For the microarray expression values for 10 genes that are provided in Table 1, compute the following: Gene Name Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 Gene 10 Exp.1 1.1 2.5 2.45 1.8 2.3 4.8 1.9 1.2 0.7 4.9 Exp.2 1.9 3.8 2.4 4.8 3.7 0.8 1.1 0.6 0.7 0.4 True Cluster C1 C2 C3 C2 C2 C3 C1 C1 C1 C3 Table 1: Microarray Expression Values 1 K-medoids: Kmedoids.html http://www.math.le.ac.uk/people/ag153/homepage/KmeansKmedoids/Kmeans_ 2 a) Compute the Euclidean distances between the genes. Show these distances in a 10 x 10 matrix. Draw a 2-dimensional plot of expression values of the genes in Table 1 (x-axis is Exp.1 and y-axis is Exp.2 values). b) Initiate the K-medoids clustering with Gene 1, Gene 2, and Gene 3. Iterate the clustering 3 times. Show the iteration steps of the algorithm. Does the algorithm converge? Report the obtained clusters. c) Compute the precision, accuracy, and recall for the obtained clustering. A NSWER : a) The Euclidean distances among these genes are: Gene 1 2 3 4 5 6 7 8 9 10 1 0 2 2.36 0 3 1.44 1.40 0 4 2.98 1.22 2.49 0 5 2.16 0.22 1.31 1.21 0 6 3.86 3.78 2.84 5.00 3.83 0 7 1.13 2.77 1.41 3.70 2.63 2.92 0 8 1.30 3.45 2.19 4.24 3.29 3.61 0.86 0 9 1.26 3.58 2.44 4.24 3.40 4.10 1.26 0.51 0 10 4.09 4.16 3.16 5.38 4.20 0.41 0.38 3.71 4.21 0 The 2-dimensional plot of the expression values: b) Iteration 1: First, the cluster members are identified using the shortest distance. The clusters are: • Cluster 1 (Centroid 1) - 1, 7, 8, 9 • Cluster 2 (Centroid 2) - 2, 4, 5 3 • Cluster 3 (Centroid 3) - 3, 6, 10 The cluster centers are updated by computing the total distances of each member to other elements. The element having the minimum distance to all other elements is chosen as the new cluster center. The new cluster centers are: • Centroid of Cluster 1: 8 • Centroid of Cluster 2: 5 • Centroid of Cluster 3: 6 Iteration 2: The clusters are redefined based on the new cluster centers: • Cluster 1 (Centroid 8): 1, 7, 8, 9 • Cluster 2 (Centroid 5): 2, 3, 4, 5 • Cluster 3 (Centroid 6): 6, 10 When we re-compute the cluster centers, the centroids are still 8, 5, and 6. Iteration 3: The centroids did not change. Therefore, the clusters stay the same. The algorithm converged. The final clusters are {1, 7, 8, 9}, {2, 3, 4, 5}, and {6, 10}. c) The precision, accuracy and recall needs to be evaluated in terms of network pairs that are correctly grouped. The network pairs that are originally from the same cluster form the true set and vice versa defines the false set. The network pairs that are clustered together form the positive set and vice versa forms the negative set. In the light of this information, the true positive(TP), false positive (FP), true negative (TN), and false negative(FN) counts are as follows: • TP: 10 • TN: 30 • FP: 3 • FN: 2 As the obtained clusters are identical with the true clusters, the FP and FN values are 0. So the precision, accuracy, and recall are : TP 10 T P +F P = 10+3 P +T N Accur ac y = T P +TT N +F P +F N TP 10 Recal l = T P +F N = 10+2 • P r eci si on = • • = 10+30 10+30+3+2 3. (5 marks) Let G(V, E) be a graph. If the running time of an algorithm on G is O(l og (|V |2 |E |)), then which of the following is a better running time than O(l og (|V |2 |E |)) and why? • O(|V |2 + |E |) • O(|E |2 ) • O(l og (|E |3 )) • O(l og (|E ||V |15 )) 4 ¡ ¢ A NSWER : The maximum number of edges in a network that have |V | nodes is |V2 | = |V ||V −1| . In order to be able to fairly compare the provided complexities, we need to 2 convert the |E | terms into |V |. The complexities become: • O(l og (|V |2 |E |)) = O(l og (|V |2 ( |V |(|V2 |−1) ))) 2 = O(l og (|V |2 ( |V2| − |V2 | ))) 4 3 = O(l og ( |V2| − |V2| )) = O(l og (|V |4 )) = O(4l og (|V |)) • O(|V |2 + |E |) = O(|V |2 + |V |(|V2 |−1) ) = O(|V |2 ) • O(|E |2 ) = O(( |V |(|V2 |−1) )2 ) = O(|V |4 ) • O(l og (|E |3 )) = O(l og (( |V |(|V2 |−1) )3 )) = O(l og (|V |6 )) = O(6l og (|V |)) • O(l og (|E ||V |15 )) = O(l og ( |V |(|V2 |−1) |V |15 )) = O(l og (|V |17 )) = O(17l og (|V |)) In terms of computational complexity, O(l og (|E |3 )) and O(l og (|E ||V |15 )) have the same computational complexity O(l og (|V |2 |E |)) which is O(l og (|V |)). Still, in terms of running time, none of the other alternatives would be any faster than O(l og (|V |2 |E |)). Q2 - BIOLOGICAL NETWORKS / GRAPH THEORY (30 MARKS) 1. (5 marks) Describe the relationship between protein-protein interaction networks and metabolic networks? How can they be linked to each other? Provide a detailed explanation. A NSWER : Protein-protein interaction (PPI) networks represent the physical binding information of protein pairs. Metabolic networks represent the set of chemical reactions that occur within a cell. Proteins enter the reactions in metabolic networks as enzymes that catalyse the reaction. Both proteins in PPI networks and enzymes in metabolic networks are synthesized from the genes. The two networks can be linked with each other using these genes. 5 2. (15 marks) For the graph below, identify the following: a) Degree distribution b) Average clustering coefficient and Clustering coefficient spectrum c) Shortest path length distribution d) Betweenness centrality distribution e) The list of all graphlets of type G 5 and G 6 f) All automorphisms and automorphism orbits. g) Graphlet signatures of nodes 3, 6, 7, and 9 using only the graphlets with 2, 3, and 4 nodes (i.e., omit orbits 15 to 72). A NSWERS : a) Degree distribution: b) Clustering coefficients of the nodes are as follows: 6 • Nodes 1, 2, 7, 8, 9: 1 • Nodes 4, 5: 0 • Node 3: • Node 6: 1 6 3 10 Therefore the average clustering coefficient of the network is : 0.607. The clustering specrum is: c) Shortest path length distribution: d) The betweenness centralities of the nodes are: • Nodes 1, 2, 7, 8, 9: 0 • Nodes 4, 5: 6 (Normalized : 0.214) 7 • Node 3: 12.5 (Normalized: 0.446) • Node 6: 15.5 (Normalized: 0.554) This distribution can be plotted as: e) The list of G 5 are: {3, 4, 6, 5}. The list of G 6 are: {1, 2, 3, 5}, {1, 2, 3, 4}, {7, 9, 6, 5}, {7, 9, 6, 4}, {7, 8, 6, 5}, {7, 8, 6, 4}, {8, 9, 6, 5}, {8, 9, 6, 4}. f) The autonomorphism orbits of this graph are : {1, 2}, {3}, {4, 5}, {6}, {7, 8, 9}. There are 2 ∗ 2 ∗ 6 = 24 automorphisms of this graph. These mappings are listed as follows (assuming that the mapping order is [1, 2, 3, 4, 5, 6, 7, 8, 9]): • [1, 2, 3, 4, 5, 6, 7, 8, 9] • [1, 2, 3, 4, 5, 6, 7, 9, 8] • [1, 2, 3, 4, 5, 6, 8, 7, 9] • [1, 2, 3, 4, 5, 6, 8, 9, 7] • [1, 2, 3, 4, 5, 6, 9, 7, 8] • [1, 2, 3, 4, 5, 6, 9, 8, 7] • [1, 2, 3, 5, 4, 6, 7, 8, 9] • [1, 2, 3, 5, 4, 6, 7, 9, 8] • [1, 2, 3, 5, 4, 6, 8, 7, 9] • [1, 2, 3, 5, 4, 6, 8, 9, 7] • [1, 2, 3, 5, 4, 6, 9, 7, 8] • [1, 2, 3, 5, 4, 6, 9, 8, 7] • [2, 1, 3, 4, 5, 6, 7, 8, 9] 8 • [2, 1, 3, 4, 5, 6, 7, 9, 8] • [2, 1, 3, 4, 5, 6, 8, 7, 9] • [2, 1, 3, 4, 5, 6, 8, 9, 7] • [2, 1, 3, 4, 5, 6, 9, 7, 8] • [2, 1, 3, 4, 5, 6, 9, 8, 7] • [2, 1, 3, 5, 4, 6, 7, 8, 9] • [2, 1, 3, 5, 4, 6, 7, 9, 8] • [2, 1, 3, 5, 4, 6, 8, 7, 9] • [2, 1, 3, 5, 4, 6, 8, 9, 7] • [2, 1, 3, 5, 4, 6, 9, 7, 8] • [2, 1, 3, 5, 4, 6, 9, 8, 7] g) The graphlet signatures of nodes 3, 6, 7, and 8 are summarized as follows: node3 node6 node7 node9 O. 0 4 5 3 3 O. 1 2 2 2 2 O. 2 5 7 0 0 O. 3 1 3 3 3 O. 4 6 4 2 2 O. 5 4 6 0 0 O. 6 0 0 1 1 O. 7 2 3 0 0 O. 8 1 1 0 0 O. 9 0 0 0 0 O. 10 0 0 4 4 O. 11 2 6 0 0 O. 12 0 0 0 0 O. 13 0 0 0 0 3. (5 marks) What is the maximum number of graphlets of type G 6 that a connected graph with 10 nodes and 10 edges can have? Show the resulting graph. A NSWER : With the given conditions, the maximum number of G 6 is 7. For forming G 6 a triangle is needed, which uses 3 nodes and 3 edges. The rest of the forming a second triangle would require 1 node and 2 more edges. In this case, the graph would be unconnected since we are left with 6 nodes and 5 edges. Therefore, the maximum number of triangles that we can include in this graph is 1, if we want it to be connected. The rest of the nodes and edges can be attached to any of the nodes. The maximum number of G 6 will be 7, one G 6 being formed by each newly attached node. 4. (5 marks) If a graph contains 10 triangles, what is the maximum number of 4-node cliques that it can have? Show the steps of your computation. A NSWER : We can obtain the maximum number of 4-node cliques when we form the ¡ ¢ clique with the maximum possible size. A 4-node clique contains 43 = 4 triangles. A ¡ ¢ 5-node clique contains 53 = 10 triangles. Therefore, we can form a 5-node clique with the 10 triangles in the graph. ¡ ¢ The 5-node clique contains 54 = 5 4-node cliques. Therefore, the maximum number of 4-node cliques is 5. 9 O. 14 0 1 1 1 Q3 - DATA ANALYSIS / PROGRAMMING (40 MARKS ) This question contains two sections; Section A and B. Section A contains data analysis questions that aim to provide you experience with accessing the public databases and analysing the data that you can obtain from these databases. Section B contains a bioinformatics programming question that requires computational thinking and programming. You can choose to answer either of the sections to get full marks. If you choose to answer both sections, both of your answers will be evaluated and you will get the highest of the two marks. A - D ATA A NALYSIS Osteocalcin, also known as bone gamma-carboxyglutamic acid-containing protein (BGLAP), is a noncollagenous protein found in bone and dentin. As osteocalcin is produced by osteoblasts, it is often used as a marker for the bone formation process. In this question, you are going to analyse the properties of this protein over different species. 1. (5 marks) From UniProt database 2 , obtain the following information on Osteocalcin for Human (Homo Sapiens), Mouse (Mus Musculus), Pig (Sus Scrofa), Dog (Canis Familiaris), and Cat (Felis Catus): a) UniProt IDs b) Gene IDs c) Gene Names d) KEGG Database IDs e) Sequence lengths f) FASTA sequences g) Gene Ontology (GO) Annotations for biological process, molecular function, and cellular component A NSWER : The information is summarized as follows: Uniprot Id Gene Id Gene Name KEGG Database ID Sequence Length Human (Homo Sapiens) P02818 632 BGLAP hsa:632 100 Mouse (Mus Musculus) P86546 12096 BGLAP mmu:12096 95 Pig (Sus scrofa) Q8HYY9 397530 BGLAP ssc:397530 49 Dog (Canis Familiaris) P81455 403762 BGLAP cfa:403762 49 Cat (Felis Catus) P02821 101085947 BGLAP fca:101085947 49 Sequences in FASTA format: >sp|P02818|OSTCN_HUMAN Osteocalcin OS=Homo sapiens GN=BGLAP PE=1 SV=2 MRALTLLALLALAALCIAGQAGAKPSGAESSKGAAFVSKQEGSEVVKRPRRYLYQWLGAP 2 Uniprot: http://www.uniprot.org/ 10 VPYPDPLEPRREVCELNPDCDELADHIGFQEAYRRFYGPV >sp|P86546|OSTCN_MOUSE Osteocalcin OS=Mus musculus GN=Bglap PE=2 SV=1 MRTIFLLTLLTLAALCLSDLTDAKPSGPESDKAFMSKQEGNKVVNRLRRYLGASVPSPDP LEPTREQCELNPACDELSDQYGLKTAYKRIYGITI >sp|Q8HYY9|OSTCN_PIG Osteocalcin OS=Sus scrofa GN=BGLAP PE=1 SV=2 YLDHGLGAPAPYPDPLEPRREVCELNPDCDELADHIGFQEAYRRFYGIA >sp|P81455|OSTCN_CANFA Osteocalcin OS=Canis familiaris GN=BGLAP PE=1 SV=1 YLDSGLGAPVPYPDPLEPKREVCELNPNCDELADHIGFQEAYQRFYGPV >sp|P02821|OSTCN_FELCA Osteocalcin OS=Felis catus GN=BGLAP PE=1 SV=1 YLAPGLGAPAPYPDPLEPKREICELNPDCDELADHIGFQDAYRRFYGTV Gene Ontology Annotations: Human: Biological Process: • skeletal system development • ossification • osteoblast differentiation • osteoblast development • cell adhesion • aging • cell aging • response to mechanical stimulus • response to gravity • response to inorganic substance • response to zinc ion • response to organic cyclic compound • response to activity • bone mineralization • regulation of bone mineralization • biomineral tissue development • response to nutrient levels 11 • response to vitamin K • response to vitamin D • response to testosterone stimulus • response to hydroxyisoflavone • odontogenesis • response to drug • response to estrogen stimulus • regulation of bone resorption • response to ethanol • regulation of osteoclast differentiation • response to glucocorticoid stimulus • bone development • cellular response to vitamin D • cellular response to growth factor stimulus Molecular Function: • structural molecule activity • calcium ion binding • structural constituent of bone • hydroxyapatite binding • metal ion binding Cellular Component: • extracellular region • extracellular space • cytoplasm • rough endoplasmic reticulum • Golgi apparatus • dendrite • membrane-bounded vesicle • cell projection • perikaryon Mouse: Biological Process: • regulation of bone mineralization 12 • biomineral tissue development Molecular Function: • calcium ion binding • metal ion binding Cellular Component: • extracellular region • cytoplasm Pig & Dog & Cat: Biological Process: • osteoblast differentiation • regulation of bone mineralization • biomineral tissue development • response to vitamin D Molecular Function: • calcium ion binding • metal ion binding Cellular Component: • extracellular region 2. (10 marks) Align the Osteocalcin sequences for the 5 species using the BLAST server 3 . a) Create a table showing the bit scores and the E-values of all pairwise BLAST alignments. A NSWER : The following table shows that bit scores of BLAST alignments: Human Mouse Pig Dog Cat Human 203 100 92 93.2 90.1 Mouse 95.9 195 66.6 65.9 63.9 Pig 92.8 67 102 91.7 90.5 Dog 94.4 66.2 91.7 102 89 Cat 90.9 64.7 90.5 89 102 Note that, the matrix is not symetric (aligning A with B produces different results than aligning B with A). This is because of the heuristic approach of BLAST. The initial seed that the alignment starts with defines the alignment. Therefore, the output of the two results might differ. 3 BLAST Server: http://blast.ncbi.nlm.nih.gov/Blast.cgi 13 b) Show the phylogenetic similarities among the Osteocalcin proteins by constructing a dendrogram using average linkage clustering on the bit scores of BLAST alignments. (Hint: be careful about the definition of the bit scores of BLAST alignments.) A NSWER : The higher bit scores mean that the two sequences are more similar to eeach other. The asymmetry of bit-score matrix should be eliminated by taking the maximum of the two values, as it is the maximum reachable similarity of the two sequences. The symmetric similarity matrix is as follows: Human Mouse Pig Dog Cat Human X X X X X Mouse 100 X X X X Pig 92.8 67 X X X Dog 94.4 66.2 91.7 X X Cat 90.9 64.7 90.5 89 X The dendrogram that will be produced from this similarity matrix using average linkage clustering is as follows: c) Show the BLAST alignment on the Osteocalcin proteins of Dog and Cat. How many matches, mismatches and gaps are there in this alignment? A NSWER : The alignment is: 14 There are 41 matches, 8 mismatches and 0 gaps in this alignment. d) Is the BLAST alignment on the Osteocalcin proteins of Dog and Cat optimal? Prove your answer. (Hint: BLAST server provides support for identifying the optimal global alignment.) A NSWER : The optimal alignment obtained by Needleman Wunsch algorithm is: There are 41 matches, 8 mismatches and 0 gaps in this optimal alignment. For this particular example, BLAST identifies the same alignment with Needlemam-Wunsch Algorithm. Therefore, the BLAST alignment for these two sequences is optimal. But note that, this is not always guaranteed to be true. 3. (25 marks) Obtain the human protein-protein interaction network (PPI) from BioGRID database 4 and analyse this network using Cytoscape (version 2.8.3) 5 . a) Collect the human PPI data from BioGRID database. Use database version 3.2.97 and include only the data from yeast-2-hybrid and affinity capture / mass - spectrometry experiments and annotate the nodes with their Entrez Gene IDs. How many nodes and edges are there when duplicate edges and self-loops are eliminated? What is the edge density of this network ? (Hint: BioGRID database provides a Cytoscape plug-in named BiogridPlugin2. This plug-in may ease the data collection from BioGRID database). 4 BioGRID:http://thebiogrid.org/ 5 CytoScape: http://www.cytoscape.org/ 15 A NSWER : The PPI network that is collected using only Yeast-2-Hybrid and Affinity Capture - MS experiments contains 12,234 nodes and 62,607 edges. Therefore, the edge-density of this network is: #o f ed g es Densi t y = ¡#o f nod es ¢ = 0.000837 (0.1) 2 b) Identify the first and second neighborhoods of the human Osteocalcin protein. Construct the induced subgraph on Osteocalcin and the identified neighborhoods. Visualize the induced subgraph using Cytoscape and include the visualization into your report. What are the number of nodes and edges in this network? A NSWER : The induced subgraph contains 101 nodes and 203 edges. The illustration of this network is as below: 16 c) Compute the following network properties for the induced subgraph obtained above: • degree distribution • average clustering coefficient • clustering spectrum • network diameter • spectrum of shortest path lengths • betweenness centrality distribution • eccentricity centrality distribution You may use the NetworkAnalyzer plug-in of Cytoscape for performing this analysis. A NSWER : Degree Distribution: Average clustering coefficient: 0.633 Clustering Spectrum: 17 Network Diameter: 2 Spectrum of Shortest Path Lengths: Betweenness Centrality Distribution: Eccentricity Centrality Distribution: d) Compute the number of graphlets in the induced subgraph for each 2-to-5 node 18 graphlet. You may use GraphCrunch2 6 package for this purpose. A NSWER : The graphlet counts are summarized as follows: Graphlet 1 2 3 4 5 6 7 8 9 10 Count 7761 357 2401 194901 94 18040 3889 256 3070 29122 Graphlet 11 12 13 14 15 16 17 18 19 20 Count 3977258 9292 1482 546201 35 1864 159978 3640 1417 97 Graphlet 21 22 23 24 25 26 27 28 29 Count 356 51066 11402 4225 112 7722 200 1112 100 e) Which proteins are topologically identical to the Osteocalcin protein in the induced subgraph? Use graphlet signature similarities of the nodes for identifying the topological similarities among the nodes. (Hint: GraphCrunch2 package can compute the graphlet signature similarities.) A NSWER : The Gene IDs of the proteins that are topologically identical to Osteocalcin are listed as follows: 10782, 146542, 199704, 148266, 115509, 65251, 79818, 284406, 253639, 92285, 7770, 632, 92595, 286205, 642954, 65988, 55900, 25799, 653147, 374899, 57337, 58499, 2492, 6874, 51224. f) What are the three highest degree nodes (proteins) in the induced subgraph? Write their Gene Names and Entrez Gene IDs. From KEGG database 7 , identify the pathways these proteins are associated with. Write the names and KEGG ID’s of these pathways. A NSWER : The three highest degree nodes in the induced subgraph are CBX5, UBC, and SUMO2. The pathways that they are associated are: CBX5: No pathway associations in KEGG. UBC: PPAR signaling pathway (hsa03320) SUMO2: RNA transport (hsa03013) B - B IOINFORMATICS P ROGRAMMING Smith-Waterman algorithm is a dynamic programming algorithm which identifies the optimal local alignment among two protein sequences. In this question, you are asked to imple6 GraphCrunch2: http://bio-nets.doc.ic.ac.uk/graphcrunch2/ http://www.genome.jp/kegg/ 7 KEGG Database: 19 ment a modified version of the Smith-Waterman algorithm that can identify the optimal local alignment among three protein sequences. 1. (5 marks) How would you modify the Smith-Waterman algorithm to find the optimal local alignment of 3 sequences rather than 2? Explain. A NSWER : Smith-Waterman algorithm considers 3 directions when aligning two sequences; for the cost matrix cell (x, y), checks the cost of reaching from (x −1, y), (x, y − 1), and (x − 1, y − 1). For aligning 3-sequences, the cost matrix should be modified to be 3 dimensional. For checking the cost of a cell (x, y, z) in the cost matrix, 7 different directions need to be evaluated. These directions and their meaning are as follows: • (x − 1, y, z) : Means aligning an amino acid from the first sequence with gaps for the second and third sequences • (x, y −1, z) :Means aligning an amino acid from the second sequence with gaps for the first and third sequences • (x, y, z − 1) :Means aligning an amino acid from the third sequence with gaps for the first and second sequences. • (x − 1, y − 1, z) :Means aligning amino acids from the first and second sequences with a gap for the third sequence. • (x −1, y, z −1) :Means aligning amino acids from the first and third sequences with a gap for the second sequence. • (x, y − 1, z − 1) :Means aligning amino acids from the second and third sequences with a gap for the third sequence. • (x − 1, y − 1, z − 1) :Means aligning amino acids from all sequences. At these steps, the cost of gap penalties and matches/mismatches should be computed for each pair of aminoacid alignments and averaged as described in the question text. The maximum score of the 7 options should be chosen to fill the cost matrix cell. 2. (20 marks) Implement a modified version of Smith-Waterman algorithm that can identify the optimal local alignment among three sequences. Use BLOSUM62 matrix8 for scoring the alignments. Compute the similarity score of 3 aminoacids (or gaps) as the average score of their pairwise distances, i.e., for a match (a, b, c) the alignment score is S(a, b, c) = S(a,b)+S(a,c)+S(b,c) . Your implementation should read a file containing three 3 FASTA sequences (an example is provided in part 3). You are allowed to use any wellknown programming language (preferably C, C++, Java, or Python). You will be penalized 5 marks if your code is not clearly commented. You are supposed to include your implementation both in your report and also as a soft copy on CATE. You are supposed to write a clear README file containing easy-to-follow instructions on the compilation/execution process. Failing to do so will be penalized by 10 marks. 8 BLOSUM62 : http://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt 20 A NSWER : The implementation is not provided since the implementation depends on the choices of programming languages. 3. (5 marks) What is the optimal local alignment of Insulin protein sequences from Human, Pig and Chicken? What is the alignment score for this alignment? The sequences for theseInsulin proteins are provided in FASTA format as follows (this is the format that your code in part 1 should read): • Human: >sp|P01308|INS_HUMAN Insulin OS=Homo sapiens GN=INS PE=1 SV=1 MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTR REAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN • Pig: >sp|P01315|INS_PIG Insulin OS=Sus scrofa GN=INS PE=1 SV=2 MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKAR REAENPQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN • Chicken: >sp|P67970|INS_CHICK Insulin OS=Gallus gallus GN=INS PE=1 SV=1 MALWIRSLPLLALLVFSGPGTSYAAANQHLCGSHLVEALYLVCGERGFFYSPKARR DVEQPLVSSPLRGEAGVLPFQQEEYEKVKRGIVEQCCHNTCSLYQLENYCN A NSWER : The optimal alignment between these three sequences are as follows: The optimal alignment score is for this alignment 396. There were slight differences in the answers that are mainly caused because of different way of threating gaps. We have been very flexible for these differences. Any sound alignment is accepted as true. 4. (5 marks) What is the computational complexity of your implementation? Explain. 21 A NSWER : The highest computational complexity is caused at the cost matrix filling step. For three sequences of lengths m, n, and k, the computational complexity of the algorithm is O(mnk). This is the theoretical complexity of the algorithm. Depending on the quality of the implementation, this complexity might get higher. 5. (5 marks) Can the pairwise Smith-Waterman algorithm be directly used for identifying the optimal local alignment among three sequences? If so, explain how and compare the computational complexity of the two approaches? If not, explain why not? A NSWER : No, pairwise version of the Smith-Waterman algorithm can not be used for identifying the optimal alignment among three sequences. This is mainly because, when you are performing pairwise alignment, you optimize the alignment to get the best alignment for the two sequences that you have chosen. The second alignment which includes the third sequence preserves the first alignment which was optimized for the first two sequences. For this reason, the resulting alignment might be very different from the optimal alignment of the three sequences. 22
© Copyright 2025