1 REVIEW ARTICLE A Comprehensive Review of HIV-1 and Human Protein-Protein Interaction Prediction Debasmita Pal and Kartick Chandra Mondal Department of Information Technology, Jadavpur University, Kolkata - 700032, West Bangal, India E-mails: [email protected], [email protected] Abstract: Human Immunodeficiency Virus-Type 1 (HIV-1) which is the etiologic agent of AIDS, has been the centre of attention of virologists in recent times due to its life-threatening nature and epidemic spread throughout the globe. The virus infects the host cells for replication by exploiting a complex interaction network of HIV-1 and human proteins and causes destruction of the power of human immune system, gradually leading to death. Antiviral drugs are designed to utilize the information on viral-host protein-protein interactions (PPIs), so that the viral replication and infection can be prevented. Therefore, the prediction of novel interactions based on experimentally validated interactions, that are curated in public PPI database could help in discovering new therapeutic targets. In this article, an overview of HIV-1 proteins and their role in virus replication and pathogenesis has been given followed by a discussion on different types of antiretroviral drugs and HIV-1-human PPI database. Thereafter, we have presented a brief explanation of different approaches adopted to predict new PPI and their predicted results along with the overlap of predicted interactions by different literatures. Keywords: HIV-1 Proteins, Antiretroviral Drugs, HIV-1, Human Protein Interaction Database (HHPID), Interactions Prediction, Association Rule Mining 1 Introduction In recent years, our society has been highly perturbed by some virulent viruses; one of them is Human Immunodeficiency Virus - Type 1 (HIV-1) that gradually ruins the immune system of human body making it susceptible to infections and diseases. The terminal stage of HIV-1 infection is Acquired Immunodeficiency Syndrome (AIDS) which eventually becomes fatal [1]. As per the report published by Joint United Nations Programme on HIV/AIDS/World Health Organization, the worldwide annual estimates of deaths caused by HIV/AIDS in 2005 was approximately 2.3 million and it was approximately 1.6 million in 2012 [2]. Although AIDS related mortality is decreasing steadily due to antiretroviral therapy, still HIV/AIDS is a global pandemic. Around 35.3 million people are now living with HIV or AIDS worldwide [3]. Anti-HIV drugs which have been discovered till now, can only prevent early stages of HIV-1 infection to some extent rather than curing it. Further, no vaccine has 2 been discovered yet. Therefore, the research on HIV-1 pathogenesis and improvement of antiviral treatment is going on and is one of the most challenging areas of medical science. HIV-1 is a retrovirus; a family of RNA viruses that cannot grow or spawn on its own without a living host cell since it does not contain DNA; it invades the host cell in order to successfully produce posterity. Mostly the vital cells in the human immune system such as CD4 or T cells, macrophages and dendritic cells, are infected by HIV-1 virus [4]. It enters the cell cytoplasm by binding itself with receptors (CD4) and coreceptors (CXCR4, CCR5) on the host cell surface followed by fusion with cellular membrane. Upon entering, a DNA copy of viral RNA genome is produced by an HIV-1 protein enzyme Reverse Transcriptase — the process is called Reverse Transcription. The resulting viral DNA is eventually inserted into the cell nucleus as a part of pre-integration complex and integrated into the cellular DNA by viral protein enzyme Integrase [5]. The integrated viral DNA is referred to as provirus that may enter into clinical latency stage during which no signs or symptoms of HIV infection can be noticeable. Alternatively, the provirus may take part in the transcription process and creates new RNA genome and viral proteins that are assembled and released from the cell as a new virion [6]. To do all this means to make entry into the host cell and gain the control over host cell processes, HIV-1 proteins play an important role through a complex network of molecular reactions that includes virus-host protein-protein interactions (PPIs). Therefore, the knowledge on PPIs between HIV-1 and human cellular proteins is the key to perceive HIV-1 replication and pathogenesis and subsequently, it marks the beginning of a new era in designing the restoratives and optimizing the science of therapy. The single-stranded RNA genome of HIV-1 virus consists of nine genes that encode to structural proteins along with two regulatory and four accessory or auxiliary proteins (Figure 1) [7]. The three major genes gag, pol and env first encode to polyprotein precursors that are processed to produce the structural proteins for matured virus particle. Regulatory and accessory proteins are also essential for viral replication and to cause disease [8, 9]. Table 1 describes the functions of each HIV-1 protein in viral replication [6, 10, 11]. HIV-1 viral proteins interact with cellular proteins during different stages of its life cycle in order to replicate successfully and cause disease. Therefore, it is evident that the study on HIV-1-human PPIs helps us to gain knowledge on HIV-1-infection and replication process, which in turn enhances the development of antiretroviral drugs. The drugs are designed in such a manner so that it can hinder HIV-1 proteins to interact with human proteins at different stages of viral life-cycle, preventing the virus to replicate. It will be well understood if we analyze the formation of an antiretroviral drug [12] as an example. Broadly, antiretroviral drugs can be categorized into six major classes which are depicted in Figure 2. The grouping of the drugs is based on how they interfere with steps in HIV-1 replication impeding the virus to replicate and infect the host cell [13]. In general, the PPIs are identified by various experimental methods. However, the prediction of possible interactions based on experimentally observed interactions is one of the major issue in PPI research to revitalize the process of treatment. Numerous researches have been done in order to predict the PPIs in a single organism i.e., intra-species interactions. However, analyzing and predicting inter-species interactions, especially the PPIs between the virus and its host, is more promising in medical science for antiviral drug invention. Recently, some articles have been published on prediction of PPIs between HIV-1 and human proteins. All of these researches were based on the publicly available PPI database, the concept of which is elaborated in Section 2. In Section 3, several approaches taken to predict HIV-1-human PPIs such as classifier based, structural similarity and 3 Figure 1: Relation between HIV-1 Genes and Proteins Figure 2: Categorization of Antiretroviral Drugs and Their Prevention Strategies association rule mining (ARM) based approaches are explained with a brief comparison on their predicted results. Finally, we conclude in Section 4. 4 Table 1 HIV-1 Proteins and Their Functions HIV-1 Protein Matrix (MA, [p17]) Nucleocapsid (NC, [p7]) Capsid (CA, [p24]) p6 Spacer Peptide 1 (SP1, [p2]) Spacer Peptide 2 (SP2, [p1]) Reverse Transcriptase (RT, [p51,p66]) Integrase (IN, [p32]) Protease (PR, [p11]) Surface Glycoprotein (gp120) Transmembrane Glycoprotein (gp41) Tat (Trans-Activator of Transcription) Rev (Regulator of Expression of Virion Proteins) Nef (Negative Factor) Vif (Viral Infectivity Factor) Vpr (Viral Protein R) Vpu (Viral Protein U) Protein Functions Supports Gag interaction with plasma membrane mediating the binding and virus assembly and incorporates envelope proteins into virions. Surrounds RNA genome of the virus in order to protect it by forming a stable complex with the viral RNA and helps in delivering the RNA to the virus particle during assembly. Forms a coating around viral RNA, inserting it into the target cell during infection and helps in virion assembly and maturation. Incorporates Vpr into new virions and promotes budding of virions from the infected cell. Essential for basal transcription and Tat-mediated transcription of HIV-1. Currently, it’s function is unknown. Carries out the reverse transcription process, that is, produces a DNA copy of viral RNA genome. Integrates the viral DNA with infected cellular DNA forming provirus. Cleaves Gag and Gag-pol polyprotein precursors into proper functional pieces, mediating the maturation of virus particles. Binds to the host cell surface receptor and coreceptors mediating virus entry into the host cell. Contains fusion peptide and supports the virus to fuse with host cell plasma membrane. Regulates reverse transcription ensuring efficient synthesis of viral mRNA as well as the release of virions from infected cell. Exports RNA from the nucleus to the cytoplasm before it can be spliced so that structural proteins and RNA genome for new virion can be produced. Compels the infected cell to stop producing several cell defence proteins and enhances the progression of HIV infection to AIDS. Attacks cell’s antiretroviral factors that inhibit infection. Assists the viral genome after entry into the nucleus and enhances the infection. Downregulates cell receptor CD4 and mediates the release of new virions from host cell surface by enfeebling the interaction between new envelop proteins and cell receptors. 2 HIV-1 - Human PPI Database PPI database provides the information about the interacting protein pairs. It can be epitomized into three categories — 5 i) Primary Databases - Contain the information about PPIs, the existence of which have been identified by experimental methods. For example, Database of Interacting Proteins (DIP)1 , Biomolecular Interaction Network Database (BIND)2 etc. ii) Meta-databases - These are made by integrating primary databases. For example, Agile Protein Interaction Data Analyzer (APID)3 . iii) Prediction Databases - Collect the information about interacting protein pairs which have been predicted using several techniques based on known interactions. For example, Known and Predicted Protein-Protein Interactions (STRING)4 . A lot of articles have been published focusing the interactions between HIV-1 and human proteins. However, only one or a few interactions were concentrated in each of the individual releases; hence it was laborious to gather or access all the information about the interactions of a particular HIV-1 protein with human proteins in a compact way. The Division of Acquired Immunodeficiency Syndrome (DAIDS) of the National Institute of Allergy and Infectious Diseases (NIAID) first took the initiative to create HIV-1, Human Protein Interaction Database (HHPID)5 which catalogs all the known interactions of HIV-1 proteins with human proteins. In the year of 2000, the project of developing the database was started in collaboration with Southern Research Institute and the National Center for Biotechnology Information (NCBI). The database has been prepared by collecting information on interactions from numerous peer-reviewed articles (over 1,00,000) available in PubMed. In case of conflicting conclusions by different publications, the data ambivalence has been reported in the description part of the interaction. Moreover, newly published literatures are reviewed on a periodic basis to keep the database with updated information [14]. To contribute a concise, yet detailed, summarized view of all HIV-1-human PPIs in the field of HIV/AIDS research, the HHPID database consists of the following information for each of the identified PPI — • NCBI Reference Sequence (RefSeq) protein accession numbers — RefSeq6 is a public database built by NCBI that provides non-redundant, annotated collection of nucleotide sequences (DNA, RNA) including genomic data, transcripts and proteins. • NCBI Entrez Gene ID numbers — Entrez Gene7 is a database for gene-specific information that generates a stable identifier for genes. This identifier can be used to integrate multiple types of information including nomenclature, accessions of genespecific and gene product-specific sequences etc. • Amino acids from each protein that are known to be involved in the interaction • Brief description of the PPI • Keywords for searching the interactions — A total of 68 unique interaction keywords were listed initially to categorize the interactions between HIV-1 and human proteins. 1 http://dip.doe-mbi.ucla.edu 2 http://bind.ca 3 http://bioinfow.dep.usal.es/apid/index.htm 4 http://string-db.org/ 5 http://www.ncbi.nlm.nih.gov/projects/RefSeq/HIVInteractions/ 6 http://www.ncbi.nlm.nih.gov/refseq/ 7 http://www.ncbi.nlm.nih.gov/gene 6 Some of the interactions can be considered as direct viral-host interactions, whereas most of them are indirect such as regulatory interactions that are responsible for alteration of human gene expression [15]. All interaction keywords are listed in Table 2 [20, 30]. Currently, the database includes eight additional interaction types — dephosphorylated by, destabilized by, disrupted by, induces reorganization of, induces ubiquitination of, rescued by, stabilized by, sumoylated by. The query interface of NCBI website supports to find the cellular proteins having a specific type of interaction with a viral protein based on these keywords. • National Library of Medicine (NLM) PubMed identification numbers (PMIDs) — Can be used to identify all journal articles describing the interaction. Currently, the database comprises a total of 12785 HIV-1-human PPIs that involves 21 HIV-1 proteins (including processed proteins as well as polyprotein precursors) and 3142 human genes encoding 3183 human proteins. It also consists of 5549 PMID references to the original articles describing the interaction. Among all the interactions reported, only 32% are direct physical interactions (such as binds, cleaves) and rest are indirect (such as upregulates, modifies). Interactions in the database are mostly associated with the interaction type— interacts with (20.6%), upregulates (11.6%), binds (7.9%), activates (7%), downregulates (9.13%), inhibits (5.6%). Furthermore, envelope and tat proteins are involved in 26.45% and 25.37% of the total interactions. It has been found that around 37% of the human proteins interact with more than one HIV-1 protein and 58% of all the interactions have been reported by more than one article [15]. The HHPID greatly contributes to the HIV-1/AIDS research community by presenting an elaborate visualization of HIV-1 replication and pathogenesis. Additionally, it has become more valuable source of information for providing the facility to cross-reference to other public databases. The database supports the work done by Brass et al. in 2008. They performed genomic siRNA (small interfering RNA) screen to recognize and analyze the human proteins requisite for HIV-1 replication. More than 250 cellular proteins was identified in this work and these proteins have been entitled as HIV Dependency Factors (HDFs) [16]. It was found that a total of 36 proteins that are involved in the interactions present in the database, have also been recognized as HDFs. [17]. Furthermore, the database assists the new experimental and computational results by providing a direct comparison with knowledge on known interactions. Moreover, it should be noted that if the information about the interaction between a particular protein pair is not present in the database, we cannot draw the conclusion that the protein pair does not interact because of the lack of substantive evidence. Thus, the prediction of PPIs based on the known interactions will help in the discovery of new therapeutic targets and prevention strategies. 3 HIV-1 - Human PPI Prediction Several approaches have been taken by the researchers in order to predict the novel interactions between HIV-1 and human proteins till date which is shown in Figure 3. In 2006, Park et al. tried to develop an online prediction system HPID 2.08 that provides all the interacting protein partners for a given protein submitted by users [19]. The system 8 http://www.hpid.org 7 Table 2 Direct and Indirect Interaction Types Reported in HHPID HIV1-to-Human Protein Interaction Types Indirect Interaction activates induces cleavage of associates with induces complex with co-localizes with induces phosphorylation of competes with induces rearrangement of complexes with induces release of cooperates with inhibits decreases phosphorylation of inhibits acetylation of deglycosylates modulates depolymerizes polarizes disrupts recruits downregulates regulates enhances relocalizes enhances polymerization of requires fractionates with sensitizes inactivates stabilizes incorporates stimulates induces accumulation of synergizes with induces acetylation of upregulates Human-to-HIV1 Protein Interaction Types Direct Interaction Indirect Interaction acetylated by activated by isomerized by binds associates with mediated by cleaved by cleavage induced by modified by degraded by co-localizes with modulated by interacts with competes with palmitoylated by methylated by complexes with processed by myristoylated by cooperates with recruited by phosphorylated by downregulated by regulated by ubiquitinated by enhanced by relocalized by exported by requires fractionates with stimulated by glycosylated by synergizes with imported by upregulated by inhibited by Direct Interaction acetylates binds cleaves degrades interacts with phosphorylates Interaction types common in both directions have been denoted by italics. These types of interactions are referred to as bidirectional or undirectional. utilizes several types of information such as protein domain, protein function and subcellular localization in order to predict the human proteins that might interact with an HIV-1 protein. Along with the name of the protein of interest, the user should input the superfamilies of the protein using any of the Superfamily, InterPro or Pfam databases. Protein Structural Interactome MAP (PSIMAP) is used to predict the interacting protein partners at the superfamily level and homology search of protein sequences of the query 8 Figure 3: Different Approaches Taken For Predicting HIV-1-Human PPI protein in the databases like Ensembl, DIP, HPRD and NCBI is used to find the interacting protein partners. In 2009, Tastan et al. attempted to predict the global set of interactions between HIV-1 and human proteins [20]. They considered the problem of predicting PPI as a classification problem and adopted a supervised learning framework to solve the classifier problem. In 2010, the authors extended their approach by integrating a semi-supervised multi-task learning method to solve the classifier problem that considered labeled data (protein pairs that are known to interact by experimental evidence) along with partially labeled data (protein pairs that are associated, but not experimentally validated) [21]. These experiments were based on the dataset retrieved from the HHPID. Besides this, Dyer proposed a supervised learning approach using Support Vector Machine (SVM) which was trained with the information about HIV-1-human PPIs from other public PPI databases rather than HHPID [22]. In [23], the authors suggested a prediction framework using conformal method that was utilized to predict novel interactions and estimate confidence of each prediction. Recently, a probability weighted ensemble transfer learning model (PWEN-TLM) was developed to utilize homolog GO information and information on subcellular co-localized proteins that improves the performance of classsifier based approach [24]. However, the prediction of PPI as a classification problem demands both interacting and non-interacting protein pairs. Although, the databases consisting of HIV-1-human PPIs provide the set of interacting protein pairs as positive samples, no resource can provide the set of non-interacting protein pairs as mentioned in Section 2. Thus, the performance of classifier based approach vastly depends on the choice of random protein pairs used as negative samples. This fact has driven some researchers to take an unsupervised approach based on ARM in predicting novel PPIs between HIV-1 and the host. In 2010, the attempt was made to extract the association rules among human proteins based only on HHPID dataset using apriori algorithm that uses the concept of frequent itemsets [27]. The generated association rules came to the aid of predicting a set of new interactions. In [28], the authors exploited the concept of frequent closed itemsets (FCIs) in ARM using biclustering method and generated association rules among HIV-1 proteins as well as among human proteins were utilized to discover new interactions. Recently, this biclustering based approach has been extended to provide the interaction type and the direction of regulation with each of the predicted interactions [30]. In [31], a multiobjective genetic algorithm-based biclustering technique was proposed to find a strong interaction module in HIV-1-human PPI network 9 using the concept of quasi-biclique. In [32], the authors suggested a new algorithm FIST that mines FCIs to generate a minimal non-redundant cover of association rules using closure based approach and extracts hierarchy of biclusters in a single process and the algorithm was utilized to predict new PPIs [33]. One downside of ARM based approaches is that they do not predict the interactions involving human proteins those are not covered by frequent itemsets or FCIs. Moreover, since this model does not learn the pattern of non-interacting protein pairs, there might be a chance of getting high rate of false positives. In 2010, Doolittle et al. tried to predict HIV-1-human PPIs based on structural similarity of HIV-1 proteins with human proteins stimulated by the fact that HIV-1 proteins might interact with those human proteins which are the binding partners of the human proteins having similar structure with HIV-1 proteins [25]. This methodology utilized the information on interactions found in HPRD (Human Protein Reference Database) for prediction and structural information of proteins from PDB (Protein Data Bank). However, in this method, only the geometric structures of proteins were considered for finding similarity. In [26], the authors introduced a evolution-aware structural alignment method (Unialign) that incorporates evolutionary relationships of proteins along with the geometric structural similarity. Even so, the predicted set of interactions found by this approach does not include the human proteins not having the structural alignment with any HIV-1 protein or the host proteins for which crystal structures are not available in PDB. We have tried to make a comparison on number of interactions predicted by different studies with true positive interactions (predicted interactions which are known to exist experimentally) in Figure 4a. For classifier based approach, the predictions include the known interactions as well as the novel interactions, making it easy to calculate the number of true interactions in the prediction set. Structural similarity based approach used HHPID and PIG (Pathogen Interaction Gateway) database for the validation purpose. For ARM based approaches using apriori and biclustering method, it is difficult to validate the predicted interactions computationally since the predictions comprise only the novel interactions which are not reported in the database. Therefore, the authors tried to validate their predictions by exhaustively searching the recently published PubMed entries that provide experimental evidence for the interactions, not reported in HHPID. We have found that some of the interactions which were considered as novel in the articles [20], [27] and [28], are now experimentally validated. Based on the list of recent literatures provided in [28] and [30], we have modified the number of true positive interactions in the prediction set that has been reflected in Figure 4b. Table 3 shows the interactions which were predicted before and are now experimentally validated. The predicted interactions were also validated by finding the overlap with other studies. Figure 5 comes up with the overlap of the predicted interactions by five studies listed in the figure. We have found a total of four interactions those were predicted by four studies (marked by i, ii in Figure 5) and a total of eight predicted interactions common to three studies (marked by iii, iv, v). No prediction of interacting protein pairs has been found to be common in all the five studies (marked by vi). Table 4 lists the interactions predicted by at least three of the studies. More priority can be put to the overlapped interactions for analysis and they would be of great interests for the virologists. In the following subsections, all of these approaches and their results are explained in detail and lastly, corresponding to these approaches, we have provided a list of URLs where the prediction method and the predicted dataset can be publicly available (Table 13). 10 Figure 4: Comparison of Predicted Interactions by Different Approaches (a) (b) 3.1 Classifier Based Approach by Information Integration The first attempt to predict the global interaction set between HIV-1 and human proteins by Tastan et al. utilized a supervised learning framework that integrates multiple heterogeneous biological information sources [20]. Since the virus exploits the existing communication pathways within the cell in order to infect it, the interaction relationships between the human proteins can be used to find the proteins that the pathogen might target. Thus, the authors derived a number of features that consolidate the existing knowledge on human interactome (set of intra-human PPIs) with other available information to predict HIV-1 and human PPIs. 11 Table 3 Previously Predicted, Now Experimentally Validated By Different Studies Predictions by Tastan et al. [20] T at ↔ M AP K14, T at ↔ CASP 9, V pu ↔ BCL2, RT ↔ CD4, V if ↔ U BB, V if ↔ T P 53, env_gp160 ↔ T P 53, env_gp120 ↔ CASP 8, env_gp120 ↔ SRC, env_gp41 ↔ M AP K1, Gag_P r55 ↔ M AP K1 Predictions by Mukhopadhyay et al. [27] env_gp41 ↔ M AP K1 Predictions by Mukhopadhyay et al. [28] T at ↔ M AP K14, T at ↔ CASP 9, Rev ↔ CD4, RT ↔ CD4 Figure 5: Overlapping of Predicted Interactions By Different Studies Table 4 Interacting Protein Pairs Shared by Different Literatures Common to Four Studies† HIV-1 Protein Human Protein capsid MAPK1 gp41 MAPK1 gp160 CASP3 p6 PRKCA Common to Three Studies‡ HIV-1 Protein Human Protein capsid PRKCA gp120 ACTB gp120 CALM1 integrase CD4 integrase MAPK1 nucleocapsid PRKCA vif PRKCA vpr PRKCA †Interactions marked by (i) and (ii) in Figure 5 ‡Interactions marked by (iii), (iv) and (v) in Figure 5 Here, the prediction of PPI was considered as a binary classification problem, since a protein can either interact with another protein or not. Hence, each protein pair was either a member of the interaction class or non-interaction class. The set of interacting protein pairs 12 of HIV-1 and human, taken from HHPID was divided into two exclusive groups– i) Most likely direct physical interactions; ii) Indirect interactions (Table 2). Group 1 interactions consisted of 1063 protein pairs involving 721 human proteins and group 2 comprised of 1447 protein pairs that involved 914 human proteins. The Group 1 interactions formed the interaction class which was employed to fabricate the base. Group 2 was used for mining the final predictions. As mentioned, since the set of non-interacting protein pairs are not available, protein pairs which are not known to interact were chosen randomly as negative samples. A feature vector consisting of 35 features was used to describe each protein pair and Random Forest (RF) classifier was utilized to provide the solution to the classification problem. The features were derived from one or more biological information sources such as Gene Ontology(GO) annotations, graph properties of human interactome (degree, clustering coefficient and betweenness centrality), gene expression, tissue feature, HIV-1 protein type features (ptf), post-translational modifications, sequence similarity features, ELM-ligand feature. The concept of Gini index was utilized to construct the decision trees in RF and to evaluate the importance of the features used in the classification problem. Each possible protein pair in RF was assigned a RF prediction score and the protein pairs with positive RF score were expected to interact with high score indicating higher probability of interaction. Several tests were done on RF classifier model to evaluate its performance and it was expected to achieve average MAP (Mean Average Precision) score of 0.23, i.e., on an average 23% of all the predicted PPIs are true positive interactions. For PPI prediction, this result is considered as good. A total 3372 interactions were predicted with RF score ≥ 0.00 among which 2084 are the novel interactions. The predicted interactions were compared with the human genes reported in the siRNA screen done by Brass et al. [16] and with the human proteins those are hijacked by HIV-1 in its virion [18]. Additionally, the features which contributed most to the classification of protein pairs were identified based on the Gini importance on RF classifier. The top 3 Gini features include degree, betweenness centrality and neighbour GO process similarity and the top 6 Gini features include clustering coefficient, neighbour GO function, and cellular location similarities in addition. However, the performance of this model is influenced by the availability of truly interacting protein pairs. Thus the authors extended their supervised approach by integrating a semi-supervised multi-task learning framework that included the protein pairs for which there exists an association between two partners, but not enough experimental evidence to support it as direct interactions [21]. Here, 158 HIV-1-human protein pairs suggested by the experts were taken as positive samples and 2119 pairs were considered as partial positives among which 552 are Group 1 interactions and 1567 are Group 2 interactions. Instead of 35 features, 18 features have been associated with each HIV-1-human protein pair. Using semi-supervised approach, 3428 interactions were predicted among which 259 interactions were validated by partial positive interactions and 3123 novel interactions were discovered. The predicted interactions were examined in the same way as the previous one to get the most interesting pairs to concentrate upon. Table 5 gives the number of predictions found by Tastan et al. using supervised and semi-supervised approaches for some of the prediction score cut-offs. In [22], a supervised learning approach to predict physical interactions between HIV-1 and human proteins using SVM was proposed. The SVM was trained with the datasets retrieved from the databases as positive samples as well as some random protein pairs that are not known to interact as negative samples. Here, the authors did not use the HHPID datasets as positive samples, rather they collected 1028 HIV-1-human PPI as positive samples from 13 Table 5 Result Found Using Classifier Based Approach by Tastan et al. Using Supervised Learning Framework [20] Prediction Number of Number Human Genes Overlap Score Predicted of Novel Involved in with siRNA Cut-off Interactions Interactions Predicted (282 genes) Predicted Interactions ≥ 0.00 3372 2084 1010 46 ≥ 2.50 279 28 22 0 Using Semi-supervised Learning Framework [21] ≥ −1.8 3428 3123 1027 24 ≥ −1.5 2434 2172 721 21 Overlap with Virion (316 genes) 240 4 72 61 other four public databases— the Biomolecular Interaction Network Database (BIND), the Database of Interacting Proteins(DIP), IntAct9 , and Reactome10 . The performance of the model was evaluated by computing area the precision-recall curve (PR-AUC score) where high AUC score indicates good performance of the predictor. Different combinations of features such as domains, protein sequence 4-mers, and network properties of human interaction network were chosen to analyze the performance of SVM and it was found that the model trained with all the feature sets has provided highest AUC score. Moreover, different ratios of positive samples to negative samples (PS:NS = 1:25, 1:50 and 1:100) were taken to prepare the complete list of predicted interactions. The predicted interactions were examined with the HDFs recognized by Brass et al. [16] and those interactions were identified which involves the human proteins reported as HDFs. The result found with this method is given in Table 6. Table 6 Result Found Using Supervised Learning Method By Dyer et al. [22] Ratio of Positive to Negative Samples (PS:NS) 1:25 1:50 1.100 Number of Predicted Interactions 1111 506 182 Number of Predicted Interactions Involving HDFs 46 33 16 PR-AUC Score 0.707 0.630 0.505 The classifier based approach for predicting PPI requires both positive and negative samples for training and testing purposes. Positive samples are readily available from interaction database, however, there is no such resource for non-interacting protein pairs, thus protein pairs which are not known to interact i.e., not present in the interaction database are randomly taken and considered as negative samples based on the assumption that these random protein pairs are not likely to interact physically, but this might not be true always. Hence, the task of selection of random protein pairs that are not known to interact, as negative samples is more challenging in order to predict a potential set of HIV-1-human PPI using classifier based approach. In 2012, a conformal prediction framework was proposed for prediction of novel interactions between HIV-1 and human proteins and to estimate confidence with each of the 9 http://www.ebi.ac.uk/intact/ 10 http://www.reactome.org/ 14 predicted interactions [23]. The conformal predictor was used to deal with only one class labeled with interactive proteins, while no set of non-interactive protein pairs was clearly defined. All the protein pairs those are not known to interact, formed the ‘background set’ with unlabeled examples. The algorithm was based on the ‘exchangeability’ assumption of data i.e., any of the undiscovered interactions is equally likely to be discovered next and the relative ‘strangeness’ of protein pairs. A protein pair is strange with respect to others if it has a very small or very large value for one of it’s features. The authors applied the algorithm to 1063 interacting protein pairs retrieved from HHPID along with 353778 possible protein pairs with unknown labels. A p-value was assigned to each example of protein pair with lower p-value indicating the interaction to be unlikely. To obtain a prediction list with a certain confidence level (γ), all the protein pairs with p-value of at least (1 − γ) need to be included in the list. The number of predictions found for some of the p-values is given in Table 7. The prediction set includes more number of interactions than that of found from RF classifier model by Tastan et al. [20]. Moreover, a large overlap with ‘siRNA’ and ‘virion’ dataset was observed which shows that the conformal method has given more number of potential interactions than RF classifier model. Table 7 Result Found Using Conformal Prediction Framework By Nouretdinov et al. [23] Conformal Prediction p-value ≥ 0.95 ≥ 0.90 ≥ 0.80 ≥ 0.50 Number of Number Predicted of Novel Interactions Interactions Predicted 295 241 711 604 2398 2185 19521 18988 Number of Predictions with RF score ≥ 0 [20] 267 548 1156 2376 Overlap Overlap with with siRNA Virion NA 11 26 85 NA 38 78 173 In [24], the author suggested PWEN-TLM with SVM taken as classifier to discover novel interactions between HIV-1 and human proteins. This model addressed three major difficulties in prediction of PPIs computationally— data scarcity, data unavailability, and negative data sampling. Homolog GO information was utilized to deal with the issue of data scarcity and data unavailability. To validate the effectiveness of using homolog GO information for model training and testing purposes, three experimental settings were developed— i) Optimistic case which assumed that target GO information for both training and testing data was available, ii) Moderate case which assumed that target GO information was not available for test data, and iii) Pessimistic case that did not consider target GO information for both training and testing data. Regarding negative data sampling, the author constructed two sets of negative data— one with random sampling of protein pairs that are not known to interact as previously followed by other researchers and the other with the exclusion of subcellular co-localized proteins. The exclusiveness of subcellular co-localized proteins in the formation of negative data samples was based on the fact that subcellular co-localized protein pairs are more likely to have physical interactions. For the experiment, a total of 3638 PPIs retrieved from HHPID was considered as positive samples and equal number of negative samples excluding subcellular co-localized protein pairs were extracted to form the dataset D1. Similarly, the dataset D2 was created with same positive samples and equal number of negative data with random sampling. The Receiver Operating Characteristic - Area Under Curve (ROC-AUC) metric was utilized to measure the significance of homolog 15 GO information and Matthews correlation coefficient (MCC) and accuracy measurement was done to evaluate the the effectiveness of excluding subcellular co-localized proteins. The value of the different metrics obtained for different cases and datasets is shown in Table 8. The relatively small difference in ROC-AUC score for optimistic and pessimistic case indicated that homolog GO information could be a good substitute where target GO information is not available. Moreover, dataset D1 was found to have better predictive balance than dataset D2 with largest MCC difference of 0.1053. Therefore, the exclusiveness of subcellular co-localized proteins is more reliable in constructing negative dataset for better performance. Another important aspect of classifier based approach is to choose the correct ratio of positive to negative samples. In [25], Dyer et al. achieved highest PR-AUC score of 0.707 for 1:25 ratio (Table 6). Here, the author took the ratio of positive to negative samples as 1:1 and higher PR-AUC score was obtained which showed that skewed training data might produce a biased model. Table 8 Performance Metric Score of PWEN-TLM [24] Dataset D1 ROC- PRMCC AUC AUC Dataset D2 ROC- PRAccuracy MCC AUC AUC Optimistic 0.9326 0.9361 0.7446 85.62% Case Moderate 0.8155 0.8172 0.4606 66.22% Case Pessimistic 0.8735 0.8799 0.6605 80.22% Case Accuracy 0.9005 0.8989 0.6393 82.41% 0.7661 0.7480 0.4258 63.63% 0.8158 0.8478 0.6188 77.43% The author validated 180 interactions predicted in [28] since among these 180 predictions, there are some recent literature evidence for some of the predictions (Section 3, Figure 4a). Moreover, 80 of these 180 interactions were found to be common with the predictions done by Tastan et al. [20] (Figure 5). After using 180 interactions as test set without overlap with the training data, PWEN-TLM predicted 132 interactions in the optimistic case and 165 interactions in the pessimistic case. Among 132 predictions, 46 interactions and among 165, a total of 61 interactions were also predicted in [20]. Since, optimistic and pessimistic PWEN-TLM performed better according to ROC-AUC score, hence the author considered only these two cases for validation purpose. Besides this, the author also tried to find novel interactions using PWEN-TLM where HIV-targeted human proteins were taken as test candidates. In this case, the model predicted 718 interactions in optimistic case and 61 interactions in the pessimistic case. 3.2 Based on Structural Similarity In [25], Doolittle et al. made a computational approach to predict HIV-1-human PPIs based on the structural similarity of HIV-1 and human proteins. They first retrieved the human proteins having significant structural similarity with HIV-1 proteins using Dali database11 that comprises 3D structure comparisons of all protein structures present in PDB12 . PDB contains the published crystal structures of proteins which cover most of the HIV-1 proteins 11 http://ekhidna.biocenter.helsinki.fi/dali/start 12 http://www.rcsb.org/ 16 (PR, RT, IN, CA, MA, NC, Gag p2, gp120, gp41, Nef, Tat, Vpr, and Vpu), however, structures for many human proteins are not available in PDB. The identified HIV-1 proteins that have high structural similarity with at least one of the human proteins include: gp41, gp120, CA, MA, p2, PR, IN, RT, and Vpr and those human proteins having structural similarity with HIV-1 proteins have been referred to as HIV-similar proteins. Their next step was to identify intra-human PPIs in which HIV-similar proteins are known to participate, from HPRD13 . The prediction approach is based on the assumption that HIV-1 proteins are likely to have the same interactions in which HIV-similar proteins participate, as their human, HIV-similar counterparts allow them to attach into the host cell protein interaction network. The predicted interactions were filtered to provide functional evidence and their biological relevance. The authors considered those predicted interactions in which the target proteins (human proteins with which HIV-similar proteins are known to interact) satisfy at least one criteria — i) impair HIV-1 infection or replication according to siRNA or shRNA, ii) are present in HIV-1 virions. The filter based on these two types of dataset has been referred to as “Literature Filters”. After application of this filter, the prediction set consists of a total of 2143 interactions, among which 62 were verified as true positive interactions based on the dataset retrieved from host-pathogen interaction database— HHPID and PIG14 . A total of 347 human proteins were predicted to have similar structure with at least one HIV-1 protein and 406 unique human proteins were predicted to potentially interact with HIV-1. The potentiality of the prediction set was further improved considering the fact of protein co-localization that demands both HIV-1 and the target human protein to be present in the same location within the cell based on GO cellular component(CC) annotation. The number of unique interactions in this refined list of predictions is 502, among which 31 interactions are known to exist experimentally. 189 HIV-similar proteins that have 137 known different binding partners, were encountered. Application of this filter not only reduced the set of likely interactions, also increased the percentage of true interactions from ˜ to 6%. ˜ Moreover, gp41 has been found to have more predicted interactions and this is 3% obvious because large number of GO cellular component terms are annotated to gp41 and it is found in more parts of the cell, hence increasing the probability to satisfy co-localization criterion with more number of human proteins. This method of prediction is completely based on structures of proteins, so different structures for a single protein may produce different predictions about its interactions. Hence, some predictions are lost if it is done at gene level. The prediction done at gene level produced 265 interactions followed by CC filters. The summary of the result is tabulated in Table 9. The predicted set of interactions were examined from functional and biological aspects based on GO annotations. The properties of human proteins those were predicted to interact with HIV-1, were inspected using biological process and molecular function GO terms. They showed a significant enrichment in the processes of transportation of proteins and nucleic acid, cell death and post-translational modifications – HIV-1 proteins are known to alter or manipulate these processes during infection. In addition, it has been observed that the predicted interactions are supported by various studies. Moreover, the interactions predicted by structural similarity were compared with the result found by Tastan et al. [20]. 10% of the predicted interactions are common in both the studies. 13 http://www.hprd.org/ 14 http://pathogenportal.net/pig/ 17 Table 9 Result Found Using Structural Similarity Based Approach By Doolittle et al. [25] Number of predicted interactions Number of true positive interactions predicted, verified from HHPID and PIG Number of human proteins found having structural similarity with HIV-1 proteins Number of human proteins found to interact with HIV-1 proteins Predictions Done at Processed Proteins Level After After Literature Literature and Filter CC Filters Predictions Gene Level After Literature Filter Done at After Literature and CC Filters 2143 502 883 265 62 31 56 22 347 189 NA NA 406 137 NA NA In this article, the authors utilized Dali database for protein structure comparison that only considers the geometric structural information of the protein. Therefore, to incorporate evolutionary relationships among proteins along with the structural alignments, Zhao et al. proposed a novel evolution-aware structural alignment method (Unialign) in order to predict the possible interacting protein partners of HIV-1 proteins [26]. They only considered HIV-1 protein gp41 in their experiment and predicted 922 unique human target proteins potentially interacting with gp41. They validated their predictions against the direct interactions reported in HHPID since they only considered the direct physical interactions in their experiment. A total of 15 of these predicted interactions were among the 68 experimentally validated interactions. They also evaluated the performance of Dali database and Unialign method which shows that Unialign method is more efficient than Dali to find the structurally similar proteins. 3.3 Based on ARM Technique This type of approach mines the association rules among viral or human proteins from the known interactions of HIV-1 proteins with human proteins found in PPI database. Subsequently, the generated association rules with high confidence are utilized to predict the potential set of viral-host PPIs. This methodology completely exploits the information of interacting protein pairs, thus overcoming the problem faced by classifier based approach. Figure 6 shows the pipeline of applying ARM in predicting novel interactions. An itemset can be referred to as a set of human proteins interacting with a particular HIV-1 protein or vice-versa. Frequent itemsets are those itemsets that satisfy a minimum support threshold i.e., a set of human proteins can be referred to as frequent itemsets if they are known 18 to interact with a minimum number of HIV-1 proteins and vice-versa. From the frequent itemsets, association rules are generated. In subsequent subsections, the methods for extracting association rules from experimentally validated interactions between HIV-1 and human proteins that have been published in different articles, are explained and the predictions from these association rules with the predicted results are discussed. Figure 6: Pipeline of Applying ARM in Prediction of PPI 3.3.1 ARM Using Apriori Algorithm In [27], the authors tried to extract the association rules among human proteins that are known to interact with HIV-1 proteins using well-known apriori algorithm. The information about the PPI between HIV-1 and human proteins found in HHPID, were organized into a binary adjacency matrix with rows representing the viral proteins and columns representing the human proteins. They used 1288 interactions (direct or indirect) between 17 HIV-1 proteins and 773 human proteins to predict new viral-host interactions. The dimension of the matrix is 17 × 773. An entry of 1 in the matrix represents that the corresponding HIV1 and human protein pair is known to interact and 0 indicates no information is present in the database about the interaction between corresponding HIV-1 and human protein pair. To apply apriori algorithm to this matrix, each row (HIV-1 protein) was treated as a transaction and each column (human protein) was treated an item and after applying the algorithm, frequent itemsets were obtained. From these frequent itemsets, association rules satisfying the minimum threshold, called minimum confidence, were extracted among human proteins. Here, the authors were interested in finding association rules with a single consequent. Thus, extracted rule can be represented as: {HP1 , HP2 , HP3 , HP4 ⇒ HP5 } with HPi (i = 1, 2, 3, 4, 5) denoting human proteins. We can interpret this rule as — if there exist physical interactions between the human proteins HP1 , HP2 , HP3 , HP4 with a set of HIV-1 proteins, then there exists a possibility of interaction of human protein HP5 with the same set of HIV-1 proteins. Hence, each rule obtained from the matrix was associated with a set of HIV-1 proteins for which the rule was valid. Now, the possible viral-host interactions were predicted from the extracted association rules with high confidence. Suppose, in the frequent itemsets, the antecedent of the rule {HP1 , HP2 , HP3 , HP4 ⇒ HP5 } is true for 10 viral proteins (V P1 , V P2 , V P3 , . . . , V P10 ) and consequent of the rule is valid for the first 8 HIV-1 proteins 19 (V P1 , V P2 , . . . , V P8 ), thus the support of the rule is 8 and confidence of the rule will be 80% (Support of the rule / Support of the antecedent = 8 out of 10), which can be considered as high. Therefore, two new interactions from this rule can be predicted: V P9 ↔ HP5 and V P10 ↔ HP5 . The rules with 100% confidence were not considered in the prediction process since no interaction can be predicted from this. With this process, 22 new interactions were predicted from 34 extracted association rules with single consequent. HIV1 proteins involved in the predicted interactions are rev, tat, vif, vpr, gp120, gp160, gp41, protease, nef, matrix, nucleocapsid, p6, capsid and human proteins involved are ACTG1, PRKCA, MAPK1, ACTB, PRKCB1, PRKCQ, MAPK3, PRKCD, PRKCE, CASP3, CD4. The summary of the result is described in Table 10. Table 10 Result Found Using Apriori Algorithm By Mukhopadhyay et al. [27] Minimum Support=40%, Minimum Confidence=70% Number of extracted association rules among 34 human proteins Number of new interactions predicted 22 Number of human proteins involved in the 11 predicted interactions Number of HIV-1 involved in the predicted 13 interactions The resulting set of predictions were compared with the novel predictions done by Tastan et al. in [20] described in the Section 3.1. Among 22 predicted interactions by this apriori approach, 14 interactions were shared by both of the experiments. 3.3.2 ARM Using Biclustering Approach In [28], the authors suggested a novel approach for predicting PPI based on ARM using biclustering method that allows simultaneous clustering of rows and columns of a matrix. They extracted association rules among HIV-1 proteins as well as among human proteins using the information about known interactions between HIV-1 and human proteins. These rules were used to predict new viral-host PPIs which are not present in the database. The information about the PPI found in the HHPID was arranged in the same way as in [27]. The authors considered 19 HIV-1 proteins and 1432 human proteins to form the matrix HV with rows indicating the human proteins and the columns indicating the HIV-1 proteins. The transpose of the matrix HV, denoted by VH with size 19 × 1432 was computed. In both matrices HV and VH, each row was treated as a transaction and each column was treated as an item to apply the algorithm for obtaining maximal biclusters (biclusters which are not a proper subset of any biclusters). The set of items of each maximal all-1 bicluster satisfying the minimum threshold value constitutes an FCI. Thus, after applying BiMax algorithm (Prelic et. al., 2006) to the matrices HV and VH individually, all maximal biclusters were obtained and hence, all FCIs which are the condensed representation (nonredundant minimal representation) of all frequent itemsets. The FCIs extracted from the matrix HV has given the association rules among the viral proteins and association rules among the human proteins were obtained from the FCIs found from VH matrix. Each rule generated from the matrix HV was associated with a set of human proteins for which the rule is valid. Similarly, a set of HIV-1 proteins was associated with each rule extracted from 20 the matrix VH. The representation and interpretation of the generated rules are same as described in Section 3.3.1. The extracted association rules from both the matrices were filtered to obtain the set of most useful association rules by removing less-confident and redundant rules. The rule R2 is called redundant if there exist another rule R1 with same consequent as R2 and antecedent of R2 is a proper subset of antecedent of R1 while R1 and R2 have confidence greater than or equal to minimum threshold. From these high-confident, non-redundant set of association rules, new interactions between HIV-1 and human proteins were predicted based on the interpretation of the association rules. A total of 140 novel interactions were predicted from HV matrix and 43 from VH matrix. The step-by-step result is shown in Table 11. It was found that only three interactions were common in both the set of new predictions from HV and VH matrix that reveals the necessity of taking both matrices into account. Hence, after taking union of two sets, 180 unique novel interactions were obtained which involves 17 HIV-1 proteins (capsid, gp120, gp160, gp41, Tat, integrase, Gag_Pr55, matrix, Nef, nucleocapsid, p6, protease, Rev, RT, Vif, Vpr, Vpu) and 140 human proteins. Table 11 Result Found Using Biclustering Approach By Mukhopadhyay et al. [28] Number of all-1 maximal biclusters/FCIs generated Number of generated association rules with single consequent Number of high-confident association rules generated Number of high-confident and nonredundant rules generated Number of novel interactions predicted Number of HIV-1/human proteins involved in the predicted interactions From HV matrix min_support=20 min_confidence=70% From VH matrix min_support=5 min_confidence=70% 48 74 123 361 26 50 15 36 140 43 130 (Human Proteins) 15 (HIV-1 Proteins) In the article [29] and [30], the biclustering based ARM approach to predicting PPI was extended by integrating the information on the interaction type (Section 2) and direction of regulation of interaction (HIV-1-to-Human or Human-to-HIV-1) which could help us to provide valuable additional information regarding the type and regulation pattern of each predicted interaction. There are 68 unique interaction types reported in HHPID which can be divided into three classes – regulating (Direction of regulation is from HIV-1 to human proteins) , regulated by (Direction of regulation is from human to HIV-1 proteins) and bidirectional (Table 2). In the article [29], only regulating and bidirectional interactions were considered to predict HIV-1-human PPI, however the authors extended their research by considering all the three classes of interactions in predicting viral-host PPI in [30]. Each human protein that are known to interact with HIV-1 proteins were annotated with its corresponding interaction type. A total of 2564 annotated human proteins were obtained under two classes (regulating and bidirectional) and 1271 annotated human proteins under two classes (regulated by and bidirectional). Two matrices HV_positive of dimension 21 19 × 2564 and HV_negative of dimension 19 × 1271 with rows representing viral proteins and columns representing annotated human proteins were computed. The presence of interaction between the corresponding human and HIV-1 protein pair is indicated by 1 in HV_positive matrix and −1 in HV_negative matrix and in both the matrices, × and 0 represent bidirectional interaction and the absence of interaction respectively. The algorithm for finding maximal biclusters with a minimum support threshold value was applied to HV_positive and HV_negative matrix individually to obtain the FCIs, hence the association rules in a similar way as described previously. The rule generated from HV_positive and HV_negative may be of the form as Rule 1 and Rule 2 respectively. Moreover, in [30] the association rules among HIV-1 proteins were also extracted from the maximal biclusters obtained from transpose of the matrix HV_positive and HV_negative. This type of rule is described by Rule 3. Rule 1: {HP1 _cleaves, HP2 _inhibits, HP3 _stimulates ⇒ HP4 _upregulates, HP5 _inhibits} — It can be interpreted as if the human protein HP1 is cleaved, HP2 is inhibited and HP3 is stimulated by some HIV-1 proteins, then there is a possibility that the same set of viral proteins upregulates HP4 and inhibits HP5 . Rule 2: {HP6 _activatedby, HP7 _enhancedby ⇒ HP8 _modifiedby, HP9 _inhibitedby} — It can be interpreted as if the human protein HP6 activates and HP7 enhances some HIV-1 proteins, then HP8 and HP9 have the possibility to modify and inhibit the same set of HIV-1 proteins respectively. Rule 3: {V P1 , V P2 , V P3 ⇒ V P4 , V P5 } — It can be interpreted as if there exist physical interactions of viral proteins V P1 , V P2 and V P3 with some human proteins, then V P4 and V P5 are also like to interact with the same set of human proteins. After extracting association rules among human proteins annotated with interaction type, some novel interactions were predicted between HIV-1 and human protein associated with an interaction type. For example, let us consider that the antecedent of Rule 1, mentioned above, is valid for a set of viral proteins (V P1 , V P2 , V P3 , V P4 , V P5 ) and the consequent is true for (V P1 , V P2 , V P3 , V P4 ). Thus, the confidence of the rule is 80% (4 out of 5). Hence, it can be predicted that the viral protein V P4 is likely to interact with the human proteins HP4 and HP5 with the interaction type ‘upregulates’ and ‘inhibits’. In a similar way, new interactions were also predicted from the rules of type represented as Rule 2 and Rule 3. In [29], 17 maximal biclusters were found with minimum threshold value from which 46 association rules were predicted. Among the 46 predicted interactions, 26 interactions were found to be true positive interactions. In [30], 19 biclusters were obtained from the matrix HV_positive including those that were found in [29] and the matrix HV_negative also generated 19 biclusters. A total of 93 association rules among human proteins and 33 association rules among viral proteins were generated from both the matrices. The total number of new interactions predicted from the biclusters extracted from HV_positive and HV_negative was 114 among which 59 predictions were found to be experimentally validated. Table 12 describes the individual result found from both the matrices. However, the above mentioned approaches modeled the HIV-1-human PPI network as a binary matrix without considering the interaction strengths and mined association rules from only the complete bicliques (all-1 biclusters). In [31], the authors tried to find out quasi-bicliques from the weighted interaction graph of HIV-1 and human proteins in order to generate more relevant information about the proteins. A quasi-biclique is equivalent to a bicluster having small number of zero elements and large mean interaction strength (MIS), 22 Table 12 Result Found Using Biclustering Method Incorporating Interaction Type and Regulation Direction by Mukhopadhyay et al. [30] Number of maximal biclusters obtained (Minimum number of HIV-1 proteins, Minimum number of human proteins) Number of association rules generated among human proteins Number of association rules generated among HIV-1 proteins Number of predicted interactions Number of human proteins involved in the predicted interactions Number of HIV-1 proteins involved in the predicted interactions Number of predicted interactions which have been found to be experimentally validated From HV_positive Matrix From HV_negative Matrix 19(4,2) 19(3, 2) 62 31 26 7 64 50 31 32 8 13 35 24 thus releasing the stringent requirement of having all non-zero entries as in bicliques. Here, the quasi-bicliques finding process was treated as a biclustering problem in a weighted graph. Since BiMax algorithm can not be applied on a weighted graph, a multiobjective genetic algorithm-based biclustering technique (MOBICLUST) was proposed to generate dense biclusters having high MIS from HIV-1-human bipartile PPI network where two sets of nodes represent HIV-1 and human proteins respectively and edges denote the interactions. The authors categorized the interactions into three types — direct physical interactions, indirect interactions and novel interactions that was predicted by Tastan et al. [20]. Each interaction was assigned a prediction score in [20] as described in Section 3.1. According to that scores, the authors filtered out the interactions with a lower threshold of 2.0 and obtained a PPI network with 17 nodes of HIV-1 proteins, 1403 nodes of human proteins and 617 weighted edges of interactions. The application of MOBICLUST algorithm to this PPI network generated 26 biclusters. After filtering biclusters with MIS less than 2.0, 14 biclusters were obtained. Due to the large overlap of interactions within these biclusters, the union of 14 biclusters was taken which represents a strong interaction module. This interaction module involves 7 HIV-1 proteins (gp120, gp160, gp41, matrix, nef, RT, tat) and 15 human proteins (AP2B1, CALM1, CALM3, CD4, CXCR4, LCK, MAPK1, PRKACA, PRKCB1, PRKCD, PRKCE, PRKCG, PRKCI, PRKCQ, TP53)). The number of interactions contained in this module is 75 out of which 57 are direct physical interactions. 3.3.3 Integration of ARM with Biclusters Based on Closure Lattice In [32], the authors proposed an algorithm FIST (Frequent Itemset mining using Suffix Trees) that utilizes the concept of subset lattice in ARM. The algorithm combines the generation of FCIs, generators for each FCI, minimal non-redundant cover of association 23 rules and hierarchical conceptual biclusters into a single process. It was operated on the information about known PPIs along with some biological and bibliographical annotations to extract association rules and predict new interactions between HIV-1 and human proteins [33]. By applying FIST, the authors obtained the knowledge patterns that represent three kinds of relationships— among HIV-1 proteins, among human proteins and between HIV-1 and human proteins. 1. V P1 , V P2 , . . . , V Pm ⇔ HP1 , HP2 , . . . , HPn 2. V P1 , V P2 , . . . , V Pm ⇒ V Pm+1 , V Pm+2 , . . . , V Pm+n 3. HP1 , HP2 , . . . , HPm ⇒ HPm+1 , HPm+2 , . . . , HPm+n where V Pi (i = 1, 2, . . . , m, m + 1, . . . , m + n) and HPi (i = 1, 2, . . . , m, m + 1, . . . , n) denote viral and human proteins respectively. The first input dataset to the algorithm was created by taking 19 HIV-1 proteins as columns and 1433 human proteins as rows. The interaction between the corresponding pair of HIV-1 and human protein in the matrix was marked as 1 and question mark for no interaction. Thereafter, some additional attributes were merged to the first matrix to form a second dataset that represents GO annotations and bibliographic references corresponding to each row of human proteins along with the information on viral-host interactions. These annotations are the nominal dataset which included 1149 unique GO terms taken from GO website and 2670 unique publications collected from NCBI website. Both FIST and apriori were applied to the first dataset for performance comparison and it was found that the number of association rules generated from FIST was much more less than that of apriori, making the analysts more focused on relevant rules. Additionally, FIST can generate biclusters and provide a set of valid objects (human proteins) with each association rule unlike apriori. Apriori could not be executed on second dataset because of the high memory consumption by frequent itemsets generated. Since, FIST mines FCIs rather than frequent itemsets, it performed well on the second dataset. A total of 1346 unique HIV-1-human protein pairs were predicted to interact by FIST approach. The authors validated their predicted interactions by comparing with those predicted by Tastan et al. using supervised approach (Section 3.1). Figure 7 depicts the predicted interacting protein pairs by Tastan et al. which are covered by at least one FCI generated from FIST. Table 13: List of URLs Providing Prediction Method and Predicted PPI Dataset References Supervised approach by Tastan et al. [20] Semi-supervised approach by Tastan et al. [21] Prediction Method/Supplementary Materials(†) and PPI Prediction Datasets(‡) †http://www.ncbi.nlm.nih.gov/pmc/ articles/PMC3263379/bin/NIHMS345575supplement-supplement.pdf ‡http://www.ncbi.nlm.nih.gov/pmc/ articles/PMC3263379/bin/NIHMS345575supplement-predictions.xls ‡http://www.cs.cmu.edu/~qyj/HIVsemi/ HIVPPI.embOmodel.all.ave.cutList.cvs Continued on next page... 24 Continued from previous page... References Prediction Method/Supplementary Materials(†) and PPI Prediction Datasets(‡) Supervised approach by ‡http://www.ncbi.nlm.nih.gov/pmc/ Dyer et al. [22] articles/PMC3134873/bin/NIHMS289876supplement-02.txt ‡http://www.ncbi.nlm.nih.gov/pmc/ articles/PMC3134873/bin/NIHMS289876supplement-03.txt ‡http://www.ncbi.nlm.nih.gov/pmc/ articles/PMC3134873/bin/NIHMS289876supplement-04.txt Conformal prediction ‡http://www.clrc.rhul.ac.uk/people/ approach by Nouretdinov alex/HIVpsb12/supp_files.zip et al. [23] PWEN-TLM approach by ‡http://www.plosone.org/article/ Mei [24] fetchSingleRepresentation.action? uri=info:doi/10.1371/journal.pone. 0079606.s005 ‡http://www.plosone.org/article/ fetchSingleRepresentation.action? uri=info:doi/10.1371/journal.pone. 0079606.s006 Structural similarity based ‡http://www.virologyj.com/content/ approach by Doolittle et al. supplementary/1743-422x-7-82-s4.txt [25] ARM based approach ‡http://www.plosone.org/article/ using biclustering method fetchSingleRepresentation.action? by Mukhopadhyay et al. uri=info:doi/10.1371/journal.pone. [28] 0032289.s003 ARM based approach †http://www.biomedcentral.com/imedia/ using biclustering method 1211858693119566/supp1.pdf incorporating type and ‡http://www.biomedcentral.com/imedia/ direction information by 1432252521195668/supp2.zip Mukhopadhyay et al. [30] Closure based integrated †http://www.i3s.unice.fr/~pasquier/ biclustering approach web/?Research___Softwares___FIST (FIST) by Mondal et al. [33] 4 Conclusion In this article, we have attempted to outline the significance of interactions between HIV1 virus and the host proteins in the pathogenesis of the lethal disease HIV/AIDS and consequently in designing antiviral drugs that can suppress HIV-1 infection in human body. Drug’s antagonistic structure interferes with different stages of virus replication cycle. 25 Figure 7: Overlapping of Supervised Learning Approach by Tastan et al. [20] and Integrated Biclustering Approach By FIST [33] Therefore, the discovery process of drugs can be ameliorated by predicting new interactions between HIV-1 and human proteins — this fact has motivated the researchers to focus on taking different approaches in order to predict novel interactions. All methodologies are mainly based on PPIs which have been proved to exist experimentally. However, they utilize various information on different PPIs in different ways for predicting interactions. For example, ARM based approaches utilize only the viral-host PPIs as the backbone for predicting new PPIs, structural similarity based approach is completely based on the PPI information among human proteins, whereas classifier based approaches considers viralhost PPIs along with intra-human PPIs represented as features. From this study, we have found that the classifier based approaches predicted greater number of new interactions than any other approaches, additionally greater percentage of predicted interactions (23%) was found to be true positives in supervised approach by Tastan et al. However, the classifier model suffers from identifying lack of experimentally validated non-interacting protein pairs, although PWEN-TLM was devised to exclude subcellular co-localized protein pairs from negative samples to achieve better performance and derive more potential set of likely interactions. Secondly, structural similarity based approach is not widely acceptable since crystal structures of all human proteins are not available and also a single protein can be expressed by more than one structure. Therefore, a lot of interest has been shifted to ARM based approaches because of utilizing only positive samples for prediction. However, since this approach does not consider the pattern of non-interaction of proteins for prediction, it might suffer from high rate of false positives. References [1] AIDSinfo | Information on HIV/AIDS Treatment, Prevention, and Research, http: //www.aidsinfo.nih.gov/education-materials/fact-sheets [2] UNAIDS 2013 | AIDS by the numbers, http://www.unaids.org/en/ media/unaids/contentassets/documents/unaidspublication/ 2013/JC2571_AIDS_by_the_numbers_en.pdf [3] WHO | Data and statistics, http://www.who.int/hiv/data/en/ 26 [4] Alimonti, J.B., Ball, T.B. and Fowke, K.R. (2003) ‘Mechanisms of CD4+ T lymphocyte cell death in human immunodeficiency virus infection and AIDS’, Journal of General Virology, Vol. 84, No. 7, pp. 1649–1661. [5] Wu, Y. (2004) ‘HIV-1 gene expression: lessons from provirus and non-integrated DNA’, Retrovirology, 1:13. [6] Freed, E.O. (2001) ‘HIV-1 replication’, Somatic Cell and Molecular Genetics, 26(16):13-33. [7] Frankel, A.D. and Young, J.A.T. (1998) ‘HIV-1: fifteen proteins and an RNA’, Annu. Rev. Biochem., Vol. 67, pp. 1–25. [8] Karn, J. and Stoltzfus, C.M. (2012) ‘Transcriptional and Posttranscriptional Regulation of HIV-1 Gene Expression’, Cold Spring Harbor Perspectives in Medicine, 4:a006916. [9] Malim, M.H. and Emerman, M. (2008) ‘HIV-1 accessory proteins–ensuring viral survival in a hostile environment’, Cell Host and Microbe, Vol. 3, No. 6 pp. 388–398. [10] Freed, E.O. (1998) ‘HIV-1 gag proteins: diverse functions in the virus life cycle’, Virology, Vol. 251, No. 1, pp. 1–15. [11] Suñé, C. and García-Blanco, M.A. (1995) ‘Sp1 transcription factor is required for in vitro basal and Tat-activated transcription from the human immunodeficiency virus type 1 long terminal repeat’, Journal of Virology, Vol. 69, No. 10, pp. 6572–6576. [12] Dorr, P., Westby, M., Dobbs, S., Griffin, P., Irvine, B. et al. (2005) ‘Maraviroc (UK427,857), a potent, orally bioavailable, and selective small-molecule inhibitor of Chemokine receptor CCR5 with broad-spectrum anti-Human Immunodeficiency Virus Type 1 activity’, Antimicrobial Agents and Chemotherapy, Vol. 49, No. 11, pp. 4721– 4732. [13] HIV/AIDS, HIVAIDS/ NIAID, NIH, http://www.niaid.nih.gov/topics/ [14] Fu, W., Sanders-Beer, B., Katz, K., Maglott, D. and Pruitt, K. (2009) ‘Human immunodeficiency virus type-1, human protein interaction database at NCBI’, Nucleic Acids Research, Vol. 37, Database Issue, D417–D422. [15] Ptak, R.G., Fu, W., Sanders-Beer, B.E., Dickerson, J.E., Pinney et al. (2008) ‘Cataloguing the HIV Type 1 Human Protein Interaction Network’, AIDS Research and Human Retroviruses, Vol. 24, No. 12, pp. 1497–1502. [16] Brass, A.L., Dykxhoorn, D.M., Benita, Y., Yan, N., Engelman, A. et al. (2008) ‘Identification of host proteins required for HIV infection through a functional genomic screen’, Cell, Vol. 135, pp. 49–60. [17] Pinney, J.W., Dickerson, J.E., Fu, W., Sanders-Beer, B.E., Ptak, R.G. and Robertson, D.L. (2009) ‘HIV-host interactions: a map of viral perturbation of the host system’, AIDS, Vol. 23, No.5, pp.549–554. [18] Ott, D.E. (2008) ‘Cellular proteins detected in HIV-1’, Reviews in Medical Virology, Vol. 18, No. 3, pp. 159–175. 27 [19] Park, B. and Han, K. (2006) ‘Web service for predicting interacting proteins and application to human and HIV-1 Proteins’, Computational Intelligence and Bioinformatics, Vol. 4155, pp. 631–640. [20] Tastan, O., Qi, Y., Carbonell, J. and Klein-Seetharaman, J. (2009) ‘Prediction of interactions between HIV-1 and Human proteins by information integration’, Pacific Symposium on Biocomputing, 2009:516–527. [21] Qi, Y., Tastan, O., Carbonell, J., Klein-Seetharaman, J. and Weston, J. (2010) ‘Semisupervised multi-task learning for predicting interactions between HIV-1 and human proteins’, Bioinformatics, Vol. 26, No. 18, pp. i645–i652. [22] Dyer, M., Murali, T. and Sobral, B. (2011) ‘Supervised learning and prediction of physical interactions between human and HIV proteins’, Infection, genetics and evolution: journal of molecular epidemiology and evolutionary genetics in infectious diseases, Vol. 11, No. 5, pp. 917–923. [23] Nouretdinov, I., Gammerman, A., Qi, Y. and Klein-Seetharaman, J. (2012) ‘Determining confidence of predicted interactions between HIV-1 and human proteins using conformal method’, Pacific Symposium on Biocomputing, 2012:311–322. [24] Mei, S. (2013) ‘Probability weighted ensemble transfer learning for predicting interactions between HIV-1 and human proteins’, PLoS ONE, 8(11):e79606. [25] Doolittle, J.M. and Gomez, S.M. (2010) ‘Structural similarity-based predictions of protein interactions between HIV-1 and Homo sapiens’, Virology Journal , 7:82. [26] Zhao, C. and Sacan, A. (2013) ‘Prediction of HIV-1 and human protein interactions based on a novel evolution-aware structure alignment method’, BIOCOMP 2013. [27] Mukhopadhyay, A., Maulik, U., Bandyopadhyay, S. and Eils, R. (2010) ‘Mining association rules from HIV-Human protein interactions’, International Conference on Systems in Medicine and Biology (ICSMB), IEEE, pp. 344–348. [28] Mukhopadhyay, A., Maulik, U. and Bandyopadhyay, S. (2012) ‘A novel biclustering approach to association rule mining for predicting HIV-1-Human protein interactions’, PLoS ONE, 7:e32289. [29] Mukhopadhyay, A., Ray, S., Maulik, U. (2012) ‘Predicting annotated HIV-1-Human PPIs using a biclustering approach to association rule mining’, Third International Conference on Emerging Applications of Information Technology (EAIT), IEEE, pp. 28–31. [30] Mukhopadhyay, A., Ray, S. and Maulik, U. (2014) ‘Incorporating the type and direction information in predicting novel regulatory interactions between HIV-1 and human proteins using a biclustering approach’, BMC Bioinformatics, Vol. 15, No. 26. [31] Maulik, U., Mukhopadhyay, A., Bhattacharyya, M., Kaderali, L., Brors, B., Bandyopadhyay, S. and Eils, R. (2012) ‘Mining quasi-bicliques from HIV-1-human protein interaction network: a multiobjective biclustering approach’, IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol.10, No. 2, pp. 423– 435. 28 [32] Mondal, K. C., Pasquier, N., Mukhopadhyay, A., Maulik, U. and Bandhopadyay, S. (2012) ‘A new Approach for association rule mining and bi-clustering using formal concept analysis’, Machine Learning and Data Mining in Pattern Recognition, Springer, Vol. 7376, pp. pp 86–101. [33] Mondal, K.C., Pasquier, N., Mukhopadhyay, A., Pereira, C.C., Maulik, U. and Tettamanzi, A. (2012) ‘Prediction of protein interactions on HIV-1-human PPI data using a novel closure-based integrated approach’, Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms, INSTICC, pp. 164– 173.
© Copyright 2024