A Comprehensive Review of HIV-1 and Human Protein-Protein Interaction Prediction

1
REVIEW ARTICLE
A Comprehensive Review of HIV-1 and Human
Protein-Protein Interaction Prediction
Debasmita Pal and Kartick Chandra Mondal
Department of Information Technology,
Jadavpur University,
Kolkata - 700032, West Bangal, India
E-mails: [email protected], [email protected]
Abstract: Human Immunodeficiency Virus-Type 1 (HIV-1) which is the etiologic
agent of AIDS, has been the centre of attention of virologists in recent times
due to its life-threatening nature and epidemic spread throughout the globe. The
virus infects the host cells for replication by exploiting a complex interaction
network of HIV-1 and human proteins and causes destruction of the power of
human immune system, gradually leading to death. Antiviral drugs are designed
to utilize the information on viral-host protein-protein interactions (PPIs), so that
the viral replication and infection can be prevented. Therefore, the prediction
of novel interactions based on experimentally validated interactions, that are
curated in public PPI database could help in discovering new therapeutic targets.
In this article, an overview of HIV-1 proteins and their role in virus replication
and pathogenesis has been given followed by a discussion on different types
of antiretroviral drugs and HIV-1-human PPI database. Thereafter, we have
presented a brief explanation of different approaches adopted to predict new PPI
and their predicted results along with the overlap of predicted interactions by
different literatures.
Keywords: HIV-1 Proteins, Antiretroviral Drugs, HIV-1, Human Protein
Interaction Database (HHPID), Interactions Prediction, Association Rule Mining
1 Introduction
In recent years, our society has been highly perturbed by some virulent viruses; one of them
is Human Immunodeficiency Virus - Type 1 (HIV-1) that gradually ruins the immune system
of human body making it susceptible to infections and diseases. The terminal stage of HIV-1
infection is Acquired Immunodeficiency Syndrome (AIDS) which eventually becomes fatal
[1]. As per the report published by Joint United Nations Programme on HIV/AIDS/World
Health Organization, the worldwide annual estimates of deaths caused by HIV/AIDS in
2005 was approximately 2.3 million and it was approximately 1.6 million in 2012 [2].
Although AIDS related mortality is decreasing steadily due to antiretroviral therapy, still
HIV/AIDS is a global pandemic. Around 35.3 million people are now living with HIV or
AIDS worldwide [3]. Anti-HIV drugs which have been discovered till now, can only prevent
early stages of HIV-1 infection to some extent rather than curing it. Further, no vaccine has
2
been discovered yet. Therefore, the research on HIV-1 pathogenesis and improvement of
antiviral treatment is going on and is one of the most challenging areas of medical science.
HIV-1 is a retrovirus; a family of RNA viruses that cannot grow or spawn on its own
without a living host cell since it does not contain DNA; it invades the host cell in order to
successfully produce posterity. Mostly the vital cells in the human immune system such as
CD4 or T cells, macrophages and dendritic cells, are infected by HIV-1 virus [4]. It enters the
cell cytoplasm by binding itself with receptors (CD4) and coreceptors (CXCR4, CCR5) on
the host cell surface followed by fusion with cellular membrane. Upon entering, a DNA copy
of viral RNA genome is produced by an HIV-1 protein enzyme Reverse Transcriptase — the
process is called Reverse Transcription. The resulting viral DNA is eventually inserted into
the cell nucleus as a part of pre-integration complex and integrated into the cellular DNA by
viral protein enzyme Integrase [5]. The integrated viral DNA is referred to as provirus that
may enter into clinical latency stage during which no signs or symptoms of HIV infection
can be noticeable. Alternatively, the provirus may take part in the transcription process
and creates new RNA genome and viral proteins that are assembled and released from the
cell as a new virion [6]. To do all this means to make entry into the host cell and gain the
control over host cell processes, HIV-1 proteins play an important role through a complex
network of molecular reactions that includes virus-host protein-protein interactions (PPIs).
Therefore, the knowledge on PPIs between HIV-1 and human cellular proteins is the key to
perceive HIV-1 replication and pathogenesis and subsequently, it marks the beginning of a
new era in designing the restoratives and optimizing the science of therapy.
The single-stranded RNA genome of HIV-1 virus consists of nine genes that encode
to structural proteins along with two regulatory and four accessory or auxiliary proteins
(Figure 1) [7]. The three major genes gag, pol and env first encode to polyprotein precursors
that are processed to produce the structural proteins for matured virus particle. Regulatory
and accessory proteins are also essential for viral replication and to cause disease [8, 9].
Table 1 describes the functions of each HIV-1 protein in viral replication [6, 10, 11].
HIV-1 viral proteins interact with cellular proteins during different stages of its life
cycle in order to replicate successfully and cause disease. Therefore, it is evident that the
study on HIV-1-human PPIs helps us to gain knowledge on HIV-1-infection and replication
process, which in turn enhances the development of antiretroviral drugs. The drugs are
designed in such a manner so that it can hinder HIV-1 proteins to interact with human
proteins at different stages of viral life-cycle, preventing the virus to replicate. It will be
well understood if we analyze the formation of an antiretroviral drug [12] as an example.
Broadly, antiretroviral drugs can be categorized into six major classes which are depicted
in Figure 2. The grouping of the drugs is based on how they interfere with steps in HIV-1
replication impeding the virus to replicate and infect the host cell [13].
In general, the PPIs are identified by various experimental methods. However, the
prediction of possible interactions based on experimentally observed interactions is one
of the major issue in PPI research to revitalize the process of treatment. Numerous
researches have been done in order to predict the PPIs in a single organism i.e., intra-species
interactions. However, analyzing and predicting inter-species interactions, especially the
PPIs between the virus and its host, is more promising in medical science for antiviral
drug invention. Recently, some articles have been published on prediction of PPIs between
HIV-1 and human proteins. All of these researches were based on the publicly available PPI
database, the concept of which is elaborated in Section 2. In Section 3, several approaches
taken to predict HIV-1-human PPIs such as classifier based, structural similarity and
3
Figure 1: Relation between HIV-1 Genes and Proteins
Figure 2: Categorization of Antiretroviral Drugs and Their Prevention Strategies
association rule mining (ARM) based approaches are explained with a brief comparison on
their predicted results. Finally, we conclude in Section 4.
4
Table 1 HIV-1 Proteins and Their Functions
HIV-1 Protein
Matrix
(MA, [p17])
Nucleocapsid
(NC, [p7])
Capsid
(CA, [p24])
p6
Spacer Peptide 1
(SP1, [p2])
Spacer Peptide 2
(SP2, [p1])
Reverse Transcriptase
(RT, [p51,p66])
Integrase
(IN, [p32])
Protease
(PR, [p11])
Surface Glycoprotein
(gp120)
Transmembrane
Glycoprotein (gp41)
Tat (Trans-Activator
of Transcription)
Rev (Regulator of
Expression of Virion
Proteins)
Nef (Negative Factor)
Vif (Viral Infectivity
Factor)
Vpr (Viral Protein R)
Vpu (Viral Protein U)
Protein Functions
Supports Gag interaction with plasma membrane mediating the
binding and virus assembly and incorporates envelope proteins
into virions.
Surrounds RNA genome of the virus in order to protect it by
forming a stable complex with the viral RNA and helps in
delivering the RNA to the virus particle during assembly.
Forms a coating around viral RNA, inserting it into the target cell
during infection and helps in virion assembly and maturation.
Incorporates Vpr into new virions and promotes budding of
virions from the infected cell.
Essential for basal transcription and Tat-mediated transcription
of HIV-1.
Currently, it’s function is unknown.
Carries out the reverse transcription process, that is, produces a
DNA copy of viral RNA genome.
Integrates the viral DNA with infected cellular DNA forming
provirus.
Cleaves Gag and Gag-pol polyprotein precursors into proper
functional pieces, mediating the maturation of virus particles.
Binds to the host cell surface receptor and coreceptors mediating
virus entry into the host cell.
Contains fusion peptide and supports the virus to fuse with host
cell plasma membrane.
Regulates reverse transcription ensuring efficient synthesis of
viral mRNA as well as the release of virions from infected cell.
Exports RNA from the nucleus to the cytoplasm before it can
be spliced so that structural proteins and RNA genome for new
virion can be produced.
Compels the infected cell to stop producing several cell defence
proteins and enhances the progression of HIV infection to AIDS.
Attacks cell’s antiretroviral factors that inhibit infection.
Assists the viral genome after entry into the nucleus and
enhances the infection.
Downregulates cell receptor CD4 and mediates the release of
new virions from host cell surface by enfeebling the interaction
between new envelop proteins and cell receptors.
2 HIV-1 - Human PPI Database
PPI database provides the information about the interacting protein pairs. It can be
epitomized into three categories —
5
i) Primary Databases - Contain the information about PPIs, the existence of which have
been identified by experimental methods. For example, Database of Interacting Proteins
(DIP)1 , Biomolecular Interaction Network Database (BIND)2 etc.
ii) Meta-databases - These are made by integrating primary databases. For example, Agile
Protein Interaction Data Analyzer (APID)3 .
iii) Prediction Databases - Collect the information about interacting protein pairs which
have been predicted using several techniques based on known interactions. For
example, Known and Predicted Protein-Protein Interactions (STRING)4 .
A lot of articles have been published focusing the interactions between HIV-1 and
human proteins. However, only one or a few interactions were concentrated in each of the
individual releases; hence it was laborious to gather or access all the information about
the interactions of a particular HIV-1 protein with human proteins in a compact way. The
Division of Acquired Immunodeficiency Syndrome (DAIDS) of the National Institute of
Allergy and Infectious Diseases (NIAID) first took the initiative to create HIV-1, Human
Protein Interaction Database (HHPID)5 which catalogs all the known interactions of
HIV-1 proteins with human proteins. In the year of 2000, the project of developing the
database was started in collaboration with Southern Research Institute and the National
Center for Biotechnology Information (NCBI). The database has been prepared by collecting
information on interactions from numerous peer-reviewed articles (over 1,00,000) available
in PubMed. In case of conflicting conclusions by different publications, the data ambivalence
has been reported in the description part of the interaction. Moreover, newly published
literatures are reviewed on a periodic basis to keep the database with updated information
[14].
To contribute a concise, yet detailed, summarized view of all HIV-1-human PPIs in the
field of HIV/AIDS research, the HHPID database consists of the following information for
each of the identified PPI —
• NCBI Reference Sequence (RefSeq) protein accession numbers — RefSeq6 is a public
database built by NCBI that provides non-redundant, annotated collection of nucleotide
sequences (DNA, RNA) including genomic data, transcripts and proteins.
• NCBI Entrez Gene ID numbers — Entrez Gene7 is a database for gene-specific
information that generates a stable identifier for genes. This identifier can be used to
integrate multiple types of information including nomenclature, accessions of genespecific and gene product-specific sequences etc.
• Amino acids from each protein that are known to be involved in the interaction
• Brief description of the PPI
• Keywords for searching the interactions — A total of 68 unique interaction keywords
were listed initially to categorize the interactions between HIV-1 and human proteins.
1 http://dip.doe-mbi.ucla.edu
2 http://bind.ca
3 http://bioinfow.dep.usal.es/apid/index.htm
4 http://string-db.org/
5 http://www.ncbi.nlm.nih.gov/projects/RefSeq/HIVInteractions/
6 http://www.ncbi.nlm.nih.gov/refseq/
7 http://www.ncbi.nlm.nih.gov/gene
6
Some of the interactions can be considered as direct viral-host interactions, whereas
most of them are indirect such as regulatory interactions that are responsible for
alteration of human gene expression [15]. All interaction keywords are listed in Table
2 [20, 30]. Currently, the database includes eight additional interaction types —
dephosphorylated by, destabilized by, disrupted by, induces reorganization of, induces
ubiquitination of, rescued by, stabilized by, sumoylated by. The query interface of
NCBI website supports to find the cellular proteins having a specific type of interaction
with a viral protein based on these keywords.
• National Library of Medicine (NLM) PubMed identification numbers (PMIDs) — Can
be used to identify all journal articles describing the interaction.
Currently, the database comprises a total of 12785 HIV-1-human PPIs that involves 21
HIV-1 proteins (including processed proteins as well as polyprotein precursors) and 3142
human genes encoding 3183 human proteins. It also consists of 5549 PMID references to the
original articles describing the interaction. Among all the interactions reported, only 32% are
direct physical interactions (such as binds, cleaves) and rest are indirect (such as upregulates,
modifies). Interactions in the database are mostly associated with the interaction type—
interacts with (20.6%), upregulates (11.6%), binds (7.9%), activates (7%), downregulates
(9.13%), inhibits (5.6%). Furthermore, envelope and tat proteins are involved in 26.45% and
25.37% of the total interactions. It has been found that around 37% of the human proteins
interact with more than one HIV-1 protein and 58% of all the interactions have been reported
by more than one article [15].
The HHPID greatly contributes to the HIV-1/AIDS research community by presenting
an elaborate visualization of HIV-1 replication and pathogenesis. Additionally, it has
become more valuable source of information for providing the facility to cross-reference
to other public databases. The database supports the work done by Brass et al. in 2008.
They performed genomic siRNA (small interfering RNA) screen to recognize and analyze
the human proteins requisite for HIV-1 replication. More than 250 cellular proteins was
identified in this work and these proteins have been entitled as HIV Dependency Factors
(HDFs) [16]. It was found that a total of 36 proteins that are involved in the interactions
present in the database, have also been recognized as HDFs. [17]. Furthermore, the database
assists the new experimental and computational results by providing a direct comparison
with knowledge on known interactions.
Moreover, it should be noted that if the information about the interaction between a
particular protein pair is not present in the database, we cannot draw the conclusion that the
protein pair does not interact because of the lack of substantive evidence. Thus, the prediction
of PPIs based on the known interactions will help in the discovery of new therapeutic targets
and prevention strategies.
3 HIV-1 - Human PPI Prediction
Several approaches have been taken by the researchers in order to predict the novel
interactions between HIV-1 and human proteins till date which is shown in Figure 3.
In 2006, Park et al. tried to develop an online prediction system HPID 2.08 that provides
all the interacting protein partners for a given protein submitted by users [19]. The system
8 http://www.hpid.org
7
Table 2 Direct and Indirect Interaction Types Reported in HHPID
HIV1-to-Human Protein Interaction Types
Indirect Interaction
activates
induces cleavage of
associates with
induces complex with
co-localizes with
induces phosphorylation of
competes with
induces rearrangement of
complexes with
induces release of
cooperates with
inhibits
decreases phosphorylation of inhibits acetylation of
deglycosylates
modulates
depolymerizes
polarizes
disrupts
recruits
downregulates
regulates
enhances
relocalizes
enhances polymerization of
requires
fractionates with
sensitizes
inactivates
stabilizes
incorporates
stimulates
induces accumulation of
synergizes with
induces acetylation of
upregulates
Human-to-HIV1 Protein Interaction Types
Direct Interaction
Indirect Interaction
acetylated by
activated by
isomerized by
binds
associates with
mediated by
cleaved by
cleavage induced by
modified by
degraded by
co-localizes with
modulated by
interacts with
competes with
palmitoylated by
methylated by
complexes with
processed by
myristoylated by
cooperates with
recruited by
phosphorylated by
downregulated by
regulated by
ubiquitinated by
enhanced by
relocalized by
exported by
requires
fractionates with
stimulated by
glycosylated by
synergizes with
imported by
upregulated by
inhibited by
Direct Interaction
acetylates
binds
cleaves
degrades
interacts with
phosphorylates
Interaction types common in both directions have been denoted by italics. These types of interactions are
referred to as bidirectional or undirectional.
utilizes several types of information such as protein domain, protein function and subcellular localization in order to predict the human proteins that might interact with an
HIV-1 protein. Along with the name of the protein of interest, the user should input the
superfamilies of the protein using any of the Superfamily, InterPro or Pfam databases.
Protein Structural Interactome MAP (PSIMAP) is used to predict the interacting protein
partners at the superfamily level and homology search of protein sequences of the query
8
Figure 3: Different Approaches Taken For Predicting HIV-1-Human PPI
protein in the databases like Ensembl, DIP, HPRD and NCBI is used to find the interacting
protein partners.
In 2009, Tastan et al. attempted to predict the global set of interactions between HIV-1
and human proteins [20]. They considered the problem of predicting PPI as a classification
problem and adopted a supervised learning framework to solve the classifier problem.
In 2010, the authors extended their approach by integrating a semi-supervised multi-task
learning method to solve the classifier problem that considered labeled data (protein pairs
that are known to interact by experimental evidence) along with partially labeled data
(protein pairs that are associated, but not experimentally validated) [21]. These experiments
were based on the dataset retrieved from the HHPID. Besides this, Dyer proposed a
supervised learning approach using Support Vector Machine (SVM) which was trained
with the information about HIV-1-human PPIs from other public PPI databases rather than
HHPID [22]. In [23], the authors suggested a prediction framework using conformal method
that was utilized to predict novel interactions and estimate confidence of each prediction.
Recently, a probability weighted ensemble transfer learning model (PWEN-TLM) was
developed to utilize homolog GO information and information on subcellular co-localized
proteins that improves the performance of classsifier based approach [24].
However, the prediction of PPI as a classification problem demands both interacting
and non-interacting protein pairs. Although, the databases consisting of HIV-1-human PPIs
provide the set of interacting protein pairs as positive samples, no resource can provide
the set of non-interacting protein pairs as mentioned in Section 2. Thus, the performance
of classifier based approach vastly depends on the choice of random protein pairs used as
negative samples. This fact has driven some researchers to take an unsupervised approach
based on ARM in predicting novel PPIs between HIV-1 and the host. In 2010, the attempt
was made to extract the association rules among human proteins based only on HHPID
dataset using apriori algorithm that uses the concept of frequent itemsets [27]. The generated
association rules came to the aid of predicting a set of new interactions. In [28], the authors
exploited the concept of frequent closed itemsets (FCIs) in ARM using biclustering method
and generated association rules among HIV-1 proteins as well as among human proteins
were utilized to discover new interactions. Recently, this biclustering based approach has
been extended to provide the interaction type and the direction of regulation with each of the
predicted interactions [30]. In [31], a multiobjective genetic algorithm-based biclustering
technique was proposed to find a strong interaction module in HIV-1-human PPI network
9
using the concept of quasi-biclique. In [32], the authors suggested a new algorithm FIST that
mines FCIs to generate a minimal non-redundant cover of association rules using closure
based approach and extracts hierarchy of biclusters in a single process and the algorithm
was utilized to predict new PPIs [33]. One downside of ARM based approaches is that they
do not predict the interactions involving human proteins those are not covered by frequent
itemsets or FCIs. Moreover, since this model does not learn the pattern of non-interacting
protein pairs, there might be a chance of getting high rate of false positives.
In 2010, Doolittle et al. tried to predict HIV-1-human PPIs based on structural similarity
of HIV-1 proteins with human proteins stimulated by the fact that HIV-1 proteins might
interact with those human proteins which are the binding partners of the human proteins
having similar structure with HIV-1 proteins [25]. This methodology utilized the information
on interactions found in HPRD (Human Protein Reference Database) for prediction and
structural information of proteins from PDB (Protein Data Bank). However, in this method,
only the geometric structures of proteins were considered for finding similarity. In [26],
the authors introduced a evolution-aware structural alignment method (Unialign) that
incorporates evolutionary relationships of proteins along with the geometric structural
similarity. Even so, the predicted set of interactions found by this approach does not include
the human proteins not having the structural alignment with any HIV-1 protein or the host
proteins for which crystal structures are not available in PDB.
We have tried to make a comparison on number of interactions predicted by different
studies with true positive interactions (predicted interactions which are known to exist
experimentally) in Figure 4a. For classifier based approach, the predictions include the
known interactions as well as the novel interactions, making it easy to calculate the
number of true interactions in the prediction set. Structural similarity based approach used
HHPID and PIG (Pathogen Interaction Gateway) database for the validation purpose. For
ARM based approaches using apriori and biclustering method, it is difficult to validate
the predicted interactions computationally since the predictions comprise only the novel
interactions which are not reported in the database. Therefore, the authors tried to validate
their predictions by exhaustively searching the recently published PubMed entries that
provide experimental evidence for the interactions, not reported in HHPID. We have found
that some of the interactions which were considered as novel in the articles [20], [27] and
[28], are now experimentally validated. Based on the list of recent literatures provided in
[28] and [30], we have modified the number of true positive interactions in the prediction
set that has been reflected in Figure 4b. Table 3 shows the interactions which were predicted
before and are now experimentally validated. The predicted interactions were also validated
by finding the overlap with other studies. Figure 5 comes up with the overlap of the predicted
interactions by five studies listed in the figure. We have found a total of four interactions
those were predicted by four studies (marked by i, ii in Figure 5) and a total of eight predicted
interactions common to three studies (marked by iii, iv, v). No prediction of interacting
protein pairs has been found to be common in all the five studies (marked by vi). Table 4
lists the interactions predicted by at least three of the studies. More priority can be put to the
overlapped interactions for analysis and they would be of great interests for the virologists.
In the following subsections, all of these approaches and their results are explained in
detail and lastly, corresponding to these approaches, we have provided a list of URLs where
the prediction method and the predicted dataset can be publicly available (Table 13).
10
Figure 4: Comparison of Predicted Interactions by Different Approaches
(a)
(b)
3.1 Classifier Based Approach by Information Integration
The first attempt to predict the global interaction set between HIV-1 and human proteins by
Tastan et al. utilized a supervised learning framework that integrates multiple heterogeneous
biological information sources [20]. Since the virus exploits the existing communication
pathways within the cell in order to infect it, the interaction relationships between the human
proteins can be used to find the proteins that the pathogen might target. Thus, the authors
derived a number of features that consolidate the existing knowledge on human interactome
(set of intra-human PPIs) with other available information to predict HIV-1 and human
PPIs.
11
Table 3 Previously Predicted, Now Experimentally Validated By Different Studies
Predictions by Tastan et al. [20]
T at ↔ M AP K14, T at ↔ CASP 9, V pu ↔ BCL2, RT ↔ CD4,
V if ↔ U BB, V if ↔ T P 53, env_gp160 ↔ T P 53, env_gp120 ↔ CASP 8,
env_gp120 ↔ SRC, env_gp41 ↔ M AP K1, Gag_P r55 ↔ M AP K1
Predictions by Mukhopadhyay et al. [27]
env_gp41 ↔ M AP K1
Predictions by Mukhopadhyay et al. [28]
T at ↔ M AP K14, T at ↔ CASP 9, Rev ↔ CD4, RT ↔ CD4
Figure 5: Overlapping of Predicted Interactions By Different Studies
Table 4 Interacting Protein Pairs Shared by Different Literatures
Common to Four Studies†
HIV-1 Protein
Human Protein
capsid
MAPK1
gp41
MAPK1
gp160
CASP3
p6
PRKCA
Common to Three Studies‡
HIV-1 Protein
Human Protein
capsid
PRKCA
gp120
ACTB
gp120
CALM1
integrase
CD4
integrase
MAPK1
nucleocapsid
PRKCA
vif
PRKCA
vpr
PRKCA
†Interactions marked by (i) and (ii) in Figure 5
‡Interactions marked by (iii), (iv) and (v) in Figure 5
Here, the prediction of PPI was considered as a binary classification problem, since a
protein can either interact with another protein or not. Hence, each protein pair was either a
member of the interaction class or non-interaction class. The set of interacting protein pairs
12
of HIV-1 and human, taken from HHPID was divided into two exclusive groups– i) Most
likely direct physical interactions; ii) Indirect interactions (Table 2). Group 1 interactions
consisted of 1063 protein pairs involving 721 human proteins and group 2 comprised of
1447 protein pairs that involved 914 human proteins. The Group 1 interactions formed the
interaction class which was employed to fabricate the base. Group 2 was used for mining
the final predictions. As mentioned, since the set of non-interacting protein pairs are not
available, protein pairs which are not known to interact were chosen randomly as negative
samples. A feature vector consisting of 35 features was used to describe each protein pair
and Random Forest (RF) classifier was utilized to provide the solution to the classification
problem. The features were derived from one or more biological information sources such as
Gene Ontology(GO) annotations, graph properties of human interactome (degree, clustering
coefficient and betweenness centrality), gene expression, tissue feature, HIV-1 protein type
features (ptf), post-translational modifications, sequence similarity features, ELM-ligand
feature.
The concept of Gini index was utilized to construct the decision trees in RF and to
evaluate the importance of the features used in the classification problem. Each possible
protein pair in RF was assigned a RF prediction score and the protein pairs with positive RF
score were expected to interact with high score indicating higher probability of interaction.
Several tests were done on RF classifier model to evaluate its performance and it was
expected to achieve average MAP (Mean Average Precision) score of 0.23, i.e., on an
average 23% of all the predicted PPIs are true positive interactions. For PPI prediction, this
result is considered as good. A total 3372 interactions were predicted with RF score ≥ 0.00
among which 2084 are the novel interactions. The predicted interactions were compared
with the human genes reported in the siRNA screen done by Brass et al. [16] and with the
human proteins those are hijacked by HIV-1 in its virion [18]. Additionally, the features
which contributed most to the classification of protein pairs were identified based on the
Gini importance on RF classifier. The top 3 Gini features include degree, betweenness
centrality and neighbour GO process similarity and the top 6 Gini features include clustering
coefficient, neighbour GO function, and cellular location similarities in addition.
However, the performance of this model is influenced by the availability of truly
interacting protein pairs. Thus the authors extended their supervised approach by integrating
a semi-supervised multi-task learning framework that included the protein pairs for which
there exists an association between two partners, but not enough experimental evidence to
support it as direct interactions [21]. Here, 158 HIV-1-human protein pairs suggested by the
experts were taken as positive samples and 2119 pairs were considered as partial positives
among which 552 are Group 1 interactions and 1567 are Group 2 interactions. Instead of
35 features, 18 features have been associated with each HIV-1-human protein pair. Using
semi-supervised approach, 3428 interactions were predicted among which 259 interactions
were validated by partial positive interactions and 3123 novel interactions were discovered.
The predicted interactions were examined in the same way as the previous one to get the
most interesting pairs to concentrate upon. Table 5 gives the number of predictions found
by Tastan et al. using supervised and semi-supervised approaches for some of the prediction
score cut-offs.
In [22], a supervised learning approach to predict physical interactions between HIV-1
and human proteins using SVM was proposed. The SVM was trained with the datasets
retrieved from the databases as positive samples as well as some random protein pairs that are
not known to interact as negative samples. Here, the authors did not use the HHPID datasets
as positive samples, rather they collected 1028 HIV-1-human PPI as positive samples from
13
Table 5 Result Found Using Classifier Based Approach by Tastan et al.
Using Supervised Learning Framework [20]
Prediction Number of Number
Human Genes Overlap
Score
Predicted
of Novel Involved in with siRNA
Cut-off
Interactions Interactions Predicted
(282 genes)
Predicted
Interactions
≥ 0.00
3372
2084
1010
46
≥ 2.50
279
28
22
0
Using Semi-supervised Learning Framework [21]
≥ −1.8
3428
3123
1027
24
≥ −1.5
2434
2172
721
21
Overlap
with Virion
(316 genes)
240
4
72
61
other four public databases— the Biomolecular Interaction Network Database (BIND),
the Database of Interacting Proteins(DIP), IntAct9 , and Reactome10 . The performance of
the model was evaluated by computing area the precision-recall curve (PR-AUC score)
where high AUC score indicates good performance of the predictor. Different combinations
of features such as domains, protein sequence 4-mers, and network properties of human
interaction network were chosen to analyze the performance of SVM and it was found
that the model trained with all the feature sets has provided highest AUC score. Moreover,
different ratios of positive samples to negative samples (PS:NS = 1:25, 1:50 and 1:100)
were taken to prepare the complete list of predicted interactions. The predicted interactions
were examined with the HDFs recognized by Brass et al. [16] and those interactions were
identified which involves the human proteins reported as HDFs. The result found with this
method is given in Table 6.
Table 6 Result Found Using Supervised Learning Method By Dyer et al. [22]
Ratio of Positive to
Negative Samples
(PS:NS)
1:25
1:50
1.100
Number of Predicted
Interactions
1111
506
182
Number of Predicted
Interactions Involving
HDFs
46
33
16
PR-AUC
Score
0.707
0.630
0.505
The classifier based approach for predicting PPI requires both positive and negative
samples for training and testing purposes. Positive samples are readily available from
interaction database, however, there is no such resource for non-interacting protein pairs,
thus protein pairs which are not known to interact i.e., not present in the interaction database
are randomly taken and considered as negative samples based on the assumption that these
random protein pairs are not likely to interact physically, but this might not be true always.
Hence, the task of selection of random protein pairs that are not known to interact, as
negative samples is more challenging in order to predict a potential set of HIV-1-human
PPI using classifier based approach.
In 2012, a conformal prediction framework was proposed for prediction of novel
interactions between HIV-1 and human proteins and to estimate confidence with each of the
9 http://www.ebi.ac.uk/intact/
10 http://www.reactome.org/
14
predicted interactions [23]. The conformal predictor was used to deal with only one class
labeled with interactive proteins, while no set of non-interactive protein pairs was clearly
defined. All the protein pairs those are not known to interact, formed the ‘background set’
with unlabeled examples. The algorithm was based on the ‘exchangeability’ assumption of
data i.e., any of the undiscovered interactions is equally likely to be discovered next and the
relative ‘strangeness’ of protein pairs. A protein pair is strange with respect to others if it has
a very small or very large value for one of it’s features. The authors applied the algorithm
to 1063 interacting protein pairs retrieved from HHPID along with 353778 possible protein
pairs with unknown labels. A p-value was assigned to each example of protein pair with
lower p-value indicating the interaction to be unlikely. To obtain a prediction list with a
certain confidence level (γ), all the protein pairs with p-value of at least (1 − γ) need to be
included in the list. The number of predictions found for some of the p-values is given in
Table 7. The prediction set includes more number of interactions than that of found from RF
classifier model by Tastan et al. [20]. Moreover, a large overlap with ‘siRNA’ and ‘virion’
dataset was observed which shows that the conformal method has given more number of
potential interactions than RF classifier model.
Table 7 Result Found Using Conformal Prediction Framework By Nouretdinov et al. [23]
Conformal
Prediction
p-value
≥ 0.95
≥ 0.90
≥ 0.80
≥ 0.50
Number of Number
Predicted
of
Novel
Interactions Interactions
Predicted
295
241
711
604
2398
2185
19521
18988
Number
of
Predictions
with RF score
≥ 0 [20]
267
548
1156
2376
Overlap Overlap
with
with
siRNA
Virion
NA
11
26
85
NA
38
78
173
In [24], the author suggested PWEN-TLM with SVM taken as classifier to discover
novel interactions between HIV-1 and human proteins. This model addressed three major
difficulties in prediction of PPIs computationally— data scarcity, data unavailability, and
negative data sampling. Homolog GO information was utilized to deal with the issue
of data scarcity and data unavailability. To validate the effectiveness of using homolog
GO information for model training and testing purposes, three experimental settings were
developed— i) Optimistic case which assumed that target GO information for both training
and testing data was available, ii) Moderate case which assumed that target GO information
was not available for test data, and iii) Pessimistic case that did not consider target GO
information for both training and testing data. Regarding negative data sampling, the author
constructed two sets of negative data— one with random sampling of protein pairs that are
not known to interact as previously followed by other researchers and the other with the
exclusion of subcellular co-localized proteins. The exclusiveness of subcellular co-localized
proteins in the formation of negative data samples was based on the fact that subcellular
co-localized protein pairs are more likely to have physical interactions. For the experiment,
a total of 3638 PPIs retrieved from HHPID was considered as positive samples and equal
number of negative samples excluding subcellular co-localized protein pairs were extracted
to form the dataset D1. Similarly, the dataset D2 was created with same positive samples and
equal number of negative data with random sampling. The Receiver Operating Characteristic
- Area Under Curve (ROC-AUC) metric was utilized to measure the significance of homolog
15
GO information and Matthews correlation coefficient (MCC) and accuracy measurement
was done to evaluate the the effectiveness of excluding subcellular co-localized proteins.
The value of the different metrics obtained for different cases and datasets is shown in
Table 8. The relatively small difference in ROC-AUC score for optimistic and pessimistic
case indicated that homolog GO information could be a good substitute where target GO
information is not available. Moreover, dataset D1 was found to have better predictive
balance than dataset D2 with largest MCC difference of 0.1053. Therefore, the exclusiveness
of subcellular co-localized proteins is more reliable in constructing negative dataset for
better performance. Another important aspect of classifier based approach is to choose the
correct ratio of positive to negative samples. In [25], Dyer et al. achieved highest PR-AUC
score of 0.707 for 1:25 ratio (Table 6). Here, the author took the ratio of positive to negative
samples as 1:1 and higher PR-AUC score was obtained which showed that skewed training
data might produce a biased model.
Table 8 Performance Metric Score of PWEN-TLM [24]
Dataset D1
ROC- PRMCC
AUC AUC
Dataset D2
ROC- PRAccuracy
MCC
AUC AUC
Optimistic
0.9326 0.9361 0.7446 85.62%
Case
Moderate
0.8155 0.8172 0.4606 66.22%
Case
Pessimistic
0.8735 0.8799 0.6605 80.22%
Case
Accuracy
0.9005 0.8989 0.6393 82.41%
0.7661 0.7480 0.4258 63.63%
0.8158 0.8478 0.6188 77.43%
The author validated 180 interactions predicted in [28] since among these 180
predictions, there are some recent literature evidence for some of the predictions (Section
3, Figure 4a). Moreover, 80 of these 180 interactions were found to be common with
the predictions done by Tastan et al. [20] (Figure 5). After using 180 interactions as test
set without overlap with the training data, PWEN-TLM predicted 132 interactions in the
optimistic case and 165 interactions in the pessimistic case. Among 132 predictions, 46
interactions and among 165, a total of 61 interactions were also predicted in [20]. Since,
optimistic and pessimistic PWEN-TLM performed better according to ROC-AUC score,
hence the author considered only these two cases for validation purpose. Besides this, the
author also tried to find novel interactions using PWEN-TLM where HIV-targeted human
proteins were taken as test candidates. In this case, the model predicted 718 interactions in
optimistic case and 61 interactions in the pessimistic case.
3.2 Based on Structural Similarity
In [25], Doolittle et al. made a computational approach to predict HIV-1-human PPIs based
on the structural similarity of HIV-1 and human proteins. They first retrieved the human
proteins having significant structural similarity with HIV-1 proteins using Dali database11
that comprises 3D structure comparisons of all protein structures present in PDB12 . PDB
contains the published crystal structures of proteins which cover most of the HIV-1 proteins
11 http://ekhidna.biocenter.helsinki.fi/dali/start
12 http://www.rcsb.org/
16
(PR, RT, IN, CA, MA, NC, Gag p2, gp120, gp41, Nef, Tat, Vpr, and Vpu), however,
structures for many human proteins are not available in PDB. The identified HIV-1 proteins
that have high structural similarity with at least one of the human proteins include: gp41,
gp120, CA, MA, p2, PR, IN, RT, and Vpr and those human proteins having structural
similarity with HIV-1 proteins have been referred to as HIV-similar proteins. Their next step
was to identify intra-human PPIs in which HIV-similar proteins are known to participate,
from HPRD13 . The prediction approach is based on the assumption that HIV-1 proteins
are likely to have the same interactions in which HIV-similar proteins participate, as their
human, HIV-similar counterparts allow them to attach into the host cell protein interaction
network.
The predicted interactions were filtered to provide functional evidence and their
biological relevance. The authors considered those predicted interactions in which the target
proteins (human proteins with which HIV-similar proteins are known to interact) satisfy at
least one criteria — i) impair HIV-1 infection or replication according to siRNA or shRNA,
ii) are present in HIV-1 virions. The filter based on these two types of dataset has been
referred to as “Literature Filters”. After application of this filter, the prediction set consists
of a total of 2143 interactions, among which 62 were verified as true positive interactions
based on the dataset retrieved from host-pathogen interaction database— HHPID and PIG14 .
A total of 347 human proteins were predicted to have similar structure with at least one
HIV-1 protein and 406 unique human proteins were predicted to potentially interact with
HIV-1.
The potentiality of the prediction set was further improved considering the fact of protein
co-localization that demands both HIV-1 and the target human protein to be present in
the same location within the cell based on GO cellular component(CC) annotation. The
number of unique interactions in this refined list of predictions is 502, among which 31
interactions are known to exist experimentally. 189 HIV-similar proteins that have 137
known different binding partners, were encountered. Application of this filter not only
reduced the set of likely interactions, also increased the percentage of true interactions from
˜ to 6%.
˜ Moreover, gp41 has been found to have more predicted interactions and this is
3%
obvious because large number of GO cellular component terms are annotated to gp41 and it
is found in more parts of the cell, hence increasing the probability to satisfy co-localization
criterion with more number of human proteins.
This method of prediction is completely based on structures of proteins, so different
structures for a single protein may produce different predictions about its interactions.
Hence, some predictions are lost if it is done at gene level. The prediction done at gene level
produced 265 interactions followed by CC filters. The summary of the result is tabulated in
Table 9.
The predicted set of interactions were examined from functional and biological aspects
based on GO annotations. The properties of human proteins those were predicted to interact
with HIV-1, were inspected using biological process and molecular function GO terms.
They showed a significant enrichment in the processes of transportation of proteins and
nucleic acid, cell death and post-translational modifications – HIV-1 proteins are known
to alter or manipulate these processes during infection. In addition, it has been observed
that the predicted interactions are supported by various studies. Moreover, the interactions
predicted by structural similarity were compared with the result found by Tastan et al. [20].
10% of the predicted interactions are common in both the studies.
13 http://www.hprd.org/
14 http://pathogenportal.net/pig/
17
Table 9 Result Found Using Structural Similarity Based Approach By Doolittle et al. [25]
Number of predicted
interactions
Number of true
positive interactions
predicted,
verified
from HHPID and
PIG
Number of human
proteins
found
having
structural
similarity with HIV-1
proteins
Number of human
proteins found to
interact with HIV-1
proteins
Predictions Done at
Processed Proteins Level
After
After
Literature Literature and
Filter
CC Filters
Predictions
Gene Level
After
Literature
Filter
Done at
After
Literature and
CC Filters
2143
502
883
265
62
31
56
22
347
189
NA
NA
406
137
NA
NA
In this article, the authors utilized Dali database for protein structure comparison that
only considers the geometric structural information of the protein. Therefore, to incorporate
evolutionary relationships among proteins along with the structural alignments, Zhao et al.
proposed a novel evolution-aware structural alignment method (Unialign) in order to predict
the possible interacting protein partners of HIV-1 proteins [26]. They only considered HIV-1
protein gp41 in their experiment and predicted 922 unique human target proteins potentially
interacting with gp41. They validated their predictions against the direct interactions
reported in HHPID since they only considered the direct physical interactions in their
experiment. A total of 15 of these predicted interactions were among the 68 experimentally
validated interactions. They also evaluated the performance of Dali database and Unialign
method which shows that Unialign method is more efficient than Dali to find the structurally
similar proteins.
3.3 Based on ARM Technique
This type of approach mines the association rules among viral or human proteins from
the known interactions of HIV-1 proteins with human proteins found in PPI database.
Subsequently, the generated association rules with high confidence are utilized to predict the
potential set of viral-host PPIs. This methodology completely exploits the information of
interacting protein pairs, thus overcoming the problem faced by classifier based approach.
Figure 6 shows the pipeline of applying ARM in predicting novel interactions. An itemset
can be referred to as a set of human proteins interacting with a particular HIV-1 protein or
vice-versa. Frequent itemsets are those itemsets that satisfy a minimum support threshold
i.e., a set of human proteins can be referred to as frequent itemsets if they are known
18
to interact with a minimum number of HIV-1 proteins and vice-versa. From the frequent
itemsets, association rules are generated. In subsequent subsections, the methods for
extracting association rules from experimentally validated interactions between HIV-1
and human proteins that have been published in different articles, are explained and the
predictions from these association rules with the predicted results are discussed.
Figure 6: Pipeline of Applying ARM in Prediction of PPI
3.3.1 ARM Using Apriori Algorithm
In [27], the authors tried to extract the association rules among human proteins that are
known to interact with HIV-1 proteins using well-known apriori algorithm. The information
about the PPI between HIV-1 and human proteins found in HHPID, were organized into a
binary adjacency matrix with rows representing the viral proteins and columns representing
the human proteins. They used 1288 interactions (direct or indirect) between 17 HIV-1
proteins and 773 human proteins to predict new viral-host interactions. The dimension of
the matrix is 17 × 773. An entry of 1 in the matrix represents that the corresponding HIV1 and human protein pair is known to interact and 0 indicates no information is present
in the database about the interaction between corresponding HIV-1 and human protein
pair. To apply apriori algorithm to this matrix, each row (HIV-1 protein) was treated as
a transaction and each column (human protein) was treated an item and after applying
the algorithm, frequent itemsets were obtained. From these frequent itemsets, association
rules satisfying the minimum threshold, called minimum confidence, were extracted among
human proteins. Here, the authors were interested in finding association rules with a single
consequent. Thus, extracted rule can be represented as: {HP1 , HP2 , HP3 , HP4 ⇒ HP5 }
with HPi (i = 1, 2, 3, 4, 5) denoting human proteins. We can interpret this rule as — if there
exist physical interactions between the human proteins HP1 , HP2 , HP3 , HP4 with a set
of HIV-1 proteins, then there exists a possibility of interaction of human protein HP5 with
the same set of HIV-1 proteins. Hence, each rule obtained from the matrix was associated
with a set of HIV-1 proteins for which the rule was valid.
Now, the possible viral-host interactions were predicted from the extracted
association rules with high confidence. Suppose, in the frequent itemsets, the
antecedent of the rule {HP1 , HP2 , HP3 , HP4 ⇒ HP5 } is true for 10 viral proteins
(V P1 , V P2 , V P3 , . . . , V P10 ) and consequent of the rule is valid for the first 8 HIV-1 proteins
19
(V P1 , V P2 , . . . , V P8 ), thus the support of the rule is 8 and confidence of the rule will
be 80% (Support of the rule / Support of the antecedent = 8 out of 10), which can be
considered as high. Therefore, two new interactions from this rule can be predicted: V P9 ↔
HP5 and V P10 ↔ HP5 . The rules with 100% confidence were not considered in the
prediction process since no interaction can be predicted from this. With this process, 22 new
interactions were predicted from 34 extracted association rules with single consequent. HIV1 proteins involved in the predicted interactions are rev, tat, vif, vpr, gp120, gp160, gp41,
protease, nef, matrix, nucleocapsid, p6, capsid and human proteins involved are ACTG1,
PRKCA, MAPK1, ACTB, PRKCB1, PRKCQ, MAPK3, PRKCD, PRKCE, CASP3, CD4.
The summary of the result is described in Table 10.
Table 10 Result Found Using Apriori Algorithm By Mukhopadhyay et al. [27]
Minimum Support=40%, Minimum Confidence=70%
Number of extracted association rules among
34
human proteins
Number of new interactions predicted
22
Number of human proteins involved in the
11
predicted interactions
Number of HIV-1 involved in the predicted
13
interactions
The resulting set of predictions were compared with the novel predictions done by
Tastan et al. in [20] described in the Section 3.1. Among 22 predicted interactions by this
apriori approach, 14 interactions were shared by both of the experiments.
3.3.2 ARM Using Biclustering Approach
In [28], the authors suggested a novel approach for predicting PPI based on ARM using
biclustering method that allows simultaneous clustering of rows and columns of a matrix.
They extracted association rules among HIV-1 proteins as well as among human proteins
using the information about known interactions between HIV-1 and human proteins. These
rules were used to predict new viral-host PPIs which are not present in the database.
The information about the PPI found in the HHPID was arranged in the same way as
in [27]. The authors considered 19 HIV-1 proteins and 1432 human proteins to form the
matrix HV with rows indicating the human proteins and the columns indicating the HIV-1
proteins. The transpose of the matrix HV, denoted by VH with size 19 × 1432 was computed.
In both matrices HV and VH, each row was treated as a transaction and each column
was treated as an item to apply the algorithm for obtaining maximal biclusters (biclusters
which are not a proper subset of any biclusters). The set of items of each maximal all-1
bicluster satisfying the minimum threshold value constitutes an FCI. Thus, after applying
BiMax algorithm (Prelic et. al., 2006) to the matrices HV and VH individually, all maximal
biclusters were obtained and hence, all FCIs which are the condensed representation (nonredundant minimal representation) of all frequent itemsets. The FCIs extracted from the
matrix HV has given the association rules among the viral proteins and association rules
among the human proteins were obtained from the FCIs found from VH matrix. Each rule
generated from the matrix HV was associated with a set of human proteins for which the
rule is valid. Similarly, a set of HIV-1 proteins was associated with each rule extracted from
20
the matrix VH. The representation and interpretation of the generated rules are same as
described in Section 3.3.1.
The extracted association rules from both the matrices were filtered to obtain the set
of most useful association rules by removing less-confident and redundant rules. The rule
R2 is called redundant if there exist another rule R1 with same consequent as R2 and
antecedent of R2 is a proper subset of antecedent of R1 while R1 and R2 have confidence
greater than or equal to minimum threshold. From these high-confident, non-redundant set
of association rules, new interactions between HIV-1 and human proteins were predicted
based on the interpretation of the association rules. A total of 140 novel interactions were
predicted from HV matrix and 43 from VH matrix. The step-by-step result is shown in
Table 11. It was found that only three interactions were common in both the set of new
predictions from HV and VH matrix that reveals the necessity of taking both matrices into
account. Hence, after taking union of two sets, 180 unique novel interactions were obtained
which involves 17 HIV-1 proteins (capsid, gp120, gp160, gp41, Tat, integrase, Gag_Pr55,
matrix, Nef, nucleocapsid, p6, protease, Rev, RT, Vif, Vpr, Vpu) and 140 human proteins.
Table 11 Result Found Using Biclustering Approach By Mukhopadhyay et al. [28]
Number
of
all-1
maximal
biclusters/FCIs generated
Number of generated association
rules with single consequent
Number
of
high-confident
association rules generated
Number of high-confident and
nonredundant rules generated
Number of novel interactions
predicted
Number of HIV-1/human proteins
involved in the predicted interactions
From HV matrix
min_support=20
min_confidence=70%
From VH matrix
min_support=5
min_confidence=70%
48
74
123
361
26
50
15
36
140
43
130 (Human Proteins)
15 (HIV-1 Proteins)
In the article [29] and [30], the biclustering based ARM approach to predicting PPI was
extended by integrating the information on the interaction type (Section 2) and direction
of regulation of interaction (HIV-1-to-Human or Human-to-HIV-1) which could help us to
provide valuable additional information regarding the type and regulation pattern of each
predicted interaction. There are 68 unique interaction types reported in HHPID which can
be divided into three classes – regulating (Direction of regulation is from HIV-1 to human
proteins) , regulated by (Direction of regulation is from human to HIV-1 proteins) and
bidirectional (Table 2). In the article [29], only regulating and bidirectional interactions
were considered to predict HIV-1-human PPI, however the authors extended their research
by considering all the three classes of interactions in predicting viral-host PPI in [30].
Each human protein that are known to interact with HIV-1 proteins were annotated with its
corresponding interaction type. A total of 2564 annotated human proteins were obtained
under two classes (regulating and bidirectional) and 1271 annotated human proteins under
two classes (regulated by and bidirectional). Two matrices HV_positive of dimension
21
19 × 2564 and HV_negative of dimension 19 × 1271 with rows representing viral proteins
and columns representing annotated human proteins were computed. The presence of
interaction between the corresponding human and HIV-1 protein pair is indicated by 1
in HV_positive matrix and −1 in HV_negative matrix and in both the matrices, × and 0
represent bidirectional interaction and the absence of interaction respectively. The algorithm
for finding maximal biclusters with a minimum support threshold value was applied to
HV_positive and HV_negative matrix individually to obtain the FCIs, hence the association
rules in a similar way as described previously. The rule generated from HV_positive and
HV_negative may be of the form as Rule 1 and Rule 2 respectively. Moreover, in [30] the
association rules among HIV-1 proteins were also extracted from the maximal biclusters
obtained from transpose of the matrix HV_positive and HV_negative. This type of rule is
described by Rule 3.
Rule 1: {HP1 _cleaves, HP2 _inhibits, HP3 _stimulates ⇒ HP4 _upregulates, HP5 _inhibits}
— It can be interpreted as if the human protein HP1 is cleaved, HP2 is inhibited and HP3
is stimulated by some HIV-1 proteins, then there is a possibility that the same set of viral
proteins upregulates HP4 and inhibits HP5 .
Rule 2: {HP6 _activatedby, HP7 _enhancedby ⇒ HP8 _modifiedby, HP9 _inhibitedby} —
It can be interpreted as if the human protein HP6 activates and HP7 enhances some HIV-1
proteins, then HP8 and HP9 have the possibility to modify and inhibit the same set of
HIV-1 proteins respectively.
Rule 3: {V P1 , V P2 , V P3 ⇒ V P4 , V P5 } — It can be interpreted as if there exist physical
interactions of viral proteins V P1 , V P2 and V P3 with some human proteins, then V P4 and
V P5 are also like to interact with the same set of human proteins.
After extracting association rules among human proteins annotated with interaction type,
some novel interactions were predicted between HIV-1 and human protein associated with
an interaction type. For example, let us consider that the antecedent of Rule 1, mentioned
above, is valid for a set of viral proteins (V P1 , V P2 , V P3 , V P4 , V P5 ) and the consequent
is true for (V P1 , V P2 , V P3 , V P4 ). Thus, the confidence of the rule is 80% (4 out of 5).
Hence, it can be predicted that the viral protein V P4 is likely to interact with the human
proteins HP4 and HP5 with the interaction type ‘upregulates’ and ‘inhibits’. In a similar
way, new interactions were also predicted from the rules of type represented as Rule 2 and
Rule 3. In [29], 17 maximal biclusters were found with minimum threshold value from which
46 association rules were predicted. Among the 46 predicted interactions, 26 interactions
were found to be true positive interactions. In [30], 19 biclusters were obtained from the
matrix HV_positive including those that were found in [29] and the matrix HV_negative
also generated 19 biclusters. A total of 93 association rules among human proteins and
33 association rules among viral proteins were generated from both the matrices. The
total number of new interactions predicted from the biclusters extracted from HV_positive
and HV_negative was 114 among which 59 predictions were found to be experimentally
validated. Table 12 describes the individual result found from both the matrices.
However, the above mentioned approaches modeled the HIV-1-human PPI network as
a binary matrix without considering the interaction strengths and mined association rules
from only the complete bicliques (all-1 biclusters). In [31], the authors tried to find out
quasi-bicliques from the weighted interaction graph of HIV-1 and human proteins in order
to generate more relevant information about the proteins. A quasi-biclique is equivalent to a
bicluster having small number of zero elements and large mean interaction strength (MIS),
22
Table 12 Result Found Using Biclustering Method Incorporating Interaction Type and Regulation
Direction by Mukhopadhyay et al. [30]
Number of maximal biclusters
obtained (Minimum number of
HIV-1 proteins, Minimum number
of human proteins)
Number of association rules
generated among human proteins
Number of association rules
generated among HIV-1 proteins
Number of predicted interactions
Number of human proteins
involved in the predicted
interactions
Number of HIV-1 proteins
involved in the predicted
interactions
Number of predicted interactions
which have been found to be
experimentally validated
From HV_positive
Matrix
From HV_negative
Matrix
19(4,2)
19(3, 2)
62
31
26
7
64
50
31
32
8
13
35
24
thus releasing the stringent requirement of having all non-zero entries as in bicliques. Here,
the quasi-bicliques finding process was treated as a biclustering problem in a weighted
graph. Since BiMax algorithm can not be applied on a weighted graph, a multiobjective
genetic algorithm-based biclustering technique (MOBICLUST) was proposed to generate
dense biclusters having high MIS from HIV-1-human bipartile PPI network where two sets
of nodes represent HIV-1 and human proteins respectively and edges denote the interactions.
The authors categorized the interactions into three types — direct physical interactions,
indirect interactions and novel interactions that was predicted by Tastan et al. [20]. Each
interaction was assigned a prediction score in [20] as described in Section 3.1. According
to that scores, the authors filtered out the interactions with a lower threshold of 2.0 and
obtained a PPI network with 17 nodes of HIV-1 proteins, 1403 nodes of human proteins
and 617 weighted edges of interactions. The application of MOBICLUST algorithm to
this PPI network generated 26 biclusters. After filtering biclusters with MIS less than
2.0, 14 biclusters were obtained. Due to the large overlap of interactions within these
biclusters, the union of 14 biclusters was taken which represents a strong interaction module.
This interaction module involves 7 HIV-1 proteins (gp120, gp160, gp41, matrix, nef, RT,
tat) and 15 human proteins (AP2B1, CALM1, CALM3, CD4, CXCR4, LCK, MAPK1,
PRKACA, PRKCB1, PRKCD, PRKCE, PRKCG, PRKCI, PRKCQ, TP53)). The number of
interactions contained in this module is 75 out of which 57 are direct physical interactions.
3.3.3 Integration of ARM with Biclusters Based on Closure Lattice
In [32], the authors proposed an algorithm FIST (Frequent Itemset mining using Suffix
Trees) that utilizes the concept of subset lattice in ARM. The algorithm combines the
generation of FCIs, generators for each FCI, minimal non-redundant cover of association
23
rules and hierarchical conceptual biclusters into a single process. It was operated on the
information about known PPIs along with some biological and bibliographical annotations
to extract association rules and predict new interactions between HIV-1 and human proteins
[33]. By applying FIST, the authors obtained the knowledge patterns that represent three
kinds of relationships— among HIV-1 proteins, among human proteins and between HIV-1
and human proteins.
1. V P1 , V P2 , . . . , V Pm ⇔ HP1 , HP2 , . . . , HPn
2. V P1 , V P2 , . . . , V Pm ⇒ V Pm+1 , V Pm+2 , . . . , V Pm+n
3. HP1 , HP2 , . . . , HPm ⇒ HPm+1 , HPm+2 , . . . , HPm+n
where V Pi (i = 1, 2, . . . , m, m + 1, . . . , m + n) and HPi (i = 1, 2, . . . , m, m +
1, . . . , n) denote viral and human proteins respectively.
The first input dataset to the algorithm was created by taking 19 HIV-1 proteins as
columns and 1433 human proteins as rows. The interaction between the corresponding
pair of HIV-1 and human protein in the matrix was marked as 1 and question mark for no
interaction. Thereafter, some additional attributes were merged to the first matrix to form a
second dataset that represents GO annotations and bibliographic references corresponding
to each row of human proteins along with the information on viral-host interactions. These
annotations are the nominal dataset which included 1149 unique GO terms taken from GO
website and 2670 unique publications collected from NCBI website. Both FIST and apriori
were applied to the first dataset for performance comparison and it was found that the number
of association rules generated from FIST was much more less than that of apriori, making
the analysts more focused on relevant rules. Additionally, FIST can generate biclusters and
provide a set of valid objects (human proteins) with each association rule unlike apriori.
Apriori could not be executed on second dataset because of the high memory consumption
by frequent itemsets generated. Since, FIST mines FCIs rather than frequent itemsets, it
performed well on the second dataset.
A total of 1346 unique HIV-1-human protein pairs were predicted to interact by FIST
approach. The authors validated their predicted interactions by comparing with those
predicted by Tastan et al. using supervised approach (Section 3.1). Figure 7 depicts the
predicted interacting protein pairs by Tastan et al. which are covered by at least one FCI
generated from FIST.
Table 13: List of URLs Providing Prediction Method and Predicted PPI Dataset
References
Supervised approach by
Tastan et al. [20]
Semi-supervised approach
by Tastan et al. [21]
Prediction Method/Supplementary Materials(†) and
PPI Prediction Datasets(‡)
†http://www.ncbi.nlm.nih.gov/pmc/
articles/PMC3263379/bin/NIHMS345575supplement-supplement.pdf
‡http://www.ncbi.nlm.nih.gov/pmc/
articles/PMC3263379/bin/NIHMS345575supplement-predictions.xls
‡http://www.cs.cmu.edu/~qyj/HIVsemi/
HIVPPI.embOmodel.all.ave.cutList.cvs
Continued on next page...
24
Continued from previous page...
References
Prediction Method/Supplementary Materials(†) and
PPI Prediction Datasets(‡)
Supervised approach by ‡http://www.ncbi.nlm.nih.gov/pmc/
Dyer et al. [22]
articles/PMC3134873/bin/NIHMS289876supplement-02.txt
‡http://www.ncbi.nlm.nih.gov/pmc/
articles/PMC3134873/bin/NIHMS289876supplement-03.txt
‡http://www.ncbi.nlm.nih.gov/pmc/
articles/PMC3134873/bin/NIHMS289876supplement-04.txt
Conformal
prediction ‡http://www.clrc.rhul.ac.uk/people/
approach by Nouretdinov alex/HIVpsb12/supp_files.zip
et al. [23]
PWEN-TLM approach by ‡http://www.plosone.org/article/
Mei [24]
fetchSingleRepresentation.action?
uri=info:doi/10.1371/journal.pone.
0079606.s005
‡http://www.plosone.org/article/
fetchSingleRepresentation.action?
uri=info:doi/10.1371/journal.pone.
0079606.s006
Structural similarity based ‡http://www.virologyj.com/content/
approach by Doolittle et al. supplementary/1743-422x-7-82-s4.txt
[25]
ARM based approach ‡http://www.plosone.org/article/
using biclustering method fetchSingleRepresentation.action?
by Mukhopadhyay et al. uri=info:doi/10.1371/journal.pone.
[28]
0032289.s003
ARM based approach †http://www.biomedcentral.com/imedia/
using biclustering method 1211858693119566/supp1.pdf
incorporating type and ‡http://www.biomedcentral.com/imedia/
direction information by 1432252521195668/supp2.zip
Mukhopadhyay et al. [30]
Closure based integrated †http://www.i3s.unice.fr/~pasquier/
biclustering
approach web/?Research___Softwares___FIST
(FIST) by Mondal et al.
[33]
4 Conclusion
In this article, we have attempted to outline the significance of interactions between HIV1 virus and the host proteins in the pathogenesis of the lethal disease HIV/AIDS and
consequently in designing antiviral drugs that can suppress HIV-1 infection in human body.
Drug’s antagonistic structure interferes with different stages of virus replication cycle.
25
Figure 7: Overlapping of Supervised Learning Approach by Tastan et al. [20] and
Integrated Biclustering Approach By FIST [33]
Therefore, the discovery process of drugs can be ameliorated by predicting new interactions
between HIV-1 and human proteins — this fact has motivated the researchers to focus
on taking different approaches in order to predict novel interactions. All methodologies
are mainly based on PPIs which have been proved to exist experimentally. However, they
utilize various information on different PPIs in different ways for predicting interactions.
For example, ARM based approaches utilize only the viral-host PPIs as the backbone for
predicting new PPIs, structural similarity based approach is completely based on the PPI
information among human proteins, whereas classifier based approaches considers viralhost PPIs along with intra-human PPIs represented as features. From this study, we have
found that the classifier based approaches predicted greater number of new interactions than
any other approaches, additionally greater percentage of predicted interactions (23%) was
found to be true positives in supervised approach by Tastan et al. However, the classifier
model suffers from identifying lack of experimentally validated non-interacting protein
pairs, although PWEN-TLM was devised to exclude subcellular co-localized protein pairs
from negative samples to achieve better performance and derive more potential set of likely
interactions. Secondly, structural similarity based approach is not widely acceptable since
crystal structures of all human proteins are not available and also a single protein can be
expressed by more than one structure. Therefore, a lot of interest has been shifted to ARM
based approaches because of utilizing only positive samples for prediction. However, since
this approach does not consider the pattern of non-interaction of proteins for prediction, it
might suffer from high rate of false positives.
References
[1] AIDSinfo | Information on HIV/AIDS Treatment, Prevention, and Research, http:
//www.aidsinfo.nih.gov/education-materials/fact-sheets
[2] UNAIDS 2013 | AIDS by the numbers, http://www.unaids.org/en/
media/unaids/contentassets/documents/unaidspublication/
2013/JC2571_AIDS_by_the_numbers_en.pdf
[3] WHO | Data and statistics, http://www.who.int/hiv/data/en/
26
[4] Alimonti, J.B., Ball, T.B. and Fowke, K.R. (2003) ‘Mechanisms of CD4+ T
lymphocyte cell death in human immunodeficiency virus infection and AIDS’, Journal
of General Virology, Vol. 84, No. 7, pp. 1649–1661.
[5] Wu, Y. (2004) ‘HIV-1 gene expression: lessons from provirus and non-integrated
DNA’, Retrovirology, 1:13.
[6] Freed, E.O. (2001) ‘HIV-1 replication’, Somatic Cell and Molecular Genetics, 26(16):13-33.
[7] Frankel, A.D. and Young, J.A.T. (1998) ‘HIV-1: fifteen proteins and an RNA’, Annu.
Rev. Biochem., Vol. 67, pp. 1–25.
[8] Karn, J. and Stoltzfus, C.M. (2012) ‘Transcriptional and Posttranscriptional Regulation
of HIV-1 Gene Expression’, Cold Spring Harbor Perspectives in Medicine, 4:a006916.
[9] Malim, M.H. and Emerman, M. (2008) ‘HIV-1 accessory proteins–ensuring viral
survival in a hostile environment’, Cell Host and Microbe, Vol. 3, No. 6 pp. 388–398.
[10] Freed, E.O. (1998) ‘HIV-1 gag proteins: diverse functions in the virus life cycle’,
Virology, Vol. 251, No. 1, pp. 1–15.
[11] Suñé, C. and García-Blanco, M.A. (1995) ‘Sp1 transcription factor is required for in
vitro basal and Tat-activated transcription from the human immunodeficiency virus
type 1 long terminal repeat’, Journal of Virology, Vol. 69, No. 10, pp. 6572–6576.
[12] Dorr, P., Westby, M., Dobbs, S., Griffin, P., Irvine, B. et al. (2005) ‘Maraviroc (UK427,857), a potent, orally bioavailable, and selective small-molecule inhibitor of
Chemokine receptor CCR5 with broad-spectrum anti-Human Immunodeficiency Virus
Type 1 activity’, Antimicrobial Agents and Chemotherapy, Vol. 49, No. 11, pp. 4721–
4732.
[13] HIV/AIDS,
HIVAIDS/
NIAID,
NIH,
http://www.niaid.nih.gov/topics/
[14] Fu, W., Sanders-Beer, B., Katz, K., Maglott, D. and Pruitt, K. (2009) ‘Human
immunodeficiency virus type-1, human protein interaction database at NCBI’, Nucleic
Acids Research, Vol. 37, Database Issue, D417–D422.
[15] Ptak, R.G., Fu, W., Sanders-Beer, B.E., Dickerson, J.E., Pinney et al. (2008)
‘Cataloguing the HIV Type 1 Human Protein Interaction Network’, AIDS Research
and Human Retroviruses, Vol. 24, No. 12, pp. 1497–1502.
[16] Brass, A.L., Dykxhoorn, D.M., Benita, Y., Yan, N., Engelman, A. et al. (2008)
‘Identification of host proteins required for HIV infection through a functional genomic
screen’, Cell, Vol. 135, pp. 49–60.
[17] Pinney, J.W., Dickerson, J.E., Fu, W., Sanders-Beer, B.E., Ptak, R.G. and Robertson,
D.L. (2009) ‘HIV-host interactions: a map of viral perturbation of the host system’,
AIDS, Vol. 23, No.5, pp.549–554.
[18] Ott, D.E. (2008) ‘Cellular proteins detected in HIV-1’, Reviews in Medical Virology,
Vol. 18, No. 3, pp. 159–175.
27
[19] Park, B. and Han, K. (2006) ‘Web service for predicting interacting proteins
and application to human and HIV-1 Proteins’, Computational Intelligence and
Bioinformatics, Vol. 4155, pp. 631–640.
[20] Tastan, O., Qi, Y., Carbonell, J. and Klein-Seetharaman, J. (2009) ‘Prediction of
interactions between HIV-1 and Human proteins by information integration’, Pacific
Symposium on Biocomputing, 2009:516–527.
[21] Qi, Y., Tastan, O., Carbonell, J., Klein-Seetharaman, J. and Weston, J. (2010) ‘Semisupervised multi-task learning for predicting interactions between HIV-1 and human
proteins’, Bioinformatics, Vol. 26, No. 18, pp. i645–i652.
[22] Dyer, M., Murali, T. and Sobral, B. (2011) ‘Supervised learning and prediction
of physical interactions between human and HIV proteins’, Infection, genetics and
evolution: journal of molecular epidemiology and evolutionary genetics in infectious
diseases, Vol. 11, No. 5, pp. 917–923.
[23] Nouretdinov, I., Gammerman, A., Qi, Y. and Klein-Seetharaman, J. (2012)
‘Determining confidence of predicted interactions between HIV-1 and human proteins
using conformal method’, Pacific Symposium on Biocomputing, 2012:311–322.
[24] Mei, S. (2013) ‘Probability weighted ensemble transfer learning for predicting
interactions between HIV-1 and human proteins’, PLoS ONE, 8(11):e79606.
[25] Doolittle, J.M. and Gomez, S.M. (2010) ‘Structural similarity-based predictions of
protein interactions between HIV-1 and Homo sapiens’, Virology Journal , 7:82.
[26] Zhao, C. and Sacan, A. (2013) ‘Prediction of HIV-1 and human protein interactions
based on a novel evolution-aware structure alignment method’, BIOCOMP 2013.
[27] Mukhopadhyay, A., Maulik, U., Bandyopadhyay, S. and Eils, R. (2010) ‘Mining
association rules from HIV-Human protein interactions’, International Conference on
Systems in Medicine and Biology (ICSMB), IEEE, pp. 344–348.
[28] Mukhopadhyay, A., Maulik, U. and Bandyopadhyay, S. (2012) ‘A novel biclustering
approach to association rule mining for predicting HIV-1-Human protein interactions’,
PLoS ONE, 7:e32289.
[29] Mukhopadhyay, A., Ray, S., Maulik, U. (2012) ‘Predicting annotated HIV-1-Human
PPIs using a biclustering approach to association rule mining’, Third International
Conference on Emerging Applications of Information Technology (EAIT), IEEE, pp.
28–31.
[30] Mukhopadhyay, A., Ray, S. and Maulik, U. (2014) ‘Incorporating the type and direction
information in predicting novel regulatory interactions between HIV-1 and human
proteins using a biclustering approach’, BMC Bioinformatics, Vol. 15, No. 26.
[31] Maulik, U., Mukhopadhyay, A., Bhattacharyya, M., Kaderali, L., Brors, B.,
Bandyopadhyay, S. and Eils, R. (2012) ‘Mining quasi-bicliques from HIV-1-human
protein interaction network: a multiobjective biclustering approach’, IEEE/ACM
Transactions on Computational Biology and Bioinformatics, Vol.10, No. 2, pp. 423–
435.
28
[32] Mondal, K. C., Pasquier, N., Mukhopadhyay, A., Maulik, U. and Bandhopadyay,
S. (2012) ‘A new Approach for association rule mining and bi-clustering using
formal concept analysis’, Machine Learning and Data Mining in Pattern Recognition,
Springer, Vol. 7376, pp. pp 86–101.
[33] Mondal, K.C., Pasquier, N., Mukhopadhyay, A., Pereira, C.C., Maulik, U. and
Tettamanzi, A. (2012) ‘Prediction of protein interactions on HIV-1-human PPI data
using a novel closure-based integrated approach’, Proceedings of the International
Conference on Bioinformatics Models, Methods and Algorithms, INSTICC, pp. 164–
173.