Genetic Regulatory Networks EURASIP Journal on Bioinformatics and Systems Biology

EURASIP Journal on Bioinformatics and Systems Biology
Genetic Regulatory Networks
Guest Editors: Edward R. Dougherty, Tatsuya Akutsu,
Paul Dan Cristea, and Ahmed H. Tewfik
Genetic Regulatory Networks
EURASIP Journal on Bioinformatics and Systems Biology
Genetic Regulatory Networks
Guest Editors: Edward R. Dougherty, Tatsuya Akutsu,
Paul Dan Cristea, and Ahmed H. Tewfik
Copyright © 2007 Hindawi Publishing Corporation. All rights reserved.
This is a special issue published in volume 2007 of “EURASIP Journal on Bioinformatics and Systems Biology.” All articles are open
access articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Editor-in-Chief
Ioan Tabus, Tampere University of Technology, Finland
Associate Editors
Jaakko Astola, Finland
Junior Barrera, Brazil
Michael Bittner, USA
Yidong Chen, USA
Paul Dan Cristea, Romania
Aniruddha Datta, USA
Bart De Moor, Belgium
Edward R. Dougherty, USA
Javier Garcia-Frias, USA
Debashis Ghosh, USA
John Goutsias, USA
Roderic Guigo, Spain
Yufei Huang, USA
Seungchan Kim, USA
John Quackenbush, USA
Jorma Rissanen, Finland
Stéphane Robin, France
Paola Sebastiani, USA
Erchin Serpedin, USA
Ilya Shmulevich, USA
Ahmed H. Tewfik, USA
Sabine Van Huffel, Belgium
Yue Wang, USA
Z. Jane Wang, Canada
Contents
Genetic Regulatory Networks, Edward R. Dougherty, Tatsuya Akutsu, Paul Dan Cristea,
and Ahmed H. Tewfik
Volume 2007, Article ID 17321, 2 pages
Analysis of Gene Coexpression by B-Spline Based CoD Estimation, Huai Li, Yu Sun, and Ming Zhan
Volume 2007, Article ID 49478, 10 pages
Gene Systems Network Inferred from Expression Profiles in Hepatocellular Carcinogenesis by
Graphical Gaussian Model, Sachiyo Aburatani, Fuyan Sun, Shigeru Saito, Masao Honda, Shu-ichi Kaneko,
and Katsuhisa Horimoto
Volume 2007, Article ID 47214, 11 pages
Uncovering Gene Regulatory Networks from Time-Series Microarray Data with Variational Bayesian
Structural Expectation Maximization, Isabel Tienda Luna, Yufei Huang, Yufang Yin, Diego P. Ruiz Padillo,
and M. Carmen Carrion Perez
Volume 2007, Article ID 71312, 14 pages
Inferring Time-Varying Network Topologies from Gene Expression Data, Arvind Rao, Alfred O. Hero III,
David J. States, and James Douglas Engel
Volume 2007, Article ID 51947, 12 pages
Inference of a Probabilistic Boolean Network from a Single Observed Temporal Sequence,
Stephen Marshall, Le Yu, Yufei Xiao, and Edward R. Dougherty
Volume 2007, Article ID 32454, 15 pages
Algorithms for Finding Small Attractors in Boolean Networks, Shu-Qin Zhang, Morihiro Hayashida,
Tatsuya Akutsu, Wai-Ki Ching, and Michael K. Ng
Volume 2007, Article ID 20180, 13 pages
Fixed Points in Discrete Models for Regulatory Genetic Networks, Dorothy Bollman, Omar Colón-Reyes,
and Edusmildo Orozco
Volume 2007, Article ID 97356, 8 pages
Comparison of Gene Regulatory Networks via Steady-State Trajectories, Marcel Brun, Seungchan Kim,
Woonjung Choi, and Edward R. Dougherty
Volume 2007, Article ID 82702, 11 pages
A Robust Structural PGN Model for Control of Cell-Cycle Progression Stabilized by Negative Feedbacks,
Nestor Walter Trepode, Hugo Aguirre Armelin, Michael Bittner, Junior Barrera, Marco Dimas Gubitoso,
and Ronaldo Fumio Hashimoto
Volume 2007, Article ID 73109, 11 pages
Hindawi Publishing Corporation
EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 17321, 2 pages
doi:10.1155/2007/17321
Editorial
Genetic Regulatory Networks
Edward R. Dougherty,1, 2 Tatsuya Akutsu,3 Paul Dan Cristea,4 and Ahmed H. Tewfik5
1 Department
of Electrical & Computer Engineering, College of Engineering, Texas A&M University, College Station,
TX 77843-3128, USA
2 Translation Genomics Research Institute, Phoenix, AZ 85004, USA
3 Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan
4 Digital Signal Processing Laboratory, Department of Electrical Engineering, “Politechnica” University of Bucharest,
060032 Bucharest, Romania
5 Department of Electrical and Computer Engineering, Institute of Technology, University of Minnesota, Minneapolis,
MN 55455, USA
Received 3 June 2007; Accepted 3 June 2007
Copyright © 2007 Edward R. Dougherty et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
Systems biology aims to understand the manner in which the
parts of an organism interact in complex networks, and systems medicine aims at basing diagnosis and treatment on a
systems level understanding of molecular interaction, both
intra-and inter-cellular. Ultimately, the enterprise rests on
characterizing the interaction of the macromolecules constituting cellular machinery. Genomics, a key driver in this enterprise, involves the study of large sets of genes and proteins,
with the goal of understanding systems, not simply components. The major goal of translational genomics is to characterize genetic regulation, and its effects on cellular behavior
and function, thereby leading to a functional understanding
of disease and the development of systems-based medical solutions.
To achieve this goal it is necessary to develop nonlinear dynamical models that adequately represent genomic
regulation and to develop mathematically grounded diagnostic and therapeutic tools based on these models. Signals
generated by the genome must be processed to characterize their regulatory effects and their relationship to changes
at both the genotypic and phenotypic levels. Owing to the
complex regulatory activity within the cell, a full understanding of regulation would involve characterizing signals
at both the transcriptional (RNA) and translational (protein) levels; however, owing to the tight connection between
the levels, a goodly portion of the information is available
at the transcriptional level, and owing to the availability of
transcription-based microarray technologies, most current
studies utilize mRNA expression measurements. Since transcriptional (and posttranscriptional) regulation involves the
processing of numerous and different kinds of signals, mathematical and computational methods are required to model
the multivariate influences on decision-making in genetic
networks.
Construction of a network model is only the beginning
of biological analysis. Understanding a gene network means
understanding its dynamics, especially its long-run behavior.
For instance, it has been conjectured that the stationary distribution characterizes phenotype. It is in terms of dynamics
that issues such as stability, robustness, and therapeutic effects must be examined. Indeed, it seems virtually impossible
to design targeted treatment regimens that address a patient’s
individual regulatory structure without taking into account
the stochastic dynamics of cell regulation. From the perspective of systems medicine, perhaps the most important issue
to be addressed is the design of treatment policies based on
the external control of regulatory network models, since this
is the route to the design of optimal therapies, both in terms
of achieving desired changes and avoiding deleterious side
effects.
As a discipline, signal processing involves the construction of model mathematical systems, including systems of
differential equations, graphical networks, stochastic functional relations, and simulation models. And if we view signal processing in the wide sense, to include estimation, classification, automatic control, information theory, networks,
2
EURASIP Journal on Bioinformatics and Systems Biology
and coding, we see that genomic signal processing will play
a central role in the development of systems medicine. There
is a host of important and difficult problems, ranging over
issues such as inference, complexity reduction, and the control of high-dimensional systems. These represent an exciting
challenge for the signal processing community and a chance
for the community to play a leading role in the future of
medicine.
Edward R. Dougherty
Tatsuya Akutsu
Paul Dan Cristea
Ahmed H. Tewfik
Hindawi Publishing Corporation
EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 49478, 10 pages
doi:10.1155/2007/49478
Research Article
Analysis of Gene Coexpression by B-Spline Based
CoD Estimation
Huai Li, Yu Sun, and Ming Zhan
Bioinformatics Unit, Branch of Research Resources, National Institute on Aging, National Institutes of Health,
Baltimore, MD 21224, USA
Received 31 July 2006; Revised 3 January 2007; Accepted 6 January 2007
Recommended by Edward R. Dougherty
The gene coexpression study has emerged as a novel holistic approach for microarray data analysis. Different indices have been
used in exploring coexpression relationship, but each is associated with certain pitfalls. The Pearson’s correlation coefficient, for
example, is not capable of uncovering nonlinear pattern and directionality of coexpression. Mutual information can detect nonlinearity but fails to show directionality. The coefficient of determination (CoD) is unique in exploring different patterns of gene
coexpression, but so far only applied to discrete data and the conversion of continuous microarray data to the discrete format could
lead to information loss. Here, we proposed an effective algorithm, CoexPro, for gene coexpression analysis. The new algorithm
is based on B-spline approximation of coexpression between a pair of genes, followed by CoD estimation. The algorithm was
justified by simulation studies and by functional semantic similarity analysis. The proposed algorithm is capable of uncovering
both linear and a specific class of nonlinear relationships from continuous microarray data. It can also provide suggestions for
possible directionality of coexpression to the researchers. The new algorithm presents a novel model for gene coexpression and
will be a valuable tool for a variety of gene expression and network studies. The application of the algorithm was demonstrated
by an analysis on ligand-receptor coexpression in cancerous and noncancerous cells. The software implementing the algorithm is
available upon request to the authors.
Copyright © 2007 Huai Li et al. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1.
INTRODUCTION
The utilization of high-throughput data generated by microarray gives rise to a picture of transcriptome, the complete set of genes being expressed in a given cell or organism under a particular set of conditions. With recent interests in biological networks, the gene coexpression study has
emerged as a novel holistic approach for microarray data
analysis [1–4]. The coexpression study by microarray data allows exploration of transcriptional responses that involve coordinated expression of genes encoding proteins which work
in concert in the cell. Most of coexpression studies have been
based on the Pearson’s correlation coefficient [1, 2, 5]. The
linear model-based correlation coefficient provides a good
first approximation of coexpression, but is also associated
with certain pitfalls. When the relationship between logexpression levels of two genes is nonlinear, the degree of coexpression would be underestimated [6]. Since the correlation coefficient is a symmetrical measurement, it cannot provide evidence of directional relationship in which one gene
is upstream of another [7]. Similarly, mutual information is
also not suitable for modeling directional relationship, although applied in various coexpression studies [8, 9]. The
coefficient of determination (CoD), on the other hand, is
capable of uncovering nonlinear relationship in microarray
data and suggesting the directionality, thus has been used in
prediction analysis of gene expression, determination of connectivity in regulatory pathways, and network inference [10–
14]. However, the application of CoD in microarray analysis
so far can only be applied to discrete data, and continuous
microarray data must be converted by quantization to the
discrete format prior application. The conversion by quantization could lead to the loss of important biological information, especially for a dataset with a small sample size and
low data quality. Moreover, quantization is a coarse-grained
approximation of gene expression pattern and the resulting
data may represent “qualitative” relationship and lead to biologically erroneous conclusions [15].
B-spline is a flexible mathematical formulation for curve
fitting due to a number of desirable properties [16]. Under
2
EURASIP Journal on Bioinformatics and Systems Biology
the smoothness constraint, B-spline gives the “optimal”
curve fitting in terms of minimum mean-square error [16,
17]. Recently, B-spline has been widely used in microarray
data analysis, including inference of genetic networks, estimation of mutual information, and modeling of time-series
gene expression data [7, 17–23]. In a Bayesian network model
for genetic network construction from microarray data [7],
B-spline has been used as a basis function for nonparametric
regression to capture nonlinear relationships between genes.
In numerical estimation of mutual information from continuous microarray data [23], a generalized indicator function
based on B-spline has been proposed to get more accurate
estimation of probabilities. By treating the gene expression
level as a continuous function of time, B-spline approaches
have been used to cluster genes based on mixture models
[17, 19, 22], and to identify differential-expressed genes over
the time [18, 21]. All the studies have shown the great usefulness of the B-spline approach for microarray data analysis.
In this study, we proposed a new algorithm, CoexPro,
which is based on B-spline approximation followed by CoD
estimation, for gene coexpression analysis. Given a pair
of genes gx and g y with expression values {(xi , yi ), i =
1, . . . , N }, we first employed B-spline to construct the function relationship y = F(x) of the expression level y of gene
g y given the expression level x of gene gx in the (x, y) plane.
We then computed CoD to determine how well the expression of gene g y is predicted by the expression of gene gx
based on the B-spline model. The proposed modeling is able
to address specific nonlinear relationship in gene coexpression, in addition to linear correlation, it can suggest possible
directionality of interactions, and can be calculated directly
from microarray data. We demonstrated the effectiveness of
the new algorithm in disclosing different patterns of coexpression using both simulated and real gene-expression data.
We validated the identified gene coexpression by examining
the biological and physiological significances. We finally used
the proposed method to analyze expression profiles of ligands and receptors in leukemia, lung cancer, prostate cancer, and their normal tissue counterparts. The algorithm correctly identified coexpressed ligand-receptor pairs specific to
cancerous tissues and provided new clues for the understanding of cancer development.
2.
Once we have the model, we compute CoD to determine
how well the expression of gene g y is predicted by the expression of gene gx . The CoD allows measurement of both linear
and specific nonlinear patterns and suggests possible directionality of coexpression. Continuous data from microarray
can be directly used in the calculation without transformation into the discrete format, hence avoiding potential loss or
misrepresentation of biological information.
2.1.1. Two-dimensional B-spline approximation
The two-dimensional (2D) B-spline is a set of piecewise polynomial functions [16]. Using the notion of parametric representation, the 2D B-spline curve can be defined as follows:
x Given a two-dimensional scatter plot of expression for a
pair of genes gx and g y with expression values {(xi , yi ), i =
1, . . . , N }, it allows us to explore if there are hidden coexpression patterns between the two genes through modeling the
plotted pattern. Here, we propose to use B-spline to model
the functional relationship y = F(x) of the expression level y
of gene g y given the expression level x of gene gx in the (x, y)
plane. Mathematically, it is most convenient to express the
curve in the form of x = f (t) and y = g(t), where t is some
parameter, instead of using implicit equation just involving x
and y. This is called a parametric representation of the curve
that has been commonly used in B-spline curve fitting [16].
tmin ≤ t < tmax .
(1)
In (1), yjj , j = 1, . . . , n + 1 are n + 1 control points assigned from data samples. t is a parameter and is in the range
of maximum and minimum values of the element in a knot
vector. A knot vector, t1 , t2 , . . . , tk+(n+1) , is specified for giving
a number of control points n + 1 and B-spline order k. It is
necessary that t j ≤ t j+1 , for all j. For an open curve, openuniform knot vector should be used, which is defined as
t j = t1 = 0, j ≤ k,
t j = j − k, k < j < n + 2,
t j = tk+(n+1) = n − k + 2, j ≥ n + 2.
(2)
For example, if k = 3, n + 1 = 10, the open-uniform knot
vector is equal to [0 0 0 1 2 3 4 5 6 7 8 8 8]. In this
case, tmin = 0, tmax = 8, and 0 ≤ t < 8.
The B j,k (t) basis functions are of order k. k must be at
least 2, and can be no more than n + 1. The B j,k (t) depend
only on the value of k and the values in the knot vector. The
B j,k (t) are defined recursively as:
⎧
⎨1,
t j ≤ t < t j+1 ,
B j,1 (t) = ⎩
0, otherwise,
t j+k − t
t − tj
B j,k−1 (t) +
B j+1,k−1 (t).
B j,k (t) =
t j+k−1 − t j
t j+k − t j+1
(3)
METHODS
2.1. Model for gene coexpression of mixed patterns
n+1
f (t)
x j
x
B j,k (t)
,
=
=
y
g(t)
y j
j =1
Given a pair of genes gx and g y with expression values
{(xi , yi ), i = 1, . . . , N }, n + 1 control points {(x j , y j ), j =
1, . . . , n + 1} selected from {(xi , yi ), i = 1, . . . , N }, a knot vec-
tor, t1 , t2 , . . . , tk+(n+1) , and the order of k, the plotted pattern
can be modeled by (1). In (1), f (t) and g(t) are the x and y
components of a point on the curve, t is a parameter in the
parametric representation of the curve.
2.1.2. CoD estimation
If one uses the MSE metric, then CoD is the ratio of the
explained variation to the total variation and denotes the
strength of association between predictor genes and the target gene. Mathematically, for any feature set X, CoD relative
Huai Li et al.
3
to the target variable Y is defined as CoDX →Y = (ε0 − εX )/ε0 ,
where ε0 is the prediction error in the absence of predictor
and εX is the error for the optimal predictors. For the purpose
of exploring coexpression pattern, we only consider a pair of
genes gx and g y , where g y is the target gene that is predicted
by the predictor gene gx . The errors are estimated based on
available samples (resubstitution method) for simplicity.
Given a pair of genes gx and g y with expression values xi
and yi , i = 1, . . . , N, where N is the number of samples, we
construct the predictor y = F(x) for predicting the target expression value y. If the error is the mean-square error (MSE),
then CoD of gene g y predicted by gene gx can be computed
according to the definition
CoDgx →g y =
ε0 − εX
=
ε0
N i=1
yi − y
2
N i=1
yi − y
−
N i=1
2
Output
(i) CoD of gene g y predicted by gene gx .
Algorithm
x
y
=
f (t)
g(t)
x j
y j
in the
(x, y) plane based on (n + 1) control points
, j=
1, . . . , n + 1 , a knot vector, t1 , t2 , . . . , tk+(n+1) , and the
order of k.
(1) Find indices of yi , i = 1, . . . , N , where (x1 ≤
x2 ≤ · · · ≤ xN ) are ordered as monotonic
x (ii) Calculate CoD of gene g y predicted by gene gx .
(1) Compute mean expression value of g y as y =
N
i=1 yi /N.
(2) For i = 1, . . . , N, find yi = F(xi ) by eliminating
t between x = f (t) and y = g(t). First find ti =
arg{mint | f (t) − xi |}. Then compute yi = g(ti ).
(3) Calculate CoD from (4) based on the ordered
xi sequence
yi , i = 1, . . . , N . Refer to (4),
CoD value is the same
as calculated based on
xi
yi , i = 1, . . . , N . Including the special cases,
we have (1) ε0 > 0, if ε0 ≥ εX , compute CoD
from (4); else set CoD to 0. (2) ε0 = 0, if εX = 0,
set CoD to 1; else set CoD to 0.
(4)
(i) A pair of genes gx and g y with expression values xi and
yi , i = 1, . . . , N. N is the number of samples.
(ii) M intervals of control points. By given N and M, the
number of control points (n + 1) is determined as n =
N/M , where · is the floor function.
(iii) Spline order k.
, j = 1, . . . , n and xyn+1
= yN .
n+1
N
(3) Compute the B j,k (t) basis functions recursively
from (3).
x
f (t) x j n+1
(4) Formulate y = g(t) =
j =1 B j,k (t) y j
based on (1).
y1+(
j −1)×M
.
Input
xi x1+(
j −1)×M
2
y i − F xi
When the relationship is linear or approximately linear, CoD
and the correlation coefficient are equivalent measurements
since CoD is equal to R2 if F(xi ) = mxi + b. As the relationship departs from linearity, however, CoD can capture some
specific nonlinear information whereas the correlation coefficient fails. In terms of prediction of direction, both the correlation coefficient and mutual information are symmetrical measurements that cannot provide evidence of which way
causation flows. CoD, however, can suggest the direction of
gene relationship. In other words, CoDgx →g y is not necessarily equal to CoDg y →gx . This feature makes CoD to be uniquely
useful, especially in network inference.
The key point for computing CoD from (4) is to find the
predictor y = F(x) from continuous data samples (xi , yi ).
Motivated by the spirit of B-spline, we formulate an algorithm to estimate the CoD from continuous data of gene expression. The proposed algorithm is summarized as follows.
(i) Fit two-dimensional B-spline curve
increasing from (x1 , x2 , . . . , xN ), yi is the value
corresponding to the same index as xi .
x j (2) Assign (n + 1) control points as:
=
y j
2.1.3. Statistical significance
For a given CoD value estimated on the basis of B-spline
approximation (referred to as CoD-B in the following), the
probability (Pshuffle ) of obtaining a larger CoD-B at random
between gene gx and g y is calculated by randomly shuffling
one of the expression profiles through Monte Carlo simulation. In the simulation, a random dataset is created by shuffling the expression profiles of the predictor gene gx and the
target gene g y , and CoD-B is estimated based on the random
dataset. This process is repeated 10,000 times under the condition that the parameters k and M are kept constant, and
the resulting histogram of CoD-B shows that it can be approximated by the half-normal distribution. We then determine Pshuffle according to the derived probability distribution
of CoD-B from the simulation.
2.2.
Scheme for coexpression identification
Based on the new algorithm developed, we propose a scheme
for identifying coexpression of mixed patterns by using CoDB as the measuring score. We first calculate CoD-B from gene
expression data for each pair of genes under experimental
conditions A and B. For example, condition A represents
the cancer state and condition B represents the normal state.
Then under the cutoff values of CoD-B (e.g., 0.50) and Pshuffle
(e.g., 0.05), we select the set of gene pairs that are significantly
coexpressed under condition A and the set of gene pairs that
are not significantly coexpressed under condition B as follows:
setA := (Coexpressed pairs, satisfy CoD-B ≥ 0.50 AND
Pshuffle < 0.05),
setB := (Coexpressed pairs, satisfy CoD-B < 0.50 AND
Pshuffle < 0.05).
4
The set of significantly coexpressed gene pairs to differentiate
condition A from condition B is chosen as the intersect of
setA and setB: setC = setA ∩ setB.
2.3. Software and experimental validation
We have implemented a Java-based interactive computational tool for the CoexPro algorithm that we have developed. All computations were conducted using the software.
The effects of the number of control points and the order k of the B-spline function for CoD estimation were assessed from the simulated datasets which contain four different coexpression patterns: (1) linear pattern, (2) nonlinear
pattern I (piecewise pattern), (3) nonlinear pattern II (sigmoid pattern), and (4) random pattern for control. Each
dataset contained 31 data points. The coexpression profiles
of the four simulated patterns are shown in Supplementary
Figures S1A, S1C, S1E, and S1G (supplementary figures are
available at doi:10.1155/2007/49478). For each pattern, the
averaged CoD (CoD) and Z-Score (Z) values were calculated
under different B-spline orders (k) and control points intervals (M). For computing CoD and Z-Score, the original
dataset was shuffled 10,000 times. CoD was obtained by averaging CoD values of the shuffled data. Z-Score was calculated
as Z = (CoD − CoD)/σ, where CoD was estimated from the
original dataset and σ was the standard deviation.
The CoexPro algorithm was first validated for its ability of capturing different coexpression patterns by comparing the results from CoD-B, CoD estimated from quantized
data (referred to as CoD-Q in the following), and the correlation coefficient (R). The validation was conducted on
the four simulated datasets described above and four real
expression datasets representing four different coexpression
patterns (normal tissue array data; obtained from the GEO
database with the accession number GSE 1987). The coexpression profiles of the four real-data patterns are shown in
Supplementary Figures S1B, S1D, S1F, and S1H. For getting
quantized data, gene expression values were discretized into
three categories: over expressed, equivalently expressed, and
under expressed, depending whether the expression level was
significantly lower than, similar to, or greater than the respective control threshold [11, 14]. Since some genes had small
natural range of variation, z-transformation was used to normalize the expression of genes across experiments, so that the
relative expression levels of all genes had the same mean and
standard derivation. The control threshold was then set to be
one standard derivation for the quantization.
The proposed algorithm was next validated for its ability
of identifying biologically significant coexpression. The validation was conducted by functional semantic similarity analysis. The analysis was based on the gene ontology (GO), in
which each gene is described by a set of GO terms of molecular functions, biological process, or cellular components that
the gene is associated to (http://www.geneontology.org). The
functional semantic similarity of a pair of genes gx and g y
was measured by the number of GO terms that they shared
(GOgx ∩ GOg y ), where GOgx denotes the set of GO terms for
gene gx and GOg y denotes the set of GO terms for gene g y .
The semantic similarity was set to zero if one or both genes
EURASIP Journal on Bioinformatics and Systems Biology
had no GO terms. The semantic similarity was calculated
from six sets of coexpression gene pairs: (1) those nonlinear coexpression pairs identified by CoD-B; (2) those linear
coexpression pairs identified by CoD-B; (3) those nonlinear
coexpression pairs identified by CoD-Q; (4) those linear coexpression pairs identified by CoD-Q; (5) those coexpression
pairs identified by correlation coefficient (R); and (6) those
from randomly selected gene pairs. The real gene expression
data used in this analysis were Affymetrix microarray data
derived from the normal white blood cell (obtained from the
GEO database with the accession number GSE137). The resulting distributions of similarity scores from the six gene
pair data sets were examined by the Kolmogorov-Smirnov
test for the statistical differences.
The proposed algorithm was finally validated by a case
study on ligand-receptor coexpression in cancerous and normal tissues. The ligand-receptor cognate pair data were obtained from the database of ligand-receptor partners (DLRP)
[5]. The gene expression data used in this study included
Affymetrix microarray data derived from dissected tissues of
acute myeloid leukemia (AML), lung cancer, prostate cancer, and their normal tissue counterparts (downloaded from
the GEO database with accession numbers GSE 995, GSE
1987, GSE 1431, resp.). Each of these microarray datasets
contained about 30 patient cancer samples and 10 normal
tissue samples. The array data were normalized by the robust
multiarray analysis (RMA) method [24].
3.
3.1.
RESULTS AND DISCUSSION
B-spline function and optimization
We applied the B-spline function for approximation of the
plotted pattern of a pair of genes, prior to CoD estimation of
coexpression. The shape of a curve fitted by B-spline is specified by two major parameters: the number of control points
sampled from data and the B-spline order k. Under different control points, the shape of a modeling curve would be
different. On the other hand, increasing the order k would
increase the smoothness of a modeling curve. We assessed
these parameters for their influence on the CoD estimation.
The assessment was conducted based on four coexpression
patterns derived by simulation: (1) linear pattern, (2) nonlinear pattern I (piecewise pattern), (3) nonlinear pattern II
(sigmoid pattern), and (4) random pattern (see Section 2).
The coexpression profiles of the four simulated patterns are
shown in Supplementary Figure S1. Figures 1(a) and 1(b)
show plots of averaged CoD (CoD) and Z-Score, respectively,
under different B-spline orders (k) at fixed M = 3. CoD was
computed based on 10,000 shuffled data sets and Z-Score
was calculated as Z = (CoD − CoD)/σ, where CoD was estimated from the original dataset and σ was the standard deviation. A high Z-Score value indicated that the CoD estimated
from the real pattern was beyond random expectation. As
indicated, Z-Score showed no sign of improvement when k
increased up to 4 or above in both linear and nonlinear coexpression patterns. Figures 1(c) and 1(d) show plots of CoD
and Z-Score, respectively, under different number M of control point intervals at fixed k = 4. As indicated, at M = 1
Huai Li et al.
5
12
0.068
0.066
10
8
0.062
Significance
Averaged CoD
0.064
0.06
0.058
6
4
0.056
2
0.054
0
0.052
0.05
2
3
4
−2
5
2
3
4
Order k
Nonlinear-II
Random
Linear
Nonlinear-I
Nonlinear-II
Random
Linear
Nonlinear-I
(a)
(b)
0.9
35
0.8
30
25
0.6
Significance
Averaged CoD
0.7
0.5
0.4
0.3
20
15
10
0.2
5
0.1
0
0
5
Order k
1
2
3
4
5
6
7
8
9
10
Interval of control points
Nonlinear-II
Random
Linear
Nonlinear-I
(c)
−5
1
2
3
4
5
6
7
8
9
10
Interval of control points
Nonlinear-II
Random
Linear
Nonlinear-I
(d)
Figure 1: Estimation of averaged CoD and significance at different spline orders k and control point intervals M under linear, nonlinear I
(piecewise pattern), nonlinear II (sigmoid pattern), and random coexpression patterns. The data sets of the four patterns were generated by
simulation. The averaged CoD and significance were calculated from 10,000 shuffled realizations of the dataset. (a) and (b) show averaged
CoD and significance calculated under different spline orders k at fixed M = 3. (c) and (d) show averaged CoD and significance calculated
under different number M of control point intervals at fixed k = 4.
(i.e., all data points from samples were used as the control
points), a data over-fitting phenomenon was observed, where
CoD was high but Z-Score was low in all data patterns. The
increase of M led to the decrease of CoD and increase of ZScore. Based on the results and taking into account of small
sample sizes in microarray data, we set M = 3 and k = 4 empirically for the identification of coexpression in this study.
3.2. Justification of algorithm
In order to justify our algorithm, we compared CoD-B, CoDQ, and the correlation coefficient (R) for their power of cap-
turing different coexpression patterns, particularly nonlinear and directional relationships. Four different coexpression
patterns were analyzed: linear, nonlinear I (piecewise pattern), nonlinear II (sigmoid pattern), and random patterns
(see Section 2; Supplementary Figure S1). Table 1 shows the
results. As expected, for the linear coexpression pattern,
CoD-B, CoD-Q, and R2 values were all significantly high
and CoD-B performed well in both simulated and real data
(p-value < 1.0E-6) (see Table 1). For the random pattern,
both CoD-B and R2 were very low as expected. But CoD-Q
failed to uncover the random pattern, showing significantly
high values (0.68 in the simulated data set and 0.65 in the
6
EURASIP Journal on Bioinformatics and Systems Biology
Table 1: Comparison of CoD estimated by our algorithm (CoD-B), CoD estimated from quantized data (CoD-Q), and correlation coefficient (R2 ) under different coexpression patterns.
Simulated data
Coregulated pattern
CoD-B
(Pshuffle )
CoD-Q
(Pshuffle )
Linear
0.98
(1.0E-6)
Nonlinear-I
Real data
R2
R2
(Pshuffle )
CoD-B
(Pshuffle )
CoD-Q
(Pshuffle )
(Pshuffle )
0.98
(1.0E-6)
0.99
(1.0E-6)
0.65
(1.0E-6)
0.68
(3.3E-2)
0.68
(4.7E-3)
0.94
(1.0E-6)
0.80
(1.0E-6)
1.8E-5
(9.5E-2)
0.68
(4.6E-3)
0.84
(1.2E-3)
0.31
(2.1E-3)
Nonlinear-II
0.98
(1.0E-6)
0.93
(1.0E-6)
0.57
(1.0E-6)
0.79
(8.2E-3)
0.79
(6.8E-3)
0.10
(1.9E-2)
Random
1.0E-5
(6.2E-1)
0.68
(7.4E-1)
0.0026
(4.3E-1)
1.0E-05
(6.6E-1)
0.65
(3.3E-1)
0.051
(2.5E-1)
real-array data). For the nonlinear patterns, both CoD-B and
CoD-Q performed well with significantly high values, while
R2 was low and unable to reveal the patterns. As shown in
Table 1, for the nonlinear pattern I, CoD-B was 0.94 with
p-value 1.0E-6, CoD-Q was 0.80 with p-value 1.0E-6, while
R2 was 1.8E-5 with p-value 9.5E-2 in the simulated data. In
the real data, CoD-B was 0.68 with p-value 4.6E-3, CoD-Q
was 0.84 with p-value 1.2E-3, while R2 was 0.31 with p-value
2.1E-3. A similar trend was also observed for the nonlinear
pattern II (see Table 1).
It is important to explore nonlinear coexpression pattern
and directional relationship in gene expression for gene regulation or pathway studies. The two nonlinear patterns that
we examined in this study can represent different biological
events. The nonlinear pattern I (piecewise pattern; Supplementary Figures S1C–S1D) may represent a negative feedback event: gene gx and gene g y initially have a positive correlation until gene gx reaches a certain expression level then
the correlation becomes negative. The nonlinear pattern II
(sigmoid pattern; Supplementary Figures S1E–S1F) may represent two consecutive biological events: threshold and saturation. Initially, gene gx ’s expression level increases without
affecting gene g y ’s expression activity. When the level of gene
gx reaches a certain threshold, gene g y ’s expression starts to
increase with gx . But after gene gx ’s level reaches a second
threshold, its effect on gene g y becomes saturated and gene
g y ’s level plateaued. The directional relationship, particularly
the interaction between transcription factors and their targets, on the other hand, is an important component in gene
regulatory network or pathways. Our algorithm provides effective means to analyze nonlinear coexpression pattern and
uncover directional relationship from microarray gene expression data.
In this study, we estimated the errors arising from CoD-B
and CoD-Q calculation by the resubstitution method based
on available samples for simplicity. Other methods, such as
bootstrapping, could also be applied for the error estimation,
especially when the sample size is small. In exploring coexpression pattern, our algorithm at the current version deals
with a pair of genes gx and g y , where g y is the target gene that
is predicted by the predictor gene gx . In the future, we would
extend our algorithm to explore multivariate gene relations
as well.
3.3.
Biological significance of coexpression
identified by CoD-B
We validated our algorithm for its ability of capturing biologically meaningful coexpression by functional semantic similarity analysis of coexpressed genes identified. The semantic
similarity measures the number of the gene ontology (GO)
terms shared by the two coexpressed genes [2, 25]. Six sets of
coexpression gene pairs were subjected to the semantic similarity analysis: (1) 9419 nonlinear coexpression pairs picked
up by CoD-B but not by the correlation coefficient (R) (cutoff value is 0.70 for both CoD-B and R2 ); (2) 8225 linear coexpression pairs picked up by both CoD-B and R2 using the
same cutoff; (3) 39406 nonlinear coexpression pairs picked
up by CoD-Q but not by R2 using the same cutoff; (4) 8408
linear coexpression pairs picked up by both CoD-Q and R2
using the same cutoff; (5) 11596 coexpression pairs picked
up by R2 using the same cutoff; and (6) 250000 randomly selected gene pairs used for control. The gene expression data
from the normal white blood cell were used for the analysis. Figure 2 shows the distribution of semantic similarity
scores under these datasets. For the random gene pairs, the
cumulative probability of the gene pairs reached to 1 when
the functional similarity was as high as 8. This indicated that
all of the random gene pairs had the functional similarity 8
or below. In contrast, for the coexpressed genes identified by
CoD-B, the cumulated probability of 1 (i.e., 100% of gene
pairs) corresponded to the semantic similarity above 30, indicative of much higher functional similarities between the
coexpressed genes identified. The distributions of similarity
scores derived from the two coexpressed gene datasets were
very similar to each other while both were significantly different from that of randomly generated gene pairs (P < 10E10 by the Kolmogorov-Smirnov test). For the coexpressed
Huai Li et al.
7
1
3.4.
Case study: coexpression of ligand-receptor pairs
0.9
Cumulative probability
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
5
10
15
20
25
30
35
Semantic similarity
Random pairs
Linear Coex-pairs by CoD-B
Nonlinear Coex-pairs by CoD-B
Linear Coex-pairs by CoD-Q
Nonlinear Coex-pairs by CoD-Q
Coex-pairs by R
Figure 2: The distributions of functional similarity scores in six
sets of gene pairs. The square line on the plot represents the distribution of randomly selected gene pairs, the circle line is that
of linearly coexpressed gene pairs picked up by CoD-B, the triangle line represents that of nonlinearly coexpressed gene pairs
picked up by CoD-B, the star line is that of linearly coexpressed
gene pairs picked up by CoD-Q, the diamond line represents that
of nonlinearly coexpressed gene pairs picked up by CoD-Q, and
the downward-pointing triangle line represents that of coexpressed
gene pairs picked up by correlation coefficient (R). The x-axis indicates functional semantic similarity scores (GO term overlap; see
Section 2). For the random gene pairs, the cumulative probability of
gene pairs reached to 1 when the functional similarity was up to 8.
That meant all the random gene pairs had the functional similarity
8 or below. In contrast, for coexpressed genes picked up by CoD-B,
the cumulated probability did not reache 1 (i.e., 100% of gene pairs)
until the functional similarity was over 30, indicative of high functional similarities in the coexpressed genes. The accumulative distributions were significantly different from that of randomly generated gene pairs (P < 10E-10 by the Kolmogorov-Smirnov test). For
the coexpressed genes identified by CoD-Q, the curves of cumulated probability laid between the curves in the case of CoD-B and
in the random case. The cumulated probability of 1 corresponded
to the semantic similarity above 25. For the coexpressed genes identified by R, the curves of cumulated probability also laid between
the curves in the case of CoD-B and in the random case.
genes identified by CoD-Q, the curves of cumulated probability laid between the curves in the case of CoD-B and the
curve in the random case. The cumulated probability of 1
corresponded to the semantic similarity above 25. For the
coexpressed genes identified by R2 , the curves of cumulated
probability also laid between the curves in the case of CoD-B
and in the random case. The results suggest that the new algorithm is effective in identifying biologically significant coexpression of both linear and nonlinear patterns.
We finally used our new algorithm to analyze coexpression
of ligands and their corresponding receptors in lung cancer, prostate cancer, leukemia, and their normal tissue counterparts. Significantly coexpressed ligand and receptor pairs
were identified in the cancer and normal tissue groups at the
thresholds of R2 and CoD-B 0.50 and Pshuffle 0.05. The results are shown in Supplementary Tables S1 to S6. By applying the criteria of differential coexpression (see Section 2),
we identified ligand-receptor pairs which showed differential coexpression between cancerous and normal tissues, as
well as among different cancers. Table 2 lists the differentially coexpressed genes between lung cancer and normal tissues. The values of CoD-Q and R2 are also listed in the table for comparison. Supplementary Tables S7 and S8 list the
differentially coexpressed genes in AML and prostate cancer, respectively. 12 ligand-receptor pairs were differentially
coexpressed between lung cancer and normal tissues (the
CoD-B difference > 0.40) (see Table 2). The ligand BMP7
(bone morphogenetic protein 7), related to cancer development [26, 27], was one of the differentially coexpressed genes.
For BMP7 and its receptor ACVR2B (activin receptor IIB),
the CoD-B was 0.76 (Pshuffle < 2.8E-2) in the lung cancer
and 0.00 (Pshuffle < 5.8E-1) in the normal tissue, the CoDQ was 0.75 (Pshuffle < 2.9E-2) in the lung cancer and 0.00
(Pshuffle < 5.7E-1) in the normal tissue, and the R2 value
was 0.043 (Pshuffle < 2.9E-2) in the lung cancer and 0.0012
(Pshuffle < 1.0E-1) in the normal tissue (see Table 2). BMP7
and ACVR2B therefore showed nonlinear coexpression in the
lung cancer while not coexpressed in the normal tissue. The
nonlinear coexpression relationship was detected by both
CoD-B and CoD-Q but not by R2 . The coexpression profile
(see Figure 3(a)) further showed that the two genes displayed
approximately the nonlinear pattern I of coexpression, and
BMP7 was over expressed in the lung cancer as compared
with the normal tissue. These results are suggestive of a certain level of negative feedback involved in the interaction between BMP7 and ACVR2B. The findings facilitate our understanding of the role of BMP7 in cancer development.
The ligand CCL23 (chemokine ligand 23) and its receptor CCR1 (chemokine receptor 1), on the other hand, exhibited high linear coexpression in the normal lung tissue
while were not coexpressed in cancerous lung samples. As
shown in Table 2, the CoD-B value of the gene pair was 0.85
in the normal tissue while 0.00 in the lung cancer, the CoDQ value of the gene pair was 0.87 in the normal tissue while
0.62 in the lung cancer, and the R2 value was 0.92 in the normal tissue and 0.054 in the lung cancer. In this case, CoDB and R2 differentiated the coexpression patterns of the two
genes under different conditions but CoD-Q failed. The coexpression profile (see Figure 3(b)) further showed that the
two genes displayed approximately the linear pattern of coexpression in the normal condition. Similarly, CCL23 and
CCR1 were also highly coexpressed in the normal prostate
samples (CoD-B = 0.85) but not coexpressed in the cancerous prostate samples (CoD-B = 0.00) (see Supplementary
Table S8). However, CCL23 and CCR1 were not coexpressed
8
EURASIP Journal on Bioinformatics and Systems Biology
Table 2: List of ligand-receptor pairs which showed differential coexpression between the lung cancer and normal tissue based on CoD-B.
The values of CoD-Q and R2 of ligand-receptor pairs are also listed in the table for comparison.
Ligand
CoD-B
(Pshuffle )
Receptor
R2
CoD-Q
(Pshuffle )
(Pshuffle )
Cancer
Normal
Cancer
Normal
Cancer
Normal
BMP7
ACVR2B
0.76
(2.8E-2)
0.00
(5.8E-1)
0.75
(2.9E-2)
0.00
(5.7E-1)
0.043
(2.9E-2)
0.0012
(1.0E-1)
EFNA3
EPHA5
0.84
(6.7E-6)
0.00
(6.9E-1)
0.66
(3.4E-1)
0.52
(1.6E-1)
0.22
(1.7E-2)
0.0072
(8.1E-1)
EGF
EGFR
0.50
(9.1E-4)
0.00
(6.6E-1)
0.64
(9.1E-1)
0.55
(2.2E-1)
0.20
(1.2E-2)
0.0034
(8.8E-1)
EPO
EPOR
0.49
(1.6E-5)
0.00
(7.1E-1)
0.092
(5.7E-2)
0.00
(5.0E-1)
0.14
(3.3E-2)
0.0022
(8.9E-1)
FGF8
FGFR2
0.55
(1.5E-7)
0.00
(6.6E-1)
0.70
(2.1E-1)
0.71
(4.0E-1)
0.30
(3.4E-3)
0.19
(2.5E-1)
IL16
CD4
0.62
(2.7E-6)
0.031
(6.8E-1)
0.76
(4.2E-2)
0.56
(2.7E-1)
0.40
(4.9E-4)
0.21
(2.1E-1)
CCL7
CCBP2
0.48
(4.7E-5)
0.00
(6.7E-1)
0.44
(7.4E-2)
0.61
(5.0E-1)
0.028
(3.5E-1)
0.086
(4.2E-1)
CCL23
CCR1
0.00
(7.3E-1)
0.85
(2.1E-9)
0.62
(8.0E-1)
0.87
(1.5E-2)
0.054
(2.3E-1)
0.92
(3.0E-4)
IL1RN
IL1R1
0.23
(7.7E-2)
0.83
(8.4E-7)
0.61
(7.2E-1)
0.81
(7.1E-2)
0.00
(9.6E-1)
0.90
(2.3E-4)
IL18
IL18R1
0.18
(9.7E-2)
0.71
(4.5E-6)
0.69
(8.1E-1)
0.67
(1.9E-1)
0.23
(9.0E-3)
0.64
(9.3E-3)
IL13
IL13RA2
0.00
(6.2E-1)
0.69
(1.5E-4)
0.59
(4.7E-1)
0.64
(2.2E-1)
0.0071
(6.7E-1)
0.69
(2.0E-2)
BMP5
BMPR2
0.00
(6.9E-1)
0.61
(1.7E-4)
0.58
(3.3E-1)
0.61
(2.8E-1)
0.12
(7.2E-2)
0.60
(1.7E-2)
in either normal (CoD-B = 0.00) or AML samples (CoD-B =
0.00). The results suggest that CCL23 and CCR1 show differential coexpression not only between cancerous and normal
tissues, but also among different cancers. It has been reported
that chemokine members and their receptors contribute to
tumor proliferation, mobility, and invasiveness [28]. Some
chemokines help to enhance immunity against tumor implantation, while others promote tumor proliferation [29].
Our results revealed the absence of a specific type of nonlinear interaction, for example, as described in Section 2.3, between CCL23 and CCR1 in lung and prostate cancer samples
but not in AML samples, shedding light on the understanding of the involvement of chemokine signaling in tumor development.
We further identified different patterns of ligand-receptor coexpression in cancer and normal tissues. In the lung
cancer, for example, 11 ligand-receptor pairs showed a linear
coexpression pattern, which were significant in both CoDB and R2 , while 28 pairs showed a nonlinear pattern, which
were significant only in CoD-B (see Supplementary Table
S1). In the counterpart normal tissue, however, 35 ligandreceptor pairs showed a linear coexpression pattern, while 6
pairs showed a nonlinear pattern (see Supplementary Table
S2). Such differences in the coexpression pattern were not
identified in previous coexpression studies based on the correlation coefficient [5].
4.
CONCLUSION
In summary, we proposed an effective algorithm based on
CoD estimation with B-spline approximation for modeling
and measuring gene coexpression pattern. The model can
address both linear and some specific nonlinear relationships, suggest the directionality of interaction, and can be
calculated directly from microarray data without quantization that could lead to information loss or misrepresentation. The newly proposed algorithm can be very useful in
analyzing a variety of gene expression in pathway or network
Huai Li et al.
9
160
900
140
800
700
600
100
CCR1
ACVR2B
120
80
60
500
400
300
40
200
20
100
0
0
200
400
600
800
1000
0
0
100
200
BMP7
Lung cancer
Normal
300
400
500
600
700
800
CCL23
Lung cancer
Normal
(a)
(b)
Figure 3: Coexpression profiles of two representative ligand-receptor pairs in lung cancer cells and normal cells. (a) BMP7 and ACVR2B in
lung cancer samples (Pshuffle < 2.8E-2) and normal samples (Pshuffle < 5.8E-1); (b) CCL23 and CCR1 in lung cancer samples (Pshuffle < 7.3E-1)
and normal samples (Pshuffle < 2.1E-9).
studies, especially in the case when there are specific nonlinear relations between the gene expression profiles.
ACKNOWLEDGEMENT
This study was supported, at least in part, by the Intramural
Research Program, National Institute on Aging, NIH.
REFERENCES
[1] J. M. Stuart, E. Segal, D. Koller, and S. K. Kim, “A genecoexpression network for global discovery of conserved genetic modules,” Science, vol. 302, no. 5643, pp. 249–255, 2003.
[2] H. K. Lee, A. K. Hsu, J. Sajdak, J. Qin, and P. Pavlidis, “Coexpresion analysis of human genes across many microarray data
sets,” Genome Research, vol. 14, no. 6, pp. 1085–1094, 2004.
[3] V. van Noort, B. Snel, and M. A. Huynen, “The yeast coexpression network has a small-world, scale-free architecture and
can be explained by a simple model,” EMBO Reports, vol. 5,
no. 3, pp. 280–284, 2004.
[4] S. L. Carter, C. M. Brechbuhler, M. Griffin, and A. T. Bond,
“Gene co-expression network topology provides a framework
for molecular characterization of cellular state,” Bioinformatics, vol. 20, no. 14, pp. 2242–2250, 2004.
[5] T. G. Graeber and D. Eisenberg, “Bioinformatic identification
of potential autocrine signaling loops in cancers from gene expression profiles,” Nature Genetics, vol. 29, no. 3, pp. 295–300,
2001.
[6] M. J. Herrgård, M. W. Covert, and B. Ø. Palsson, “Reconciling gene expression data with known genome-scale regulatory network structures,” Genome Research, vol. 13, no. 11, pp.
2423–2434, 2003.
[7] S. Imoto, T. Goto, and S. Miyano, “Estimation of genetic
networks and functional structures between genes by using Bayesian networks and nonparametric regression,” Pacific
Symposium on Biocomputing, pp. 175–186, 2002.
[8] A. J. Butte and I. S. Kohane, “Mutual information relevance
networks: functional genomic clustering using pairwise entropy measurements,” Pacific Symposium on Biocomputing, pp.
418–429, 2000.
[9] X. Zhou, X. Wang, and E. R. Dougherty, “Construction
of genomic networks using mutual-information clustering
and reversible-jump Markov-chain-Monte-Carlo predictor
design,” Signal Processing, vol. 83, no. 4, pp. 745–761, 2003.
[10] S. Kim, H. Li, E. R. Dougherty, et al., “Can Markov chain models mimic biological regulation?” Journal of Biological Systems,
vol. 10, no. 4, pp. 337–357, 2002.
[11] R. F. Hashimoto, S. Kim, I. Shmulevich, W. Zhang, M. L.
Bittner, and E. R. Dougherty, “Growing genetic regulatory
networks from seed genes,” Bioinformatics, vol. 20, no. 8, pp.
1241–1247, 2004.
[12] I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang, “Probabilistic Boolean networks: a rule-based uncertainty model for
gene regulatory networks,” Bioinformatics, vol. 18, no. 2, pp.
261–274, 2002.
[13] E. R. Dougherty, S. Kim, and Y. Chen, “Coefficient of determination in nonlinear signal processing,” Signal Processing,
vol. 80, no. 10, pp. 2219–2235, 2000.
[14] H. Li and M. Zhan, “Systematic intervention of transcription
for identifying network response to disease and cellular phenotypes,” Bioinformatics, vol. 22, no. 1, pp. 96–102, 2006.
[15] V. Hatzimanikatis and K. H. Lee, “Dynamical analysis of gene
networks requires both mRNA and protein expression information,” Metabolic Engineering, vol. 1, no. 4, pp. 275–281,
1999.
[16] H. Prautzsch, W. Boehm, and M. Paluszny, Bézier and B-Spline
Techniques, Springer, Berlin, Germany, 2002.
[17] P. Ma, C. I. Castillo-Davis, W. Zhong, and J. S. Liu, “A datadriven clustering method for time course gene expression
data,” Nucleic Acids Research, vol. 34, no. 4, pp. 1261–1269,
2006.
10
[18] J. D. Storey, W. Xiao, J. T. Leek, R. G. Tompkins, and R. W.
Davis, “Significance analysis of time course microarray experiments,” Proceedings of the National Academy of Sciences of the
United States of America, vol. 102, no. 36, pp. 12837–12842,
2005.
[19] Z. Bar-Joseph, G. K. Gerber, D. K. Gifford, T. S. Jaakkola, and
I. Simon, “Continuous representations of time-series gene expression data,” Journal of Computational Biology, vol. 10, no. 34, pp. 341–356, 2003.
[20] K. Bhasi, A. Forrest, and M. Ramanathan, “SPLINDID: a semiparametric, model-based method for obtaining transcription
rates and gene regulation parameters from genomic and proteomic expression profiles,” Bioinformatics, vol. 21, no. 20, pp.
3873–3879, 2005.
[21] W. He, “A spline function approach for detecting differentially
expressed genes in microarray data analysis,” Bioinformatics,
vol. 20, no. 17, pp. 2954–2963, 2004.
[22] Y. Luan and H. Li, “Clustering of time-course gene expression
data using a mixed-effects model with B-splines,” Bioinformatics, vol. 19, no. 4, pp. 474–482, 2003.
[23] C. O. Daub, R. Steuer, J. Selbig, and S. Kloska, “Estimating
mutual information using B-spline functions—an improved
similarity measure for analysing gene expression data,” BMC
Bioinformatics, vol. 5, no. 1, p. 118, 2004.
[24] R. A. Irizarry, B. M. Bolstad, F. Collin, L. M. Cope, B. Hobbs,
and T. P. Speed, “Summaries of Affymetrix GeneChip probe
level data,” Nucleic Acids Research, vol. 31, no. 4, p. e15, 2003.
[25] P. W. Lord, R. D. Stevens, A. Brass, and C. A. Goble, “Investigating semantic similarity measures across the gene ontology:
the relationship between sequence and annotation,” Bioinformatics, vol. 19, no. 10, pp. 1275–1283, 2003.
[26] K. D. Brubaker, E. Corey, L. G. Brown, and R. L. Vessella,
“Bone morphogenetic protein signaling in prostate cancer cell
lines,” Journal of Cellular Biochemistry, vol. 91, no. 1, pp. 151–
160, 2004.
[27] S. Yang, C. Zhong, B. Frenkel, A. H. Reddi, and P. RoyBurman, “Diverse biological effect and Smad signaling of bone
morphogenetic protein 7 in prostate tumor cells,” Cancer Research, vol. 65, no. 13, pp. 5769–5777, 2005.
[28] A. Müller, B. Homey, H. Soto, et al., “Involvement of
chemokine receptors in breast cancer metastasis,” Nature,
vol. 410, no. 6824, pp. 50–56, 2001.
[29] J. M. Wang, X. Deng, W. Gong, and S. Su, “Chemokines and
their role in tumor growth and metastasis,” Journal of Immunological Methods, vol. 220, no. 1-2, pp. 1–17, 1998.
EURASIP Journal on Bioinformatics and Systems Biology
Hindawi Publishing Corporation
EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 47214, 11 pages
doi:10.1155/2007/47214
Research Article
Gene Systems Network Inferred from Expression Profiles in
Hepatocellular Carcinogenesis by Graphical Gaussian Model
Sachiyo Aburatani,1 Fuyan Sun,1 Shigeru Saito,2 Masao Honda,3 Shu-ichi Kaneko,3 and
Katsuhisa Horimoto1
1 Biological
Network Team, Computational Biology Research Center (CBRC), National Institute of Advanced
Industrial Science and Technology (AIST), 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan
2 Chemo & Bio Informatics Department, INFOCOM CORPORATION, Mitsui Sumitomo Insurance Surugadai Annex Building,
3-11, Kanda-Surugadai, Chiyoda-ku, Tokyo 101-0062, Japan
3 Department of Gastroenterology, Graduate School of Medical Science, Kanazawa University, 13-1 Takara-machi, Kanazawa,
Ishikawa 920-8641, Japan
Received 28 June 2006; Revised 27 February 2007; Accepted 1 May 2007
Recommended by Paul Dan Cristea
Hepatocellular carcinoma (HCC) in a liver with advanced-stage chronic hepatitis C (CHC) is induced by hepatitis C virus, which
chronically infects about 170 million people worldwide. To elucidate the associations between gene groups in hepatocellular carcinogenesis, we analyzed the profiles of the genes characteristically expressed in the CHC and HCC cell stages by a statistical
method for inferring the network between gene systems based on the graphical Gaussian model. A systematic evaluation of the
inferred network in terms of the biological knowledge revealed that the inferred network was strongly involved in the known genegene interactions with high significance (P < 10−4 ), and that the clusters characterized by different cancer-related responses were
associated with those of the gene groups related to metabolic pathways and morphological events. Although some relationships in
the network remain to be interpreted, the analyses revealed a snapshot of the orchestrated expression of cancer-related groups and
some pathways related with metabolisms and morphological events in hepatocellular carcinogenesis, and thus provide possible
clues on the disease mechanism and insights that address the gap between molecular and clinical assessments.
Copyright © 2007 Sachiyo Aburatani et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
1.
INTRODUCTION
Hepatitis C virus (HCV) is the major etiologic agent of nonA non-B hepatitis, and chronically infects about 170 million
people worldwide [1–3]. Many HCV carriers develop chronic
hepatitis C (CHC), and finally are afflicted with hepatocellular carcinoma (HCC) in livers with advanced-stage CHC.
Thus, the CHC and HCC cell stages are essential in hepatocellular carcinogenesis.
To elucidate the mechanism of hepatocellular carcinogenesis at a molecular level, many experiments have been
performed from various approaches. In particular, recent
advances in techniques to monitor simultaneously the expression levels of genes on a genomic scale have facilitated
the identification of genes involved in the tumorigenesis
[4]. Indeed, some relationships between the disease and the
tumor-related genes were proposed from the gene expression analyses [5–7]. Apart from the relationship between
tumor-related genes and the disease at the molecular level,
the information about the pathogenesis and the clinical characteristics of hepatocellular carcinogenesis has accumulated
steadily [8, 9]. However, there is a gap between the information about hepatocellular carcinogenesis at the molecular level and that at more macroscopic levels, such as the
clinical level. Furthermore, the relationships between tumorrelated genes and other genes also remain to be investigated.
Thus, an approach to describe the perspective of carcinogenesis from measurements at the molecular level is desirable to
bridge the gap between the information at the two different
levels.
Recently, we have developed an approach to infer a regulatory network, which is based on graphical Gaussian modeling (GGM) [10, 11]. Graphical Gaussian modeling is one of
the graphical models that includes the Boolean and Bayesian
models [12, 13]. Among the graphical models, GGM has the
simplest structure in a mathematical sense; only the inverse
2
of the correlation coefficient between the variables is needed,
and therefore, GGM can be easily applied to a wide variety
of data. However, straightforward applications of statistical
theory to practical data fail in some cases, and GGM also
fails frequently when applied to gene expression profiles; here
the expression profile indicates a set of the expression degrees of one gene, measured under various conditions. This
is because the profiles often share similar expression patterns, which indicate that the correlation coefficient matrix
between the genes is not regular. Thus, we have devised a procedure, named ASIAN (automatic system for inferring a network), to apply GGM to gene expression profiles, by a combination of hierarchical clustering [14]. First, the large number
of profiles is grouped into clusters, according to the standard
approach of profile analysis [15]. To avoid the generation
of a nonregular correlation coefficient matrix from the expression profiles, we adopted a stopping rule for hierarchical
clustering [10]. Then, the relationship between the clusters is
inferred by GGM. Thus, our method generates a framework
of gene regulatory relationships by inferring the relationships
between the clusters [11, 16], and provides clues toward estimating the global relationships between genes on a large
scale.
Methods for extracting biological knowledge from large
amounts of literature and arranging it in terms of gene
function have been developed. Indeed, ontologies have been
made available by the gene ontology (GO) consortium [17]
to construct a functional categorization of genes and gene
products, and by using the GO terms, the software determines whether any GO terms annotate a specified list of
genes at a frequency greater than that expected by chance
[18]. Furthermore, various software applications, most of
which are commercial software, such as MetaCore from
GeneGo http://www.genego.com/, have been developed for
the navigation and analysis of biological pathways, gene regulation networks, and protein interaction maps [19]. Thus,
advances in the processing of biological knowledge have
enabled us to correspond to the results of gene expression analyses for a large amount of data with the biological
functions.
In this study, we analyzed the gene expression profiles
from the CHC and HCC cell stages, by ASIAN based on the
graphical Gaussian Model, to reveal the framework of gene
group associations in hepatocellular carcinogenesis. For this
purpose, first, the genes characteristically expressed in hepatocellular carcinogenesis were selected, and then, the profiles of the genes thus selected were subjected to the association inference method. In addition to the association inference, which was presented by the network between the
clusters, the network was further interpreted systematically
by the biological knowledge of the gene interactions and by
the functional categories with GO terms. The combination
of the statistical network inference from the profiles with the
systematic network interpretation by the biological knowledge in the literature provides a snapshot of the orchestration
of gene systems in hepatocellular carcinogenesis, especially
for bridging the gap between the information on the disease
mechanisms at the molecular level and at more macroscopic
levels.
EURASIP Journal on Bioinformatics and Systems Biology
2.
2.1.
MATERIALS AND METHODS
Gene selection
We selected the up- and downregulated genes characteristically expressed in the CHC and HCC stages, as a prerequisite for defining the variables in the network inference by
the graphical Gaussian modeling. This involved the following steps. (1) The averages and the standard deviations in the
respective conditions, AV j and SD j , for j = 1, . . . , Nc , are calculated. (2) The expression degree of the ith gene in the jth
condition, ei j , is compared with |AV j ± SD j |. (3) The gene
is regarded as a characteristically expressed gene, if the number of conditions that ei j ≥ |AV j ± SD j | is more than Nc /2.
Although the criterion for a characteristically expressed gene
is usually |AV j ± 2SD j |, the present selection procedure described above is simply designed to gather as many characteristically expressed genes as possible, and is suitable to capture a macroscopic relationship between the gene systems estimated by the following cluster analysis.
2.2.
Gene systems network inference
The present analysis is composed of three parts: first, the profiles selected in the preceding section are subjected to the
clustering analysis with the automatic determination of cluster number, and then the profiles of clusters are subjected
to the graphical Gaussian modeling. Finally, the network inferred by GGM is rearranged according to the magnitude of
partial correlation coefficients, which can be regarded as the
association strength, between the clusters. The details of the
analysis are as follows.
2.2.1. Clustering with automatic determination
of cluster number
In clustering the gene profiles, here, the Euclidian distance
between Pearson’s correlation coefficients of profiles and
the unweighted pair group method using arithmetic average (UPGMA or group average method) were adopted as the
metric and the technique, respectively, with reference to the
previous analyses by GGM [11, 16]. In particular, the present
metric between the two genes is designed to reflect the similarity in the expression profile patterns between other genes
as well as between the measured conditions, that is,
n 2
ril − r jl ,
di j = (1)
l=1
where n is the total number of the genes, and ri j is the Pearson correlation coefficient between the i and j genes of the
expression profiles that are measured at Nc conditions, pik ,
(k = 1, 2, . . . , Nc ):
l
ri j = l
k=1
k=1
pik − pi · p jk − p j
pik − pi
2 l 2 ,
· k=1 p jk − p j
(2)
where pi is the arithmetic average of pik over Nc conditions.
Sachiyo Aburatani et al.
3
In the cluster number estimation, various stopping rules
for the hierarchical clustering have been developed [20]. Recently, we have developed a method for estimating the cluster number in the hierarchical clustering, by considering the
following application of the graphical model to the clusters
[10]. In our approach, the variance inflation factor (VIF) is
adopted as a stopping rule, and is defined by
VIFi = rii−1 ,
Step 2. Calculate the partial correlation coefficient matrix
P(τ) from the correlation coefficient matrix C(τ). τ indicates
the number of the iteration.
(3)
where rii−1 is the ith diagonal element of the inverse of the
correlation coefficient matrix between explanatory variables
[21]. In the cluster number determination, the popular cutoff
value of 10.0 [21] was adopted as a threshold in the present
analysis, also with reference to the previous analyses.
After the cluster number determination, the average expression profiles are calculated for the members of each cluster, and then the average correlation coefficient matrix between the clusters is calculated from them. Finally, the average correlation coefficient matrix between the clusters is
subjected to the graphical Gaussian modeling. Note that the
average coefficient correlation matrix avoids the difficulty
of the above numerical calculation, due to the distinctive
patterns of the average expression profiles of clusters. This
means that the GGM works well for the average coefficient
correlation matrix.
2.2.2. Graphical Gaussian modeling
The concept of conditional independence is fundamental to
graphical Gaussian modeling (GGM). The conditional independence structure of the data is characterized by a conditional independence graph. In this graph, each variable is
represented by a vertex, and two vertices are connected by
an edge if there is a direct association between them. In contrast, a pair of vertices that are not connected in the graph is
conditionally independent.
In the procedure for applying the GGM to the profile data
[11], a graph, G = (V , E), is used to represent the relationship among the M clusters, where V is a finite set of nodes,
each corresponding to one of the M clusters, and E is a finite set of edges between the nodes. E consists of the edges
between cluster pairs that are conditionally dependent. The
conditional independence is estimated by the partial correlation coefficient, expressed by
ri j
ri, j |rest = − √ ii √ j j ,
r r
Step 1. Prepare a complete graph of G(0) = (V , E). The nodes
correspond to M clusters. All of the nodes are connected. G(0)
is called a full model. Based on the expression profile data, construct an initial correlation coefficient matrix C(0).
(4)
where ri j |rest is the partial correlation coefficient between
variables i and j, given the rest variables, and ri j is the (i, j)
element in the reverse of the correlation coefficient matrix.
In order to evaluate which pair of clusters is conditionally independent, we applied the covariance selection [22],
which was attained by the stepwise and iterative algorithm
developed by Wermuth and Scheidt [23]. The algorithm is
presented as Algorithm 1.
The graph obtained by the above procedure is an undirected graph, which is called an independence graph. The in-
Step 3. Find an element that has the smallest absolute value
among all of the nonzero elements of P(τ). Then, replace the
element in P(τ) with zero.
Step 4. Reconstruct the correlation coefficient matrix, C(τ +
1), from P(τ). In C(τ + 1), the element corresponding to the
element set to zero in P(τ) is revised, while all of the other
elements are left to be the same as those in C(τ).
Step 5. In the Wermuth and Sheidt algorithm, the termination
of the iteration is judged by the “deviance” values. Here, we
used two types of deviance, dev1 and dev2, with the following:
C(τ + 1)
,
dev1 = Nc log
C(0)
C(τ + 1)
.
dev2 = Nc log
C(τ)
(5)
Calculate dev1 and dev2. The two deviances follow an asymptotic χ 2 distribution with a degree of freedom = n, and that
with a degree of freedom = 1, respectively. n is the number of
elements that are set to zero until the (τ + 1)th iteration. In our
approach, n is equal to (τ + 1). |C(τ)| indicates the determinant of C(τ). Nc is the number of different conditions under
which the expression levels of M clusters are measured.
Step 6. If the probability value corresponding to dev1 ≤ 0.05,
or the probability value corresponding to dev2 ≤ 0.05, then
the model C(τ + 1) is rejected, and the iteration is stopped.
Otherwise, the edge between a pair of clusters with a partial
correlation coefficient set to zero in P(τ) is omitted from G(τ)
to generate G(τ + 1), and τ is increased by 1. Then, go to
Step 1.
Algorithm 1
dependence graph represents which pair of clusters is conditionally independent. That is, when the partial correlation
coefficient for a cluster pair is equal to 0, the cluster pair is
conditionally independent, and the relationship is expressed
as no edge between the nodes corresponding to the clusters
in the independence graph.
The genes grouped into each cluster are expected to share
similar biological functions, in addition to the regulatory
mechanism [24]. Thus, a network between the clusters can
be approximately regarded as a network between gene systems, each with similar functions, from a macroscopic viewpoint. Note that the number of connections in one vertex is
not limited, while it is only one in the cluster analysis. This
4
EURASIP Journal on Bioinformatics and Systems Biology
feature of the network reflects the multiple relationships of a
gene or a gene group in terms of the biological function.
2.2.3. Rearrangement of the inferred network
When there are many edges, drawing them all on one graph
produces a mess or “spaghetti” pattern, which would be difficult to read. Indeed, in some examples of the application
of GGM to actual profiles, the intact networks by GGM still
showed complicated forms with many edges [11, 16]. Since
the magnitude of the partial correlation coefficient indicates
the strength of the association between clusters, the intact
network can be rearranged according to the partial correlation coefficient value, to interpret the association between
clusters. The strength of the association can be assigned by
a standard test for the partial correlation coefficient [25]. By
Fisher’s Z transformation of partial correlation coefficients,
that is,
Z=
1 + ri j ·rest
1
,
log
2
1 − ri j ·rest
(6)
Z is approximately distributed according to the following
normal distribution:
N
1 + ri j ·rest
1
1
,
,
log
2
1 − ri j ·rest
Nc − (M − 2) − 3
The inferred network can be statistically evaluated in terms
of the gene-gene interactions. The chance probability was estimated by the correspondence between the inferred cluster
network and the information about gene interactions. The
following steps were used. (1) The known gene pairs with
interactions in the database were overlaid onto the inferred
network. (2) The number of cluster pairs, upon which the
gene interactions were overlaid, was counted. (3) The chance
probability, in which the cluster pairs connected by the established edges in the network were found in all possible pairs,
was calculated by using the following equation:
f −1
P =1−
i=0
N −g
n−i
N
n
,
The inferred network can be evaluated in terms of the biological knowledge. For this purpose, we characterize the
clusters by GO terms, and overlay the knowledge about
the gene interactions onto the network. For this purpose,
we first use GO::TermFinder [18] to characterize the clusters by GO terms with the user-defined significance probability (http://search.cpan.org/dist/GO-TermFinder). Then,
Pathway Studio [19] is used to survey the biological information about the gene interactions between the selected genes.
2.5.
Software
All calculations of the present clustering and GGM were performed by the ASIAN web site [26, 27] (http://www.eureka.
cbrc.jp/asian) and “Auto Net Finder,” the commercialized
PC version of ASIAN, from INFOCOM CORPORATION,
Tokyo, Japan (http://www.infocom.co.jp/bio/download).
2.6.
Expression profile data
The expression profiles of 8516 genes were monitored in 27
CHC samples and 17 HCC samples [28].
3.
2.3. Statistical significance of the inferred network
with the biological knowledge
g
i
Evaluation of the inferred network in terms of
the biological knowledge
(7)
where Nc and M are the number of conditions and the number of clusters, respectively. Thus, we can statistically test the
observed correlation coefficients under the null hypothesis
with a significance probability.
2.4.
(8)
where N is the number of possible cluster pairs in the network, n is the number of cluster pairs with edges in the inferred network, f is the number of cluster pairs with edges
in the inferred network, including the known gene pairs with
interactions, and g is the number of cluster pairs, including
the known gene pairs with interactions.
3.1.
RESULTS AND DISCUSSION
Clustering
Among the 8516 genes with expression profiles that were
measured in the previous studies [28], 661 genes were selected as those characteristically expressed in the CHC and
HCC stages. As a preprocessing step for the association inference, the genes thus selected were automatically divided
into 18 groups by ASIAN [26, 27]. Furthermore, each cluster
was characterized in terms of the GO terms, which define the
macroscopic features of the cluster in terms of the biological
function.
Figure 1 shows the dendrogram of clusters, together with
their expression patterns. As seen in Figure 1, the genes were
grouped into 18 clusters, in terms of the number of members and the expression patterns in the clusters. The average
number of cluster members was 36.7 genes (SD, 14.2), and
the maximum and minimum numbers of members were 69
in cluster 14 and 18 in cluster 9, respectively. As for the expression pattern, five clusters (10, 12, 14, 15, and 18) and
ten clusters (1–7, 9, 16, and 17) were composed of up- and
downregulated genes, respectively, and three clusters (8, 11,
and 13) showed similar mixtures of up- and downregulated
genes.
Table 1 shows the GO terms for the clusters (clusterGOB), which characterized them well (see details at
http://www.cbrc.jp/∼horimoto/HCGO.pdf). Among the 661
genes analyzed in this study, 525 genes were characterized by
the GO terms, and among the 18 clusters, 11 clusters were
characterized by GO terms with P < .05. In addition, 188
genes (28.3% of all characterized genes) corresponded to the
GO terms listed in Table 1. As seen in the table, although
Sachiyo Aburatani et al.
most clusters are characterized by several GO terms, reflecting the fact that the genes function generally in multiple
pathways, the clusters are not composed of a mixture of genes
with distinctive functions. For example, cluster 2 is characterized by 10 terms, and most of the terms are related to
the energy metabolism. Thus, the GO terms in the respective
clusters share similar features of biological functions, which
cause the hierarchical structure of the GO term definitions.
In Table 1, most of the clusters characterized by GO
terms with P < .05 are related to response function and to
metabolism. Clusters 1, 6, 8, 12, and 13 are characterized by
GO terms related to different responses, and clusters 2, 3, 4,
and 7 are characterized by GO terms related to different aspects of metabolism. Although the genes in two clusters, 14
and 16, did not adhere to this dichotomy, the genes characteristically expressed in HCC in the above nine clusters were
related to the responses and the metabolic pathways. As for
the remaining clusters with lower significance, three clusters
(9, 10, and 11) were also characterized by response functions,
and four clusters (5, 15, 17, and 18) were related to morphological events at the cellular level. Note that none of the clusters characterized by cellular level events attained the significance level. This may be because the genes related to cellular
level events represent only a small fraction of genes relative
to all genes with known functions, in comparison with the
genes related to molecular level events in the definition of
GO terms.
It is interesting to determine the correspondence between
the up- and downregulated genes and the GO terms in the
clusters. In the five clusters of upregulated genes, clusters 10
and 12 were characterized by different responses, and two
clusters were characterized by morphological events, which
were the categories of “cell proliferation” in cluster 15 and of
“development” in cluster 18. The remaining cluster, 14, was
characterized by regulation, development, and metabolism.
As for the clusters of downregulated genes, four of the ten
clusters were characterized by GO terms related to various
aspects of metabolism. In the remaining six clusters, three
clusters were characterized by GO terms related to responses,
two clusters were characterized by morphological events, and
one cluster was characterized by mixed categories.
In summary, the present gene selection and the following automatic clustering produced a macroscopic view of
gene expression in hepatocellular carcinogenesis. Although
the clusters contain many genes that do not always share the
same functions, the clusters were characterized by their responses, morphological events, and metabolic aspects from
a macroscopic viewpoint. The clusters of upregulated genes
were characterized by the former two categories, and those
of the downregulated genes represented all three categories.
Thus, the present clustering serves to interpret the network
between the clusters in terms of the biological function and
the gene expression pattern.
3.2. Known gene interactions in the inferred network
The association between the 18 clusters inferred by GGM is
shown in Figure 2. In the intact network by ASIAN, 96 of 153
possible edges between 18 clusters (about 63%) were estab-
5
10 (38)
11 (31)
12 (30)
13 (56)
8 (32)
9 (18)
4 (25)
5 (24)
17 (24)
14 (69)
15 (28)
18 (28)
16 (50)
6 (42)
7 (48)
1 (32)
2 (59)
3 (27)
Figure 1: Dendrogram of genes and profiles. The dendrogram was
constructed by hierarchical clustering with the metric of the Euclidian distances between the correlation coefficients and the UPGMA.
The blue line on the dendrogram indicates the cluster boundary estimated automatically by ASIAN. The gene expression patterns of
the respective clusters in the CHC and HCC stages are shown by
the degree of intensity: the red and green colors indicate relatively
higher and lower intensities. The cluster number and the number of
member genes in each cluster (in parentheses) are denoted on the
right side of the figure.
lished by GGM. Since the intact network is still messy, the
network was rearranged to interpret its biological meaning
by extracting the relatively strong associations between the
clusters, according to the procedure in Section 2.2.3. After
the rearrangement, 34 edges remained by the statistical test
of the partial correlation coefficients with 5% significance.
In the rearranged network, all of the clusters were nested,
but each cluster was connected to a few other clusters. Indeed, the average number of edges per cluster was 2.3, and
the maximum and minimum numbers of edges were seven
in cluster 15 and one in cluster 9, respectively. In particular,
the numbers of edges are not proportional to the numbers
of constituent genes in each cluster. For example, while the
numbers of genes in clusters 15 and 17 are equal to each other
(24 genes), the number of edges from cluster 15 (2 edges) differs from that from cluster 17 (5 edges). Thus, the number of
edges does not depend on the number of genes belonging to
the cluster, but rather on the gene associations between the
cluster pairs.
6
To test the validity of the inferred network in terms of
biological function, the biological knowledge about the gene
interactions is overlaid onto the inferred network. For this
purpose, all of the gene pairs belonging to cluster pairs are
surveyed by Pathway Assist, which is a database for biological knowledge about molecular interactions, compiled
based on the gene ontology [17]. Among the 661 genes analyzed in this study, the interactions between 90 gene pairs
were detected by Pathway Assist, and 50 of these pairs were
found in Figure 2. Notice that the number of gene pairs reported in the literature does not directly reflect the importance of the gene interactions, and instead is highly dependent on the number of scientists who are studying at the corresponding genes. Thus, we counted the numbers of cluster pairs in which at least one gene pair was known, by
projecting the gene pairs with known interactions onto the
network. By this projection, the interactions were found in
35 (g in the equation of Section 2.3) cluster pairs among
153 (N) possible pairs (see details of the gene pair projection
at http://www.cbrc.jp/∼horimoto/GPPN.pdf). Then, 19 ( f )
of the 35 cluster pairs were overlapped with 34 (n) cluster
pairs in the rearranged network. The chance probability that
a known interaction was found in the connected cluster pairs
in the rearranged network was calculated as P < 10−4.3 . Thus,
the rearranged network faithfully captures the known interactions between the constituent genes.
Furthermore, the genes with known interactions were
corresponded to the genes responsible for the GO terms of
each cluster, as shown in Table 1. The genes responsible for
the GO terms were distributed over all cluster pairs, including gene pairs with known interactions, except for only two
pairs, clusters 15 and 17, and 15 and 18. Thus, the network
can be interpreted not only by the known gene interactions
but also by the GO terms characterizing the clusters.
3.3. Gene systems network characterized by GO terms
3.3.1. Coarse associations between the clusters
To elucidate the associations between the clusters, the cluster associations with 1% significance probability were further
discriminated from those with 5% probability. This generated four groups of clusters, shown in Figure 3(a).
First, we will focus on the groups including the clusters that were characterized by GO terms with a significance probability, and that were definitely occupied by upor downregulated genes (clusters depicted by triangles with
bold lines in the figure). Groups I and III attained the above
criteria. In group I, the clusters were a mixture of the clusters
of the up- and downregulated genes. Note that three of the
six clusters were composed of upregulated genes, which were
characterized by responses (cluster 12), mixed categories
(cluster 14), and morphological events (cluster 15). In group
III, all three clusters were of downregulated genes. One cluster was characterized by responses, and two were characterized by amino-acid-related metabolism. In contrast, groups
II and IV were composed of the clusters that were somewhat
inadequately characterized by GO terms and expression patterns. Thus, groups I and III provide the characteristic fea-
EURASIP Journal on Bioinformatics and Systems Biology
tures about the orchestration of gene expression in hepatocellular carcinogenesis.
Secondly, a coarse grinning for group associations provides another viewpoint, shown in Figure 3(b). When the
groups with at least one edge between the clusters in the respective groups were presented, regardless of the number of
edges, groups I, II, and IV were nested, and group III was
connected with only group I. In the second view, group I,
which includes three of the five clusters of upregulated genes
in all clusters, was associated with all of the other groups.
This suggests that group I represents a positive part of the
gene expression in hepatocellular carcinogenesis, which is
consistent with the interpretation by the first view, from the
significant GO terms and the clear expression patterns. Interestingly, among the clusters characterized by morphological
events (clusters 5, 15, 17, and 18), three of the four clusters
were distributed over groups I, II, and IV, and the distribution was consistent with the nested groups. This suggests that
the upregulated genes of the clusters in group I are responsible for the events at the cellular level.
Thirdly, the clusters not belonging to the four groups
were clusters 1, 3, and 5. Clusters 1, 3, and 5 were directly
connected with groups I, III, and IV, groups I and III, and
group IV, respectively. Interestingly, cluster 1, characterized
by only “anti-inflammatory response,” was connected with
five clusters belonging to three groups, in which four clusters were downregulated clusters. Although cluster 5 was not
clearly characterized by the GO terms, cluster 3 was characterized by metabolic terms that were quite similar to those
for cluster 2, a downregulated cluster. Thus, the three clusters may be concerned with downregulation in hepatocellular carcinogenesis.
3.3.2. Interpretations of the inferred network
in terms of pathogenesis
The coarse associations between the clusters in the preceding
section can be interpreted on the macroscopic level, such as
the pathological level. The interpretation of the network inferred based on the information at the molecular level will be
useful to bridge the gap between the information about the
disease mechanisms at the molecular and more macroscopic
levels.
One of the most remarkable associations is found in
group I. Cluster 12, with upregulation, was associated at a
1% significance level with cluster 2, with downregulation.
The former cluster is characterized by the GO terms related
to the immune response, and the latter is characterized by
those involved with metabolism. In general, CHC and HCC
result in serious damage to hepatocytes, which are important
cells for nutrient metabolism, and the damage induces different responses. Indeed, HCC is a suitable target for testing
active immunotherapy [29]. Furthermore, cluster 2 was also
associated at a 1% significance level with cluster 14, characterized by prostaglandin-related terms. This may reflect
the fact that one mediator of inflammation, prostaglandin,
shows elevated expression in human and animal HCCs [30].
Thus, the associations in group I are involved in the molecular pathogenesis of the CHC and HCC stages.
Sachiyo Aburatani et al.
7
Table 1: Cluster characterization by GO terms# .
GO no.
Category
P-value
Fraction
1
2
2
2
2
2
2
2
2
2
2
GO:0030236
GO:0006094
GO:0006066
GO:0006091
GO:0019319
GO:0046165
GO:0046364
GO:0006067
GO:0006069
GO:0006629
GO:0009618
Anti-inflammatory response
Gluconeogenesis
Alcohol metabolism
Generation of precursor metabolites and energy
Hexose biosynthesis
Alcohol biosynthesis
Monosaccharide biosynthesis
Ethanol metabolism
Ethanol oxidation
Lipid metabolism
Response to pathogenic bacteria
0.18%
0.06%
0.12%
0.14%
0.34%
0.34%
0.34%
0.48%
0.48%
1.47%
4.96%
2 of 22/6 of 26081
3 of 37/19 of 26081
6 of 37/312 of 26081
9 of 37/961 of 26081
3 of 37/33 of 26081
3 of 37/33 of 26081
3 of 37/33 of 26081
2 of 37/5 of 26081
2 of 37/5 of 26081
7 of 37/722 of 26081
2 of 37/15 of 26081
3
3
3
3
3
GO:0006094
GO:0019319
GO:0046165
GO:0046364
GO:0009069
Gluconeogenesis
Hexose biosynthesis
Alcohol biosynthesis
Monosaccharide biosynthesis
Serine family amino acid metabolism
0.61%
1.87%
1.87%
1.87%
4.49%
2 of 15/19 of 26081
2 of 15/33 of 26081
2 of 15/33 of 26081
2 of 15/33 of 26081
2 of 15/51 of 26081
4
4
4
4
4
4
4
GO:0006725
GO:0009308
GO:0006570
GO:0050878
GO:0006950
GO:0006519
GO:0007582
Aromatic compound metabolism
Amine metabolism
Tyrosine metabolism
Regulation of body fluids
Response to stress
Amino acid and derivative metabolism
Physiological process
0.07%
0.38%
0.59%
1.65%
2.70%
4.12%
4.63%
4 of 20/140 of 26081
5 of 20/454 of 26081
2 of 20/11 of 26081
3 of 20/113 of 26081
6 of 20/1116 of 26081
4 of 20/398 of 26081
20 of 20/17195 of 26081
5
5
GO:0006917
GO:0012502
Induction of apoptosis∗
Induction of programmed cell death∗
16.06%
16.06%
6
6
6
6
6
6
6
6
6
6
6
GO:0009613
GO:0043207
GO:0006950
GO:0009605
GO:0006953
GO:0006955
GO:0006956
GO:0006952
GO:0050896
GO:0009607
GO:0006629
Response to pest, pathogen, or parasite
Response to external biotic stimulus
Response to stress
Response to external stimulus
Acute-phase response
Immune response
Complement activation
Defense response
Response to stimulus
Response to biotic stimulus
Lipid metabolism
0.00%
0.00%
0.00%
0.05%
0.05%
0.34%
0.48%
0.68%
1.15%
1.65%
2.20%
8 of 29/522 of 26081
8 of 29/557 of 26081
10 of 29/1116 of 26081
10 of 29/1488 of 26081
3 of 29/25 of 26081
8 of 29/1098 of 26081
3 of 29/52 of 26081
8 of 29/1209 of 26081
11 of 29/2619 of 26081
8 of 29/1372 of 26081
6 of 29/722 of 26081
7
7
7
7
7
7
7
7
7
GO:0006559
GO:0019752
GO:0006082
GO:0006558
GO:0009074
GO:0006519
GO:0019439
GO:0006629
GO:0009308
L-phenylalanine catabolism
Carboxylic acid metabolism
Organic acid metabolism
L-phenylalanine metabolism
Aromatic amino acid family catabolism
Amino acid and derivative metabolism
Aromatic compound catabolism
Lipid metabolism
Amine metabolism
0.83%
1.00%
1.02%
1.26%
1.26%
1.67%
1.79%
3.04%
3.09%
2 of 31/9 of 26081
6 of 31/590 of 26081
6 of 31/592 of 26081
2 of 31/11 of 26081
2 of 31/11 of 26081
5 of 31/398 of 26081
2 of 31/13 of 26081
6 of 31/722 of 26081
5 of 31/454 of 26081
8
8
8
GO:0001570
GO:0006950
GO:0050896
Vasculogenesis
Response to stress
Response to stimulus
0.09%
0.42%
2.33%
2 of 21/4 of 26081
7 of 21/1116 of 26081
9 of 21/2619 of 26081
Cluster no.
2 of 13/132 of 26081
2 of 13/132 of 26081
8
EURASIP Journal on Bioinformatics and Systems Biology
Table 1: Continued.
∗
9
GO:0009611
Response to wounding
11.19%
3 of 13/394 of 26081
10
GO:0009607
Response to biotic stimulus∗
6.66%
6 of 19/1372 of 26081
11
GO:0050896
Response to stimulus∗
72.68%
6 of 17/2619 of 26081
12
12
12
12
12
12
12
12
12
GO:0006955
GO:0006952
GO:0050874
GO:0009607
GO:0050896
GO:0030333
GO:0019882
GO:0019884
GO:0019886
Immune response
Defense response
Organismal physiological process
Response to biotic stimulus
Response to stimulus
Antigen processing
Antigen presentation
Antigen presentation, exogenous antigen
Antigen processing, exogenous antigen via MHC class II
0.01%
0.01%
0.02%
0.03%
0.39%
0.97%
2.62%
3.97%
4.22%
8 of 18/1098 of 26081
8 of 18/1209 of 26081
10 of 18/2432 of 26081
8 of 18/1372 of 26081
9 of 18/2619 of 26081
3 of 18/108 of 26081
3 of 18/151 of 26081
2 of 18/32 of 26081
2 of 18/33 of 26081
13
13
13
13
13
13
GO:0009611
GO:0009613
GO:0043207
GO:0006955
GO:0006950
GO:0050874
Response to wounding
Response to pest, pathogen, or parasite
Response to external biotic stimulus
Immune response
Response to stress
Organismal physiological process
0.08%
0.38%
0.55%
3.12%
3.44%
3.98%
6 of 30/394 of 26081
6 of 30/522 of 26081
6 of 30/557 of 26081
7 of 30/1098 of 26081
7 of 30/1116 of 26081
10 of 30/2432 of 26081
14
14
14
14
14
GO:0051244
GO:0007275
GO:0001516
GO:0046457
GO:0051242
Regulation of cellular physiological process
Development
Prostaglandin biosynthesis
Prostanoid biosynthesis
Positive regulation of cellular physiological process
0.51%
0.94%
3.30%
3.30%
4.35%
8 of 45/665 of 26081
13 of 45/2060 of 26081
2 of 45/9 of 26081
2 of 45/9 of 26081
5 of 45/289 of 26081
15
GO:0008283
Cell proliferation∗
29.37%
16
16
16
16
GO:0042221
GO:0008152
GO:0009628
GO:0006445
Response to chemical substance
Metabolism
Response to abiotic stimulus
Regulation of translation
0.16%
1.29%
1.89%
2.82%
17
GO:0050817
Coagulation∗
13.92%
2 of 12/118 of 26081
11.67%
6 of 16/2060 of 26081
18
GO:0007275
∗
Development
4 of 26/488 of 26081
5 of 31/237 of 26081
25 of 31/11891 of 26081
5 of 31/400 of 26081
3 of 31/87 of 26081
#
The gene ontology terms in each cluster, detected with 5% significance probability by using GO::TermFinder [18], are listed. When the terms with that
significance probability were not found in the cluster, the terms with the smallest probability were listed as indicated by an asterisk. In the last column, “Fraction,” the numbers of genes belonging to the corresponding category in the cluster, of genes belonging to the cluster, of genes belonging to the corresponding
category in all genes of the GO term data set, and of all genes are listed.
The associated clusters 4 and 7 in group III, which were
characterized by GO terms related to amino acid and lipid
metabolism, also show downregulation. Indeed, the products of dysregulated (aberrant regulation) metabolism are
widely used to examine liver function in common clinical
tests [8]. In addition, the connection between the clusters
in groups III and I implies that the downregulation of the
clusters in group III may be related to abnormal hepatocyte
function.
In addition, cluster 15 in group I, which is characterized
by the GO term “proliferation,” was associated with different clusters in groups I, II, and IV. It is known that abnormal
proliferation is one of the obvious features of cancer [31].
This broad association may be responsible for the cellular
level events in hepatocellular carcinogenesis.
In summary, the inferred network reveals a coarse snapshot of the gene systems related to the molecular pathogenesis and clinical characteristics of hepatocellular carcinogenesis. Although the resolution of the network is still low, due to
the cluster network, the present network may provide some
clues for further investigations of the pathogenic relationships involved in hepatocellular carcinoma.
3.3.3. Interpretations of the inferred network in terms of
gene-gene interactions
In addition to the macroscopic interpretations above, the
gene functionality from the gene-gene interactions listed
in Figure 2 is also discussed in the context of hepatocellular carcinoma. Although the consideration of genegene interactions is beyond the aim of the present study,
Sachiyo Aburatani et al.
9
ALB-MTP
CYP2C9-CYP2C18
PLG-CPB2
THBD-CPB2
TF-CDH1
TF-HPX
GNG5-AEBP1
PRELP-SPARC
COL1A2-RFX5
CYP2E1-COL1A2
ALB-OCRL
FBP1- MAN1A1
LPA-MAP2K1
CYP2E1-MAP2K1
ALB-BCHE
IGFBP3-IRS1
7
12
14
2
15
13
SDC2-CXCL12
16
4
MAOA-MAOB
BAAT-NAT2
MAGED1-BIRC4
B2M- ARAF1
B2M-TIMP1
F8-VWF
ZFP36-VWF
B2M-RFX5
HTATIP2-NME2
SHC1-MAP3K10
DNCH1-CDKN2A
18
6
ASCL1-BMP4
CITED2-CDKN2A
9
11
FOS-ODC1
PCK1-PCK2
PLG-SERPINF2
THBD-SERPINF2
PLG-KLKB1
1
8
SPINK1-CTSB
3
5
FOXA3-CYP3A4
10
AMBP-MAP2K1
CRAT-AR
SORL1-CSF2
DIABLO-HSPB1
17
VEGF-A2M
NTRK2-A2M
JUN-A2M
VEGF-HSPB1
VEGF-THBS2
VEGF-CTF1
VEGF-CSF2
JUN-CSF2
JUN-WEE1
Figure 2: Network between clusters, together with a projection of biological knowledge about the gene interactions. The clusters are indicated by
triangles and circles, in which the cluster numbers correspond to those in Figure 1, and the edges between the clusters are associations with
5% significance probability. The red triangles, the green upside-down triangles, and the circles indicate the clusters of up- and downregulated
genes, and the mixture of them, respectively, and the dotted triangles indicate the clusters that were not characterized by GO terms with less
than 5% significance probability. The known gene interactions in Pathway Assist are indicated between the clusters, in which the genes
highlighted by bold letters are characterized by the GO terms in Table 1.
some examples may provide possible clues about the disease
mechanisms.
First, we surveyed the frequencies of GO terms (geneGOB listed in the supplemental data at http://www.cbrc
.jp/∼horimoto/suppl/HCGO.pdf) in the selected genes
in the present analysis, to investigate the features of
gene-gene interactions in the inferred network. A few
general terms appeared frequently, such as “response” (122
times in the geneGOB column of the supplemental data
at http://www.cbrc.jp/∼horimoto/suppl/HCGO.pdf) and
“metabolism” (183), as expected from the coarse associations
between the clusters in the preceding section. As for more
specific terms about the gene function, “lipid” (46), “apoptosis” (31), and “cell growth” (27) are remarkably found in the
list. The “lipid” is expected from the relationship between
groups I and III, and the “apoptosis” and the “cell growth”
are also expected from the frequent appearance of GO terms
(clusterGOB listed in Table 1) related to the morphological
events. Since the frequent appearance of “lipid” may be a
sensitive reflection of the protein-protein interactions in
lipid metabolic pathways to the expression profiles, here,
we focus on the gene-gene interactions characterized by the
“apoptosis” and the “cell growth.”
Among the gene-gene interactions listed in Figure 2, the
gene-gene interactions characterized by the cell growth or
death are found in the coarse associations between the clus-
ters. Group I contains the gene-gene interactions related to
apoptosis. The expression of HTAIP2 (HIV-1 Tat interactive
protein 2, 30 kd) in cluster 14 induces the expression of a
number of genes, including NME2 (nonmetastatic cells 2,
protein) in cluster 15 as well as the apoptosis-related genes
Bad and Siva [32]. MAGED1 (melanoma antigen, family
D, 1) in cluster 13, and its binding partner BIRC4 (baculoviral IAP repeat-containing 4) in cluster 14 are known
to play some roles in apoptosis [33]. In addition, the expression of COL1A2 (collagen, type I, alpha 2) in cluster 12, which is related to cell adhesion and skeletal development, is regulated by RFX5 (regulatory factor X, 5) in
cluster 14 [29, 34]. In group IV, the expression of CSF2
(colony-stimulating factor 2) in cluster 8 is dependent on
the cooperation between NFAT (nuclear factor of activated
T cells) and JUN (Jun oncogene) in cluster 10 [35]. Between groups I and II, ASCL1 (achaete-scute complex-like
1) in cluster 13 and BMP4 (bone morphogenetic protein
4) in cluster 18 share the function of cell differentiation
[36].
As a result, the gene-gene interactions listed above are related to the mechanisms of cell growth or death at the molecular level. On the other hand, the cluster associations reveal
the relationship between the cancer-induced events and various aspects of metabolisms at the pathogenesis and clinical
characteristics. Thus, the metabolic pathways might directly
10
EURASIP Journal on Bioinformatics and Systems Biology
Group I
Group III
7
12
14
2
15
13
Group II
16
4
18
6
1
3
5
10
9
11
ACKNOWLEDGMENTS
8
S. Aburatani was supported by a Grant-in-Aid for Scientific
Research (Grant 18681031) from the Ministry of Education,
Culture, Sports, Science, and Technology of Japan, and K.
Horimoto was partly supported by a Grant-in-Aid for Scientific Research on Priority Areas “Systems Genomics” (Grant
18016008) and by a Grant-in-Aid for Scientific Research
(Grant 19201039) from the Ministry of Education, Culture,
Sports, Science, and Technology of Japan. This study was
supported in part by the New Energy and Industrial Technology Development Organization (NEDO) of Japan and by
the Ministry of Health, Labour, and Welfare of Japan.
17
Group IV
(a)
I
III
study, our aim was not the inference of detailed gene-gene
interactions, but of coarse gene system interactions. Indeed,
the use of a partial correlation coefficient is employed as a
feasible approach for gene association inference as a first approximation in some studies [37, 38]. Thus, the assumption
of the linearity is not suitable for a fine analysis of dynamic
gene behaviors, but may be useful for the approximate analysis of static gene associations.
II
IV
(b)
Figure 3: Orchestration of gene systems. (a) The association with
1% significance probability is indicated by a bold line, and the clusters with 1% significance association are naturally divided into four
groups, which are enclosed by broken lines. (b) The connections
between the groups are drawn schematically, as a coarse grinning of
the cluster association.
influence the mechanisms of cancer-induced cell growth or
death at the molecular level in unknown ways.
3.4. Merits and pitfalls of the present approach
The present analysis reveals a framework of gene system associations in hepatocellular carcinogenesis. The inferred network provides a bridge between the events at the molecular level and those at macroscopic levels: the associations
between clusters characterized by cancer-related responses
and those characterized by metabolic and morphological
events can be interpreted from pathological and clinical
views. In addition, the viewpoint of the gene-gene interactions in the inferred network indicates the relationship between cancer and cell growth/death. Thus, the gene systems
network may also be useful as a bridge between the gene-gene
interactions and the observations at macroscopic levels, such
as clinical tests.
The present method assumes linearity in the cluster associations by using a partial correlation coefficient to identify
the independence between clusters. It is well known that the
interactions among genes and other molecular components
are often nonlinear, and the assumption of linearity misses
many important relationships among genes. In the present
REFERENCES
[1] M. J. Alter, H. S. Margolis, K. Krawczynski, et al., “The natural history of community-acquired hepatitis C in the United
States. The sentinel counties chronic non-A, non-B hepatitis
study team,” The New England Journal of Medicine, vol. 327,
no. 27, pp. 1899–1905, 1992.
[2] A. M. Di Bisceglie, “Hepatitis C,” The Lancet, vol. 351,
no. 9099, pp. 351–355, 1998.
[3] S. Zeuzem, S. V. Feinman, J. Rasenack, et al., “Peginterferon
alfa-2a in patients with chronic hepatitis C,” The New England
Journal of Medicine, vol. 343, no. 23, pp. 1666–1672, 2000.
[4] S. S. Thorgeirsson, J.-S. Lee, and J. W. Grisham, “Molecular
prognostication of liver cancer: end of the beginning,” Journal
of Hepatology, vol. 44, no. 4, pp. 798–805, 2006.
[5] N. Iizuka, M. Oka, H. Yamada-Okabe, et al., “Oligonucleotide
microarray for prediction of early intrahepatic recurrence of
hepatocellular carcinoma after curative resection,” The Lancet,
vol. 361, no. 9361, pp. 923–929, 2003.
[6] H. Okabe, S. Satoh, T. Kato, et al., “Genome-wide analysis
of gene expression in human hepatocellular carcinomas using
cDNA microarray: identification of genes involved in viral carcinogenesis and tumor progression,” Cancer Research, vol. 61,
no. 5, pp. 2129–2137, 2001.
[7] L.-H. Zhang and J.-F. Ji, “Molecular profiling of hepatocellular
carcinomas by cDNA microarray,” World Journal of Gastroenterology, vol. 11, no. 4, pp. 463–468, 2005.
[8] J. Jiang, P. Nilsson-Ehle, and N. Xu, “Influence of liver cancer on lipid and lipoprotein metabolism,” Lipids in Health and
Disease, vol. 5, p. 4, 2006.
[9] A. Zerbini, M. Pilli, C. Ferrari, and G. Missale, “Is there a role
for immunotherapy in hepatocellular carcinoma?” Digestive
and Liver Disease, vol. 38, no. 4, pp. 221–225, 2006.
[10] K. Horimoto and H. Toh, “Statistical estimation of cluster
boundaries in gene expression profile data,” Bioinformatics,
vol. 17, no. 12, pp. 1143–1151, 2001.
[11] H. Toh and K. Horimoto, “Inference of a genetic network by a
combined approach of cluster analysis and graphical Gaussian
modeling,” Bioinformatics, vol. 18, no. 2, pp. 287–297, 2002.
Sachiyo Aburatani et al.
[12] S. Lauritzen, Graphical Models, Oxford University Press, Oxford, UK, 1996.
[13] J. Whittaker, Graphical Models in Applied Multivariate Statistics, John Wiley & Sons, New York, NY, USA, 1990.
[14] H. Toh and K. Horimoto, “System for automatically inferring a
genetic network from expression profiles,” Journal of Biological
Physics, vol. 28, no. 3, pp. 449–464, 2002.
[15] D. K. Slonim, “From patterns to pathways: gene expression
data analysis comes of age,” Nature Genetics, vol. 32, no. 5, pp.
502–508, 2002.
[16] S. Aburatani, S. Kuhara, H. Toh, and K. Horimoto, “Deduction
of a gene regulatory relationship framework from gene expression data by the application of graphical Gaussian modeling,”
Signal Processing, vol. 83, no. 4, pp. 777–788, 2003.
[17] M. Ashburner, C. A. Ball, J. A. Blake, et al., “Gene ontology:
tool for the unification of biology,” Nature Genetics, vol. 25,
no. 1, pp. 25–29, 2000.
[18] E. I. Boyle, S. Weng, J. Gollub, et al., “GO::TermFinder—open
source software for accessing gene ontology information and
finding significantly enriched gene ontology terms associated
with a list of genes,” Bioinformatics, vol. 20, no. 18, pp. 3710–
3715, 2004.
[19] A. Nikitin, S. Egorov, N. Daraselia, and I. Mazo, “Pathway
studio—the analysis and navigation of molecular networks,”
Bioinformatics, vol. 19, no. 16, pp. 2155–2157, 2003.
[20] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An
Introduction to Cluster Analysis, John Wiley & Sons, New York,
NY, USA, 1990.
[21] R. J. Freund and W. J. Wilson, Regression Analysis: Statistical
Modeling of a Response Variable, Academic Press, San Diego,
Calif, USA, 1998.
[22] A. P. Dempster, “Covariance selection,” Biometrics, vol. 28,
no. 1, pp. 157–175, 1972.
[23] N. Wermuth and E. Scheidt, “Algorithm AS 105: fitting a
covariance selection model to a matrix,” Applied Statistics,
vol. 26, no. 1, pp. 88–92, 1977.
[24] L. F. Wu, T. R. Hughes, A. P. Davierwala, M. D. Robinson, R.
Stoughton, and S. J. Altschuler, “Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters,” Nature Genetics, vol. 31, no. 3, pp. 255–
265, 2002.
[25] T. W. Anderson, An Introduction to Multivariate Statistical
Analysis, John Wiley & Sons, New York, NY, USA, 2nd edition,
1984.
[26] S. Aburatani, K. Goto, S. Saito, et al., “ASIAN: a website for
network inference,” Bioinformatics, vol. 20, no. 16, pp. 2853–
2856, 2004.
[27] S. Aburatani, K. Goto, S. Saito, H. Toh, and K. Horimoto,
“ASIAN: a web server for inferring a regulatory network
framework from gene expression profiles,” Nucleic Acids Research, vol. 33, pp. W659–W664, 2005.
[28] M. Honda, S. Kaneko, H. Kawai, Y. Shirota, and K. Kobayashi,
“Differential gene expression between chronic hepatitis B and
C hepatic lesion,” Gastroenterology, vol. 120, no. 4, pp. 955–
966, 2001.
[29] T. Wu, “Cyclooxygenase-2 in hepatocellular carcinoma,” Cancer Treatment Reviews, vol. 32, no. 1, pp. 28–44, 2006.
[30] H. Xiao, V. Palhan, Y. Yang, and R. G. Roeder, “TIP30 has an
intrinsic kinase activity required for up-regulation of a subset
of apoptotic genes,” The EMBO Journal, vol. 19, no. 5, pp. 956–
963, 2000.
[31] W. B. Coleman, “Mechanisms of human hepatocarcinogenesis,” Current Molecular Medicine, vol. 3, no. 6, pp. 573–588,
2003.
11
[32] Y. Xu, P. K. Sengupta, E. Seto, and B. D. Smith, “Regulatory
factor for X-box family proteins differentially interact with histone deacetylases to repress collagen α2(I) gene (COL1A2) expression,” Journal of Biological Chemistry, vol. 281, no. 14, pp.
9260–9270, 2006.
[33] P. A. Barker and A. Salehi, “The MAGE proteins: emerging
roles in cell cycle progression, apoptosis, and neurogenetic disease,” Journal of Neuroscience Research, vol. 67, no. 6, pp. 705–
712, 2002.
[34] Y. Xu, L. Wang, G. Buttice, P. K. Sengupta, and B. D. Smith,
“Interferon γ repression of collagen (COL1A2) transcription
is mediated by the RFX5 complex,” The Journal of Biological
Chemistry, vol. 278, no. 49, pp. 49134–49144, 2003.
[35] F. Macian, C. Garcia-Rodriguez, and A. Rao, “Gene expression
elicited by NFAT in the presence or absence of cooperative recruitment of Fos and Jun,” The EMBO Journal, vol. 19, no. 17,
pp. 4783–4795, 2000.
[36] J. Fu, S. S. W. Tay, E. A. Ling, and S. T. Dheen, “High glucose alters the expression of genes involved in proliferation and cellfate specification of embryonic neural stem cells,” Diabetologia, vol. 49, no. 5, pp. 1027–1038, 2006.
[37] J. Schäfer and K. Strimmer, “An empirical Bayes approach to
inferring large-scale gene association networks,” Bioinformatics, vol. 21, no. 6, pp. 754–764, 2005.
[38] A. de la Fuente, N. Bing, I. Hoeschele, and P. Mendes, “Discovery of meaningful associations in genomic data using partial correlation coefficients,” Bioinformatics, vol. 20, no. 18, pp.
3565–3574, 2004.
Hindawi Publishing Corporation
EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 71312, 14 pages
doi:10.1155/2007/71312
Research Article
Uncovering Gene Regulatory Networks from Time-Series
Microarray Data with Variational Bayesian Structural
Expectation Maximization
Isabel Tienda Luna,1 Yufei Huang,2 Yufang Yin,2 Diego P. Ruiz Padillo,1 and M. Carmen Carrion Perez1
1 Department
of Applied Physics, University of Granada, 18071 Granada, Spain
of Electrical and Computer Engineering, University of Texas at San Antonio (UTSA), San Antonio,
TX 78249-0669, USA
2 Department
Received 1 July 2006; Revised 4 December 2006; Accepted 11 May 2007
Recommended by Ahmed H. Tewfik
We investigate in this paper reverse engineering of gene regulatory networks from time-series microarray data. We apply dynamic
Bayesian networks (DBNs) for modeling cell cycle regulations. In developing a network inference algorithm, we focus on soft
solutions that can provide a posteriori probability (APP) of network topology. In particular, we propose a variational Bayesian
structural expectation maximization algorithm that can learn the posterior distribution of the network model parameters and
topology jointly. We also show how the obtained APPs of the network topology can be used in a Bayesian data integration strategy
to integrate two different microarray data sets. The proposed VBSEM algorithm has been tested on yeast cell cycle data sets. To
evaluate the confidence of the inferred networks, we apply a moving block bootstrap method. The inferred network is validated by
comparing it to the KEGG pathway map.
Copyright © 2007 Isabel Tienda Luna et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
1.
INTRODUCTION
With the completion of the human genome project and successful sequencing genomes of many other organisms, emphasis of postgenomic research has been shifted to the understanding of functions of genes [1]. We investigate in this
paper reverse engineering gene regulatory networks (GRNs)
based on time-series microarray data. GRNs are the functioning circuitry in living organisms at the gene level. They
display the regulatory relationships among genes in a cellular
system. These regulatory relationships are involved directly
and indirectly in controlling the production of protein and
in mediating metabolic processes. Understanding GRNs can
provide new ideas for treating complex diseases and breakthroughs for designing new drugs.
GRNs cannot be measured directly but can be inferred
based on their inputs and outputs. This process of recovering
GRNs from their inputs and outputs is referred to as reverse
engineering GRNs [2]. The inputs of GRNs are a sequence
of signals and the outputs are gene expressions at either the
mRNA level or the protein level. One popular technology
that measures expressions of a large amount of gene at the
mRNA levels is microarray. It is not surprising that microarray data have been a popular source for uncovering GRNs
[3, 4]. Of particular interest to this paper are time-series microarray data, which are generated from a cell cycle process.
Using the time-series microarray data, we aim to uncover the
underlying GRNs that govern the process of cell cycles.
Mathematically, reverse engineering GRNs are a traditional inverse problem, whose solutions require proper modeling and learning from data. Despite many existing methods
for solving inverse problems, solutions to the GRNs problem are however not trivial. Special attention must be paid
to the enormously large scale of the unknowns and the difficulty from the small sample size, not to mention the inherent experimental defects, noisy readings, and so forth. These
call for powerful mathematic modeling together with reliable
inference. At the same time, approaches for integrating different types of relevant data are desirable. In the literature,
many different models have been proposed for both static,
cell cycle networks including probabilistic Boolean networks [5, 6], (dynamic) Bayesian networks [7–9], differential
2
equations [10], and others [11, 12]. Unlike in the case of
static experiments, extra effort is needed to model temporal dependency between samples for the time-series experiments. Such time-series models can in turn complicate the
inference, thus making the task of reverse engineering even
tougher than it already is.
In this paper, we apply dynamic Bayesian networks
(DBNs) to model time-series microarray data. DBNs have
been applied to reverse engineering GRNs in the past [13–
18]. Differences among the existing work are the specific
models used for gene regulations and the detailed inference
objectives and algorithms. These existing models include
discrete binomial models [14, 17], linear Gaussian models
[16, 17], and spline function with Gaussian noise [18]. We
choose to use the linear Gaussian regulatory model in this
paper. Linear Gaussian models model the continuous gene
expression level directly, thus preventing loss of information
in using discrete models. Even though linear Gaussian models could be less realistic, network inference over linear Gaussian models is relatively easier than that for nonlinear and/or
non Gaussian models, therefore leading to more robust results. It has been shown in [19] that if taking both computational complexity and inference accuracy into consideration,
linear Gaussian models are favored over nonlinear regulatory
models. In addition, this model actually models the joint effect of gene regulation and microarray experiments and the
model validity is better evaluated from the data directly. In
this paper, we provide the statistical test of the validity of the
linear Gaussian model.
To learn the proposed DBNs from time-series data, we
aim at soft Bayesian solutions, that is, the solutions that
provide the a posteriori probabilities (APPs) of the network
topology. This requirement separates the proposed solutions
with most of the existing approaches such as greedy search
and simulated-annealing-based algorithms, all of which produce only point estimates of the networks and are considered
as “hard” solutions. The advantage of soft solutions has been
demonstrated in digital communications [20]. In the context of GRNs, the APPs from the soft solutions provide valuable measurements of confidence on inference, which is difficult with hard solutions. Moreover, the obtained APPs can
be used for Bayesian data integration, which will be demonstrated in the paper. Soft solutions including Markov chain
Monte Carlo (MCMC) sampling [21, 22] and variational
Bayesian expectation maximization (VBEM) [16] have been
proposed for learning the GRNs. However, MCMC sampling
is only feasible for small networks due to its high complexity.
In contrast, VBEM has been shown to be much more efficient. However, the VBEM algorithm in [16] was developed
only for parameter learning. It therefore cannot provide the
desired APPs of topology. In this paper, we propose a new
variational Bayesian structural EM (VBSEM) algorithm that
can learn both parameters and topology of a network. The algorithm still maintains the general feature of VBEM for having low complexity, thus it is appropriate for learning large
networks. In addition, it estimates the APPs of topology directly and is suitable for Bayesian data integration. To this
end, we discuss a simple Bayesian strategy for integrating two
EURASIP Journal on Bioinformatics and Systems Biology
microarray data sets by using the APPs obtained from VBSEM.
We apply the VBSEM algorithm to uncover the yeast cell
cycle networks. To obtain the statistics of the VBSEM inference results and to overcome the difficulty of the small sample size, we apply a moving block bootstrap method. Unlike conventional bootstrap strategy, this method is specifically designed for time-series data. In particular, we propose
a practical strategy for determining the block length. Also, to
serve our objective of obtaining soft solutions, we apply the
bootstrap samples for estimating the desired APPs. Instead
of making a decision of the network from each bootstrapped
data set, we make a decision based on the bootstrapped APPs.
This practice relieves the problem of small sample size, making the solution more robust.
The rest of the paper is organized as follows. In Section 2,
DBNs modeling of the time-series data is discussed. The
detailed linear Gaussian model for gene regulation is also
provided. In Section 3, objectives on learning the networks
are discussed and the VBSEM algorithm is developed. In
Section 4, a Bayesian integration strategy is illustrated. In
Section 5, the test results of the proposed VBEM on the simulated networks and yeast cell cycle data are provided. A bootstrap method for estimating the APPs is also discussed. The
paper concludes in Section 6.
2.
MODELING WITH DYNAMIC
BAYESIAN NETWORKS
Like all graphical models, a DBN is a marriage of graphical
and probabilistic theories. In particular, DBNs are a class of
directed acyclic graphs (DAGs) that model probabilistic distributions of stochastic dynamic processes. DBNs enable easy
factorization on joint distributions of dynamic processes into
products of simpler conditional distributions according to
the inherent Markov properties, and thus greatly facilitate the
task of inference. DBNs are shown to be a generalization of a
wide range of popular models, which include hidden Markov
models (HMMs) and Kalman filtering models, or state-space
models. They have been successfully applied in computer vision, speech processing, target tracking, and wireless communications. Refer to [23] for a comprehensive discussion
on DBNs.
A DBN consists of nodes and directed edges. Each node
represents a variable in the problem while a directed edge
indicates the direct association between the two connected
nodes. In a DBN, the direction of an edge can carry the temporal information. To model the gene regulation from cell
cycle using DBNs, we assume to have a microarray that measures the expression levels of G genes at N + 1 evenly sampled
consecutive time instances. We then define a random variable
matrix Y ∈ RG×(N+1) with the (i, n)th element yi (n − 1) denoting the expression level of gene i measured at time n − 1
(see Figure 1). We further assume that the gene regulation
follows a first-order time-homogeneous Markov process. As
a result, we need only to consider regulatory relationships
between two consecutive time instances and this relationship remains unchanged over the course of the microarray
experiment. This assumption may be insufficient, but it will
Isabel Tienda Luna et al.
3
Microarry
Dynamic Bayesian network
First order Markov process
Time
Gene
Time 0 Time 1 Time 2 · · · Time N
y1 (0)
y1 (1)
y1 (2)
···
y1 (N)
y1 (0) y1 (1)
y1 (2)
···
y1 (N)
y2 (0)
y2 (1)
y2 (2)
···
y2 (N)
Gene 2 y2 (0) y2 (1)
y2 (2)
···
y2 (N)
y3 (2)
···
y3 (N)
y3 (1)
..
.
y3 (2)
.
.
.
···
Gene 3 y3 (0) y3 (1)
y3 (0)
.
.
.
yi (0)
..
.
yi (1)
..
.
yG (1)
yi (2)
.
.
.
yG (2)
···
y3 (N)
.
.
.
yi (N)
.
.
.
yG (N)
Gene 1
.
..
.
.
.
.
.
.
.
.
.
.
.
.
Gene G yG (0) yG (1) yG (2)
···
yG (0)
yG (N)
···
Figure 1: A dynamic Bayesian network modeling of time-series expression data.
facilitate the modeling and inference. Also, we call the regulating genes the “parent genes,” or “parents” for short.
Based on these definitions and assumptions,
the joint
probability p(Y) can be factorized as p(Y) = 1≤n≤N p(y(n)
| y(n − 1)), where y(n) is the vector of expression levels of all
genes at time n. In addition, we assume that given y(n − 1),
the expression levels at n become independent. As a result,
p(y(n) | y(n − 1)), for
all n, can be further factorized as
p(y(n) | y(n − 1)) = 1≤i≤G p(yi (n) | y(n − 1)). These factorizations suggest the structure of the proposed DBNs illustrated in Figure 1 for modeling the cell cycle regulations. In
this DBN, each node denotes a random variable in Y and all
the nodes are arranged the same way as the corresponding
variables in the matrix Y. An edge between two nodes denotes the regulatory relationship between the two associated
genes and the arrow indicates the direction of regulation. For
example, we see from Figure 1 that genes 1, 3, and G regulate
gene i. Even though, like all Bayesian networks, DBNs do not
allow circles in the graph, they, however, are capable of modeling circular regulatory relationship, an important property
that is not possessed by regular Bayesian networks. As an example, a circular regulation can be seen in Figure 1 between
genes 1 and 2 even though no circular loops are used in the
graph.
To complete modeling with DBNs, we need to define the
conditional distributions of each child node over the graph.
Then the desired joint distribution can be represented as a
product of these conditional distributions. To define the conditional distributions, we let pai (n) denote a column vector of the expression levels of all the parent genes that regulate gene i measured at time n. As an example in Figure 1,
pai (n)T = [y1 (n), y3 (n), yG (n)]. Then, the conditional distribution of each child node over the DBNs can be expressed as
p(yi (n) | pai (n − 1)), for all i. To determine the expression of
the distributions, we assume linear regulatory relationship,
that is, the expression level of gene i is the result of linear
combination of the expression levels of the regulating genes
at previous sample time. To make further simplification,
we assume the regulation is a time-homogeneous process.
Mathematically, we have the following expression:
yi (n) = wiT pai (n − 1) + ei (n),
n = 1, 2, . . . , N,
(1)
where wi ∈ R is the weight vector independent of time n and
ei (n) is assumed to be white Gaussian noise with variance σi2 .
We provide in Section 5 the statistical test of the validity of
white Gaussian noise. The weight vector is indicative of the
degree and the types of the regulation [16]. A gene is upregulated if the weight is positive and is down-regulated otherwise. The magnitude (absolute value) of the weight indicates
the degree of regulation. The noise variable is introduced to
account for modeling and experimental errors. From (1), we
obtain that the conditional distribution is a Gaussian distribution, that is,
p yi (n) | pai (n − 1) = N wiT pai (n − 1), σi2 .
(2)
In (1), the weight vector wi and the noise variance σi2 are the
unknown parameters to be determined.
2.1.
Objectives
Based on the above dynamic Bayesian networks formulation,
our work has two objectives. First, given a set of time-series
data from a single experiment, we aim at uncovering the
underlying gene regulatory networks. This is equivalent to
learning the structure of the DBNs. In specific, if we can determine that genes 2 and 3 are the parents of gene 1 in the
DBNs, there will be directed links going from gene 2 and
3 to gene 1 in the uncovered GRNs. Second, we are also
concerned with integrating two data sets of the same network from different experiments. Through integrating the
two data sets, we expect to improve the confidence of the
inferred networks obtained from a single experiment. To
achieve these two objectives, we propose in the following
an efficient variational Bayesian structural EM algorithm to
learn the network and a Bayesian approach for data integration.
4
3.
EURASIP Journal on Bioinformatics and Systems Biology
LEARNING THE DBN WITH VBSEM
Given a set of microarray measurements on the expression
levels in cell cycles, the task of learning the above DBN consists of two parts: structure learning and parameter learning.
The objective of structure learning is to determine the topology of the network or the parents of each gene. This is essentially a problem of model or variable selection. Under a given
structure, parameter learning involves the estimation of the
unknown model coefficients of each gene: the weight vector
wi and the noise variance σi2 , for all i. Since the network is
fully observed and, given parent genes, the gene expression
levels at any given time are independent, we can learn the
parents and the associated model parameters of each gene
separately. Thus we only discuss in the following the learning
process on gene i.
3.1. A Bayesian criterion for network
structural learning
(2)
(K)
Let Si = {S(1)
i , Si , . . . , Si } denote a set of K possible network topologies for gene i, where each element represents a
topology derived from a possible combination of the parents
of gene i. The problem of structure learning is to select the
topology from Si that is best supported by the microarray
data.
(k)
(k) (k)
For a particular topology S(k)
i , we use wi , pai , ei and
2
σik to denote the associated model variables. We can then express (1) for S(k)
i in a more compact matrix-vector form
(k)
(k)
yi = Pa(k)
i wi + ei ,
(3)
(k)
(k)
where yi = [yi (1), . . . , yi (N)]T , Pa(k)
= [pai (0), pai (1),
i
(k)
(k)
(k)
(k)
. . . , pa(k)
i (N − 1)] , ei = [ei (1), ei (2), . . . , ei (N)] , and
(k)
wi is independent of time n.
The structural learning can be performed under the
Bayesian paradigm. In particular, we are interested in calculating the a posteriori probabilities of the network topology
p(S(k)
| Y), for all k. The APPs will be important for the
i
data integration tasks. They also provide a measurement of
confidence on inferred networks. Once we obtain the APPs,
we can select the most probable topology Si according to the
maximum a posteriori (MAP) criterion [24], that is,
Si = arg max p
(k)
Si ∈Si
S(k)
i
|Y .
(k)
p yi | S(k)
i ,Y−i p Si | Y−i
p yi | Y−i
3.2.
Variational Bayesian structural
expectation maximization
To develop the VBSEM algorithm, we define a G-dimensional binary vector bi ∈ {0, 1}G , where bi ( j) = 1 if gene j is a
parent of gene i in the topology Si and bi ( j) = 0 otherwise.
We can actually consider bi as an equivalent representation of
Si and finding the structure Si can thus equate to determining
the values of bi . Consequently, we can replace Si in all the
above expressions by bi and turn our attention to estimate
the equivalent APPs p(bi | Y).
The basic idea behind VBSEM is to approximate the
intractable APPs of topology with a tractable distribution
q(bi ). To do so, we start with a lower bound on the normalizing constant p(yi | Y−i ) based on Jensen’s inequality
ln p yi | Y−i
= ln
(6)
dθ i p yi | bi , θ i p bi p θ i
bi
≥
dθ i q θ i
bi
p bi , yi | θ i
p θi
q bi ln
+ ln ,
q bi
q θi
(7)
(4)
The APPs are calculated according to the Bayes theorem,
p S(k)
i |Y =
but S(k)
i cannot be determined from Y−i . Note also that there
is a slight abuse of notation in (4). Y in p(S(k)
i | Y) denotes a
realization of expression levels measured from a microarray
experiment.
To calculate the APPs according to (5), the marginal likelihood p(yi | Pa(k)
i ) and the marginalization constant p(yi |
Y−i ) need to be determined. It has been shown that with conjugate priors on the parameters, we can obtain p(yi | Pa(k)
i )
analytically [21]. However, p(yi | Y−i ) becomes computationally prohibited for large networks because computing
p(yi | Y−i ) involves summation over 2G terms. This difficulty with p(yi | Y−i ) makes the exact calculation of the APPs
infeasible. Numerical approximation must be therefore employed to estimate the APPs instead. Monte Carlo samplingbased algorithms have been reported in the literature for this
approximation [21]. They are however computationally very
expensive and do not scale well with the size of networks.
In what follows, we propose a much more efficient solution
based on variational Bayesian EM.
(5)
p yi | Pa(k)
p S(k)
i
i ,
=
p yi | Y−i
where Y−i represents a matrix obtained by removing yi from
Y, the second equality is arrived at from the fact that given
(k)
S(k)
i , yi depends on Y−i only through Pai , and the last equa(k)
(k)
tion is due to that given Pai , Si is known automatically
where θ i = {wi , σi2 } and q(θ i ) is a distribution introduced for
approximating the also intractable marginal posterior distribution of parameters p(θ i | Y). The lower bound in (7) can
serve as a cost function for determining the approximate distributions q(bi ) and q(θ i ), that is, we choose q(bi ) and q(θ i )
such that the lower bound in (7) is maximized. The solution
can be obtained by variational derivatives and a coordinate
ascent iterative procedure and is shown to include the following two steps in each iteration:
VBE step:
q(t+1) bi =
1
exp
Zbi
dθ i q(t) θ i ln p bi , yi | θ i
,
(8)
Isabel Tienda Luna et al.
5
first obtain q(bi (l)), for all l from q(bi ) and then approximate
the marginal APPs p(bi (l) | Y), for all l, by
The VBSEM algorithm
(1) Initialization
Initialize the mean and the covariance matrices of the
approximate distributions as described in Appendix A.
(2) VBE step: structural learning
Calculate the approximate posterior distributions of
topology q(bi ) using (B.1)
(3) VBM step: parameter learning
Calculate the approximate parameter posterior
distributions q(θ i ) using (B.5)
(4) Compute F
Compute the lower bound as described in Appendix A. If
F increases, go to (2). Otherwise, terminate the algorithm.
q bi (l) = 1
.
p bi (l) = 1 | Y = q bi (l) = 1 + q bi (l) = 0
Instead of the MAP criterion, decisions on bi can be then
made in a bitwise fashion based on the marginal APPs. In
specific, we have
⎧
⎨1
bi (l) = ⎩
if p bi (l) | Y ≥ ρ,
0 otherwise,
VBM step:
q
(t+1)
1 θi =
p θ i exp
Zθ i
q
(t+1)
bi ln p bi , yi | θ i
(11)
where ρ is a threshold. When bi (l) = 1, it implies that gene l is
a regulator of gene i in the topology of gene i. Meanwhile, parameters can be learned from q(θ i ) easily based on the minimum mean-squared-error criterion (MMSE) and they are
Algorithm 1: The summary of VBSEM algorithm
(10)
σi2 =
w
i = mwi ,
,
bi
(9)
where t and t+1 are iteration numbers and Zbi and Zθi are the
normalizing constants to be determined. The above procedure is commonly referred to as variational Bayesian expectation maximization algorithm [25]. The VBEM can be considered as a probabilistic version of the popular EM algorithm
in the sense that it learns the distribution instead of finding
a point solution as in EM. Apparently, to carry out this iterative approximation, analytical expressions must exist in both
VBE and VBM steps. However, it is difficult to come up with
an analytical expression at least in the VBM step since the
summation is NP hard. To overcome this problem, we enforce the approximation q(bi ) to be a multivariate Gaussian
distribution. The Gaussian assumption on the discrete variable bi facilitates the computation in the VBEM algorithm,
circumventing the 2G summations. Although p(bi | Y) is a
high-dimensional discrete distribution, the defined Gaussian
approximation will guarantee the approximations to fall in
the exponential family, and as a result the subsequent computations in the VBEM iterations can be carried out exactly
[25]. In specific, by choosing conjugate priors for both θ i and
bi as described in Appendix A, we can show that the calculations in both VBE and VBM steps can be performed analytically. The detailed derivations are included in Appendix B.
Unlike the common VBEM algorithm, which learns only the
distributions of parameters, the proposed VBEM learns the
distributions of both structure and parameters. We, therefore, call the algorithm VB structural EM (VBSEM). The algorithm of VBSEM for learning the DBNs under study is
summarized in Algorithm 1.
When the algorithm converges, we obtain q(bi ), a multivariate Gaussian distribution and q(θ i ). Based on q(bi ), we
need then to produce a discrete distribution as a final estimate of p(bi ). Direct discretization in the variable space is
computationally difficult. Instead, we propose to work with
the marginal APPs from model averaging. To this end, we
β
,
α−2
(12)
where mwi , β, and α are defined in Appendix B according to
(B.5).
4.
BAYESIAN INTEGRATION OF TWO DATA SETS
A major task of the gene network research is to integrate all
prevalent data sets about the same network from different
sources so as to improve the confidence of inference. As indicated before, the values of bi define the parent sets of gene i,
and thus the topology of the network. The APPs obtained
from the VBSEM algorithm provide us with an avenue to
pursue Bayesian data integration.
We illustrate here an approach for integrating two microarray data sets Y1 and Y2 , each produced from an experiment under possibly different conditions. The premise for
combining the two data sets is that they are the experimental outcomes of the same underlying gene network, that is,
the topologies Si or bi , for all i are the same in the respective data models. Direct combination of the two data sets at
the data level requires many preprocesses including scaling,
alignment, and so forth. The preprocessing steps introduce
noise and potential errors to the original data sets. Instead,
we propose to perform data integration at the topology level.
The objective of topology-level data integration is to obtain
the APPs of bi from the combined data sets p(bi | Y1 , Y2 ) and
then make inference on the gene network structures accordingly.
To obtain p(bi | Y1 , Y2 ), we factor it according to the
Bayes rule as
1
2
p bi | Y , Y
p Y2 | bi p Y1 | bi p bi
=
p Y1 p Y2
(13)
p Y2 | bi p bi | Y1
,
=
p Y2
where p(Y2 | bi ) is the marginalized likelihood functions of
data set 2 and p(bi | Y1 ) is the APPs obtained from data set 1.
EURASIP Journal on Bioinformatics and Systems Biology
The above equation suggests a simple scheme to integrate the
two data sets: we start with a data set, say Y1 , and calculate the
APPs p(bi | Y1 ); then by considering p(bi | Y1 ) as the prior
distribution, the data set Y1 is integrated with Y2i according
to (13). By this way, we obtain the desired APPs p(bi | Y1 , Y2 )
from the combined data sets. To implement this scheme, the
APPs of the topology must be computed and the proposed
VBSEM can be applied for the task. This new scheme provides a viable and efficient framework for Bayesian data integration.
5.
RESULTS
5.1. Test on simulated systems
Table 1: Area under each curve.
G = 30
0.8007
Setting
AUC
G = 100
0.7253
G = 150
0.6315
G = 200
0.5872
1
0.9
0.8
0.7
0.6
Recall
6
0.5
0.4
5.1.1. Study based on precision-recall curves
In this section, we validate the performance of the proposed
VBSEM algorithm using synthetic networks whose characteristics are as realistic as possible. This study was accomplished through the calculation of the precision-recall curves.
Among the scientific community in this field, it is common
to employ the ROC analysis to study the performance of
a proposed algorithm. However, since genetic networks are
sparse, the number of false positives far exceeds the number
of true positives. Thus, the specificity is inappropriate as even
small deviation from a value of 1 will result in a large number
of false positives. Therefore, we choose the precision-recall
curves in evaluating the performance. Precision corresponds
to the expected success rate in the experimental validation of
the predicted interactions and it is calculated as TP /(TP +FP ),
where TP is the number of true positives and FP is the number of false positives. Recall, on the other hand, indicates the
probability of correctly detecting a true positive and it is calculated as TP /(TP +FN ), where FN is the number of false negatives. In a good system, precision decreases as recall increases
and the higher the area under the curve is the better the system is.
To accomplish our objective, we simulated 4 networks
with 30, 100, 150, and 200 genes, respectively. For each tested
network, we collected only 30 time samples for each gene,
which mimics the realistic small sample scenario. Regarding
the regulation process, each gene had either none, one, two,
or three parents. Besides, the number of parents was selected
randomly for each gene. The weights associated to each regulation process were also chosen randomly from an interval
that contains the typical estimated values when working with
the real microarray data. As for the nature of regulation, the
signs of the weights were selected randomly as well. Finally,
the data values of the network outputs were calculated using
the linear Gaussian model proposed in (1). These data values
were taken after the system had reached stationarity and they
were in the range of the observations corresponding to real
microarray data.
In Figure 2, the precision-recall curves are plotted for different settings. In order to construct these curves, we started
by setting a threshold ρ for the APPs. This threshold ρ is between 0 to 1 and it was used as in (11): for each possible regulation relationship between two genes, if its APP is greater
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
Precision
30 genes
100 genes
150 genes
200 genes
Figure 2: Precision-recall curve.
than ρ, then the link is considered to exist, whereas if the
APP is lower than ρ, the link is not considered. We calculated the precision and the recall for each selected threshold
between 0 and 1. We plotted the results in blue for the case
with G = 30, black for G = 100, red for G = 150, and green
for G = 200. As expected, the performance got worse as the
number of genes increases. One measure of this degradation
is shown in Table 1 where we calculated the area under each
curve (AUC).
To further quantify the performance of the algorithms,
we calculated the F-score. F-score constitutes an evaluation
measure that combines precision and recall and it can be calculated as
Fα =
1
,
α(1/precision) + (1 − α)(1/recall)
(14)
where α is a weighting factor and a large α means that the
recall is more important, whereas a small α means that precision is more important. In general, α = 0.5 is used, where
the importance of precision and the importance of recall are
even and Fα is called harmonic mean. This value is equal to 1
when both precision and recall are 100%, and 0 when one of
them is close to 0. Figure 3 depicts the value of the harmonic
mean as a function of the APP threshold ρ for the VBSEM algorithm. As it can be seen, the performance of the algorithm
for G = 30 is better than the performance for any other setting. However, we can also see that there is almost no performance degradation between the curve corresponding to
G = 30 and the one for G = 100 in the APP threshold interval from 0.5 to 0.7. The same observation can be obtained for
Isabel Tienda Luna et al.
7
Table 2: Computation time for different sizes of networks.
G1 = 100
G3 = 200
G2 = 500
G4 = 1000
0.7
19.2871
206.5132
889.8120
12891.8732
0.6
Table 3: Number of errors in 100 Monte Carlo trials.
No. of errors
VBEM (no. of
iterations = 10)
Gibbs sampling
(500 samples)
G = 5, N = 5
G = 5, N = 10
62
1
126
5
Harmonic mean
Setting
Computation
time (s)
0.8
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
APP threshold
G = 30
G = 100
G = 150
G = 200
Figure 3: Harmonic mean as a function of the APP threshold.
0.9
0.8
0.7
Harmonic mean
curves G = 150 and G = 200 in the interval from 0.5 to 0.6.
In general, in the interval from 0.5 to 0.7, the degradation of
the algorithm performance is small for reasonable harmonic
mean values (i.e., > 0.5).
To demonstrate the scalability of the VBSEM algorithm,
we have studied the harmonic mean for simulated networks
characterized by the following settings: (G1 = 1000, N1 =
400), (G2 = 500, N2 = 200), (G3 = 200, N3 = 80), and
(G4 = 100, N4 = 40). As it can be noticed, the ratio Gi /Ni has
been kept constant in order to maintain the proportion between the amount of nodes in the network and the amount of
information (samples). The results were plotted in Figure 4
where we have represented the harmonic mean as a function of the APP threshold. The closeness of the curves at APP
threshold equal to 0.5 supports the good scalability of the
proposed algorithm. We have also recorded the computation
time of VBSEM for each network and listed them in Table 2.
The results were obtained with a standard PC with 3.4 GHz
and 2 GB RAM.
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
5.1.2. Comparison with the Gibbs sampling
We tested in this subsection the VBSEM algorithm on a simulated network in order to compare it with the Gibbs sampling [26]. We simulated a network of 20 genes and generated their expressions based on the proposed DBNs and the
linear Gaussian regulatory model with Gaussian distributed
weights. We focused on a particular gene in the simulated
networks. The gene was assumed to have two parents. We
compared the performance of VBSEM and Gibbs sampling
in recovering the true networks. In Table 3, we present the
number of errors in 100 Monte Carlo tests. For the Gibbs
sampling, 500 Monte Carlo samples were used. We tested the
algorithms under different settings. In the table, N stands for
the number of time samples and G is the number of genes.
As it can be seen, the VBSEM outperforms Gibbs sampling
even in an underdetermined system. Since the VBSEM has
much lower complexity than Gibbs sampling, the proposed
VBSEM algorithm is better suited for uncovering large networks.
G = 100, N = 40
G = 200, N = 80
0.4 0.5 0.6
APP threshold
0.7
0.8
0.9
1
G = 500, N = 200
G = 1000, N = 400
Figure 4: Harmonic mean as a function of the APP threshold to
proof scalability.
5.2.
Test on real data
We applied the proposed VBSEM algorithm on cDNA microarray data sets of 62 genes in the yeast cell cycle reported in [27, 28]. The data set 1 [27] contains 18 samples
evenly measured over a period of 119 minutes where a synchronization treatment based on α mating factor was used.
On the other hand, the data set 2 [28] contains 17 samples evenly measured over 160 minutes and a temperaturesensitive CDC15 mutant was used for synchronization. For
each gene, the data is represented as the log2 {(expression
at time t)/(expression in mixture of control cells)}. Missing
8
EURASIP Journal on Bioinformatics and Systems Biology
CDC20
CDC5
CDC14
CLB1
ESC5
CLN3
TEM1
TUP1
CAK1
FAR1
CDC7
GIN4
CLN1
SCI1
SWI4
FUS3
Downregulation
Weights 0–0.4
SMC3
CLB2
HSL7
RAD53
PHO5
CLB6
CDC45
HSL1
Weights 0.4–0.8
Weights 0.8–1.5
Figure 5: Inferred network using the α data set of [27].
values exist in both data sets, which indicate that there was no
strong enough signal in the spot. In this case, simple spline
interpolation was used to fill in the missing data. Note the
time step that differs in each data set can be neglected since
we assume a time-homogeneous regulating process.
When validating the results, the main objective is to determine the level of confidence of the connections in the inferred network. The underlying intuition is that we should be
more confident on features that would still be inferred when
we perturb the data. Intuitively, this can be performed on
multiple independent data sets generated from repeated experiments. However, in this case and many other practical
scenarios, only one or very limited data replicates are available and the sample size in each data set is small. The question is then how to produce the perturbed data from the limited available data sets and at the same time maintain the underlying statistical features of the data set. One way to achieve
it is to apply the bootstrap method [29]. Through bootstrapping the data set, we can generate multiple pseudoindependent data sets, each of which still maintains the statistics of
the original data. The bootstrap methods have been used
extensively for static data sets. When applied to time-series
data, an additional requirement is to maintain as much as
possible the inherent time dependency between samples in
the bootstrapped data sets. This is important since the proposed DBNs modeling and VBSEM algorithm exploit this
time dependency. Approaches have been studied in the bootstrap literatures to handle time-dependent samples and we
adopt the popular moving block bootstrap method [30]. In
moving block bootstrap, we created pseudo-data sets from
the original data set by first randomly sampling blocks of
sub-data sets and then putting them together to generate a
new data set. The detailed steps can be summarized as follows.
(1) Select the length of the block L.
(2) Create the set of possible n = N − L + 1 blocks from
data. These blocks are created in the followingway:
Zi = Y(:, i : i + L − 1).
(15)
(3) Randomly sample with replacement N/L blocks
from the set of blocks {Zi }iN=−1L+1 .
(4) Create the pseudo-data set by putting all the sampled
blocks together and trim the size to N by removing the
extra data samples.
A key issue in moving block bootstrap is to determine the
block length L. The idea is to choose a large enough block
length L so that observations more than L time units apart
will be nearly independent. Many theoretical and applicable
results have been developed on choosing the block length.
However, they rely on large size of data samples and are computationally intensive. Here, we develop an easy and practical
approach to determine the block length. We compute the autocorrelation function on data and choose the block length as
the delay, at which the ACF becomes the smallest. The ACF
in this case may not be reliable but it provides at least some
measures of independence.
In Figure 5, we show the inferred network when the data
set from [27] was considered and the moving block bootstrap
Isabel Tienda Luna et al.
9
RAD9
BUB1
DBF2
ESP1
SMC3
MAD3
CDC5
MET30
HSL1
CLN3
CLN1
CDC14
FAR1
SIC1
RAD17
GIN4
CDC20
SWI4
SWI6
CDH1
BUB2
CLB2
CLB1
RAD53
SWI5
SWE1
MEC1
DDC1
CDC28
CLB6
LTE1
GIN4
MIH1
PDS1
TEM1
PDS1
TUP1
CDC6
CDC45
CAK1
Figure 6: Inferred network using the CDC28 data set of [28].
was used to resample the observations. The total number of
re-sample data sets was 500. In this plot, we only drew those
links with the estimated APP higher than 0.6. Weused the
solid lines to represent those links with weights between 0
and 0.4, the dotted lines for the links with weights between
0.4 and 0.8, and the lines with dashes and dots for those with
weights higher than 0.8. The red color was used to represent downregulation. A circle enclosing some genes means
that those corresponding proteins compose a complex. The
edges inside these circles are considered as correct edges since
genes inside the same circle will coexpress with some delay.
In Table 4, we show the connections with some of the highest APPs found from the α data set of [27]. We compared
them with the links in the KEGG pathway [31], and some of
the links inferred by the proposed algorithm are predicted in
it. We considered a connection as predicted when the parent
is in the upper stream of the child in the KEGG. Furthermore, the proposed algorithm is also capable of predicting
the nature of the relationship represented by the link through
the weight. For example, the connection between CDC5 and
CLB1 has a weight equal to 0.6568, positive, so it represents
an upregulation as predicted in the KEGG pathway. Another
example is the connection from CLB1 to CDC20; its APP is
0.6069 and its weight is 0.4505, again positive, so it stands for
an up-regulation as predicted by the KEGG pathway.
In Figure 6, we depict the inferred network when the
CDC28 data set of [28] was used. A moving block bootstrap was also used with the number of the bootstrap data
sets equal to 500 again. Still, the links presented in this plots
are those with the APP higher than 0.6. In Table 5, we show
some of the connections with some of the highest APPs. We
also compared them with the links in the KEGG pathway,
and some of the links inferred by the proposed algorithm are
also predicted in it. Furthermore, the proposed algorithm is
Table 4: Links with higher APPs obtained from the α data set of
[27].
From
CDC5
CLB6
CLB6
SWI4
CLB6
CLB6
CLN1
PH05
CLB6
CDC5
FUS3
PH05
CLB2
CLB6
To
CLB1
CDC45
SMC3
SMC3
HSL1
CLN1
CLN3
SIC1
RAD53
CLB2
GIN4
PHO5
CDC5
SWI4
APPs
0.6558
0.6562
0.7991
0.6738
0.6989
0.7044
0.6989
0.6735
0.6974
0.6566
0.6495
0.6441
0.6390
0.6336
Comparison with KEGG
Predicted
Predicted
Not predicted
Not predicted
Not predicted
Predicted the other way round
Predicted the other way round
Not predicted
Not predicted
Predicted
Not predicted
Not predicted
Predicted the other way round
Predicted the other way round
also capable of predicting the nature of the relationship represented by the link through the weight. For example, the
connection between TEM1 and DDC1 has a weight equal
to −0.3034; the negative sign represents a downregulation
as predicted in the KEGG pathway. Another example is the
connection from CLB2 to CDC20, its APP is 0.6069 and its
weight is 0.7763, this time positive, so it stands for an upregulation as predicted by the KEGG pathway.
Model validation
To validate the proposed linear Gaussian model, we tested
the normality of the prediction errors. If the prediction errors
10
EURASIP Journal on Bioinformatics and Systems Biology
DDC1
MEC3
GRF10
6
6
6
4
4
4
2
2
2
0
−2
0
Prediction error
2
0
−2
0
Prediction error
(a)
2
0
−2
(b)
0
Prediction error
2
(c)
Figure 7: Histogram of prediction error in the α data set.
4
DDC1
MEC3
4
4
3
3
3
2
2
2
1
1
1
0
−2
0
Prediction error
2
0
−2
0
Prediction error
(a)
2
0
−2
(b)
GRF10
0
Prediction error
2
(c)
Figure 8: Histogram of prediction error in the CDC28 data set.
yield Gaussian distributions as in the linear model (1), it then
proves the feasibility of linear Gaussian assumption on data.
i and w
i of gene i, the prediction
Given the estimated b
error ei is obtained as
i − yi ,
ib
ei = RW
(16)
i = diag(w
where W
i ) and R = TY , with
⎛
Results validation
⎞
1
⎜
T=⎜
⎝
..
.
0
.. ⎟
⎟
.⎠.
over all the genes. To examine the normality, we performed Kolmogorov-Smirnov goodness-of-fit hypothesis
test (KSTEST) of the prediction errors for each gene. All the
prediction errors pass the normality test at the significance
level of 0.05, and therefore it demonstrates the validity of the
proposed linear Gaussian assumption.
(17)
1 0
We show in Figures 7 and 8 examples of the histograms of the
prediction errors for genes DDC1, MEC3, and GRF10 in the
α and CDC28 data sets.
Those histograms exhibit the bell shape for the distribution of the prediction errors and such pattern is constant
To systematically present the results, we treated the KEGG
map as the ground truth and calculated the statistics of
the results. Even though there are still uncertainties, the
KEGG map represents up-to-date knowledge about the dynamics of gene interaction and it should be reasonable to
serve as a benchmark of results validation. In Tables 6 and
7, we enlisted the number of true positives (tp), true negatives (tn), false positives (fp), and false negative (fn) for
the α and CDC28 data sets, respectively. We also varied the
Isabel Tienda Luna et al.
11
SMC3
BUB1
DBF2
SWI5
ESP1
CDC28
RAD9
SCC3
PDS1
SIC1
MEC1
MET30
FAR1
MAD3
RAD17
CDC5
BUB2
CLN1
SWI6
HSL1
CLN3
CLB1
CLB2
PDS1
SWI4
SWE1
TEM1
CLB6
CDC20
HSL7
RAD53
MIH1
TUP1
CAK1
CDC6
CDC45
GIN4
Downregulation
Weights 0–0.4
FUS3
Weights 0.4–0.8
Weights 0.8–1.5
Figure 9: Inferred network by integrating the α and CDC28 data sets.
Table 5: Links with higher APPs obtained from the CDC28 data set
of [28].
From
CLB1
To
CDC20
APPs
0.7876
Comparison with KEGG
Predicted the other way round
BUB1
ESP1
0.6678
Predicted
BUB2
CDC5
0.7145
Predicted
SIC1
GIN4
0.6700
Not predicted
SMC3
HSL1
0.6689
Not predicted
CLN1
CLN3
0.7723
Predicted the other way round
FAR1
SIC1
0.6763
Predicted
CLN1
SIC1
0.6640
Predicted
CDC5
PCL1
0.7094
Not predicted
DBF2
FAR1
0.7003
Not predicted
SIC1
CLN1
0.8174
Predicted the other way round
PBS1
MBP1
0.7219
Not predicted
FAR1
MET30
0.873
Not predicted
CLB2
DBF2
0.7172
Predicted the other way round
APP threshold for decision. (The thresholds are listed in the
threshold column of the tables.) A general observation is that
we do not have high confidence about the inference results
since high tp cannot be achieved at low fp. Since the VBSEM algorithm has been tested with acceptable performance
on simulated networks and the model has also been vali-
Table 6: The α data set.
APPs threshold
0.4
0.5
0.6
tp
411
58
8
tn
177
3116
3405
fp
3247
308
19
fn
9
362
412
fp
3261
405
61
fn
15
346
406
Table 7: The CDC28 data set.
APPs threshold
0.4
0.5
0.6
tp
405
74
14
tn
163
3019
3363
dated, this can very well indicate that the two data sets were
not quite informative about the causal relationship between
genes.
Data integration
In order to improvethe accuracy of the inference, we applied
the Bayesian integration scheme described in Section 4 to
combine the two data sets, trying to use information provided from both data sets to improve the inference confidence. The Bayesian integration includes two stages. In the
first stage, the proposed VBSEM algorithm is run on the data
set 1 that contains larger number of samples. In the second
12
EURASIP Journal on Bioinformatics and Systems Biology
Table 8: Links with higher APPs obtained based on the integrated
data set.
From
To
APPs
Comparison with KEGG
CLB1
CLB2
CDC6
HSL7
CDC5
CLB2
CLB6
RAD17
FAR1
FAR1
MET30
CDC5
CLB6
CLB6
CLN1
BUB1
BUB2
SIC1
CDC20
CDC5
CLB6
CLB1
CLB1
CLB1
HSL1
CLN3
SIC1
MET30
RAD9
CLB2
CDC45
Cln1
CLN3
ESP1
CDC5
CLN1
0.7969
0.6898
0.7486
0.6878
0.7454
0.6795
0.7564
0.7324
0.7329
0.7742
0.7534
0.7033
0.6299
0.6912
0.8680
0.6394
0.6142
0.6793
Predicted the other way round
Predicted the other way round
Predicted the other way round
Predicted
Predicted
Predicted
Not predicted
Not predicted
Predicted
Not predicted
Not predicted
Predicted
Predicted
Predicted the other way round
Predicted the other way round
Predicted
Predicted
Predicted the other way round
stage, the APPs of the latent variables bi obtained in the first
stage are used as the priors in the VBSEM algorithm run on
the second data set from [28]. In Figure 9, we plot the inferred network obtained from the integration process. We
also performed bootstrap resampling in the integration process: we first obtained a sampled data set from the data set 1
and then we use its calculated APPs as the prior to integrate
a bootstrap sampled data from set 2.
In Table 8, we present the links with the higher APPs inferred performing integration of the data sets. We made a
comparison between these links and the ones shown in the
KEGG pathway map again. As it can be seen, the proposed
algorithm is able to predict many relationships. For instance,
the link between CDC5 and CLB1 is predicted correctly by
our algorithm with a posteriori probability of 0.7454. The
weight associated to this connection is −0.1245, which is
negative, and so there is a downregulation relationship confirmed in the KEGG pathway. We also observed improvements from integrating the two data sets. Regarding the link
between CDC5 and CLB1, if we compare the result obtained
from the integrated data set, with that shown in Table 4, we
see that this relationship was not predicted when using the
CDC28 data set 2. Even though this link was predicted by the
α data set its APP is however lower and the weight is positive
indicating an inconsistency with the KEGG map. The inconsistency has been fixed by data integration. As another example, the relationship between HSL7 and CLB1 was predicted
based on the integrated data sets but it was not predicted
from the CDC28 data set. This link was predicted when only
the α data set was used but its APP is 0.6108, lower than the
APP obtained performing integration. Similar phenomenon
can be observed for the link between FAR1 to SIC1 again.
Table 9: Integrated data set.
APPs threshold
0.4
0.5
0.6
tp
252
50
17
tn
1655
3175
3374
fp
1769
249
50
fn
168
370
403
We also listed the statistics of the results when compared
with the KEGG map in Table 9. We can see that when compared with Tables 6 and 7, data integration almost halved the
fp at the thresholds 0.4 and 0.5 and also reduced the fp at
0.6. Meanwhile, tp increased. This implies the increased confidence on the results after data integration, which demonstrates the advantages of the Bayesian data integration.
Another way of looking at the benefits of the integration
process is by examining the lower bound of the VBSEM. If
the data integration process benefits the performance of the
algorithm, we must see higher lower bound values than those
of single data set. This happens because if the data contains
more information after integration, the lower bound should
be closer to the value it is approximating. In Figure 10, we
plot the evolution of lower bound over the VBSEM iterations
for each gene from the α data set, the CDC28 data set, and the
integrated data sets. The increase of the lower bound, when
the integrated data sets were used, supports the advantages
of Bayesian data integration.
6.
CONCLUSION
We investigated the DBNs modeling of cell cycle GRNs and
the VBSEM learning of network topology. The proposed VBSEM solution is able to estimate the APPs of topology. We
showed how the estimated APPs can be used in a Bayesian
data integration strategy. The low complexity of the VBSEM algorithm shows its potential to work with large networks. We also showed how the bootstrap method can be
used to obtain the confidence of the inferred networks. This
approach has been approved very useful in the case of small
data size, a common case in computational biology research.
APPENDICES
A.
CONJUGATE PRIORS OF TOPOLOGY
AND PARAMETERS
We choose the conjugate priors for topology and the parameters and they are
p b i = N b i | μ0 , C 0 ,
p θi = p
wi , σi2
σi2
σi2
= p wi |
p
γ0 ν 0
2
= N wi | μwi , σi IG IG
,
(A.1)
2 2
,
(A.2)
where μ0 and C0 are the mean and the covariance of the prior
probability density p(bi ). In general, μwi and μ0 are simply
set as zero vectors, and meanwhile ν0 and γ0 are set equal
to small positive real values. Moreover, covariance matrix C0
CDC28 data set
1
α data set
−20
−30
2
3
4
5
Number of iterations
−40
−50
−60
−70
−80
−90
−100
6
Lower bound
−20
−30
−40
−50
−60
−70
−80
−90
−100
−110
13
Lower bound
Lower bound
Isabel Tienda Luna et al.
1
2
3
4
5
Number of iterations
(a)
6
Integrated data set
−20
−30
−40
−50
−60
−70
−80
−90
−100
−110
1
2
3
4
5
Number of iterations
(b)
6
(c)
Figure 10: Evolution of the VBSEM lower bound.
needs to be checked carefully and is usually set as a diagonal matrix with a relatively large constant at each diagonal
element.
These priors satisfy the conditions for conjugate exponential (CE) models [25]. For conjugate exponential models,
formulae exist in [25] for solving analytically the integrals in
the VBE and VBM steps.
B.
where
q bi = N bi | mbi , Σbi ,
where
mbi = Σbi C0−1 + f
with
D=B⊗
Σbi = C0−1 + D
mwi mwi
T σi−2
f = yi R diag mwi σi−2
−1
,
+ A−1 ,
⎛
⎞
1
⎜
T=⎜
⎝
..
.
0
.. ⎟
⎟
.⎠
with
Mx = diag mbi ,
(B.7)
c = yi RMx mwi − yi yi − ν0 ,
1.
K = B ⊗ Σbi + mbi mTbi ,
and R = TY , with
(B.6)
α = N(η + 1) − G − 2,
(B.2)
(B.3)
B = R R,
A = IG + K,
,
and η is a hyperparameter of the parameter prior p(θ i ) based
on CE models (A.2).
q(θi ) ,
,
(B.1)
q(θi )
−1
T
β = −c,
Let us first start with VBE step. Suppose that q(θ i ) obtained
in the previous VBM step follows a Gaussian-inverse-gamma
distribution and has the expression (B.5). The VBE step calculates the approximation on the APPs of topology p(bi ). By
applying the theorems of the CE model [25], q(bi ) can be
shown to have the following expression:
yi T RMx
Σwi = σi2 IG + K
DERIVATION OF VBE AND VBM STEPS
−1 mwi = IG + K
(B.4)
1 0
being an N × (N + 1) matrix.
We now turn to the VBM step in which we compute
q(θ i ). Again, from the CE model and q(bi ) obtained in (B.1),
we have
α β
,
(B.5)
q θ i = N wi | mwi , Σwi IG ,
2 2
Computation of the lower bound F
The convergence of the VBEM algorithm is tested using a
lower bound of ln p(yi ). In this paper, we use F to denote
this lower bound and we calculate it using the newest q(bi )
and q(θ i ) obtained in the iterative process. F can be written
more succinctly using the definition of the KL divergence. Let
us first review the definition of the KL divergence and then
derive an analytical expression for F .
The KL divergence measures the difference between two
probability distributions and it is also termed relative entropy. Thus, using this definition we can write the difference
between the real and the approximate distributions in the following way:
p bi , yi | θ i
KL q bi p bi , yi | θ i = − dbi q bi ln
,
q bi
p θi
KL q θ i p θ i = − dθ i ln .
q θi
(C.1)
14
EURASIP Journal on Bioinformatics and Systems Biology
And finally, the lower bound F can be written in terms
of the previous definitions as
F =
dθ i q θ i
=−
dbi q bi ln
p bi , yi | θ i
p θi
+ ln q bi
q θi
dθ i q θ i KL q bi p bi , yi | θ i
− KL q θ i p θ i .
(C.2)
ACKNOWLEDGMENTS
Yufei Huang is supported by an NSF Grant CCF-0546345.
Also M. Carmen Carrion Perez thanks MCyT under project
TEC 2004-06096-C03-02/TCM forfunding.
REFERENCES
[1] H. Kitano, “Looking beyond that details: a rise in systemoriented approaches in genetics and molecular biology,” Current Genetics, vol. 41, no. 1, pp. 1–10, 2002.
[2] P. D’haeseleer, S. Liang, and R. Somogyi, “Genetic network inference: from co-expression clustering to reverse engineering,”
Bioinformatics, vol. 16, no. 8, pp. 707–726, 2000.
[3] P. Brazhnik, A. de la Fuente, and P. Mendes, “Gene networks:
how to put the function in genomics,” Trends in Biotechnology,
vol. 20, no. 11, pp. 467–472, 2002.
[4] N. Friedman, “Inferring cellular networks using probabilistic
graphical models,” Science, vol. 303, no. 5659, pp. 799–805,
2004.
[5] I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang, “Probabilistic Boolean networks: a rule-based uncertainty model for
gene regulatory networks,” Bioinformatics, vol. 18, no. 2, pp.
261–274, 2002.
[6] X. Zhou, X. Wang, and E. R. Dougherty, “Construction
of genomic networks using mutual-information clustering
and reversible-jump Markov-chain-Monte-Carlo predictor
design,” Signal Processing, vol. 83, no. 4, pp. 745–761, 2003.
[7] A. J. Hartemink, D. K. Gifford, T. S. Jaakkola, and R. A. Young,
“Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks,” in
Proceedings of the 6th Pacific Symposium on Biocomputing (PSB
’01), pp. 422–433, The Big Island of Hawaii, Hawaii, USA, January 2001.
[8] E. J. Moler, D. C. Radisky, and I. S. Mian, “Integrating naive
Bayes models and external knowledge to examine copper and
iron homeostasis in S. cerevisiae,” Physiol Genomics, vol. 4,
no. 2, pp. 127–135, 2000.
[9] E. Segal, Rich probabilistic models for genomic data, Ph.D. thesis, Stanford University, Stanford, Calif, USA, 2004.
[10] H. de Jong, “Modeling and simulation of genetic regulatory
systems: a literature review,” Journal of Computational Biology,
vol. 9, no. 1, pp. 67–103, 2002.
[11] Z. Bar-Joseph, “Analyzing time series gene expression data,”
Bioinformatics, vol. 20, no. 16, pp. 2493–2503, 2004.
[12] N. Simonis, S. J. Wodak, G. N. Cohen, and J. van Helden,
“Combining pattern discovery and discriminant analysis to
predict gene co-regulation,” Bioinformatics, vol. 20, no. 15, pp.
2370–2379, 2004.
[13] K. Murphy and S. Mian, “Modelling gene expression data using dynamic Bayesian networks,” Tech. Rep., Computer Science Division, University of California, Berkeley, Calif, USA,
1999.
[14] N. Friedman, M. Linial, I. Nachman, and D. Pe’er, “Using
Bayesian networks to analyze expression data,” Journal of Computational Biology, vol. 7, no. 3-4, pp. 601–620, 2000.
[15] R. J. P. van Berlo, E. P. van Someren, and M. J. T. Reinders,
“Studying the conditions for learning dynamic Bayesian networks to discover genetic regulatory networks,” Simulation,
vol. 79, no. 12, pp. 689–702, 2003.
[16] M. J. Beal, F. Falciani, Z. Ghahramani, C. Rangel, and D. L.
Wild, “A Bayesian approach to reconstructing genetic regulatory networks with hidden factors,” Bioinformatics, vol. 21,
no. 3, pp. 349–356, 2005.
[17] B.-E. Perrin, L. Ralaivola, A. Mazurie, S. Bottani, J. Mallet,
and F. d’Alché-Buc, “Gene networks inference using dynamic
Bayesian networks,” Bioinformatics, vol. 19, supplement 2, pp.
ii138–ii148, 2003.
[18] S. Y. Kim, S. Imoto, and S. Miyano, “Inferring gene networks
from time series microarray data using dynamic Bayesian networks,” Briefings in Bioinformatics, vol. 4, no. 3, pp. 228–235,
2003.
[19] F. Ferrazzi, R. Amici, P. Sebastiani, I. S. Kohane, M. F. Ramoni, and R. Bellazzi, “Can we use linear Gaussian networks
to model dynamic interactions among genes? Results from a
simulation study,” in Proceedings of IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS
’06), pp. 13–14, College Station, Tex, USA, May 2006.
[20] X. Wang and H. V. Poor, Wireless Communication Systems: Advanced Techniques for Signal Reception, Prentice Hall PTR, Englewood Cliffs, NJ, USA, 2004.
[21] J. Wang, Y. Huang, M. Sanchez, Y. Wang, and J. Zhang, “Reverse engineering yeast gene regulatory networks using graphical models,” in Proceedings of IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP ’06), vol. 2,
pp. 1088–1091, Toulouse, France, May 2006.
[22] D. Husmeier, “Sensitivity and specificity of inferring genetic
regulatory interactions from microarray experiments with dynamic Bayesian networks,” Bioinformatics, vol. 19, no. 17, pp.
2271–2282, 2003.
[23] K. P. Murphy, Dynamic Bayesian networks: representation, inference and learning, Ph.D. thesis, University of California,
Berkeley, Calif, USA, 2004.
[24] S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory, Prentice-Hall, Englewood Cliffs, NJ, USA, 1997.
[25] M. J. Beal, Variational algorithms for approximate Bayesian inference, Ph.D. thesis, The Gatsby Computational Neuroscience
Unit, University College London, London, UK, May 2003.
[26] S. P. Brooks, “Markov chain Monte Carlo method and its application,” Journal of the Royal Statistical Society: Series D, The
Statistician, vol. 47, no. 1, pp. 69–100, 1998.
[27] P. T. Spellman, G. Sherlock, M. Q. Zhang, et al., “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Molecular
Biology of the Cell, vol. 9, no. 12, pp. 3273–3297, 1998.
[28] R. J. Cho, M. J. Campbell, E. A. Winzeler, et al., “A genomewide transcriptional analysis of the mitotic cell cycle,” Molecular Cell, vol. 2, no. 1, pp. 65–73, 1998.
[29] B. Efron and R. Tibshirani, An Introduction to Bootstrap,
Monographs on Statistics and Applied Probability, no. 57,
Chapman & Hall, New York, NY, USA, 1993.
[30] S. N. Lahiri, Resampling Methods for Dependent Data,
Springer, New York, NY, USA, 2003.
[31] “Kegg: Kyoto encyclopedia of genes and genomes,” http://www
.genome.jp/kegg/.
Hindawi Publishing Corporation
EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 51947, 12 pages
doi:10.1155/2007/51947
Research Article
Inferring Time-Varying Network Topologies from
Gene Expression Data
Arvind Rao,1, 2 Alfred O. Hero III,1, 2 David J. States,2, 3 and James Douglas Engel4
1 Department
of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109-2122, USA
Graduate Program, Center for Computational Medicine and Biology, School of Medicine, University of Michigan,
Ann Arbor, MI 48109-2218, USA
3 Department of Human Genetics, School of Medicine, University of Michigan, Ann Arbor, MI 48109-0618, USA
4 Department of Cell and Developmental Biology, School of Medicine, University of Michigan, Ann Arbor, MI 48109-2200, USA
2 Bioinformatics
Received 24 June 2006; Revised 4 December 2006; Accepted 17 February 2007
Recommended by Edward R. Dougherty
Most current methods for gene regulatory network identification lead to the inference of steady-state networks, that is, networks
prevalent over all times, a hypothesis which has been challenged. There has been a need to infer and represent networks in a
dynamic, that is, time-varying fashion, in order to account for different cellular states affecting the interactions amongst genes.
In this work, we present an approach, regime-SSM, to understand gene regulatory networks within such a dynamic setting. The
approach uses a clustering method based on these underlying dynamics, followed by system identification using a state-space
model for each learnt cluster—to infer a network adjacency matrix. We finally indicate our results on the mouse embryonic kidney
dataset as well as the T-cell activation-based expression dataset and demonstrate conformity with reported experimental evidence.
Copyright © 2007 Arvind Rao et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1.
INTRODUCTION
Most methods of graph inference work very well on stationary time-series data, in that the generating structure for the
time series does not exhibit switching. In [1, 2], some useful method to learn network topologies using linear statespace models (SSM), from T-cell gene expression data, has
been presented. However, it is known that regulatory pathways do not persist over all time. An important recent finding
in which the above is seen to be true is following examination
of regulatory networks during the yeast cell cycle [3], wherein
topologies change depending on underlying (endogeneous
or exogeneous) cell condition. This brings out a need to identify the variation of the “hidden states” regulating gene network topologies and incorporating them into their network
inference framework [4]. This hidden state at time t (denoted
by xt ) might be related to the level of some key metabolite(s)
governing the activity (gt ) of the gene(s). These present a notion of condition specificity which influence the dynamics of
various genes active during that regime (condition). From
time-series microarray data, we aim to partition each gene’s
expression profile into such regimes of expression, during
which the underlying dynamics of the gene’s controlling state
(xt ) can be assumed to be stationary. In [5], the powerful notion of context sensitive boolean networks for gene relationships has been presented. However, at least for short timeseries data, such a boolean characterization of gene state requires a one-bit quantization of the continuous state, which
is difficult without expert biological knowledge of the activation threshold and knowledge of the precise evolution of
gene expression. Here, we work with gene profiles as continuous variables conditioned on the regime of expression. Each
regime is related to the state of a state-space model that is estimated from the data.
Our method (regime-SSM) examines three components:
to find the switch in gene dynamics, we use a change-point
detection (CPD) approach using singular spectrum analysis
(SSA). Following the hypothesis that the mechanism causing the genes to switch at the same time came from a common underlying input [3, 6], we group genes having similar change points. This clustering borrows from a mixture of
Gaussian (MoG) model [7]. The inference of the network adjacency matrix follows from a state-space representation of
expression dynamics among these coclustered genes [1, 2].
Finally, we present analyses on the publicly available embryonic kidney gene expression dataset [8] and the T-cell
2
EURASIP Journal on Bioinformatics and Systems Biology
activation dataset [1], using a combination of the above developed methods and we validate our findings with previously published literature as well as experimental data.
For the embryonic kidney dataset, the biological problem motivating our network inference approach is one of
identifying gene interactions during mammalian nephrogenesis (kidney formation). Nephrogenesis, like several other
developmental processes, involves the precise temporal interaction of several growth factors, differentiation signals, and
transcription factors for the generation and maturation of
progenitor cells. One such key set of transcription factors
is the GATA family, comprising six members, all containing the (–GATA–) binding domain. Among these, Gata2 and
Gata3 have been shown to play a functional role [8, 9] in
nephric development between days 10–12 after fertilization.
From a set of differentially expressed genes pertinent to this
time window (identified from microarray data), our goal is to
prospectively discover regulatory interactions between them
and the Gata2/3 genes. These interactions can then be further
resolved into transcriptional, or signaling interactions on the
basis of additional biological information.
In the T-cell activation dataset, the question is if events
downstream of T-cell activation can be partitioned into early
and late response behaviors, and if so, which genes are active
in a particular phase. Finally, can a network-level influence
be inferred among the genes of each phase and do they correlate with known data? We note here that we are not looking
for the behavior of any particular gene, but only interested in
genes from each phase.
As will be shown in this paper, regime-SSM generates biologically relevant hypotheses regarding time-varying gene
interactions during nephric development and T-cell activation. Several interesting transcripts are seen to be involved in
the process and the influence network hereby generated resolves cyclic dependencies.
The main assumption for the formulation of a linear
state-space model to examine the possibility of gene-gene interactions is that gene expression is a function of the underlying cell state and the expression of other genes at the previous
time step. If longer-range dependencies are to be considered,
the complexity of the model would increase. Another criticism of the model might be that nonlinear interactions cannot be adequately modeled by such a framework. However,
around the equilibrium point (steady state), we can recover a
locally linearized version of this nonlinear behavior.
2.
SSA AND CHANGE-POINT DETECTION
First we introduce some notations. Consider N gene expression profiles, g (1) , g (2) , . . . , g (N) ∈ RT , T being the length of
each gene’s temporal expression profile (as obtained from
microarray expression). The jth time instant of gene i’s expression profile will be denoted by g (i)
j .
State-space partitioning is done using singular spectrum
analysis [10] (SSA). SSA identifies structural change points
in time-series data using a sequential procedure [11]. We will
briefly review this method.
Consider the “windowed” (width NW ) time-series data
given by {g1(i) , g2(i) , . . . , gN(i)W }, with M (M ≤ NW /2) as some
integer-valued lag parameter, and a replication parameter
K = NW − M + 1. The SSA procedure in CPD involves the
following.
(i) Construction of an l-dimensional subspace: here, a
“trajectory matrix” for the time series, over the interval
[n + 1, n + T] is constructed,
⎛
GBi,(n)
(i)
gn+1
⎜
⎜ (i)
⎜ gn+2
⎜
=⎜ .
⎜ .
⎜ .
⎝
⎞
(i)
gn+2
(i)
gn+3
(i)
. . . gn+K
(i)
gn+3
..
.
(i)
gn+4
..
.
(i)
⎟
. . . gn+K+1
⎟
,
.. ⎟
..
⎟
.
. ⎟
⎠
⎟
⎟
(1)
(i)
(i)
(i)
(i)
gn+M
gn+M+1
gn+M+2
. . . gn+N
W
where K = NW − M + 1. The columns of the matrix Gi,(n)
are
B
i,(n)
(i)
(i)
T
the vectors G j = (gn+ j , . . . , gn+ j+M −1 ) , with j = 1, . . . , K.
(ii) Singular vector decomposition of the lag covariance
matrix Ri,n = GBi,(n) (GBi,(n) )T yields a collection of singular vectors—a grouping of l of these Singular vectors, corresponding to the l highest eigenvalues—denoted by I =
{1, . . . , l}, establishes a subspace Ln,I of RM .
i,(n)
defined by
(iii) Construction of the test matrix: use Gtest
⎛
i,(n)
Gtest
(i)
gn+p+1
⎜
⎜ (i)
⎜ gn+p+2
⎜
=⎜ .
⎜ .
⎜ .
⎝
(i)
gn+p+2
...
(i)
gn+q
(i)
gn+p+3
..
.
...
..
.
(i)
gn+q+1
..
.
(i)
(i)
(i)
gn+p+M
gn+p+M+1
. . . gn+q+M
−1
⎞
⎟
⎟
⎟
⎟
⎟.
⎟
⎟
⎠
(2)
Here, we use the length (p) and location (q) of test sample.
We choose p ≥ K, with K = NW − M + 1. Also q > p,
here we take q = p + 1. From this construction, the matrix
columns are the vectors Gi,(n)
j , j = p + 1, . . . , q. The matrix
has dimension M × Q, Q = (q − p) = 1.
(iv) Computation of the detection statistic: the detection
statistics used in the CPD are
(a) the normed Euclidean distance between the column
span of the test matrix, that is, Gi,(n)
and the lj
dimensional subspace Ln,I of RM . This is denoted by
Dn,I,p,q ;
(b) the normalized sum of squares of distances, denoted
by Sn = Dn,I,p,q /MQμn,I , with μn,I = Dm,I,0,K , where m
is the largest value of m ≤ n so that the hypothesis of
no change is accepted;
(c) a cumulative sum- (CUSUM-) type statistic W1 = S1 ,
Wn+1 = max{(Wn + Sn+1 − Sn − 1/3MQ), 0}, n ≥ 1.
The CPD procedure declares a structural change in the time
series dynamics if for some time instant
n, we observe Wn > h
with the threshold h = (2tα /(MQ)) (1/3)q(3MQ − Q2 + 1),
tα being the (1 − α) quantile of the standard normal distribution.
(v) Choice of algorithm parameters:
(a) window width (NW ): here, we choose NW T/5, T being the length of the original time series, the algorithm
Arvind Rao et al.
3
provides a reliable method of extracting most structural changes. As opposed to choosing a much smaller
NW , this might lead to some outliers being classified as
potential change points, but in our set-up this is preferred in contrast to losing genuine structural changes
based on choosing larger NW ;
(b) choice of lag M: in most cases, choose M = NW /2.
3.
MIXTURE-OF-GAUSSIANS (MoG) CLUSTERING
Having found change points (and thus, regimes) from the
gene trajectories of the differentially expressed genes, our
goal is to now group (cluster) genes with similar temporal
profiles within each regime. In this section, we derive the parameter update equations for a mixture-of-Gaussian clustering paradigm. As will be seen later, the Gaussian assumptions
on the gene expression permit the use of coclustered genes
for the SSM-based network parameter estimation.
We now consider the group of gene expression profiles
G = {g(1) , g(2) , . . . , g(n) }, all of which share a common change
point (time of switch)—c1 . Consider gene profile i, g(i) =
[g1(i) , g2(i) , . . . , gT(i)c1 ]T , a Tc1 -dimensional random vector which
follows a k-component finite mixture distribution described
by
p(g | θ) =
k
m=1
αm p g | φm ,
(3)
where α1 , . . . , αk are the mixing probabilities, each φm is
the set of parameters defining the mth component, and
θ ≡ {φ1 , . . . , φk , α1 , . . . , αk } is the set of complete parameters
needed to specify the mixture. We have
αm ≥ 0,
k
m = 1, . . . , k,
In the E-step of the EM algorithm, the function Q(θ,
θ(t))
≡ E[log p(G, Z | θ) | G, θ(t)]
is computed. This yields
(i)
wm
≡E
(4)
G = g(1) , g(2) , . . . , g
(n)
,
(5)
the log-likelihood of a k-component mixture is given by
log p(G | θ) = log
n
p g(i) | θ
i=1
=
n
i=1
log
k
αm p g
(i)
(6)
| φm .
m=1
(i) Treat the labels, Z = {z(1) , . . . , z(n) }, associated with
the n samples—as missing data. Each label is a binary vector
(i)
(i)
z(i) = [z1(i) , . . . , zk(i) ], where zm
= 1 and z p = 0, for p = m in(i)
dicate that sample g was produced by the mth component.
In this setting, the expectation maximization algorithm
can be used to derive the cluster parameter (θ) update equations.
(7)
k Np + 1
log n ,
2
(8)
N p is number of parameters per component in the k component mixture, given the number of clusters kmin ≤ k ≤ kmax .
In the M-step, for m = 0, 1, . . . , k, θm (t + 1) = arg maxφm
of the paQ(θ, θ(t)),
for m : αm (t + 1) > 0, the elements φ’s
rameter vector estimate θ are typically not closed form and
depend on the specific parametrization of the densities in the
mixture, that is, p(g(i) | φm ). If p(g(i) | φm ) belongs to the
Gaussian density N (μm , Σm ) class, we have, φ = (μ, Σ) and
EM updates yield [7]
αm (t + 1) =
n
(i)
i=1 wm
n
,
(i) (i)
wm
g
(i) ,
i=1 wm
i=1
μm (t + 1) = n
(i) (i)
i=1 wm g
n
For a set of n independently and identically distributed
samples,
kMML = arg mink − log p G | θ(k)
+
n
αm = 1.
αm (t)p g(i) | θm (t)
| G, θt = k
,
j (t)p g(i) | θj (t)
j =1 α
(i)
(i)
where wm
is the posterior probability of the event zm
= 1,
(i)
on observing gm .
The estimate of the number of components (k) is chosen
using a minimum message length (MML) criterion [7]. The
MML criterion borrows from algorithmic information theory and serves to select models of lowest complexity to explain the data. As can be seen below, this complexity has two
components: the first encodes the observed data as a function
of the model and the second encodes the model itself. Hence,
the MML criterion in our setup becomes,
m=1
(i)
zm
Σm (t + 1) =
T
− μm (t + 1) g(i) − μm (t + 1)
.
n
(i)
i=1 wm
(9)
Equations (7) and (9) are the parameter update equations for each of the m = 1, . . . , k cluster components.
For the kidney expression data, since we are interested
in the role of Gata2 and Gata3 during early kidney development, we consider all the genes which have similar change
points as the Gata2 and Gata3 genes, respectively. We perform an MoG clustering within such genes and look at
those coclustered with Gata2 or Gata3. Coclustering within a
regime potentially suggests that the governing dynamics are
the same, even to the extent of coregulation. We note that
just because a gene is coclustered with Gata2 in one regime,
it does not mean that it will cocluster in a different regime.
This approach suggests a way to localize regimes of correlation instead of the traditional global correlation measure that
can mask transient and condition-specific dynamics. For this
gene expression data, the MML penalized criterion indicates
that an adequate number of clusters to describe this data is
4
EURASIP Journal on Bioinformatics and Systems Biology
two (k = 2). In Tables 1 and 2, we indicate some of the genes
with similar coexpression dynamics as Gata2/Gata3 and a
cluster assignment of such genes. We observe that this clustering corresponds to the first phase of embryonic development (days 10–12 dpc), the phase where Gata2 and Gata3 are
perhaps most relevant to kidney development [12–15].
A word about Table 1 is in order. The entries in each column of a row (gene) indicate the change points (as found
by the SSA-CPD procedure) in the time series of the interpolated gene expression profile. Our simulation studies with
the T-cell data indicate that the SSM and CoD performance
is not much worse with the interpolated data compared to
the original time series (Table 7). We note that because of the
present choice of parameters NW , we might have the detection of some false positive change points, but this is preferable to the loss of genuine change points. An examination of
the change points of the various genes in Table 1 indicates
three regimes—between points approximately 1–5, 5–11 and
12–20. The missing entries mean that there was no change
point identified for a certain regime and are thus treated as
such. Since our focus is early Gata3 behavior, we are interested in time points 1–12, and hence we examine the evolution of network-level interactions over the first two regimes
for the genes coclustered in these regimes.
To clarify the validity of the presented approach, we
present a similar analysis on another data set—the T-cell expression data presented in [1]. This data looks at the expression of various genes after T-cell activation using stimulation with phorbolester PMA and ionomycin [16]. This
data has the profiles of about 58 genes over 10 time points
with 44(34 + 10) replicate measurements for each time point.
Since here we have no specific gene in mind (unlike earlier
where we were particularly interested in Gata3 behavior), the
change point procedure (CPD) yields two distinct regimes—
one from time points 1 to 4 and the other from time points 5
to 10. Following the MoG clustering procedure yields the optimal number of clusters to be 1 (from MML) in each regime.
We therefore call these two clusters “early response” and “late
response” genes and then proceed to learn a network relationship amongst them, within each cluster. The CPD and
cluster information for the early and late responses are summarized in Table 3.
4.
STATE-SPACE MODEL
For a given regime, we treat gene expression as an observation related to an underlying hidden cell state (xt ), which is
assumed to govern regime-specific gene expression dynamics for that biological process, globally within the cell. Suppose there are N genes whose expression is related to a single process. The ith gene’s expression vector is denoted as
gt(i) , t = 1, . . . T, where T is the number of time points for
which the data is available. The state-space model (SSM) is
used to model the gene expression (gt(i) , i = 1, 2, . . . , N and
t = 1, 2, . . . , T) as a function of this underlying cell state (xt )
as well as some external inputs. A notion of influence among
genes can be integrated into this model by considering the
SSM inputs to be the gene expression values at the previous
Table 1: Change-point analysis of some key genes, prior to clustering (annotations in Table 8). The numbers indicate the time points
at which regime changes occur for each gene.
Gene symbol Change point I Change point II Change point III
Bmp7
Rara
Pax2
Gata3
Gata2
Gdf11
Npnt
Cd44
Pgf
Pbx1
Ret
6
5
6
5
—
—
—
5
5
5
—
10
11
12
9
—
10
12
11
11
12
10
12
16
15
12
18
20
16
15
—
20
—
Table 2: Some of the genes coclustered with Gata2 and Gata3 after
MoG clustering (annotations in Table 8).
Genes with the same
dynamics as Gata3
Genes with the same
dynamics as Gata2
Bmp7
Nrtn
Pax2
Ros1
Pbx1
Rara
Gdf11
Lamc2
Cldn3
Ros1
Ptprd
Npnt
Cdh16
Cldn4
Table 3: Some of the genes related to early and late responses in
T-cell activation (annotations in Table 9).
Genes related to early response
(time points: 1–4)
Genes related to late response
(time points: 5–10)
CD69
Mcp1
Mcl1
EGR1
JunD
CKR1
CCNA2
CDC2
EGR1
IL2r gamma
IL6
—
time step. The state and observation equations of the statespace model [17] are
(i) state equation:
xt+1 = Axt + Bgt + es,t ; es,t ∼ N (0, Q),
i = 1, . . . , N; t = 1, . . . , T;
(10)
(ii) observation equation:
gt = Cxt + Dgt−1 + eo,t ;
eo,t ∼ N (0, R),
(11)
Arvind Rao et al.
5
Table 4: Assumptions and log-likelihood calculations in the state-space model. The (≡) symbol indicates a definition.
Symbol
Interpretation
T
Number of time points
—
Rg
Number of replicates
—
P gt | xt
≡
Expression
T
−1 [g −Cx −Dg
t
t
t−1 ]
e−1/2[gt −Cxt −Dgt−1 ] R
· (2π)− p/2 det(R)−1/2
t =2
P xt | xt−1
—
T
e−1/2[xt −Axt−1 −Bgt−1 ] Q
−1 [x −Ax
t
t−1 −Bgt−1 ]
· (2π)−k/2 det(Q)−1/2
t =2
P x1
Initial state density assumption
P {x}, {g}
Markov property
e−1/2[x1 −π1 ] V1 [x1 −π1 ] · (2π)−k/2 det V1
Rg P x1 (i)
T
i=1
log P {x}, {g}
Joint log probability
T
P xt (i) | xt−1 (i) , gt−1 (i) ·
t =2
i=1
Rg
−
T 1
t =2
2
P gt (i) | xt (i) , gt−1 (i)
t =1
gt (i) − Cxt (i) − Dgt−1 (i) R−1 gt (i) − Cxt (i) − Dgt−1 (i)
T −
T
log det(R)
2
1 (i)
xt − Axt−1 (i) − Bgt−1 (i) Q−1 xt (i) − Axt−1 (i) − Bgt−1 (i)
2
t =1
1
1
T −1
log det(Q) − x1 − π1 V1−1 x1 − π1 − log det V1
−
2
2
2
−
T(p + k)
log(2π)
−
2
with xt = [xt(1) , xt(2) , . . . , xt(K) ]T and gt = [gt(1) , gt(2) , . . . ,
gt(N) ]T . A likelihood method [1] is used to estimate the state
dimension K. The noise vectors es,t and eo,t are Gaussian distributed with mean 0 and covariance matrices Q and R, respectively.
From the state and observation equations (10) and (11),
j =1,...,N
we notice that the matrix-valued parameter D = [Di, j ]i=1,...,N
quantifies the influence among genes i and j from one time
instant to the next, within a specific regime. To infer a biological network using D, we use bootstrapping to estimate the
distribution of the strength of association estimates amongst
genes and infer network linkage for those associations that
are observed to be significant.
Within this proposed framework, we segment the overall
gene expression time trajectories into smaller, approximately
stationary, gene expression regimes. We note that the MoG
clustering framework is a nonlinear one in that the regimespecific state space is partitioned into clusters. These cluster
assignments of correlated gene expression vectors can change
with regime, allowing us to capture the sets of genes that interact under changing cell condition.
5.
−1/2
SYSTEM IDENTIFICATION
We consider the case where we have Rg = B × P realizations of expression data for each gene available. Arguably,
mRNA level is a measure of gene expression, B(= 2) denotes the number of biological replicates, and P(= 16 perfect match probes) denotes the number of probes per gene
transcript. Each of these Rg realizations is T-time-point long
and is obtained from Affymetrix U74Av2 murine microarray raw CEL files. In the section below, we derive the update
equations for maximum-likelihood estimates of the parameters A, B, C, D, Q and R (in (10) and (11)) using an EM
algorithm, based on [17, 18]. The assumptions underlying
this model are outlined in Table 4. A sequence of T output
vectors (g1 , g2 , . . . , gT ) is denoted by {g}, and a subsequence
t
{gt0 , gt0 +1 , . . . , gt1 } by {g}t10 . We treat the (xt , gt ) vector as the
complete data and find the log-likelihood log P({x}, {g}) under the above assumptions. The complete E-and M-steps involved in the parameter update steps are outlined in Tables 5
and 6.
6.
BOOTSTRAPPED CONFIDENCE INTERVALS
As suggested above, the entries of the D matrix indicate the
strength of influence among the genes, from one time step to
the next (within each regime). We use bootstrapping to find
confidence intervals for each entry in the D matrix and if it is
significant, we assign a positive or negative direction (+1 or
−1) to this influence.
The bootstrapping procedure [19] is adapted to our situation as follows.
6
EURASIP Journal on Bioinformatics and Systems Biology
Table 5: M-step of the EM algorithm for state-space parameter estimation. The (≡) symbol indicates a definition.
Matrix symbol
Interpretation
Expression
M-Step
π1 new
Initial state mean
x1
V1new
Initial state covariance
P1 − x1 x1 +
C new
Output matrix
Rg T
Rg
(i)
1 (i)
x1 − x1 x1 − x1
Rg i=1
gt xt − D
(i)
i=1 t =1
Rnew
1
Output noise covariance
Anew
Rg × T
Rg T
State dynamics matrix
i=1 t =2
·
xt gt−1
(i)
·
Rg
T
(i)
(gt gt
i=1 t =1
(i)
Pt,t
−1
(i)
− B xt gt−1
Rg T
(i)
new
·
(i)
xt gt
Rg T
i=1 t =2
(i)
Rg T
gt−1 (i) gt−1 (i) − gt−1 (i) xt
·
Rg T
−1
−1
(i)
·
(i) Suppose there are R regimes in the data with change
points (c1 , c2 , . . . , cR ) identified from SSA. For the rth
regime, generate B independent bootstrap samples of
size N (the original number of genes under consideration), -(Y∗1 , Y∗2 , . . . , Y∗B ) from original data, by random
(i) T
resampling from g(i) = [gc(i)
r , . . . , gcr+1 ] .
(ii) Using the EM algorithm for parameter estimation, estimate the value of D (the influence parameter). Denote the estimate of D for the ith bootstrap sample by
Di∗ .
(iii) Compute the sample mean and sample variance of the
estimates of D over all the B bootstrap samples. That
is,
variance =
1
B
1 ∗
D ,
B i=1 i
B
B − 1 i=1
∗
Di − D
∗ 2
(i)
xt gt−1 (i)
Rg T
−1
Pt(i)
−1
(i)
xt gt−1 (i)
i=1 t =2
gt−1 (i) xt (i)
(i)
(i)
xt gt−1 (i) − xt gt−1 (i)
Rg T
−1
Pt(i)
−1
· xt (i) gt−1 (i) − gt−1 gt−1 (i)
i=1 t =2
∗
(i)
Rg T
Rg T
Rg T
(i)
(i)
1
(i)
Pt − Anew
Pt−1,t − B
gt−1 (i) xt
Rg × (T − 1) i=1 t=2
i=1 t =2
i=1 t =2
State noise covariance
mean = D =
(i)
gt−1 gt
−1
Pt(i)
Rg
i=1 t =2
Qnew
−D
new
i=1 t =1
(i)
Pt,t
−1
T
Pt(i)−1
i=1 t =1
i=1 t =2
Input to state matrix
(i)
Pt(i)
i=1 t =1
B new
−1
Pt(i)
i=1 t =1
)−C
(i)
gt (i) gt−1 (i) − gt (i) xt
Rg T
Rg T
i=1 t =1
i=1 t =1
Input to observation
(i)
Rg T
Dnew
Rg T
(12)
.
(iv) Using the above obtained sample mean and variance,
estimate confidence intervals for the elements of D. If
D lies in this bootstrapped confidence interval, we infer
a potential influence and if not, we discard it. Note that
even though we write D, we carry out this hypothesis
test for each Di, j , i = 1, . . . , n; j = 1, . . . , n; for each of
the n genes under consideration in every regime.
7.
SUMMARY OF ALGORITHM
Within each regime identified by CPD, we model gene expression as Gaussian distributed vectors. We cluster the genes
using a mixture-of-Gaussians (MoG) clustering algorithm [7]
to identify sets of genes which have similar “dynamics of expression” —in that they are correlated within that regime. We
then proceed to learn the dynamic system parameters (matrices A, B, C, D, Q, and R) for the state-space model (SSM)
underlying each of the clusters. We note two important ideas:
(i) we might obtain different cluster assignments for the
genes depending on the regime;
(ii) since all these genes (across clusters within a regime)
are still related to the same biological process, the hidden state xt is shared among these clusters.
Therefore, we learn the SSM parameters in an alternating
manner by updating the estimates from cluster to cluster
Arvind Rao et al.
7
Table 6: E-step of the EM algorithm for state-space parameter estimation.
E-Step
Forward
x1 0
≡
π1
V10
≡
V1
xt t−1
Update
Axt−1 t−1 + Bgt−1
Vtt−1
Update
AVtt−−11 A + Q
Kt
Update
Vtt−1 C CVtt−1 C + R
xt t
Update
xt t−1 + Kt gt − Cxt t−1 − Dgt−1
Vtt
Update
Vtt−1 − Kt CVtt−1
Backward
T
VT,T
−1
Initialization
−1
I − KT C AVTT−−11
xt
≡
xt τ
Pt
≡
VtT + xt T xt T
Jt−1
Update
Vtt−1 A Vtt−1
xt−1 T
Update
xt−1 t−1 + Jt−1 x1 T − Axt−1 t−1 − Bgt−2
VtT
Update
Vtt−−11 + Jt−1 VtT − Vtt−1 Jt−1
Pt,t−1
VtT−1,t−2
≡
Update
The discussion of the network inference procedure
would be incomplete in the absence of any other algorithms for comparison. For this purpose, we implement the
CoD- (coefficient-of-determination-) based approach [20,
21] along with the models proposed in [1] (SSM) and [22]
(GGM). The CoD method allows us to determine the association between two genes within a regime via an R2 goodness
of fit statistic. The methods of [1, 22] are implemented on the
time-series data (with regard to underlying regime). Such a
study would be useful to determine the relative merits of each
approach. We believe that no one procedure can work for every application and the choice of an appropriate procedure
would be governed by the biological question under investigation. Each of these methods use some underlying assumptions and if these are consistent with the question that we
ask, then that method has great utility. These individual results, their evaluation, and their comparison are summarized
in Section 8.
8.
−1
Vt,tT −1 + xt T xt−1
T
Vtt−−11 Jt−2 + Jt−1 Vt,tT −1 − AVtt−−11 Jt−2
while still retaining the form of the state vector xt . The learning is done using an expectation-maximization-type algorithm. The number of components during regime-specific
clustering is estimated using a minimum message length criterion. Typically, O(N) iterations suffice to infer the mixture model in each regime with N genes under consideration.
Thus, our proposed approach is as follows.
(i) Identify the N key genes based on required phenotypical characteristic using fold change studies. Preprocess
the gene expression profiles by standardization and cubic spline interpolation.
(ii) Segment each gene’s expression profile into a sequence of state-dependent trajectories (regime change
points), from underlying dynamics, using SSA.
(iii) For each regime (as identified in step 2),
cluster genes using an MoG model so that genes
with correlated expression trajectories cluster together. Learn an SSM [17, 18] for each cluster (from (10) and (11) for estimation of the
mean and covariance matrices of the state vector)
within that regime. The input to observation matrix (D) is indicative of the topology of the network in that regime.
(iv) Examine the network matrices D (by bootstrapping
to find thresholds on strength of influence estimates)
across all regimes to build the time-varying network.
8.1.
RESULTS
Application to the GATA pathway
To illustrate our approach (regime-SSM), we consider the
embryonic kidney gene expression dataset [8] and study the
set of genes known to have a possible role in early nephric development. An interruption of any gene in this signaling cascade potentially leads to early embryonic lethality or abnormal organ development. An influence network among these
genes would reveal which genes (and their products) become important at a certain phase of nephric development.
The choice of the N(= 47) genes is done using FDR fold
change studies [23] between ureteric bud and metanephric
mesenchyme tissue types, since this spatial tissue expression
is of relevance during early embryonic development. The
dataset is obtained by daily sampling of the mRNA expression ranging from 11.5–16.5 days post coitus (dpc). Detailed
studies of the phenotypes characterizing each of these days is
available from the Mouse Genome Informatics Database at
http://www.informatics.jax.org/. We follow [24] and use interpolated expression data pre-processing for cluster analysis.
We resample this interpolated profile to obtain twenty points
per gene expression profile. Two key aspects were confirmed
after interpolation [24, 25]: (1) there were no negative expression values introduced, (2) the differences in fold change
were not smoothed out.
Initial experimental studies have suggested that the 10.5–
12.5 dpc are relatively more important in determination of
the course of metanephric development. We chose to explore
which genes (out of the 47 considered) might be relevant in
this specific time window. The SSA-CPD procedure identified several genes which exhibit similar dynamics (have approximately same change points, for any given regime) in the
early phase and distinctly different dynamics in later phases
(Table 1).
Our approach to influence determination using the statespace model yields up to three distinct regimes of expression over all the 47 genes identified from fold change studies
between bud and mesenchyme. MoG clustering followed by
8
EURASIP Journal on Bioinformatics and Systems Biology
Pax2
Mapk1
Lamc2
Acvr2b
Bmp7
Wnt11
Ros1
Gata3
Rara
Gdf11
Kcnj8
Gata3
Pbx1
Mapk1
Pax2
Lamc2
Cd44
Figure 1: Network topology over regimes (solid lines represent the
first regime, and the dotted lines indicate the second regime).
Acvr2b
Npnt
Lamc2
Gdf11
Cldn7
Kcnj8
Gata3
Npnt
Rara
Figure 3: Steady-state network inferred using CoD (solid lines represent the first regime, and the dotted lines indicate the second
regime).
Rara
CD69
JunD
EGR1
Mcl1
Figure 2: Steady-state network inferred over all time, using [1].
Casp7
state-space modeling yield three regime topologies of which
we are interested in the early regime (days 10.5–12.5). This
influence topology is shown in Figure 1.
We compare our obtained network (using regime-SSM)
with the one obtained using the approach outlined in [1],
shown in Figure 2. We note that the network presented in
Figure 2 extends over all time, that is, days 10.5–16.5 for
which basal influences are represented but transient and
condition-specific influences may be missed. Some of these
transient influences are recaptured in our method (Figure 1)
and are in conformity (lower false positives in network connectivity) with pathway entries in Entrez Gene [15] as well
as in recent reviews on kidney expression [8, 12] (also,
see Table 8). For example, the Mapk1-Rara [26] or the Pax2Gdf11 [27] interactions are completely missed in Figure 2—
this is seen to be the case since these interactions only occur during the 10.5–12.5 dpc regime. We also see that the
Acvr2b-Lamc2 [28] interaction is observed in the steady state
but not in the first regime. This interaction becomes active
in the second regime (first via the Acvr2b-Gdf11 and then via
the Gdf11-Lamc2), indicating that it might not have particular relevance in the day 10.5–12.5 dpc stage. Several of these
predicted interactions need to be experimentally characterized in the laboratory. It is especially interesting to see the
Rara gene in this network, because it is known that Gata3
[29, 30] has tissue-specific expression in some cells of the developing eye. Also Gdf11 exhibits growth factor activity and
is extremely important during organ formation.
In Figure 3, we give the results of the CoD approach of
network inference. Here the Gata3-Pax2 interaction seems
reversed and counterintuitive. As can be seen, some of the
interactions (e.g., Pax2-Gata3) can be seen here (via other
nodes: Mapk1-Wnt11), but there is a need to resolve cycles (Ros1–Wnt11-Mapk1) and feedback/feedforward loops
(Bmp7-Gata3-Wnt11). Both of these topologies can convey
potentially useful information about nephric development.
Thus a potentially useful way to combine these two methods
is to “seed” the network using CoD and then try to resolve
cycles using regime-SSM.
IL6
nFKB
CYP19A1
LAT
Intgam
IL2Rg
CKR1
CDC2
T-cell activation
Figure 4: Steady-state network inferred using SSM (solid lines represent the first regime, and the dotted lines indicate the second
regime).
8.2.
T-cell activation
The regime-SSM network is shown in Figure 4. The corresponding network learnt in each regime using CoD is also
shown (Figure 5). The study of this network using GGM
(for the whole time-series data) is already available in [22].
Though there are several interactions of interest discovered
in both the SSM and CoD procedures, we point out a few
of interest. It is already known that synergistic interactions
between IL-6 and IL-1 are involved in T-cell activation [31].
IL-2 receptor transcription is affected by EGR1 [32]. An examination of the topology of these two networks (CoD and
SSM) would indicate some matches and is worth pursuing
for experimental investigation. However, as already alluded
to above, we have to find a way to resolve cycles from the
CoD network [33]. Several of these match the interactions
reported in [1, 22]. However, the additional information that
we can glean is that some of the key interactions occur during
“early response” to stimulation and some occur subsequently
(interleukin-6 mediated T-cell activation) in the “late phase.”
An examination of the gene ontology (GO) terms represented in each cluster as well as the functional annotations
in Entrez Gene shows concordance with literature findings
(Table 9). Because this dataset has been the subject of several
interesting investigations, it would be ideal to ask other questions related to network inference procedures, for the purpose of comparison. One of the primary questions we seek
Arvind Rao et al.
CD69
Mcp1
9
JunD
Pde4b
EGR1
Intgam
Pax2
Mcl1
Mapk1 Cldn4
Fmn
CKR1
Lamc2
Clcn3
Cldn7
Cdh16
Ptprd
Rara
Pbx1
Cd44
Kcnj8
Gdf11
CCNA2
CYP19A1
IL2Rg
CDC2
Figure 6: Steady-state network inferred using GGMs.
Figure 5: Steady-state network inferred using CoD (solid lines represent the first regime, and the dotted lines indicate the second
regime).
to answer is what is the performance of the network inference procedure if a subsampled trajectory is used instead?
In Table 7, the performances of the CoD and SSM algorithms are summarized. Using the T-cell (10 points, 44 replicates) data, we infer a network using the SSM procedure.
With the identified edges as the gold standard for comparison, we now use SSM network inference on an undersampled version of this time series (5 points, 44 replicates) and
check for any new edges ( fnew ) or deletion of edges ( flost ).
Ideally, we would want both these numbers to be zero. fnew
is the fraction of new edges added to the original set and flost
is number of edges lost from the original data network over
both regimes. Further, we now interpolate this undersampled
data to 10 points and carry out network inference. This is
done for each of the identified regimes. The same is done for
the CoD method. We note that this is not a comparison between SSM and CoD (both work with very different assumptions), but of the effect of undersampling the data and subsequently interpolating this undersampled data to the original data length (via resampling). Table 7 suggests that as expected, there is degradation in performance (SSM/CoD) in
the absence of all the available information. However, it is
preferred to infer some false positives rather than lose true
positive edges. This also indicates that interpolated data does
not do worse than the undersampled data in terms of true
positives ( flost ).
We make three observations regarding this method of
network inference.
(i) It is not necessary for the target gene (Gata2/Gata3)
to be present as part of the inferred network. We can
obtain insight into the mechanisms underlying transcription in each regime even if some of the genes with
similar coexpression dynamics as the target gene(s) are
present in the inferred network.
(ii) Probe-level observations from a small number of biological replicates seem to be very informative for network inference. This is because the LDS parameter estimation algorithm uses these multiple expression realizations to iteratively estimate the state mean, covariance and other parameters, notably D [17]. Hence
inspite of few time points, we can use multiple measurements (biological, technical, and probe-level repli-
cates) for reliable network inference. This follows similar observations in [34] that probe-level replicates are
very useful for understanding intergene relationships.
(iii) Following [24], it would seem that several network
hypotheses can individually explain the time evolution behavior captured by the expression data. The
LDS parameter estimation procedure seeks to find a
maximum-likelihood (ML) estimate of the system parameters A, B, C, and D and then finally uses bootstrapping to only infer high confidence interactions.
This ML estimation of the parameters uses an EM algorithm with multiple starts to avoid initializationrelated issues [17], and thus finds the “most consistent” hypothesis which would explain the evolution
of expression data. It is this network hypothesis that
we investigate. Since this network already contains our
gene of interest Gata3, we can proceed to verify these
interactions from literature and experimentally.
9.
DISCUSSION
One of the primary motivations for computational inference
of state specific gene influence networks is the understanding
of transcriptional regulatory mechanisms [36]. The networks
inferred via this approach are fairly general, and thus there is
a need to “decompose” these networks into transcriptional,
signal transduction or metabolic using a combination of biological knowledge and chemical kinetics. Depending on the
insights expected, the tools for dissection of these predicted
influences might vary.
For comparison, we additionally investigated a graphical Gaussian model (GGM) approach as suggested in [35]
using partial correlation as a metric to quantify influence
(Figure 6). This method works for short time-series data but
we could not find a way to incorporate previous expression values as inputs to the evolution of state or individual
observations—something we could explicitly do in the statespace approach. However, we are now in the process of examining the networks inferred by the GGM approach over
the regimes that we have identified from SSA. Again, we observe that the network connections reflect a steady-state behavior and that transient (state-specific) changes in influence
are not fully revealed. The same is observed in the case of
the T-cell data, from the results reported in [22]. A comparison of all the presented methods, along with regime-SSM,
has been presented in Table 10. The comparisons are based
10
EURASIP Journal on Bioinformatics and Systems Biology
Table 7: Functional annotations (Entrez Gene) of some of the genes coclustered with Gata2 and Gata3.
Gene symbol
Gene name
Possible role in nephrogenesis (function)
Bmp7
Rara
Gata2
Gata3
Pax2
Lamc2
Npnt
Ros1
Ptprd
Ret-Gdnf
Gdf11
Mapk1
Kcnj8
Bone morphogenetic protein
Retinoic acid receptor
GATA binding protein 2
GATA binding protein 3
Paired homeobox-2
Laminin
Nephronectin
Ros1 proto-oncogene
protein tyrosine phosphatase
Ret proto-oncogene, Glial neutrophic factor
Growth development factor
Mitogen-activated protein kinase 1
potassium inwardly rectifying channel, subfamily J, member 8
Cell signaling
Retinoic acid pathway, related to eye phenotype
Hematopoiesis, urogenital development
Hematopoiesis, urogenital development
Direct target of Gata2
Cell adhesion molecule
Cell adhesion molecule
Signaling epithelial differentiation
Cell adhesion
Metanephros development
Cell-cell signaling and adhesion
Role in growth factor activity, cell adhesion
Potassium ion transport
Acvr2b
Activin receptor IIB
Transforming growth
factor-beta receptor activity
Table 8: Functional annotations of some of the coclustered genes (early and late responses) following T-cell activation.
Gene symbol
Gene name
Possible role in T-cell activation (function)
CD69
Mcl1
IL6
LAT
EGR1
CDC2
Casp7
CD69 antigen
Myeloid cell leukemia sequence 1 (BCL2-related)
Interleukin 6
Linker for activation of T cells
Early growth response gene 1
Cell division control protein 2
Caspase 7
Early T-cell activation antigen
Mediates cell proliferation and survival
Accessory factor signal
Membrane adapter protein involved in T-cell activation
activates nFKB signaling
Involved in cell-cycle control
Involved in apoptosis
JunD
Jun D proto-oncogene
Regulatory role in T lymphocyte
proliferation and Th cell differentiation
CKR1
CYP19A1
Intgam
nFKB
IL2Rg
Pde4b
Mcp1
CCNA2
Chemokine receptor 1
Cytochrome P450, member 19
Integrin alpha M
nFKB protein
Interleukin-2 receptor gamma
Phosphodiesterase 4B, cAMP-specific
Monocyte chemotactic protein 1
Cyclin A2
negative regulator of the antiviral CD8+ T-cell response
cell proliferation
Mediates phagocytosis-induced apoptosis
Signaling transduction activity
Signaling activity
Mediator of cellular response to extracellular signal
Cytokine gene involved in immunoregulation
Involved in cell-cycle control
Table 9: Results of network inference on original, subsampled, and
interpolated data.
Method (T-cell data)
SSM on original data
SSM on undersampled data
SSM on interpolated data
CoD on original data
CoD on undersampled data
CoD on interpolated data
Edges inferred
fnew
flost
14
—
—
12
—
—
—
3
4
—
3
4
—
3
2
—
2
2
on whether these frameworks permit the inference of directional influences, regime specificity, resolution of cycles, and
modeling of higher lags.
10.
CONCLUSIONS
In this work, we have developed an approach (regime-SSM)
to infer the time-varying nature of gene influence network
topologies, using gene expression data. The proposed approach integrates change-point detection to delineate phases
Arvind Rao et al.
11
Table 10: Comparison of various network inference methods (Y: Yes, N: No).
Method
Direction
Regime-specific
Resolve cycles
Higher lags (> 1)
Nonlinear/locally linear
CoD [20, 21]
Y
Y
N
N
Y
GGM [35]
Y
N
N
N
Y
SSM [1]
Y
N
Y
Y
Y
Regime-SSM
Y
Y
Y
Y
Y
of gene coexpression, MoG clustering implying possible
coregulation, and network inference amongst the regimespecific coclustered genes using a state-space framework. We
can thus incorporate condition specificity of gene expression
dynamics for understanding gene influences. Comparison of
the proposed approach with other current procedures like
GGM or CoD reveals some strengths and would very well
complement existing approaches (Table 10). We believe that
this approach, in conjunction with sequence and transcription factor binding information, can give very valuable clues
to understand the mechanisms of transcriptional regulation
in higher eukaryotes.
ACKNOWLEDGMENTS
The authors gratefully acknowledge the support of the NIH
under Award 5R01-GM028896-21 (JDE). The authors also
thank the three anonymous reviewers for constructive comments to improve this manuscript. The material in this paper
was presented in part at the IEEE International Workshop
on Genomic Signal Processing and Statistics 2005 (GENSIPS05).
REFERENCES
[1] C. Rangel, J. Angus, Z. Ghahramani, et al., “Modeling Tcell activation using gene expression profiling and state-space
models,” Bioinformatics, vol. 20, no. 9, pp. 1361–1372, 2004.
[2] B.-E. Perrin, L. Ralaivola, A. Mazurie, S. Bottani, J. Mallet,
and F. D’Alché-Buc, “Gene networks inference using dynamic
Bayesian networks,” Bioinformatics, vol. 19, supplement 2, pp.
II138–II148, 2003.
[3] N. M. Luscombe, M. M. Babu, H. Yu, M. Snyder, S. A. Teichmann, and M. Gerstein, “Genomic analysis of regulatory
network dynamics reveals large topological changes,” Nature,
vol. 431, no. 7006, pp. 308–312, 2004.
[4] E. Sontag, A. Kiyatkin, and B. N. Kholodenko, “Inferring dynamic architecture of cellular networks using time series of
gene expression, protein and metabolite data,” Bioinformatics,
vol. 20, no. 12, pp. 1877–1886, 2004.
[5] S. Kim, H. Li, D. Russ, et al., “Context-sensitive probabilistic
Boolean networks to mimic biological regulation,” in Proceedings of Oncogenomics, Phoenix, Ariz, USA, January-February
2003.
[6] H. Li, C. L. Wood, Y. Liu, T. V. Getchell, M. L. Getchell, and A.
J. Stromberg, “Identification of gene expression patterns using
planned linear contrasts,” BMC Bioinformatics, vol. 7, p. 245,
2006.
[7] M. A. T. Figueiredo and A. K. Jain, “Unsupervised learning of
finite mixture models,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 24, no. 3, pp. 381–396, 2002.
[8] R. O. Stuart, K. T. Bush, and S. K. Nigam, “Changes in gene
expression patterns in the ureteric bud and metanephric mesenchyme in models of kidney development,” Kidney International, vol. 64, no. 6, pp. 1997–2008, 2003.
[9] M. Khandekar, N. Suzuki, J. Lewton, M. Yamamoto, and J.
D. Engel, “Multiple, distant Gata2 enhancers specify temporally and tissue-specific patterning in the developing urogenital system,” Molecular and Cellular Biology, vol. 24, no. 23, pp.
10263–10276, 2004.
[10] N. Golyandina, V. Nekrutkin, and A. Zhigljavsky, Analysis of
Time Series Structure—SSA and Related Techniques, Chapman
& Hall/CRC, New York, NY, USA, 2001.
[11] V. Moskvina and A. Zhigljavsky, “An algorithm based on singular spectrum analysis for change-point detection,” Communications in Statistics Part B: Simulation and Computation,
vol. 32, no. 2, pp. 319–352, 2003.
[12] K. Schwab, L. T. Patterson, B. J. Aronow, R. Luckas, H.-C.
Liang, and S. S. Potter, “A catalogue of gene expression in the
developing kidney,” Kidney International, vol. 64, no. 5, pp.
1588–1604, 2003.
[13] Y. Zhou, K.-C. Lim, K. Onodera, et al., “Rescue of the embryonic lethal hematopoietic defect reveals a critical role
for GATA-2 in urogenital development,” The EMBO Journal,
vol. 17, no. 22, pp. 6689–6700, 1998.
[14] G. A. Challen, G. Martinez, M. J. Davis, et al., “Identifying the
molecular phenotype of renal progenitor cells,” Journal of the
American Society of Nephrology, vol. 15, no. 9, pp. 2344–2357,
2004.
[15] NCBI Pubmed, http://www.ncbi.nlm.nih.gov/entrez/query.
fcgi.
[16] H. H. Zadeh, S. Tanavoli, D. D. Haines, and D. L. Kreutzer,
“Despite large-scale T cell activation, only a minor subset of T
cells responding in vitro to Actinobacillus actinomycetemcomitans differentiate into effector T cells,” Journal of Periodontal
Research, vol. 35, no. 3, pp. 127–136, 2000.
[17] Z. Ghahramani and G. E. Hinton, “Parameter estimation for
linear dynamical systems,” Tech. Rep., University of Toronto,
Toronto, Ontario, Canada, 1996.
[18] R. H. Shumway and D. S. Stoffer, Time Series Analysis and Applications, Springer Texts in Statistics, Springer, New York, NY,
USA, 2000.
[19] B. Effron, An Introduction to the Bootstrap, Chapman &
Hall/CRC, New York, NY, USA, 1993.
[20] E. R. Dougherty, S. Kim, and Y. Chen, “Coefficient of determination in nonlinear signal processing,” Signal Processing,
vol. 80, no. 10, pp. 2219–2235, 2000.
[21] S. Kim, E. R. Dougherty, M. L. Bittner, et al., “General nonlinear framework for the analysis of gene interaction via multivariate expression arrays,” Journal of Biomedical Optics, vol. 5,
no. 4, pp. 411–424, 2000.
[22] R. Opgen-Rhein and K. Strimmer, “Using regularized dynamic correlation to infer gene dependency networks from
12
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
EURASIP Journal on Bioinformatics and Systems Biology
time-series microarray data,” in Proceedings of the 4th
International Workshop on Computational Systems Biology
(WCSB ’06), Tampere, Finland, June 2006.
A. O. Hero III, G. Fleury, A. J. Mears, and A. Swaroop, “Multicriteria gene screening for analysis of differential expression
with DNA microarrays,” EURASIP Journal on Applied Signal
Processing, vol. 2004, no. 1, pp. 43–52, 2004, special issue on
genomic signal processing.
Z. Bar-Joseph, “Analyzing time series gene expression data,”
Bioinformatics, vol. 20, no. 16, pp. 2493–2503, 2004.
A. Kundaje, O. Antar, T. Jebara, and C. Leslie, “Learning regulatory networks from sparsely sampled time series expression
data,” Tech. Rep., Columbia University, New York, NY, USA,
2002.
J. E. Balmer and R. Blomhoff, “Gene expression regulation by
retinoic acid,” Journal of Lipid Research, vol. 43, no. 11, pp.
1773–1808, 2002.
A. F. Esquela and S. E.-J. Lee, “Regulation of metanephric kidney development by growth/differentiation factor 11,” Developmental Biology, vol. 257, no. 2, pp. 356–370, 2003.
A. Maeshima, S. Yamashita, K. Maeshima, I. Kojima, and Y.
Nojima, “Activin a produced by ureteric bud is a differentiation factor for metanephric mesenchyme,” Journal of the American Society of Nephrology, vol. 14, no. 6, pp. 1523–1534, 2003.
M. Mori, N. B. Ghyselinck, P. Chambon, and M. Mark, “Systematic immunolocalization of retinoid receptors in developing and adult mouse eyes,” Investigative Ophthalmology and
Visual Science, vol. 42, no. 6, pp. 1312–1318, 2001.
K.-C. Lim, G. Lakshmanan, S. E. Crawford, Y. Gu, F. Grosveld,
and J. D. Engel, “Gata3 loss leads to embryonic lethality due to
noradrenaline deficiency of the sympathetic nervous system,”
Nature Genetics, vol. 25, no. 2, pp. 209–212, 2000.
H. Mizutani, L. T. May, P. B. Sehgal, and T. S. Kupper, “Synergistic interactions of IL-1 and IL-6 in T cell activation. Mitogen but not antigen receptor-induced proliferation of a cloned
T helper cell line is enhanced by exogenous IL-6,” Journal of
Immunology, vol. 143, no. 3, pp. 896–901, 1989.
J.-X. Lin and W. J. Leonard, “The immediate-early gene product Egr-1 regulates the human interleukin- 2 receptor β-chain
promoter through noncanonical Egr and Sp1 binding sites,”
Molecular and Cellular Biology, vol. 17, no. 7, pp. 3714–3722,
1997.
M. J. Herrgård, M. W. Covert, and B. Ø. Palsson, “Reconciling gene expression data with known genome-scale regulatory network structures,” Genome Research, vol. 13, no. 11, pp.
2423–2434, 2003.
C. Li and W. H. Wong, “Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection,” Proceedings of the National Academy of Sciences of the
United States of America, vol. 98, no. 1, pp. 31–36, 2001.
J. Schäfer and K. Strimmer, “An empirical Bayes approach to
inferring large-scale gene association networks,” Bioinformatics, vol. 21, no. 6, pp. 754–764, 2005.
A. Rao, A. O. Hero III, D. J. States, and J. D. Engel, “Inference of biologically relevant gene influence networks using the
directed information criterion,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP ’06), vol. 2, pp. 1028–1031, Toulouse, France, May
2006.
Hindawi Publishing Corporation
EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 32454, 15 pages
doi:10.1155/2007/32454
Research Article
Inference of a Probabilistic Boolean Network from
a Single Observed Temporal Sequence
Stephen Marshall,1 Le Yu,1 Yufei Xiao,2 and Edward R. Dougherty2, 3, 4
1 Department
of Electronic and Electrical Engineering, Faculty of Engineering, University of Strathclyde, Glasgow,
G1 1XW, UK
2 Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128, USA
3 Computational Biology Division, Translational Genomics Research Institute, Phoenix, AZ 85004, USA
4 Department of Pathology, University of Texas M. D. Anderson Cancer Center, Houston, TX 77030, USA
Received 10 July 2006; Revised 29 January 2007; Accepted 26 February 2007
Recommended by Tatsuya Akutsu
The inference of gene regulatory networks is a key issue for genomic signal processing. This paper addresses the inference of probabilistic Boolean networks (PBNs) from observed temporal sequences of network states. Since a PBN is composed of a finite number
of Boolean networks, a basic observation is that the characteristics of a single Boolean network without perturbation may be determined by its pairwise transitions. Because the network function is fixed and there are no perturbations, a given state will always
be followed by a unique state at the succeeding time point. Thus, a transition counting matrix compiled over a data sequence will
be sparse and contain only one entry per line. If the network also has perturbations, with small perturbation probability, then the
transition counting matrix would have some insignificant nonzero entries replacing some (or all) of the zeros. If a data sequence
is sufficiently long to adequately populate the matrix, then determination of the functions and inputs underlying the model is
straightforward. The difficulty comes when the transition counting matrix consists of data derived from more than one Boolean
network. We address the PBN inference procedure in several steps: (1) separate the data sequence into “pure” subsequences corresponding to constituent Boolean networks; (2) given a subsequence, infer a Boolean network; and (3) infer the probabilities of
perturbation, the probability of there being a switch between constituent Boolean networks, and the selection probabilities governing which network is to be selected given a switch. Capturing the full dynamic behavior of probabilistic Boolean networks,
be they binary or multivalued, will require the use of temporal data, and a great deal of it. This should not be surprising given
the complexity of the model and the number of parameters, both transitional and static, that must be estimated. In addition to
providing an inference algorithm, this paper demonstrates that the data requirement is much smaller if one does not wish to infer
the switching, perturbation, and selection probabilities, and that constituent-network connectivity can be discovered with decent
accuracy for relatively small time-course sequences.
Copyright © 2007 Stephen Marshall et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
1.
INTRODUCTION
A key issue in genomic signal processing is the inference of
gene regulatory networks [1]. Many methods have been proposed and these are specific to the network model, for instance, Boolean networks [2–5], probabilistic Boolean networks [6–9], and Bayesian networks [10–12], the latter being
related to probabilistic Boolean networks [13]. The manner
of inference depends on the kind of data available and the
constraints one imposes on the inference. For instance, patient data do not consist of time-course measurements and
are assumed to come from the steady state of the network,
so that inference procedures cannot be expected to yield net-
works that accurately reflect dynamic behavior. Instead, one
might just hope to obtain a set of networks whose steady state
distributions are concordant, in some way, with the data.
Since inference involves selecting a network from a family
of networks, it can be beneficial to constrain the problem
by placing restrictions on the family, such as limited attractor structure and limited connectivity [5]. Alternatively one
might impose a structure on a probabilistic Boolean network
that resolves inconsistencies in the data arising from mixing
of data from several contexts [9].
This paper concerns inference of a probabilistic Boolean
network (PBN) from a single temporal sequence of network
states. Given a sufficiently long observation sequence, the
2
EURASIP Journal on Bioinformatics and Systems Biology
goal is to infer a PBN that is a good candidate to have generated it. This situation is analogous to that of designing a
Wiener filter from a single sufficiently long observation of
a wide-sense stationary stochastic process. Here, we will be
dealing with an ergodic process so that all transitional relations will be observed numerous times if the observed sequence is sufficiently long. Should one have the opportunity to observe multiple sequences, these can be used individually in the manner proposed and the results combined
to provide the desired inference. Note that we say we desire a good candidate, not the only candidate. Even with
constraints and a long sequence, there are many PBNs that
could have produced the sequence. This is typical in statistical inference. For instance, point estimation of the mean of
a distribution identifies a single value as the candidate for
the mean, and typically the probability of exactly estimating the mean is zero. What this paper provides, and what
is being provided in other papers on network inference, is
an inference procedure that generates a network that is to
some extent, and in some way, consistent with the observed
sequence.
We will not delve into arguments about Boolean or probabilistic Boolean network modeling, these issues having been
extensively discussed elsewhere [14–21]; however, we do note
that PBN modeling is being used as a framework in which to
apply control theory, in particular, dynamic programming,
to design optimal intervention strategies based on the gene
regulatory structure [22–25]. With current technology it is
not possible to obtain sufficiently long data sequences to estimate the model parameters; however, in addition to using randomly generated networks, we will apply the inference to data generated from a PBN derived from a Boolean
network model for the segment polarity genes in drosophila
melanogaster [26], this being done by assuming that some
genes in the existing model cannot be observed, so that
they become latent variables outside the observable model
and therefore cause the kind of stochasticity associated with
PBNs.
It should be recognized that a key purpose of this paper is to present the PBN inference problem in a rigorous
framework so that observational requirements become clear.
In addition, it is hoped that a crisp analysis of the problem
will lead to more approximate solutions based on the kind
of temporal data that will become available; indeed, in this
paper we propose a subsampling strategy that greatly mitigates the number of observations needed for the construction of the network functions and their associated regulatory
gene sets.
2.
PROBABILISTIC BOOLEAN NETWORKS
A Boolean network (BN) consists of a set of n variables,
{x0 , x1 , . . . , xn−1 }, where each variable can take on one of
two binary values, 0 or 1 [14, 15]. At any time point t
(t = 0, 1, 2, . . . ), the state of the network is defined by the
vector x(t) = (x0 (t), x1 (t), . . . , xn−1 (t)). For each variable xi ,
there exist a predictor set {xi0 , xi1 , . . . , xi,k(i)−1 } and a transition function fi determining the value of xi at the next time
point,
xi (t + 1) = fi xi0 (t), xi1 (t), . . . , xi,k(i)−1 (t) ,
(1)
where 0 ≤ i0 < i1 < · · · < i, k(i) − 1 ≤ n − 1. It is typically the case that, relative to the transition function fi , many
of the variables are nonessential, so that k(i) < n (or even
k(i) n). Since the transition function is homogeneous in
time, meaning that it is time invariant, we can simplify the
notation by writing
xi+ = fi xi0 , xi1 , . . . , xi,k(i)−1 .
(2)
The n transition functions, together with the associated predictor sets, supply all the information necessary to determine the time evolution of the states of a Boolean network,
x(0) → x(1) → · · · → x(t) → · · · . The set of transition functions constitutes the network function, denoted as
f = ( f0 , . . . , fn−1 ).
Attractors play a key role in Boolean networks. Given a
starting state, within a finite number of steps, the network
will transition into a cycle of states, called an attractor cycle
(or simply, attractor), and will continue to cycle thereafter.
Nonattractor states are transient and are visited at most once
on any network trajectory. The level of a state is the number
of transitions required for the network to transition from the
state into an attractor cycle. In gene regulatory modeling, attractors are often identified with phenotypes [16].
A Boolean network with perturbation (BNp) is a Boolean
network altered so that, at any moment t, there is a probability P of randomly flipping a variable of the current state x(t)
of the BN. An ordinary BN possesses a stationary distribution but except in very special circumstances does not possess
a steady-state distribution. The state space is partitioned into
sets of states called basins, each basin corresponding to the
attractor into which its states will transition in due time. On
the other hand, for a BNp there is the possibility of flipping
from the current state into any other state at each moment.
Hence, the BNp is ergodic as a random process and possesses
a steady-state distribution. By definition, the attractor cycles
of a BNp are the attractor cycles of the BN obtained by setting
P = 0.
A probabilistic Boolean network (PBN) consists of a finite collection of Boolean networks with perturbation over
a fixed set of variables, where each Boolean network is defined by a fixed network function and all possess common
perturbation probability P [18, 20]. Moreover, at each moment, there is a probability q of switching out of the current
Boolean network to a different constituent Boolean network,
where each Boolean network composing the PBN has a probability (called selection probability) of being selected. If q = 1,
then a new network function is randomly selected at each
time point, and the PBN is said to be instantaneously random,
the idea being to model uncertainty in model selection; if q <
1, then the PBN remains in a given constituent Boolean network until a network switch and the PBN is said to be context
sensitive. The original introduction of PBNs considered only
instantaneously random PBNs [18] and using this model
PBNs were first used as the basis of applying control theory to
Stephen Marshall et al.
3
optimal intervention strategies to drive network dynamics in
favorable directions, such as away from metastatic states in
cancer [22]. Subsequently, context-sensitive PBNs were introduced to model the randomizing effect of latent variables
outside the network model and this leads to the development
of optimal intervention strategies that take into account the
effect of latent variables [23]. We defer to the literature for
a discussion of the role of latent variables [1]. Our interest
here is with context-sensitive PBNs, where q is assumed to be
small, so that on average, the network is governed by a constituent Boolean network for some amount of time before
switching to another constituent network. The perturbation
parameter p and the switching parameter q will be seen to
have effects on the proposed network-inference procedure.
By definition, the attractor cycles of a PBN are the attractor cycles of its constituent Boolean networks. While the
attractor cycles of a single Boolean network must be disjoint,
those of a PBN need not to be disjoint since attractor cycles
from different constituent Boolean networks can intersect.
Owing to the possibility of perturbation, a PBN is ergodic
and possesses a steady-state distribution. We note that one
can define a PBN without perturbation but we will not do so.
Let us close this section by noting that there is nothing inherently necessary about the quantization {0, 1} for a PBN;
indeed, PBN modeling is often done with the ternary quantization corresponding to a gene being down regulated (−1),
up regulated (1), or invariant (0). For any finite quantization
the model is still referred to as a PBN. In this paper we stay
with binary quantization for simplicity but it should be evident that the methodology applies to any finite quantization,
albeit, with greater complexity.
3.
INFERENCE PROCEDURE FOR BOOLEAN
NETWORKS WITH PERTURBATION
We first consider the inference of a single Boolean network
with perturbation. Once this is accomplished, our task in the
context of PBNs will be reduced to locating the data in the
observed sequence corresponding to the various constituent
Boolean networks.
3.1. Inference based on the transition counting matrix
and a cost function
The characteristics of a Boolean network, with or without
perturbation, can be estimated by observing its pairwise state
transitions, x(t) → x(t + 1), where x(t) can be an arbitrary vector from the n-dimensional state space Bn = {0, 1}n .
The states in Bn are ordered lexicographically according to
{00 · · · 0, 00 · · · 1, . . . , 11 · · · 1}. Given a temporal data sequence x(0), . . . , x(N), a transition counting matrix C can be
compiled over the data sequence showing the number ci j of
state transitions from the ith state to the jth state having occurred,
⎡
c00
⎢ c
⎢ 10
C=⎢
⎣ ···
c01
c11
···
c2n −1,0 c2n −1,1
⎤
· · · c0,2n −1
· · · c1,2n −1 ⎥
⎥
⎥.
···
··· ⎦
· · · c2n −1,2n −1
(3)
If the temporal data sequence results from a BN without perturbations, then a given state will always be followed by a
unique state at the next time point, and each row of matrix C
contains at most one nonzero value. A typical nonzero entry
will correspond to a transition of the form a0 a1 · · · am →
b0 b1 · · · bm . If {xi0 , xi1 , . . . , xi,k(i)−1 } is the predictor set for
xi , because the variables outside the set {xi0 , xi1 , . . . , xi,k(i)−1 }
have no effect on fi , this tells us that fi (ai0 , ai1 , . . . , ai,k(i)−1 ) =
bi and one row of the truth table defining fi is obtained. The
single transition a0 a1 · · · am → b0 b1 · · · bm gives one row of
each transition function for the BN. Given deterministic nature of a BN, we will not be able to sufficiently populate the
matrix C on a single observed sequence because, based on
the initial state, the BN will transition into an attractor cycle
and remain there. Therefore, we need to observe many runs
from different initial states.
For a BNp with small perturbation probability, C will
likely have some nonzero entries replacing some (or all) of
the 0 entries. Owing to perturbation and the consequent
ergodicity, a sufficiently long data sequence will sufficiently
populate the matrix to determine the entries caused by perturbation, as well as the functions and inputs underlying the
model. A mapping x(t) → x(t+1) will have been derived linking pairs of state vectors. This mapping induces n transition
functions determining the state of each variable at time t + 1
as a function of its predictors at time t, which are precisely
shown in (1) or (2). Given sufficient data, the functions and
the set of essential predictors may be determined by Boolean
reduction.
The task is facilitated by treating one variable at a time.
Given any variable, xi , and keeping in mind that some observed state transitions arise from random perturbations
rather than transition functions, we wish to find the k(i)
variables that control xi . The k(i) input variables that most
closely correlate with the behavior of xi will be identified as
the predictors. Specifically, the next state of variable xi is a
function of k(i) variables, as in (2). The transition counting matrix will contain one large single value on each line
(plus some “noise”). This value indicates the next state that
follows the current state of the sequence. It is therefore possible to create a two-column next-state table with current-state
column x0 x1 · · · xn−1 and next-state column x0+ x1+ · · · xn+−1 ,
there being 2n rows in the table, a typical entry looking like
00101 → 11001 in the case of 5 variables. If the states are
written in terms of their individual variables, then a mapping is produced from n variables to n variables, where the
next state of any variable may be written as a function of all
n input variables. The problem is to determine which subset consisting of k(i) out of the n variables is the minimal set
needed to predict xi , for i = 0, 1, . . . , n − 1. We refer to the k(i)
variables in the minimal predictor set essential predictors.
To determine the essential predictors for a given variable,
xi , we will define a cost function. Assuming k variables are
used to predict xi , there are n!/(n − k)!k! ways of choosing them. Each k with a choice of variables has a cost. By
minimizing the cost function, we can identify k such that
k = k(i), as well as the predictor set. In a Boolean network
without perturbation, if the value of xi is fully determined
4
EURASIP Journal on Bioinformatics and Systems Biology
Table 1: Effect of essential variables.
Current state
x2
x3
x1
x0
x4
·
·
·
·
·
0
0
0
0
1
1
1
1
0
1
·
·
·
·
·
·
·
·
·
·
·
0
1
1
0
1
x0+
x1+
Next state
x2+
x3+
1
1
·
All inputs with
same value of x0 ,
x2 , x3
should result in
the same output
·
·
1
1
1
0
x4+
1
·
·
1
1
·
·
by the predictor set, {xi0 , xi1 , . . . , xi,k−1 }, then this set will
not change for different combinations of the remaining variables, which are nonessential insofar as xi is concerned.
Hence, so long as xi0 , xi1 , . . . , xi,k−1 are fixed, the value of xi
should remain 0 or 1, regardless of the values of the remaining variables. For any given realization (xi0 , xi1 , . . . , xi,k−1 ) =
(ai0 , ai1 , . . . , ai,k−1 ), ai j ∈ {0, 1}, let
ui0,i1,...,i(k−1) ai0 , ai1 , . . . , ai,k−1
=
xi0 =ai0 ,...,xi,k−1 =ai,k−1
xi+
x0 , x1 , . . . , xn−1 .
where I is the characteristic function. Function I(w) = 1
if w is true and function I(w) = 0 if w is false. The term
ri0,i1,...,i(k−1) (ai0 , ai1 , . . . , ai,k−1 ) is designed to be minimized
if ui0,i1,...,i(k−1) (ai0 , ai1 , . . . , ai,k−1 ) is close to either 0 or 2n−k .
It represents a summation over one single realization of
the variables xi0 , xi1 , . . . , xi,k−1 . Therefore, we define the cost
function R by summing the individual costs over all possible
realizations of xi0 , xi1 , . . . , xi,k−1 :
(4)
R xi0 , xi1 , . . . , xi,k−1
=
ri0,i1,...,i(k−1) ai0 , ai1 , . . . , ai,k−1 .
ai0 ,ai1 ,...,ai,k−1 ∈{0,1}
According to this equation, ui0,i1,...,i(k−1) (ai0 , ai1 , . . . , ai,k−1 ) is
the sum of the next-state values assuming xi0 , xi1 , . . . , xi,k−1
are held fixed at ai0 , ai1 , . . . , ai,k−1 , respectively. There will be
2n−k lines in the next-state table, where (xi0 , xi1 , . . . , xi,k−1 ) =
(ai0 , ai1 , . . . , ai,k−1 ), while other variables can vary. Thus,
there will be 2n−k terms in the summation. For instance, for
the example in Table 1, when xi = x0 , k = 3, i0 = 0, i1 = 2,
and i2 = 3, that is, xi+ = fi (x0 , ∗, x2 , x3 , ∗), we have
u10,12,13 (0, 1, 1) = x1+ (0, 0, 1, 1, 0) + x1+ (0, 0, 1, 1, 1)
+ x1+ (0, 1, 1, 1, 0) + x1+ (0, 1, 1, 1, 1).
(5)
The term ui0,i1,...,i(k−1) (ai0 , ai1 , . . . , ai,k−1 ) attains its maximum
(2n−k ) or minimum (0) if the value of xi+ remains unchanged
on the 2n−k lines in the next-state table, which is the case in
the above example. Hence, the k inputs are good predictors of
the function if ui0,i1,...,i(k−1) (ai0 , ai1 , . . . , ai,k−1 ) is close to either
0 or 2n−k .
The cost function is based on the quantity
ri0,i1,...,i(k−1) ai0 , ai1 , . . . , ai,k−1
= ui0,i1,...,i(k−1) ai0 , ai1 , . . . , ai,k−1 I
2n−k
2
+ 2n−k − ui0,i1,...,i(k−1) ai0 , ai1 , . . . , ai,k−1 I
ui0,i1,...,i(k−1) ai0 , ai1 , . . . , ai,k−1 ≤
ui0,i1,...,i(k−1) ai0 , ai1 , . . . , ai,k−1 >
2n−k
,
2
(7)
The essential predictors for variable xi are chosen to be the
k variables that minimize the cost R(xi0 , xi1 , . . . , xi,k−1 ) and k
is selected as the smallest integer to achieve the minimum.
We emphasize on the smallest because if k (k < n) variables
can perfectly predict xi , then adding one more variable also
achieves the minimum cost. For small numbers of variables,
the k inputs may be chosen by a full search, with the cost
function being evaluated for every combination. For larger
numbers of variables, genetic algorithms can be used to minimize the cost function.
In some cases the next-state table is not fully defined, due
to insufficient temporal data. This means that there are donot-care outputs. Tests have shown that the input variables
may still be identified correctly even for 90% of missing data.
Once the input set of variables is determined, it is
straightforward to determine the functional relationship by
Boolean minimization [27]. In many cases the observed data
are insufficient to specify the behavior of the function for every combination of input variables; however, by setting the
unknown states as do-not-care terms, an accurate approximation of the true function may be achieved. The task is
simplified when the number k of input variables is small.
(6)
3.2.
Complexity of the Procedure
We now consider the complexity of the proposed inference
procedure. The truth table consists of n genes and therefore
Stephen Marshall et al.
5
Table 2: Values of Ξn,k .
k
2
3
4
5
11430
16480
17545
6
86898
141210
159060
7
8
9
5.84 × 105
1.06 × 106
1.28 × 105
3.61 × 106
7.17 × 106
9.32 × 106
2.11 × 107
4.55 × 107
6.35 × 107
Table 3: Computation times.
k
2
3
4
5
< 1s
< 1s
< 1s
6
< 1s
< 1s
< 1s
7
< 1s
< 1s
< 1s
n
8
2s
6s
9s
9
12 s
36 s
68 s
4.
10
69 s
214 s
472 s
11
476 s
2109 s
3097 s
has 2n lines. We wish to identify the k predictors which best
describe the behavior of each gene. Each gene has a total of
Ckn = n!/(n − k)!k! possible sets of k predictors. Each of these
sets of k predictors has 2k different combinations of values.
For every specific combination there are 2n−k lines of the
truth table. These are lines where the predictors are fixed but
the values of the other (nonpredictor) genes change. These
must be processed according to (5), (6), and (7).
The individual terms in (5) are binary values, 0 or 1. The
cost function in (7) is designed to be maximized when all
terms in (5) are either all 0 or all 1; that is, the sum is either at its minimum or maximum value. Simulations have
shown that this may be more efficiently computed by carrying out all pairwise comparisons of terms and recording the
number of times they differ. Hence a summation has been replaced by a computationally more efficient series of comparison operations. The number of pairs in a set of 2n−k values is
2n−k−1 (2n−k − 1). Therefore, the total number of comparisons
for a given n and k is given by
n!
2k 2n−k 2n−k−1 2n−k − 1
(n − k)!k!
n!
· 22n−k−1 2n−k − 1 .
=n
(n − k)!k!
ξn,k = n
(8)
This expression gives the number of comparisons for a fixed
value of k; however, if we wish to compute the number of
comparisons for all values of predictors, up to and including
k, then this is given by
Ξn,k =
k
j =1
n
n
10
1.18 × 108
2.74 × 108
4.09 × 108
n!
22n− j −1 2n− j − 1 .
(n − j)! j!
(9)
Values for Ξn,k are given in Table 2 and actual computation
times taken on an Intel Pentium 4 with a 2.0 GHz clock and
768 MB of RAM are given in Table 3.
The values are quite consistent given the additional computational overheads not accounted for in (9). Even for 10
genes and up to 4 selectors, the computation time is less than
8 minutes. Because the procedure of one BN is not dependent on other BNs, the inference of multiple BNs can be run
in parallel, so that time complexity is not an issue.
15
4.23 × 1011
1.34 × 1012
2.71 × 1012
20
1.04 × 1015
4.17 × 1015
1.08 × 1016
30
3.76 × 1021
5.52 × 1021
6.47 × 1022
50
1.94 × 1034
1.74 × 1035
1.09 × 1035
INFERENCE PROCEDURE FOR PROBABILISTIC
BOOLEAN NETWORKS
PBN inference is addressed in three steps: (1) split the temporal data sequence into subsequences corresponding to constituent Boolean networks; (2) apply the preceding inference
procedure to each subsequence; and (3) infer the perturbation, switching, and selection probabilities. Having already
treated estimation of a BNp, in this section we address the
first and third steps.
4.1.
Determining pure subsequences
The first objective is to identify points within the temporal
data sequence where there is a switch of constituent Boolean
networks. Between any two successive switch points there
will lie a pure temporal subsequence generated by a single
constituent network. The transition counting matrix resulting from a sufficiently long pure temporal subsequence will
have one large value in each row, with the remainder in each
row being small (resulting from perturbation). Any measure
of purity should therefore be maximized when the largest
value in each row is significantly larger than any other value.
The value of the transition counting matrix at row i and column j has already been defined in (3) as ci j . Let the largest
value of ci j in row i be defined as ci1 and the second largest
value be ci2 . The quantity ci1 − ci2 is proposed as the basis of
a purity function to determine the likelihood that the temporal subsequence lying between two data points is pure. As the
quantity relates to an individual row of the transition matrix,
it is summed over all rows and normalized by the total value
of the elements to give a single value P for each matrix:
n
2 −1 1
c − c2
P = 2i=n −0 1 i2n −1 i .
j =0
i=0 ci j
(10)
The purity function P is maximized for a state transition matrix when each row contains only one single large value and
the remaining values on each row are zero.
To illustrate the purity function, consider a temporal data
sequence of length N generated from two Boolean networks.
The first section of the sequence, from 0 to N1 , has been
generated from the first network and the remainder of the
sequence, from N1 + 1 to N − 1, has been generated from
the second network. We desire an estimate η of the switch
point N1 . The variable η splits the data sequence into two
parts and 0 ≤ η ≤ N − 1. The problem of locating the
switch point, and hence partitioning the data sequence, reduces to a search to locate N1 . To accomplish this, a trial
switch point, G, is varied and the data sets before and after
6
EURASIP Journal on Bioinformatics and Systems Biology
G
Time step
Partitioning at first pass
W
V
Partitioning at second pass
1.01
1
0.99
0.98
0.97
0.96
0.95
0.94
0.93
0.92
0.91
0.9
3822
4777
5732
6687
7642
8597
9552
10507
11462
12417
13372
14327
15282
16237
The position of the
switch point
2
957
1912
2867
Function P on W (tm) and V (tm)
(a)
Time step tm (2–16309)
0.96
0.95
0.94
0.93
0.92
0.91
0.9
0.89
0.88
0.87
0.86
3822
4777
5732
6687
7642
8597
9552
10507
11462
12417
13372
14327
15282
16237
The position of the
switch point
2
957
1912
2867
Product of function P on WV (tm)
(b)
Time step tm (2–16309)
(c)
Figure 1: Switch point estimation: (a) data sequence divided by a
sliding point G and transition matrices produced by for the data
on each side of the partition; (b) purity functions from W and V ;
(c) simple function of two purity functions indicating switch point
between models.
it are mapped into two different transition counting matrices, W and V . The ideal purity factor is a function which
is maximized for both W and V when G = N1 . The procedure is illustrated in Figure 1. Figure 1(a) shows how the
data are mapped from either side of a sliding point into the
transition matrices. Figure 1(b) shows the purity functions
derived from the transition counting matrices of W and V .
Figure 1(c) shows a simple functional of W and V (in this
case their product), which gives a peak at the correct switch
point. The estimate η of the switch point is detected via a
threshold.
Figure 2: Passes for partitioning: the overall sequence is divided at
the first pass into two shorter subsequences for testing. This is repeated in a second pass with the start and end points of the subsequences offset in order to avoid missing a switch point due to
chaotic behavior.
The method described so far works well provided the sequence to be partitioned derives from two networks and the
switch point does not lie close to the edge of the sequence. If
the switch point lies close to the start or end of the sequence,
then one of the transition counting matrices will be insufficiently populated, thereby causing the purity function to exhibit chaotic behavior.
If the data sequence is long and there is possibly a large
number of switch points, then the sequence can be divided
into a series of shorter subsequences that are individually
tested by the method described. Owing to the effects of
chaotic behavior near subsequence borders, the method is
repeated in a second pass in which the sequence is again divided into shorter subsequences but with the start and end
points offset (see Figure 2). This ensures that a switch point
will not be missed simply because it lies close to the edge of
the data subsequence being tested.
The purity function provides a measure of the difference
in the relative behavior of two Boolean networks. It is possible that two Boolean networks can be different but still
have many common transitions between their states. In this
case the purity function will indicate a smaller distinction between the two models. This is particularly true where the two
models have common attractors. Moreover, on average, the
value of the purity function may vary greatly between subsequences. Hence, we apply the following normalization to
obtain a normalized purity value:
Pnorm =
P−T
,
T
(11)
where P is the purity value in the window and T is either the
mean or geometric mean of the window values. The normalization removes differences in the ranges and average values
of points in different subsequence, thereby making it easier
to identify genuine peaks resulting from switches between
Boolean networks.
If two constituent Boolean networks are very similar,
then it is more difficult to distinguish them and they may
be identified as being the same on account of insufficient
or noisy data. This kind of problem is inherent to any inference procedure. If two networks are identified during inference, this will affect the switching probability because
it will be based on the inferred model, which will have
Stephen Marshall et al.
less constituent Boolean networks because some have been
identified. In practice, noisy data are typically problematic
owing to overfitting, the result being spurious constituent
Boolean networks in the inferred model. This overfitting
problem has been addressed elsewhere by using Hammingdistance filters to identify close data profiles [9]. By identifying similar networks, the current proposed procedure
acts like a lowpass filter and thereby mitigates overfitting. As with any lowpass filter, discrimination capacity is
diminished.
7
A measure of goodness of the estimator is given by
(12)
for ε > 0. Because S possesses a binomial distribution, this
probability is directly expressible in terms of the binomial
density, which means that the goodness of our estimator is
completely characterized. This computation is problematic
for large N, but if N is sufficiently large so that the rule-ofthumb min{N p, N(1 − p)} > 5 is satisfied, then the normal
approximation to the binomial distribution can be used.
Chebyshev’s inequality provides a lower bound:
4.2. Estimation of the switching, selection, and
perturbation probabilities
So far we have been concerned with identifying a family of
Boolean networks composing a PBN; much longer data sequences are required to estimate the switching, selection, and
perturbation probabilities. The switching probability may be
estimated simply by dividing the number of switch points
found by the total sequence length. The perturbation probability is estimated by identifying those transitions in the sequence not determined by a constituent-network function.
For every data point, the next state is predicted using the
model that has been found. If the predicted state does not
match the actual state, then it is recorded as being caused by
perturbation. Switch points are omitted from this process.
The perturbation rate is then calculated by dividing the total
instances of perturbation by the length of the data sequence.
Regarding the selection probabilities, we assume that a
constituent network cannot switch into itself; otherwise there
would be no switch. This assumption is consistent with the
heuristic that a switch results from the change of a latent variable that in turn results in a change of the network structure.
Thus, the selection probabilities are conditional, depending
on the current network. The conditional probabilities are of
the form qAB , which gives the probability of selecting network B during a switch, given the current network is A, and
qAB is estimated by dividing the number of times the data
sequence switches from A to B by the number of times it
switches out of A.
In all cases, the length N of the sequence necessary to obtain good estimates is key. This issue is related to how often
we expect to observe a perturbation, network switch, or network selection during a data sequence. It can be addressed in
terms of the relevant network parameters.
We first consider estimation of the perturbation probability P. Note that we have defined P as the probability of
making a random state selection, whereas in some papers
each variable is given a probability of randomly changing. If
the observed sequence has length N and we let X denote the
number of perturbations (0 or 1) at a given time point, then
the mean of X is p and the estimate, p, we are using for p is
the sample mean of X for a random sample of size N, the
sample being random because perturbations are independent. The expected number of perturbations is N p, which
is the mean of the random variable S given by an independent sum of N random variables identically distributed to X.
S possesses a binomial distribution with variance N p(1 − p).
P | p − p| < ε = P |N p − S| < Nε
P | p − p| < ε = 1 − P |N p − S| ≥ Nε
p(1 − p)
≥1−
.
Nε2
(13)
A good estimate is very likely if N is sufficiently large to
make the fraction very small. Although often loose, Chebyshev’s inequality provides an asymptotic guarantee of goodness. The salient issue is that the expected number of perturbations (in the denominator) becomes large.
A completely analogous analysis applies to the switching
probability q, with q replacing p and q replacing p in (12)
and (13), with Nq being the expected number of switches.
To estimate the selection probabilities, let pi j be the probability of selecting network B j given a switch is called for and
the current network is Bi , pi j its estimator, ri the probability
of observing a switch out of network Bi , ri the estimator of ri
formed by dividing the number of times the PBN is observed
switching out of Bi divided by N, si j the probability of observing a switch from network Bi to network B j , and si j the
estimator of si j formed by dividing the number of times the
PBN is observed switching out of Bi into B j by N. The estimator of interest, pi j , can be expressed as si j / ri . The probability of observing a switch out of Bi is given by qP(Bi ), where
P(Bi ) is the probability that the PBN is in Bi , so that the expected number of times such a switch is observed is given by
NqP(Bi ). There is an obvious issue here: P(Bi ) is not a model
parameter. We will return to this issue.
Let us first consider si j . Define the following events: At is
a switch at time t, Bit is the event of the PBN being in network
Bi at time t, and [Bi → B j ]t is the event Bi switches to B j at
time t. Then, because the occurrence of a switch is independent of the current network,
P Bi −→ B j
t t
= P At P Bit−1 P Bi −→ B j | Bit−1
= qP Bit−1 pi j .
(14)
The probability of interest depends on the time, as does
the probability of being in a particular constituent network;
however, if we assume the PBN is in the steady state, then the
time parameters drop out to yield
P Bi −→ B j
t = qP Bi pi j .
(15)
Therefore the number of times we expect to see a switch from
Bi to B j is given by NqP(Bi )pi j .
8
EURASIP Journal on Bioinformatics and Systems Biology
Let us now return to the issue of P(Bi ) not being a model
parameter. In fact, although it is not directly a model parameter, it can be expressed in terms of the model parameters so
long as we assume we are in the steady state. Since
Bit
=
t c
A
∩ Bit−1 ∪
t
A ∩
Btj−1
∩ B j −→ Bi
t
Table 4: Average percentage of predictors and functions recovered
from 104 BN sequences consisting of n = 7 variables for k = 2 and
k = 3, and P = .01.
Model recovery
Sequence length
j
=i
(16)
a straightforward probability analysis yields
P Bit = (1 − q)P Bit−1
+q
t
P Btj−1 P B j −→ Bi | Btj−1 .
(17)
j
=i
Under the steady-state assumption the time parameters may
be dropped to yield
P Bi =
p ji P B j .
500
1000
2000
4000
6000
8000
10 000
20 000
30 000
40 000
Predictors
recovered (%)
k=2
46.27
54.33
71.71
98.08
98.11
98.18
98.80
100
100
100
k=3
21.85
28.24
29.84
34.87
50.12
50.69
51.39
78.39
85.89
87.98
Functions
recovered (%)
k=2
34.59
45.22
64.28
96.73
97.75
97.87
98.25
98.333
99.67
99.75
k=3
12.26
19.98
22.03
28.53
42.53
43.23
43.74
69.29
79.66
80.25
(18)
j
=i
Hence, the network probabilities are given in terms of the
selection probabilities by
⎛
−1
⎜p
⎜ 12
0=⎜
⎜ ..
⎝ .
p21
−1
..
.
p1m p2,m−1
5.
⎞ ⎛ ⎞
P B1
· · · pm1
⎜ P B ⎟
· · · pm2 ⎟
2 ⎟
⎟⎜
⎜
⎟
.. ⎟
..
⎟ ⎜ .. ⎟ .
.
. ⎠ ⎝ . ⎠
· · · −1
P Bm
(19)
EXPERIMENTAL RESULTS
A variety of experiments have been performed to assess the
proposed algorithm. These include experiments on single
BNs, PBNs, and real data. Insofar as the switching, selection,
and perturbation probabilities are concerned, their estimation has been characterized analytically in the previous section so we will not be concerned with them here.
Thus, we are concerned with the percentages of the predictors and functions recovered from a generated sequence.
Letting c p and t p be the number of predictors correctly identified and the total number of predictors in the network, respectively, the percentage, π p , of predictors correctly identified is given by
πp =
cp
× 100.
tp
cf
× 100.
tf
Single Boolean networks
When inferring the parameters of single BNs from data sequences by our method, it was found that the predictors and
functions underlying the data could be determined very accurately from a limited number of observations. This means
that even when only a small number of the total states and
possible transitions of the model are observed, the parameters can still be extracted.
These tests have been conducted using a database of 80
sequences generated by single BNs with perturbation. These
have been constructed by randomly generating 16 BNs with
n = 7 variables and connectivity k = 2 or k = 3, and P = .01.
The sequence lengths vary in 10 steps from 500 to 40 000,
as shown in Table 4. The table shows the percentages of the
predictors and functions recovered from a sequence generated by a single BN, that is, a pure sequence with n = 7,
for k = 2 or k = 3, expressed as a function of the overall
sequence length. The average percentages of predictors and
functions recovered from BN sequences with k = 2 is much
higher than for k = 3 in the same sequence length.
5.2.
Probabilistic Boolean networks
(20)
Letting c f and t f be the number of function outputs correctly identified and the total number of function outputs in
network, respectively, the percentage, π f , of function outputs
correctly identified is given by
πf =
5.1.
(21)
The functions may be written as truth tables and π f corresponds to the percentage of lines in all the truth tables recovered from the data which correctly match the lines of the
truth tables for the original function.
For the analysis of PBN inference, we have constructed two
databases consisting of sequences generated by PBNs with
n = 7 genes.
(i) Database A: the sequences are generated by 80 randomly
generated PBNs and sequence lengths vary in 10 steps
from 2000 to 500 000, each with different values of p
and q, and two different levels of connectivity k.
(ii) Database B: 200 sequences of length 100 000 are generated from 200 randomly generated PBNs, each having
4 constituent BNs with k = 3 predictors. The switching
probability q varies in 10 values: .0001, .0002, .0005,
.001, .002, .005, .01, .02, .05, 0.1.
Stephen Marshall et al.
(1) Select one subsequence for each BN and analyze that
only.
(2) Collate all subsequences generated by the same BN and
analyze each set.
Using the first strategy, the accuracy of the recovery of the
predictors and functions tends to go down as the switching
probability goes up because the lengths of the subsequences
get shorter as the switching probability increases. Using the
second strategy, the recovery rate is almost independent of
the switching probability because the same number of data
points from each BN is encountered. They are just cut up
into smaller subsequences. Past a certain threshold, when the
switching probability is very high the subsequences are so
short that they are hard to classify.
Figure 3 shows a graph of predictor recovery as a function
of switching probability for the two strategies using database
B. Both strategies give poor recovery for low switching probability because not all of the BNs are seen. Strategy 2 is more
effective in recovering the underlying model parameters over
a wider range of switching values. For higher values of q,
the results from strategy 1 decline as the subsequences get
shorter. The results for strategy 2 eventually decline as the sequences become so short that they cannot be effectively classified.
These observations are borne out by the results in
Figure 4, which show the percentage of predictors recovered
using strategy 2 from a PBN-generated sequence with 4 BNs
consisting of n = 7 variables with k = 3, P = .01, and switching probabilities q = .001 and q = .005 for various length
sequences using database A. It can be seen that for low sequence lengths and low probability, only 21% of the predictors are recovered because only one BN has been observed.
As sequence length increases, the percentage of predictors recovered increases and at all times the higher switching probability does best, with the gap closing for very long sequence
lengths.
More comparisons are given in Figures 5 and 6, which
compare the percentage predictor recovery for two different
connectivity values and for two different perturbation val-
100
Predictor recovered (%)
90
80
70
60
50
40
30
20
10
0.1
0.05
0.02
0.01
0.005
0.002
0.001
0.0005
0.0002
0.0001
0
Network switching probability
Strategy 1
Strategy 2
Figure 3: The percentage of predictors recovered from fixed length
PBN sequences (of 100 000 sample points). The sequence is generated from 4 BNs, with n = 7 variables and k = 3 predictors, and
P = .01.
Predictor recovered (%)
The key issue for PBNs is how the inference algorithm
works relative to the identification of switch points via the
purity function. If the data sequence is successfully partitioned into pure sequences, each generated by a constituent
BN, then the BN results show that the predictors and functions can be accurately determined from a limited number of
observations. Hence, our main concern with PBNs is apprehending the effects of the switching probability q, perturbation probability p, connectivity k, and sequence length. For
instance, if there is a low switching probability, say q = .001,
then the resulting pure subsequences may be several hundred data points long. So while each BN may be characterized from a few hundred data points, it may be necessary to
observe a very long sequence simply to encounter all of the
constituent BNs.
When analyzing long sequences there are two strategies
that can be applied after the data have been partitioned into
pure subsequences.
9
100
90
80
70
60
50
40
30
20
10
0
2
4
6
8
10 50 100
Length of sequence
200 300
500
×103
q = .001
q = .005
Figure 4: The percentage of predictors recovered using strategy 2
from a sequence generated from a PBN with 4 BNs consisting of
n = 7 variables with k = 3, P = .01 and switching probabilities
q = .001 and q = .005 for various length sequences.
ues, respectively. They both result from strategy 2 applied to
database A. It can be seen that it is easier to recover predictors
for smaller values of k and larger values of p.
A fuller picture of the recovery of predictors and functions from a PBN sequence of varying length, varying k, and
varying switching probability is given in Table 5 for database
A, where P = .01 and there are three different switching
probabilities: q = .001, .005, .03. As expected, it is easier to
recover predictors for low values of k. Also over this range
the percentage recovery of both functions and predictors increases with increasing switching probability.
10
EURASIP Journal on Bioinformatics and Systems Biology
Table 5: The percentage of predictors recovered by strategy 2 as a function of various length sequences from sequences generated by experimental design A with at P = .01, switching probabilities, q = .001, .005, .03, and for k = 2 and k = 3.
q = .001
Sequence length
2000
4000
6000
8000
10 000
50 000
100 000
200 000
300 000
500 000
q = .005
Functions
recovered (%)
Predictor
recovered (%)
Functions
recovered (%)
Predictor
recovered (%)
Functions
recovered (%)
k=2
22.07
36.90
53.59
54.75
58.69
91.50
97.28
97.69
97.98
99.40
k=2
20.15
33.13
43.23
47.15
53.57
88.22
95.43
96.39
96.82
98.67
k=2
50.74
55.43
76.08
77.02
79.10
94.58
97.97
98.68
99.00
99.68
k=2
37.27
42.49
66.74
67.48
69.47
92.59
96.47
97.75
98.19
99.18
k=2
65.25
74.88
75.69
76.22
86.36
96.70
98.47
99.27
99.40
99.83
k=2
53.52
66.31
67.20
67.72
80.92
94.71
96.68
98.03
98.97
99.25
k=3
20.94
36.31
38.80
44.54
45.63
75.03
79.68
83.65
85.62
89.88
k=3
12.95
23.89
26.79
29.42
36.29
65.29
71.19
76.23
79.00
84.85
Predictor recovered (%)
120
100
80
60
40
20
0
2
4
6
8
10
50 100 200 300
Length of sequence
500
×103
k=2
k=3
Figure 5: The percentage of predictors recovered using strategy 2
and experimental design A as a function of sequence length for connectivities k = 2 and k = 3.
Predictor recovered (%)
120
100
80
60
40
20
0
2
4
q = .03
Predictor
recovered (%)
6
8
10 50 100
Length of sequence
200 300
500
×103
p = 0.02
p = 0.005
Figure 6: The percentage of predictors recovered using strategy 2
and experimental design A as a function of sequence length for perturbation probabilities P = .02 and P = .005.
k=3
41.79
52.54
54.92
59.77
65.37
80.07
85.51
86.76
92.37
93.90
k=3
25.44
37.06
42.02
45.07
51.94
71.55
78.34
80.24
88.28
90.30
k=3
48.84
56.08
64.33
67.86
73.82
86.64
90.71
94.02
95.50
96.69
k=3
34.01
42.72
51.97
55.10
61.84
78.32
85.06
90.79
92.50
94.21
We have seen the marginal effects of the switching and
perturbation probabilities, but what about their combined
effects? To understand this interaction, and to do so taking
into account both the number of genes and the sequence
length, we have conducted a series of experiments using randomly generated PBNs composed of either n = 7 or n = 10
genes, and possessing different switching and perturbation
values. The result is a set of surfaces giving the percentages of
predictors recovered as a function of p and q.
The PBNs have been generated according to the following
protocol.
(1) Randomly generate 80 BNs with n = 7 variables and
connectivity k = 3 (each variable has at most 3 predictors,
the number for each variable being randomly selected). Randomly order the BNs as A1, A2, . . . , A80.
(2) Consider the following perturbation and switching
probabilities: P = .005, P = .01, P = .015, P = .02, q = .001,
q = .005, q = .01, q = .02, q = .03.
(3) For each p, q, do the following: (1) construct a PBN
from A1, A2, A3, A4 with selection probabilities 0.1, 0.2, 0.3,
0.4, respectively; (2) construct a PBN from A5, A6, A7, A8
with selection probabilities 0.1, 0.2, 0.3, 0.4, respectively; (3)
continue until the BNs are used up.
(4) Apply the inference algorithm to all PBNs using data
sequences of length N = 4000, 6000, 8000, 10 000, 50 000.
(5) Repeat the same procedure from (1)–(4) using 10
variables.
Figures 7 and 8 show fitted surfaces for n = 7 and n = 10,
respectively. We can make several observations in the parameter region considered: (a) as expected, the surface heights
increase with increasing sequence length; (b) as expected, the
surface heights are lower for more genes, meaning that longer
sequences are needed for more genes; (c) the surfaces tend
to increase in height for both p and q, but if q is too large,
then recovery percentages begin to decline. The trends are
the same for both numbers of genes, but recovery requires
increasingly long sequences for larger numbers of genes.
11
70
Predictor recovered (%)
Predictor recovered (%)
Stephen Marshall et al.
60
50
40
30
20
0.025
0.02
0.015
Val
0.01
ue
0.005
of q
80
70
60
50
40
30
0.025
0.02
0.01
0.005
e of
Valu
p
80
70
60
50
40
0.025
0.01
0.005
e of
Valu
p
0.02
0.01
0.005
e of
Valu
90
80
70
60
50
40
0.025
0.02
0.02
0.015
Val
0.01
ue
0.005
of q
0.015
p
(c)
Predictor recovered (%)
0.015
(b)
Predictor recovered (%)
Predictor recovered (%)
(a)
0.02
0.015
Val
0.01
ue
0.005
of q
0.02
0.02
0.015
Val
0.01
ue
0.005
of q
0.015
0.015
0.01
0.005
e of
Valu
p
(d)
95
90
85
80
75
70
0.025
0.02
0.015
Val
0.01
ue
0.005
of q
0.02
0.015
0.01
0.005
p
e of
Valu
(e)
Figure 7: Predictor recovery as a function of switching and perturbation probabilities for n = 7 genes: (a) N = 4000, (b) N = 6000, (c)
N = 8000, (d) N = 10 000, (e) N = 50 000.
6.
A SUBSAMPLING STRATEGY
It is usually only necessary to observe a few thousand sample
points in order to determine the underlying predictors and
functions of a single BN. Moreover, it is usually only necessary to observe a few hundred sample points to classify a BN
as being BN1, BN2, and so forth. However, in analyzing a
PBN-generated sequence with low switching probability, say
EURASIP Journal on Bioinformatics and Systems Biology
70
Predictor recovered (%)
Predictor recovered (%)
12
60
50
40
30
60
50
40
30
20
0.025
0.02
0.015
Val
0.01
ue
0.005
of q
70
0.025
0.02
0.01
0.005
e of
Valu
p
(a)
0.015
0.01
0.005
e of
Valu
p
(b)
70
Predictor recovered (%)
Predictor recovered (%)
0.02
0.02
0.015
Val
0.01
ue
0.005
of q
0.015
60
50
40
30
80
70
60
50
40
0.025
0.02
0.015
Val
0.01
ue
0.005
of q
0.025
0.02
0.01
0.005
e of
Valu
p
(c)
Predictor recovered (%)
0.02
0.02
0.015
Val
0.01
ue
0.005
of q
0.015
0.015
0.01
0.005
e of
Valu
p
(d)
85
80
75
70
65
60
55
0.025
0.02
0.015
Val
0.01
ue
0.005
of q
0.02
0.015
0.01
0.005
p
e of
Valu
(e)
Figure 8: Predictor recovery as a function of switching and perturbation probabilities for n = 10 genes: (a) N = 4000, (b) N = 6000, (c)
N = 8000, (d) N = 10 000, (e) N = 50 000.
Stephen Marshall et al.
S
BN1
0
BN2
10000
13
BN3
20000
BN2
30000
BN4
40000
Figure 9: Subsampling strategy.
Predictor recovered (%)
120
100
80
60
40
20
0
200 400
600 800 1000 2000 3000 4000 5000 10000
Sampling space
Sample set 200
Sample set 100
Sample set 150
Sample set 50
Subsampling represents an effort at complexity reduction
and is commonly used in engineering applications to gain
speed and reduce cost. From a larger perspective, the entire investigation of gene regulatory networks needs to take
complexity reduction into consideration because in the natural state the networks are extremely complex. The issue
is whether goals can be accomplished better using fine- or
coarse-grained analysis [28]. For instance, a stochastic differential equation model might provide a more complete description in principle, but a low-quantized discrete network
might give better results owing to reduced inference requirements or computational complexity. Indeed, in this paper
we have seen the inference difficulty that occurs by taking
into account the stochasticity caused by latent variables on
a coarse binary model. Not only does complexity reduction
motivate the use of models possessing smaller numbers of
critical parameters and relations, for instance, by network
reduction [29] suppressing functional relations in favor of
a straight transitional probabilistic model [30], it also motivates suboptimal inference, as in the case of the subsampling
discussed herein or in the application of suboptimal intervention strategies to network models, such as PBNs [31].
Figure 10: Predictor recovery percentages using various subsampling regimes.
7.
q = .001, it is necessary on average to observe 1,000 points
before a switch to the second BN occurs. This requires huge
data lengths, not for deriving the parameters (predictors and
functions) of the underlying model, but in order for a switch
to occur in order to observe another BN.
This motivates consideration of subsampling. Rather
than analyzing the full sequence, we analyze a small subsequence of data points, skip a large run of points, analyze another sample, skip more points, and so forth. If the sample is
sufficiently long to classify it correctly, then the samples from
the same BN may be collated to produce good parameter estimates. The subsampling strategy is illustrated in Figure 9.
It is for use with data possessing a low switching probability. It is only necessary to see a sequence containing a small
number of sample points of each BN in order to identify the
BN. The length of the sampled subsequences is fixed at some
value S.
To test the subsampling strategy, a set of 20 data sequences, each consisting of 100 000 samples points, was generated from a PBN consisting of 4 BNs, n = 7 variables,
k = 2, P = .01, and q = .001 in database A. We define a
sampling space to consist of a sampling window and nonsampling interval, so that the length of a sampling space
is given by L = S + I, where I is the length of the nonsampling interval. We have considered sampling spaces of
lengths L = 200, 400, 600, 800, 1000, 2000, 3000, 4000, 5000,
and 10 000 and sampling windows (subsequences) of lengths
S = 50, 100, 150, and 200. When S = L, there is no subsampling. The results are shown in Figure 10, which shows
the percentage of predictors recovered. The recovery percentage by processing all 100 000 points in the full sequence is
97.28%.
REAL-DATA NETWORK EXPERIMENT
To test the inference technique on real data, we have considered an experiment based on a model affected by latent
variables, these being a key reason for PBN modeling. Latent variables are variables outside the model whose behavior
causes the model to appear random—switch between constituent networks.
The real-gene PBN is derived from the drosophila segment polarity genes for which a Boolean network has been
derived that consists of 8 genes: wg1 , wg2 , wg3 , wg4 , PTC1 ,
PTC2 , PTC3 , and PTC4 [26]. The genes are controlled by
the following equations:
wg1 = wg1 and not wg2 and not wg4 ,
wg2 = wg2 and not wg1 and not wg3 ,
wg3 = wg1 or wg3 ,
wg4 = wg2 or wg4 ,
PTC1 = not wg2 and not wg4 or
(22)
PTC1 and not wg1 and not wg3 ,
PTC2 = not wg1 and not wg3 or
PTC2 and not wg2 and not wg4 ,
PTC3 = 1,
PTC4 = 1.
Now let wg4 and PTC4 be hidden variables (not observable). Since PTC4 has a constant value, its being a hidden
variable has no effect on the network. However, if we let
14
EURASIP Journal on Bioinformatics and Systems Biology
Table 6: Percentages of the predictors and functions recovered from the segment polarity PBN.
q = .001
Predictor
Function
recovered (%)
recovered (%)
Length
2000
4000
6000
8000
10 000
20 000
30 000
40 000
50 000
100 000
51.94
58.39
65.78
72.74
76.03
87.81
97.35
98.64
99.59
99.69
q = .005
Predictor
Function
recovered (%)
recovered (%)
36.83
38.49
50.77
59.23
63.98
76.86
84.61
85.74
90.18
90.85
56.95
59.86
80.42
83.97
88.10
95.68
97.65
99.19
99.59
99.87
wg4 = 0 or 1, we will arrive at a 6-gene PBN consisting of
two BNs. When wg4 = 0, we have the following BN:
wg1 = wg1 and not wg2 ,
wg2 = wg2 and not wg1 and not wg3 ,
wg3 = wg1 or wg3 ,
PTC1 = not wg2 or PTC1 and not wg1 and not wg3 ,
PTC2 = not wg1 and not wg3 or PTC2 and not wg2 ,
PTC3 = 1.
(23)
When wg4 = 1, we have the following BN:
wg1 = wg1 and not wg2 and 0,
wg2 = wg2 and not wg1 and not wg3 ,
wg3 = wg1 or wg3 ,
PTC1 = not wg2 and 0 or
PTC1 and not wg1 and not wg3 ,
(24)
PTC2 = not wg1 and not wg3 or
PTC2 and not wg2 and 0 ,
PTC3 = 1.
Together, these compose a 6-gene PBN. Note that in the second BN we do not simplify the functions for wg1 , PTC1 , and
PTC3 , so that they have the same predictors as in the first BN.
There are 6 genes considered here: wg1 , wg2 , wg3 , PTC1 ,
PTC2 , and PTC3 . The maximum number of predictor genes
is k = 4. The two constituent networks are regulated by the
same predictor sets. Based on this real-gene regulatory network, synthetic sequences have been generated and the inference procedure is applied to these sequences. 600 sequences
with 10 different lengths (between 2000 and 100 000) have
been generated with various lengths, P = .01, and three
switching probabilities, q = .001, .005, .02. Table 6 shows the
average percentages of the predictors and functions recovered.
39.81
44.53
65.40
69.47
74.31
81.60
88.28
90.03
90.35
91.19
8.
Predictor
recovered (%)
68.74
70.26
85.60
86.82
92.80
96.98
99.17
99.66
99.79
100
q = .02
Function
recovered (%)
49.43
52.86
68.11
70.28
77.83
83.48
88.82
91.05
91.94
93.97
CONCLUSION
Capturing the full dynamic behavior of probabilistic Boolean
networks, whether they are binary or multivalued, will require the use of temporal data, and a goodly amount of it.
This should not be surprising given the complexity of the
model and the number of parameters, both transitional and
static, that must be estimated. This paper proposed an algorithm that works well, but shows the data requirement. It
also demonstrates that the data requirement is much smaller
if one does not wish to infer the switching, perturbation, and
selection probabilities, and that constituent-network connectivity can be discovered with decent accuracy for relatively
small time-course sequences. The switching and perturbation probabilities are key factors, since if they are very small,
then large amounts of time are needed to escape attractors;
on the other hand, if they are large, estimation accuracy is
hurt. Were we to restrict our goal to functional descriptions
of state transitions when in attractor cycles, then the necessary amount of data would be enormously reduced; however,
our goal in this paper is to capture as much of the PBN structure as possible, including transient regulation.
Among the implications of the issues raised in this paper, there is a clear message regarding the tradeoff between
fine- and coarse-grain models. Even if we consider a binary
PBN, which is considered to be a coarse-grain model, and a
small number of genes, the added complexity of accounting
for function switches owing to latent variables significantly
increases the data requirement. This is the kind of complexity problem indicative of what one must confront when using
solely data-driven learning algorithms. Further study should
include mitigation of data requirements by prior knowledge,
such as transcriptional knowledge of connectivity or regulatory functions for some genes involved in the network. It
is also important to consider the reduction in complexity
resulting from prior constraints on the network generating
the data. These might include: connectivity, attractor structure, the effect of canalizing functions, and regulatory bias.
In the other direction, one can consider complicating factors
such as missing data and inference when data measurements
cannot be placed into direct relation with the synchronous
temporal dynamics of the model.
Stephen Marshall et al.
ACKNOWLEDGMENTS
We appreciate the National Science Foundation (CCF0514644 and BES-0536679) and the National Cancer Institute (R01 CA-104620) for partly supporting this research.
We would also like to thank Edward Suh, Jianping Hua, and
James Lowey of the Translational Genomics Research Institute for providing high-performance computing support.
REFERENCES
[1] E. R. Dougherty, A. Datta, and C. Sima, “Research issues in
genomic signal processing,” IEEE Signal Processing Magazine,
vol. 22, no. 6, pp. 46–68, 2005.
[2] T. Akutsu, S. Miyano, and S. Kuhara, “Identification of genetic
networks from a small number of gene expression patterns under the Boolean network model,” in Proceedings of the 4th Pacific Symposium on Biocomputing (PSB ’99), pp. 17–28, Mauna
Lani, Hawaii, USA, January 1999.
[3] H. Lähdesmäki, I. Shmulevich, and O. Yli-Harja, “On learning
gene regulatory networks under the Boolean network model,”
Machine Learning, vol. 52, no. 1-2, pp. 147–167, 2003.
[4] S. Liang, S. Fuhrman, and R. Somogyi, “REVEAL, a general reverse engineering algorithm for inference of genetic network
architectures,” in Proceedings of the 3rd Pacific Symposium on
Biocomputing (PSB ’98), pp. 18–29, Maui, Hawaii, USA, January 1998.
[5] R. Pal, I. Ivanov, A. Datta, M. L. Bittner, and E. R. Dougherty,
“Generating Boolean networks with a prescribed attractor
structure,” Bioinformatics, vol. 21, no. 21, pp. 4021–4025,
2005.
[6] X. Zhou, X. Wang, and E. R. Dougherty, “Construction
of genomic networks using mutual-information clustering
and reversible-jump Markov-chain-Monte-Carlo predictor
design,” Signal Processing, vol. 83, no. 4, pp. 745–761, 2003.
[7] R. F. Hashimoto, S. Kim, I. Shmulevich, W. Zhang, M. L. Bittner, and E. R. Dougherty, “Growing genetic regulatory networks from seed genes,” Bioinformatics, vol. 20, no. 8, pp.
1241–1247, 2004.
[8] X. Zhou, X. Wang, R. Pal, I. Ivanov, M. L. Bittner, and E.
R. Dougherty, “A Bayesian connectivity-based approach to
constructing probabilistic gene regulatory networks,” Bioinformatics, vol. 20, no. 17, pp. 2918–2927, 2004.
[9] E. R. Dougherty and Y. Xiao, “Design of probabilistic Boolean
networks under the requirement of contextual data consistency,” IEEE Transactions on Signal Processing, vol. 54, no. 9,
pp. 3603–3613, 2006.
[10] D. Pe’er, A. Regev, G. Elidan, and N. Friedman, “Inferring subnetworks from perturbed expression profiles,” Bioinformatics,
vol. 17, supplement 1, pp. S215–S224, 2001.
[11] D. Husmeier, “Sensitivity and specificity of inferring genetic
regulatory interactions from microarray experiments with dynamic Bayesian networks,” Bioinformatics, vol. 19, no. 17, pp.
2271–2282, 2003.
[12] J. M. Peña, J. Björkegren, and J. Tegnér, “Growing Bayesian
network models of gene networks from seed genes,” Bioinformatics, vol. 21, supplement 2, pp. ii224–ii229, 2005.
[13] H. Lähdesmäki, S. Hautaniemi, I. Shmulevich, and O. YliHarja, “Relationships between probabilistic Boolean networks
and dynamic Bayesian networks as models of gene regulatory
networks,” Signal Processing, vol. 86, no. 4, pp. 814–834, 2006.
15
[14] S. A. Kauffman, “Metabolic stability and epigenesis in randomly constructed genetic nets,” Journal of Theoretical Biology,
vol. 22, no. 3, pp. 437–467, 1969.
[15] S. A. Kauffman, “Homeostasis and differentiation in random
genetic control networks,” Nature, vol. 224, no. 5215, pp. 177–
178, 1969.
[16] S. A. Kauffman, The Origins of Order: Self-Organization and
Selection in Evolution, Oxford University Press, New York, NY,
USA, 1993.
[17] S. Huang, “Gene expression profiling, genetic networks, and
cellular states: an integrating concept for tumorigenesis and
drug discovery,” Journal of Molecular Medicine, vol. 77, no. 6,
pp. 469–480, 1999.
[18] I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang, “Probabilistic Boolean networks: a rule-based uncertainty model for
gene regulatory networks,” Bioinformatics, vol. 18, no. 2, pp.
261–274, 2002.
[19] S. Kim, H. Li, E. R. Dougherty, et al., “Can Markov chain models mimic biological regulation?” Journal of Biological Systems,
vol. 10, no. 4, pp. 337–357, 2002.
[20] I. Shmulevich, E. R. Dougherty, and W. Zhang, “From
Boolean to probabilistic Boolean networks as models of genetic regulatory networks,” Proceedings of the IEEE, vol. 90,
no. 11, pp. 1778–1792, 2002.
[21] I. Shmulevich and E. R. Dougherty, “Modeling genetic regulatory networks with probabilistic Boolean networks,” in
Genomic Signal Processing and Statistics, E. R. Dougherty, I.
Shmulevich, J. Chen, and Z. J. Wang, Eds., EURASIP Book Series on Signal Processing and Communication, pp. 241–279,
Hindawi, New York, NY, USA, 2005.
[22] A. Datta, A. Choudhary, M. L. Bittner, and E. R. Dougherty,
“External control in Markovian genetic regulatory networks,”
Machine Learning, vol. 52, no. 1-2, pp. 169–191, 2003.
[23] R. Pal, A. Datta, M. L. Bittner, and E. R. Dougherty, “Intervention in context-sensitive probabilistic Boolean networks,”
Bioinformatics, vol. 21, no. 7, pp. 1211–1218, 2005.
[24] R. Pal, A. Datta, and E. R. Dougherty, “Optimal infinitehorizon control for probabilistic Boolean networks,” IEEE
Transactions on Signal Processing, vol. 54, no. 6, part 2, pp.
2375–2387, 2006.
[25] A. Datta, R. Pal, and E. R. Dougherty, “Intervention in probabilistic gene regulatory networks,” Current Bioinformatics,
vol. 1, no. 2, pp. 167–184, 2006.
[26] R. Albert and H. G. Othmer, “The topology of the regulatory
interactions predicts the expression pattern of the segment polarity genes in Drosophila melanogaster,” Journal of Theoretical
Biology, vol. 223, no. 1, pp. 1–18, 2003.
[27] G. Langholz, A. Kandel, and J. L. Mott, Foundations of Digital
Logic Design, World Scientific, River Edge, NJ, USA, 1998.
[28] I. Ivanov and E. R. Dougherty, “Modeling genetic regulatory
networks: continuous or discrete?” Journal of Biological Systems, vol. 14, no. 2, pp. 219–229, 2006.
[29] I. Ivanov and E. R. Dougherty, “Reduction mappings between
probabilistic Boolean networks,” EURASIP Journal on Applied
Signal Processing, vol. 2004, no. 1, pp. 125–131, 2004.
[30] W.-K. Ching, M. K. Ng, E. S. Fung, and T. Akutsu, “On construction of stochastic genetic networks based on gene expression sequences,” International Journal of Neural Systems,
vol. 15, no. 4, pp. 297–310, 2005.
[31] M. K. Ng, S.-Q. Zhang, W.-K. Ching, and T. Akutsu, “A control
model for Markovian genetic regulatory networks,” in Transactions on Computational Systems Biology V, Lecture Notes in
Computer Science, pp. 36–48, 2006.
Hindawi Publishing Corporation
EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 20180, 13 pages
doi:10.1155/2007/20180
Research Article
Algorithms for Finding Small Attractors in Boolean Networks
Shu-Qin Zhang,1 Morihiro Hayashida,2 Tatsuya Akutsu,2 Wai-Ki Ching,1 and Michael K. Ng3
1 Advanced
Modeling and Applied Computing Laboratory, Department of Mathematics, The University of Hong Kong,
Pokfulam Road, Hong Kong
2 Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan
3 Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong
Received 29 June 2006; Revised 24 November 2006; Accepted 13 February 2007
Recommended by Edward R. Dougherty
A Boolean network is a model used to study the interactions between different genes in genetic regulatory networks. In this paper,
we present several algorithms using gene ordering and feedback vertex sets to identify singleton attractors and small attractors in
Boolean networks. We analyze the average case time complexities of some of the proposed algorithms. For instance, it is shown that
the outdegree-based ordering algorithm for finding singleton attractors works in O(1.19n ) time for K = 2, which is much faster
than the naive O(2n ) time algorithm, where n is the number of genes and K is the maximum indegree. We performed extensive
computational experiments on these algorithms, which resulted in good agreement with theoretical results. In contrast, we give a
simple and complete proof for showing that finding an attractor with the shortest period is NP-hard.
Copyright © 2007 Shu-Qin Zhang et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
1.
INTRODUCTION
The advent of DNA microarrays and oligonucleotide chips
has significantly sped up the systematic study of gene interactions [1–4]. Based on microarray data, different kinds
of mathematical models and computational methods have
been developed, such as Bayesian networks, Boolean networks and probabilistic Boolean networks, ordinary and partial differential equations, qualitative differential equations,
and other mathematical models [5]. Among all the models,
the Boolean network model has received much attention. It
was originally introduced by Kauffman [6–9] and reviews
can be found in [10–12]. In a Boolean network, gene expression states are quantized to only two levels: 1 (expressed)
and 0 (unexpressed). Although such binary expression is very
simple, it can retain meaningful biological information contained in the real continuous-domain gene expression profiles. For instance, it can be applied to separation between
types of gliomas and types of sarcomas [13].
In a Boolean network, genes interact through some logical rules called Boolean functions. The state of a target gene is
determined by the states of its regulating genes (input genes)
and its Boolean function. Given the states of the input genes,
the Boolean function transforms them into an output, which
is the state of the target gene. Although the Boolean network
model is very simple, its dynamic process is complex and can
yield insight to the global behavior of large genetic regulatory
networks [14].
The total number of possible global states for a Boolean
network with n genes is 2n . However, for any initial condition, the system will eventually evolve into a limited set of
stable states called attractors. The set of states that can lead
the system to a specific attractor is called the basin of attraction. There can be one or many states for each attractor. An
attractor having only one state is called a singleton attractor.
Otherwise, it is called a cyclic attractor.
There are two different interpretations for the function
of attractors. One intuition that follows Kauffman is that
one attractor should correspond to a cell type [11]. Another interpretation of attractors is that they correspond to
the cell states of growth, differentiation, and apoptosis [10].
Cyclic attractors should correspond to cell cycles (growth)
and singleton attractors should correspond to differentiated
or apoptosis states. These two interpretations are complementary since one cell type can consist of several neighboring
attractors and each of them corresponds to different cellular
functional states [15].
The number and length of attractors are important features of networks. Extensive studies have been done for analyzing them. Starting from [11], a fast increase of the number
2
of attractors has been seen in [16–19]. Many studies have also
been done on the mean length of attractors [11, 17], although
there is no conclusive result.
It is also important to identify attractors of a given
Boolean network. In particular, identification of all singleton
attractors is important because singleton attractors correspond to steady states in Boolean networks and have close relation with steady states in other mathematical models of biological networks [10, 20–23]. As mentioned before, Huang
wrote that singleton attractors correspond to differentiation
and apoptosis states of a cell [10]. Devloo et al. transforms
the problem of finding steady states for some types of biological networks to a constraint satisfaction problem [20]. The
resulting constraint satisfaction problem is very close to the
problem of identification of singleton attractors in Boolean
networks. Mochizuki introduced a general model of genetic
networks based on nonlinear differential equations [21]. He
analyzed the number of steady states in that model, where
steady states are again closely related to singleton attractors in
Boolean networks. Zhou et al. proposed a Bayesian-based approach to constructing probabilistic genetic networks [23].
Pal et al. proposed algorithms for generating Boolean networks with a prescribed attractor structure [22]. These studies focus on singleton attractors and it is mentioned that realworld attractors are most likely to be singleton attractors,
rather than cyclic attractors.
Therefore, it is meaningful to identify singleton attractors. Of course, these can be done by examining all possible
states of a Boolean network. However, it would be too time
consuming even for small n, since 2n states have to be examined. Of course, if we want to find any one (not necessarily singleton) attractor, we may find it by following the trajectory to the attractor beginning from a randomly selected
state. If the basin of attraction is large, the possibility to find
the corresponding attractor would be high. However, it is not
guaranteed that a singleton attractor can be found. In order
to find a singleton attractor, a lot of trajectories may be examined. Indeed, Akutsu et al. proved in 1998 that finding a
singleton attractor is NP-hard [24]. Independently, Milano
and Roli showed in 2000 that the satisfiability problem can be
transformed into the problem of finding a singleton attractor
[25], which provides a proof of NP-hardness of the singleton
attractor problem. Thus, it is not plausible that the singleton
attractor problem can be solved efficiently (i.e., polynomial
time) in all cases. However, it may be possible to develop algorithms that are fast in practice and/or in the average case.
Therefore, this paper studies algorithms for identifying singleton attractors that are fast in many practical cases and have
concrete theoretical backgrounds.
Some studies have been done on fast identification of singleton attractors. Akutsu et al. proposed an algorithm for
finding singleton attractors based on a feedback vertex set
[24]. Devloo et al. proposed algorithms for finding steady
states of various biological networks using constraint programming [20], which can also be applied to identification
of singleton attractors in Boolean networks. In particular, the
algorithms proposed by Devloo et al. are efficient in practice.
However, there are no theoretical results on the efficiency of
EURASIP Journal on Bioinformatics and Systems Biology
their algorithms. Thus, we aim at developing algorithms that
are fast in practice and have a theoretical guarantee on their
efficiency (more precisely, the average case time complexity).
In this paper, we propose several algorithms for identifying all singleton attractors. We first present a basic recursive
algorithm. In this algorithm, a partial solution is extended
one by one according to a given gene ordering that leads to
a complete solution. If it is found that a partial solution cannot be extended to a complete solution, the next partial solution is examined. This algorithm is quite similar to the backtracking method employed in [20]. The important difference
of this paper from [20] is that we perform some theoretical
analysis of the average case time complexity. For example, we
show that the basic recursive algorithm works in O(1.23n )
time in the average case under the condition that Boolean
networks with maximum indegree 2 are given uniformly at
random. It should be noted that O(1.23n ) is much smaller
than O(2n ), though it is not polynomial.
Next, we develop improved algorithms using the outdegree-based ordering and the breadth-first search (BFS)
based ordering. For these algorithms, we perform theoretical analysis of the average case time complexity, which shows
that these are better than the basic recursive algorithm.
Moreover, we examine the algorithm based on feedback vertex sets (FVS) and its combination with the outdegree-based
ordering, where the idea of use of FVS was previously proposed in our previous work [24]. We also perform computational experiments using these algorithms, which show that
the FVS-based algorithm with the outdegree-based gene ordering is the most efficient in practice among these algorithms. Then, we extend the gene-ordering-based algorithms
for finding cyclic attractors with short periods along with
theoretical analysis and computational experiments. Though
we do not have strong evidence that small attractors are more
important than those with long periods, it seems that cell cycles correspond to small attractors and large attractors are
not so common (with the exception of circadian rhythms)
in real biological networks. As a minimum, these extensions
show that application of the proposed techniques is not limited to the singleton attractor problem.
As mentioned before, NP-hardness results on finding a
singleton attractor (or the smallest attractor) were already
presented in [24, 25]. However, both papers appeared as conference papers, the detailed proof is not given in [24], and the
transformation given in [25] is a bit complicated. Therefore,
we describe a simple and complete proof. We believe that it is
worthy to include a simple and complete proof in this paper.
Finally, we conclude with future work.
2.
ANALYSIS OF ALGORITHMS USING GENE
ORDERING FOR FINDING SINGLETON
ATTRACTORS
In this section, we present algorithms using gene ordering
for identification of singleton attractors along with theoretical analysis of the average case time complexity. Experimental results will be given later along with those of FVS-based
Shu-Qin Zhang et al.
3
Table 1: Example of a truth table of a Boolean network.
v1
v2
v3
f1
f2
f3
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
1
1
0
0
1
1
1
1
0
1
1
1
0
0
1
1
1
0
1
0
0
1
0
000
001
011
100
101
110
010
111
methods. Before presenting the algorithms, we briefly review
the Boolean network model.
Figure 1: State transitions of the Boolean network shown in
Table 1.
2.1. Boolean network and attractor
A Boolean network G(V , F) consists of a set of n nodes (vertices) V and n Boolean functions F, where
V = v 1 , v2 , . . . , vn ,
(1)
F = f1 , f2 , . . . , fn .
In general, V and F correspond to a set of genes and a set
of gene regulatory rules, respectively. Let vi (t) represent the
state of vi at time t. The overall expression level of all the
genes in the network at time step t is given by the following
vector:
v(t) = v1 (t), v2 (t), . . . , vn (t) .
(2)
This vector is referred to as the Gene Activity Profile (GAP)
of the network at time t, where vi (t) = 0 means that the
ith gene is not expressed and vi (t) = 1 means that it is expressed. Since v(t) ranges from [0, 0, . . . , 0] (all entries are 0)
to [1, 1, . . . , 1] (all entries are 1), there are 2n possible states.
The regulatory rules among the genes are given as follow:
vi (t + 1) = fi vi1 (t), vi2 (t), . . . , viki (t) ,
i = 1, 2, . . . , n. (3)
This rule means that the state of gene vi at time t + 1 depends
on the states of ki genes at time t, where ki is called the indegree of gene vi . The maximum indegree of a Boolean network
is defined as
K = max ki .
i
Input: a Boolean network G(V , F)
Output: all the singleton attractors
Initialize m := 1;
Procedure IdentSingletonAttractor(v, m)
if m = n + 1 then Output v1 (t), v2 (t), . . . , vn (t), return;
for b = 0 to 1 do vm (t) := b;
= v j (t) for some j ≤ m then
if it is found that v j (t + 1) continue;
else IdentSingletonAttractor(v, m + 1);
return.
(4)
The number of genes that are directly affected by gene vi
is called the outdegree of gene vi . The states of all genes
are updated synchronously according to the corresponding
Boolean functions.
A consecutive sequence of GAPs v(t), v(t +1), . . . , v(t + p)
is called an attractor with period p if v(t) = v(t + p). An
attractor with period 1 is called a singleton attractor and an
attractor with period > 1 is called a cyclic attractor.
Table 1 gives an example of a truth table of a Boolean network. Each gene will update its state according to the states
of some other genes in the previous step. The state transitions of this Boolean network can be seen in Figure 1. The
Algorithm 1
system will eventually evolve into two attractors. One attractor is [0, 1, 1], which is a singleton attractor, and the other
one is
[1, 0, 1] −→ [1, 0, 0] −→ [0, 1, 0] −→ [1, 1, 0] −→ [1, 0, 1],
(5)
which is a cyclic attractor with period 4.
2.2.
Basic recursive algorithm
The number of singleton attractors in a Boolean network depends on the regulatory rules of the network. If the regulatory rules are given as vi (t + 1) = vi (t) for all i, the number of
singleton attractors is 2n . Thus, it would take O(2n ) time in
the worst case if we want to identify all the singleton attractors. On the other hand, it is known that the average number
of singleton attractors is 1 regardless of the number of genes
n and the maximum indegree K [21]. Therefore, it is useful
to develop algorithms for identifying all singleton attractors
without examining all 2n states (in the average case).
For that purpose, we propose a very simple algorithm,
which is referred to as the basic recursive algorithm in this paper. In the algorithm, a partial GAP (i.e., profile with m (< n)
genes) is extended one by one towards a complete GAP (i.e.,
4
EURASIP Journal on Bioinformatics and Systems Biology
singleton attractor), according to a given gene ordering. If it
is found that a partial GAP cannot be extended to a singleton
attractor, the next partial GAP is examined. The pseudocode
of the algorithm is given as shown in Algorithm 1.
The algorithm extends a partial GAP by one gene at a
time. At the mth recursive step, the states of the first m − 1
genes are determined. Then, the algorithm extends the partial GAP by adding vm (t) = 0. If v j (t + 1) = v j (t) holds or the
value of v j (t + 1) is not determined for all j = 1, . . . , m, the
algorithm proceeds to the next recursive step. That is, if there
is a possibility that the current partial GAP can be extended
to a singleton attractor, it goes to the next recursive step.
Otherwise, it extends the partial GAP by adding vm (t) = 1
and executes a similar procedure. After examining vm (t) = 0
and vm (t) = 1, the algorithm returns to the previous recursive step. Since the number of singleton attractors is small in
most cases, it is expected that the algorithm does not examine many partial GAPs with large m. The average case time
complexity is estimated as follows.
Suppose that Boolean networks with maximum indegree
K are given uniformly at random. Then the average case time
complexity of the algorithm for K = 1 to K = 10 is given in the
first row of Table 2.
Theoretical analysis
2.3.
Outdegree-based ordering algorithm
In the basic recursive algorithm, the original ordering of
genes was used. If we sort the genes according to their outdegree (genes are ordered from larger outdegree to smaller
outdegree), it is expected that values of v j (t + 1) for a larger
number of genes are determined at each recursive step than
those determined for the basic recursive algorithm, and thus
a lower number of partial GAPs are examined. This intuition
is justified by the following theoretical analysis.
Suppose that Boolean networks with maximum indegree K
are given uniformly at random. After reordering all genes according to their outdegrees from largest to smallest, the average
case time complexity of the algorithm for K = 1 to K = 10 is
given in the second row of Table 2.
Theoretical analysis
We assume (without loss of generality) w.l.o.g. that the indegrees of all genes are K. If the input genes for any gene are
randomly selected from all the genes, the outdegree of genes
follows the Poisson distribution with mean approximately λ.
In this case, λ = K holds since the total indegree must be
equal to the total outdegree. Thus, λ and K are confused in
the following. The probability that a gene has outdegree k is
P(k) =
Assume that we have tested the first m out of n genes, where
m ≥ K. For all i ≤ m, vi (t) = vi (t + 1) holds with probability
P vi (t) = vi (t + 1)
= 0.5 ·
m Cki
≈ 0.5 ·
n Cki
m
n
ki
≥ 0.5 ·
m
n
K
.
(6)
If vi (t) = vi (t + 1) does not hold, the algorithm can continue.
Therefore, the probability that the algorithm examines the
(m + 1)th gene is not more than
1 − P vi (t) = vi (t + 1)
m
= 1 − 0.5 ·
m
n
K m
.
λk exp(−λ)
.
k!
We reorder the genes according to their outdegrees from
largest to smallest. Assume that the first m genes have been
tested and gene m is the uth gene among the genes with outdegree l. Then
∞
λk exp(−λ)
m−u=n·
m
f (m) = 2 · 1 − 0.5 ·
n
n−m=n·
l
λk exp(−λ)
k!
n·
(8)
Let s = m/n, and f (s) = [2s · (1 − 0.5 · sK )s ]n = [(2 − sK )s ]n .
The average case time complexity is estimated by the maximum value of f (s). Though an additional O(nm) factor is required, it can be ignored since O(n2 an ) O((a + )n ) holds
for any a > 1 and > 0.
Since the time complexity should be a function with respect to n, we only need to compute the maximum value of
the function g(s) = (2 − sK )s . With simple numerical calculations, we can get its maximum value for fixed K. Then,
the average case time complexity of the algorithm can be estimated as O((max(g))n ). We list the time complexity from
K = 1 to 10 in the first row of Table 2. As K gets larger, the
complexity increases.
− u.
(11)
The total outdegree of these n − m genes is
K m
.
(10)
and therefore
k=0
m
k!
k=l+1
(7)
Thus, the number of recursive calls executed for the first m
genes is at most
(9)
l
λk exp(−λ)
k!
k=0
· k − u · l.
(12)
The total outdegree for the first m genes is
λn − n ·
l
λk exp(−λ)
k=0
= λn − λn ·
k!
·k−u·l
l−1 k
λ exp(−λ)
k=0
k!
+u·l
λl exp(−λ)
= λn − λ n − (m − u) − n ·
+u·l
l!
= λm + λn ·
λl
exp(−λ)
+ u(l − λ).
l!
(13)
Shu-Qin Zhang et al.
5
Thus, for i ≤ m, we have
Thus,
(14)
λl exp(−λ) (l − λ)
λl exp(−λ) λ s
·n·
+
l!
λn
l!
l
l
−1
λ exp(−λ)
λ exp(−λ) λ s
= 2− s+
.
+ (l − λ) ·
l!
l!
(22)
The number of recursive calls executed for the first m genes
is
There are only a few values that are less than λ. Using a
method similar to the one above, we can get an upper
bound for g(s).
= vi (t + 1)
P vi (t) λm + λn · λl exp(−λ)/l! + u(l − λ)
λn
l
m λ exp(−λ) (l − λ)u λ
= 0.5 ·
.
+
+
n
l!
λn
λ
= 0.5 ·
f (m) = 2m · 1 − 0.5 ·
g(s) ≤ 2 − s +
m λl exp(−λ) (l − λ)u
+
+
n
l!
λn
λ m
.
(15)
Letting s = m/n, f (m) can be rewritten as
λl exp(−λ) (l − λ)u
+
l!
λn
λl exp(−λ) (l − λ)u λ s n
=
2− s+
.
+
l!
λn
λ s n
f (m) = 2s · 1 − 0.5 · s +
(16)
As in Section 2.2, we estimate the maximum value of g(s)
where it is defined here as g(s) = [2 − (s + λl exp(−λ)/l! +
(l − λ)u/λn)λ ]s . We also must consider the relationship between l and λ.
(1) If l > λ,
g(s) ≤ 2 − s +
λl exp(−λ)
l!
λ s
= g1 (s).
(17)
Since λl exp(−λ)/l! tends to zero if l is large, we only
need to examine several small values of l. The upper
bound of g(s) can be obtained by computing the maximum value of g1 (s) with some numerical methods.
However, we should be careful so that
P(k ≥ l + 1) ≤ s ≤ P(k ≥ l)
(18)
holds. That is, it should be guaranteed that the maximum value obtained is for the gene with outdegree l.
(2) If l = λ,
g(s) = 2 − s +
λl exp(−λ)
l!
λ s
.
(19)
Similar to above, we can get an upper bound for g(s).
(3) If l < λ,
g(s) = 2 − s +
λl exp(−λ) (l − λ)u
+
l!
λn
λ s
.
(20)
Since gene m is the uth gene among the genes with outdegree l,
u≤n·
λl exp(−λ)
.
l!
(21)
It should be noted that l must belong to exactly one of these
three cases when g(s) reaches its maximum value. Summarizing the three different cases above, we can get an approximation of the average case time complexity of the algorithm.
The second row of Table 2 shows the time complexity of the
algorithm for K = 1 to K = 10. As in Section 2.2, the complexity increases as K increases.
We remark that the difference between this improved algorithm and the basic recursive algorithm lies only in that we
need to sort all the genes according to their outdegrees from
largest to smallest before executing the main procedure of the
basic recursive algorithm.
2.4.
Breadth-first search-based
ordering algorithm
Breadth-first search is a general technique for traversing a
graph. It visits all the nodes and edges of a graph in a manner that all the nodes at depth (distance from the root node)
d are visited before visiting nodes at depth d + 1. For example, suppose that node a has outgoing edges to nodes b and
c, b has outgoing edges to nodes d and e, and c has outgoing edges to nodes f and g, where other edges (e.g., an edge
from d to f ) can exist. In this case, nodes are visited in the
order of a, b, c, d, e, f . In this way, all of the nodes are totally ordered according to the visiting order. The algorithm
for implementing BFS can be found in many text books. The
computation time for BFS on a graph with n nodes and m
edges is O(n+m). If we use this BFS-based ordering, as in the
case of outdegree-based ordering, it is expected that values of
v j (t + 1) for a larger number of genes are determined at each
recursive step, and thus, lower numbers of partial GAPs are
examined. We can estimate the average case time complexity
as follows.
Suppose that Boolean networks with maximum indegree K
are given uniformly at random. After reordering all genes according to the BFS-ordering, the average case time complexity
of the algorithm for K = 1 to K = 10 is given in the third row
of Table 2.
Theoretical analysis
As in Section 2.3, we assume w.l.o.g. that all n genes have
the same indegree K. Suppose that we have tested m genes.
Since the input genes of the ith gene must be among the first
K · i + 1 genes, whether vi (t + 1) = vi (t) or not can be determined before visiting the (K · i + 2)th gene. According to
6
EURASIP Journal on Bioinformatics and Systems Biology
Table 2: Theoretical time complexities of basic, outdegree-based, and BFS-based algorithms.
K
Basic
Outdegree-based
BFS-based
1
1.23n
1.09n
≈ O(n)
2
1.35n
1.19n
1.16n
3
1.43n
1.27n
1.27n
4
1.49n
1.34n
1.35n
the determination pattern of states of m genes, we consider 3
cases.
(1) The states of the first (m − 1)/K genes are determined and they must satisfy vi (t+1) = vi (t), where a
denotes the standard floor function. Then, we have
5
1.53n
1.41n
1.41n
6
1.57n
1.45n
1.45n
i≤
m−1
.
K
(23)
(m−1)/K
P vi (t) = vi+1 (t)
= vi+1 (t)
P vi (t) = 0.5 ·
·
K
,
(24)
n−1
m
≤i≤
.
K
K
The algorithm can continue for any gene i with probability
1 − P vi (t) = vi+1 (t)
i=(m−1)/K
m
1 − P vi (t) = vi+1 (t)
i=(n−1)/K
= 0.5(m−1)/K ·
1 − P vi (t) = vi+1 (t)
m
m+ j ·K
= vi+1 (t) = 0.5 ·
P vi (t) m
m+i·K
K K m−(n−1)/K .
Then, the total number of recursive calls is
f (m) = 2m · 0.5(m−1)/K
·
(n−
1)/K
m
n
K
,
K
= vi+1 (t) = 1 − 0.5 ·
1 − P vi (t) ,
m
1 − 0.5 ·
m+i·K
m
· 1 − 0.5 ·
n
n−1
m
≤i≤
.
K
K
(25)
n−1
≤ i ≤ m. (26)
K
Here, the algorithm can continue for each gene with
probability
1 − 0.5 ·
m
· 1 − 0.5 ·
n
(3) From the n/Kth gene to the mth gene, the input
genes to them can be any gene; thus
(n−
1)/K
i=(m−1)/K
i=(m−1)/K
(28)
m
m+ j·K
= 1 − 0.5 ·
10
1.67n
1.57n
1.58n
9
1.65n
1.56n
1.56n
(n−
1)/K
·
(2) For any gene i between the m/Kth gene and the
(n − 1)/Kth gene, whether vi (t + 1) is equal to vi (t)
can be determined before examining the (m + j · K)th
gene, where j = 1, 2, . . . , (n − m)/K. Then, we have
8
1.62n
1.51n
1.53n
The probability that the algorithm can be executed for all
m genes is
i=1
P vi (t) = vi+1 (t) = 0.5,
7
1.60n
1.48n
1.50n
m
n
K
,
n−1
≤ i ≤ m.
K
(27)
K K m−(n−1)/K K m−(m−1)/K
m
≤ 2m · 0.5(m−1)/K · 1 − 0.5 ·
n
m K m−(m−1)/K
= 2−
n
K [(m−(m−1)/K)/n]·n
m
= 2−
n
K (m/n)(1−1/K)·n
m
≈ 2−
.
n
(29)
Let s = m/n and g(s) = (2 − sK )s(1−1/K) . Using numerical
methods, we can get the maximum value of g. From K = 1 to
K = 10, the upper bound of the average case time complexity
of the algorithm is in the third row of Table 2.
It is to be noted that in the estimation of the upper bound
of f (m), we overestimated the probability that genes belong
to the second case, and thus the upper bound obtained here is
not tight. More accurate time complexities can be estimated
from the results of computational experiments.
Shu-Qin Zhang et al.
3.
FINDING SINGLETON ATTRACTORS USING
FEEDBACK VERTEX SET
In this section, we present algorithms based on the feedback
vertex set and the results of computational experiments on
all of our proposed algorithms for identification of singleton
attractors. The algorithms in this section are based on a simple and interesting property on acyclic Boolean networks although they can be applied to general Boolean networks with
cycles. Though an algorithm based on the feedback vertex set
was already proposed in our previous work [24], some improvements (ordering based on connected components and
ordering based on outdegree) are achieved in this section.
7
Input: a Boolean network G(V , F)
(FVS) Output: an ordered feedback vertex set F = v1(FVS) , . . . , vM
Procedure FindFeedbackVertexSet
let F := ∅, M := 1;
let C:= (all the connected components of G);
for each connected component C ∈ C do
let V := (a set of vertices in C );
= ∅ do
while V (FVS)
:= (a vertex selected randomly from V );
let vM
(FVS)
and vertices whose truth values
remove vM
can be fixed only from F in V ;
increment M.
3.1. Acyclic network
As to be shown in Section 5, the problem of finding a singleton attractor in a Boolean network is NP-hard. However, we
have a positive result for acyclic networks as follows.
Proposition 1. If the network is acyclic, there exists a unique
singleton attractor. Moreover, the unique attractor can be computed in polynomial time.
Proof. In an acyclic network, there exists at least one node
without incoming edges. Such nodes should have fixed
Boolean values. The values of the other nodes are uniquely
determined from these nodes by the nth time step in polynomial time. Since the state of any node does not change after
the nth step, there exists only one singleton attractor.
As shown below, this property is also useful for identifying singleton attractors in cyclic networks.
Algorithm 2
Input: a Boolean network G(V , F) and an ordered feedback
(FVS) vertex set F = v1(FVS) , . . . , vM
Output: all the singleton attractors
Initialize m := 1;
Procedure IdentSingletonAttractorWithFVS(v, m)
if m = M + 1 then Output v1 (t), v2 (t), . . . , vn (t), return;
for b = 0 to 1 do vm(FVS) (t) := b;
propagate truth values of v1(FVS) (t), . . . , vm(FVS) (t) to
all possible v(t) except F ;
compute v1(FVS) (t + 1), . . . , vm(FVS) (t + 1) from v(t);
(FVS)
(FVS)
if it is found that v j (t + 1) = vj
(t) for some
j ≤ m then
continue;
else IdentSingletonAttractorWithFVS(v, m + 1);
return.
3.2. Algorithm
In the basic recursive algorithm, we must consider truth assignments to all the nodes in the network. On the other
hand, Proposition 1 indicates that if the network is acyclic,
the truth values of all nodes are uniquely determined from
the values of the nodes with no incoming edges. Thus, it is
enough to examine truth assignments only to the nodes with
no incoming edges, if we can decompose the network into
acyclic graphs. Such a set of nodes is called a feedback vertex
set (FVS). The problem of finding a minimum feedback vertex set is known to be NP-hard [26]. Some algorithms which
approximate the minimum feedback vertex set have been developed [27]. However, such algorithms are usually complicated. Thus, we use a simple greedy algorithm (shown in
Algorithm 2) for finding a (not necessarily minimum) feedback vertex set, where a similar algorithm was already presented in [24]. In our proposed algorithm, nodes in FVS are
ordered according to the connected components of the original network in order to reduce the number of iterations. In
other words, nodes in the same connected component are
ordered sequentially.
Then, we modify the procedure IdentSingletonAttractor(v, m) for FVS as shown in Algorithm 3.
Algorithm 3
Furthermore, we can combine the outdegree-based ordering with FVS. In FindFeedbackVertexSet, we select a node
randomly from a connected component. When combined
with the outdegree-based ordering, we can instead select the
node with the maximum outdegree in a connected component.
3.3.
Computational experiments
In this section, we evaluate the proposed algorithms by performing a number of computational experiments on both
random networks and scale-free networks [28].
3.3.1. Experiments on random networks
For each K (K = 1, . . . , 10) and each n (n = 1, . . . , 20),
we randomly generated 10 000 Boolean networks with maximum indegree K and took the average values. All of these
computational experiments were done on a PC with Opteron
8
EURASIP Journal on Bioinformatics and Systems Biology
Table 3: Empirical time complexities of basic, outdegree, BFS, feedback vertex set, and FVS + outdegree algorithms.
K
Basic
Outdegree
BFS
Feedback
FVS + Outdegree
1
1.27n
1.14n
1.09n
1.10n
1.05n
2
1.39n
1.23n
1.16n
1.28n
1.13n
3
1.46n
1.30n
1.24n
1.39n
1.21n
4
1.53n
1.37n
1.31n
1.47n
1.29n
5
1.57n
1.42n
1.37n
1.53n
1.35n
7
1.63n
1.51n
1.45n
1.60n
1.46n
8
1.67n
1.54n
1.49n
1.64n
1.49n
9
1.69n
1.56n
1.52n
1.66n
1.52n
10
1.70n
1.59n
1.53n
1.68n
1.55n
10000
1.7
1.6
The number of iterations
Base of the time complexity (a of an )
6
1.60n
1.47n
1.42n
1.56n
1.41n
1.5
1.4
1.3
1.2
1000
100
10
1.1
1
1
2
3
4
5
6
7
8
9
10
Indegree K
Basic
Outdegree
BFS
Feedback
FVS + outdegree
Figure 2: Base of the empirical time complexity (an ’s a value) of the
proposed algorithms for finding singleton attractors.
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The number of nodes
Basic O(1.39n )
Outdegree O(1.23n )
BFS O(1.16n )
Feedback O(1.28n )
FVS + outdegree O(1.13n )
Figure 3: Number of iterations done by the proposed algorithms
for K = 2.
3.3.2. Experiments on scale-free networks
2.4 GHz CPUs and 4 GB RAM running under the Linux (version 2.6.9) operating system, where the gcc compiler (version
3.4.5) was used with optimization option -O3.
Table 3 shows the empirical time complexity of each proposed method for each K. We used a tool for GNUPLOT to fit
the function b · an to the experimental results. The tool uses
the nonlinear least-squares (NLLS) Marquardt-Levenberg algorithm. Figure 2 is a graphical representation of the result
of Table 3. It is seen that the FVS + Outdegree method is the
fastest in most cases.
Figure 3 is an example to show the average number of
iterations with respect to the number of genes for K = 2.
Figure 4 shows the average computation time with respect to
the number of genes when K = 2, where similar results were
obtained for other values of K.
The time complexities estimated from the results of computational experiments are a little different from those obtained by theoretical analysis. However, this is reasonable
since, in our theoretical analysis, we assumed that the number of genes is very large, we made some approximations,
and there were also small numerical errors in computing the
maximum values of g(s).
It is known that many real biological networks have the scalefree property (i.e., the degree distribution approximately follows a power-law) [28]. Furthermore, it is observed that in
gene regulatory networks, the outdegree distribution follows
a power-law and the indegree distribution follows a Poisson
distribution [29]. Thus, we examined networks with scale
free topology.
We generated scale-free networks with a power-law outdegree distribution (∝ k−2 ) and a Poisson indegree distribution (with the average indegree 2) as follows. We first choose
the number of outputs for each gene from a power-law distribution. That is, gene vi has Li outputs where all the Li are
drawn from a power-law distribution. Then, we choose the Li
outputs of each gene vi randomly with uniform probability
from n genes. Once each gene has been assigned with a set of
outputs, the inputs of all genes are fully determined because
v j is an input of vi if vi is an output of v j . Since Li output
genes are chosen randomly for each gene vi , the indegree distribution should follow a Poisson distribution.
Figure 5 compares the outdegree-based algorithm, the
BFS-based algorithm and the FVS + Outdegree algorithm for
scale-free networks generated as above and for random networks with constant indegree 2, where the average CPU time
Shu-Qin Zhang et al.
9
0.01
Input: a Boolean network G(V , F) and a period p
Output: all of the small attractors with period p
Initialize m := 1;
Procedure IdentSmallAttractor(v, m)
if m = n + 1 then Output v1 (t), v2 (t), . . . , vn (t), return;
for b = 0 to 1 do vm (t) := b;
for p
= 0 to p−1 do compute v(t+p
+1) from v(t+p
);
= v j (t) for some j ≤ m then
if it is found that v j (t+p) continue;
else IdentSmallAttractor(v, m + 1);
return.
Elapsed time (s)
0.001
1e-04
1e-05
1e-06
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Algorithm 4
The number of nodes
Basic
Outdegree
BFS
Feedback
FVS + outdegree
Figure 4: Elapsed time (in seconds) by the proposed algorithms for
random networks with K = 2.
1000
100
Elapsed time (s)
10
1
0.1
0.01
0.001
1e-04
1e-05
40
50
60
70
80
90
100
110
120
The number of nodes
Fix/outdegree
Fix/BFS
Fix/FVS + outdegree
PS/outdegree
PS/BFS
PS/FVS + outdegree
Figure 5: Elapsed time (in seconds) of some of the proposed algorithms for random networks with K = 2 (Fix) and scale-free networks (PS).
BFS-based algorithm, and O(1.12n ) versus O(1.05n ) for the
FVS + Outdegree algorithm, where (random) versus (scalefree) is shown for each case. The average case complexities
for random networks are better than those in Table 3 and are
closer to the theoretical time complexities shown in Table 2.
These results are reasonable because networks with much
larger number of nodes were examined in this case.
It should be noted that Devloo et al. proposed constraint
programming based methods for finding steady-states in
some kinds of biological networks [20]. Their methods use a
backtracking technique, which is very close to our proposed
recursive algorithms, and may also be applied to Boolean networks. Their methods were applied to networks up to several
thousand nodes with indegree = outdegree = 2. Since different types of networks were used, our proposed methods cannot be directly compared with their methods. Their methods
include various heuristics and may be more useful in practice
than our proposed methods. However, no theoretical analysis was performed on the computational complexity of their
methods.
4.
In this section, we modify the gene-ordering-based algorithms presented in Section 2 to find cyclic attractors with
short periods. We also perform a theoretical analysis and
computational experiments.
4.1.
was taken over 100 networks for each case and a PC with
Xeon 5160 3 GHz CPUs with 8 GB RAM was used. The result
is interesting and we observed that all algorithms work much
faster for scale-free networks than for random networks. This
result is reasonable because scale-free networks have a much
larger number of high degree nodes than random networks
and thus heuristics based on the outdegree-based ordering
or the BFS-based ordering should work efficiently. The average case time complexities estimated from this experimental result are as follows: O(1.19n ) versus O(1.09n ) for the
outdegree-based algorithm, O(1.12n ) versus O(1.09n ) for the
FINDING SMALL ATTRACTORS
Modifications of algorithms
The basic idea of our modifications is very simple. Instead
of checking whether or not vi (t + 1) = vi (t) holds, we check
whether or not vi (t + p) = vi (t) holds. The pseudocode of the
modified basic recursive algorithm is given in Algorithm 4.
This procedure computes v(t + p) from the truth assignments on the first m genes of v(t). Values of some genes of
v(t + p) may not be determined because these genes may also
depend on the last (n − m) genes of v(t). If either v j (t + p) =
v j (t) holds or the value of v j (t + p) is not determined for
each j = 1, . . . , m, the algorithm will continue to the next
EURASIP Journal on Bioinformatics and Systems Biology
recursive step. As in Section 2, we can combine this algorithm
with the outdegree-based ordering and the BFS-based ordering.
In these algorithms, it is assumed that the period p is
given in advance. However, the algorithms can be modified
for identifying all cyclic attractors with period at most P. For
that purpose, we simply need to execute the algorithms for
each of p = 1, 2, . . . , P. Though this method does not seem to
be practical, its theoretical time complexity is still better than
O(2n ) for small P. Suppose that the average case time complexity for p is O(T p (n)). Then, this simple method would
take O( Pp=1 T p (n)) ≤ O(P · TP (n)) time, which is still faster
than O(2n ) if TP (n) = o(2n ) and P is bounded by some polynomial of n.
2
Base of the time complexity (a of an )
10
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
1
Let the period of the attractor be p. We assume w.l.o.g. as
before that the indegree of all genes is K. As in Section 2.2,
we consider the first m genes among all n genes. Given the
states of all m genes at time t, we need to know the states of all
these genes at time t + p. The probability that vi (t) = vi (t + p)
holds for each i ≤ m is approximated by:
m
P vi (t) = vi (t + p) = 0.5 ·
n
K
m
·
n
K 2
m
···
n
K p
,
1 − P vi (t) = vi (t + p)
6
7
8
9
10
2
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
2
3
4
5
6
7
8
9
10
Indegree K
Basic
Outdegree
BFS
Figure 7: Base of the empirical time complexity (an ’s a value) of the
proposed algorithms for finding cyclic attractors with period 3.
m
K K 2
K p m
m
m
m
= 1 − 0.5 ·
·
···
.
n
n
(31)
n
Therefore, the number of total recursive calls executed for
these m genes is
5
Figure 6: Base of the empirical time complexity (an ’s a value) of the
proposed algorithms for finding cyclic attractors with period 2.
(30)
where (m/n)K means that the K input genes to gene vi at time
2
t + p − 1 are among the first m genes, (m/n)K means that at
time t + p − 2 the input genes to the K input genes to gene vi
are also in the first m genes, and so on.
Then, the probability that the algorithm examines some
specific truth assignment on m genes is approximately given
by
4
Basic
Outdegree
BFS
Base of the time complexity (a of an )
Theoretical analysis
3
Indegree K
4.2. Theoretical analysis
Before giving the experimental results, we perform a theoretical analysis on the modified basic recursive algorithm.
Suppose that Boolean networks with maximum indegree
K are given uniformly at random. Then the average case time
complexity of the modified basic recursive algorithm for period
1 to 5 and K = 1 to K = 10 is given in Table 4.
2
= vi (t + p)
f (m) = 2m · 1 − P vi (t) m
K K 2
K p m
m
m
m
= 2m · 1 − 0.5 ·
·
···
.
n
n
n
(32)
As in Section 2.2, we can compute the maximum value of
f (m). The results are given in Table 4.
4.3.
Computational experiments
Computational experiments were also performed to examine the time complexity of the algorithms for finding small
attractors. The environment and parameters of the experiments were the same as in Section 3.3.1. Though FVS-based
algorithms can also be modified for small attractors, they are
not efficient for p > 1. Therefore, we only examined geneordering-based algorithms.
Figures 6 to 8 show the time complexity of the algorithms
estimated from the results of computational experiments for
p = 2 to p = 4 and for K = 1 to K = 10. When K is comparatively small, the outdegree-based ordering method is the
Shu-Qin Zhang et al.
11
Table 4: Theoretical time complexities for the modified basic algorithm for finding small attractors with period p.
K
p=1
p=2
p=3
p=4
p=5
1
1.23n
1.35n
1.43n
1.49n
1.53n
2
1.35n
1.57n
1.72n
1.83n
1.90n
3
1.43n
1.70n
1.86n
1.94n
1.97n
4
1.49n
1.78n
1.92n
1.97n
1.99n
5
1.53n
1.83n
1.95n
1.99n
1.99n
Base of the time complexity (a of an )
2
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
2
3
4
5
6
7
8
9
10
Indegree K
6
1.57n
1.87n
1.97n
1.99n
1.99n
7
1.60n
1.89n
1.97n
1.99n
1.99n
9
1.65n
1.92n
1.99n
1.99n
1.99n
vi (t + 1) = vi (t).
(33)
Suppose that fi (xi1 , . . . , xi3 ) is a Boolean function assigned to
ci in 3SAT. Then, for each vN+i , we assign the following function:
vN+i (t + 1) = fi vi1 (t), vi2 (t), vi3 (t) ∨ vN+i (t).
most efficient. But when K increases, all the three methods
perform the same, which is equivalent to the worst case in
finding the attractors, that is O(2n ). The results obtained
from the numerical experiments for the modified basic recursive algorithm are consistent with the theoretical results
presented in Section 4.2.
5.
HARDNESS RESULT
As mentioned in Section 1, Akutsu et al. [24] and Milano and
Roli [25] showed that finding a singleton attractor (or an attractor with the shortest period) is NP-hard. Those results
justify our proposed algorithms which take exponential time
in the worst case (and even in the average case). However,
the proof is omitted in [24] and the proof in [25] is a bit
complicated: Boolean functions assigned in the transformed
Boolean network are much longer than those in the original
satisfiability problem. Here we give a simpler and complete
proof.
Theorem 1. Finding an attractor with the shortest period is
NP-hard.
Proof. We show that deciding whether or not there exists a
singleton attractor is NP-hard, from which the theorem follows since the singleton attractor is the attractor with the
shortest period (if any such period exists).
10
1.67n
1.93n
1.99n
1.99n
1.99n
We use a simple polynomial time reduction from 3SAT
[26] to the singleton attractor problem.
Let x1 , . . . , xN be Boolean variables (i.e., 0-1 variables).
Let c1 , . . . , cL be a set of clauses over x1 , . . . , xN , where each
clause is a logical OR of at most three literals. It should be
noted that a literal is a variable or its negation (logical NOT).
Then, 3SAT is a problem of asking whether or not there exists
an assignment of 0-1 values to x1 , . . . , xN which satisfies all
the clauses (i.e., the values of all clauses are 1).
From an instance of 3SAT, we construct an instance of the
singleton attractor problem. We let the set of vertices (nodes)
V = {v1 , . . . , vN+L }, where each vi for i = 1, . . . , N corresponds to xi and each vN+i for i = 1, . . . , L corresponds to ci .
For each vi such that i ≤ N, we make the following assignment:
Basic
Outdegree
BFS
Figure 8: Base of the time complexity (an ’s a value) of the proposed
algorithms for finding cyclic attractors with period 4.
8
1.62n
1.91n
1.98n
1.99n
1.99n
(34)
Figure 9 is an example of reduction from 3SAT to the singleton attractor problem.
Here, we show that 3SAT is satisfiable if and only if there
exists a singleton attractor.
Suppose that there exists an assignment of Boolean values
b1 , . . . , bN to x1 , . . . , xN which satisfies all clauses c1 , . . . , cL .
Then, we let
⎧
⎨b
vi (0) = ⎩
1
i
for i = 1, . . . , N,
for i = N + 1, . . . , N + L.
(35)
It is straight forward to see that v(0) = (v1 (0), . . . , vN+L (0)) is
a singleton attractor (i.e., v(0) = v(1)).
Suppose that there exists a singleton attractor. Let v(0) =
(v1 (0), . . . , vN+L (0)) be the state of the singleton attractor.
Then, vN+i (0) must be 1 for all i = 1, . . . , L. Otherwise
(i.e., vN+i (0) = 0), vN+i (1) would be 1 and it contradicts
the assumption that v(0) is a singleton attractor. Furthermore, fi (vi1 (0), vi2 (0), vi3 (0)) = 1 must hold. Otherwise,
vN+i (1) would be 0 since the equations vN+i (0) = 1 and
fi (vi1 (0), vi2 (0), vi3 (0)) = 0 hold. This contradicts the assumption that v(0) is a singleton attractor. Therefore, by assigning vi (0) to xi for i = 1, . . . , N, all the clauses are satisfied.
Since the reduction can trivially be done in polynomial
time, we have the theorem.
12
EURASIP Journal on Bioinformatics and Systems Biology
v1
v2
v3
v4
Table 5: Comparison of time complexities for simulation of network behavior, identification of attractors, finding control strategies, and identification of networks. P means that the problem can
be solved in polynomial time.
v7
v5
v1
v2
v6
v3
v5
v1
v3
v4
v6
v2
v3
v4
v7
Figure 9: Example of a reduction from 3SAT to the singleton attractor problem. An instance of 3SAT {x1 ∨ x2 ∨ x3 , x1 ∨ x3 ∨ x4 , x2 ∨
x3 ∨ x4 } is transformed into this Boolean network.
6.
CONCLUSION
In this paper, we have presented fast algorithms for identifying singleton attractors and cyclic attractors with short periods. The proposed algorithms are much faster than the naive
enumeration-based algorithm. However, the proposed algorithms cannot be applied to random networks with several
hundreds or more genes. Moreover, it may not be faster than
the constraint programming-based algorithms in [20]. However, the most important point of this work is that the average case time complexities of the ordering-based algorithms
are analyzed and are shown to be better than O(2n ). We hope
that our work stimulates further development of faster algorithms and deeper theoretical analysis.
It is interesting that the results of computational experiments suggest that our proposed algorithms are much faster
for scale-free networks than for random networks. However,
we could not yet perform theoretical analysis for scale-free
networks. Thus, theoretical analysis of the average case time
complexity for scale-free networks (precisely, networks with
a power-law outdegree distribution and a Poisson indegree
distribution) is left as future work.
Although this paper focused on the Boolean network as a
model of biological networks, the techniques proposed here
may be useful for designing algorithms for finding steady
states in other models and for theoretical analysis of such
algorithms. For instance, Mochizuki performed theoretical
analysis on the number of steady states in some continuous biological networks that are based on nonlinear differential equations [21]. However, the core part of the analysis
is done in a combinatorial manner and is very close to that
for Boolean networks. Thus, it may be possible to develop
fast algorithms for finding steady states in such continuous
network models. Application and extension of the proposed
techniques to other types of biological networks are important future research topics.
Finally, it is interesting to compare the complexities of
four problems for three classes of networks: simulation of
network behavior (almost trivial), identification of attractors
(this paper), identification of networks [30, 31], and finding control strategies [32] for trees, acyclic graphs, and general graphs. These four problems constitute a more complete picture of modeling genetic regulatory networks with
Simulation of network
Identification of attractor
Finding control strategies
Identification of network
Identification of network
(bounded indegree)
Tree
Acyclic
graph
General
graph
P
P
P
NP-hard
P
P
NP-hard
NP-hard
P
NP-hard
NP-hard
NP-hard
P
P
P
a Boolean network. Simulation of a Boolean network is a
trivial but important step to analyze the model. Attractors
describe the long run behavior of the Boolean network system. Finding a control strategy is to consider how the system can be made to evolve desirably. Identification of genetic
regulatory networks is the first step in obtaining the model
from data. Table 5 shows complexities for various problems
with several network structures. Although many works have
been done for these problems, the computational complexity is still an important issue. It is also left as future work to
study how to cope with high computational complexity (e.g.,
NP-hardness) of these problems.
ACKNOWLEDGMENTS
We thank anonymous reviewers for helpful comments. TA
was partially supported by a Grant-in-Aid “Systems Genomics” from MEXT, Japan and by the Cell Array Project
from NEDO, Japan. WKC was partially supported by Hung
Hing Ying Physical Research Fund, HKU GRCC Grants nos.
10206647, 10206483, and 10206147. MKN was partially supported by RGC 7046/03P, 7035/04P, 7035/05P, and HKBU
FRGs. S.-Q. Zhang and M. Hayashida contributed equally to
this work.
REFERENCES
[1] J. E. Celis, M. Kruhøffer, I. Gromova, et al., “Gene expression profiling: monitoring transcription and translation products using DNA microarrays and proteomics,” FEBS Letters,
vol. 480, no. 1, pp. 2–16, 2000.
[2] T. R. Hughes, M. Mao, A. R. Jones, et al., “Expression profiling using microarrays fabricated by an ink-jet oligonucleotide
synthesizer,” Nature Biotechnology, vol. 19, no. 4, pp. 342–347,
2001.
[3] R. J. Lipshutz, S. P. A. Fodor, T. R. Gingeras, and D. J. Lockhart, “High density synthetic oligonucleotide arrays,” Nature
Genetics, vol. 21, supplement 1, pp. 20–24, 1999.
[4] D. J. Lockhart and E. A. Winzeler, “Genomics, gene expression and DNA arrays,” Nature, vol. 405, no. 6788, pp. 827–836,
2000.
[5] H. D. Jong, “Modeling and simulation of genetic regulatory
systems: a literature review,” Journal of Computational Biology,
vol. 9, no. 1, pp. 67–103, 2002.
Shu-Qin Zhang et al.
[6] K. Glass and S. A. Kauffman, “The logical analysis of continuous, nonlinear biochemical control networks,” Journal of Theoretical Biology, vol. 39, no. 1, pp. 103–129, 1973.
[7] S. A. Kauffman, “Metabolic stability and epigenesis in randomly constructed genetic nets,” Journal of Theoretical Biology,
vol. 22, no. 3, pp. 437–467, 1969.
[8] S. A. Kauffman, “Homeostasis and differentiation in random
genetic control networks,” Nature, vol. 224, no. 215, pp. 177–
178, 1969.
[9] S. A. Kauffman, “The large scale structure and dynamics of genetic control circuits: an ensemble approach,” Journal of Theoretical Biology, vol. 44, no. 1, pp. 167–190, 1974.
[10] S. Huang, “Gene expression profiling, genetic networks, and
cellular states: an integrating concept for tumorigenesis and
drug discovery,” Journal of Molecular Medicine, vol. 77, no. 6,
pp. 469–480, 1999.
[11] S. A. Kauffman, The Origins of Order: Self-Organization and
Selection in Evolution, Oxford University Press, New York, NY,
USA, 1993.
[12] R. Somogyi and C. Sniegoski, “Modeling the complexity of genetic networks: understanding multigenic and pleiotropic regulation,” Complexity, vol. 1, no. 6, pp. 45–63, 1996.
[13] I. Shmulevich and W. Zhang, “Binary analysis and
optimization-based normalization of gene expression
data,” Bioinformatics, vol. 18, no. 4, pp. 555–565, 2002.
[14] D. Thieffry, A. M. Huerta, E. Pérez-Rueda, and J. ColladoVides, “From specific gene regulation to genomic networks:
a global analysis of transcriptional regulation in Escherichia
coli,” BioEssays, vol. 20, no. 5, pp. 433–440, 1998.
[15] S. Huang, “Cell state dynamics and tumorigenesis in
Boolean regulatory networks,” InterJournal Genetics, MS: 416,
http://www.interjournal.org/
[16] B. Drossel, “Number of attractors in random Boolean networks,” Physical Review E, vol. 72, no. 1, Article ID 016110,
5 pages, 2005.
[17] B. Drossel, T. Mihaljev, and F. Greil, “Number and length of attractors in a critical Kauffman model with connectivity one,”
Physical Review Letters, vol. 94, no. 8, Article ID 088701, 4
pages, 2005.
[18] B. Samuelsson and C. Troein, “Superpolynomial growth in the
number of attractors in Kauffman networks,” Physical Review
Letters, vol. 90, no. 9, Article ID 098701, 4 pages, 2003.
[19] J. E. S. Socolar and S. A. Kauffman, “Scaling in ordered and
critical random Boolean networks,” Physical Review Letters,
vol. 90, no. 6, Article ID 068702, 4 pages, 2003.
[20] V. Devloo, P. Hansen, and M. Labbé, “Identification of all
steady states in large networks by logical analysis,” Bulletin of
Mathematical Biology, vol. 65, no. 6, pp. 1025–1051, 2003.
[21] A. Mochizuki, “An analytical study of the number of steady
states in gene regulatory networks,” Journal of Theoretical Biology, vol. 236, no. 3, pp. 291–310, 2005.
[22] R. Pal, I. Ivanov, A. Datta, M. L. Bittner, and E. R. Dougherty,
“Generating Boolean networks with a prescribed attractor
structure,” Bioinformatics, vol. 21, no. 21, pp. 4021–4025,
2005.
[23] X. Zhou, X. Wang, R. Pal, I. Ivanov, M. Bittner, and E. R.
Dougherty, “A Bayesian connectivity-based approach to constructing probabilistic gene regulatory networks,” Bioinformatics, vol. 20, no. 17, pp. 2918–2927, 2004.
[24] T. Akutsu, S. Kuhara, O. Maruyama, and S. Miyano, “A system
for identifying genetic networks from gene expression patterns
produced by gene disruptions and overexpressions,” Genome
Informatics, vol. 9, pp. 151–160, 1998.
13
[25] M. Milano and A. Roli, “Solving the satisfiability problem
through Boolean networks,” in Proceedings of the 6th Congress
of the Italian Association for Artificial Intelligence on Advances
in Artificial Intelligence, vol. 1792 of Lecture Notes in Artificial Intelligence, pp. 72–83, Springer, Bologna, Italy, September
1999.
[26] M. R. Garey and D. S. Johnson, Computers and Intractability: A
Guide to the Theory of NP-Completeness, W.H. Freeman, New
York, NY, USA, 1979.
[27] G. Even, J. Naor, B. Schieber, and M. Sudan, “Approximating
minimum feedback sets and multicuts in directed graphs,” Algorithmica, vol. 20, no. 2, pp. 151–174, 1998.
[28] A.-L. Barabási and R. Albert, “Emergence of scaling in random
networks,” Science, vol. 286, no. 5439, pp. 509–512, 1999.
[29] N. Guelzim, S. Bottani, P. Bourgine, and F. Képès, “Topological and causal structure of the yeast transcriptional regulatory
network,” Nature Genetics, vol. 31, no. 1, pp. 60–63, 2002.
[30] T. Akutsu, S. Miyano, and S. Kuhara, “Identification of genetic
networks from a small number of gene expression patterns under the Boolean network model,” in Proceedings of the 4th Pacific Symposium on Biocomputing (PSB ’99), vol. 4, pp. 17–28,
Big Island of Hawaii, Hawaii, USA, January 1999.
[31] T. Akutsu, S. Kuhara, O. Maruyama, and S. Miyano, “Identification of genetic networks by strategic gene disruptions
and gene overexpressions under a Boolean model,” Theoretical Computer Science, vol. 298, no. 1, pp. 235–251, 2003.
[32] T. Akutsu, M. Hayashida, W.-K. Ching, and M. K. Ng, “Control of Boolean networks: hardness results and algorithms
for tree structured networks,” Journal of Theoretical Biology,
vol. 244, no. 4, pp. 670–679, 2007.
Hindawi Publishing Corporation
EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 97356, 8 pages
doi:10.1155/2007/97356
Research Article
Fixed Points in Discrete Models for
Regulatory Genetic Networks
Dorothy Bollman,1 Omar Colón-Reyes,1 and Edusmildo Orozco2
1 Departament
2 Department
of Mathematical Sciences, University of Puerto Rico, Mayaguez, PR 00681, USA
of Computer Science, University of Puerto Rico, Rı́o Piedras, San Juan, PR 00931-3355, USA
Received 1 July 2006; Revised 22 November 2006; Accepted 20 February 2007
Recommended by Tatsuya Akutsu
It is desirable to have efficient mathematical methods to extract information about regulatory iterations between genes from repeated measurements of gene transcript concentrations. One piece of information is of interest when the dynamics reaches a steady
state. In this paper we develop tools that enable the detection of steady states that are modeled by fixed points in discrete finite
dynamical systems. We discuss two algebraic models, a univariate model and a multivariate model. We show that these two models
are equivalent and that one can be converted to the other by means of a discrete Fourier transform. We give a new, more general
definition of a linear finite dynamical system and we give a necessary and sufficient condition for such a system to be a fixed point
system, that is, all cycles are of length one. We show how this result for generalized linear systems can be used to determine when
certain nonlinear systems (monomial dynamical systems over finite fields) are fixed point systems. We also show how it is possible
to determine in polynomial time when an ordinary linear system (defined over a finite field) is a fixed point system. We conclude
with a necessary condition for a univariate finite dynamical system to be a fixed point system.
Copyright © 2007 Dorothy Bollman et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
1.
INTRODUCTION
Finite dynamical systems are dynamical systems on finite sets. Examples include cellular automata and Boolean
networks, (e.g., [1]) with applications in many areas of
science and engineering (e.g., [2, 3]), and more recently
in computational biology (e.g., [4–6]). A common question in all of these applications is how to analyze the dynamics of the models without enumerating all state transitions. This paper presents partial solutions to this problem.
Because of technological advances such as DNA microarrays, it is possible to measure gene transcripts from a large
number of genes. It is desirable to have efficient mathematical methods to extract information about regulatory iterations between genes from repeated measurements of gene
transcript concentrations.
One piece of information about regulatory iterations of
interest is when the dynamics reaches a steady state. In the
words of Fuller (see [7]): “this paradigm closely parallels
the goal of professionals who aim to understand the flow of
molecular events during the progression of an illness and to
predict how the disease will develop and how the patient will
respond to certain therapies.”
The work of Fuller et al. [7] serves as an example. When
the gene expression profile of human brain tumors was analyzed, these were divided into three classes—high grade,
medium grade, and low grade. A key gene expression event
was identified, which was a high expression of insulin-like
growth factor binding protein 2 (IGFBP2) occurring only
in high-grade brain tumors. It can be assumed that gene
expression events were initiated at some stages in low-level
tumors and may have led to the state when IGFBP2 is activated. The activation of IGFBP2 can be understood to
be a steady state. If we model the kinetics and construct
a model that reconstructs the genetic regulatory network
that activates during the brain tumor process, then we may
be able to predict the convergence of events that lead to
the activation of IGFBP2. In the same way, we also want
to know what happens in the next step following the activation of IGFBP2. Our goal is to develop tools that will
enable this type of analysis in the case of modeling gene
regulatory networks by means of discrete dynamical systems.
2
EURASIP Journal on Bioinformatics and Systems Biology
The use of polynomial dynamical systems to model
biological phenomena, in particular gene regulatory networks, have proved to be as valid as continuous models.
Laubenbacher and Stigler (see [6]) point out, for example, that most ordinary differential equations models cannot be solved analytically and that numerical solutions of
such time-continuous systems necessitate approximations by
time-discrete systems, so that ultimately, the two types of
models are not that different.
Once a gene regulatory network is modeled, in our case
by finite fields, or by finitely generated modules, we obtain a
finite dynamical system. Our goal is to determine if the dynamical system represents a steady-state gene regulatory network (i.e., if every state eventually enters a steady state). This
is a crucial task. Shmulevich et al. (see [8]) have shown that
the steady-state distribution is necessary in order to compute
the long term influence that is a measure of gene impact over
other genes.
The rest of the paper is organized as follows. In Section 2
we give some basic definitions and facts about finite dynamical systems and their associated state spaces. In Section 3
we discuss multivariate and univariate finite field models for
genetic networks and show that they are equivalent. Each
of the models can be converted to the other by a discrete
Fourier transform. Section 4 is devoted to fixed point systems. We give a new definition of linear finite dynamical systems and give necessary and sufficient conditions for such a
system to be a fixed point system. We review results concerning monomial fixed point systems and show how our results
concerning linear systems can be used to determine when
a monomial finite dynamical system over an arbitrary finite
field is a fixed point system. We show how fixed points can be
determined in the univariable model by solving a polynomial
equation over a finite field and we give a necessary condition
for a finite dynamical system to be a fixed point system. Finally, in Section 5 we discuss some implementation issues.
2.
PRELIMINARIES
A finite dynamical system (fds) is an ordered pair (X, f )
where X is a finite set and f is a function that maps X into
itself, that is, f : X → X. The state space of an fds (X, f ) is a
digraph (i.e., directed graph) whose nodes are labeled by the
elements of X and whose edges consist of all ordered pairs
(x, y) ∈ X × X such that f (x) = y. We say that two finite
dynamical systems are isomorphic if there exists a graph isomorphism between their state spaces.
Let G = (V , E) be a digraph. A path in G of the form
(v1 , v2 ), (v2 , v3 ), . . . , (vn−1 , vn ), (vn , v1 ), where v1 , v2 , . . . , vn are
distinct members of V is a cycle of length n. We define a tree to
be a digraph T = (V , E) which has a unique node v0 , called
the root of T, such that (a) (v0 , v0 ) ∈ E, (b) for any node v =
v0 , there is a path from v to v0 , (c) T has no “semicycles” (i.e.,
alternate sequence of nodes and edges v1 , x1 , v2 , . . . , xn , vn+1 ,
n = 0, where v1 = vn+1 and each xi is (vi , vi+1 or vi+1 , vi ))
other than the trivial one (v0 , (v0 , v0 ), v0 ). (Such a tree with
the edge (v0 , v0 ) deleted is sometimes called an “in-tree” with
“sink” v0 .)
Let T be a tree, let nT = ni=1 Ti be the union of n
copies T1 , T2 , . . . , Tn of T, and let ri be the root of Ti . Define T (n) to be the digraph obtained from nT by deleting the
edges (ri , ri ), i = 1, 2, . . . , n, and adjoining the edges (ri , r j ),
i, j = 1, 2, . . . , n, where j = i + 1 mod n. We call T (n) an ncycled tree. Note that by definition, every tree is an n-cycled
tree with n = 1. Note also that by definition, a digraph
Tr = ({r }, {r, r }) consisting of a single trivial cycle is a tree
and hence every cycle of length n is isomorphic to nTr and
hence is an n-cycled tree.
The product of two digraphs G1 = (V1 , E1 ) and G2 =
(V2 , E2 ), denoted G1 × G2 , is the digraph G = (V , E) where
V = V1 × V2 (the Cartesian product of V1 by V2 ) and E =
{((x1 , y1 ), (x2 , y2 )) ∈ V × V : (x1 , x2 ) ∈ E1 and (y1 , y2 ) ∈
E2 }. The following facts follow easily from the definitions.
Lemmas 1, 3, and 4 have been noted in [9].
Lemma 1. The state space of an fds is the disjoint union of cycled trees.
Of special interest are those fds whose state space consists
entirely of trees. Such an fds is called a fixed point system (fps).
For any finite set X we call f : X → X nilpotent if there
exists a unique x0 ∈ X such that f k (X) = x0 for some positive integer k.
Lemma 2. The state space of an fds (X, f ) is a tree if and only
if f is nilpotent. Hence (X, f ) is an fps if f is nilpotent.
Proof. Suppose that the state space (X, f ) is a tree with root
x0 and height k. Then f k (x) = x0 for all x ∈ X and x0 is
the only node with this property. Hence f is nilpotent. Conversely, if f is nilpotent and f k (X) = x0 , then by Lemma 1,
the state space consists of an n-cycled tree and since x0 is
unique, n = 1.
Example 1. Consider the fds (F23 , f ), where f : F23 → F23 is
defined by f (x, y, z) = (y, 0, x) and F2 is the binary field. In
this case f is a nilpotent function. The state space of (F23 , f )
is a tree whose state space is shown in Figure 1.
Lemma 3. The state space of an fds (X, f ) is the union of cycles
if and only if f is one-to-one.
Lemma 4. The product of a tree and a cycle of length l is a
cycled tree whose cycle has length l.
3.
FINITE FIELD MODELS
A finite dynamical system constitutes a very natural discrete
model for regulatory processes (see [10]), in particular genetic networks. Experimental data can be discretized into a
finite set X of expression levels. A network consisting of n
genes is then represented by an fds (X n , f ). The dynamics of
the network is described by a discrete time series
f s0 = s1 , f s1 = s2 , . . . , f sk−2 = sk−1 .
(1)
Special cases of the finite dynamical model are the
Boolean model and finite field models. In the Boolean model,
Dorothy Bollman et al.
010
3
011
110
100
101
111
001
000
Figure 1: State space of (F23 , f ), where f (x, y, z) = (y, 0, x) over F2 .
either a gene can affect another gene or not. In a finite field
model, one is able to capture graded differences in gene
expression. A finite field model can be considered as a generalization of the Boolean model since each Boolean operation
can be expressed in terms of the sum and product in Z2 . In
particular,
x ∧ y = x · y,
x ∨ y = x + y + x · y,
x = x + 1.
(2)
Two types of finite field models have emerged, the
multivariate model [6] and the univariable model [11]. The
multivariate model is given by the fds (Fqn , f ), where Fqn represents the set of n-tuples over the finite field Fq with q elements. Each coordinate function fi gives the next state of
gene i, given the states of the other genes. The univariate
model is given by the fds (Fqn , f ). In this case, each value of
f represents the next states of the n genes, given the present
states.
The two types of finite field models can be considered
equivalent in the following sense.
of a polynomial in n variables over Fq , q prime, can be done
with O(qn /n) operations (see [12]) and hence, evaluating f
in all n of its coordinates is O(qn ), the same number of operations needed for the evaluation of a univariate polynomial
over Fqn . However, the complexity of the comparison of two
values in Fqn is O(n), whereas the complexity of the comparison of two values in Fqn , represented as described below, is
O(1).
Arithmetic in Fqn , q prime, is integer arithmetic modulo q. Arithmetic in Fqn is efficiently performed by table
lookup methods, as shown below. Nonzero elements of Fqn
are represented by powers of a primitive element α. Multiplication is then performed by adding exponents modulo
qn − 1. For addition we make use of a precomputed table of
values defined as follows. Every nonozero element of Fqn has
a unique representation in the form 1 + αi and the unique
number z(i), 0 ≤ z(i) ≤ qn − 2, such that 1 + αi = αz(i)
is called the Zech log of i. Note that for a ≤ b, αa + αb =
n
αa (1 + αb−a ) = αa+z(b−a) mod q −1 . Addition is thus performed
by adding one exponent to the Zech log of the difference,
which is found in a precomputed table. In order to construct
a table of Zech logs for Fqn , we first need a primitive polynomial, which can be found in any one of various tables (e.g.,
[13]).
Example 2. Let us construct a table of Zech logs for F32 using
the primitive polynomial x5 +x2 +1. Thus, we have α5 = α2 +1,
where α is a root of x5 + x2 + 1. Continuing to compute the
powers and making use of this fact, we have α6 = α3 + α,
α7 = α4 + α2 , α8 = α5 + α3 = α3 + α2 + 1, . . . , α31 = 1.
Now use these results to compute for each i = 1, . . . , 30, the
number z(i) such that αi + 1 = αz(i) . For example, since α5 =
α2 + 1, we have α5 + 1 = α2 and so z(5) = 2, and so forth. See
Table 1.
Definition 1. An fds (X, f ) is equivalent to an fds (Y , g) if
there is an epimorphism φ : X → Y such that φ ◦ f = g ◦ φ.
Usually it is most convenient to choose the most appropriate model at the outset. However, at the cost of computing all possible values of the map, it is possible to convert one
model to the other. The rest of this section is devoted to developing such an algorithm.
It is easy to see that if two fds’s are equivalent, then their
state spaces are the same up to isomorphism. We can show
that for any n-dimensional dynamical system (Fqn , f ) there
is an equivalent one-dimensional system (Fqn , g). To see this,
consider a primitive element α of Fqn , that is, a generator of
the multiplicative group of Fqn − {0}. Then there is a natural
correspondence between Fqn and Fqn , given by
Definition 2. Let F be a field and let α ∈ F be an element of
order d, that is, αd = 1 and no smaller power of α equals 1.
The discrete Fourier transform (DFT) of blocklength d over
F is defined by the matrix T = [αi j ], i, j = 0, 1, . . . , d − 1.
The inverse discrete Fourier transform is given by T −1 =
d−1 [α−i j ], i, j = 0, 1, . . . , d − 1, where d−1 denotes the inverse
of the field element d = 1 + 1 + · · · + 1 (d times).
φα x0 , . . . , xn−1 = x0 + x1 α + x2 α2 + · · · + xn−1 αn−1 .
(3)
Since for each a ∈ Fqn there exists unique yi ∈ Fq such
that a = y0 + y1 α + y2 α2 + · · · + yn−1 αn−1 we can define
g : Fqn → Fqn as g(a) = (φα ◦ f )(y0 , . . . , yn−1 ). Notice then
that g ◦ φα = φα ◦ f and therefore the dynamical systems g
and f are equivalent.
One important consideration in choosing an appropriate
finite field model for a genetic network is the complexity of
the needed computational tasks. For example, the evaluation
It is easy to show that TT −1 = Id , where Id denotes the
d × d identity matrix (see, e.g., [14]). Now an element in
Fq , is of order d if and only if d divides q − 1. Thus, for every finite field Fq there is a DFT over Fq with block length
q − 1 which is defined by [αi j ], i, j = 0, 1, . . . , q − 2, where
α is a primitive element of Fq . We denote such a DFT by
Tq,α .
Theorem 1. Let B0 = (φα ◦ f )(0, . . . , 0) and for each i =
1, 2, . . . , qn − 1, let Bi = (φα ◦ f )(a0,i , a1,i , . . . , an−1,i ) where
4
EURASIP Journal on Bioinformatics and Systems Biology
Table 1: Zech Logs for F32 .
Coli is in an environment with lactose, then the lac operon
turns on the enzymes that are needed in order to degrade
lactose. These enzymes are beta-galactisidase, Lactose Permease, and Thiogalactoside transectylase. In [15], a continuous
model is proposed that measures the rate of change in the
concentration of these enzymes as well as the concentration
of mRNA and intracellular lactose. In [16, 17], Laubenbacher
and Stigler provide a discrete model for the lac operon given
by
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
z(i) 18 5 29 10 2 27 22 20 16 4 19 23 14 13 24
i 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
z(i) 9 30 1 11 8 25 7 12 15 21 28 6 26 3 17
α is a primitive element of Fqn and where an−1,i αn−1 + · · · +
a1,i α + α0,i = αi−1 . Then g is given by the polynomial
Aqn −1 x
qn −1
+ Aqn −2 x
qn −2
+ · · · + A1 x + A0 ,
⎞
⎛
⎞
Aqn −1
B1 − A0
⎜A n ⎟
⎜ B −A ⎟
0 ⎟
⎜ q −2 ⎟
⎜ 2
⎜ . ⎟ = −Tqn ,α ⎜
⎟.
..
⎜ . ⎟
⎜
⎟
⎝ . ⎠
⎝
⎠
.
A1
Bqn −1 − A0
(5)
Proof. For each i = 0, 1, . . . , qn − 2, we have
Bi+1 = φα f a0,i+1 , a1,i+1 , . . . , an−1,i+1
= g φα a0,i+1 , a1,i+1 , . . . , an−1,i+1
= g a0,i+1 + a1,i+1 α + · · · + an−1,i+1 αn−1 = g αi .
(6)
Now every function defined on a finite field Fqn can be expressed as a polynomial of degree not more than qn − 1.
Hence g is of the form (4) and it remains to show that the
Ai are given by (5). For this we need only to solve the following system of equations:
qn −1
g αi =Aqn −1 αi
qn −2
+ Aqn −2 αi
+ · · · + A1 αi + A0 ,
i = 0, 1, . . . , qn − 2.
Since α is a primitive element of Fqn , we have (αi )q
and so
qn −1
Bi+1 − A0 = g αi = Aqn −1 αi
+ · · · + A1 αi ,
n −1
(7)
= 1
qn −2
+ Aqn −2 αi
i = 0, 1, . . . , qn − 2.
(8)
Thus,
⎛
⎜
⎜
⎜
⎜
⎝
B1 − A0
B2 − A0
..
.
Bqn −1 − A0
⎞
⎛
⎞
Aqn −1
⎟
⎜A n ⎟
q −2 ⎟
⎟
⎜
⎟ = d −1 T −n1 ⎜ . ⎟ ,
q ,α ⎜ . ⎟
⎟
⎠
⎝ . ⎠
(9)
A1
where d−1 = (qn − 1)−1 = −1. The theorem then follows by
applying Tqn ,α to both sides of this last equation.
F25 , f x1 , x2 , x3 , x4 , x5
= x3 , x1 , x3 + x2 x4 + x2 x3 x4 , 1 + x2 x4
(4)
where A0 = B0 and
⎛
(10)
+ x5 + 1 + x2 x4 x5 , x1 ,
where x1 represents mRNA, x2 represents beta-galactosidase,
x3 represents allolactose, x4 represents lactose, and x5 represents permease. In order to find an equivalent univariate fds
( f25 , g) we first find a primitive element α in F25 . This can
be done by finding a “primitive polynomial,” that is, an irreducible polynomial of degree 5 over F2 that has a zero α in
F25 that generates the multiplicative cyclic group of F25 − {0}.
Such an α can be found either by trial and error or by the use
of tables (see, e.g., [13]).
In our case, we choose α to be a zero of x5 + x2 + 2.
Next, we compute Bi , i = 0, 1, . . . , 31. By definition
B0 = φα ( f (0, 0, 0, 0, 0)) = φα (0, 0, 0, 0, 0) = 0 and Bi =
φα ( f (a0,i , a1,i , a2,i , a3,i , a4,i )) where αi−1 = a0,i + a1,i α + a2,i α2 +
a3,i α3 + a4,i α4 for i = 1, 2, . . . , 31.
So, for example,
B1 = φα f (1, 0, 0, 0, 0) = φα (0, 1, 0, 0, 1)
= α + α4 = α1+z(3) = α30 ,
B2 = φα f (0, 1, 0, 0, 0) = φα (0, 0, 0, 0, 0) = 0.
(11)
Continuing we find that [B1 , B2 , . . . , B31 ] = [α30 , 0, α5 , α3 , α3 ,
α26 , α2 , α8 , α15 , α20 , α9 , α26 , α5 , α8 , α15 , α15 , α24 , α9 , α30 , α5 , α8 ,
α3 , α15 , α26 , α8 , α9 , α15 , α28 , α8 , α9 , α3 ].
Multiplying by the 31 × 31 matrix T32,α = [αi j ], 0 ≤
i, j ≤ 30, we obtain [A31 , A30 , . . . , A1 ] and hence the equivalent univariate polynomial, which is g(x) = x +α22 x2 +α2 x3 +
α11 x4 + α20 x5 + x6 + α12 x7 + α25 x8 + α20 x9 + α20 x10 + α2 x11 +
α5 x12 +α5 x13 +α23 x14 +α7 x16 +α27 x17 +α20 x18 +α6 x19 +α20 x20 +
α27 x21 + x22 + α27 x24 + x25 + α27 x26 + α16 x28 .
As previously mentioned, the complexity of evaluating a
polynomial in n variables over a finite field Fq is O(qn /n).
The complexity of evaluating f in all of its n coordinates is
thus O(qn ) and the complexity of evaluating f in all points
of Fqn is thus O(q2n ). The computation of the matrix-vector
product in (5) involves O(q2n ) operations over the field Fqn .
However, using any one of the number of classical fast algorithms, such as Cooley-Tukey (see, e.g., [14]), the number of
operations can be reduced to O(qn n).
We illustrate the algorithm given by Theorem 1 with an
example.
4.
Example 3. A recent application involves the study and creation of a model for lac operon [15]. When the bacteria E.
A fixed point system (fps) is defined to be an fds whose state
space consists of trees, that is, contains no cycles other than
FIXED POINT SYSTEMS
Dorothy Bollman et al.
trivial ones (of length one). The fixed point system problem
is the problem of determining of an fds whether or not it is
an fps. Of course one such method would be the brute force
method whereby we examine sequences determined by successive applications of the state map to determine if any such
sequence contains a cycle of length greater than one. The
worst case occurs when the state space consists of one cycle.
Consider such a multivariate fds (Fqn , f ). In order to recognize a maximal cycle, f (a1 , a2 , . . . , an ), f 2 (a1 , a2 , . . . , an ), . . . ,
such an approach would require backtracking at each step in
order to compare the most recent value f i (a1 , a2 , . . . , an ) with
all previous values. An evaluation requires O(qn ) operations,
there are qn such evaluations, and a comparison of two values requires n steps. The complexity of the complete process
is thus O(q2n + n2 ) = O(q2n ).
To date, all results concerning the fixed point system
problem are characterized in terms of multivariate fds. A solution to the fixed point system problem consists of characterizing such an fds (Fqn , f ) in terms of the structure of f . Ideally, such conditions should be amenable to implementations
in polynomial time in n. In a recent work, Just [18] claims
that if the class of regulatory functions contains the quadratic
monotone functions xi ∨ x j and xi ∧ x j , then the fixed point
problem for Boolean dynamical systems is NP-hard. In view
of this result, it is unlikely that we can achieve the goal of a
polynomial time solution to the fixed point problem, at least
in the general case. However, the question arises if the above
O(q2n ) result can be improved (to say O(qn )) and also what
are special cases of the fixed point problem that have polynomial time solutions.
In this section we give a polynomial solution to the special case of the fixed point problem for linear finite dynamical systems, we review known results for the nonlinear case, and we point out how our results concerning the
general linear case for fds over finitely generated modules
give a more complete solution to the case of monomial
finite field dynamical systems over arbitrary finite fields.
We conclude by proposing a new approach to the problem via univariate systems, we give an algorithm for determining the fixed points of a univariate system, and we
give a necessary condition for a univariate fds to be an
fps.
4.1. Linear fixed point systems
Finite dynamical systems over finite fields that are linear are
very amenable to analysis and have been studied extensively
in the literature (see [2, 9]).
In the multivariate case, a linear system over a finite field is represented by an fds (Fqn , f ) where f can
be represented by an n × n matrix A over Fq . The fixed
points of a multivariate fds (Fqm , A) are simply the solutions to the homogeneous system of equations (A − I)x =
0.
In the finite field model for genetic networks, we assume
that the number of states of each gene is a power of a prime.
However, we will give a more general model that eliminates
this assumption.
5
A module M over a ring R is said to be finitely generated
if there exists a set of elements {s1 , s2 , . . . , sn } ⊂ M such that
M = {r1 s1 + r2 s2 + · · · + rn sn | ri ∈ R}. Finitely generated
modules are generalized vector spaces. Examples are Fqn and
the set Zmn of n tuples over the ring of integers modulo an
arbitrary integer m.
A linear finite dynamical module system (lfdms) consists
of an ordered pair (M(R), f ) where M(R) is a finitely generated module over a finite commutative ring R with unity
and f : M(R) → M(R) is linear. Let (M1 (R), f1 ) and
(M2 (R), f2 ) be lfdms. We define the direct sum of (M1 (R), f1 )
and (M2 (R), f2 ) to be the fds (M1 ⊕M2 , f1 ⊕ f2 ) where M1 ⊕M2
is the direct sum of the modules M1 (R) and M2 (R) and
f1 ⊕ f2 : M1 ⊕ M2 → M1 ⊕ M2 is defined by ( f1 ⊕ f2 )(u + v) =
f1 (u) + f2 (v), for each u ∈ M1 (R) and v ∈ M2 (R). The state
space of the direct sum is related to the component fds as
follows.
Lemma 5. Let G1 be the state space of the lfdms (M1 (R), f1 )
and let G2 be the state space of the lfdms (M2 (R), f2 ). Then the
state space of the direct sum of (M1 (R), f1 ) and (M2 (R), f2 ) is
G1 × G 2 .
This result has been noted in [9] for lfdms over fields.
We use the following well-known result (see, e.g., [19])
in order to establish necessary and sufficient conditions for
an lfdms to be a fixed point system.
Lemma 6 (Fitting’s lemma). Let (M(R), f ) be an lfdms. Then
there exist an integer n > 0 and submodules N and P satisfying
(i) N = f n (M(R)),
(ii) P = f −n (0),
(iii) (M(R), f ) = (N(R), f1 ) ⊕ (P(R), f2 ), f1 = f |N (the
restriction of f to N) is invertible and f2 = f |P is nilpotent.
Theorem 2. Let (M(R), f ) be an lfdms and let N be defined
as above. Then (M(R), f ) is a fixed point system if and only if
either f is nilpotent or f |N is the identity map.
Proof. By Fitting’s lemma, we have (M(R), f ) = (N(R),
f |N ) ⊕ (P(R), f |P ) where N = f n (M(R)) and P = f −n (0).
Suppose that f is nilpotent. Then by Lemma 2, the state
space of (M(R), f ) is a tree. Next suppose that f |N is the
identity. Then by Lemma 3, the state space of (M(N), f |N )
is a union of cycles each of length one and by Lemma 2, the
state space of (M(P), f |P ) is a tree. Hence by Lemma 4, the
state space of (M(R), f ) is a union of trees and so (M(R), f )
is a fixed point system.
Conversely, suppose that (M(R), f ) is a fixed point system. Then the state space of (M(R), f ) is a union U of trees.
If U consists of only one tree, then by Lemma 3, f is nilpotent. Now suppose that U is the union of at least two trees.
Since f is invertible on N, it is also one-to-one on N. By
Lemma 2, the state space of (N(R), f |N ) is a union of cycles.
Each of these cycles must be of length one. For if not, the
state space of (M(R), f ) would contain at least one n-cycled
tree where n > 1, contradicting that (M(R), f ) is a fixed point
system.
6
Theorem 2 can be used to prove the following result,
which is suggested in [20, 21].
Corollary 1. A linear finite dynamical system (Fqn , f ) over a
field is a fixed point system if and only if the characteristic
polynomial of f is of the form xn0 (x − 1)n1 and the minimal
polynomial is of the form xn0 (x − 1)n1 where n1 is either zero or
one.
EURASIP Journal on Bioinformatics and Systems Biology
each hi = x1δi1 x2δi2 · · · xnδin and where δi j is one if > 0 and is
zero otherwise.
The following theorem was published in [25].
Theorem 3 (Colón-Reyes, Jarrah, Laubenbacher, and Sturmfels). A monomial fds (Fqn , f ) is an fps if and only if (Zqn−1 , L f )
and (Z2n , S f ) are fixed point systems.
Proof. Suppose (Fqn , f ) is an fps. Then either f is nilpotent
or f |N is the identity. If f is nilpotent, then the characteristic polynomial of f is of the form xn0 and the minimal
polynomial of f is of the form xn0 . If f |N is the identity, then
the characteristic polynomial of f |N is of the form (x − 1)n1
and the minimal polynomial of f |N is of the form (x − 1)n1
where 0 ≤ n1 ≤ n1 . Furthermore, n1 ≤ 1 since otherwise
(Fqn , f ) would not be an fps [2]. Therefore the characteristic
and minimal polynomials of f are of the desired forms.
Conversely, suppose that the characteristic polynomial of
f is of the form xn0 (x − 1)n1 and its minimal polynomial is
of the form xn0 (x − 1)n1 , where n0 ≤ n0 and n1 is either zero
or one. If n0 = 0, then the characteristic polynomial of f is
(x − 1)n1 and so the minimal polynomial of f is x − 1, which
implies that f is the identity and hence (Fqn , f ) is an fps. Next,
suppose that n0 > 0. Then either n1 = 0 or n1 = 1. If n1 = 0,
then the state space of (Fqn , f ) is a tree. If n1 = 1, then the
state space of (Fqn , f ) is the product of a tree and cycles of
length one and, hence the union of trees.
Example 4. Consider the monomial fds (F52 , f ) where f =
(xy, y) = (x1 y 1 , x0 y 1 ). The matrix L f = ( 01 11 ) over Z4 is
nonsingular and hence not nilpotent. Furthermore, the n
of Theorem 2 is 1, N = Z4 , and L f is not the identity. By
Theorem 2, (Z4 , L f ) is not an fps and by the previous theorem, (F52 , f ) is not an fps.
The corollary gives us a polynomial time algorithm to determine of a linear fds (Fqn , f ), where f is given by an n × n
matrix, whether or not it is an fps. The characteristic polynomial of f can be determined in time O(n3 ) using the definition. The minimal polynomial of f can be determined
in time O(n3 ) using an algorithm of Storjohann [22]. Both
polynomials can be factored in subquadratic time using an
algorithm of Kaltofen and Shoup [23].
Lemma 7. (Fqn , g) has fixed points if and only if h(x) =
n
gcd(g(x) − x, xq − x) = 1 and in such a case, the fixed points
are the zeros of h(x).
4.2. Monomial systems
The simplest nonlinear multivariate fds (Fqn , f ) is one in
which each component function fi of f is a monomial, that
is, a product of powers of the variables. In [24], Colón-Reyes
et al. provide necessary and sufficient conditions that allow
one to determine in polynomial time when an fds of the form
(F2n , f ), where f a monomial, is an fps. In [25], Colón-Reyes
et al. give necessary and sufficient conditions for (Fqn , f ),
where f is a monomial and q an arbitrary prime, to be an
fps. However, one of these conditions is that a certain linear
fds over a ring be an fps, but no criterion is given for such
an fds to be an fps. Theorem 2 gives such a criterion. Let us
describe the situation in more detail.
Definition 3. If f = ( f1 , f2 , . . . , fn ) where each f j is of the
1 j 2 j
n j
form x1 x2 · · · xn , j = 1, 2, . . . , n, where each i j belongs
to the ring Zq−1 of integers modulo q −1, then (Fqn , f ) is called
a monomial finite dynamical system. The log map of (Fqn , f ) is
defined by the n × n matrix L f = [i j ], where 1 ≤ i, j ≤ n.
The support map is defined by S f = (h1 , h2 , . . . , hn ) where
The problem of determining in polynomial time (in n)
when an lfdms (Rn , f ) is an fps, where R is a finite ring, is
open.
4.3.
A univariate approach
The fixed point problem is an important problem, suitable
solutions for which have been obtained only in certain special cases. All of the work done so far has been done for multivariate fds. By considering the problem in the univariate domain, it is possible to gain some insight that is not evident in
the multivariate domain. The results in the remainder of this
section are examples of this.
Proof. An element a of Fqn is a fixed point of (Fqn , g) if and
n
only if a is a zero of g(x) − x. Since xq = x for all x ∈ Fqn ,
n
q
x − x contains all linear factors of the form x − a, a ∈ Fqn
and so a is a zero of g(x) − x if and only if x − a is a factor of
n
both g(x) − x and xq − x, that is, if and only if it is a factor
of h(x).
Lemma 7 gives us algorithms for determining whether or
not a given univariate fds has fixed points and if so, a method
to find all such points. For the first part, we note that the
greatest common divisor of two univariate polynomials of
degree no more than d can be determined using no more
than (d log2 d) operations [26]. Since g has degree at most
qn − 1, this means that the complexity for calculating h(x),
that is, for determining whether or not a given univariate fds
(Fqn , g) is an fps O(n2 qn ).
= 1, h(x) can be factored in order to deterWhen h(x) mine the set of all fixed points. At worst, using the algorithm
in [23], the complexity of determining the factors of h(x) is
O(d1.815 n), where d is the degree of h(x). Clearly, d is less than
or equal to the degree of g(x), which in practice is determined
by experimental data (e.g., from microarrays) and thus considerably less than the total number of possible points qn . If
we assume that the degree of g(x) is not more than the square
root of qn , then d1.815 n ≤ n2 qn and the total complexity of the
algorithm for determining all fixed points is thus O(n2 qn ).
Dorothy Bollman et al.
0
7
1
2
4
3
Figure 2: State space of (F5 , g), where g(x) = x3 over F5 .
In contrast, the only known method for determining
the fixed points of a multivariate fds (Fqn , f ) is the brute
force method of enumerating all state transitions and for
each value f (a1 , a2 , . . . , an ) so generated, check to see if
f (a1 , a2 , . . . , an ) = (a1 , a2 , . . . , an ). The number of operations
in this method is O(q2n ).
In many cases, the degree of h(x) of Lemma 7 is small and
its zeros can be found by inspection or by only several trials
and errors. The lac operon example illustrates this.
Example 5. Let (F32 , g) be the fds describing the lac operon
(Example 3). We have h(x) = gcd(g(x), x32 − x) = x4 +α26 x3 +
α18 x2 = x2 (x − α3 )(x − α15 ) and thus the fixed points are
x = 0, x = α3 , and x = α15 .
Lemma 7 also gives a necessary condition for an fds to be
an fps, which for emphasis we state as a theorem.
Theorem 4. With the notation of Lemma 7, if (Fq , g) is an fps,
= 1.
then h(x) Proof. If h(x) = 1, then (Fqn , g) has no fixed points and all
cycles are nontrivial. Hence by Lemma 7, (Fqn , g) is not an
fps.
The converse of Theorem 4 is not true.
Example 6. Consider the fds (F5 , g) where g(x) = x3 . Then
h(x) = gcd(x5 − x, x3 − x) = x3 − x = 1, but (F5 , g) is not an
fps. (see Figure 2).
5.
IMPLEMENTATION ISSUES
One of the difficulties of implementing algorithms for the
multivariate model is the choice of data structures, which
can, in fact, affect complexity. For example, no algorithm is
known for factoring multivariate polynomials that runs in
time polynomial in the length of the “sparse” representation.
However, such an algorithm exists for the “black box” representation (see, e.g., [27]).
On the other hand, data structures needed for algorithms
for the univariate model are well known and simple to implement. In this case, one can also take advantage of wellknown methods used in cryptography and coding theory. Table lookup methods for carrying out finite field arithmetic
are an example. By using lookup tables we can make arithmetic operations at almost no cost. However, for very large
fields, memory space becomes a limitation. Ferrer [28] has
implemented table lookup arithmetic for fields of characteristic 2 on a Hewlett-Packard Itanium machine with two
900 MHz ia64 CPU modules and 4 GB of RAM. On this machine, we can create lookup tables of up to 229 elements.
Multiplication is by far the most costly finite field operation and also the most often used, since other operations
such as computing powers and computing inverses make
use of multiplication. In other experiments on the HewlettPackard Itanium, Ferrer [28] takes advantage of machine
hardware in order to implement a “direct” multiplication algorithm for F2n that runs in time linear in n for n = 2 up to
n = 63 [28]. Here the field size is limited by the word-length
of the computer architecture.
For larger fields, we can make use of “composite” fields
(see, e.g., [29]), that is, fields Fn where n is composite, say
n = rs. Making use of the isomorphism of F2r s and F(2r )s ,
we can use table lookup for a suitable “ground field” Fr and
the direct method mentioned above for multiplication in the
extension field F(2r )s . Using the ground field F25 and selected
values of s, Ferrer [28] obtains running time O(s2 ).
Still another approach to implement finite field arithmetic, that is, especially efficient for fields of characteristic 2,
is the use of reconfigurable hardware or “field programmable
gate arrays” (FPGAs). In [30], Ferrer, Moreno and the first
author obtain a multiplication algorithm which outperforms
all other known FPGA multiplication algorithms for fields of
characteristic 2.
6.
CONCLUSIONS
One piece of information that is of utmost interest when
modeling biological events, in particular gene regulation networks, is when the dynamics reaches a steady state. If the
modeling of such networks is done by discrete finite dynamical systems, such information is given by the fixed points of
the underlying system. We have shown that we can choose
between a multivariate and a univariate polynomial representation. Here we introduce a new tool, the discrete Fourier
transform that helps us change from one representation to
the other, without altering the dynamics of the system.
We provide a criterion to determine when a linear finite
dynamical system over an arbitrary finitely generated module
over a commutative ring with unity is a fixed point system.
When a gene regulation network is modeled by a linear finite
dynamical system we can then decide if such an event reaches
a steady state using our results. When the finitely generated
module is a finite field we can decide in polynomial time.
Gene regulation networks, as suggested in the literature,
seem to obey very complex mechanisms whose rules appear
to be of a nonlinear nature (see [31]). In this regard, we have
made explicit some useful facts concerning fixed points and
fixed point systems. We have given algorithms for determining when a univariate fds has at least one fixed point and how
to find them. We have also given a necessary condition for
a univariable fds to be a fixed point system. However, there
are still much to be done and a number of open problems
remain. In particular, what families of fds admit polynomial
time algorithms for determining whether or not a given fds is
an fps? This work is a first step towards the aim of designing
theories and practical tools to tackle the general problem of
fixed points in finite dynamical systems.
8
ACKNOWLEDGMENTS
This work was partially supported by Grant number S06GM08102 NIH-MBRS (SCORE). The figures in this paper were created using the Discrete Visualizer of Dynamics software, from the Virginia Bioinformatics Institute
(http://dvd.vbi.vt.edu/network visualizer). The authors are
grateful to Dr. Oscar Moreno for sharing his ideas on the univariate model and composite fields.
REFERENCES
[1] J. F. Lynch, “On the threshold of chaos in random Boolean
cellular automata,” Random Structures & Algorithms, vol. 6,
no. 2-3, pp. 239–260, 1995.
[2] B. Elspas, “The theory of autonomous linear sequential networks,” IRE Transactions on Circuit Theory, vol. 6, no. 1, pp.
45–60, 1959.
[3] J. Plantin, J. Gunnarsson, and R. Germundsson, “Symbolic algebraic discrete systems theory—applied to a fighter aircraft,”
in Proceedings of the 34th IEEE Conference on Decision and
Control, vol. 2, pp. 1863–1864, New Orleans, La, USA, December 1995.
[4] D. Bollman, E. Orozco, and O. Moreno, “A parallel solution to
reverse engineering genetic networks,” in Computational Science and Its Applications—ICCSA 2004—Part 3, A. Laganà, M.
L. Gavrilova, V. Kumar, et al., Eds., vol. 3045 of Lecture Notes
in Computer Science, pp. 490–497, Springer, Berlin, Germany,
2004.
[5] A. S. Jarrah, H. Vastani, K. Duca, and R. Laubenbacher, “An
optimal control problem for in vitro virus competition,” in
Proceedings of the 43rd IEEE Conference on Decision and Control (CDC ’), vol. 1, pp. 579–584, Nassau, Bahamas, December
2004.
[6] R. Laubenbacher and B. Stigler, “A computational algebra
approach to the reverse engineering of gene regulatory networks,” Journal of Theoretical Biology, vol. 229, no. 4, pp. 523–
537, 2004.
[7] G. N. Fuller, C. H. Rhee, K. R. Hess, et al., “Reactivation
of insulin-like growth factor binding protein 2 expression in
glioblastoma multiforme,” Cancer Research, vol. 59, no. 17, pp.
4228–4232, 1999.
[8] I. Shmulevich, E. R. Dougherty, and W. Zhang, “Gene perturbation and intervention in probabilistic Boolean networks,”
Bioinformatics, vol. 18, no. 10, pp. 1319–1331, 2002.
[9] R. A. Hernández Toledo, “Linear finite dynamical systems,”
Communications in Algebra, vol. 33, no. 9, pp. 2977–2989,
2005.
[10] J. Bähler and S. Svetina, “A logical circuit for the regulation
of fission yeast growth modes,” Journal of Theoretical Biology,
vol. 237, no. 2, pp. 210–218, 2005.
[11] O. Moreno, D. Bollman, and M. Aviño, “Finite dynamical systems, linear automata, and finite fields,” in Proceedings of the
WSEAS International Conference on System Science, Applied
Mathematics and Computer Science, and Power Engineering
Systems, pp. 1481–1483, Copacabana, Rio de Janeiro, Brazil,
October 2002.
[12] B. Sunar and D. Cyganski, “Comparison of bit and word level
algorithms for evaluating unstructured functions over finite
rings,” in Proceedings of the 7th International Workshop Cryptographic Hardware and Embedded Systems (CHES ’05), J. R. Rao
and B. Sunar, Eds., vol. 3659 of Lecture Notes in Computer Science, pp. 237–249, Edinburgh, UK, August-September 2005.
EURASIP Journal on Bioinformatics and Systems Biology
[13] M. Zivkovic, “A table of primitive binary polynomials,” Mathematics of Computation, vol. 62, no. 205, pp. 385–386, 1994.
[14] R. E. Blahut, Algebraic Methods for Signal Processing and Communications Coding, Springer, New York, NY, USA, 1991.
[15] N. Yildirim and M. C. Mackey, “Feedback regulation in the
lactose operon: a mathematical modeling study and comparison with experimental data,” Biophysical Journal, vol. 84, no. 5,
pp. 2841–2851, 2003.
[16] R. Laubenbacher, “Network Inference, with an application
to yeast system biology,” Presentation at the Center for
Genomics Science, Cuernavaca, Mexico, September 2006,
http://mitla.lcg.unam.mx/.
[17] R. Laubenbacher and B. Stigler, “Mathematical Tools for Systems Biology,” http://people.mbi.ohio-state.edu/bstigler/sbworkshop.pdf.
[18] W. Just, “The steady state system problem is NP-hard even for
monotone quadratic Boolean dynamical systems,” submitted
to Annals of Combinatorics.
[19] B. R. Macdonald, Finite Rings with Identity, Marcel Dekker,
New York, NY, USA, 1974.
[20] O. Colón-Reyes, Monomial dynamical systems, Ph.D. thesis,
Virginia Polytechnic Institute and State University, Blacksburg, Va, USA, 2005.
[21] O. Colón-Reyes, Monomial Dynamical Systems over Finite
Fields, ProQuest, Ann Arbor, Mich, USA, 2005.
[22] A. Storjohann, “An O(n3 ) algorithm for the Frobenius normal
form,” in Proceedings of the 23rd International Symposium on
Symbolic and Algebraic Computation (ISSAC ’98), pp. 101–104,
Rostock, Germany, August 1998.
[23] E. Kaltofen and V. Shoup, “Subquadratic-time factoring of
polynomials over finite fields,” Mathematics of Computation,
vol. 67, no. 223, pp. 1179–1197, 1998.
[24] O. Colón-Reyes, R. Laubenbacher, and B. Pareigis, “Boolean
monomial dynamical systems,” Annals of Combinatorics,
vol. 8, no. 4, pp. 425–439, 2004.
[25] O. Colón-Reyes, A. S. Jarrah, R. Laubenbacher, and B. Sturmfels, “Monomial dynamical systems over finite fields,” Journal
of Complex Systems, vol. 16, no. 4, pp. 333–342, 2006.
[26] A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design
and Analysis of Computer Algorithms, Addison Wesley, Boston,
Mass, USA, 1974.
[27] J. von zur Gathen and J. Gerhard, Modern Computer Algebra, Cambridge University Press, Cambridge, UK, 2nd edition,
2003.
[28] E. Ferrer, “A co-design approach to the reverse engineering
problem,” CISE Ph.D. thesis proposal, University of Puerto
Rico, Mayaguez, Puerto Rico, USA, 2006.
[29] E. Savas and C. K. Koc, “Efficient method for composite field
arithmetic,” Tech. Rep., Electrical and Computer Engineering,
Oregon State University, Corvallis, Ore, USA, December 1999.
[30] E. Ferrer, D. Bollman, and O. Moreno, “Toward a solution of
the reverse engineering problem usings FPGAs,” in Proceedings
of the International Euro-Par Workshops, Lehner, et al., Eds.,
vol. 4375 of Lecture Notes in Computer Science, pp. 301–309,
Springer, Dresden, Germany, September 2006.
[31] R. Thomas, “Laws for the dynamics of regulatory networks,”
International Journal of Developmental Biology, vol. 42, no. 3,
pp. 479–485, 1998.
Hindawi Publishing Corporation
EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 82702, 11 pages
doi:10.1155/2007/82702
Research Article
Comparison of Gene Regulatory Networks via
Steady-State Trajectories
Marcel Brun,1 Seungchan Kim,1, 2 Woonjung Choi,3 and Edward R. Dougherty1, 4, 5
1 Computational
Biology Division, Translational Genomics Research Institute, Phoenix, AZ 85004, USA
of Computing and Informatics, Ira A. Fulton School of Engineering, Arizona State University, Tempe, AZ 85287, USA
3 Department of Mathematics and Statistics, College of Liberal Arts and Sciences, Arizona State University, Tempe, AZ 85287, USA
4 Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA
5 Cancer Genomics Laboratory, Department of Pathology, University of Texas M.D. Anderson Cancer Center, Houston,
TX 77030, USA
2 School
Received 31 July 2006; Accepted 24 February 2007
Recommended by Ahmed H. Tewfik
The modeling of genetic regulatory networks is becoming increasingly widespread in the study of biological systems. In the abstract, one would prefer quantitatively comprehensive models, such as a differential-equation model, to coarse models; however,
in practice, detailed models require more accurate measurements for inference and more computational power to analyze than
coarse-scale models. It is crucial to address the issue of model complexity in the framework of a basic scientific paradigm: the model
should be of minimal complexity to provide the necessary predictive power. Addressing this issue requires a metric by which to
compare networks. This paper proposes the use of a classical measure of difference between amplitude distributions for periodic
signals to compare two networks according to the differences of their trajectories in the steady state. The metric is applicable to
networks with both continuous and discrete values for both time and state, and it possesses the critical property that it allows
the comparison of networks of different natures. We demonstrate application of the metric by comparing a continuous-valued
reference network against simplified versions obtained via quantization.
Copyright © 2007 Marcel Brun et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1.
INTRODUCTION
The modeling of genetic regulatory networks (GRNs) is becoming increasingly widespread for gaining insight into the
underlying processes of living systems. The computational
biology literature abounds in various network modeling approaches, all of which have particular goals, along with their
strengths and weaknesses [1, 2]. They may be deterministic
or stochastic. Network models have been studied to gain insight into various cellular properties, such as cellular state
dynamics and transcriptional regulation [3–8], and to derive
intervention strategies based on state-space dynamics [9, 10].
Complexity is a critical issue in the synthesis, analysis,
and application of GRNs. In principle, one would prefer
the construction and analysis of a quantitatively comprehensive model such as a differential equation-based model to a
coarsely quantized discrete model; however, in practice, the
situation does not always suffice to support such a model.
Quantitatively detailed (fine-scale) models require signifi-
cantly more complex mathematics and computational power
for analysis and more accurate measurements for inference
than coarse-scale models. The network complexity issue has
similarities with the issue of classifier complexity [11]. One
must decide whether to use a fine-scale or coarse-scale model
[12]. The issue should be addressed in the framework of the
standard engineering paradigm: the model should be of minimal complexity to solve the problem at hand.
To quantify network approximation and reduction, one
would like a metric to compare networks. For instance, it
may be beneficial for computational or inferential purposes
to approximate a system by a discrete model instead of a continuous model. The goodness of the approximation is measured by a metric and the precise formulation of the properties will depend on the chosen metric.
Comparison of GRN models needs to be based on salient
aspects of the models. One study used the L1 norm between
the steady-state distributions of different networks in the
context of the reduction of probabilistic Boolean networks
2
EURASIP Journal on Bioinformatics and Systems Biology
[13]. Another study compared networks based on their
topologies, that is, connectivity graphs [14]. This method
suffers from the fact that networks with the same topology
may possess very different dynamic behaviors. A third study
involved a comprehensive comparison of continuous models based on their inferential power, prediction power, robustness, and consistency in the framework of simulations,
where a network is used to generate gene expression data,
which is then used to reconstruct the network [15]. A key
drawback of most approaches is that the comparison is applicable only to networks with similar representations; it is
difficult to compare networks of different natures, for instance, a differential-equation model to a Boolean model. A
salient property of the metric proposed in this study is that it
can compare networks of different natures in both value and
time.
We propose a metric to compare deterministic GRNs via
their steady-state behaviors. This is a reasonable approach
because in the absence of external intervention, a cell operates mainly in its steady state, which characterizes its phenotype, that is, cell cycle, disease, cell differentiation, and
so forth. [16–19]. A cell’s phenotypic status is maintained
through a variety of regulatory mechanisms. Disruption of
this tight steady-state regulation may lead to an abnormal
cellular status, for example, cancer. Studying steady-state behavior of a cellular system and its disruption can provide significant insight into cellular regulatory mechanisms underlying disease development.
We first introduce a metric to compare GRNs based on
their steady-state behaviors, discuss its characteristics, and
treat the empirical estimation of the metric. Then we provide
a detailed application to quantization utilizing the mathematical framework of reference and projected networks. We
close with some remarks on the efficacy of the proposed
metric.
2.
METRIC BETWEEN NETWORKS
In this section, we construct the distance metric between networks using a bottom-up approach. Following a description
of how trajectories are decomposed into their transient and
steady-state parts, we define a metric between two periodic
or constant functions and then extend this definition to a
more general family of functions that can be decomposed between transient and steady-state parts.
2.1. Steady-state trajectory
Given the understanding that biological networks exhibit
steady-state behavior, we confine ourselves to networks exhibiting steady-state behavior. Moreover, since a cell uses nutrients such as amino acids and nucleotides in cytoplasm to
synthesize various molecular components, that is, RNAs and
proteins [18], and since there are only limited supplies of nutrients available, the amount of molecules present in a cell
is bounded. Thus, the existence of steady-state behavior implies that each individual gene trajectory can be modeled as a
bounded function f (t) that can be decomposed into a transient trajectory plus a steady-state trajectory:
f (t) = ftran (t) + fss (t),
(1)
where limt→∞ ftran (t) = 0 and fss (t) is either a periodic function or a constant function.
The limit condition on the transient part of the trajectory
indicates that for large values of t, the trajectory is very close
to its steady-state part. This can be expressed in the following
manner: for any > 0, there exists a time tss such that | f (t) −
fss (t)| < for t > tss . This property is useful to identify fss (t)
from simulated data by finding an instant tss such that f (t) is
almost periodical or constant for t > tss .
A deterministic gene regulatory network, whether it is
represented by a set of differential equations or state transition equations, produces different dynamic behaviors, depending on the starting point. If ψ is a network with N genes
and x0 is an initial state, then its trajectory,
(1)
(N)
(t), . . . , f(ψ,x
(t) ,
f(ψ,x0 ) (t) = f(ψ,x
0)
0)
(2)
(i)
where f(ψ,x
(t) is a trajectory for an individual gene (denoted
0)
by f (i) (t) or f (t) where there is no ambiguity) generated by
the dynamic behavior of the network ψ when starting at x0 .
For a differential-equation model, the trajectory f(ψ,x0 ) (t) can
be obtained as a solution of a system of differential equations;
for a discrete model, it can be obtained by iterating the system’s transition equations. Trajectories may be continuoustime functions or discrete-time functions, depending on the
model.
The decomposition of (1) applies to f(ψ,x0 ) (t) via its ap(i)
(t). In the case
plication to the individual trajectories f(ψ,x
0)
of discrete-valued networks (with bounded values), the system must enter an attractor cycle or an attractor state at some
time point tss . In the first case f(ψ,x0 ),ss (t) is periodical, and in
the second case it is constant. In both cases, f(ψ,x0 ),tran (t) = 0
for t ≥ tss .
2.2.
Distance based on the amplitude
cumulative distribution
Different metrics have been proposed to compare two realvalued trajectories f (t) and g(t), including the correlation
f , g , the cross-correlation Γ f ,g (τ), the cross-spectral density p f ,g (ω), the difference between their amplitude cumulative distributions F(x) = p f (x) and G(x) = pg (x), and the
difference between their statistical moments [20]. Each has
its benefits and drawbacks depending on one’s purpose. In
this paper, we propose using the difference between the amplitude cumulative distributions of the steady-state trajectories.
Let fss (t) and gss (t) be two measurable functions that are
either periodical or constant, representing the steady-state
parts of two functions, f (t) and g(t), respectively. Our goal
is to define a metric (distance) between them by using the
3
6
1
5
0.9
4
0.8
3
0.7
2
0.6
F(x)
x = f (t)
Marcel Brun et al.
1
0.5
0
0.4
−1
0.3
−2
0.2
−3
0.1
−4
0
200
400
600
0
−4
800 1000 1200 1400 1600 1800 2000
t
−3
−2
−1
0
1
2
3
4
5
6
x
2∗ sin(t)
2∗ cos(2∗ t + 1)
2∗ sin(t) + 2∗ sin(2∗ t)
(a)
2∗ sin(t) + 2∗ sin(2∗ t) + 2
3 + 0∗ t
4 + 0∗ t
(b)
Figure 1: Example of (a) periodical and constant functions f (t) and (b) their amplitude cumulative distributions F(x).
amplitude cumulative distribution (ACD), which measures
the probability density of a function [20].
If fss (t) is periodic with period t p > 0, its cumulative densityfunction F(x) over R is defined by
F(x) = λ
M(x)
,
tp
(3)
where λ(A) isthe Lebesgue measure of the set A and
M(x) = ts ≤ t < te | fss (t) ≤ x ,
(4)
where te = ts + t p , for any point ts .
If fss is constant, given by fss (t) = a for any t, then we
define F(x) as a unit step function located at x = a. Figure 1
shows an example of some periodical functions and their amplitude cumulative distributions.
Given two steady-state trajectories, fss (t) and gss (t), and
their respective amplitude cumulative distributions, F(x)
and G(x), we define the distance between fss and gss as the
distance between the distributions
dss fss , gss = F − G
(5)
for some suitable norm · . Examples of norms include L∞ ,
defined by the supremum of their differences,
dL∞ ( f , g) = sup F(x) − G(x)
,
0≤x≤∞
(6)
and L1 defined by the area of the absolute value of their difference,
dL1 ( f , g) =
0≤x<∞
F(x) − G(x)
dx.
(7)
In both cases, we apply the biological constraint that the amplitudes are nonnegative.
The L1 norm is well suited to the steady-state behavior because in the case of constant functions f (t) = a and
g(t) = b, their distributions are unit steps functions at x = a
and x = b, respectively, so that dL1 ( f , g) = |a − b|, the distance, in amplitude, between the two functions. Hence, we
can interpret the distance dL1 ( f , g) as an extension of the distance, in amplitude, between two constant signals, to the general case of periodic functions, taking into consideration the
differences in their shapes.
2.3.
Network metric
Once a distance between their steady-state trajectories is defined, we can extend this distance to two trajectories f (t) and
g(t) by
dtr ( f , g) = dss fss , gss ,
(8)
where dss is defined by (5).
The next step is to define the distance between two multivariate trajectories f(t) and g(t) by
dtr (f, g) =
N
1 (i) (i) dtr f , g ,
N i=1
(9)
where f (i) (t) and g (i) (t) are the component trajectories of
f(t) and g(t), respectively. Owing to the manner in which a
norm is used to define dss , in conjunction with the manner
in which dtr is constructed from dss , the triangle inequality
dtr (f, h) ≤ dtr (f, g) + dtr (g, h)
(10)
4
EURASIP Journal on Bioinformatics and Systems Biology
holds, and dtr is a metric.
The last step is to define the metric between two networks
as the expected distance between the trajectories over all possible initial states. For networks ψ1 and ψ2 , we define
d ψ1 , ψ2 = ES dtr f(ψ1 ,x0 ) , f(ψ2 ,x0 ) ,
(11)
where the expectation is taken with respect to the space S of
initial states.
The use of a metric, in particular, the triangle inequality,
is essential for the problem of estimating complex networks
by using simpler models. This is akin to the pattern recognition problem of estimating a complex classifier via a constrained classifier to mitigate the data requirement. In this
situation, there is a complex model that represents a broad
family of networks and a simpler model that represents a
smaller class of networks. Given a reference network from the
complex model and a sampled trajectory from it, we want to
estimate the optimal constrained network. We can identify
the optimal constrained network, that is, projected network,
as the one that best approximates the complex one, and the
goal of the inference process should be to obtain a network
close to the optimal constrained network. Let ψ be a reference
network (e.g., a continuous-valued ODE-based network), let
P(ψ) be the optimal constrained network (e.g., a discretevalued network), and let ω be an estimator of P(ψ) estimated
from data sampled from ψ. Then
d(ω, ψ) ≤ d ω, P(ψ) + d P(ψ), ψ ,
x
t1
t0
t2
mi = f
2.4. Estimation of the amplitude
cumulative distribution
The amplitude cumulative distribution of a trajectory can be
estimated by simulating the trajectory and then estimating
the ACD from the trajectory. Assuming that the steady-state
ti+2
ti + ti+1
2
Figure 2: Example of determination of values mi .
trajectory fss (t) is periodic with period t p , we can analyze
fss (t) between two points, ts and te = ts + t p . For a continuous function fss (t), we assume that any amplitude value x
is visited only a finite number of times by fss (t) in a period
ts ≤ t < te . In accordance with (3), we define the cumulative
distribution
λ ts ≤ t ≤ te | fss (t) ≤ x
F(x) =
tp
.
(13)
To calculate F(x) from a sampled trajectory, for each value x,
let Sx be the set of points where fss (t) = x:
Sx = ts ≤ t ≤ te | fss (t) = x ∪ ts , te .
(14)
The set Sx is finite. Let n = |Sx | denote the number of elements t0 , . . . , tn−1 . These can be sorted so that ts = t0 <
t1 < t2 < · · · < tn−1 = te . Now we define the set mi ,
i = 0, . . . , n − 2, of intermediate values between two consecutive points where fss (t) crosses x (see Figure 2) by
mi = fss
where the following distances have natural interpretations:
This structure is analogous to the classical constrained regression problem, where constraints are used to facilitate better inference via reduction of the estimation error (so long as
this reduction exceeds the projection error) [11]. In the case
of networks, the constraint problem becomes one of finding
a projection mapping for models representing biological processes for which the loss defined by d(P(ψ), ψ) may be maintained within manageable bounds so that with good inference techniques, the estimation error defined by d(ω, P(ψ))
will be minimized.
ti+1
(12)
(i) d(ω, ψ) is the overall distance and quantifies the approximation of the reference network by the estimated
optimal constrained network;
(ii) d(ω, P(ψ)) is the estimation distance for the constrained network and quantifies the inference of the
optimal constrained network;
(iii) d(P(ψ), ψ) is the projection distance and quantifies how
well the optimal constrained network approximates
the reference network.
ti
ti + ti+1
.
2
(15)
Let Ix be a set of the indices of points ti such that the
function f (t) is below x in the interval [ti , ti+1 ],
Ix = 0 ≤ i ≤ n − 2 | mi ≤ x .
(16)
Finally, the cumulative distribution F(x), defined by the measure of the set {ts ≤ t ≤ te | f (t) ≤ x}, can be computed as
the sum of the lengths of the intervals where f (t) ≤ x:
F(x) =
i∈Ix
ti+1 − ti
.
tp
(17)
The estimation of F(x) from a finite set {a1 , . . . , am } representing the function f (t) at points t1 , . . . , tm reduces to estimating the values in (17):
F(x)
=
1 ≤ i ≤ m | ai ≤ x m
(18)
at the points ai , i = 1, . . . , m.
In the case of computing the distance between two functions f (t) and g(t), where the only information available
consists of two samples, {a1 , . . . , am } and {b1 , . . . , br }, for f
and g, respectively, both cumulative distributions F(x)
and
G(x)
need only be defined at the points in the set
S = a1 , . . . , am ∪ b1 , . . . , br .
(19)
Marcel Brun et al.
5
p1 (t)
r1 (t)
Translation
r3 (t)
Cis-regulation
Transcription
r2 (t)
Translation
p2 (t)
Figure 3: Block diagram of a model for transcriptional regulation.
In this case, if we sort the set S so that 0 = s0 < s2 < · · · <
sk = T (with T being the upper limit for the amplitude values, and k ≤ r + m), then (6) can be approximated by
dL∞ ( f , g) = max F si − G si 0≤i≤k
(20)
and (7) can be approximated by
dL1 ( f , g) =
si+1 − si F si − G si .
(21)
0≤i≤k−1
3.
APPLICATION TO QUANTIZATION
To illustrate application of the network metric, we will analyze how different degrees of quantization affect model accuracy. Quantization is an important issue in network modeling because it is imperative to balance the desire for fine
description against the need for reduced complexity for both
inference and computation. Since it is difficult, if not impossible, to directly evaluate the goodness of a model against a
real biological system, we will study the problem using a standard engineering approach. First, an in numero reference network model or system is formulated. Then, a second network
model with a different level of abstraction is introduced to
approximate the reference system. The objective is to investigate how different levels of abstraction, quantization levels in
this study, impact the accuracy of the model prediction. The
first model is called the reference model. From it, reference
networks will be instantiated with appropriate sets of model
parameters. The model will be continuous-valued to approximate the reference system at its fullest closeness. The second
model is called a projected model, and projected networks will
be instantiated from it. This model will be a discrete-valued
model at a given different level of quantization.
The ability of a projected network, an instance of the
projected model, to approximate a reference network, an instance of the reference model, can be evaluated by comparing
the trajectories generated from each network with different
initial states and computing the distances between the networks as given by (11).
3.1.
Reference model
The origin of our reference model is a differential-equation
model that quantitatively represents transcription, translation, cis-regulation and chemical reactions [7, 15, 21]. Specifically, we consider a differential-equation model that approximates the process of transcription and translation for
a set of genes and their associated proteins (as illustrated in
Figure 3) [7].The model comprises the following differential
equations:
d pi (t)
= λi ri t − τ p,i − γi pi (t), i ∈ G,
dt
dri (t)
= κi ci t − τr,i − βi ri (t), i ∈ G,
dt
ci (t) = φi p j t − τc, j , j ∈ Ri , i ∈ G,
(22)
where ri and pi are the concentrations of mRNA and proteins induced by gene i, respectively, ci (t) is the fraction of
DNA fragments committed to transcription of gene i, κi is the
transcription rate of gene i, and τ p,i , τr,i , and τc,i are the time
delays for each process to start when the conditions are given.
The most general form for the function φi is a real-valued
(usually nonlinear) function with domain in R|Ri | and range
in R, φi : R|Ri | → R. The functions are defined by the equations
φi p j , j ∈ Ri = 1 −
ρ p j , Si j , θi j
j ∈Ri+
×
ρ p j , Si j , θi j ,
(23)
j ∈Ri−
ρ(p, S, θ) =
1
,
(1 + θ p)S
where the parameters θ are the affinity constants and the parameters Si j are the distinct sites for gene i where promoter
j can bind. The functions depend on the discrete parameter
Si j , the number of binding sites for protein j on gene i, and
θi j , the affinity constant between gene i and protein j.
A discrete-time model results from the preceding
continuous-time model by discretizing the time t on intervals nδt, and the assumption that the fraction of DNA
6
EURASIP Journal on Bioinformatics and Systems Biology
Table 1: Parameter values used in simulations.
Parameter
Affinity constant
Value
θ = 108 M−1
mRNA and protein half-life
ρ = 1200 s
π = 3600 s
Transcription rates
Translation rate
λ = 0.20 s−1
Time delays
Transcription
Input substrate
concentration
1
1
1
2
2
2
3
3
3
4
4
Translation
3.2.
Value
S=1
κ1 = 0.001 pMs−1
κ2 = κ3 = κ4 = 0.05 pMs−1
τr = 2000 s
τc = 200 s
τ p = 2400 s
Projected model
The next step is to reduce the reference network model to
a projected network model. This is accomplished by applying constraints in the reference model. The application of
constraints modifies the original model, thereby obtaining
a simpler one. We focus on quantization of the gene expression levels (which are continuous-valued in the reference model) via uniform quantization, which is defined by
a finite or denumerable set L of intervals, L1 = [0, Δx ),
L2 = [Δx , 2Δx ), . . . , Li = [(i − 1)Δx , iΔx ), . . . , and a mapping ΠL : R → R such that Π(x) = ai for some collection of
points ai ∈ Li .
The equations for ri , pi , and ci (24) are replaced by
4
Cis-regulation
mRNA
Protein
Gene
r i (n) = Π e−βi δt r i (n − 1) + κi s βi , δt ci n − nr,i − 1 ,
(27)
Figure 4: Example of a tRS of a hypothetical metabolic pathway
that consists of four genes. In this figure, denotes an activator,
whereas, denotes a repressor.
fragments committed to transcription and concentration of
mRNA remains constant in the time interval [t − δt, t) [7].
In place of the differential equations for ri , pi , and ci , at time
t = nδt, we have the equations
ri (n) = e−βi δt ri (n − 1) + κi s(βi , δt)ci n − nr,i − 1 ,
pi (n) = e−γi δt pi (n − 1) + λi s λi , δt ri n − n p,i − 1 ,
Parameter
Number of binding sites
ci (n) = φi p j n − nc, j , j ∈ Ri ,
(25)
This model, which will serve as our reference model, is called
a (discrete) transcriptional regulatory system (tRS).
We generate networks using this model and a fixed set θ
of parameters. We call these networks reference networks. A
reference network is identified by its set θ of parameters,
θ = α1 , β1 , λ1 , γ1 , κ1 , τ p,1 , τr,1 , τc,1 , φ1 , R1 , . . . , αN ,
βN , λN , γN , κN , τ p,N , τr,N , τc,N , φN , RN .
i ∈ G.
(29)
Issues to be investigated include (1) how different quantization techniques (specification of the partition L) affect
the quality of the model; (2) which quantization technique
(mapping Π) is the best for the model; and (3) the similarity
of the attractors of the dynamical system defined by (27) and
(28) to the steady state of the original system, as a function
of Δx . We consider the first issue.
3.3.
i ∈ G,
1 − e−xy
.
x
ci (n) = φi p j n − nc, j , j ∈ Ri ,
A hypothetical metabolic pathway
(24)
where nr,i = τr,i /δt, n p,i = τ p,i /δt, nc, j = τc, j /δt, and
s(x, y) =
pi (n) = Π e−γi δt pi (n − 1) + λi s λi , δt r i n − n p,i − 1 ,
(28)
(26)
To illustrate the proposed metric in the framework of the
reference and projected models, we compare two networks
based on a hypothetical metabolic pathway. We first briefly
describe the hypothetical metabolic pathway with necessary
biochemical parameters to set up a reference system. Then,
the simulation study shows the impacts of various quantization levels in both time and trajectory based on the proposed
metric.
We consider a gene regulatory network consisting of four
genes. A graphical representation of the system is depicted
in Figure 4, where denotes an activator and denotes a
repressor. We assume that the GRN regulates a hypothetical
pathway, which metabolizes an input substrate to an output
product. This is done by means of enzymes whose transcriptional control is regulated by the protein produced from gene
3. Moreover, we assume that the effect of a higher input substrate concentration is to increase the transcription rate κ1 ,
Marcel Brun et al.
7
Gene 1
6
Gene 2
50
5
40
4
30
3
20
2
10
1
0
0
Initial
Final
10000 seconds
Initial
10000 seconds
10000 seconds
Quant = 0
Q = 0.001, S = 0.06, Sn = 0
Q = 0.01, S = 0.5, Sn = 0
Q = 0.1, S = 1.7, Sn = 0
10000 seconds
Quant = 0
Q = 0.001, S = 0.65, Sn = 0.82
Q = 0.01, S = 6.65, Sn = 0
Q = 0.1, S = 49.5, Sn = 0
(a)
(b)
Gene 4
Gene 3
120
Final
200
100
150
80
100
60
40
50
20
0
0
Initial
Final
10000 seconds
10000 seconds
Quant = 0
Q = 0.001, S = 0.63, Sn = 0.13
Q = 0.01, S = 4.34, Sn = 13
Q = 0.1, S = 111.66, Sn = 13
(c)
Initial
Final
10000 seconds
10000 seconds
Quant = 0
Q = 0.001, S = 9.76, Sn = 0.07
Q = 0.01, S = 52.18, Sn = 0.89
Q = 0.1, S = 58.96, Sn = 0.89
(d)
Figure 5: Example of trajectories from the first simulation of 4-gene network. Each figure shows the trajectory for one of the four genes, for
several values of the level quantization Δx , represented by the lines Q = 0, Q = 0.001, Q = 0.01 and Q = 0.1 (Q = 0 represents the original
network without quantization). The values S displayed in the graphs shows the distance computed between the trajectory and the one with
Q=0. The vertical axis shows the concentration levels x in pM. The horizontal axis shows the time t in seconds.
whereas the effect of a lower substrate concentration is to reduce κ1 . Unless otherwise specified, the parameters are assumed to be gene-independent. These parameters are summarized in Table 1.
We assume that each cis-regulator is controlled by one
module with four binding sites, and set S = 4, θ = 108 M−1 ,
κ2 = κ3 = κ4 = 0.05 pMs−1 , and λ = 0.05 s−1 . The value of
the affinity constant θ corresponds to a binding free energy
8
EURASIP Journal on Bioinformatics and Systems Biology
Iter. 1, gene 2
Iter. 1, gene 1
1
0.8
0.8
0.6
0.6
F(x)
F(x)
1
0.4
0.4
0.2
0.2
0
0
0.5
1
x
1.5
0
2
0
Quant = 0
Q = 0.0001, S = 0.06, Sn = 0
Q = 0.01, S = 0.5, Sn = 0
Q = 0.1, S = 1.7, Sn = 0
10
20
30
x
40
50
60
Quant = 0
Q = 0.001, S = 0.65, Sn = 0.82
Q = 0.01, S = 6.65, Sn = 0
Q = 0.1, S = 49.5, Sn = 0
(a)
(b)
Iter. 1, gene 3
Iter. 1, gene 4
0.8
0.8
0.6
0.6
F(x)
1
F(x)
1
0.4
0.4
0.2
0.2
0
0
50
100
150
0
0
50
100
x
x
Quant = 0
Q = 0.001, S = 0.63, Sn = 0.13
Q = 0.01, S = 4.34, Sn = 1.3
Q = 0.1, S = 111.66, Sn = 1.3
150
200
Quant = 0
Q = 0.001, S = 9.76, Sn = 0.07
Q = 0.01, S = 52.18, Sn = 0.89
Q = 0.1, S = 58.96, Sn = 0.89
(c)
(d)
Figure 6: Example of estimated cumulative density function (CDF) from the first simulation of 4-gene network, computed from the trajectories in Figure 5. Each figure shows the CDF for one of the four genes, for several values of the level quantization Δx , represented by the lines
Q = 0, Q = 0.001, Q = 0.01, and Q = 0.1 (Q = 0 represents the original network without quantization). The value S displayed in the graphs
show the distance computed between the trajectory and the one with Q = 0. The vertical axis shows the cumulative distribution F(x). The
horizontal axis shows the concentration levels x in pM.
of ΔU = −11.35 kcal/mol at temperature T = 310.15◦ K (or
37◦ C). The values of the transcription rates κ2 , κ3 , and κ4 correspond to transcriptional machinery that, on the average,
produces one mRNA molecule every 8 seconds. This value
turns out to be typical for yeast cells [22]. We also assume
that on the average, the volume of each cell in C equals 4 pL
[18]. The translation rate λ is taken to be 10-fold larger than
the rate of 0.3/minute for translation initiation observed in
vitro using a semipurified rabbit reticulocyte system [23].
The degradation parameters β and γ are specified by
means of the mRNA and protein half-life parameters ρ and
π, respectively, which satisfy
1
e−βρ = ,
2
1
e−γπ = .
2
(30)
ln 2
.
π
(31)
In this case,
β=
ln 2
,
ρ
γ=
Marcel Brun et al.
9
80
70
120
100
60
80
50
60
40
40
20
0
3600
1800
600
30
20
300
120
δt
101
60
100
30
10
5
1
10−3
10−2
10−1
Δx
10
0
1
5
10
30
60 120
300 600
1800 3600
δt
Figure 7: Results for the first simulation: the vertical axis shows the
distance dL1 ( f(Δx ,δt ) , f(Δx =0,δt ) ) as function of quantization levels for
both the values (axis labeled “Δx ”) and the time (axis labeled “δt ”).
Δx = 0.1
Δx = 1
Δx = 0
Δx = 0.001
Δx = 0.01
(a)
3.4. Results and discussion
It is expected that the finer the quantization is (smaller values of Δx ), the more similar will be the projected networks
to the reference networks. This similarity should be reflected
by the trajectories as measured by the proposed metric. A
straightforward simulation consists of the design of a reference network, the design of a projected network (for some
value of Δx ), the generation of several trajectories for both
networks from randomly selected starting points, and the
computation of the average distance between trajectories, using (9) and (21). Each process is repeated for different time
intervals δt to study how the time intervals used in the simulation affect the analysis.
The firstsimulation is based on the same 4-gene model
presented in [7]. We use 6 different quantization levels,
Δx = 0, 0.001, 0.01, 0.1, 1, and 10, where Δx = 0 means
no quantization, and designates the reference network. For
each quantization level Δx and starting point x0 , we generate the simulated time series expression and compare it to
the time-series generated with Δx = 0 (the reference network), estimating the proposed metric using (21). The process is repeated using a total of 10 different time intervals,
δt = 1 second, 5 seconds, 10 seconds, 30 seconds, 1 minute,
2 minutes, 5 minutes, 10 minutes, 30 minutes, and 1 hour.
The simulation is repeated and the distances are averaged for
30 different starting points x0 .
Figures 5 and 6 show the trajectories and empirical cumulative density functions estimated from the simulated system as illustrated in the previous section. Several quantization levels are used in the simulation. The last graph in
Figure 5 shows the mRNA concentration for the forth gene,
over the 10 000 first seconds (transient) and over the last
10 000 seconds (steady-state). We can see that for quantizations 0 and 0.001, the steady-state solutions are periodic, and
for quantizations 0.001 and 0.1, the solutions are constant.
This is reflected by the associated plot of F(x) in Figure 6.
120
100
80
60
40
20
0
10−3
10−2
10−1
Δx
δt = 1
δt = 10
δt = 60
100
101
δt = 300
δt = 1800
(b)
Figure 8: Results for the first simulation: the vertical axis shows the
distance dL1 ( f(Δx ,δt ) , f(Δx =0,δt ) ) as function of quantization levels for
both the values (labeled “Δx ”) and the time (labeled “δt ”). Part (a)
shows the distance as a function of Δx for several values of δt . Part
(b) shows the distance as a function of δt for several values of Δx .
Figure 7 shows how strong quantization (high values of
Δx ) yields high distance, with the distance decreasing again
when the time interval (δt ) increases. The z-axis in the figure
represents the distance dL1 ( f(Δx ,δt ) , f(Δx =0,δt ) ).
In our second simulation, we use a different connectivity (all other kinetic parameters are unchanged), and we
10
EURASIP Journal on Bioinformatics and Systems Biology
40
35
40
35
30
30
25
25
20
15
20
10
15
5
0
3600
1800
600
10
300
120
δt
101
60
100
30
10
5
1
10−3
10−2
10−1
Δx
5
0
1
5
10
30
60 120
300 600
1800 3600
δt
Figure 9: Results for the second simulation: the vertical axis shows
the distance dL1 ( f(Δx ,δt ) , f(Δx =0,δt ) ) as function of quantization levels
for both the values (axis labeled “Dx”) and the time (axis labeled
“delta t”).
Δx = 0
Δx = 0.001
Δx = 0.01
Δx = 0.1
Δx = 1
(a)
again use 10 different time intervals, δt = 1 second, 5 seconds,
10 seconds, 30 seconds, 1 minute, 2 minutes, 5 minutes,
10 minutes, 30 minutes and 1 hour, and 6 different quantization levels, Δx = 0, 0.001, 0.01, 0.1, 1, and 10. (Δx = 0
meaning no quantization). The simulation is repeated and
the distances are averaged for 30 different starting points.
Analogous to the first simulation, Figure 9 shows how strong
quantization (high values of Δx ) yields high distance, which
decreases when the time interval (δt ) increases.
An important observation regarding Figures 8 and 10 is
that the error decreases as δt increases. This is due to the fact
that the coarser the amplitude quantization is, the more difficult it is for small time intervals to capture the dynamics of
slowly changing sequences.
4.
CONCLUSION
This study has proposed a metric to quantitatively compare
two networks and has demonstrated the utility of the metric via a simulation study involving different quantizations of
the reference network. A key property of the proposed metric
is that it allows comparison of networks of different natures.
It also takes into consideration differences in the steady-state
behavior and is invariant under time shifting and scaling.
The metric can be used for various purposes besides quantization issues. Possibilities include the generation of a projected network from a reference network by removing proteins from the equations and connectivity reduction by removing edges in the connectivity matrix.
The metric facilitates systematic study of the ability
of discrete dynamical models, such as Boolean networks,
to approximately represent more complex models, such as
differential-equation models. This can be particularly important in the framework of network inference, where the parameters for projected models can be inferred from the reference model, either analytically or via synthetic data generated via simulation of the reference model. Then, given the
40
35
30
25
20
15
10
5
0
10−3
10−2
10−1
Δx
δt = 1
δt = 10
δt = 60
100
101
δt = 300
δt = 1800
(b)
Figure 10: Results for the second simulation: the vertical axis shows
the distance dL1 ( f(Δx ,δt ) , f(Δx =0,δt ) ) as function of quantization levels
for both the values (labeled “Δx ”) and the time (labeled “δt ”). Part
(a) shows the distance as a function of Δx for several values of δt .
Part (b) shows the distance as a function of δt for several values of
Δx .
reference and projected models, the metric can be used to
determine the level of abstraction that provides the best inference; given the amount of observations available, this approach corresponds to classification-rule constraint for classifier inference in pattern recognition.
Marcel Brun et al.
11
NOMENCLATURE
Trajectory:
A function f (t)
Distance Function: The proposed distance between
networks
NOTATIONS
t:
ψ:
x0 :
f (t), g(t), h(t):
fss , gss :
fψ,xo (t):
ftran :
fss :
F(x), G(x), H(x):
dtr (·, ·):
dss (·, ·):
λ(A):
f(t):
Time
Network
Starting Point
Trajectories
Steady-State trajectories
Trajectory
Transient part of the trajectory
Steady-state part of the trajectory
Cumulative distribution functions
Distance between two trajectories
Distance between two periodic or constant
trajectories
Lebesgue measure of set A
Multivariate trajectory
ACKNOWLEDGMENTS
We would like to thank the National Science Foundation
(CCF-0514644) and the National Cancer Institute (R01 CA104620) for sponsoring in part this research.
REFERENCES
[1] H. De Jong, “Modeling and simulation of genetic regulatory
systems: a literature review,” Journal of Computational Biology,
vol. 9, no. 1, pp. 67–103, 2002.
[2] R. Srivastava, L. You, J. Summers, and J. Yin, “Stochastic vs.
deterministic modeling of intracellular viral kinetics,” Journal
of Theoretical Biology, vol. 218, no. 3, pp. 309–321, 2002.
[3] R. Albert and A.-L. Barabási, “Statistical mechanics of complex networks,” Reviews of Modern Physics, vol. 74, no. 1, pp.
47–97, 2002.
[4] S. Kim, H. Li, E. R. Dougherty, et al., “Can Markov chain models mimic biological regulation?” Journal of Biological Systems,
vol. 10, no. 4, pp. 337–357, 2002.
[5] R. Albert and H. G. Othmer, “The topology of the regulatory
interactions predicts the expression pattern of the segment polarity genes in Drosophila melanogaster,” Journal of Theoretical
Biology, vol. 223, no. 1, pp. 1–18, 2003.
[6] S. Aburatani, K. Tashiro, C. J. Savoie, et al., “Discovery of
novel transcription control relationships with gene regulatory
networks generated from multiple-disruption full genome expression libraries,” DNA Research, vol. 10, no. 1, pp. 1–8, 2003.
[7] J. Goutsias and S. Kim, “A nonlinear discrete dynamical model
for transcriptional regulation: construction and properties,”
Biophysical Journal, vol. 86, no. 4, pp. 1922–1945, 2004.
[8] H. Li and M. Zhan, “Systematic intervention of transcription
for identifying network response to disease and cellular phenotypes,” Bioinformatics, vol. 22, no. 1, pp. 96–102, 2006.
[9] A. Datta, A. Choudhary, M. L. Bittner, and E. R. Dougherty,
“External control in Markovian genetic regulatory networks,”
Machine Learning, vol. 52, no. 1-2, pp. 169–191, 2003.
[10] A. Choudhary, A. Datta, M. L. Bittner, and E. R. Dougherty,
“Control in a family of boolean networks,” in IEEE International Workshop on Genomic Signal Processing and Statistics
(GENSIPS ’06), College Station, Tex, USA, May 2006.
[11] L. Devroye, L. Györffi, and G. Lugosi, A Probabilistic Theory of
Pattern Recognition, Springer, New York, NY, USA, 1996.
[12] I. Ivanov and E. R. Dougherty, “Modeling genetic regulatory
networks: continuous or discrete?” Journal of Biological Systems, vol. 14, no. 2, pp. 219–229, 2006.
[13] I. Ivanov and E. R. Dougherty, “Reduction mappings between
probabilistic boolean networks,” EURASIP Journal on Applied
Signal Processing, vol. 2004, no. 1, pp. 125–131, 2004.
[14] S. Ott, S. Imoto, and S. Miyano, “Finding optimal models for
small gene networks,” in Proceedings of the Pacific Symposium
on Biocomputing (PSB ’04), pp. 557–567, Big Island, Hawaii,
USA, January 2004.
[15] L. F. Wessels, E. P. van Someren, and M. J. Reinders, “A comparison of genetic network models,” in Proceedings of the Pacific Symposium on Biocomputing (PSB ’01), pp. 508–519, Lihue, Hawaii, USA, January 2001.
[16] M. B. Elowitz, A. J. Levine, E. D. Siggia, and P. S. Swain,
“Stochastic gene expression in a single cell,” Science, vol. 297,
no. 5584, pp. 1183–1186, 2002.
[17] S. A. Kauffman, The Origins of Order: Self-Organization and
Selection in Evolution, Oxford University Press, New York, NY,
USA, 1993.
[18] B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, and P.
Walter, Molecular Biology of the Cell, Garland Science, New
York, NY, USA, 4th edition, 2002.
[19] S. A. Kauffman, “Metabolic stability and epigenesis in randomly constructed genetic nets,” Journal of Theoretical Biology,
vol. 22, no. 3, pp. 437–467, 1969.
[20] P. A. Lynn, An Introduction to the Analysis and Processing of
Signals, John Wiley & Sons, New York, NY, USA, 1973.
[21] A. Arkin, J. Ross, and H. H. McAdams, “Stochastic kinetic
analysis of developmental pathway bifurcation in phage λinfected Escherichia coli cells,” Genetics, vol. 149, no. 4, pp.
1633–1648, 1998.
[22] V. Iyer and K. Struhl, “Absolute mRNA levels and transcriptional initiation rates in Saccharomyces cerevisiae,” Proceedings
of the National Academy of Sciences of the United States of America, vol. 93, no. 11, pp. 5208–5212, 1996.
[23] J. R. Lorsch and D. Herschlag, “Kinetic dissection of fundamental processes of eukaryotic translation initiation in vitro,”
EMBO Journal, vol. 18, no. 23, pp. 6705–6717, 1999.
Hindawi Publishing Corporation
EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 73109, 11 pages
doi:10.1155/2007/73109
Research Article
A Robust Structural PGN Model for Control of Cell-Cycle
Progression Stabilized by Negative Feedbacks
Nestor Walter Trepode,1 Hugo Aguirre Armelin,2 Michael Bittner,3 Junior Barrera,1
Marco Dimas Gubitoso,1 and Ronaldo Fumio Hashimoto1
1 Institute
of Mathematics and Statistics, University of São Paulo, Rua do Matao 1010, 05508-090 São Paulo, SP, Brazil
of Chemistry, University of São Paulo, Avenue Professor Lineu Prestes 748, 05508-900 São Paulo, SP, Brazil
3 Translational Genomics Research Institute, 445 N. Fifth Street, Phoenix, AZ 85004, USA
2 Institute
Received 27 July 2006; Revised 24 November 2006; Accepted 10 March 2007
Recommended by Tatsuya Akutsu
The cell division cycle comprises a sequence of phenomena controlled by a stable and robust genetic network. We applied a probabilistic genetic network (PGN) to construct a hypothetical model with a dynamical behavior displaying the degree of robustness
typical of the biological cell cycle. The structure of our PGN model was inspired in well-established biological facts such as the
existence of integrator subsystems, negative and positive feedback loops, and redundant signaling pathways. Our model represents
genes interactions as stochastic processes and presents strong robustness in the presence of moderate noise and parameters fluctuations. A recently published deterministic yeast cell-cycle model does not perform as well as our PGN model, even upon moderate
noise conditions. In addition, self stimulatory mechanisms can give our PGN model the possibility of having a pacemaker activity
similar to the observed in the oscillatory embryonic cell cycle.
Copyright © 2007 Nestor Walter Trepode et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
1.
INTRODUCTION
A complex genetic network is the central controller of the
cell-cycle process, by which a cell grows, replicates its genetic
material, and divides into two daughter cells. The cell-cycle
control system shows adaptability to specific environmental
conditions or cell types, exhibits stability in the presence of
variable excitation, is robust to parameter fluctuation and is
fault tolerant due to replications of network structures. It also
receives information from the processes being regulated and
is able to arrest the cell cycle at specific “checkpoints” if some
events have not been correctly completed. This is achieved by
means of intracellular negative feedback signals [1, 2].
Recently, two models were proposed to describe this control system. After exhaustive literature studies, Li et al. proposed a deterministic discrete binary model of the yeast
cell-cycle control system, completely based on documented
data [3]. They studied the signal wave generated by the
model, that goes through all the consecutive phases of the
cell-cycle progression, and verified, by simulation, that almost all the state transitions of this deterministic model converge to this “biological pathway,” showing stability under
different activation signal waveforms. Based on experimental
data, Pomerening et al. proposed a continuous deterministic model for the self-stimulated embryonic cell-cycle, which
performs one division after the other, without the need of
external stimuli nor waiting to grow [4].
We recently proposed the probabilistic genetic network
(PGN) model, where the influence between genes is represented by a stochastic process. A PGN is a particular family
of Markov Chains with some additional properties (axioms)
inspired in biological phenomena. Some of the implications
of these axioms are: stationarity; all states are reachable; one
variable’s transition is conditionally independent of the other
variables’ transitions; the probability of the most probable
state trajectory is much higher than the probabilities of the
other possible trajectories (i.e., the system is almost deterministic); a gene is seen as a nonlinear stochastic gate whose
expression depends on a linear combination of activator and
inhibitory signals and the system is built by compiling these
elementary gates. This model was successfully applied for designing malaria parasite genetic networks [5, 6].
Here we propose a hypothetical structural PGN model
for the eukaryote control of cell-cycle progression, that aims
2
EURASIP Journal on Bioinformatics and Systems Biology
to reproduce the typical robustness observed in the dynamical behavior of biological systems. Control structures inspired in well-known biological facts, such as the existence of
integrators, negative and positive feedbacks, and biological
redundancies, were included in the model architecture. After adjusting its parameters heuristically, the model was able
to represent dynamical properties of real biological systems,
such as sequential propagation of gene expression waves, stability in the presence of variable excitation and robustness in
the presence of noise [7].
We carried out extensive simulations—under different
stimulus and noise conditions—in order to analyze stability
and robustness in our proposed model. We also analyzed the
performance of the yeast cell cycle control model constructed
by Li et al. [3] under similar simulations. Under small noisy
conditions, our PGN model exhibited remarkable robustness
whereas Li’s yeast-model did not perform that well. We infer that our PGN model very likely possesses some structural
features ensuring robustness which Li’s model lacks. To further emulate cellular environment conditions, we extended
our model to include random delays in its regulatory signals
without degrading its previous stability and robustness. Finally, with the addition of positive feedback, our model became self-stimulated, showing an oscillatory behavior similar to the one displayed by the embryonic cell-cycle [4]. Besides being able to represent the observed behavior of the
other two models, our PGN model showed strong robustness
to system parameter fluctuation. The dynamical structure of
the proposed model is composed of: (i) prediction by an almost deterministic stochastic rule (i.e., gene model), and (ii)
stochastic choice of an almost deterministic stochastic prediction rule (i.e., random delays).
After this introduction, in Section 2, we present our
mathematical modeling of a gene regulatory network by a
PGN. In Section 3, we briefly describe Li’s yeast cell-cycle
model and present the simulation, in the presence of noise,
of our PGN version of it. Sections 4 and 5 describe the architecture and dynamics of our model for control of cell-cycle
progression and analyze its simulations in the presence of
noise and random delays in the regulatory signals (the same
noise pattern was applied to both our model and Li’s yeastmodel). Section 6 shows the inclusion of positive feedback in
our model to obtain a pacemaker activity, similar to the one
found in embryonic cells. Finally, in Section 7 we discuss our
results and the continuity of this research.
2.
MATHEMATICAL MODELING OF
GENETIC NETWORKS
2.1. Genetic regulatory networks
The cell cycle control system is a complex network comprising many forward and feedback signals acting at specific
times. Figure 1 is a schematic representation of such a network, usually called a gene regulatory network. Proteins produced as a consequence of gene expression (i.e., after transcription and translation) form multiprotein complexes, that
interact with each other, integrating extracellular signals—
not shown—, regulating metabolic pathways (arrow 3), re-
Feedback signals
1
2
Transcription
DNA
Translation
RNA
Proteins
3
4
Metabolic
pathways
Microarray measurements
Figure 1: Gene regulatory network.
ceiving (arrow 4) and sending (arrow 1 and 2) feedback signals. In this way, genes and their protein products form a signaling network that controls function, cell division cycle, and
programmed cell death. In that network, the level of expression of each gene depends on both its own expression value
and the expression values of other genes at previous instants
of time, and on previous external stimuli. This kind of interactions between genes forms networks that may be very
complex. The dynamical behavior of these networks can be
adequately represented by discrete stochastic dynamical systems. In the following subsections, we present a model of this
kind.
2.2.
Discrete dynamical systems
Discrete dynamical systems, discrete in time and finite in
range, can model the behavior of gene networks [8–12]. In
this model, we represent each gene or protein by a variable
which takes the value of the gene expression or the protein
concentration. All these variables, taken collectively, are the
components of a vector called the state of the system. Each
component (i.e., gene or protein) of the state vector has associated a function that calculates its next value (i.e., expression value or protein concentration) from the state at previous instants of time. These functions are the components of
a function vector, called transition function, that defines the
transition from one state to the next and represents the actual
regulatory mechanisms.
Let R be the range of all state components. For example,
R = {0, 1} in binary systems, R = {−1, 0, 1} or R = {0, 1, 2}
in three levels systems. The transition function φ, for a network of N variables and memory m, is a function from RmN
to RN . This means that the transition function φ maps the
previous m states, x(t − 1), x(t − 2), . . . , x(t − m), into the
state x(t) with x(t) = [x1 (t), x2 (t), . . . , xN (t)]T ∈ RN . A discrete dynamical system is given by, for every time t ≥ 0,
x(t) = φ x(t − 1), x(t − 2), . . . , x(t − m) .
(1)
A component of x is a value xi ∈ R. Systems defined as above
are time translation invariant, that is, the transition function
is the same for all discrete time t. The system architecture—
or structure—is the wiring diagram of the dependencies
Nestor Walter Trepode et al.
3
between the variables (state vector components). The system
dynamics is the temporal evolution of the state vector, given
by the transition function.
(v) for every variable i there exists a matrix ai and a vector
bi of real numbers such that, for every x, z ∈ RmN and
yi ∈ R if
N m
2.3. Probabilistic genetic networks
When the transition function φ is a stochastic function (i.e.,
for each sequence of states x(t − m), . . . , x(t − 2), x(t − 1),
the next state x(t) is a realization of a random vector), the
dynamical system is a stochastic process. Here we represent gene regulatory networks by stochastic processes, where
the stochastic transition function is a particular family of
Markov chains, that is called probabilistic genetic network
(PGN).
Consider a sequence of random vectors X0 , X1 , X2 , . . .
assuming values in RN , denoted, respectively, x(0), x(1),
x(2), . . . . A sequence of random states (Xt )∞
t =0 is called a
Markov chain if for every t ≥ 1,
P Xt = x(t) | X0 = x(0), . . . , Xt−1 = x(t − 1)
= P Xt = x(t) | Xt−1 = x(t − 1) .
(2)
That is, the conditional probability of the future event, given
the past history, depends only upon the last instant of time.
Let X, with realization x, represent the state before a transition, and let Y , with realization y be the first state after
that transition. A Markov chain is characterized by a transition matrix πY |X of conditional probabilities between states,
whose elements are denoted p y|x , and the probability distribution π0 of the random vector representing the initial state.
The stochastic transition function φ at time t, is given by, for
every t ≥ 1,
φ[x] = φ x(t − 1) = y,
(3)
where y is a realization of a random vector with distribution
p•|x .
An m order Markov chain—which depends on the m
previous instants of time—is equivalent to a Markov chain
whose states have dimension m × N.
Let the sequence X = Xt−1 , . . . , Xt−m with realization x =
x(t − 1), . . . , x(t − m) represent the sequence of m states before
a transition. A probabilistic genetic network (PGN) is an m
order Markov chain (πY |X , π0 ) such that
(i) πY |X is homogeneous, that is, p y|x is independent of t,
(ii) p y|x > 0 for all states x ∈ RmN , y ∈ RN ,
(iii) πY |X is conditionally independent, that is, for all states
x ∈ RmN , y ∈ RN ,
p y|x = ΠNi=1 p yi | x ,
(4)
(iv) πY |X is almost deterministic, that is, for every sequence
of states x ∈ RmN , there exists a state y ∈ RN such that
p y|x ≈ 1,
akji x j (t − k) =
j =1 k=1
pi
k=1
N m
akji z j (t − k),
j =1 k=1
bik xi (t − k) =
pi
bik zi (t − k),
(5)
k=1
then p yi | x = p yi | z ,
0 ≤ pi ≤ m.
These axioms imply that each variable xi is characterized
by a matrix and a vector of coefficients and a stochastic function gi from Z, a subset of integer numbers, to R.
If akji is positive, then the target variable xi is activated
by the variable x j at time t − k, if akji is negative, then it is
inhibited by variable x j at time t − k, if akji is zero, then it is
not affected by variable x j at time t − k. We say that variable
xi is predicted by the variable x j when some akji is different
from zero. Similarly, if bik is zero, the value of xi at time t is
not affected for its previous value at time t − k. The constant
parameter pi , for the state variable xi , represents the number
of previous instants of time at which the values of xi can affect
the value of xi (t). If pi = 0, previous values of xi have no
p
effect on the value of xi (t) and the summation k=i 1 bik xi (t −
k) is defined to be zero.
The component i of the stochastic transition function φ,
denoted φi , is built by the composition of a stochastic function gi with two linear combinations: (i) ai and the previous states x(t − 1), . . . , x(t − m), and (ii) bi and the values of
xi (t − 1), . . . , xi (t − pi ). This means that, for every t ≥ 1,
φi x(t − 1), . . . , x(t − m) = gi (α, β),
(6)
where
α=
N m
akji x j (t − k),
j =1 k=1
β=
pi
bik xi (t − k)
(7)
k=1
and gi (α, β) is a realization of a random variable in R, with
distribution p(• | α, β). This restriction on gi means that the
components of a PGN transition function vector are random
variables with a probability distribution conditioned to two
linear combinations, α and β, from the fifth PGN axiom.
The PGN model reflects the properties of a gene as a nonlinear stochastic gate. Systems are built by compiling these
gates.
Biological rationale for PGN axioms
The axioms that define the PGN model are inspired by biological phenomena. The dynamical system structure is justified by the necessity of representing a sequential process. The
discrete representation is sufficient since the interactions between genes and proteins occur at the molecular level [13].
The stochastic aspects represent perturbations or lack of detailed knowledge about the system dynamics. Axiom (i) is
EURASIP Journal on Bioinformatics and Systems Biology
just a constraint to simplify the model. In general, real systems are not homogeneous, but may be homogeneous by
parts, that is, in time intervals. Axiom (ii) imposes that all
states are reachable, that is, noise may lead the system to
any state. It is a quite general model that reflects our lack
of knowledge about the kind of noise that may affect the system. Axiom (iii) implies that the prediction of each gene can
be computed independently of the prediction of the other
genes, which is a kind of system decomposition consistent
with what is observed in nature. Axiom (iv) means that the
system has a main trajectory, that is, one that is much more
probable than the others. Axiom (v) means that genes act as a
nonlinear gate triggered by a balance between inhibitory and
excitatory inputs, analogous to neurons.
State variables
4
1
Mcm1SFF 0
1
Clb12 0
1
Sic1 0
1
clb56 0
1
Cdc20Cdc14 0
1
Swi5 0
1
Cdh1 0
1
Cln12 0
1
SBF 0
1
MBF 0
1
Cln3 0
1
CS 0
0
4
8
12
16
20
24
28
Time steps
The eukaryotic cell-cycle process is an ordered sequence of
events by which the cell grows and divides in two daughter cells. It is organized in four phases: G1 (the cell progressively grows and by the end of this phase becomes irreversibly
committed to division), S (phase of DNA synthesis and chromosome replication), G2 (bridging “gap” between S and M),
and M (period of chromosomes separation and cell division)
[1, 2]. The cell-cycle basic organization and control system
have been highly conserved during evolution and are essentially the same in all eukaryotic cells, what makes more relevant the study of a simple organism, like yeast.
We made studies of stability and robustness on a recently published deterministic binary control model of the
yeast cell-cycle, which was entirely built from real biological knowledge after extensive literature studies [3]. From the
≈ 800 genes involved in the yeast cell-cycle process [14],
only a small number of key regulators, responsible for the
control of the cell-cycle process, were selected to construct
a model where each interaction between its variables is documented in the literature. A dynamic model of these interactions would involve various binding constants and rates
[15, 16], but inspired by the on-off characteristic of many
of the cell-cycle control network components, and focusing
mainly on the overall dynamic properties and stability, they
constructed a simple discrete binary model. In this work we
refer to its simplified version, whose architecture is shown in
Figure 1B of [3].
The simulation1 in Figure 2(a) shows the state variables’
temporal evolution over the biological pathway, that goes
through all the sequential phases of the cell cycle, from the
excited G1 state (activated when CS—cell size—grows beyond a certain threshold), to the S phase, the G2 phase, the M
phase, and finally to the stationary G1 state where it remains.
The cell-cycle sequence has a total length of 13 discrete time
steps (period of the cycle). Under simulations driven by CS
pulses of increasing frequency,2 this system behaved well,
1
(a) Simulation of the deterministic binary yeast cell-cycle model with
only one activator pulse of CS = 1 at t = −1. After the START state
at t = 0, the system goes through the biological pathway, passing by
all the sequential cell-cycle phases: G1 at t = 1, 2, 3; S at t = 4; G2 at
t = 5; M at t = 6, . . . , 10; G1 at t = 11; and from t = 12 the system
remains in the G1 stationary state (all variables at zero level except
Sic1 = Cdh1 = 1)
YEAST CELL-CYCLE MODEL
All simulations in this work were performed using SGEN (simulator for
gene expression networks) [17].
2 Simulations are not shown here.
2
Mcm1SFF 0
2
Clb12 0
2
Sic1 0
2
clb56 0
State variables
3.
2
Cdc20Cdc14 0
2
Swi5 0
2
Cdh1 0
2
Cln12 0
2
SBF 0
2
MBF 0
2
Cln3 0
2
CS 0
0
30
60
90
120
150
180
Time steps
(b) Simulation of the three-level PGN yeast cell-cycle model with 1%
of noise (PGN with P = .99) activated by a single pulse of CS = 2 at
t = −1. After 13 time steps (period of the cycle), the system should
remain in the G1 stationary state—all variables at zero level except
Sic1 = Cdh1 = 2—(compare with Figure 2(a)). Instead, this small
amount of noise is enough to take the system completely out of its
expected normal behavior
Figure 2: Yeast cell-cycle model simulations.
showing strong stability, with all initiated cycles systematically going to conclusion, and new cycles being initiated only
after the previous one had finished.
3.1.
PGN yeast cell-cycle model
In order to study the effect of noise and the increase of the
number of signal levels in the performance of Li’s yeastmodel [3], we translated it into a three level PGN model. Initially, we mapped Li’s binary deterministic model into a three
Nestor Walter Trepode et al.
5
Table 1: Threshold values for variables without self-degradation in
the PGN yeast cell-cycle model.
xi (t − 1) = 0
xi (t − 1) = 1
xi (t − 1) = 2
th(1)
xi
1
0
−1
th(2)
xi
2
1
0
level deterministic one, with range of values R = {0, 1, 2} for
the state variables. By PGN axiom (iv), the PGN transition
matrix πY |X is almost deterministic, that is, at every time step,
one of the transition probabilities p y|x ≈ 1. The deterministic case would be the case when, at every time step, this most
probable transition have p y|x → 1, or, in real terms, the case
corresponding to total absence of noise in the system. In this
mapping, binary value 1 was mapped to 2, and binary value 0
was mapped to 0, of the three-level model. Intermediate values (in the driving and transition functions) were mapped
in a convenient way, so that they lay between the ones that
have an exact correspondence. From this deterministic threelevel model (having exactly the same dynamical behavior of
the binary model from which it was derived) we specified the
following PGN.
3.1.1. PGN specification and simulation
The total input signal driving a generic variable xi (t) ∈
{0, 1, 2} (1 ≤ i ≤ N) is given by its associated driving function:
di (t − 1) =
N
a ji x j (t − 1).
(8)
j =1
Here, the system has memory m = 1 and a ji is the weight for
variable x j at time t − 1 in the driving function of variable
xi . If variable x j is an activator of variable xi , then a ji = 1;
if variable x j is an inhibitor of variable x j , then a ji = −1;
otherwise, a ji = 0.
Let
⎧
⎪
⎪
⎪2
⎪
⎨
if di (t − 1) ≥ th(2)
xi ,
(2)
yi (t) = ⎪1 if th(1)
xi ≤ di (t − 1) < thxi ,
⎪
⎪
⎪
⎩0
(9)
if di (t − 1) < th(1)
xi .
The stochastic transition function chooses the next value of
each variable to be (i) xi (t) = yi (t) with probability P ≈ 1,
(ii) xi (t) = a with probability (1 − P)/2, or (iii) xi (t) = b
with probability (1 − P)/2; where a, b ∈ {0, 1, 2} − { yi } and
(2)
th(1)
xi , thxi are the threshold values for one and two in the
transition function of variable xi . For this model to converge,
when P → 1, to the deterministic one in the previous subsection, these thresholds must have the values indicated in
Table 1, depending on the value of xi (t − 1). If variable xi has
the self degradation property, its threshold values are those
in the column of xi (t − 1) = 0, regardless of the actual value
of xi (t − 1).
We simulated the three-level PGN version of Li’s yeastmodel with probability P = 0.99 to represent the presence
of 1% of noise in the system. Figure 2(b) shows a 200 steps
simulation of the system when the G1 stationary state is activated by a single start pulse of CS = 2 at t = −1. Comparing
with Figure 2(a), we observe that this moderate noise is sufficient to degrade the systems’ performance. Particularly, the
system should remain in the G1 stationary state after the 13
steps cycle period, however, numerous spurious waveforms
are generated. Furthermore, when we simulated this system
increasing the frequency of the CS activator pulses, noise seriously disturbed the normal signal wave propagation [18].
We conclude that this system does not have a robust performance under 1% of noise.
4.
OUR STRUCTURAL MODEL FOR CONTROL OF
CELL-CYCLE PROGRESSION
The PGN was applied to construct a hypothetical model
based on components and structural features found in biological systems (integrators, redundancy, positive forward
signals, positive and negative feedback signals, etc.) having
a dynamical behavior (waves of control signals, stability to
changes in the input signal, robustness to some kinds of
noise, etc.) similar to those observed in real cell-cycle control systems.
During cell-cycle progression, families of genes have either brief or sustained expression during specific cell-cycle
phases or transitions between phases (see, e.g., Figure 7 in
[14]). In mammalian cells, the transition G0 /G1 of cell cycle requires sequential expression of genes encoding families of master transcription factors, for instance the fos and
jun families of proto-oncogenes. Among the fos genes c-fos
and fos B are essentially regulated at transcription level and
are expressed for a brief period of time (0.5 to 1 h), displaying mRNAs and proteins of very short half life. In addition, G1 progression and G1 /S transition are controlled by
the cell cycle regulatory machine, comprised by proteins of
sustained (cyclin-dependent kinases—CDKs—and Rb protein) and transient expression (cyclins D and E). The genes
encoding cyclins D and E are transcribed at middle and late
G1 phase, respectively. Actually, there are several CDKs regulating progression along all cell cycle phases and transitions,
whose activities are dependent on cyclins that are transiently
expressed following a rigid sequential order. This basic regulation of cell cycle progression is highly conserved in eukaryotes, from yeast to mammalians. Accordingly, we organized
our model into successive gene layers expressed sequentially
in time. This wave of gene expression controls timing and
progression through the cell-cycle process.
The architecture of our cell-cycle control model is depicted in Figure 3, showing the forward and feedback regulatory signals between gene layers (s, T, v, w, x, y, and
z), that determine the system’s dynamic behavior. These
gene layers represent consecutive stages taking place along
the classical cell-cycle phases G1 , S, G2 , and M. These layers are comprised by the genes—state variables—expressed
during the execution of each stage and are grouped into
the two main parts: (i) G1 phase—layer s—that represents
the cell growth phase immediately before the onset of DNA
6
EURASIP Journal on Bioinformatics and Systems Biology
F: integration of signals
from layer s
s1
Table 2: PGN weight values and transition function thresholds.
x1
Weights
.
.
.
akFP
F
w1
s2
y1
w2
y2
s
G1 phase
T
v
x6
w
x
y
(2)
th(1)
v = 11, thv = 22
akwv = −2, k = 1, 2
akvw = 6, k = 5, 6, . . . , 9
Trigger gene
..
.
Gene layers
(2)
th(1)
P = 9, thP = 12
akPv = 4, k = 5, 6, . . . , 9
z
T
s5
= 6, k = 5, 6, . . . , 9
a1jP = −2, j = v, w, x, y, z
v
External
stimuli
Thresholds
z
S, G2 and M phases
Time
Forward signal
Feedback to T
Feedback to previous layer
Figure 3: Cell-cycle network architecture.
akxw
= −1, k = 1, 2
akwx
= 5, k = 5, 6, . . . , 9
akyx
= −1, k = 1, 2
(2)
th(1)
x = 20, thx = 28
a5xy = 2
(2)
th(1)
y = 6, th y = 12
a5yz = 2
(2)
th(1)
z = 4, thz = 8
pleted before initiating another cycle of cell duplication and
division. Parallel signaling also provide robustness, acting as
backup mechanisms in case of parts malfunction.
4.1.
replication (i.e., S phase), during which the cell responds to
external regulatory stimuli (I) and (ii) S, G2 plus M phases—
layers T, v, w, x, y, and z—that goes from DNA replication to
mitosis. The S phase trigger gene T represents an important
cell-cycle checkpoint, interfacing G1 phase regulatory signals
and the initiation of DNA replication. The signal F (Figure 3)
stands for integration, at the trigger gene T, of activator signals from layer s. Our basic assumption implies that the cellcycle control system is comprised of modules of parallel sequential waves of gene expression (layers s to z) organized
around a check-point (trigger gene T) that integrates forward and feedback signals. For example, within a module,
the trigger gene T balances forward and feedback signals to
avoid initiation of a new wave of gene expression while a
first one is still going through the cell cycle. A number of
check-point modules, across cell cycle, regulate cell growth
and genome replication during the sequential G1 , S, and G2
phases and cell duplication via mitosis.
In our model, the expression of one of the genes in layers
v to z (i.e., after the trigger gene T—see Figure 3) typically
yields three types of signals in the system: (i) a forward activator signal to genes in the next layer that tends to make the
cell-cycle progress in its sequence; (ii) an inhibitory feedback
signal to the genes in the previous layer aiming to stop the
propagation of a new forward signal for some time; and (iii)
an inhibitory feedback signal to the trigger gene T that tends to
avoid the triggering of a new wave of gene expression while
the current cycle is unfinished. The negative feedback signals
perform an important regulatory action, tending to ensure
that a new forward signal wave is not initiated nor propagated through the system when the previous one is still going
on. This imposes in the model essential robustness features
of the biological cell cycle, for example, a cycle must be com-
(2)
th(1)
w = 20, thw = 35
Complete PGN specification
This PGN is specified in the same way as the one in
Section 3.1.1, changing the driving function to the following:
di (t − 1) =
N m
akji x j (t − k),
(10)
j =1 k=1
where m is the memory of the system and akji is the weight
for variable x j at time t − k in the driving function of variable xi ; and using the weight and threshold values shown in
Table 2, where akji is the weight for the expression values of
genes in layer j at time t − k in the driving function at time
t of genes at layer i. Weight values not shown in the table are
zero. Thresholds are the same for all genes in the same layer.
4.2.
Experimental results
We simulated our hypothetical cell-cycle control model, as a
PGN with probability P = .99 driven by different excitation
signals F (integration of signals from layer s driving the trigger gene T): beginning with a single activation pulse (F = 2),
then pulses of F of increasing frequency—that is, pulses arriving each time more frequently in each simulation—and,
finally, with a constant signal F = 2. As the initial condition
for the simulations of our model, we chose all variables from
layers T to z at zero value in the m—memory of the system—
previous instants of time. This represents, in our model, the
G1 stationary state, where the system remains after a previous
cycle has ended and when there is no activator signal F strong
enough to commit the cell to division. For simplicity, when
plotting these simulations, we show only one representative
gene for each gene layer.
A single pulse of F (Figure 4(a)) makes the system go
through all the cycle stages and then, all signals remain at
7
2
z0
2
z0
2
y1 0
2
y1 0
2
x1 0
2
x1 0
State variables
State variables
Nestor Walter Trepode et al.
2
w1 0
2
v0
2
w1 0
2
v0
2
T 0
2
T 0
2
F0
2
F0
0
30
60
90
120
150
180
0
30
60
Time steps
2
z0
2
z0
2
y1 0
2
y1 0
2
x1 0
2
x1 0
2
w1 0
2
v0
2
F0
2
F0
90
120
180
150
180
2
v0
2
T 0
60
150
2
w1 0
2
T 0
30
120
(a) F = period 30 oscillator
State variables
State variables
(a) One single start pulse of F = 2 at t = −1
0
90
Time steps
150
180
Time steps
0
30
60
90
120
Time steps
(b) F = Period 50 oscillator
(b) F = period 3 oscillator
Figure 4: Simulation of our three-level PGN cell-cycle progression
control model with 1% of noise (PGN with P = .99) when activator
pulses of F arrive after the previous cycle has ended.
Figure 5: Simulation of our three level PGN cell-cycle progression
control model with 1% of noise (PGN with P = 0.99) when activator pulses of F can arrive before the previous cycle has ended.
zero level—G1 stationary state—with a very small amount of
noise. Comparing this simulation with the one in Figure 2(b)
(three-level PGN model of the yeast cell cycle under the same
noise and activation conditions), we see that this system is
almost unaffected by this amount of noise during the cycle
progression or when it is in a stationary state. Those small
extra pulses, that arise outside the signal trains in the simulations of our model, are the observable effect due to the presence of 1% of noise (they do not appear when the system is
simulated without noise [19]—not shown here). Figure 4(b)
shows that when new F activator pulses are applied after each
cycle is finished, cycles start and are completed normally.
For F pulses arriving more frequently, a new cycle is
started only if the previous one has finished (Figure 5(a)).
This control action is performed by the inhibitory negative
feedback signals—from layers v to z—acting on the trigger
gene T, carrying the information that a previous cycle is still
unfinished. We see, in these simulations, that no spurious
signal waves are generated by noise nor the forward cell-cycle
signal is stopped by it (i.e., all normally initiated cycles finish). If a very frequent train of pulses triggers gene T be-
fore the end of the ongoing cycle, that signal is stopped at
the following gene layers by the negative interlayer feedbacks.
The regulation performed by these interlayer feedbacks provide another timing effect, assigning each stage—or layer—a
given amount of time for the processes it controls, stopping
the propagation of a new forward signal wave—coming from
the previous layer—for some time. By means of two types of
negative feedbacks (to the previous layer and to gene T), this
system is able to resist the excessive activation signal, maintaining its natural period, and thus mimicking the biological
cell cycle in nature. But, as in biological systems robustness
has its limits, in our model a very frequent excitation (short
period train of F pulses—Figure 5(b)—or constant F = 2—
not shown here) surpasses the resistance of the negative feedbacks, taking the system out of its normal behavior.
For comparison purposes, we simulated both Li’s model
and ours with 1% of noise. In other simulations, not shown
here, we increased gradually the noise in our model to see
how much it can resist, and decreased gradually the noise
in Li’s model to determine the smallest amount of it that
can lead to undesired dynamical behavior. In the first case,
8
EURASIP Journal on Bioinformatics and Systems Biology
Table 3: Delay probabilities.
P(td )
.2
.6
.2
Table 4: PGN weight values and transition function thresholds in
the model with random delays in the regulatory signals.
Weights (k = k + td )
th(1)
T =9
th(2)
T = 12
akjT = −1.33; j = v, w, x, y, z; k = 1
akjT = −0.67; j = v, w, x, y, z; k = 2
—
akTv
= 5, k = 5, . . . , 9
th(1)
v = 11
akwv = −0.77, k = 1, . . . , 9
th(2)
v = 22
akvw
th(1)
w
= 15
th(2)
w
= 25
akxw
= 7, k = 3, . . . , 7
= −0.83, k = 1, . . . , 9
th(1)
x = 20
th(2)
x = 28
akwx = 6, k = 4, . . . , 8
akyx = −1.77, k = 1, . . . , 9
akxy = 3, k = 6
akyz = 3, k = 6
th(1)
y =6
=4
th(2)
z
=8
2
w1 0
2
v0
2
F0
0
30
60
90
120
150
180
Time steps
(a) One single start pulse of F = 2 at t = −1, −2
2
z0
2
y1 0
th(2)
y = 12
th(1)
z
2
x1 0
2
T 0
Thresholds
akFT = 6, k = 5, . . . , 9
State variables
2
y1 0
State variables
td
0
1
2
2
z0
2
x1 0
2
w1 0
2
v0
2
T 0
2
F0
0
30
60
90
120
150
180
Time steps
we observed that in our model, a noise above 3% is needed
for a noise pulse to propagate through the consecutive layers
as a spurious signal train (5% of noise is needed to stop the
normal signal wave, preventing it from finishing an ongoing
cell cycle) [19]. On the other hand, when simulating Li’s binary model, we observed spurious pulse propagation even at
0.05% noise [18].
5.
CELL-CYCLE PROGRESSION CONTROL MODEL
WITH RANDOM DELAYS
We modified our model in order to admit random delays in
signal propagation, maintaining its overall behavior and robustness.
5.1. PGN specification
In this version, before computing the driving function of a
variable, the model chooses a random delay td for its arguments, with the probability distribution of Table 3. Once
these delays are chosen, the stochastic transition function
defined in Section 4.1 calculates the temporal evolution of
the system, with the weights and thresholds indicated in
Table 4. The transition function parameters, specifically its
PGN weights values, depend on these variable delays. As
shown in Table 4, these delays produce a time displacement
of the weights, and so, of the inputs to the driving function
of each variable. This system is no longer time translation
invariant, but adaptive. At each time step, it chooses a PGN
(b) F = period 60 oscillator
Figure 6: Simulation of our three-level PGN cell-cycle progression
control model with random delays and 1% of noise (PGN with P =
.99), when activator pulses of F arrive after the previous cycle has
ended.
from a set of candidate PGNs (each one determined by one
of the possible combinations of delays for its variables).
In Table 4, akji denotes the weight for the expression values of genes in layer j at time t − k (where k = k + td ) in the
driving function of layer i genes at time t. Weight values not
shown in the table are zero. Thresholds are the same for all
genes in the same layer, but td is not. It is chosen individually
for each gene—by its associated component of the transition
function—at each step of discrete time.
5.2.
Experimental results
We simulated this new model—with random delays—in the
same conditions as the previous one obtaining a similar dynamical behavior. Due to the random delays applied at every
time step in the signals, the waveform widths and the period
of the cycle are somewhat variable and longer than they were
in the previous model.
Figure 6 shows the behavior of the system when it is
driven by a single pulse of F = 2 or by a train of pulses whose
9
2
z0
2
z0
2
y1 0
2
y1 0
2
x1 0
2
x1 0
State variables
State variables
Nestor Walter Trepode et al.
2
w1 0
2
v0
2
w1 0
2
v0
2
T 0
2
T 0
2
F0
2
F0
0
30
60
90
120
150
180
0
30
60
Time steps
(a) F = period 20 oscillator
2
y1 0
2
x1 0
150
180
2
z0
2
w1 0
2
y1 0
2
v0
2
T 0
2
F0
0
30
60
90
120
150
180
State variables
State variables
120
(a) Due to the positive feedback from z to T, a new cycle is
started right after the previous one has finished, without the
need of a new F activator signal. This behavior is typical of
the embryonic cell-cycle, which depends on positive feedback
loops to maintain undamped oscillations with the correct timing
2
z0
(b) Constant F = 2
CELL-CYCLE PROGRESSION CONTROL
MODEL WITH RANDOM DELAYS AND
POSITIVE FEEDBACK
Our model can exhibit a pacemaker activity, initiating onecell division cycle after the previous one has finished with-
2
w1 0
2
v0
2
F0
510
Figure 7: Simulation of our three-level PGN cell-cycle progression
control model with random delays and 1% of noise (PGN with P =
.99), when activator pulses of F arrive before the previous cycle has
ended, and with constant activation F = 2.
period is greater than the cycle period. The system behaves
normally, with a little amount of noise, much weaker than
the regulatory signals. When F pulses arrive more frequently
and the period of the activator signal is shorter than the period of the cycle (Figure 7(a)), a new cycle is not started if the
activator pulse arrives when the previous cycle has not been
completed. Finally, when the activation F becomes very frequent or constant (Figure 7(b)), the negative feedbacks can
no longer exert their regulatory action and the system undergoes disregulation.
These simulations show the degree of robustness of our
model system under noise and random delays, when driven
by a wide variety of activator signals [20].
2
x1 0
2
T 0
Time steps
6.
90
Time steps
540
570
600
630
660
690
Time steps
(b) The second cycle in this figure is somewhat weakened (by
the effect of noise and random delays), but the positive feedback gets to overcome this (without the need of F activation)
and the system recovers its normal cyclical activity
Figure 8: PGN cell-cycle progression control model with positive
feedback from gene z to the trigger gene T (akzT = 7, k = 5 + td ), 1%
of noise and only one initial activator pulse F = 2 at t = −1, −2.
out the requirement of external stimuli, if we include positive
feedback in it. This oscillatory behavior is observed in nature
during proliferation of embryonic cells [4]. For our model to
present this oscillatory behavior, it suffices to include a positive feedback signal from gene z—last layer—to the trigger
gene T. The system is exactly the same as the previous random delay PGN model, except for an additional weight different of zero: akzT = 7 (where k = 5 + td ).
6.1.
Experimental results
In the simulation of Figure 8, the system is initially driven by
a single pulse of F = 2 at t = −1, −2. As in the embryonic cell
cycle, the positive feedback loop induces a pacemaker activity
10
EURASIP Journal on Bioinformatics and Systems Biology
where all cycles are completed normally with the correct timing for all the different phases. A new cycle starts right after
the completion of the previous one without the need of any
activator signal F. Figure 8(b) shows that when a signal wave
is weakened by the combined effect of noise and random delays, the positive feedback (without the need of any F activation) is sufficient to overcome this signal failure, putting the
system back into a normal-amplitude cyclical activity. These
simulations show the flexibility of our PGN model to represent different types of dynamical behavior, including the embryonic cell-cycle, that is induced by positive feedback loops.
7.
DISCUSSION
We designed a PGN hypothetical model for control of cellcycle progression, inspired on qualitative description of wellknown biological phenomena: the cell cycle is a sequence
of events triggered by a control signal that propagates as a
wave; there are signal integrating subsystems and (positive
and negative) feedback loops; parallel replicated structures
make the cell-cycle control fault tolerant. Furthermore, important real-world nonbiological control systems usually are
designed to be stable, robust, fault tolerant and admit small
probabilistic parameter fluctuations.
Our model’s parameters were adjusted guided by the expected behavior of the system and exhaustive simulation.
This modeling effort had no intention of representing details
of molecular mechanisms such as kinetics and thermodynamics of protein interactions, functioning of the transcription machinery, microRNA, and transcription factors regulation, but their concerted effects on the control of gene expression [13].
Our cell-cycle progression control model was able to represent some behavioral properties of the real biological system, such as: (i) sequential waves of gene expression; (ii) stability in the presence of variable excitation; (iii) robustness
under noisy parameters: (iii-i) prediction by an almost deterministic stochastic rule; (iii-ii) stochastic choice of an almost deterministic stochastic prediction rule (random delays), and (iv) auto stimulation by means of positive feedback.
The presence of numerous negative feedback loops in the
model provide stability and robustness. They warrant that,
under multiple noisy perturbation patterns, the system is
able to automatically correct external stimuli that could destroy the cell. This kind of mechanisms has commonly been
found in nature. Particularly, we think that the robustness of
Li’s yeast cell-cycle model [3] would be improved by addition
of critical negative feedback loops, that we suspect should exist in the biological system. The inclusion of positive feedback can make our model capable of exhibiting a pacemaker
activity, like the one observed in embryonic cells. The parallel structure of the system architecture represents biological
redundancy, which increases system fault tolerance.
Our discrete stochastic model qualitatively reproduces
the behavior of both Li et al. [3] and Pomerening et al. [4]
models, exhibiting remarkable robustness under noise and
parameters’ random variation. The natural follow up of this
research is to infer the PGN model from available dynamical data of cell-cycle progression, analogously to what we
have done for the regulatory system of the malaria parasite
[5, 6]. We anticipate that, very likely, analysis of these dynamical data will uncover unknown negative feedback loops
in cell-cycle control mechanisms.
ACKNOWLEDGMENTS
This work was partially supported by Grants 99/073900, 01/14115-7, 03/02717-8, and 05/00587-5 from FAPESP,
Brazil, and by Grant 1 D43 TW07015-01 from The National
Institutes of Health, USA.
REFERENCES
[1] A. Murray and T. Hunt, The Cell Cycle, Oxford University
Press, New York, NY, USA, 1993.
[2] B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, and P.
Walter, Molecular Biology of the Cell, Garland Science, New
York, NY, USA, 4th edition, 2002.
[3] F. Li, T. Long, Y. Lu, Q. Ouyang, and C. Tang, “The yeast cellcycle network is robustly designed,” Proceedings of the National
Academy of Sciences of the United States of America, vol. 101,
no. 14, pp. 4781–4786, 2004.
[4] J. R. Pomerening, S. Y. Kim, and J. E. Ferrell Jr., “Systems-level
dissection of the cell-cycle oscillator: bypassing positive feedback produces damped oscillations,” Cell, vol. 122, no. 4, pp.
565–578, 2005.
[5] J. Barrera, R. M. Cesar Jr., D. C. Martins Jr., et al., “A new annotation tool for malaria based on inference of probabilistic
genetic networks,” in Proceedings of the 5th International Conference for the Critical Assessment of Microarray Data Analysis (CAMDA ’04), pp. 36–40, Durham, NC, USA, November
2004.
[6] J. Barrera, R. M. Cesar Jr., D. C. Martins Jr., et al., “Constructing probabilistic genetic networks of plasmodium falciparum
from dynamical expression signals of the intraerythrocytic developement cycle,” in Methods of Microarray Data Analysis V,
chapter 2, Springer, New York, NY, USA, 2007.
[7] N. W. Trepode, H. A. Armelin, M. Bittner, J. Barrera, M. D.
Gubitoso, and R. F. Hashimoto, “Modeling cell-cycle regulation by discrete dynamical systems,” in Proceedings of IEEE
Workshop on Genomic Signal Processing and Statistics (GENSIPS ’05), Newport, RI, USA, May 2005.
[8] S. A. Kauffman, The Origins of Order, Oxford University Press,
New York, NY, USA, 1993.
[9] N. Friedman, M. Linial, I. Nachman, and D. Pe’er, “Using
Bayesian networks to analyze expression data,” Journal of Computational Biology, vol. 7, no. 3-4, pp. 601–620, 2000.
[10] H. De Jong, “Modeling and simulation of genetic regulatory
systems: a literature review,” Journal of Computational Biology,
vol. 9, no. 1, pp. 67–103, 2002.
[11] I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang, “Probabilistic Boolean networks: a rule-based uncertainty model for
gene regulatory networks,” Bioinformatics, vol. 18, no. 2, pp.
261–274, 2002.
[12] J. Goutsias and S. Kim, “A nonlinear discrete dynamical model
for transcriptional regulation: construction and properties,”
Biophysical Journal, vol. 86, no. 4, pp. 1922–1945, 2004.
[13] S. Bornholdt, “Less is more in modeling large genetic networks,” Science, vol. 310, no. 5747, pp. 449–451, 2005.
Nestor Walter Trepode et al.
[14] P. T. Spellman, G. Sherlock, M. Q. Zhang, et al., “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Molecular
Biology of the Cell, vol. 9, no. 12, pp. 3273–3297, 1998.
[15] J. J. Tyson, K. Chen, and B. Novak, “Network dynamics and
cell physiology,” Nature Reviews Molecular Cell Biology, vol. 2,
no. 12, pp. 908–916, 2001.
[16] K. C. Chen, L. Calzone, A. Csikasz-Nagy, F. R. Cross, B. Novak, and J. J. Tyson, “Integrative analysis of cell cycle control
in budding yeast,” Molecular Biology of the Cell, vol. 15, no. 8,
pp. 3841–3862, 2004.
[17] H. A. Armelin, J. Barrera, E. R. Dougherty, et al., “Simulator
for gene expression networks,” in Microarrays: Optical Technologies and Informatics, vol. 4266 of Proceedings of SPIE, pp.
248–259, San Jose, Calif, USA, January 2001.
[18] http://www.vision.ime.usp.br/∼walter/pgn cell cycle/ycc
info.pdf.
[19] http://www.vision.ime.usp.br/∼walter/pgn cell cycle/pgn
ccm add info.pdf.
[20] http://www.vision.ime.usp.br/∼walter/pgn cell cycle/pgn
ccmrd add info.pdf.
11