EURASIP Journal on Bioinformatics and Systems Biology Genetic Regulatory Networks Guest Editors: Edward R. Dougherty, Tatsuya Akutsu, Paul Dan Cristea, and Ahmed H. Tewfik Genetic Regulatory Networks EURASIP Journal on Bioinformatics and Systems Biology Genetic Regulatory Networks Guest Editors: Edward R. Dougherty, Tatsuya Akutsu, Paul Dan Cristea, and Ahmed H. Tewfik Copyright © 2007 Hindawi Publishing Corporation. All rights reserved. This is a special issue published in volume 2007 of “EURASIP Journal on Bioinformatics and Systems Biology.” All articles are open access articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Editor-in-Chief Ioan Tabus, Tampere University of Technology, Finland Associate Editors Jaakko Astola, Finland Junior Barrera, Brazil Michael Bittner, USA Yidong Chen, USA Paul Dan Cristea, Romania Aniruddha Datta, USA Bart De Moor, Belgium Edward R. Dougherty, USA Javier Garcia-Frias, USA Debashis Ghosh, USA John Goutsias, USA Roderic Guigo, Spain Yufei Huang, USA Seungchan Kim, USA John Quackenbush, USA Jorma Rissanen, Finland Stéphane Robin, France Paola Sebastiani, USA Erchin Serpedin, USA Ilya Shmulevich, USA Ahmed H. Tewfik, USA Sabine Van Huffel, Belgium Yue Wang, USA Z. Jane Wang, Canada Contents Genetic Regulatory Networks, Edward R. Dougherty, Tatsuya Akutsu, Paul Dan Cristea, and Ahmed H. Tewfik Volume 2007, Article ID 17321, 2 pages Analysis of Gene Coexpression by B-Spline Based CoD Estimation, Huai Li, Yu Sun, and Ming Zhan Volume 2007, Article ID 49478, 10 pages Gene Systems Network Inferred from Expression Profiles in Hepatocellular Carcinogenesis by Graphical Gaussian Model, Sachiyo Aburatani, Fuyan Sun, Shigeru Saito, Masao Honda, Shu-ichi Kaneko, and Katsuhisa Horimoto Volume 2007, Article ID 47214, 11 pages Uncovering Gene Regulatory Networks from Time-Series Microarray Data with Variational Bayesian Structural Expectation Maximization, Isabel Tienda Luna, Yufei Huang, Yufang Yin, Diego P. Ruiz Padillo, and M. Carmen Carrion Perez Volume 2007, Article ID 71312, 14 pages Inferring Time-Varying Network Topologies from Gene Expression Data, Arvind Rao, Alfred O. Hero III, David J. States, and James Douglas Engel Volume 2007, Article ID 51947, 12 pages Inference of a Probabilistic Boolean Network from a Single Observed Temporal Sequence, Stephen Marshall, Le Yu, Yufei Xiao, and Edward R. Dougherty Volume 2007, Article ID 32454, 15 pages Algorithms for Finding Small Attractors in Boolean Networks, Shu-Qin Zhang, Morihiro Hayashida, Tatsuya Akutsu, Wai-Ki Ching, and Michael K. Ng Volume 2007, Article ID 20180, 13 pages Fixed Points in Discrete Models for Regulatory Genetic Networks, Dorothy Bollman, Omar Colón-Reyes, and Edusmildo Orozco Volume 2007, Article ID 97356, 8 pages Comparison of Gene Regulatory Networks via Steady-State Trajectories, Marcel Brun, Seungchan Kim, Woonjung Choi, and Edward R. Dougherty Volume 2007, Article ID 82702, 11 pages A Robust Structural PGN Model for Control of Cell-Cycle Progression Stabilized by Negative Feedbacks, Nestor Walter Trepode, Hugo Aguirre Armelin, Michael Bittner, Junior Barrera, Marco Dimas Gubitoso, and Ronaldo Fumio Hashimoto Volume 2007, Article ID 73109, 11 pages Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 17321, 2 pages doi:10.1155/2007/17321 Editorial Genetic Regulatory Networks Edward R. Dougherty,1, 2 Tatsuya Akutsu,3 Paul Dan Cristea,4 and Ahmed H. Tewfik5 1 Department of Electrical & Computer Engineering, College of Engineering, Texas A&M University, College Station, TX 77843-3128, USA 2 Translation Genomics Research Institute, Phoenix, AZ 85004, USA 3 Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan 4 Digital Signal Processing Laboratory, Department of Electrical Engineering, “Politechnica” University of Bucharest, 060032 Bucharest, Romania 5 Department of Electrical and Computer Engineering, Institute of Technology, University of Minnesota, Minneapolis, MN 55455, USA Received 3 June 2007; Accepted 3 June 2007 Copyright © 2007 Edward R. Dougherty et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Systems biology aims to understand the manner in which the parts of an organism interact in complex networks, and systems medicine aims at basing diagnosis and treatment on a systems level understanding of molecular interaction, both intra-and inter-cellular. Ultimately, the enterprise rests on characterizing the interaction of the macromolecules constituting cellular machinery. Genomics, a key driver in this enterprise, involves the study of large sets of genes and proteins, with the goal of understanding systems, not simply components. The major goal of translational genomics is to characterize genetic regulation, and its effects on cellular behavior and function, thereby leading to a functional understanding of disease and the development of systems-based medical solutions. To achieve this goal it is necessary to develop nonlinear dynamical models that adequately represent genomic regulation and to develop mathematically grounded diagnostic and therapeutic tools based on these models. Signals generated by the genome must be processed to characterize their regulatory effects and their relationship to changes at both the genotypic and phenotypic levels. Owing to the complex regulatory activity within the cell, a full understanding of regulation would involve characterizing signals at both the transcriptional (RNA) and translational (protein) levels; however, owing to the tight connection between the levels, a goodly portion of the information is available at the transcriptional level, and owing to the availability of transcription-based microarray technologies, most current studies utilize mRNA expression measurements. Since transcriptional (and posttranscriptional) regulation involves the processing of numerous and different kinds of signals, mathematical and computational methods are required to model the multivariate influences on decision-making in genetic networks. Construction of a network model is only the beginning of biological analysis. Understanding a gene network means understanding its dynamics, especially its long-run behavior. For instance, it has been conjectured that the stationary distribution characterizes phenotype. It is in terms of dynamics that issues such as stability, robustness, and therapeutic effects must be examined. Indeed, it seems virtually impossible to design targeted treatment regimens that address a patient’s individual regulatory structure without taking into account the stochastic dynamics of cell regulation. From the perspective of systems medicine, perhaps the most important issue to be addressed is the design of treatment policies based on the external control of regulatory network models, since this is the route to the design of optimal therapies, both in terms of achieving desired changes and avoiding deleterious side effects. As a discipline, signal processing involves the construction of model mathematical systems, including systems of differential equations, graphical networks, stochastic functional relations, and simulation models. And if we view signal processing in the wide sense, to include estimation, classification, automatic control, information theory, networks, 2 EURASIP Journal on Bioinformatics and Systems Biology and coding, we see that genomic signal processing will play a central role in the development of systems medicine. There is a host of important and difficult problems, ranging over issues such as inference, complexity reduction, and the control of high-dimensional systems. These represent an exciting challenge for the signal processing community and a chance for the community to play a leading role in the future of medicine. Edward R. Dougherty Tatsuya Akutsu Paul Dan Cristea Ahmed H. Tewfik Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 49478, 10 pages doi:10.1155/2007/49478 Research Article Analysis of Gene Coexpression by B-Spline Based CoD Estimation Huai Li, Yu Sun, and Ming Zhan Bioinformatics Unit, Branch of Research Resources, National Institute on Aging, National Institutes of Health, Baltimore, MD 21224, USA Received 31 July 2006; Revised 3 January 2007; Accepted 6 January 2007 Recommended by Edward R. Dougherty The gene coexpression study has emerged as a novel holistic approach for microarray data analysis. Different indices have been used in exploring coexpression relationship, but each is associated with certain pitfalls. The Pearson’s correlation coefficient, for example, is not capable of uncovering nonlinear pattern and directionality of coexpression. Mutual information can detect nonlinearity but fails to show directionality. The coefficient of determination (CoD) is unique in exploring different patterns of gene coexpression, but so far only applied to discrete data and the conversion of continuous microarray data to the discrete format could lead to information loss. Here, we proposed an effective algorithm, CoexPro, for gene coexpression analysis. The new algorithm is based on B-spline approximation of coexpression between a pair of genes, followed by CoD estimation. The algorithm was justified by simulation studies and by functional semantic similarity analysis. The proposed algorithm is capable of uncovering both linear and a specific class of nonlinear relationships from continuous microarray data. It can also provide suggestions for possible directionality of coexpression to the researchers. The new algorithm presents a novel model for gene coexpression and will be a valuable tool for a variety of gene expression and network studies. The application of the algorithm was demonstrated by an analysis on ligand-receptor coexpression in cancerous and noncancerous cells. The software implementing the algorithm is available upon request to the authors. Copyright © 2007 Huai Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. INTRODUCTION The utilization of high-throughput data generated by microarray gives rise to a picture of transcriptome, the complete set of genes being expressed in a given cell or organism under a particular set of conditions. With recent interests in biological networks, the gene coexpression study has emerged as a novel holistic approach for microarray data analysis [1–4]. The coexpression study by microarray data allows exploration of transcriptional responses that involve coordinated expression of genes encoding proteins which work in concert in the cell. Most of coexpression studies have been based on the Pearson’s correlation coefficient [1, 2, 5]. The linear model-based correlation coefficient provides a good first approximation of coexpression, but is also associated with certain pitfalls. When the relationship between logexpression levels of two genes is nonlinear, the degree of coexpression would be underestimated [6]. Since the correlation coefficient is a symmetrical measurement, it cannot provide evidence of directional relationship in which one gene is upstream of another [7]. Similarly, mutual information is also not suitable for modeling directional relationship, although applied in various coexpression studies [8, 9]. The coefficient of determination (CoD), on the other hand, is capable of uncovering nonlinear relationship in microarray data and suggesting the directionality, thus has been used in prediction analysis of gene expression, determination of connectivity in regulatory pathways, and network inference [10– 14]. However, the application of CoD in microarray analysis so far can only be applied to discrete data, and continuous microarray data must be converted by quantization to the discrete format prior application. The conversion by quantization could lead to the loss of important biological information, especially for a dataset with a small sample size and low data quality. Moreover, quantization is a coarse-grained approximation of gene expression pattern and the resulting data may represent “qualitative” relationship and lead to biologically erroneous conclusions [15]. B-spline is a flexible mathematical formulation for curve fitting due to a number of desirable properties [16]. Under 2 EURASIP Journal on Bioinformatics and Systems Biology the smoothness constraint, B-spline gives the “optimal” curve fitting in terms of minimum mean-square error [16, 17]. Recently, B-spline has been widely used in microarray data analysis, including inference of genetic networks, estimation of mutual information, and modeling of time-series gene expression data [7, 17–23]. In a Bayesian network model for genetic network construction from microarray data [7], B-spline has been used as a basis function for nonparametric regression to capture nonlinear relationships between genes. In numerical estimation of mutual information from continuous microarray data [23], a generalized indicator function based on B-spline has been proposed to get more accurate estimation of probabilities. By treating the gene expression level as a continuous function of time, B-spline approaches have been used to cluster genes based on mixture models [17, 19, 22], and to identify differential-expressed genes over the time [18, 21]. All the studies have shown the great usefulness of the B-spline approach for microarray data analysis. In this study, we proposed a new algorithm, CoexPro, which is based on B-spline approximation followed by CoD estimation, for gene coexpression analysis. Given a pair of genes gx and g y with expression values {(xi , yi ), i = 1, . . . , N }, we first employed B-spline to construct the function relationship y = F(x) of the expression level y of gene g y given the expression level x of gene gx in the (x, y) plane. We then computed CoD to determine how well the expression of gene g y is predicted by the expression of gene gx based on the B-spline model. The proposed modeling is able to address specific nonlinear relationship in gene coexpression, in addition to linear correlation, it can suggest possible directionality of interactions, and can be calculated directly from microarray data. We demonstrated the effectiveness of the new algorithm in disclosing different patterns of coexpression using both simulated and real gene-expression data. We validated the identified gene coexpression by examining the biological and physiological significances. We finally used the proposed method to analyze expression profiles of ligands and receptors in leukemia, lung cancer, prostate cancer, and their normal tissue counterparts. The algorithm correctly identified coexpressed ligand-receptor pairs specific to cancerous tissues and provided new clues for the understanding of cancer development. 2. Once we have the model, we compute CoD to determine how well the expression of gene g y is predicted by the expression of gene gx . The CoD allows measurement of both linear and specific nonlinear patterns and suggests possible directionality of coexpression. Continuous data from microarray can be directly used in the calculation without transformation into the discrete format, hence avoiding potential loss or misrepresentation of biological information. 2.1.1. Two-dimensional B-spline approximation The two-dimensional (2D) B-spline is a set of piecewise polynomial functions [16]. Using the notion of parametric representation, the 2D B-spline curve can be defined as follows: x Given a two-dimensional scatter plot of expression for a pair of genes gx and g y with expression values {(xi , yi ), i = 1, . . . , N }, it allows us to explore if there are hidden coexpression patterns between the two genes through modeling the plotted pattern. Here, we propose to use B-spline to model the functional relationship y = F(x) of the expression level y of gene g y given the expression level x of gene gx in the (x, y) plane. Mathematically, it is most convenient to express the curve in the form of x = f (t) and y = g(t), where t is some parameter, instead of using implicit equation just involving x and y. This is called a parametric representation of the curve that has been commonly used in B-spline curve fitting [16]. tmin ≤ t < tmax . (1) In (1), yjj , j = 1, . . . , n + 1 are n + 1 control points assigned from data samples. t is a parameter and is in the range of maximum and minimum values of the element in a knot vector. A knot vector, t1 , t2 , . . . , tk+(n+1) , is specified for giving a number of control points n + 1 and B-spline order k. It is necessary that t j ≤ t j+1 , for all j. For an open curve, openuniform knot vector should be used, which is defined as t j = t1 = 0, j ≤ k, t j = j − k, k < j < n + 2, t j = tk+(n+1) = n − k + 2, j ≥ n + 2. (2) For example, if k = 3, n + 1 = 10, the open-uniform knot vector is equal to [0 0 0 1 2 3 4 5 6 7 8 8 8]. In this case, tmin = 0, tmax = 8, and 0 ≤ t < 8. The B j,k (t) basis functions are of order k. k must be at least 2, and can be no more than n + 1. The B j,k (t) depend only on the value of k and the values in the knot vector. The B j,k (t) are defined recursively as: ⎧ ⎨1, t j ≤ t < t j+1 , B j,1 (t) = ⎩ 0, otherwise, t j+k − t t − tj B j,k−1 (t) + B j+1,k−1 (t). B j,k (t) = t j+k−1 − t j t j+k − t j+1 (3) METHODS 2.1. Model for gene coexpression of mixed patterns n+1 f (t) x j x B j,k (t) , = = y g(t) y j j =1 Given a pair of genes gx and g y with expression values {(xi , yi ), i = 1, . . . , N }, n + 1 control points {(x j , y j ), j = 1, . . . , n + 1} selected from {(xi , yi ), i = 1, . . . , N }, a knot vec- tor, t1 , t2 , . . . , tk+(n+1) , and the order of k, the plotted pattern can be modeled by (1). In (1), f (t) and g(t) are the x and y components of a point on the curve, t is a parameter in the parametric representation of the curve. 2.1.2. CoD estimation If one uses the MSE metric, then CoD is the ratio of the explained variation to the total variation and denotes the strength of association between predictor genes and the target gene. Mathematically, for any feature set X, CoD relative Huai Li et al. 3 to the target variable Y is defined as CoDX →Y = (ε0 − εX )/ε0 , where ε0 is the prediction error in the absence of predictor and εX is the error for the optimal predictors. For the purpose of exploring coexpression pattern, we only consider a pair of genes gx and g y , where g y is the target gene that is predicted by the predictor gene gx . The errors are estimated based on available samples (resubstitution method) for simplicity. Given a pair of genes gx and g y with expression values xi and yi , i = 1, . . . , N, where N is the number of samples, we construct the predictor y = F(x) for predicting the target expression value y. If the error is the mean-square error (MSE), then CoD of gene g y predicted by gene gx can be computed according to the definition CoDgx →g y = ε0 − εX = ε0 N i=1 yi − y 2 N i=1 yi − y − N i=1 2 Output (i) CoD of gene g y predicted by gene gx . Algorithm x y = f (t) g(t) x j y j in the (x, y) plane based on (n + 1) control points , j= 1, . . . , n + 1 , a knot vector, t1 , t2 , . . . , tk+(n+1) , and the order of k. (1) Find indices of yi , i = 1, . . . , N , where (x1 ≤ x2 ≤ · · · ≤ xN ) are ordered as monotonic x (ii) Calculate CoD of gene g y predicted by gene gx . (1) Compute mean expression value of g y as y = N i=1 yi /N. (2) For i = 1, . . . , N, find yi = F(xi ) by eliminating t between x = f (t) and y = g(t). First find ti = arg{mint | f (t) − xi |}. Then compute yi = g(ti ). (3) Calculate CoD from (4) based on the ordered xi sequence yi , i = 1, . . . , N . Refer to (4), CoD value is the same as calculated based on xi yi , i = 1, . . . , N . Including the special cases, we have (1) ε0 > 0, if ε0 ≥ εX , compute CoD from (4); else set CoD to 0. (2) ε0 = 0, if εX = 0, set CoD to 1; else set CoD to 0. (4) (i) A pair of genes gx and g y with expression values xi and yi , i = 1, . . . , N. N is the number of samples. (ii) M intervals of control points. By given N and M, the number of control points (n + 1) is determined as n = N/M , where · is the floor function. (iii) Spline order k. , j = 1, . . . , n and xyn+1 = yN . n+1 N (3) Compute the B j,k (t) basis functions recursively from (3). x f (t) x j n+1 (4) Formulate y = g(t) = j =1 B j,k (t) y j based on (1). y1+( j −1)×M . Input xi x1+( j −1)×M 2 y i − F xi When the relationship is linear or approximately linear, CoD and the correlation coefficient are equivalent measurements since CoD is equal to R2 if F(xi ) = mxi + b. As the relationship departs from linearity, however, CoD can capture some specific nonlinear information whereas the correlation coefficient fails. In terms of prediction of direction, both the correlation coefficient and mutual information are symmetrical measurements that cannot provide evidence of which way causation flows. CoD, however, can suggest the direction of gene relationship. In other words, CoDgx →g y is not necessarily equal to CoDg y →gx . This feature makes CoD to be uniquely useful, especially in network inference. The key point for computing CoD from (4) is to find the predictor y = F(x) from continuous data samples (xi , yi ). Motivated by the spirit of B-spline, we formulate an algorithm to estimate the CoD from continuous data of gene expression. The proposed algorithm is summarized as follows. (i) Fit two-dimensional B-spline curve increasing from (x1 , x2 , . . . , xN ), yi is the value corresponding to the same index as xi . x j (2) Assign (n + 1) control points as: = y j 2.1.3. Statistical significance For a given CoD value estimated on the basis of B-spline approximation (referred to as CoD-B in the following), the probability (Pshuffle ) of obtaining a larger CoD-B at random between gene gx and g y is calculated by randomly shuffling one of the expression profiles through Monte Carlo simulation. In the simulation, a random dataset is created by shuffling the expression profiles of the predictor gene gx and the target gene g y , and CoD-B is estimated based on the random dataset. This process is repeated 10,000 times under the condition that the parameters k and M are kept constant, and the resulting histogram of CoD-B shows that it can be approximated by the half-normal distribution. We then determine Pshuffle according to the derived probability distribution of CoD-B from the simulation. 2.2. Scheme for coexpression identification Based on the new algorithm developed, we propose a scheme for identifying coexpression of mixed patterns by using CoDB as the measuring score. We first calculate CoD-B from gene expression data for each pair of genes under experimental conditions A and B. For example, condition A represents the cancer state and condition B represents the normal state. Then under the cutoff values of CoD-B (e.g., 0.50) and Pshuffle (e.g., 0.05), we select the set of gene pairs that are significantly coexpressed under condition A and the set of gene pairs that are not significantly coexpressed under condition B as follows: setA := (Coexpressed pairs, satisfy CoD-B ≥ 0.50 AND Pshuffle < 0.05), setB := (Coexpressed pairs, satisfy CoD-B < 0.50 AND Pshuffle < 0.05). 4 The set of significantly coexpressed gene pairs to differentiate condition A from condition B is chosen as the intersect of setA and setB: setC = setA ∩ setB. 2.3. Software and experimental validation We have implemented a Java-based interactive computational tool for the CoexPro algorithm that we have developed. All computations were conducted using the software. The effects of the number of control points and the order k of the B-spline function for CoD estimation were assessed from the simulated datasets which contain four different coexpression patterns: (1) linear pattern, (2) nonlinear pattern I (piecewise pattern), (3) nonlinear pattern II (sigmoid pattern), and (4) random pattern for control. Each dataset contained 31 data points. The coexpression profiles of the four simulated patterns are shown in Supplementary Figures S1A, S1C, S1E, and S1G (supplementary figures are available at doi:10.1155/2007/49478). For each pattern, the averaged CoD (CoD) and Z-Score (Z) values were calculated under different B-spline orders (k) and control points intervals (M). For computing CoD and Z-Score, the original dataset was shuffled 10,000 times. CoD was obtained by averaging CoD values of the shuffled data. Z-Score was calculated as Z = (CoD − CoD)/σ, where CoD was estimated from the original dataset and σ was the standard deviation. The CoexPro algorithm was first validated for its ability of capturing different coexpression patterns by comparing the results from CoD-B, CoD estimated from quantized data (referred to as CoD-Q in the following), and the correlation coefficient (R). The validation was conducted on the four simulated datasets described above and four real expression datasets representing four different coexpression patterns (normal tissue array data; obtained from the GEO database with the accession number GSE 1987). The coexpression profiles of the four real-data patterns are shown in Supplementary Figures S1B, S1D, S1F, and S1H. For getting quantized data, gene expression values were discretized into three categories: over expressed, equivalently expressed, and under expressed, depending whether the expression level was significantly lower than, similar to, or greater than the respective control threshold [11, 14]. Since some genes had small natural range of variation, z-transformation was used to normalize the expression of genes across experiments, so that the relative expression levels of all genes had the same mean and standard derivation. The control threshold was then set to be one standard derivation for the quantization. The proposed algorithm was next validated for its ability of identifying biologically significant coexpression. The validation was conducted by functional semantic similarity analysis. The analysis was based on the gene ontology (GO), in which each gene is described by a set of GO terms of molecular functions, biological process, or cellular components that the gene is associated to (http://www.geneontology.org). The functional semantic similarity of a pair of genes gx and g y was measured by the number of GO terms that they shared (GOgx ∩ GOg y ), where GOgx denotes the set of GO terms for gene gx and GOg y denotes the set of GO terms for gene g y . The semantic similarity was set to zero if one or both genes EURASIP Journal on Bioinformatics and Systems Biology had no GO terms. The semantic similarity was calculated from six sets of coexpression gene pairs: (1) those nonlinear coexpression pairs identified by CoD-B; (2) those linear coexpression pairs identified by CoD-B; (3) those nonlinear coexpression pairs identified by CoD-Q; (4) those linear coexpression pairs identified by CoD-Q; (5) those coexpression pairs identified by correlation coefficient (R); and (6) those from randomly selected gene pairs. The real gene expression data used in this analysis were Affymetrix microarray data derived from the normal white blood cell (obtained from the GEO database with the accession number GSE137). The resulting distributions of similarity scores from the six gene pair data sets were examined by the Kolmogorov-Smirnov test for the statistical differences. The proposed algorithm was finally validated by a case study on ligand-receptor coexpression in cancerous and normal tissues. The ligand-receptor cognate pair data were obtained from the database of ligand-receptor partners (DLRP) [5]. The gene expression data used in this study included Affymetrix microarray data derived from dissected tissues of acute myeloid leukemia (AML), lung cancer, prostate cancer, and their normal tissue counterparts (downloaded from the GEO database with accession numbers GSE 995, GSE 1987, GSE 1431, resp.). Each of these microarray datasets contained about 30 patient cancer samples and 10 normal tissue samples. The array data were normalized by the robust multiarray analysis (RMA) method [24]. 3. 3.1. RESULTS AND DISCUSSION B-spline function and optimization We applied the B-spline function for approximation of the plotted pattern of a pair of genes, prior to CoD estimation of coexpression. The shape of a curve fitted by B-spline is specified by two major parameters: the number of control points sampled from data and the B-spline order k. Under different control points, the shape of a modeling curve would be different. On the other hand, increasing the order k would increase the smoothness of a modeling curve. We assessed these parameters for their influence on the CoD estimation. The assessment was conducted based on four coexpression patterns derived by simulation: (1) linear pattern, (2) nonlinear pattern I (piecewise pattern), (3) nonlinear pattern II (sigmoid pattern), and (4) random pattern (see Section 2). The coexpression profiles of the four simulated patterns are shown in Supplementary Figure S1. Figures 1(a) and 1(b) show plots of averaged CoD (CoD) and Z-Score, respectively, under different B-spline orders (k) at fixed M = 3. CoD was computed based on 10,000 shuffled data sets and Z-Score was calculated as Z = (CoD − CoD)/σ, where CoD was estimated from the original dataset and σ was the standard deviation. A high Z-Score value indicated that the CoD estimated from the real pattern was beyond random expectation. As indicated, Z-Score showed no sign of improvement when k increased up to 4 or above in both linear and nonlinear coexpression patterns. Figures 1(c) and 1(d) show plots of CoD and Z-Score, respectively, under different number M of control point intervals at fixed k = 4. As indicated, at M = 1 Huai Li et al. 5 12 0.068 0.066 10 8 0.062 Significance Averaged CoD 0.064 0.06 0.058 6 4 0.056 2 0.054 0 0.052 0.05 2 3 4 −2 5 2 3 4 Order k Nonlinear-II Random Linear Nonlinear-I Nonlinear-II Random Linear Nonlinear-I (a) (b) 0.9 35 0.8 30 25 0.6 Significance Averaged CoD 0.7 0.5 0.4 0.3 20 15 10 0.2 5 0.1 0 0 5 Order k 1 2 3 4 5 6 7 8 9 10 Interval of control points Nonlinear-II Random Linear Nonlinear-I (c) −5 1 2 3 4 5 6 7 8 9 10 Interval of control points Nonlinear-II Random Linear Nonlinear-I (d) Figure 1: Estimation of averaged CoD and significance at different spline orders k and control point intervals M under linear, nonlinear I (piecewise pattern), nonlinear II (sigmoid pattern), and random coexpression patterns. The data sets of the four patterns were generated by simulation. The averaged CoD and significance were calculated from 10,000 shuffled realizations of the dataset. (a) and (b) show averaged CoD and significance calculated under different spline orders k at fixed M = 3. (c) and (d) show averaged CoD and significance calculated under different number M of control point intervals at fixed k = 4. (i.e., all data points from samples were used as the control points), a data over-fitting phenomenon was observed, where CoD was high but Z-Score was low in all data patterns. The increase of M led to the decrease of CoD and increase of ZScore. Based on the results and taking into account of small sample sizes in microarray data, we set M = 3 and k = 4 empirically for the identification of coexpression in this study. 3.2. Justification of algorithm In order to justify our algorithm, we compared CoD-B, CoDQ, and the correlation coefficient (R) for their power of cap- turing different coexpression patterns, particularly nonlinear and directional relationships. Four different coexpression patterns were analyzed: linear, nonlinear I (piecewise pattern), nonlinear II (sigmoid pattern), and random patterns (see Section 2; Supplementary Figure S1). Table 1 shows the results. As expected, for the linear coexpression pattern, CoD-B, CoD-Q, and R2 values were all significantly high and CoD-B performed well in both simulated and real data (p-value < 1.0E-6) (see Table 1). For the random pattern, both CoD-B and R2 were very low as expected. But CoD-Q failed to uncover the random pattern, showing significantly high values (0.68 in the simulated data set and 0.65 in the 6 EURASIP Journal on Bioinformatics and Systems Biology Table 1: Comparison of CoD estimated by our algorithm (CoD-B), CoD estimated from quantized data (CoD-Q), and correlation coefficient (R2 ) under different coexpression patterns. Simulated data Coregulated pattern CoD-B (Pshuffle ) CoD-Q (Pshuffle ) Linear 0.98 (1.0E-6) Nonlinear-I Real data R2 R2 (Pshuffle ) CoD-B (Pshuffle ) CoD-Q (Pshuffle ) (Pshuffle ) 0.98 (1.0E-6) 0.99 (1.0E-6) 0.65 (1.0E-6) 0.68 (3.3E-2) 0.68 (4.7E-3) 0.94 (1.0E-6) 0.80 (1.0E-6) 1.8E-5 (9.5E-2) 0.68 (4.6E-3) 0.84 (1.2E-3) 0.31 (2.1E-3) Nonlinear-II 0.98 (1.0E-6) 0.93 (1.0E-6) 0.57 (1.0E-6) 0.79 (8.2E-3) 0.79 (6.8E-3) 0.10 (1.9E-2) Random 1.0E-5 (6.2E-1) 0.68 (7.4E-1) 0.0026 (4.3E-1) 1.0E-05 (6.6E-1) 0.65 (3.3E-1) 0.051 (2.5E-1) real-array data). For the nonlinear patterns, both CoD-B and CoD-Q performed well with significantly high values, while R2 was low and unable to reveal the patterns. As shown in Table 1, for the nonlinear pattern I, CoD-B was 0.94 with p-value 1.0E-6, CoD-Q was 0.80 with p-value 1.0E-6, while R2 was 1.8E-5 with p-value 9.5E-2 in the simulated data. In the real data, CoD-B was 0.68 with p-value 4.6E-3, CoD-Q was 0.84 with p-value 1.2E-3, while R2 was 0.31 with p-value 2.1E-3. A similar trend was also observed for the nonlinear pattern II (see Table 1). It is important to explore nonlinear coexpression pattern and directional relationship in gene expression for gene regulation or pathway studies. The two nonlinear patterns that we examined in this study can represent different biological events. The nonlinear pattern I (piecewise pattern; Supplementary Figures S1C–S1D) may represent a negative feedback event: gene gx and gene g y initially have a positive correlation until gene gx reaches a certain expression level then the correlation becomes negative. The nonlinear pattern II (sigmoid pattern; Supplementary Figures S1E–S1F) may represent two consecutive biological events: threshold and saturation. Initially, gene gx ’s expression level increases without affecting gene g y ’s expression activity. When the level of gene gx reaches a certain threshold, gene g y ’s expression starts to increase with gx . But after gene gx ’s level reaches a second threshold, its effect on gene g y becomes saturated and gene g y ’s level plateaued. The directional relationship, particularly the interaction between transcription factors and their targets, on the other hand, is an important component in gene regulatory network or pathways. Our algorithm provides effective means to analyze nonlinear coexpression pattern and uncover directional relationship from microarray gene expression data. In this study, we estimated the errors arising from CoD-B and CoD-Q calculation by the resubstitution method based on available samples for simplicity. Other methods, such as bootstrapping, could also be applied for the error estimation, especially when the sample size is small. In exploring coexpression pattern, our algorithm at the current version deals with a pair of genes gx and g y , where g y is the target gene that is predicted by the predictor gene gx . In the future, we would extend our algorithm to explore multivariate gene relations as well. 3.3. Biological significance of coexpression identified by CoD-B We validated our algorithm for its ability of capturing biologically meaningful coexpression by functional semantic similarity analysis of coexpressed genes identified. The semantic similarity measures the number of the gene ontology (GO) terms shared by the two coexpressed genes [2, 25]. Six sets of coexpression gene pairs were subjected to the semantic similarity analysis: (1) 9419 nonlinear coexpression pairs picked up by CoD-B but not by the correlation coefficient (R) (cutoff value is 0.70 for both CoD-B and R2 ); (2) 8225 linear coexpression pairs picked up by both CoD-B and R2 using the same cutoff; (3) 39406 nonlinear coexpression pairs picked up by CoD-Q but not by R2 using the same cutoff; (4) 8408 linear coexpression pairs picked up by both CoD-Q and R2 using the same cutoff; (5) 11596 coexpression pairs picked up by R2 using the same cutoff; and (6) 250000 randomly selected gene pairs used for control. The gene expression data from the normal white blood cell were used for the analysis. Figure 2 shows the distribution of semantic similarity scores under these datasets. For the random gene pairs, the cumulative probability of the gene pairs reached to 1 when the functional similarity was as high as 8. This indicated that all of the random gene pairs had the functional similarity 8 or below. In contrast, for the coexpressed genes identified by CoD-B, the cumulated probability of 1 (i.e., 100% of gene pairs) corresponded to the semantic similarity above 30, indicative of much higher functional similarities between the coexpressed genes identified. The distributions of similarity scores derived from the two coexpressed gene datasets were very similar to each other while both were significantly different from that of randomly generated gene pairs (P < 10E10 by the Kolmogorov-Smirnov test). For the coexpressed Huai Li et al. 7 1 3.4. Case study: coexpression of ligand-receptor pairs 0.9 Cumulative probability 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 35 Semantic similarity Random pairs Linear Coex-pairs by CoD-B Nonlinear Coex-pairs by CoD-B Linear Coex-pairs by CoD-Q Nonlinear Coex-pairs by CoD-Q Coex-pairs by R Figure 2: The distributions of functional similarity scores in six sets of gene pairs. The square line on the plot represents the distribution of randomly selected gene pairs, the circle line is that of linearly coexpressed gene pairs picked up by CoD-B, the triangle line represents that of nonlinearly coexpressed gene pairs picked up by CoD-B, the star line is that of linearly coexpressed gene pairs picked up by CoD-Q, the diamond line represents that of nonlinearly coexpressed gene pairs picked up by CoD-Q, and the downward-pointing triangle line represents that of coexpressed gene pairs picked up by correlation coefficient (R). The x-axis indicates functional semantic similarity scores (GO term overlap; see Section 2). For the random gene pairs, the cumulative probability of gene pairs reached to 1 when the functional similarity was up to 8. That meant all the random gene pairs had the functional similarity 8 or below. In contrast, for coexpressed genes picked up by CoD-B, the cumulated probability did not reache 1 (i.e., 100% of gene pairs) until the functional similarity was over 30, indicative of high functional similarities in the coexpressed genes. The accumulative distributions were significantly different from that of randomly generated gene pairs (P < 10E-10 by the Kolmogorov-Smirnov test). For the coexpressed genes identified by CoD-Q, the curves of cumulated probability laid between the curves in the case of CoD-B and in the random case. The cumulated probability of 1 corresponded to the semantic similarity above 25. For the coexpressed genes identified by R, the curves of cumulated probability also laid between the curves in the case of CoD-B and in the random case. genes identified by CoD-Q, the curves of cumulated probability laid between the curves in the case of CoD-B and the curve in the random case. The cumulated probability of 1 corresponded to the semantic similarity above 25. For the coexpressed genes identified by R2 , the curves of cumulated probability also laid between the curves in the case of CoD-B and in the random case. The results suggest that the new algorithm is effective in identifying biologically significant coexpression of both linear and nonlinear patterns. We finally used our new algorithm to analyze coexpression of ligands and their corresponding receptors in lung cancer, prostate cancer, leukemia, and their normal tissue counterparts. Significantly coexpressed ligand and receptor pairs were identified in the cancer and normal tissue groups at the thresholds of R2 and CoD-B 0.50 and Pshuffle 0.05. The results are shown in Supplementary Tables S1 to S6. By applying the criteria of differential coexpression (see Section 2), we identified ligand-receptor pairs which showed differential coexpression between cancerous and normal tissues, as well as among different cancers. Table 2 lists the differentially coexpressed genes between lung cancer and normal tissues. The values of CoD-Q and R2 are also listed in the table for comparison. Supplementary Tables S7 and S8 list the differentially coexpressed genes in AML and prostate cancer, respectively. 12 ligand-receptor pairs were differentially coexpressed between lung cancer and normal tissues (the CoD-B difference > 0.40) (see Table 2). The ligand BMP7 (bone morphogenetic protein 7), related to cancer development [26, 27], was one of the differentially coexpressed genes. For BMP7 and its receptor ACVR2B (activin receptor IIB), the CoD-B was 0.76 (Pshuffle < 2.8E-2) in the lung cancer and 0.00 (Pshuffle < 5.8E-1) in the normal tissue, the CoDQ was 0.75 (Pshuffle < 2.9E-2) in the lung cancer and 0.00 (Pshuffle < 5.7E-1) in the normal tissue, and the R2 value was 0.043 (Pshuffle < 2.9E-2) in the lung cancer and 0.0012 (Pshuffle < 1.0E-1) in the normal tissue (see Table 2). BMP7 and ACVR2B therefore showed nonlinear coexpression in the lung cancer while not coexpressed in the normal tissue. The nonlinear coexpression relationship was detected by both CoD-B and CoD-Q but not by R2 . The coexpression profile (see Figure 3(a)) further showed that the two genes displayed approximately the nonlinear pattern I of coexpression, and BMP7 was over expressed in the lung cancer as compared with the normal tissue. These results are suggestive of a certain level of negative feedback involved in the interaction between BMP7 and ACVR2B. The findings facilitate our understanding of the role of BMP7 in cancer development. The ligand CCL23 (chemokine ligand 23) and its receptor CCR1 (chemokine receptor 1), on the other hand, exhibited high linear coexpression in the normal lung tissue while were not coexpressed in cancerous lung samples. As shown in Table 2, the CoD-B value of the gene pair was 0.85 in the normal tissue while 0.00 in the lung cancer, the CoDQ value of the gene pair was 0.87 in the normal tissue while 0.62 in the lung cancer, and the R2 value was 0.92 in the normal tissue and 0.054 in the lung cancer. In this case, CoDB and R2 differentiated the coexpression patterns of the two genes under different conditions but CoD-Q failed. The coexpression profile (see Figure 3(b)) further showed that the two genes displayed approximately the linear pattern of coexpression in the normal condition. Similarly, CCL23 and CCR1 were also highly coexpressed in the normal prostate samples (CoD-B = 0.85) but not coexpressed in the cancerous prostate samples (CoD-B = 0.00) (see Supplementary Table S8). However, CCL23 and CCR1 were not coexpressed 8 EURASIP Journal on Bioinformatics and Systems Biology Table 2: List of ligand-receptor pairs which showed differential coexpression between the lung cancer and normal tissue based on CoD-B. The values of CoD-Q and R2 of ligand-receptor pairs are also listed in the table for comparison. Ligand CoD-B (Pshuffle ) Receptor R2 CoD-Q (Pshuffle ) (Pshuffle ) Cancer Normal Cancer Normal Cancer Normal BMP7 ACVR2B 0.76 (2.8E-2) 0.00 (5.8E-1) 0.75 (2.9E-2) 0.00 (5.7E-1) 0.043 (2.9E-2) 0.0012 (1.0E-1) EFNA3 EPHA5 0.84 (6.7E-6) 0.00 (6.9E-1) 0.66 (3.4E-1) 0.52 (1.6E-1) 0.22 (1.7E-2) 0.0072 (8.1E-1) EGF EGFR 0.50 (9.1E-4) 0.00 (6.6E-1) 0.64 (9.1E-1) 0.55 (2.2E-1) 0.20 (1.2E-2) 0.0034 (8.8E-1) EPO EPOR 0.49 (1.6E-5) 0.00 (7.1E-1) 0.092 (5.7E-2) 0.00 (5.0E-1) 0.14 (3.3E-2) 0.0022 (8.9E-1) FGF8 FGFR2 0.55 (1.5E-7) 0.00 (6.6E-1) 0.70 (2.1E-1) 0.71 (4.0E-1) 0.30 (3.4E-3) 0.19 (2.5E-1) IL16 CD4 0.62 (2.7E-6) 0.031 (6.8E-1) 0.76 (4.2E-2) 0.56 (2.7E-1) 0.40 (4.9E-4) 0.21 (2.1E-1) CCL7 CCBP2 0.48 (4.7E-5) 0.00 (6.7E-1) 0.44 (7.4E-2) 0.61 (5.0E-1) 0.028 (3.5E-1) 0.086 (4.2E-1) CCL23 CCR1 0.00 (7.3E-1) 0.85 (2.1E-9) 0.62 (8.0E-1) 0.87 (1.5E-2) 0.054 (2.3E-1) 0.92 (3.0E-4) IL1RN IL1R1 0.23 (7.7E-2) 0.83 (8.4E-7) 0.61 (7.2E-1) 0.81 (7.1E-2) 0.00 (9.6E-1) 0.90 (2.3E-4) IL18 IL18R1 0.18 (9.7E-2) 0.71 (4.5E-6) 0.69 (8.1E-1) 0.67 (1.9E-1) 0.23 (9.0E-3) 0.64 (9.3E-3) IL13 IL13RA2 0.00 (6.2E-1) 0.69 (1.5E-4) 0.59 (4.7E-1) 0.64 (2.2E-1) 0.0071 (6.7E-1) 0.69 (2.0E-2) BMP5 BMPR2 0.00 (6.9E-1) 0.61 (1.7E-4) 0.58 (3.3E-1) 0.61 (2.8E-1) 0.12 (7.2E-2) 0.60 (1.7E-2) in either normal (CoD-B = 0.00) or AML samples (CoD-B = 0.00). The results suggest that CCL23 and CCR1 show differential coexpression not only between cancerous and normal tissues, but also among different cancers. It has been reported that chemokine members and their receptors contribute to tumor proliferation, mobility, and invasiveness [28]. Some chemokines help to enhance immunity against tumor implantation, while others promote tumor proliferation [29]. Our results revealed the absence of a specific type of nonlinear interaction, for example, as described in Section 2.3, between CCL23 and CCR1 in lung and prostate cancer samples but not in AML samples, shedding light on the understanding of the involvement of chemokine signaling in tumor development. We further identified different patterns of ligand-receptor coexpression in cancer and normal tissues. In the lung cancer, for example, 11 ligand-receptor pairs showed a linear coexpression pattern, which were significant in both CoDB and R2 , while 28 pairs showed a nonlinear pattern, which were significant only in CoD-B (see Supplementary Table S1). In the counterpart normal tissue, however, 35 ligandreceptor pairs showed a linear coexpression pattern, while 6 pairs showed a nonlinear pattern (see Supplementary Table S2). Such differences in the coexpression pattern were not identified in previous coexpression studies based on the correlation coefficient [5]. 4. CONCLUSION In summary, we proposed an effective algorithm based on CoD estimation with B-spline approximation for modeling and measuring gene coexpression pattern. The model can address both linear and some specific nonlinear relationships, suggest the directionality of interaction, and can be calculated directly from microarray data without quantization that could lead to information loss or misrepresentation. The newly proposed algorithm can be very useful in analyzing a variety of gene expression in pathway or network Huai Li et al. 9 160 900 140 800 700 600 100 CCR1 ACVR2B 120 80 60 500 400 300 40 200 20 100 0 0 200 400 600 800 1000 0 0 100 200 BMP7 Lung cancer Normal 300 400 500 600 700 800 CCL23 Lung cancer Normal (a) (b) Figure 3: Coexpression profiles of two representative ligand-receptor pairs in lung cancer cells and normal cells. (a) BMP7 and ACVR2B in lung cancer samples (Pshuffle < 2.8E-2) and normal samples (Pshuffle < 5.8E-1); (b) CCL23 and CCR1 in lung cancer samples (Pshuffle < 7.3E-1) and normal samples (Pshuffle < 2.1E-9). studies, especially in the case when there are specific nonlinear relations between the gene expression profiles. ACKNOWLEDGEMENT This study was supported, at least in part, by the Intramural Research Program, National Institute on Aging, NIH. REFERENCES [1] J. M. Stuart, E. Segal, D. Koller, and S. K. Kim, “A genecoexpression network for global discovery of conserved genetic modules,” Science, vol. 302, no. 5643, pp. 249–255, 2003. [2] H. K. Lee, A. K. Hsu, J. Sajdak, J. Qin, and P. Pavlidis, “Coexpresion analysis of human genes across many microarray data sets,” Genome Research, vol. 14, no. 6, pp. 1085–1094, 2004. [3] V. van Noort, B. Snel, and M. A. Huynen, “The yeast coexpression network has a small-world, scale-free architecture and can be explained by a simple model,” EMBO Reports, vol. 5, no. 3, pp. 280–284, 2004. [4] S. L. Carter, C. M. Brechbuhler, M. Griffin, and A. T. Bond, “Gene co-expression network topology provides a framework for molecular characterization of cellular state,” Bioinformatics, vol. 20, no. 14, pp. 2242–2250, 2004. [5] T. G. Graeber and D. Eisenberg, “Bioinformatic identification of potential autocrine signaling loops in cancers from gene expression profiles,” Nature Genetics, vol. 29, no. 3, pp. 295–300, 2001. [6] M. J. Herrgård, M. W. Covert, and B. Ø. Palsson, “Reconciling gene expression data with known genome-scale regulatory network structures,” Genome Research, vol. 13, no. 11, pp. 2423–2434, 2003. [7] S. Imoto, T. Goto, and S. Miyano, “Estimation of genetic networks and functional structures between genes by using Bayesian networks and nonparametric regression,” Pacific Symposium on Biocomputing, pp. 175–186, 2002. [8] A. J. Butte and I. S. Kohane, “Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements,” Pacific Symposium on Biocomputing, pp. 418–429, 2000. [9] X. Zhou, X. Wang, and E. R. Dougherty, “Construction of genomic networks using mutual-information clustering and reversible-jump Markov-chain-Monte-Carlo predictor design,” Signal Processing, vol. 83, no. 4, pp. 745–761, 2003. [10] S. Kim, H. Li, E. R. Dougherty, et al., “Can Markov chain models mimic biological regulation?” Journal of Biological Systems, vol. 10, no. 4, pp. 337–357, 2002. [11] R. F. Hashimoto, S. Kim, I. Shmulevich, W. Zhang, M. L. Bittner, and E. R. Dougherty, “Growing genetic regulatory networks from seed genes,” Bioinformatics, vol. 20, no. 8, pp. 1241–1247, 2004. [12] I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang, “Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks,” Bioinformatics, vol. 18, no. 2, pp. 261–274, 2002. [13] E. R. Dougherty, S. Kim, and Y. Chen, “Coefficient of determination in nonlinear signal processing,” Signal Processing, vol. 80, no. 10, pp. 2219–2235, 2000. [14] H. Li and M. Zhan, “Systematic intervention of transcription for identifying network response to disease and cellular phenotypes,” Bioinformatics, vol. 22, no. 1, pp. 96–102, 2006. [15] V. Hatzimanikatis and K. H. Lee, “Dynamical analysis of gene networks requires both mRNA and protein expression information,” Metabolic Engineering, vol. 1, no. 4, pp. 275–281, 1999. [16] H. Prautzsch, W. Boehm, and M. Paluszny, Bézier and B-Spline Techniques, Springer, Berlin, Germany, 2002. [17] P. Ma, C. I. Castillo-Davis, W. Zhong, and J. S. Liu, “A datadriven clustering method for time course gene expression data,” Nucleic Acids Research, vol. 34, no. 4, pp. 1261–1269, 2006. 10 [18] J. D. Storey, W. Xiao, J. T. Leek, R. G. Tompkins, and R. W. Davis, “Significance analysis of time course microarray experiments,” Proceedings of the National Academy of Sciences of the United States of America, vol. 102, no. 36, pp. 12837–12842, 2005. [19] Z. Bar-Joseph, G. K. Gerber, D. K. Gifford, T. S. Jaakkola, and I. Simon, “Continuous representations of time-series gene expression data,” Journal of Computational Biology, vol. 10, no. 34, pp. 341–356, 2003. [20] K. Bhasi, A. Forrest, and M. Ramanathan, “SPLINDID: a semiparametric, model-based method for obtaining transcription rates and gene regulation parameters from genomic and proteomic expression profiles,” Bioinformatics, vol. 21, no. 20, pp. 3873–3879, 2005. [21] W. He, “A spline function approach for detecting differentially expressed genes in microarray data analysis,” Bioinformatics, vol. 20, no. 17, pp. 2954–2963, 2004. [22] Y. Luan and H. Li, “Clustering of time-course gene expression data using a mixed-effects model with B-splines,” Bioinformatics, vol. 19, no. 4, pp. 474–482, 2003. [23] C. O. Daub, R. Steuer, J. Selbig, and S. Kloska, “Estimating mutual information using B-spline functions—an improved similarity measure for analysing gene expression data,” BMC Bioinformatics, vol. 5, no. 1, p. 118, 2004. [24] R. A. Irizarry, B. M. Bolstad, F. Collin, L. M. Cope, B. Hobbs, and T. P. Speed, “Summaries of Affymetrix GeneChip probe level data,” Nucleic Acids Research, vol. 31, no. 4, p. e15, 2003. [25] P. W. Lord, R. D. Stevens, A. Brass, and C. A. Goble, “Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation,” Bioinformatics, vol. 19, no. 10, pp. 1275–1283, 2003. [26] K. D. Brubaker, E. Corey, L. G. Brown, and R. L. Vessella, “Bone morphogenetic protein signaling in prostate cancer cell lines,” Journal of Cellular Biochemistry, vol. 91, no. 1, pp. 151– 160, 2004. [27] S. Yang, C. Zhong, B. Frenkel, A. H. Reddi, and P. RoyBurman, “Diverse biological effect and Smad signaling of bone morphogenetic protein 7 in prostate tumor cells,” Cancer Research, vol. 65, no. 13, pp. 5769–5777, 2005. [28] A. Müller, B. Homey, H. Soto, et al., “Involvement of chemokine receptors in breast cancer metastasis,” Nature, vol. 410, no. 6824, pp. 50–56, 2001. [29] J. M. Wang, X. Deng, W. Gong, and S. Su, “Chemokines and their role in tumor growth and metastasis,” Journal of Immunological Methods, vol. 220, no. 1-2, pp. 1–17, 1998. EURASIP Journal on Bioinformatics and Systems Biology Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 47214, 11 pages doi:10.1155/2007/47214 Research Article Gene Systems Network Inferred from Expression Profiles in Hepatocellular Carcinogenesis by Graphical Gaussian Model Sachiyo Aburatani,1 Fuyan Sun,1 Shigeru Saito,2 Masao Honda,3 Shu-ichi Kaneko,3 and Katsuhisa Horimoto1 1 Biological Network Team, Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan 2 Chemo & Bio Informatics Department, INFOCOM CORPORATION, Mitsui Sumitomo Insurance Surugadai Annex Building, 3-11, Kanda-Surugadai, Chiyoda-ku, Tokyo 101-0062, Japan 3 Department of Gastroenterology, Graduate School of Medical Science, Kanazawa University, 13-1 Takara-machi, Kanazawa, Ishikawa 920-8641, Japan Received 28 June 2006; Revised 27 February 2007; Accepted 1 May 2007 Recommended by Paul Dan Cristea Hepatocellular carcinoma (HCC) in a liver with advanced-stage chronic hepatitis C (CHC) is induced by hepatitis C virus, which chronically infects about 170 million people worldwide. To elucidate the associations between gene groups in hepatocellular carcinogenesis, we analyzed the profiles of the genes characteristically expressed in the CHC and HCC cell stages by a statistical method for inferring the network between gene systems based on the graphical Gaussian model. A systematic evaluation of the inferred network in terms of the biological knowledge revealed that the inferred network was strongly involved in the known genegene interactions with high significance (P < 10−4 ), and that the clusters characterized by different cancer-related responses were associated with those of the gene groups related to metabolic pathways and morphological events. Although some relationships in the network remain to be interpreted, the analyses revealed a snapshot of the orchestrated expression of cancer-related groups and some pathways related with metabolisms and morphological events in hepatocellular carcinogenesis, and thus provide possible clues on the disease mechanism and insights that address the gap between molecular and clinical assessments. Copyright © 2007 Sachiyo Aburatani et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. INTRODUCTION Hepatitis C virus (HCV) is the major etiologic agent of nonA non-B hepatitis, and chronically infects about 170 million people worldwide [1–3]. Many HCV carriers develop chronic hepatitis C (CHC), and finally are afflicted with hepatocellular carcinoma (HCC) in livers with advanced-stage CHC. Thus, the CHC and HCC cell stages are essential in hepatocellular carcinogenesis. To elucidate the mechanism of hepatocellular carcinogenesis at a molecular level, many experiments have been performed from various approaches. In particular, recent advances in techniques to monitor simultaneously the expression levels of genes on a genomic scale have facilitated the identification of genes involved in the tumorigenesis [4]. Indeed, some relationships between the disease and the tumor-related genes were proposed from the gene expression analyses [5–7]. Apart from the relationship between tumor-related genes and the disease at the molecular level, the information about the pathogenesis and the clinical characteristics of hepatocellular carcinogenesis has accumulated steadily [8, 9]. However, there is a gap between the information about hepatocellular carcinogenesis at the molecular level and that at more macroscopic levels, such as the clinical level. Furthermore, the relationships between tumorrelated genes and other genes also remain to be investigated. Thus, an approach to describe the perspective of carcinogenesis from measurements at the molecular level is desirable to bridge the gap between the information at the two different levels. Recently, we have developed an approach to infer a regulatory network, which is based on graphical Gaussian modeling (GGM) [10, 11]. Graphical Gaussian modeling is one of the graphical models that includes the Boolean and Bayesian models [12, 13]. Among the graphical models, GGM has the simplest structure in a mathematical sense; only the inverse 2 of the correlation coefficient between the variables is needed, and therefore, GGM can be easily applied to a wide variety of data. However, straightforward applications of statistical theory to practical data fail in some cases, and GGM also fails frequently when applied to gene expression profiles; here the expression profile indicates a set of the expression degrees of one gene, measured under various conditions. This is because the profiles often share similar expression patterns, which indicate that the correlation coefficient matrix between the genes is not regular. Thus, we have devised a procedure, named ASIAN (automatic system for inferring a network), to apply GGM to gene expression profiles, by a combination of hierarchical clustering [14]. First, the large number of profiles is grouped into clusters, according to the standard approach of profile analysis [15]. To avoid the generation of a nonregular correlation coefficient matrix from the expression profiles, we adopted a stopping rule for hierarchical clustering [10]. Then, the relationship between the clusters is inferred by GGM. Thus, our method generates a framework of gene regulatory relationships by inferring the relationships between the clusters [11, 16], and provides clues toward estimating the global relationships between genes on a large scale. Methods for extracting biological knowledge from large amounts of literature and arranging it in terms of gene function have been developed. Indeed, ontologies have been made available by the gene ontology (GO) consortium [17] to construct a functional categorization of genes and gene products, and by using the GO terms, the software determines whether any GO terms annotate a specified list of genes at a frequency greater than that expected by chance [18]. Furthermore, various software applications, most of which are commercial software, such as MetaCore from GeneGo http://www.genego.com/, have been developed for the navigation and analysis of biological pathways, gene regulation networks, and protein interaction maps [19]. Thus, advances in the processing of biological knowledge have enabled us to correspond to the results of gene expression analyses for a large amount of data with the biological functions. In this study, we analyzed the gene expression profiles from the CHC and HCC cell stages, by ASIAN based on the graphical Gaussian Model, to reveal the framework of gene group associations in hepatocellular carcinogenesis. For this purpose, first, the genes characteristically expressed in hepatocellular carcinogenesis were selected, and then, the profiles of the genes thus selected were subjected to the association inference method. In addition to the association inference, which was presented by the network between the clusters, the network was further interpreted systematically by the biological knowledge of the gene interactions and by the functional categories with GO terms. The combination of the statistical network inference from the profiles with the systematic network interpretation by the biological knowledge in the literature provides a snapshot of the orchestration of gene systems in hepatocellular carcinogenesis, especially for bridging the gap between the information on the disease mechanisms at the molecular level and at more macroscopic levels. EURASIP Journal on Bioinformatics and Systems Biology 2. 2.1. MATERIALS AND METHODS Gene selection We selected the up- and downregulated genes characteristically expressed in the CHC and HCC stages, as a prerequisite for defining the variables in the network inference by the graphical Gaussian modeling. This involved the following steps. (1) The averages and the standard deviations in the respective conditions, AV j and SD j , for j = 1, . . . , Nc , are calculated. (2) The expression degree of the ith gene in the jth condition, ei j , is compared with |AV j ± SD j |. (3) The gene is regarded as a characteristically expressed gene, if the number of conditions that ei j ≥ |AV j ± SD j | is more than Nc /2. Although the criterion for a characteristically expressed gene is usually |AV j ± 2SD j |, the present selection procedure described above is simply designed to gather as many characteristically expressed genes as possible, and is suitable to capture a macroscopic relationship between the gene systems estimated by the following cluster analysis. 2.2. Gene systems network inference The present analysis is composed of three parts: first, the profiles selected in the preceding section are subjected to the clustering analysis with the automatic determination of cluster number, and then the profiles of clusters are subjected to the graphical Gaussian modeling. Finally, the network inferred by GGM is rearranged according to the magnitude of partial correlation coefficients, which can be regarded as the association strength, between the clusters. The details of the analysis are as follows. 2.2.1. Clustering with automatic determination of cluster number In clustering the gene profiles, here, the Euclidian distance between Pearson’s correlation coefficients of profiles and the unweighted pair group method using arithmetic average (UPGMA or group average method) were adopted as the metric and the technique, respectively, with reference to the previous analyses by GGM [11, 16]. In particular, the present metric between the two genes is designed to reflect the similarity in the expression profile patterns between other genes as well as between the measured conditions, that is, n 2 ril − r jl , di j = (1) l=1 where n is the total number of the genes, and ri j is the Pearson correlation coefficient between the i and j genes of the expression profiles that are measured at Nc conditions, pik , (k = 1, 2, . . . , Nc ): l ri j = l k=1 k=1 pik − pi · p jk − p j pik − pi 2 l 2 , · k=1 p jk − p j (2) where pi is the arithmetic average of pik over Nc conditions. Sachiyo Aburatani et al. 3 In the cluster number estimation, various stopping rules for the hierarchical clustering have been developed [20]. Recently, we have developed a method for estimating the cluster number in the hierarchical clustering, by considering the following application of the graphical model to the clusters [10]. In our approach, the variance inflation factor (VIF) is adopted as a stopping rule, and is defined by VIFi = rii−1 , Step 2. Calculate the partial correlation coefficient matrix P(τ) from the correlation coefficient matrix C(τ). τ indicates the number of the iteration. (3) where rii−1 is the ith diagonal element of the inverse of the correlation coefficient matrix between explanatory variables [21]. In the cluster number determination, the popular cutoff value of 10.0 [21] was adopted as a threshold in the present analysis, also with reference to the previous analyses. After the cluster number determination, the average expression profiles are calculated for the members of each cluster, and then the average correlation coefficient matrix between the clusters is calculated from them. Finally, the average correlation coefficient matrix between the clusters is subjected to the graphical Gaussian modeling. Note that the average coefficient correlation matrix avoids the difficulty of the above numerical calculation, due to the distinctive patterns of the average expression profiles of clusters. This means that the GGM works well for the average coefficient correlation matrix. 2.2.2. Graphical Gaussian modeling The concept of conditional independence is fundamental to graphical Gaussian modeling (GGM). The conditional independence structure of the data is characterized by a conditional independence graph. In this graph, each variable is represented by a vertex, and two vertices are connected by an edge if there is a direct association between them. In contrast, a pair of vertices that are not connected in the graph is conditionally independent. In the procedure for applying the GGM to the profile data [11], a graph, G = (V , E), is used to represent the relationship among the M clusters, where V is a finite set of nodes, each corresponding to one of the M clusters, and E is a finite set of edges between the nodes. E consists of the edges between cluster pairs that are conditionally dependent. The conditional independence is estimated by the partial correlation coefficient, expressed by ri j ri, j |rest = − √ ii √ j j , r r Step 1. Prepare a complete graph of G(0) = (V , E). The nodes correspond to M clusters. All of the nodes are connected. G(0) is called a full model. Based on the expression profile data, construct an initial correlation coefficient matrix C(0). (4) where ri j |rest is the partial correlation coefficient between variables i and j, given the rest variables, and ri j is the (i, j) element in the reverse of the correlation coefficient matrix. In order to evaluate which pair of clusters is conditionally independent, we applied the covariance selection [22], which was attained by the stepwise and iterative algorithm developed by Wermuth and Scheidt [23]. The algorithm is presented as Algorithm 1. The graph obtained by the above procedure is an undirected graph, which is called an independence graph. The in- Step 3. Find an element that has the smallest absolute value among all of the nonzero elements of P(τ). Then, replace the element in P(τ) with zero. Step 4. Reconstruct the correlation coefficient matrix, C(τ + 1), from P(τ). In C(τ + 1), the element corresponding to the element set to zero in P(τ) is revised, while all of the other elements are left to be the same as those in C(τ). Step 5. In the Wermuth and Sheidt algorithm, the termination of the iteration is judged by the “deviance” values. Here, we used two types of deviance, dev1 and dev2, with the following: C(τ + 1) , dev1 = Nc log C(0) C(τ + 1) . dev2 = Nc log C(τ) (5) Calculate dev1 and dev2. The two deviances follow an asymptotic χ 2 distribution with a degree of freedom = n, and that with a degree of freedom = 1, respectively. n is the number of elements that are set to zero until the (τ + 1)th iteration. In our approach, n is equal to (τ + 1). |C(τ)| indicates the determinant of C(τ). Nc is the number of different conditions under which the expression levels of M clusters are measured. Step 6. If the probability value corresponding to dev1 ≤ 0.05, or the probability value corresponding to dev2 ≤ 0.05, then the model C(τ + 1) is rejected, and the iteration is stopped. Otherwise, the edge between a pair of clusters with a partial correlation coefficient set to zero in P(τ) is omitted from G(τ) to generate G(τ + 1), and τ is increased by 1. Then, go to Step 1. Algorithm 1 dependence graph represents which pair of clusters is conditionally independent. That is, when the partial correlation coefficient for a cluster pair is equal to 0, the cluster pair is conditionally independent, and the relationship is expressed as no edge between the nodes corresponding to the clusters in the independence graph. The genes grouped into each cluster are expected to share similar biological functions, in addition to the regulatory mechanism [24]. Thus, a network between the clusters can be approximately regarded as a network between gene systems, each with similar functions, from a macroscopic viewpoint. Note that the number of connections in one vertex is not limited, while it is only one in the cluster analysis. This 4 EURASIP Journal on Bioinformatics and Systems Biology feature of the network reflects the multiple relationships of a gene or a gene group in terms of the biological function. 2.2.3. Rearrangement of the inferred network When there are many edges, drawing them all on one graph produces a mess or “spaghetti” pattern, which would be difficult to read. Indeed, in some examples of the application of GGM to actual profiles, the intact networks by GGM still showed complicated forms with many edges [11, 16]. Since the magnitude of the partial correlation coefficient indicates the strength of the association between clusters, the intact network can be rearranged according to the partial correlation coefficient value, to interpret the association between clusters. The strength of the association can be assigned by a standard test for the partial correlation coefficient [25]. By Fisher’s Z transformation of partial correlation coefficients, that is, Z= 1 + ri j ·rest 1 , log 2 1 − ri j ·rest (6) Z is approximately distributed according to the following normal distribution: N 1 + ri j ·rest 1 1 , , log 2 1 − ri j ·rest Nc − (M − 2) − 3 The inferred network can be statistically evaluated in terms of the gene-gene interactions. The chance probability was estimated by the correspondence between the inferred cluster network and the information about gene interactions. The following steps were used. (1) The known gene pairs with interactions in the database were overlaid onto the inferred network. (2) The number of cluster pairs, upon which the gene interactions were overlaid, was counted. (3) The chance probability, in which the cluster pairs connected by the established edges in the network were found in all possible pairs, was calculated by using the following equation: f −1 P =1− i=0 N −g n−i N n , The inferred network can be evaluated in terms of the biological knowledge. For this purpose, we characterize the clusters by GO terms, and overlay the knowledge about the gene interactions onto the network. For this purpose, we first use GO::TermFinder [18] to characterize the clusters by GO terms with the user-defined significance probability (http://search.cpan.org/dist/GO-TermFinder). Then, Pathway Studio [19] is used to survey the biological information about the gene interactions between the selected genes. 2.5. Software All calculations of the present clustering and GGM were performed by the ASIAN web site [26, 27] (http://www.eureka. cbrc.jp/asian) and “Auto Net Finder,” the commercialized PC version of ASIAN, from INFOCOM CORPORATION, Tokyo, Japan (http://www.infocom.co.jp/bio/download). 2.6. Expression profile data The expression profiles of 8516 genes were monitored in 27 CHC samples and 17 HCC samples [28]. 3. 2.3. Statistical significance of the inferred network with the biological knowledge g i Evaluation of the inferred network in terms of the biological knowledge (7) where Nc and M are the number of conditions and the number of clusters, respectively. Thus, we can statistically test the observed correlation coefficients under the null hypothesis with a significance probability. 2.4. (8) where N is the number of possible cluster pairs in the network, n is the number of cluster pairs with edges in the inferred network, f is the number of cluster pairs with edges in the inferred network, including the known gene pairs with interactions, and g is the number of cluster pairs, including the known gene pairs with interactions. 3.1. RESULTS AND DISCUSSION Clustering Among the 8516 genes with expression profiles that were measured in the previous studies [28], 661 genes were selected as those characteristically expressed in the CHC and HCC stages. As a preprocessing step for the association inference, the genes thus selected were automatically divided into 18 groups by ASIAN [26, 27]. Furthermore, each cluster was characterized in terms of the GO terms, which define the macroscopic features of the cluster in terms of the biological function. Figure 1 shows the dendrogram of clusters, together with their expression patterns. As seen in Figure 1, the genes were grouped into 18 clusters, in terms of the number of members and the expression patterns in the clusters. The average number of cluster members was 36.7 genes (SD, 14.2), and the maximum and minimum numbers of members were 69 in cluster 14 and 18 in cluster 9, respectively. As for the expression pattern, five clusters (10, 12, 14, 15, and 18) and ten clusters (1–7, 9, 16, and 17) were composed of up- and downregulated genes, respectively, and three clusters (8, 11, and 13) showed similar mixtures of up- and downregulated genes. Table 1 shows the GO terms for the clusters (clusterGOB), which characterized them well (see details at http://www.cbrc.jp/∼horimoto/HCGO.pdf). Among the 661 genes analyzed in this study, 525 genes were characterized by the GO terms, and among the 18 clusters, 11 clusters were characterized by GO terms with P < .05. In addition, 188 genes (28.3% of all characterized genes) corresponded to the GO terms listed in Table 1. As seen in the table, although Sachiyo Aburatani et al. most clusters are characterized by several GO terms, reflecting the fact that the genes function generally in multiple pathways, the clusters are not composed of a mixture of genes with distinctive functions. For example, cluster 2 is characterized by 10 terms, and most of the terms are related to the energy metabolism. Thus, the GO terms in the respective clusters share similar features of biological functions, which cause the hierarchical structure of the GO term definitions. In Table 1, most of the clusters characterized by GO terms with P < .05 are related to response function and to metabolism. Clusters 1, 6, 8, 12, and 13 are characterized by GO terms related to different responses, and clusters 2, 3, 4, and 7 are characterized by GO terms related to different aspects of metabolism. Although the genes in two clusters, 14 and 16, did not adhere to this dichotomy, the genes characteristically expressed in HCC in the above nine clusters were related to the responses and the metabolic pathways. As for the remaining clusters with lower significance, three clusters (9, 10, and 11) were also characterized by response functions, and four clusters (5, 15, 17, and 18) were related to morphological events at the cellular level. Note that none of the clusters characterized by cellular level events attained the significance level. This may be because the genes related to cellular level events represent only a small fraction of genes relative to all genes with known functions, in comparison with the genes related to molecular level events in the definition of GO terms. It is interesting to determine the correspondence between the up- and downregulated genes and the GO terms in the clusters. In the five clusters of upregulated genes, clusters 10 and 12 were characterized by different responses, and two clusters were characterized by morphological events, which were the categories of “cell proliferation” in cluster 15 and of “development” in cluster 18. The remaining cluster, 14, was characterized by regulation, development, and metabolism. As for the clusters of downregulated genes, four of the ten clusters were characterized by GO terms related to various aspects of metabolism. In the remaining six clusters, three clusters were characterized by GO terms related to responses, two clusters were characterized by morphological events, and one cluster was characterized by mixed categories. In summary, the present gene selection and the following automatic clustering produced a macroscopic view of gene expression in hepatocellular carcinogenesis. Although the clusters contain many genes that do not always share the same functions, the clusters were characterized by their responses, morphological events, and metabolic aspects from a macroscopic viewpoint. The clusters of upregulated genes were characterized by the former two categories, and those of the downregulated genes represented all three categories. Thus, the present clustering serves to interpret the network between the clusters in terms of the biological function and the gene expression pattern. 3.2. Known gene interactions in the inferred network The association between the 18 clusters inferred by GGM is shown in Figure 2. In the intact network by ASIAN, 96 of 153 possible edges between 18 clusters (about 63%) were estab- 5 10 (38) 11 (31) 12 (30) 13 (56) 8 (32) 9 (18) 4 (25) 5 (24) 17 (24) 14 (69) 15 (28) 18 (28) 16 (50) 6 (42) 7 (48) 1 (32) 2 (59) 3 (27) Figure 1: Dendrogram of genes and profiles. The dendrogram was constructed by hierarchical clustering with the metric of the Euclidian distances between the correlation coefficients and the UPGMA. The blue line on the dendrogram indicates the cluster boundary estimated automatically by ASIAN. The gene expression patterns of the respective clusters in the CHC and HCC stages are shown by the degree of intensity: the red and green colors indicate relatively higher and lower intensities. The cluster number and the number of member genes in each cluster (in parentheses) are denoted on the right side of the figure. lished by GGM. Since the intact network is still messy, the network was rearranged to interpret its biological meaning by extracting the relatively strong associations between the clusters, according to the procedure in Section 2.2.3. After the rearrangement, 34 edges remained by the statistical test of the partial correlation coefficients with 5% significance. In the rearranged network, all of the clusters were nested, but each cluster was connected to a few other clusters. Indeed, the average number of edges per cluster was 2.3, and the maximum and minimum numbers of edges were seven in cluster 15 and one in cluster 9, respectively. In particular, the numbers of edges are not proportional to the numbers of constituent genes in each cluster. For example, while the numbers of genes in clusters 15 and 17 are equal to each other (24 genes), the number of edges from cluster 15 (2 edges) differs from that from cluster 17 (5 edges). Thus, the number of edges does not depend on the number of genes belonging to the cluster, but rather on the gene associations between the cluster pairs. 6 To test the validity of the inferred network in terms of biological function, the biological knowledge about the gene interactions is overlaid onto the inferred network. For this purpose, all of the gene pairs belonging to cluster pairs are surveyed by Pathway Assist, which is a database for biological knowledge about molecular interactions, compiled based on the gene ontology [17]. Among the 661 genes analyzed in this study, the interactions between 90 gene pairs were detected by Pathway Assist, and 50 of these pairs were found in Figure 2. Notice that the number of gene pairs reported in the literature does not directly reflect the importance of the gene interactions, and instead is highly dependent on the number of scientists who are studying at the corresponding genes. Thus, we counted the numbers of cluster pairs in which at least one gene pair was known, by projecting the gene pairs with known interactions onto the network. By this projection, the interactions were found in 35 (g in the equation of Section 2.3) cluster pairs among 153 (N) possible pairs (see details of the gene pair projection at http://www.cbrc.jp/∼horimoto/GPPN.pdf). Then, 19 ( f ) of the 35 cluster pairs were overlapped with 34 (n) cluster pairs in the rearranged network. The chance probability that a known interaction was found in the connected cluster pairs in the rearranged network was calculated as P < 10−4.3 . Thus, the rearranged network faithfully captures the known interactions between the constituent genes. Furthermore, the genes with known interactions were corresponded to the genes responsible for the GO terms of each cluster, as shown in Table 1. The genes responsible for the GO terms were distributed over all cluster pairs, including gene pairs with known interactions, except for only two pairs, clusters 15 and 17, and 15 and 18. Thus, the network can be interpreted not only by the known gene interactions but also by the GO terms characterizing the clusters. 3.3. Gene systems network characterized by GO terms 3.3.1. Coarse associations between the clusters To elucidate the associations between the clusters, the cluster associations with 1% significance probability were further discriminated from those with 5% probability. This generated four groups of clusters, shown in Figure 3(a). First, we will focus on the groups including the clusters that were characterized by GO terms with a significance probability, and that were definitely occupied by upor downregulated genes (clusters depicted by triangles with bold lines in the figure). Groups I and III attained the above criteria. In group I, the clusters were a mixture of the clusters of the up- and downregulated genes. Note that three of the six clusters were composed of upregulated genes, which were characterized by responses (cluster 12), mixed categories (cluster 14), and morphological events (cluster 15). In group III, all three clusters were of downregulated genes. One cluster was characterized by responses, and two were characterized by amino-acid-related metabolism. In contrast, groups II and IV were composed of the clusters that were somewhat inadequately characterized by GO terms and expression patterns. Thus, groups I and III provide the characteristic fea- EURASIP Journal on Bioinformatics and Systems Biology tures about the orchestration of gene expression in hepatocellular carcinogenesis. Secondly, a coarse grinning for group associations provides another viewpoint, shown in Figure 3(b). When the groups with at least one edge between the clusters in the respective groups were presented, regardless of the number of edges, groups I, II, and IV were nested, and group III was connected with only group I. In the second view, group I, which includes three of the five clusters of upregulated genes in all clusters, was associated with all of the other groups. This suggests that group I represents a positive part of the gene expression in hepatocellular carcinogenesis, which is consistent with the interpretation by the first view, from the significant GO terms and the clear expression patterns. Interestingly, among the clusters characterized by morphological events (clusters 5, 15, 17, and 18), three of the four clusters were distributed over groups I, II, and IV, and the distribution was consistent with the nested groups. This suggests that the upregulated genes of the clusters in group I are responsible for the events at the cellular level. Thirdly, the clusters not belonging to the four groups were clusters 1, 3, and 5. Clusters 1, 3, and 5 were directly connected with groups I, III, and IV, groups I and III, and group IV, respectively. Interestingly, cluster 1, characterized by only “anti-inflammatory response,” was connected with five clusters belonging to three groups, in which four clusters were downregulated clusters. Although cluster 5 was not clearly characterized by the GO terms, cluster 3 was characterized by metabolic terms that were quite similar to those for cluster 2, a downregulated cluster. Thus, the three clusters may be concerned with downregulation in hepatocellular carcinogenesis. 3.3.2. Interpretations of the inferred network in terms of pathogenesis The coarse associations between the clusters in the preceding section can be interpreted on the macroscopic level, such as the pathological level. The interpretation of the network inferred based on the information at the molecular level will be useful to bridge the gap between the information about the disease mechanisms at the molecular and more macroscopic levels. One of the most remarkable associations is found in group I. Cluster 12, with upregulation, was associated at a 1% significance level with cluster 2, with downregulation. The former cluster is characterized by the GO terms related to the immune response, and the latter is characterized by those involved with metabolism. In general, CHC and HCC result in serious damage to hepatocytes, which are important cells for nutrient metabolism, and the damage induces different responses. Indeed, HCC is a suitable target for testing active immunotherapy [29]. Furthermore, cluster 2 was also associated at a 1% significance level with cluster 14, characterized by prostaglandin-related terms. This may reflect the fact that one mediator of inflammation, prostaglandin, shows elevated expression in human and animal HCCs [30]. Thus, the associations in group I are involved in the molecular pathogenesis of the CHC and HCC stages. Sachiyo Aburatani et al. 7 Table 1: Cluster characterization by GO terms# . GO no. Category P-value Fraction 1 2 2 2 2 2 2 2 2 2 2 GO:0030236 GO:0006094 GO:0006066 GO:0006091 GO:0019319 GO:0046165 GO:0046364 GO:0006067 GO:0006069 GO:0006629 GO:0009618 Anti-inflammatory response Gluconeogenesis Alcohol metabolism Generation of precursor metabolites and energy Hexose biosynthesis Alcohol biosynthesis Monosaccharide biosynthesis Ethanol metabolism Ethanol oxidation Lipid metabolism Response to pathogenic bacteria 0.18% 0.06% 0.12% 0.14% 0.34% 0.34% 0.34% 0.48% 0.48% 1.47% 4.96% 2 of 22/6 of 26081 3 of 37/19 of 26081 6 of 37/312 of 26081 9 of 37/961 of 26081 3 of 37/33 of 26081 3 of 37/33 of 26081 3 of 37/33 of 26081 2 of 37/5 of 26081 2 of 37/5 of 26081 7 of 37/722 of 26081 2 of 37/15 of 26081 3 3 3 3 3 GO:0006094 GO:0019319 GO:0046165 GO:0046364 GO:0009069 Gluconeogenesis Hexose biosynthesis Alcohol biosynthesis Monosaccharide biosynthesis Serine family amino acid metabolism 0.61% 1.87% 1.87% 1.87% 4.49% 2 of 15/19 of 26081 2 of 15/33 of 26081 2 of 15/33 of 26081 2 of 15/33 of 26081 2 of 15/51 of 26081 4 4 4 4 4 4 4 GO:0006725 GO:0009308 GO:0006570 GO:0050878 GO:0006950 GO:0006519 GO:0007582 Aromatic compound metabolism Amine metabolism Tyrosine metabolism Regulation of body fluids Response to stress Amino acid and derivative metabolism Physiological process 0.07% 0.38% 0.59% 1.65% 2.70% 4.12% 4.63% 4 of 20/140 of 26081 5 of 20/454 of 26081 2 of 20/11 of 26081 3 of 20/113 of 26081 6 of 20/1116 of 26081 4 of 20/398 of 26081 20 of 20/17195 of 26081 5 5 GO:0006917 GO:0012502 Induction of apoptosis∗ Induction of programmed cell death∗ 16.06% 16.06% 6 6 6 6 6 6 6 6 6 6 6 GO:0009613 GO:0043207 GO:0006950 GO:0009605 GO:0006953 GO:0006955 GO:0006956 GO:0006952 GO:0050896 GO:0009607 GO:0006629 Response to pest, pathogen, or parasite Response to external biotic stimulus Response to stress Response to external stimulus Acute-phase response Immune response Complement activation Defense response Response to stimulus Response to biotic stimulus Lipid metabolism 0.00% 0.00% 0.00% 0.05% 0.05% 0.34% 0.48% 0.68% 1.15% 1.65% 2.20% 8 of 29/522 of 26081 8 of 29/557 of 26081 10 of 29/1116 of 26081 10 of 29/1488 of 26081 3 of 29/25 of 26081 8 of 29/1098 of 26081 3 of 29/52 of 26081 8 of 29/1209 of 26081 11 of 29/2619 of 26081 8 of 29/1372 of 26081 6 of 29/722 of 26081 7 7 7 7 7 7 7 7 7 GO:0006559 GO:0019752 GO:0006082 GO:0006558 GO:0009074 GO:0006519 GO:0019439 GO:0006629 GO:0009308 L-phenylalanine catabolism Carboxylic acid metabolism Organic acid metabolism L-phenylalanine metabolism Aromatic amino acid family catabolism Amino acid and derivative metabolism Aromatic compound catabolism Lipid metabolism Amine metabolism 0.83% 1.00% 1.02% 1.26% 1.26% 1.67% 1.79% 3.04% 3.09% 2 of 31/9 of 26081 6 of 31/590 of 26081 6 of 31/592 of 26081 2 of 31/11 of 26081 2 of 31/11 of 26081 5 of 31/398 of 26081 2 of 31/13 of 26081 6 of 31/722 of 26081 5 of 31/454 of 26081 8 8 8 GO:0001570 GO:0006950 GO:0050896 Vasculogenesis Response to stress Response to stimulus 0.09% 0.42% 2.33% 2 of 21/4 of 26081 7 of 21/1116 of 26081 9 of 21/2619 of 26081 Cluster no. 2 of 13/132 of 26081 2 of 13/132 of 26081 8 EURASIP Journal on Bioinformatics and Systems Biology Table 1: Continued. ∗ 9 GO:0009611 Response to wounding 11.19% 3 of 13/394 of 26081 10 GO:0009607 Response to biotic stimulus∗ 6.66% 6 of 19/1372 of 26081 11 GO:0050896 Response to stimulus∗ 72.68% 6 of 17/2619 of 26081 12 12 12 12 12 12 12 12 12 GO:0006955 GO:0006952 GO:0050874 GO:0009607 GO:0050896 GO:0030333 GO:0019882 GO:0019884 GO:0019886 Immune response Defense response Organismal physiological process Response to biotic stimulus Response to stimulus Antigen processing Antigen presentation Antigen presentation, exogenous antigen Antigen processing, exogenous antigen via MHC class II 0.01% 0.01% 0.02% 0.03% 0.39% 0.97% 2.62% 3.97% 4.22% 8 of 18/1098 of 26081 8 of 18/1209 of 26081 10 of 18/2432 of 26081 8 of 18/1372 of 26081 9 of 18/2619 of 26081 3 of 18/108 of 26081 3 of 18/151 of 26081 2 of 18/32 of 26081 2 of 18/33 of 26081 13 13 13 13 13 13 GO:0009611 GO:0009613 GO:0043207 GO:0006955 GO:0006950 GO:0050874 Response to wounding Response to pest, pathogen, or parasite Response to external biotic stimulus Immune response Response to stress Organismal physiological process 0.08% 0.38% 0.55% 3.12% 3.44% 3.98% 6 of 30/394 of 26081 6 of 30/522 of 26081 6 of 30/557 of 26081 7 of 30/1098 of 26081 7 of 30/1116 of 26081 10 of 30/2432 of 26081 14 14 14 14 14 GO:0051244 GO:0007275 GO:0001516 GO:0046457 GO:0051242 Regulation of cellular physiological process Development Prostaglandin biosynthesis Prostanoid biosynthesis Positive regulation of cellular physiological process 0.51% 0.94% 3.30% 3.30% 4.35% 8 of 45/665 of 26081 13 of 45/2060 of 26081 2 of 45/9 of 26081 2 of 45/9 of 26081 5 of 45/289 of 26081 15 GO:0008283 Cell proliferation∗ 29.37% 16 16 16 16 GO:0042221 GO:0008152 GO:0009628 GO:0006445 Response to chemical substance Metabolism Response to abiotic stimulus Regulation of translation 0.16% 1.29% 1.89% 2.82% 17 GO:0050817 Coagulation∗ 13.92% 2 of 12/118 of 26081 11.67% 6 of 16/2060 of 26081 18 GO:0007275 ∗ Development 4 of 26/488 of 26081 5 of 31/237 of 26081 25 of 31/11891 of 26081 5 of 31/400 of 26081 3 of 31/87 of 26081 # The gene ontology terms in each cluster, detected with 5% significance probability by using GO::TermFinder [18], are listed. When the terms with that significance probability were not found in the cluster, the terms with the smallest probability were listed as indicated by an asterisk. In the last column, “Fraction,” the numbers of genes belonging to the corresponding category in the cluster, of genes belonging to the cluster, of genes belonging to the corresponding category in all genes of the GO term data set, and of all genes are listed. The associated clusters 4 and 7 in group III, which were characterized by GO terms related to amino acid and lipid metabolism, also show downregulation. Indeed, the products of dysregulated (aberrant regulation) metabolism are widely used to examine liver function in common clinical tests [8]. In addition, the connection between the clusters in groups III and I implies that the downregulation of the clusters in group III may be related to abnormal hepatocyte function. In addition, cluster 15 in group I, which is characterized by the GO term “proliferation,” was associated with different clusters in groups I, II, and IV. It is known that abnormal proliferation is one of the obvious features of cancer [31]. This broad association may be responsible for the cellular level events in hepatocellular carcinogenesis. In summary, the inferred network reveals a coarse snapshot of the gene systems related to the molecular pathogenesis and clinical characteristics of hepatocellular carcinogenesis. Although the resolution of the network is still low, due to the cluster network, the present network may provide some clues for further investigations of the pathogenic relationships involved in hepatocellular carcinoma. 3.3.3. Interpretations of the inferred network in terms of gene-gene interactions In addition to the macroscopic interpretations above, the gene functionality from the gene-gene interactions listed in Figure 2 is also discussed in the context of hepatocellular carcinoma. Although the consideration of genegene interactions is beyond the aim of the present study, Sachiyo Aburatani et al. 9 ALB-MTP CYP2C9-CYP2C18 PLG-CPB2 THBD-CPB2 TF-CDH1 TF-HPX GNG5-AEBP1 PRELP-SPARC COL1A2-RFX5 CYP2E1-COL1A2 ALB-OCRL FBP1- MAN1A1 LPA-MAP2K1 CYP2E1-MAP2K1 ALB-BCHE IGFBP3-IRS1 7 12 14 2 15 13 SDC2-CXCL12 16 4 MAOA-MAOB BAAT-NAT2 MAGED1-BIRC4 B2M- ARAF1 B2M-TIMP1 F8-VWF ZFP36-VWF B2M-RFX5 HTATIP2-NME2 SHC1-MAP3K10 DNCH1-CDKN2A 18 6 ASCL1-BMP4 CITED2-CDKN2A 9 11 FOS-ODC1 PCK1-PCK2 PLG-SERPINF2 THBD-SERPINF2 PLG-KLKB1 1 8 SPINK1-CTSB 3 5 FOXA3-CYP3A4 10 AMBP-MAP2K1 CRAT-AR SORL1-CSF2 DIABLO-HSPB1 17 VEGF-A2M NTRK2-A2M JUN-A2M VEGF-HSPB1 VEGF-THBS2 VEGF-CTF1 VEGF-CSF2 JUN-CSF2 JUN-WEE1 Figure 2: Network between clusters, together with a projection of biological knowledge about the gene interactions. The clusters are indicated by triangles and circles, in which the cluster numbers correspond to those in Figure 1, and the edges between the clusters are associations with 5% significance probability. The red triangles, the green upside-down triangles, and the circles indicate the clusters of up- and downregulated genes, and the mixture of them, respectively, and the dotted triangles indicate the clusters that were not characterized by GO terms with less than 5% significance probability. The known gene interactions in Pathway Assist are indicated between the clusters, in which the genes highlighted by bold letters are characterized by the GO terms in Table 1. some examples may provide possible clues about the disease mechanisms. First, we surveyed the frequencies of GO terms (geneGOB listed in the supplemental data at http://www.cbrc .jp/∼horimoto/suppl/HCGO.pdf) in the selected genes in the present analysis, to investigate the features of gene-gene interactions in the inferred network. A few general terms appeared frequently, such as “response” (122 times in the geneGOB column of the supplemental data at http://www.cbrc.jp/∼horimoto/suppl/HCGO.pdf) and “metabolism” (183), as expected from the coarse associations between the clusters in the preceding section. As for more specific terms about the gene function, “lipid” (46), “apoptosis” (31), and “cell growth” (27) are remarkably found in the list. The “lipid” is expected from the relationship between groups I and III, and the “apoptosis” and the “cell growth” are also expected from the frequent appearance of GO terms (clusterGOB listed in Table 1) related to the morphological events. Since the frequent appearance of “lipid” may be a sensitive reflection of the protein-protein interactions in lipid metabolic pathways to the expression profiles, here, we focus on the gene-gene interactions characterized by the “apoptosis” and the “cell growth.” Among the gene-gene interactions listed in Figure 2, the gene-gene interactions characterized by the cell growth or death are found in the coarse associations between the clus- ters. Group I contains the gene-gene interactions related to apoptosis. The expression of HTAIP2 (HIV-1 Tat interactive protein 2, 30 kd) in cluster 14 induces the expression of a number of genes, including NME2 (nonmetastatic cells 2, protein) in cluster 15 as well as the apoptosis-related genes Bad and Siva [32]. MAGED1 (melanoma antigen, family D, 1) in cluster 13, and its binding partner BIRC4 (baculoviral IAP repeat-containing 4) in cluster 14 are known to play some roles in apoptosis [33]. In addition, the expression of COL1A2 (collagen, type I, alpha 2) in cluster 12, which is related to cell adhesion and skeletal development, is regulated by RFX5 (regulatory factor X, 5) in cluster 14 [29, 34]. In group IV, the expression of CSF2 (colony-stimulating factor 2) in cluster 8 is dependent on the cooperation between NFAT (nuclear factor of activated T cells) and JUN (Jun oncogene) in cluster 10 [35]. Between groups I and II, ASCL1 (achaete-scute complex-like 1) in cluster 13 and BMP4 (bone morphogenetic protein 4) in cluster 18 share the function of cell differentiation [36]. As a result, the gene-gene interactions listed above are related to the mechanisms of cell growth or death at the molecular level. On the other hand, the cluster associations reveal the relationship between the cancer-induced events and various aspects of metabolisms at the pathogenesis and clinical characteristics. Thus, the metabolic pathways might directly 10 EURASIP Journal on Bioinformatics and Systems Biology Group I Group III 7 12 14 2 15 13 Group II 16 4 18 6 1 3 5 10 9 11 ACKNOWLEDGMENTS 8 S. Aburatani was supported by a Grant-in-Aid for Scientific Research (Grant 18681031) from the Ministry of Education, Culture, Sports, Science, and Technology of Japan, and K. Horimoto was partly supported by a Grant-in-Aid for Scientific Research on Priority Areas “Systems Genomics” (Grant 18016008) and by a Grant-in-Aid for Scientific Research (Grant 19201039) from the Ministry of Education, Culture, Sports, Science, and Technology of Japan. This study was supported in part by the New Energy and Industrial Technology Development Organization (NEDO) of Japan and by the Ministry of Health, Labour, and Welfare of Japan. 17 Group IV (a) I III study, our aim was not the inference of detailed gene-gene interactions, but of coarse gene system interactions. Indeed, the use of a partial correlation coefficient is employed as a feasible approach for gene association inference as a first approximation in some studies [37, 38]. Thus, the assumption of the linearity is not suitable for a fine analysis of dynamic gene behaviors, but may be useful for the approximate analysis of static gene associations. II IV (b) Figure 3: Orchestration of gene systems. (a) The association with 1% significance probability is indicated by a bold line, and the clusters with 1% significance association are naturally divided into four groups, which are enclosed by broken lines. (b) The connections between the groups are drawn schematically, as a coarse grinning of the cluster association. influence the mechanisms of cancer-induced cell growth or death at the molecular level in unknown ways. 3.4. Merits and pitfalls of the present approach The present analysis reveals a framework of gene system associations in hepatocellular carcinogenesis. The inferred network provides a bridge between the events at the molecular level and those at macroscopic levels: the associations between clusters characterized by cancer-related responses and those characterized by metabolic and morphological events can be interpreted from pathological and clinical views. In addition, the viewpoint of the gene-gene interactions in the inferred network indicates the relationship between cancer and cell growth/death. Thus, the gene systems network may also be useful as a bridge between the gene-gene interactions and the observations at macroscopic levels, such as clinical tests. The present method assumes linearity in the cluster associations by using a partial correlation coefficient to identify the independence between clusters. It is well known that the interactions among genes and other molecular components are often nonlinear, and the assumption of linearity misses many important relationships among genes. In the present REFERENCES [1] M. J. Alter, H. S. Margolis, K. Krawczynski, et al., “The natural history of community-acquired hepatitis C in the United States. The sentinel counties chronic non-A, non-B hepatitis study team,” The New England Journal of Medicine, vol. 327, no. 27, pp. 1899–1905, 1992. [2] A. M. Di Bisceglie, “Hepatitis C,” The Lancet, vol. 351, no. 9099, pp. 351–355, 1998. [3] S. Zeuzem, S. V. Feinman, J. Rasenack, et al., “Peginterferon alfa-2a in patients with chronic hepatitis C,” The New England Journal of Medicine, vol. 343, no. 23, pp. 1666–1672, 2000. [4] S. S. Thorgeirsson, J.-S. Lee, and J. W. Grisham, “Molecular prognostication of liver cancer: end of the beginning,” Journal of Hepatology, vol. 44, no. 4, pp. 798–805, 2006. [5] N. Iizuka, M. Oka, H. Yamada-Okabe, et al., “Oligonucleotide microarray for prediction of early intrahepatic recurrence of hepatocellular carcinoma after curative resection,” The Lancet, vol. 361, no. 9361, pp. 923–929, 2003. [6] H. Okabe, S. Satoh, T. Kato, et al., “Genome-wide analysis of gene expression in human hepatocellular carcinomas using cDNA microarray: identification of genes involved in viral carcinogenesis and tumor progression,” Cancer Research, vol. 61, no. 5, pp. 2129–2137, 2001. [7] L.-H. Zhang and J.-F. Ji, “Molecular profiling of hepatocellular carcinomas by cDNA microarray,” World Journal of Gastroenterology, vol. 11, no. 4, pp. 463–468, 2005. [8] J. Jiang, P. Nilsson-Ehle, and N. Xu, “Influence of liver cancer on lipid and lipoprotein metabolism,” Lipids in Health and Disease, vol. 5, p. 4, 2006. [9] A. Zerbini, M. Pilli, C. Ferrari, and G. Missale, “Is there a role for immunotherapy in hepatocellular carcinoma?” Digestive and Liver Disease, vol. 38, no. 4, pp. 221–225, 2006. [10] K. Horimoto and H. Toh, “Statistical estimation of cluster boundaries in gene expression profile data,” Bioinformatics, vol. 17, no. 12, pp. 1143–1151, 2001. [11] H. Toh and K. Horimoto, “Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling,” Bioinformatics, vol. 18, no. 2, pp. 287–297, 2002. Sachiyo Aburatani et al. [12] S. Lauritzen, Graphical Models, Oxford University Press, Oxford, UK, 1996. [13] J. Whittaker, Graphical Models in Applied Multivariate Statistics, John Wiley & Sons, New York, NY, USA, 1990. [14] H. Toh and K. Horimoto, “System for automatically inferring a genetic network from expression profiles,” Journal of Biological Physics, vol. 28, no. 3, pp. 449–464, 2002. [15] D. K. Slonim, “From patterns to pathways: gene expression data analysis comes of age,” Nature Genetics, vol. 32, no. 5, pp. 502–508, 2002. [16] S. Aburatani, S. Kuhara, H. Toh, and K. Horimoto, “Deduction of a gene regulatory relationship framework from gene expression data by the application of graphical Gaussian modeling,” Signal Processing, vol. 83, no. 4, pp. 777–788, 2003. [17] M. Ashburner, C. A. Ball, J. A. Blake, et al., “Gene ontology: tool for the unification of biology,” Nature Genetics, vol. 25, no. 1, pp. 25–29, 2000. [18] E. I. Boyle, S. Weng, J. Gollub, et al., “GO::TermFinder—open source software for accessing gene ontology information and finding significantly enriched gene ontology terms associated with a list of genes,” Bioinformatics, vol. 20, no. 18, pp. 3710– 3715, 2004. [19] A. Nikitin, S. Egorov, N. Daraselia, and I. Mazo, “Pathway studio—the analysis and navigation of molecular networks,” Bioinformatics, vol. 19, no. 16, pp. 2155–2157, 2003. [20] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, New York, NY, USA, 1990. [21] R. J. Freund and W. J. Wilson, Regression Analysis: Statistical Modeling of a Response Variable, Academic Press, San Diego, Calif, USA, 1998. [22] A. P. Dempster, “Covariance selection,” Biometrics, vol. 28, no. 1, pp. 157–175, 1972. [23] N. Wermuth and E. Scheidt, “Algorithm AS 105: fitting a covariance selection model to a matrix,” Applied Statistics, vol. 26, no. 1, pp. 88–92, 1977. [24] L. F. Wu, T. R. Hughes, A. P. Davierwala, M. D. Robinson, R. Stoughton, and S. J. Altschuler, “Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters,” Nature Genetics, vol. 31, no. 3, pp. 255– 265, 2002. [25] T. W. Anderson, An Introduction to Multivariate Statistical Analysis, John Wiley & Sons, New York, NY, USA, 2nd edition, 1984. [26] S. Aburatani, K. Goto, S. Saito, et al., “ASIAN: a website for network inference,” Bioinformatics, vol. 20, no. 16, pp. 2853– 2856, 2004. [27] S. Aburatani, K. Goto, S. Saito, H. Toh, and K. Horimoto, “ASIAN: a web server for inferring a regulatory network framework from gene expression profiles,” Nucleic Acids Research, vol. 33, pp. W659–W664, 2005. [28] M. Honda, S. Kaneko, H. Kawai, Y. Shirota, and K. Kobayashi, “Differential gene expression between chronic hepatitis B and C hepatic lesion,” Gastroenterology, vol. 120, no. 4, pp. 955– 966, 2001. [29] T. Wu, “Cyclooxygenase-2 in hepatocellular carcinoma,” Cancer Treatment Reviews, vol. 32, no. 1, pp. 28–44, 2006. [30] H. Xiao, V. Palhan, Y. Yang, and R. G. Roeder, “TIP30 has an intrinsic kinase activity required for up-regulation of a subset of apoptotic genes,” The EMBO Journal, vol. 19, no. 5, pp. 956– 963, 2000. [31] W. B. Coleman, “Mechanisms of human hepatocarcinogenesis,” Current Molecular Medicine, vol. 3, no. 6, pp. 573–588, 2003. 11 [32] Y. Xu, P. K. Sengupta, E. Seto, and B. D. Smith, “Regulatory factor for X-box family proteins differentially interact with histone deacetylases to repress collagen α2(I) gene (COL1A2) expression,” Journal of Biological Chemistry, vol. 281, no. 14, pp. 9260–9270, 2006. [33] P. A. Barker and A. Salehi, “The MAGE proteins: emerging roles in cell cycle progression, apoptosis, and neurogenetic disease,” Journal of Neuroscience Research, vol. 67, no. 6, pp. 705– 712, 2002. [34] Y. Xu, L. Wang, G. Buttice, P. K. Sengupta, and B. D. Smith, “Interferon γ repression of collagen (COL1A2) transcription is mediated by the RFX5 complex,” The Journal of Biological Chemistry, vol. 278, no. 49, pp. 49134–49144, 2003. [35] F. Macian, C. Garcia-Rodriguez, and A. Rao, “Gene expression elicited by NFAT in the presence or absence of cooperative recruitment of Fos and Jun,” The EMBO Journal, vol. 19, no. 17, pp. 4783–4795, 2000. [36] J. Fu, S. S. W. Tay, E. A. Ling, and S. T. Dheen, “High glucose alters the expression of genes involved in proliferation and cellfate specification of embryonic neural stem cells,” Diabetologia, vol. 49, no. 5, pp. 1027–1038, 2006. [37] J. Schäfer and K. Strimmer, “An empirical Bayes approach to inferring large-scale gene association networks,” Bioinformatics, vol. 21, no. 6, pp. 754–764, 2005. [38] A. de la Fuente, N. Bing, I. Hoeschele, and P. Mendes, “Discovery of meaningful associations in genomic data using partial correlation coefficients,” Bioinformatics, vol. 20, no. 18, pp. 3565–3574, 2004. Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 71312, 14 pages doi:10.1155/2007/71312 Research Article Uncovering Gene Regulatory Networks from Time-Series Microarray Data with Variational Bayesian Structural Expectation Maximization Isabel Tienda Luna,1 Yufei Huang,2 Yufang Yin,2 Diego P. Ruiz Padillo,1 and M. Carmen Carrion Perez1 1 Department of Applied Physics, University of Granada, 18071 Granada, Spain of Electrical and Computer Engineering, University of Texas at San Antonio (UTSA), San Antonio, TX 78249-0669, USA 2 Department Received 1 July 2006; Revised 4 December 2006; Accepted 11 May 2007 Recommended by Ahmed H. Tewfik We investigate in this paper reverse engineering of gene regulatory networks from time-series microarray data. We apply dynamic Bayesian networks (DBNs) for modeling cell cycle regulations. In developing a network inference algorithm, we focus on soft solutions that can provide a posteriori probability (APP) of network topology. In particular, we propose a variational Bayesian structural expectation maximization algorithm that can learn the posterior distribution of the network model parameters and topology jointly. We also show how the obtained APPs of the network topology can be used in a Bayesian data integration strategy to integrate two different microarray data sets. The proposed VBSEM algorithm has been tested on yeast cell cycle data sets. To evaluate the confidence of the inferred networks, we apply a moving block bootstrap method. The inferred network is validated by comparing it to the KEGG pathway map. Copyright © 2007 Isabel Tienda Luna et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. INTRODUCTION With the completion of the human genome project and successful sequencing genomes of many other organisms, emphasis of postgenomic research has been shifted to the understanding of functions of genes [1]. We investigate in this paper reverse engineering gene regulatory networks (GRNs) based on time-series microarray data. GRNs are the functioning circuitry in living organisms at the gene level. They display the regulatory relationships among genes in a cellular system. These regulatory relationships are involved directly and indirectly in controlling the production of protein and in mediating metabolic processes. Understanding GRNs can provide new ideas for treating complex diseases and breakthroughs for designing new drugs. GRNs cannot be measured directly but can be inferred based on their inputs and outputs. This process of recovering GRNs from their inputs and outputs is referred to as reverse engineering GRNs [2]. The inputs of GRNs are a sequence of signals and the outputs are gene expressions at either the mRNA level or the protein level. One popular technology that measures expressions of a large amount of gene at the mRNA levels is microarray. It is not surprising that microarray data have been a popular source for uncovering GRNs [3, 4]. Of particular interest to this paper are time-series microarray data, which are generated from a cell cycle process. Using the time-series microarray data, we aim to uncover the underlying GRNs that govern the process of cell cycles. Mathematically, reverse engineering GRNs are a traditional inverse problem, whose solutions require proper modeling and learning from data. Despite many existing methods for solving inverse problems, solutions to the GRNs problem are however not trivial. Special attention must be paid to the enormously large scale of the unknowns and the difficulty from the small sample size, not to mention the inherent experimental defects, noisy readings, and so forth. These call for powerful mathematic modeling together with reliable inference. At the same time, approaches for integrating different types of relevant data are desirable. In the literature, many different models have been proposed for both static, cell cycle networks including probabilistic Boolean networks [5, 6], (dynamic) Bayesian networks [7–9], differential 2 equations [10], and others [11, 12]. Unlike in the case of static experiments, extra effort is needed to model temporal dependency between samples for the time-series experiments. Such time-series models can in turn complicate the inference, thus making the task of reverse engineering even tougher than it already is. In this paper, we apply dynamic Bayesian networks (DBNs) to model time-series microarray data. DBNs have been applied to reverse engineering GRNs in the past [13– 18]. Differences among the existing work are the specific models used for gene regulations and the detailed inference objectives and algorithms. These existing models include discrete binomial models [14, 17], linear Gaussian models [16, 17], and spline function with Gaussian noise [18]. We choose to use the linear Gaussian regulatory model in this paper. Linear Gaussian models model the continuous gene expression level directly, thus preventing loss of information in using discrete models. Even though linear Gaussian models could be less realistic, network inference over linear Gaussian models is relatively easier than that for nonlinear and/or non Gaussian models, therefore leading to more robust results. It has been shown in [19] that if taking both computational complexity and inference accuracy into consideration, linear Gaussian models are favored over nonlinear regulatory models. In addition, this model actually models the joint effect of gene regulation and microarray experiments and the model validity is better evaluated from the data directly. In this paper, we provide the statistical test of the validity of the linear Gaussian model. To learn the proposed DBNs from time-series data, we aim at soft Bayesian solutions, that is, the solutions that provide the a posteriori probabilities (APPs) of the network topology. This requirement separates the proposed solutions with most of the existing approaches such as greedy search and simulated-annealing-based algorithms, all of which produce only point estimates of the networks and are considered as “hard” solutions. The advantage of soft solutions has been demonstrated in digital communications [20]. In the context of GRNs, the APPs from the soft solutions provide valuable measurements of confidence on inference, which is difficult with hard solutions. Moreover, the obtained APPs can be used for Bayesian data integration, which will be demonstrated in the paper. Soft solutions including Markov chain Monte Carlo (MCMC) sampling [21, 22] and variational Bayesian expectation maximization (VBEM) [16] have been proposed for learning the GRNs. However, MCMC sampling is only feasible for small networks due to its high complexity. In contrast, VBEM has been shown to be much more efficient. However, the VBEM algorithm in [16] was developed only for parameter learning. It therefore cannot provide the desired APPs of topology. In this paper, we propose a new variational Bayesian structural EM (VBSEM) algorithm that can learn both parameters and topology of a network. The algorithm still maintains the general feature of VBEM for having low complexity, thus it is appropriate for learning large networks. In addition, it estimates the APPs of topology directly and is suitable for Bayesian data integration. To this end, we discuss a simple Bayesian strategy for integrating two EURASIP Journal on Bioinformatics and Systems Biology microarray data sets by using the APPs obtained from VBSEM. We apply the VBSEM algorithm to uncover the yeast cell cycle networks. To obtain the statistics of the VBSEM inference results and to overcome the difficulty of the small sample size, we apply a moving block bootstrap method. Unlike conventional bootstrap strategy, this method is specifically designed for time-series data. In particular, we propose a practical strategy for determining the block length. Also, to serve our objective of obtaining soft solutions, we apply the bootstrap samples for estimating the desired APPs. Instead of making a decision of the network from each bootstrapped data set, we make a decision based on the bootstrapped APPs. This practice relieves the problem of small sample size, making the solution more robust. The rest of the paper is organized as follows. In Section 2, DBNs modeling of the time-series data is discussed. The detailed linear Gaussian model for gene regulation is also provided. In Section 3, objectives on learning the networks are discussed and the VBSEM algorithm is developed. In Section 4, a Bayesian integration strategy is illustrated. In Section 5, the test results of the proposed VBEM on the simulated networks and yeast cell cycle data are provided. A bootstrap method for estimating the APPs is also discussed. The paper concludes in Section 6. 2. MODELING WITH DYNAMIC BAYESIAN NETWORKS Like all graphical models, a DBN is a marriage of graphical and probabilistic theories. In particular, DBNs are a class of directed acyclic graphs (DAGs) that model probabilistic distributions of stochastic dynamic processes. DBNs enable easy factorization on joint distributions of dynamic processes into products of simpler conditional distributions according to the inherent Markov properties, and thus greatly facilitate the task of inference. DBNs are shown to be a generalization of a wide range of popular models, which include hidden Markov models (HMMs) and Kalman filtering models, or state-space models. They have been successfully applied in computer vision, speech processing, target tracking, and wireless communications. Refer to [23] for a comprehensive discussion on DBNs. A DBN consists of nodes and directed edges. Each node represents a variable in the problem while a directed edge indicates the direct association between the two connected nodes. In a DBN, the direction of an edge can carry the temporal information. To model the gene regulation from cell cycle using DBNs, we assume to have a microarray that measures the expression levels of G genes at N + 1 evenly sampled consecutive time instances. We then define a random variable matrix Y ∈ RG×(N+1) with the (i, n)th element yi (n − 1) denoting the expression level of gene i measured at time n − 1 (see Figure 1). We further assume that the gene regulation follows a first-order time-homogeneous Markov process. As a result, we need only to consider regulatory relationships between two consecutive time instances and this relationship remains unchanged over the course of the microarray experiment. This assumption may be insufficient, but it will Isabel Tienda Luna et al. 3 Microarry Dynamic Bayesian network First order Markov process Time Gene Time 0 Time 1 Time 2 · · · Time N y1 (0) y1 (1) y1 (2) ··· y1 (N) y1 (0) y1 (1) y1 (2) ··· y1 (N) y2 (0) y2 (1) y2 (2) ··· y2 (N) Gene 2 y2 (0) y2 (1) y2 (2) ··· y2 (N) y3 (2) ··· y3 (N) y3 (1) .. . y3 (2) . . . ··· Gene 3 y3 (0) y3 (1) y3 (0) . . . yi (0) .. . yi (1) .. . yG (1) yi (2) . . . yG (2) ··· y3 (N) . . . yi (N) . . . yG (N) Gene 1 . .. . . . . . . . . . . . . Gene G yG (0) yG (1) yG (2) ··· yG (0) yG (N) ··· Figure 1: A dynamic Bayesian network modeling of time-series expression data. facilitate the modeling and inference. Also, we call the regulating genes the “parent genes,” or “parents” for short. Based on these definitions and assumptions, the joint probability p(Y) can be factorized as p(Y) = 1≤n≤N p(y(n) | y(n − 1)), where y(n) is the vector of expression levels of all genes at time n. In addition, we assume that given y(n − 1), the expression levels at n become independent. As a result, p(y(n) | y(n − 1)), for all n, can be further factorized as p(y(n) | y(n − 1)) = 1≤i≤G p(yi (n) | y(n − 1)). These factorizations suggest the structure of the proposed DBNs illustrated in Figure 1 for modeling the cell cycle regulations. In this DBN, each node denotes a random variable in Y and all the nodes are arranged the same way as the corresponding variables in the matrix Y. An edge between two nodes denotes the regulatory relationship between the two associated genes and the arrow indicates the direction of regulation. For example, we see from Figure 1 that genes 1, 3, and G regulate gene i. Even though, like all Bayesian networks, DBNs do not allow circles in the graph, they, however, are capable of modeling circular regulatory relationship, an important property that is not possessed by regular Bayesian networks. As an example, a circular regulation can be seen in Figure 1 between genes 1 and 2 even though no circular loops are used in the graph. To complete modeling with DBNs, we need to define the conditional distributions of each child node over the graph. Then the desired joint distribution can be represented as a product of these conditional distributions. To define the conditional distributions, we let pai (n) denote a column vector of the expression levels of all the parent genes that regulate gene i measured at time n. As an example in Figure 1, pai (n)T = [y1 (n), y3 (n), yG (n)]. Then, the conditional distribution of each child node over the DBNs can be expressed as p(yi (n) | pai (n − 1)), for all i. To determine the expression of the distributions, we assume linear regulatory relationship, that is, the expression level of gene i is the result of linear combination of the expression levels of the regulating genes at previous sample time. To make further simplification, we assume the regulation is a time-homogeneous process. Mathematically, we have the following expression: yi (n) = wiT pai (n − 1) + ei (n), n = 1, 2, . . . , N, (1) where wi ∈ R is the weight vector independent of time n and ei (n) is assumed to be white Gaussian noise with variance σi2 . We provide in Section 5 the statistical test of the validity of white Gaussian noise. The weight vector is indicative of the degree and the types of the regulation [16]. A gene is upregulated if the weight is positive and is down-regulated otherwise. The magnitude (absolute value) of the weight indicates the degree of regulation. The noise variable is introduced to account for modeling and experimental errors. From (1), we obtain that the conditional distribution is a Gaussian distribution, that is, p yi (n) | pai (n − 1) = N wiT pai (n − 1), σi2 . (2) In (1), the weight vector wi and the noise variance σi2 are the unknown parameters to be determined. 2.1. Objectives Based on the above dynamic Bayesian networks formulation, our work has two objectives. First, given a set of time-series data from a single experiment, we aim at uncovering the underlying gene regulatory networks. This is equivalent to learning the structure of the DBNs. In specific, if we can determine that genes 2 and 3 are the parents of gene 1 in the DBNs, there will be directed links going from gene 2 and 3 to gene 1 in the uncovered GRNs. Second, we are also concerned with integrating two data sets of the same network from different experiments. Through integrating the two data sets, we expect to improve the confidence of the inferred networks obtained from a single experiment. To achieve these two objectives, we propose in the following an efficient variational Bayesian structural EM algorithm to learn the network and a Bayesian approach for data integration. 4 3. EURASIP Journal on Bioinformatics and Systems Biology LEARNING THE DBN WITH VBSEM Given a set of microarray measurements on the expression levels in cell cycles, the task of learning the above DBN consists of two parts: structure learning and parameter learning. The objective of structure learning is to determine the topology of the network or the parents of each gene. This is essentially a problem of model or variable selection. Under a given structure, parameter learning involves the estimation of the unknown model coefficients of each gene: the weight vector wi and the noise variance σi2 , for all i. Since the network is fully observed and, given parent genes, the gene expression levels at any given time are independent, we can learn the parents and the associated model parameters of each gene separately. Thus we only discuss in the following the learning process on gene i. 3.1. A Bayesian criterion for network structural learning (2) (K) Let Si = {S(1) i , Si , . . . , Si } denote a set of K possible network topologies for gene i, where each element represents a topology derived from a possible combination of the parents of gene i. The problem of structure learning is to select the topology from Si that is best supported by the microarray data. (k) (k) (k) For a particular topology S(k) i , we use wi , pai , ei and 2 σik to denote the associated model variables. We can then express (1) for S(k) i in a more compact matrix-vector form (k) (k) yi = Pa(k) i wi + ei , (3) (k) (k) where yi = [yi (1), . . . , yi (N)]T , Pa(k) = [pai (0), pai (1), i (k) (k) (k) (k) . . . , pa(k) i (N − 1)] , ei = [ei (1), ei (2), . . . , ei (N)] , and (k) wi is independent of time n. The structural learning can be performed under the Bayesian paradigm. In particular, we are interested in calculating the a posteriori probabilities of the network topology p(S(k) | Y), for all k. The APPs will be important for the i data integration tasks. They also provide a measurement of confidence on inferred networks. Once we obtain the APPs, we can select the most probable topology Si according to the maximum a posteriori (MAP) criterion [24], that is, Si = arg max p (k) Si ∈Si S(k) i |Y . (k) p yi | S(k) i ,Y−i p Si | Y−i p yi | Y−i 3.2. Variational Bayesian structural expectation maximization To develop the VBSEM algorithm, we define a G-dimensional binary vector bi ∈ {0, 1}G , where bi ( j) = 1 if gene j is a parent of gene i in the topology Si and bi ( j) = 0 otherwise. We can actually consider bi as an equivalent representation of Si and finding the structure Si can thus equate to determining the values of bi . Consequently, we can replace Si in all the above expressions by bi and turn our attention to estimate the equivalent APPs p(bi | Y). The basic idea behind VBSEM is to approximate the intractable APPs of topology with a tractable distribution q(bi ). To do so, we start with a lower bound on the normalizing constant p(yi | Y−i ) based on Jensen’s inequality ln p yi | Y−i = ln (6) dθ i p yi | bi , θ i p bi p θ i bi ≥ dθ i q θ i bi p bi , yi | θ i p θi q bi ln + ln , q bi q θi (7) (4) The APPs are calculated according to the Bayes theorem, p S(k) i |Y = but S(k) i cannot be determined from Y−i . Note also that there is a slight abuse of notation in (4). Y in p(S(k) i | Y) denotes a realization of expression levels measured from a microarray experiment. To calculate the APPs according to (5), the marginal likelihood p(yi | Pa(k) i ) and the marginalization constant p(yi | Y−i ) need to be determined. It has been shown that with conjugate priors on the parameters, we can obtain p(yi | Pa(k) i ) analytically [21]. However, p(yi | Y−i ) becomes computationally prohibited for large networks because computing p(yi | Y−i ) involves summation over 2G terms. This difficulty with p(yi | Y−i ) makes the exact calculation of the APPs infeasible. Numerical approximation must be therefore employed to estimate the APPs instead. Monte Carlo samplingbased algorithms have been reported in the literature for this approximation [21]. They are however computationally very expensive and do not scale well with the size of networks. In what follows, we propose a much more efficient solution based on variational Bayesian EM. (5) p yi | Pa(k) p S(k) i i , = p yi | Y−i where Y−i represents a matrix obtained by removing yi from Y, the second equality is arrived at from the fact that given (k) S(k) i , yi depends on Y−i only through Pai , and the last equa(k) (k) tion is due to that given Pai , Si is known automatically where θ i = {wi , σi2 } and q(θ i ) is a distribution introduced for approximating the also intractable marginal posterior distribution of parameters p(θ i | Y). The lower bound in (7) can serve as a cost function for determining the approximate distributions q(bi ) and q(θ i ), that is, we choose q(bi ) and q(θ i ) such that the lower bound in (7) is maximized. The solution can be obtained by variational derivatives and a coordinate ascent iterative procedure and is shown to include the following two steps in each iteration: VBE step: q(t+1) bi = 1 exp Zbi dθ i q(t) θ i ln p bi , yi | θ i , (8) Isabel Tienda Luna et al. 5 first obtain q(bi (l)), for all l from q(bi ) and then approximate the marginal APPs p(bi (l) | Y), for all l, by The VBSEM algorithm (1) Initialization Initialize the mean and the covariance matrices of the approximate distributions as described in Appendix A. (2) VBE step: structural learning Calculate the approximate posterior distributions of topology q(bi ) using (B.1) (3) VBM step: parameter learning Calculate the approximate parameter posterior distributions q(θ i ) using (B.5) (4) Compute F Compute the lower bound as described in Appendix A. If F increases, go to (2). Otherwise, terminate the algorithm. q bi (l) = 1 . p bi (l) = 1 | Y = q bi (l) = 1 + q bi (l) = 0 Instead of the MAP criterion, decisions on bi can be then made in a bitwise fashion based on the marginal APPs. In specific, we have ⎧ ⎨1 bi (l) = ⎩ if p bi (l) | Y ≥ ρ, 0 otherwise, VBM step: q (t+1) 1 θi = p θ i exp Zθ i q (t+1) bi ln p bi , yi | θ i (11) where ρ is a threshold. When bi (l) = 1, it implies that gene l is a regulator of gene i in the topology of gene i. Meanwhile, parameters can be learned from q(θ i ) easily based on the minimum mean-squared-error criterion (MMSE) and they are Algorithm 1: The summary of VBSEM algorithm (10) σi2 = w i = mwi , , bi (9) where t and t+1 are iteration numbers and Zbi and Zθi are the normalizing constants to be determined. The above procedure is commonly referred to as variational Bayesian expectation maximization algorithm [25]. The VBEM can be considered as a probabilistic version of the popular EM algorithm in the sense that it learns the distribution instead of finding a point solution as in EM. Apparently, to carry out this iterative approximation, analytical expressions must exist in both VBE and VBM steps. However, it is difficult to come up with an analytical expression at least in the VBM step since the summation is NP hard. To overcome this problem, we enforce the approximation q(bi ) to be a multivariate Gaussian distribution. The Gaussian assumption on the discrete variable bi facilitates the computation in the VBEM algorithm, circumventing the 2G summations. Although p(bi | Y) is a high-dimensional discrete distribution, the defined Gaussian approximation will guarantee the approximations to fall in the exponential family, and as a result the subsequent computations in the VBEM iterations can be carried out exactly [25]. In specific, by choosing conjugate priors for both θ i and bi as described in Appendix A, we can show that the calculations in both VBE and VBM steps can be performed analytically. The detailed derivations are included in Appendix B. Unlike the common VBEM algorithm, which learns only the distributions of parameters, the proposed VBEM learns the distributions of both structure and parameters. We, therefore, call the algorithm VB structural EM (VBSEM). The algorithm of VBSEM for learning the DBNs under study is summarized in Algorithm 1. When the algorithm converges, we obtain q(bi ), a multivariate Gaussian distribution and q(θ i ). Based on q(bi ), we need then to produce a discrete distribution as a final estimate of p(bi ). Direct discretization in the variable space is computationally difficult. Instead, we propose to work with the marginal APPs from model averaging. To this end, we β , α−2 (12) where mwi , β, and α are defined in Appendix B according to (B.5). 4. BAYESIAN INTEGRATION OF TWO DATA SETS A major task of the gene network research is to integrate all prevalent data sets about the same network from different sources so as to improve the confidence of inference. As indicated before, the values of bi define the parent sets of gene i, and thus the topology of the network. The APPs obtained from the VBSEM algorithm provide us with an avenue to pursue Bayesian data integration. We illustrate here an approach for integrating two microarray data sets Y1 and Y2 , each produced from an experiment under possibly different conditions. The premise for combining the two data sets is that they are the experimental outcomes of the same underlying gene network, that is, the topologies Si or bi , for all i are the same in the respective data models. Direct combination of the two data sets at the data level requires many preprocesses including scaling, alignment, and so forth. The preprocessing steps introduce noise and potential errors to the original data sets. Instead, we propose to perform data integration at the topology level. The objective of topology-level data integration is to obtain the APPs of bi from the combined data sets p(bi | Y1 , Y2 ) and then make inference on the gene network structures accordingly. To obtain p(bi | Y1 , Y2 ), we factor it according to the Bayes rule as 1 2 p bi | Y , Y p Y2 | bi p Y1 | bi p bi = p Y1 p Y2 (13) p Y2 | bi p bi | Y1 , = p Y2 where p(Y2 | bi ) is the marginalized likelihood functions of data set 2 and p(bi | Y1 ) is the APPs obtained from data set 1. EURASIP Journal on Bioinformatics and Systems Biology The above equation suggests a simple scheme to integrate the two data sets: we start with a data set, say Y1 , and calculate the APPs p(bi | Y1 ); then by considering p(bi | Y1 ) as the prior distribution, the data set Y1 is integrated with Y2i according to (13). By this way, we obtain the desired APPs p(bi | Y1 , Y2 ) from the combined data sets. To implement this scheme, the APPs of the topology must be computed and the proposed VBSEM can be applied for the task. This new scheme provides a viable and efficient framework for Bayesian data integration. 5. RESULTS 5.1. Test on simulated systems Table 1: Area under each curve. G = 30 0.8007 Setting AUC G = 100 0.7253 G = 150 0.6315 G = 200 0.5872 1 0.9 0.8 0.7 0.6 Recall 6 0.5 0.4 5.1.1. Study based on precision-recall curves In this section, we validate the performance of the proposed VBSEM algorithm using synthetic networks whose characteristics are as realistic as possible. This study was accomplished through the calculation of the precision-recall curves. Among the scientific community in this field, it is common to employ the ROC analysis to study the performance of a proposed algorithm. However, since genetic networks are sparse, the number of false positives far exceeds the number of true positives. Thus, the specificity is inappropriate as even small deviation from a value of 1 will result in a large number of false positives. Therefore, we choose the precision-recall curves in evaluating the performance. Precision corresponds to the expected success rate in the experimental validation of the predicted interactions and it is calculated as TP /(TP +FP ), where TP is the number of true positives and FP is the number of false positives. Recall, on the other hand, indicates the probability of correctly detecting a true positive and it is calculated as TP /(TP +FN ), where FN is the number of false negatives. In a good system, precision decreases as recall increases and the higher the area under the curve is the better the system is. To accomplish our objective, we simulated 4 networks with 30, 100, 150, and 200 genes, respectively. For each tested network, we collected only 30 time samples for each gene, which mimics the realistic small sample scenario. Regarding the regulation process, each gene had either none, one, two, or three parents. Besides, the number of parents was selected randomly for each gene. The weights associated to each regulation process were also chosen randomly from an interval that contains the typical estimated values when working with the real microarray data. As for the nature of regulation, the signs of the weights were selected randomly as well. Finally, the data values of the network outputs were calculated using the linear Gaussian model proposed in (1). These data values were taken after the system had reached stationarity and they were in the range of the observations corresponding to real microarray data. In Figure 2, the precision-recall curves are plotted for different settings. In order to construct these curves, we started by setting a threshold ρ for the APPs. This threshold ρ is between 0 to 1 and it was used as in (11): for each possible regulation relationship between two genes, if its APP is greater 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 Precision 30 genes 100 genes 150 genes 200 genes Figure 2: Precision-recall curve. than ρ, then the link is considered to exist, whereas if the APP is lower than ρ, the link is not considered. We calculated the precision and the recall for each selected threshold between 0 and 1. We plotted the results in blue for the case with G = 30, black for G = 100, red for G = 150, and green for G = 200. As expected, the performance got worse as the number of genes increases. One measure of this degradation is shown in Table 1 where we calculated the area under each curve (AUC). To further quantify the performance of the algorithms, we calculated the F-score. F-score constitutes an evaluation measure that combines precision and recall and it can be calculated as Fα = 1 , α(1/precision) + (1 − α)(1/recall) (14) where α is a weighting factor and a large α means that the recall is more important, whereas a small α means that precision is more important. In general, α = 0.5 is used, where the importance of precision and the importance of recall are even and Fα is called harmonic mean. This value is equal to 1 when both precision and recall are 100%, and 0 when one of them is close to 0. Figure 3 depicts the value of the harmonic mean as a function of the APP threshold ρ for the VBSEM algorithm. As it can be seen, the performance of the algorithm for G = 30 is better than the performance for any other setting. However, we can also see that there is almost no performance degradation between the curve corresponding to G = 30 and the one for G = 100 in the APP threshold interval from 0.5 to 0.7. The same observation can be obtained for Isabel Tienda Luna et al. 7 Table 2: Computation time for different sizes of networks. G1 = 100 G3 = 200 G2 = 500 G4 = 1000 0.7 19.2871 206.5132 889.8120 12891.8732 0.6 Table 3: Number of errors in 100 Monte Carlo trials. No. of errors VBEM (no. of iterations = 10) Gibbs sampling (500 samples) G = 5, N = 5 G = 5, N = 10 62 1 126 5 Harmonic mean Setting Computation time (s) 0.8 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 APP threshold G = 30 G = 100 G = 150 G = 200 Figure 3: Harmonic mean as a function of the APP threshold. 0.9 0.8 0.7 Harmonic mean curves G = 150 and G = 200 in the interval from 0.5 to 0.6. In general, in the interval from 0.5 to 0.7, the degradation of the algorithm performance is small for reasonable harmonic mean values (i.e., > 0.5). To demonstrate the scalability of the VBSEM algorithm, we have studied the harmonic mean for simulated networks characterized by the following settings: (G1 = 1000, N1 = 400), (G2 = 500, N2 = 200), (G3 = 200, N3 = 80), and (G4 = 100, N4 = 40). As it can be noticed, the ratio Gi /Ni has been kept constant in order to maintain the proportion between the amount of nodes in the network and the amount of information (samples). The results were plotted in Figure 4 where we have represented the harmonic mean as a function of the APP threshold. The closeness of the curves at APP threshold equal to 0.5 supports the good scalability of the proposed algorithm. We have also recorded the computation time of VBSEM for each network and listed them in Table 2. The results were obtained with a standard PC with 3.4 GHz and 2 GB RAM. 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 5.1.2. Comparison with the Gibbs sampling We tested in this subsection the VBSEM algorithm on a simulated network in order to compare it with the Gibbs sampling [26]. We simulated a network of 20 genes and generated their expressions based on the proposed DBNs and the linear Gaussian regulatory model with Gaussian distributed weights. We focused on a particular gene in the simulated networks. The gene was assumed to have two parents. We compared the performance of VBSEM and Gibbs sampling in recovering the true networks. In Table 3, we present the number of errors in 100 Monte Carlo tests. For the Gibbs sampling, 500 Monte Carlo samples were used. We tested the algorithms under different settings. In the table, N stands for the number of time samples and G is the number of genes. As it can be seen, the VBSEM outperforms Gibbs sampling even in an underdetermined system. Since the VBSEM has much lower complexity than Gibbs sampling, the proposed VBSEM algorithm is better suited for uncovering large networks. G = 100, N = 40 G = 200, N = 80 0.4 0.5 0.6 APP threshold 0.7 0.8 0.9 1 G = 500, N = 200 G = 1000, N = 400 Figure 4: Harmonic mean as a function of the APP threshold to proof scalability. 5.2. Test on real data We applied the proposed VBSEM algorithm on cDNA microarray data sets of 62 genes in the yeast cell cycle reported in [27, 28]. The data set 1 [27] contains 18 samples evenly measured over a period of 119 minutes where a synchronization treatment based on α mating factor was used. On the other hand, the data set 2 [28] contains 17 samples evenly measured over 160 minutes and a temperaturesensitive CDC15 mutant was used for synchronization. For each gene, the data is represented as the log2 {(expression at time t)/(expression in mixture of control cells)}. Missing 8 EURASIP Journal on Bioinformatics and Systems Biology CDC20 CDC5 CDC14 CLB1 ESC5 CLN3 TEM1 TUP1 CAK1 FAR1 CDC7 GIN4 CLN1 SCI1 SWI4 FUS3 Downregulation Weights 0–0.4 SMC3 CLB2 HSL7 RAD53 PHO5 CLB6 CDC45 HSL1 Weights 0.4–0.8 Weights 0.8–1.5 Figure 5: Inferred network using the α data set of [27]. values exist in both data sets, which indicate that there was no strong enough signal in the spot. In this case, simple spline interpolation was used to fill in the missing data. Note the time step that differs in each data set can be neglected since we assume a time-homogeneous regulating process. When validating the results, the main objective is to determine the level of confidence of the connections in the inferred network. The underlying intuition is that we should be more confident on features that would still be inferred when we perturb the data. Intuitively, this can be performed on multiple independent data sets generated from repeated experiments. However, in this case and many other practical scenarios, only one or very limited data replicates are available and the sample size in each data set is small. The question is then how to produce the perturbed data from the limited available data sets and at the same time maintain the underlying statistical features of the data set. One way to achieve it is to apply the bootstrap method [29]. Through bootstrapping the data set, we can generate multiple pseudoindependent data sets, each of which still maintains the statistics of the original data. The bootstrap methods have been used extensively for static data sets. When applied to time-series data, an additional requirement is to maintain as much as possible the inherent time dependency between samples in the bootstrapped data sets. This is important since the proposed DBNs modeling and VBSEM algorithm exploit this time dependency. Approaches have been studied in the bootstrap literatures to handle time-dependent samples and we adopt the popular moving block bootstrap method [30]. In moving block bootstrap, we created pseudo-data sets from the original data set by first randomly sampling blocks of sub-data sets and then putting them together to generate a new data set. The detailed steps can be summarized as follows. (1) Select the length of the block L. (2) Create the set of possible n = N − L + 1 blocks from data. These blocks are created in the followingway: Zi = Y(:, i : i + L − 1). (15) (3) Randomly sample with replacement N/L blocks from the set of blocks {Zi }iN=−1L+1 . (4) Create the pseudo-data set by putting all the sampled blocks together and trim the size to N by removing the extra data samples. A key issue in moving block bootstrap is to determine the block length L. The idea is to choose a large enough block length L so that observations more than L time units apart will be nearly independent. Many theoretical and applicable results have been developed on choosing the block length. However, they rely on large size of data samples and are computationally intensive. Here, we develop an easy and practical approach to determine the block length. We compute the autocorrelation function on data and choose the block length as the delay, at which the ACF becomes the smallest. The ACF in this case may not be reliable but it provides at least some measures of independence. In Figure 5, we show the inferred network when the data set from [27] was considered and the moving block bootstrap Isabel Tienda Luna et al. 9 RAD9 BUB1 DBF2 ESP1 SMC3 MAD3 CDC5 MET30 HSL1 CLN3 CLN1 CDC14 FAR1 SIC1 RAD17 GIN4 CDC20 SWI4 SWI6 CDH1 BUB2 CLB2 CLB1 RAD53 SWI5 SWE1 MEC1 DDC1 CDC28 CLB6 LTE1 GIN4 MIH1 PDS1 TEM1 PDS1 TUP1 CDC6 CDC45 CAK1 Figure 6: Inferred network using the CDC28 data set of [28]. was used to resample the observations. The total number of re-sample data sets was 500. In this plot, we only drew those links with the estimated APP higher than 0.6. Weused the solid lines to represent those links with weights between 0 and 0.4, the dotted lines for the links with weights between 0.4 and 0.8, and the lines with dashes and dots for those with weights higher than 0.8. The red color was used to represent downregulation. A circle enclosing some genes means that those corresponding proteins compose a complex. The edges inside these circles are considered as correct edges since genes inside the same circle will coexpress with some delay. In Table 4, we show the connections with some of the highest APPs found from the α data set of [27]. We compared them with the links in the KEGG pathway [31], and some of the links inferred by the proposed algorithm are predicted in it. We considered a connection as predicted when the parent is in the upper stream of the child in the KEGG. Furthermore, the proposed algorithm is also capable of predicting the nature of the relationship represented by the link through the weight. For example, the connection between CDC5 and CLB1 has a weight equal to 0.6568, positive, so it represents an upregulation as predicted in the KEGG pathway. Another example is the connection from CLB1 to CDC20; its APP is 0.6069 and its weight is 0.4505, again positive, so it stands for an up-regulation as predicted by the KEGG pathway. In Figure 6, we depict the inferred network when the CDC28 data set of [28] was used. A moving block bootstrap was also used with the number of the bootstrap data sets equal to 500 again. Still, the links presented in this plots are those with the APP higher than 0.6. In Table 5, we show some of the connections with some of the highest APPs. We also compared them with the links in the KEGG pathway, and some of the links inferred by the proposed algorithm are also predicted in it. Furthermore, the proposed algorithm is Table 4: Links with higher APPs obtained from the α data set of [27]. From CDC5 CLB6 CLB6 SWI4 CLB6 CLB6 CLN1 PH05 CLB6 CDC5 FUS3 PH05 CLB2 CLB6 To CLB1 CDC45 SMC3 SMC3 HSL1 CLN1 CLN3 SIC1 RAD53 CLB2 GIN4 PHO5 CDC5 SWI4 APPs 0.6558 0.6562 0.7991 0.6738 0.6989 0.7044 0.6989 0.6735 0.6974 0.6566 0.6495 0.6441 0.6390 0.6336 Comparison with KEGG Predicted Predicted Not predicted Not predicted Not predicted Predicted the other way round Predicted the other way round Not predicted Not predicted Predicted Not predicted Not predicted Predicted the other way round Predicted the other way round also capable of predicting the nature of the relationship represented by the link through the weight. For example, the connection between TEM1 and DDC1 has a weight equal to −0.3034; the negative sign represents a downregulation as predicted in the KEGG pathway. Another example is the connection from CLB2 to CDC20, its APP is 0.6069 and its weight is 0.7763, this time positive, so it stands for an upregulation as predicted by the KEGG pathway. Model validation To validate the proposed linear Gaussian model, we tested the normality of the prediction errors. If the prediction errors 10 EURASIP Journal on Bioinformatics and Systems Biology DDC1 MEC3 GRF10 6 6 6 4 4 4 2 2 2 0 −2 0 Prediction error 2 0 −2 0 Prediction error (a) 2 0 −2 (b) 0 Prediction error 2 (c) Figure 7: Histogram of prediction error in the α data set. 4 DDC1 MEC3 4 4 3 3 3 2 2 2 1 1 1 0 −2 0 Prediction error 2 0 −2 0 Prediction error (a) 2 0 −2 (b) GRF10 0 Prediction error 2 (c) Figure 8: Histogram of prediction error in the CDC28 data set. yield Gaussian distributions as in the linear model (1), it then proves the feasibility of linear Gaussian assumption on data. i and w i of gene i, the prediction Given the estimated b error ei is obtained as i − yi , ib ei = RW (16) i = diag(w where W i ) and R = TY , with ⎛ Results validation ⎞ 1 ⎜ T=⎜ ⎝ .. . 0 .. ⎟ ⎟ .⎠. over all the genes. To examine the normality, we performed Kolmogorov-Smirnov goodness-of-fit hypothesis test (KSTEST) of the prediction errors for each gene. All the prediction errors pass the normality test at the significance level of 0.05, and therefore it demonstrates the validity of the proposed linear Gaussian assumption. (17) 1 0 We show in Figures 7 and 8 examples of the histograms of the prediction errors for genes DDC1, MEC3, and GRF10 in the α and CDC28 data sets. Those histograms exhibit the bell shape for the distribution of the prediction errors and such pattern is constant To systematically present the results, we treated the KEGG map as the ground truth and calculated the statistics of the results. Even though there are still uncertainties, the KEGG map represents up-to-date knowledge about the dynamics of gene interaction and it should be reasonable to serve as a benchmark of results validation. In Tables 6 and 7, we enlisted the number of true positives (tp), true negatives (tn), false positives (fp), and false negative (fn) for the α and CDC28 data sets, respectively. We also varied the Isabel Tienda Luna et al. 11 SMC3 BUB1 DBF2 SWI5 ESP1 CDC28 RAD9 SCC3 PDS1 SIC1 MEC1 MET30 FAR1 MAD3 RAD17 CDC5 BUB2 CLN1 SWI6 HSL1 CLN3 CLB1 CLB2 PDS1 SWI4 SWE1 TEM1 CLB6 CDC20 HSL7 RAD53 MIH1 TUP1 CAK1 CDC6 CDC45 GIN4 Downregulation Weights 0–0.4 FUS3 Weights 0.4–0.8 Weights 0.8–1.5 Figure 9: Inferred network by integrating the α and CDC28 data sets. Table 5: Links with higher APPs obtained from the CDC28 data set of [28]. From CLB1 To CDC20 APPs 0.7876 Comparison with KEGG Predicted the other way round BUB1 ESP1 0.6678 Predicted BUB2 CDC5 0.7145 Predicted SIC1 GIN4 0.6700 Not predicted SMC3 HSL1 0.6689 Not predicted CLN1 CLN3 0.7723 Predicted the other way round FAR1 SIC1 0.6763 Predicted CLN1 SIC1 0.6640 Predicted CDC5 PCL1 0.7094 Not predicted DBF2 FAR1 0.7003 Not predicted SIC1 CLN1 0.8174 Predicted the other way round PBS1 MBP1 0.7219 Not predicted FAR1 MET30 0.873 Not predicted CLB2 DBF2 0.7172 Predicted the other way round APP threshold for decision. (The thresholds are listed in the threshold column of the tables.) A general observation is that we do not have high confidence about the inference results since high tp cannot be achieved at low fp. Since the VBSEM algorithm has been tested with acceptable performance on simulated networks and the model has also been vali- Table 6: The α data set. APPs threshold 0.4 0.5 0.6 tp 411 58 8 tn 177 3116 3405 fp 3247 308 19 fn 9 362 412 fp 3261 405 61 fn 15 346 406 Table 7: The CDC28 data set. APPs threshold 0.4 0.5 0.6 tp 405 74 14 tn 163 3019 3363 dated, this can very well indicate that the two data sets were not quite informative about the causal relationship between genes. Data integration In order to improvethe accuracy of the inference, we applied the Bayesian integration scheme described in Section 4 to combine the two data sets, trying to use information provided from both data sets to improve the inference confidence. The Bayesian integration includes two stages. In the first stage, the proposed VBSEM algorithm is run on the data set 1 that contains larger number of samples. In the second 12 EURASIP Journal on Bioinformatics and Systems Biology Table 8: Links with higher APPs obtained based on the integrated data set. From To APPs Comparison with KEGG CLB1 CLB2 CDC6 HSL7 CDC5 CLB2 CLB6 RAD17 FAR1 FAR1 MET30 CDC5 CLB6 CLB6 CLN1 BUB1 BUB2 SIC1 CDC20 CDC5 CLB6 CLB1 CLB1 CLB1 HSL1 CLN3 SIC1 MET30 RAD9 CLB2 CDC45 Cln1 CLN3 ESP1 CDC5 CLN1 0.7969 0.6898 0.7486 0.6878 0.7454 0.6795 0.7564 0.7324 0.7329 0.7742 0.7534 0.7033 0.6299 0.6912 0.8680 0.6394 0.6142 0.6793 Predicted the other way round Predicted the other way round Predicted the other way round Predicted Predicted Predicted Not predicted Not predicted Predicted Not predicted Not predicted Predicted Predicted Predicted the other way round Predicted the other way round Predicted Predicted Predicted the other way round stage, the APPs of the latent variables bi obtained in the first stage are used as the priors in the VBSEM algorithm run on the second data set from [28]. In Figure 9, we plot the inferred network obtained from the integration process. We also performed bootstrap resampling in the integration process: we first obtained a sampled data set from the data set 1 and then we use its calculated APPs as the prior to integrate a bootstrap sampled data from set 2. In Table 8, we present the links with the higher APPs inferred performing integration of the data sets. We made a comparison between these links and the ones shown in the KEGG pathway map again. As it can be seen, the proposed algorithm is able to predict many relationships. For instance, the link between CDC5 and CLB1 is predicted correctly by our algorithm with a posteriori probability of 0.7454. The weight associated to this connection is −0.1245, which is negative, and so there is a downregulation relationship confirmed in the KEGG pathway. We also observed improvements from integrating the two data sets. Regarding the link between CDC5 and CLB1, if we compare the result obtained from the integrated data set, with that shown in Table 4, we see that this relationship was not predicted when using the CDC28 data set 2. Even though this link was predicted by the α data set its APP is however lower and the weight is positive indicating an inconsistency with the KEGG map. The inconsistency has been fixed by data integration. As another example, the relationship between HSL7 and CLB1 was predicted based on the integrated data sets but it was not predicted from the CDC28 data set. This link was predicted when only the α data set was used but its APP is 0.6108, lower than the APP obtained performing integration. Similar phenomenon can be observed for the link between FAR1 to SIC1 again. Table 9: Integrated data set. APPs threshold 0.4 0.5 0.6 tp 252 50 17 tn 1655 3175 3374 fp 1769 249 50 fn 168 370 403 We also listed the statistics of the results when compared with the KEGG map in Table 9. We can see that when compared with Tables 6 and 7, data integration almost halved the fp at the thresholds 0.4 and 0.5 and also reduced the fp at 0.6. Meanwhile, tp increased. This implies the increased confidence on the results after data integration, which demonstrates the advantages of the Bayesian data integration. Another way of looking at the benefits of the integration process is by examining the lower bound of the VBSEM. If the data integration process benefits the performance of the algorithm, we must see higher lower bound values than those of single data set. This happens because if the data contains more information after integration, the lower bound should be closer to the value it is approximating. In Figure 10, we plot the evolution of lower bound over the VBSEM iterations for each gene from the α data set, the CDC28 data set, and the integrated data sets. The increase of the lower bound, when the integrated data sets were used, supports the advantages of Bayesian data integration. 6. CONCLUSION We investigated the DBNs modeling of cell cycle GRNs and the VBSEM learning of network topology. The proposed VBSEM solution is able to estimate the APPs of topology. We showed how the estimated APPs can be used in a Bayesian data integration strategy. The low complexity of the VBSEM algorithm shows its potential to work with large networks. We also showed how the bootstrap method can be used to obtain the confidence of the inferred networks. This approach has been approved very useful in the case of small data size, a common case in computational biology research. APPENDICES A. CONJUGATE PRIORS OF TOPOLOGY AND PARAMETERS We choose the conjugate priors for topology and the parameters and they are p b i = N b i | μ0 , C 0 , p θi = p wi , σi2 σi2 σi2 = p wi | p γ0 ν 0 2 = N wi | μwi , σi IG IG , (A.1) 2 2 , (A.2) where μ0 and C0 are the mean and the covariance of the prior probability density p(bi ). In general, μwi and μ0 are simply set as zero vectors, and meanwhile ν0 and γ0 are set equal to small positive real values. Moreover, covariance matrix C0 CDC28 data set 1 α data set −20 −30 2 3 4 5 Number of iterations −40 −50 −60 −70 −80 −90 −100 6 Lower bound −20 −30 −40 −50 −60 −70 −80 −90 −100 −110 13 Lower bound Lower bound Isabel Tienda Luna et al. 1 2 3 4 5 Number of iterations (a) 6 Integrated data set −20 −30 −40 −50 −60 −70 −80 −90 −100 −110 1 2 3 4 5 Number of iterations (b) 6 (c) Figure 10: Evolution of the VBSEM lower bound. needs to be checked carefully and is usually set as a diagonal matrix with a relatively large constant at each diagonal element. These priors satisfy the conditions for conjugate exponential (CE) models [25]. For conjugate exponential models, formulae exist in [25] for solving analytically the integrals in the VBE and VBM steps. B. where q bi = N bi | mbi , Σbi , where mbi = Σbi C0−1 + f with D=B⊗ Σbi = C0−1 + D mwi mwi T σi−2 f = yi R diag mwi σi−2 −1 , + A−1 , ⎛ ⎞ 1 ⎜ T=⎜ ⎝ .. . 0 .. ⎟ ⎟ .⎠ with Mx = diag mbi , (B.7) c = yi RMx mwi − yi yi − ν0 , 1. K = B ⊗ Σbi + mbi mTbi , and R = TY , with (B.6) α = N(η + 1) − G − 2, (B.2) (B.3) B = R R, A = IG + K, , and η is a hyperparameter of the parameter prior p(θ i ) based on CE models (A.2). q(θi ) , , (B.1) q(θi ) −1 T β = −c, Let us first start with VBE step. Suppose that q(θ i ) obtained in the previous VBM step follows a Gaussian-inverse-gamma distribution and has the expression (B.5). The VBE step calculates the approximation on the APPs of topology p(bi ). By applying the theorems of the CE model [25], q(bi ) can be shown to have the following expression: yi T RMx Σwi = σi2 IG + K DERIVATION OF VBE AND VBM STEPS −1 mwi = IG + K (B.4) 1 0 being an N × (N + 1) matrix. We now turn to the VBM step in which we compute q(θ i ). Again, from the CE model and q(bi ) obtained in (B.1), we have α β , (B.5) q θ i = N wi | mwi , Σwi IG , 2 2 Computation of the lower bound F The convergence of the VBEM algorithm is tested using a lower bound of ln p(yi ). In this paper, we use F to denote this lower bound and we calculate it using the newest q(bi ) and q(θ i ) obtained in the iterative process. F can be written more succinctly using the definition of the KL divergence. Let us first review the definition of the KL divergence and then derive an analytical expression for F . The KL divergence measures the difference between two probability distributions and it is also termed relative entropy. Thus, using this definition we can write the difference between the real and the approximate distributions in the following way: p bi , yi | θ i KL q bi p bi , yi | θ i = − dbi q bi ln , q bi p θi KL q θ i p θ i = − dθ i ln . q θi (C.1) 14 EURASIP Journal on Bioinformatics and Systems Biology And finally, the lower bound F can be written in terms of the previous definitions as F = dθ i q θ i =− dbi q bi ln p bi , yi | θ i p θi + ln q bi q θi dθ i q θ i KL q bi p bi , yi | θ i − KL q θ i p θ i . (C.2) ACKNOWLEDGMENTS Yufei Huang is supported by an NSF Grant CCF-0546345. Also M. Carmen Carrion Perez thanks MCyT under project TEC 2004-06096-C03-02/TCM forfunding. REFERENCES [1] H. Kitano, “Looking beyond that details: a rise in systemoriented approaches in genetics and molecular biology,” Current Genetics, vol. 41, no. 1, pp. 1–10, 2002. [2] P. D’haeseleer, S. Liang, and R. Somogyi, “Genetic network inference: from co-expression clustering to reverse engineering,” Bioinformatics, vol. 16, no. 8, pp. 707–726, 2000. [3] P. Brazhnik, A. de la Fuente, and P. Mendes, “Gene networks: how to put the function in genomics,” Trends in Biotechnology, vol. 20, no. 11, pp. 467–472, 2002. [4] N. Friedman, “Inferring cellular networks using probabilistic graphical models,” Science, vol. 303, no. 5659, pp. 799–805, 2004. [5] I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang, “Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks,” Bioinformatics, vol. 18, no. 2, pp. 261–274, 2002. [6] X. Zhou, X. Wang, and E. R. Dougherty, “Construction of genomic networks using mutual-information clustering and reversible-jump Markov-chain-Monte-Carlo predictor design,” Signal Processing, vol. 83, no. 4, pp. 745–761, 2003. [7] A. J. Hartemink, D. K. Gifford, T. S. Jaakkola, and R. A. Young, “Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks,” in Proceedings of the 6th Pacific Symposium on Biocomputing (PSB ’01), pp. 422–433, The Big Island of Hawaii, Hawaii, USA, January 2001. [8] E. J. Moler, D. C. Radisky, and I. S. Mian, “Integrating naive Bayes models and external knowledge to examine copper and iron homeostasis in S. cerevisiae,” Physiol Genomics, vol. 4, no. 2, pp. 127–135, 2000. [9] E. Segal, Rich probabilistic models for genomic data, Ph.D. thesis, Stanford University, Stanford, Calif, USA, 2004. [10] H. de Jong, “Modeling and simulation of genetic regulatory systems: a literature review,” Journal of Computational Biology, vol. 9, no. 1, pp. 67–103, 2002. [11] Z. Bar-Joseph, “Analyzing time series gene expression data,” Bioinformatics, vol. 20, no. 16, pp. 2493–2503, 2004. [12] N. Simonis, S. J. Wodak, G. N. Cohen, and J. van Helden, “Combining pattern discovery and discriminant analysis to predict gene co-regulation,” Bioinformatics, vol. 20, no. 15, pp. 2370–2379, 2004. [13] K. Murphy and S. Mian, “Modelling gene expression data using dynamic Bayesian networks,” Tech. Rep., Computer Science Division, University of California, Berkeley, Calif, USA, 1999. [14] N. Friedman, M. Linial, I. Nachman, and D. Pe’er, “Using Bayesian networks to analyze expression data,” Journal of Computational Biology, vol. 7, no. 3-4, pp. 601–620, 2000. [15] R. J. P. van Berlo, E. P. van Someren, and M. J. T. Reinders, “Studying the conditions for learning dynamic Bayesian networks to discover genetic regulatory networks,” Simulation, vol. 79, no. 12, pp. 689–702, 2003. [16] M. J. Beal, F. Falciani, Z. Ghahramani, C. Rangel, and D. L. Wild, “A Bayesian approach to reconstructing genetic regulatory networks with hidden factors,” Bioinformatics, vol. 21, no. 3, pp. 349–356, 2005. [17] B.-E. Perrin, L. Ralaivola, A. Mazurie, S. Bottani, J. Mallet, and F. d’Alché-Buc, “Gene networks inference using dynamic Bayesian networks,” Bioinformatics, vol. 19, supplement 2, pp. ii138–ii148, 2003. [18] S. Y. Kim, S. Imoto, and S. Miyano, “Inferring gene networks from time series microarray data using dynamic Bayesian networks,” Briefings in Bioinformatics, vol. 4, no. 3, pp. 228–235, 2003. [19] F. Ferrazzi, R. Amici, P. Sebastiani, I. S. Kohane, M. F. Ramoni, and R. Bellazzi, “Can we use linear Gaussian networks to model dynamic interactions among genes? Results from a simulation study,” in Proceedings of IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS ’06), pp. 13–14, College Station, Tex, USA, May 2006. [20] X. Wang and H. V. Poor, Wireless Communication Systems: Advanced Techniques for Signal Reception, Prentice Hall PTR, Englewood Cliffs, NJ, USA, 2004. [21] J. Wang, Y. Huang, M. Sanchez, Y. Wang, and J. Zhang, “Reverse engineering yeast gene regulatory networks using graphical models,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’06), vol. 2, pp. 1088–1091, Toulouse, France, May 2006. [22] D. Husmeier, “Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks,” Bioinformatics, vol. 19, no. 17, pp. 2271–2282, 2003. [23] K. P. Murphy, Dynamic Bayesian networks: representation, inference and learning, Ph.D. thesis, University of California, Berkeley, Calif, USA, 2004. [24] S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory, Prentice-Hall, Englewood Cliffs, NJ, USA, 1997. [25] M. J. Beal, Variational algorithms for approximate Bayesian inference, Ph.D. thesis, The Gatsby Computational Neuroscience Unit, University College London, London, UK, May 2003. [26] S. P. Brooks, “Markov chain Monte Carlo method and its application,” Journal of the Royal Statistical Society: Series D, The Statistician, vol. 47, no. 1, pp. 69–100, 1998. [27] P. T. Spellman, G. Sherlock, M. Q. Zhang, et al., “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Molecular Biology of the Cell, vol. 9, no. 12, pp. 3273–3297, 1998. [28] R. J. Cho, M. J. Campbell, E. A. Winzeler, et al., “A genomewide transcriptional analysis of the mitotic cell cycle,” Molecular Cell, vol. 2, no. 1, pp. 65–73, 1998. [29] B. Efron and R. Tibshirani, An Introduction to Bootstrap, Monographs on Statistics and Applied Probability, no. 57, Chapman & Hall, New York, NY, USA, 1993. [30] S. N. Lahiri, Resampling Methods for Dependent Data, Springer, New York, NY, USA, 2003. [31] “Kegg: Kyoto encyclopedia of genes and genomes,” http://www .genome.jp/kegg/. Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 51947, 12 pages doi:10.1155/2007/51947 Research Article Inferring Time-Varying Network Topologies from Gene Expression Data Arvind Rao,1, 2 Alfred O. Hero III,1, 2 David J. States,2, 3 and James Douglas Engel4 1 Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109-2122, USA Graduate Program, Center for Computational Medicine and Biology, School of Medicine, University of Michigan, Ann Arbor, MI 48109-2218, USA 3 Department of Human Genetics, School of Medicine, University of Michigan, Ann Arbor, MI 48109-0618, USA 4 Department of Cell and Developmental Biology, School of Medicine, University of Michigan, Ann Arbor, MI 48109-2200, USA 2 Bioinformatics Received 24 June 2006; Revised 4 December 2006; Accepted 17 February 2007 Recommended by Edward R. Dougherty Most current methods for gene regulatory network identification lead to the inference of steady-state networks, that is, networks prevalent over all times, a hypothesis which has been challenged. There has been a need to infer and represent networks in a dynamic, that is, time-varying fashion, in order to account for different cellular states affecting the interactions amongst genes. In this work, we present an approach, regime-SSM, to understand gene regulatory networks within such a dynamic setting. The approach uses a clustering method based on these underlying dynamics, followed by system identification using a state-space model for each learnt cluster—to infer a network adjacency matrix. We finally indicate our results on the mouse embryonic kidney dataset as well as the T-cell activation-based expression dataset and demonstrate conformity with reported experimental evidence. Copyright © 2007 Arvind Rao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. INTRODUCTION Most methods of graph inference work very well on stationary time-series data, in that the generating structure for the time series does not exhibit switching. In [1, 2], some useful method to learn network topologies using linear statespace models (SSM), from T-cell gene expression data, has been presented. However, it is known that regulatory pathways do not persist over all time. An important recent finding in which the above is seen to be true is following examination of regulatory networks during the yeast cell cycle [3], wherein topologies change depending on underlying (endogeneous or exogeneous) cell condition. This brings out a need to identify the variation of the “hidden states” regulating gene network topologies and incorporating them into their network inference framework [4]. This hidden state at time t (denoted by xt ) might be related to the level of some key metabolite(s) governing the activity (gt ) of the gene(s). These present a notion of condition specificity which influence the dynamics of various genes active during that regime (condition). From time-series microarray data, we aim to partition each gene’s expression profile into such regimes of expression, during which the underlying dynamics of the gene’s controlling state (xt ) can be assumed to be stationary. In [5], the powerful notion of context sensitive boolean networks for gene relationships has been presented. However, at least for short timeseries data, such a boolean characterization of gene state requires a one-bit quantization of the continuous state, which is difficult without expert biological knowledge of the activation threshold and knowledge of the precise evolution of gene expression. Here, we work with gene profiles as continuous variables conditioned on the regime of expression. Each regime is related to the state of a state-space model that is estimated from the data. Our method (regime-SSM) examines three components: to find the switch in gene dynamics, we use a change-point detection (CPD) approach using singular spectrum analysis (SSA). Following the hypothesis that the mechanism causing the genes to switch at the same time came from a common underlying input [3, 6], we group genes having similar change points. This clustering borrows from a mixture of Gaussian (MoG) model [7]. The inference of the network adjacency matrix follows from a state-space representation of expression dynamics among these coclustered genes [1, 2]. Finally, we present analyses on the publicly available embryonic kidney gene expression dataset [8] and the T-cell 2 EURASIP Journal on Bioinformatics and Systems Biology activation dataset [1], using a combination of the above developed methods and we validate our findings with previously published literature as well as experimental data. For the embryonic kidney dataset, the biological problem motivating our network inference approach is one of identifying gene interactions during mammalian nephrogenesis (kidney formation). Nephrogenesis, like several other developmental processes, involves the precise temporal interaction of several growth factors, differentiation signals, and transcription factors for the generation and maturation of progenitor cells. One such key set of transcription factors is the GATA family, comprising six members, all containing the (–GATA–) binding domain. Among these, Gata2 and Gata3 have been shown to play a functional role [8, 9] in nephric development between days 10–12 after fertilization. From a set of differentially expressed genes pertinent to this time window (identified from microarray data), our goal is to prospectively discover regulatory interactions between them and the Gata2/3 genes. These interactions can then be further resolved into transcriptional, or signaling interactions on the basis of additional biological information. In the T-cell activation dataset, the question is if events downstream of T-cell activation can be partitioned into early and late response behaviors, and if so, which genes are active in a particular phase. Finally, can a network-level influence be inferred among the genes of each phase and do they correlate with known data? We note here that we are not looking for the behavior of any particular gene, but only interested in genes from each phase. As will be shown in this paper, regime-SSM generates biologically relevant hypotheses regarding time-varying gene interactions during nephric development and T-cell activation. Several interesting transcripts are seen to be involved in the process and the influence network hereby generated resolves cyclic dependencies. The main assumption for the formulation of a linear state-space model to examine the possibility of gene-gene interactions is that gene expression is a function of the underlying cell state and the expression of other genes at the previous time step. If longer-range dependencies are to be considered, the complexity of the model would increase. Another criticism of the model might be that nonlinear interactions cannot be adequately modeled by such a framework. However, around the equilibrium point (steady state), we can recover a locally linearized version of this nonlinear behavior. 2. SSA AND CHANGE-POINT DETECTION First we introduce some notations. Consider N gene expression profiles, g (1) , g (2) , . . . , g (N) ∈ RT , T being the length of each gene’s temporal expression profile (as obtained from microarray expression). The jth time instant of gene i’s expression profile will be denoted by g (i) j . State-space partitioning is done using singular spectrum analysis [10] (SSA). SSA identifies structural change points in time-series data using a sequential procedure [11]. We will briefly review this method. Consider the “windowed” (width NW ) time-series data given by {g1(i) , g2(i) , . . . , gN(i)W }, with M (M ≤ NW /2) as some integer-valued lag parameter, and a replication parameter K = NW − M + 1. The SSA procedure in CPD involves the following. (i) Construction of an l-dimensional subspace: here, a “trajectory matrix” for the time series, over the interval [n + 1, n + T] is constructed, ⎛ GBi,(n) (i) gn+1 ⎜ ⎜ (i) ⎜ gn+2 ⎜ =⎜ . ⎜ . ⎜ . ⎝ ⎞ (i) gn+2 (i) gn+3 (i) . . . gn+K (i) gn+3 .. . (i) gn+4 .. . (i) ⎟ . . . gn+K+1 ⎟ , .. ⎟ .. ⎟ . . ⎟ ⎠ ⎟ ⎟ (1) (i) (i) (i) (i) gn+M gn+M+1 gn+M+2 . . . gn+N W where K = NW − M + 1. The columns of the matrix Gi,(n) are B i,(n) (i) (i) T the vectors G j = (gn+ j , . . . , gn+ j+M −1 ) , with j = 1, . . . , K. (ii) Singular vector decomposition of the lag covariance matrix Ri,n = GBi,(n) (GBi,(n) )T yields a collection of singular vectors—a grouping of l of these Singular vectors, corresponding to the l highest eigenvalues—denoted by I = {1, . . . , l}, establishes a subspace Ln,I of RM . i,(n) defined by (iii) Construction of the test matrix: use Gtest ⎛ i,(n) Gtest (i) gn+p+1 ⎜ ⎜ (i) ⎜ gn+p+2 ⎜ =⎜ . ⎜ . ⎜ . ⎝ (i) gn+p+2 ... (i) gn+q (i) gn+p+3 .. . ... .. . (i) gn+q+1 .. . (i) (i) (i) gn+p+M gn+p+M+1 . . . gn+q+M −1 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎠ (2) Here, we use the length (p) and location (q) of test sample. We choose p ≥ K, with K = NW − M + 1. Also q > p, here we take q = p + 1. From this construction, the matrix columns are the vectors Gi,(n) j , j = p + 1, . . . , q. The matrix has dimension M × Q, Q = (q − p) = 1. (iv) Computation of the detection statistic: the detection statistics used in the CPD are (a) the normed Euclidean distance between the column span of the test matrix, that is, Gi,(n) and the lj dimensional subspace Ln,I of RM . This is denoted by Dn,I,p,q ; (b) the normalized sum of squares of distances, denoted by Sn = Dn,I,p,q /MQμn,I , with μn,I = Dm,I,0,K , where m is the largest value of m ≤ n so that the hypothesis of no change is accepted; (c) a cumulative sum- (CUSUM-) type statistic W1 = S1 , Wn+1 = max{(Wn + Sn+1 − Sn − 1/3MQ), 0}, n ≥ 1. The CPD procedure declares a structural change in the time series dynamics if for some time instant n, we observe Wn > h with the threshold h = (2tα /(MQ)) (1/3)q(3MQ − Q2 + 1), tα being the (1 − α) quantile of the standard normal distribution. (v) Choice of algorithm parameters: (a) window width (NW ): here, we choose NW T/5, T being the length of the original time series, the algorithm Arvind Rao et al. 3 provides a reliable method of extracting most structural changes. As opposed to choosing a much smaller NW , this might lead to some outliers being classified as potential change points, but in our set-up this is preferred in contrast to losing genuine structural changes based on choosing larger NW ; (b) choice of lag M: in most cases, choose M = NW /2. 3. MIXTURE-OF-GAUSSIANS (MoG) CLUSTERING Having found change points (and thus, regimes) from the gene trajectories of the differentially expressed genes, our goal is to now group (cluster) genes with similar temporal profiles within each regime. In this section, we derive the parameter update equations for a mixture-of-Gaussian clustering paradigm. As will be seen later, the Gaussian assumptions on the gene expression permit the use of coclustered genes for the SSM-based network parameter estimation. We now consider the group of gene expression profiles G = {g(1) , g(2) , . . . , g(n) }, all of which share a common change point (time of switch)—c1 . Consider gene profile i, g(i) = [g1(i) , g2(i) , . . . , gT(i)c1 ]T , a Tc1 -dimensional random vector which follows a k-component finite mixture distribution described by p(g | θ) = k m=1 αm p g | φm , (3) where α1 , . . . , αk are the mixing probabilities, each φm is the set of parameters defining the mth component, and θ ≡ {φ1 , . . . , φk , α1 , . . . , αk } is the set of complete parameters needed to specify the mixture. We have αm ≥ 0, k m = 1, . . . , k, In the E-step of the EM algorithm, the function Q(θ, θ(t)) ≡ E[log p(G, Z | θ) | G, θ(t)] is computed. This yields (i) wm ≡E (4) G = g(1) , g(2) , . . . , g (n) , (5) the log-likelihood of a k-component mixture is given by log p(G | θ) = log n p g(i) | θ i=1 = n i=1 log k αm p g (i) (6) | φm . m=1 (i) Treat the labels, Z = {z(1) , . . . , z(n) }, associated with the n samples—as missing data. Each label is a binary vector (i) (i) z(i) = [z1(i) , . . . , zk(i) ], where zm = 1 and z p = 0, for p = m in(i) dicate that sample g was produced by the mth component. In this setting, the expectation maximization algorithm can be used to derive the cluster parameter (θ) update equations. (7) k Np + 1 log n , 2 (8) N p is number of parameters per component in the k component mixture, given the number of clusters kmin ≤ k ≤ kmax . In the M-step, for m = 0, 1, . . . , k, θm (t + 1) = arg maxφm of the paQ(θ, θ(t)), for m : αm (t + 1) > 0, the elements φ’s rameter vector estimate θ are typically not closed form and depend on the specific parametrization of the densities in the mixture, that is, p(g(i) | φm ). If p(g(i) | φm ) belongs to the Gaussian density N (μm , Σm ) class, we have, φ = (μ, Σ) and EM updates yield [7] αm (t + 1) = n (i) i=1 wm n , (i) (i) wm g (i) , i=1 wm i=1 μm (t + 1) = n (i) (i) i=1 wm g n For a set of n independently and identically distributed samples, kMML = arg mink − log p G | θ(k) + n αm = 1. αm (t)p g(i) | θm (t) | G, θt = k , j (t)p g(i) | θj (t) j =1 α (i) (i) where wm is the posterior probability of the event zm = 1, (i) on observing gm . The estimate of the number of components (k) is chosen using a minimum message length (MML) criterion [7]. The MML criterion borrows from algorithmic information theory and serves to select models of lowest complexity to explain the data. As can be seen below, this complexity has two components: the first encodes the observed data as a function of the model and the second encodes the model itself. Hence, the MML criterion in our setup becomes, m=1 (i) zm Σm (t + 1) = T − μm (t + 1) g(i) − μm (t + 1) . n (i) i=1 wm (9) Equations (7) and (9) are the parameter update equations for each of the m = 1, . . . , k cluster components. For the kidney expression data, since we are interested in the role of Gata2 and Gata3 during early kidney development, we consider all the genes which have similar change points as the Gata2 and Gata3 genes, respectively. We perform an MoG clustering within such genes and look at those coclustered with Gata2 or Gata3. Coclustering within a regime potentially suggests that the governing dynamics are the same, even to the extent of coregulation. We note that just because a gene is coclustered with Gata2 in one regime, it does not mean that it will cocluster in a different regime. This approach suggests a way to localize regimes of correlation instead of the traditional global correlation measure that can mask transient and condition-specific dynamics. For this gene expression data, the MML penalized criterion indicates that an adequate number of clusters to describe this data is 4 EURASIP Journal on Bioinformatics and Systems Biology two (k = 2). In Tables 1 and 2, we indicate some of the genes with similar coexpression dynamics as Gata2/Gata3 and a cluster assignment of such genes. We observe that this clustering corresponds to the first phase of embryonic development (days 10–12 dpc), the phase where Gata2 and Gata3 are perhaps most relevant to kidney development [12–15]. A word about Table 1 is in order. The entries in each column of a row (gene) indicate the change points (as found by the SSA-CPD procedure) in the time series of the interpolated gene expression profile. Our simulation studies with the T-cell data indicate that the SSM and CoD performance is not much worse with the interpolated data compared to the original time series (Table 7). We note that because of the present choice of parameters NW , we might have the detection of some false positive change points, but this is preferable to the loss of genuine change points. An examination of the change points of the various genes in Table 1 indicates three regimes—between points approximately 1–5, 5–11 and 12–20. The missing entries mean that there was no change point identified for a certain regime and are thus treated as such. Since our focus is early Gata3 behavior, we are interested in time points 1–12, and hence we examine the evolution of network-level interactions over the first two regimes for the genes coclustered in these regimes. To clarify the validity of the presented approach, we present a similar analysis on another data set—the T-cell expression data presented in [1]. This data looks at the expression of various genes after T-cell activation using stimulation with phorbolester PMA and ionomycin [16]. This data has the profiles of about 58 genes over 10 time points with 44(34 + 10) replicate measurements for each time point. Since here we have no specific gene in mind (unlike earlier where we were particularly interested in Gata3 behavior), the change point procedure (CPD) yields two distinct regimes— one from time points 1 to 4 and the other from time points 5 to 10. Following the MoG clustering procedure yields the optimal number of clusters to be 1 (from MML) in each regime. We therefore call these two clusters “early response” and “late response” genes and then proceed to learn a network relationship amongst them, within each cluster. The CPD and cluster information for the early and late responses are summarized in Table 3. 4. STATE-SPACE MODEL For a given regime, we treat gene expression as an observation related to an underlying hidden cell state (xt ), which is assumed to govern regime-specific gene expression dynamics for that biological process, globally within the cell. Suppose there are N genes whose expression is related to a single process. The ith gene’s expression vector is denoted as gt(i) , t = 1, . . . T, where T is the number of time points for which the data is available. The state-space model (SSM) is used to model the gene expression (gt(i) , i = 1, 2, . . . , N and t = 1, 2, . . . , T) as a function of this underlying cell state (xt ) as well as some external inputs. A notion of influence among genes can be integrated into this model by considering the SSM inputs to be the gene expression values at the previous Table 1: Change-point analysis of some key genes, prior to clustering (annotations in Table 8). The numbers indicate the time points at which regime changes occur for each gene. Gene symbol Change point I Change point II Change point III Bmp7 Rara Pax2 Gata3 Gata2 Gdf11 Npnt Cd44 Pgf Pbx1 Ret 6 5 6 5 — — — 5 5 5 — 10 11 12 9 — 10 12 11 11 12 10 12 16 15 12 18 20 16 15 — 20 — Table 2: Some of the genes coclustered with Gata2 and Gata3 after MoG clustering (annotations in Table 8). Genes with the same dynamics as Gata3 Genes with the same dynamics as Gata2 Bmp7 Nrtn Pax2 Ros1 Pbx1 Rara Gdf11 Lamc2 Cldn3 Ros1 Ptprd Npnt Cdh16 Cldn4 Table 3: Some of the genes related to early and late responses in T-cell activation (annotations in Table 9). Genes related to early response (time points: 1–4) Genes related to late response (time points: 5–10) CD69 Mcp1 Mcl1 EGR1 JunD CKR1 CCNA2 CDC2 EGR1 IL2r gamma IL6 — time step. The state and observation equations of the statespace model [17] are (i) state equation: xt+1 = Axt + Bgt + es,t ; es,t ∼ N (0, Q), i = 1, . . . , N; t = 1, . . . , T; (10) (ii) observation equation: gt = Cxt + Dgt−1 + eo,t ; eo,t ∼ N (0, R), (11) Arvind Rao et al. 5 Table 4: Assumptions and log-likelihood calculations in the state-space model. The (≡) symbol indicates a definition. Symbol Interpretation T Number of time points — Rg Number of replicates — P gt | xt ≡ Expression T −1 [g −Cx −Dg t t t−1 ] e−1/2[gt −Cxt −Dgt−1 ] R · (2π)− p/2 det(R)−1/2 t =2 P xt | xt−1 — T e−1/2[xt −Axt−1 −Bgt−1 ] Q −1 [x −Ax t t−1 −Bgt−1 ] · (2π)−k/2 det(Q)−1/2 t =2 P x1 Initial state density assumption P {x}, {g} Markov property e−1/2[x1 −π1 ] V1 [x1 −π1 ] · (2π)−k/2 det V1 Rg P x1 (i) T i=1 log P {x}, {g} Joint log probability T P xt (i) | xt−1 (i) , gt−1 (i) · t =2 i=1 Rg − T 1 t =2 2 P gt (i) | xt (i) , gt−1 (i) t =1 gt (i) − Cxt (i) − Dgt−1 (i) R−1 gt (i) − Cxt (i) − Dgt−1 (i) T − T log det(R) 2 1 (i) xt − Axt−1 (i) − Bgt−1 (i) Q−1 xt (i) − Axt−1 (i) − Bgt−1 (i) 2 t =1 1 1 T −1 log det(Q) − x1 − π1 V1−1 x1 − π1 − log det V1 − 2 2 2 − T(p + k) log(2π) − 2 with xt = [xt(1) , xt(2) , . . . , xt(K) ]T and gt = [gt(1) , gt(2) , . . . , gt(N) ]T . A likelihood method [1] is used to estimate the state dimension K. The noise vectors es,t and eo,t are Gaussian distributed with mean 0 and covariance matrices Q and R, respectively. From the state and observation equations (10) and (11), j =1,...,N we notice that the matrix-valued parameter D = [Di, j ]i=1,...,N quantifies the influence among genes i and j from one time instant to the next, within a specific regime. To infer a biological network using D, we use bootstrapping to estimate the distribution of the strength of association estimates amongst genes and infer network linkage for those associations that are observed to be significant. Within this proposed framework, we segment the overall gene expression time trajectories into smaller, approximately stationary, gene expression regimes. We note that the MoG clustering framework is a nonlinear one in that the regimespecific state space is partitioned into clusters. These cluster assignments of correlated gene expression vectors can change with regime, allowing us to capture the sets of genes that interact under changing cell condition. 5. −1/2 SYSTEM IDENTIFICATION We consider the case where we have Rg = B × P realizations of expression data for each gene available. Arguably, mRNA level is a measure of gene expression, B(= 2) denotes the number of biological replicates, and P(= 16 perfect match probes) denotes the number of probes per gene transcript. Each of these Rg realizations is T-time-point long and is obtained from Affymetrix U74Av2 murine microarray raw CEL files. In the section below, we derive the update equations for maximum-likelihood estimates of the parameters A, B, C, D, Q and R (in (10) and (11)) using an EM algorithm, based on [17, 18]. The assumptions underlying this model are outlined in Table 4. A sequence of T output vectors (g1 , g2 , . . . , gT ) is denoted by {g}, and a subsequence t {gt0 , gt0 +1 , . . . , gt1 } by {g}t10 . We treat the (xt , gt ) vector as the complete data and find the log-likelihood log P({x}, {g}) under the above assumptions. The complete E-and M-steps involved in the parameter update steps are outlined in Tables 5 and 6. 6. BOOTSTRAPPED CONFIDENCE INTERVALS As suggested above, the entries of the D matrix indicate the strength of influence among the genes, from one time step to the next (within each regime). We use bootstrapping to find confidence intervals for each entry in the D matrix and if it is significant, we assign a positive or negative direction (+1 or −1) to this influence. The bootstrapping procedure [19] is adapted to our situation as follows. 6 EURASIP Journal on Bioinformatics and Systems Biology Table 5: M-step of the EM algorithm for state-space parameter estimation. The (≡) symbol indicates a definition. Matrix symbol Interpretation Expression M-Step π1 new Initial state mean x1 V1new Initial state covariance P1 − x1 x1 + C new Output matrix Rg T Rg (i) 1 (i) x1 − x1 x1 − x1 Rg i=1 gt xt − D (i) i=1 t =1 Rnew 1 Output noise covariance Anew Rg × T Rg T State dynamics matrix i=1 t =2 · xt gt−1 (i) · Rg T (i) (gt gt i=1 t =1 (i) Pt,t −1 (i) − B xt gt−1 Rg T (i) new · (i) xt gt Rg T i=1 t =2 (i) Rg T gt−1 (i) gt−1 (i) − gt−1 (i) xt · Rg T −1 −1 (i) · (i) Suppose there are R regimes in the data with change points (c1 , c2 , . . . , cR ) identified from SSA. For the rth regime, generate B independent bootstrap samples of size N (the original number of genes under consideration), -(Y∗1 , Y∗2 , . . . , Y∗B ) from original data, by random (i) T resampling from g(i) = [gc(i) r , . . . , gcr+1 ] . (ii) Using the EM algorithm for parameter estimation, estimate the value of D (the influence parameter). Denote the estimate of D for the ith bootstrap sample by Di∗ . (iii) Compute the sample mean and sample variance of the estimates of D over all the B bootstrap samples. That is, variance = 1 B 1 ∗ D , B i=1 i B B − 1 i=1 ∗ Di − D ∗ 2 (i) xt gt−1 (i) Rg T −1 Pt(i) −1 (i) xt gt−1 (i) i=1 t =2 gt−1 (i) xt (i) (i) (i) xt gt−1 (i) − xt gt−1 (i) Rg T −1 Pt(i) −1 · xt (i) gt−1 (i) − gt−1 gt−1 (i) i=1 t =2 ∗ (i) Rg T Rg T Rg T (i) (i) 1 (i) Pt − Anew Pt−1,t − B gt−1 (i) xt Rg × (T − 1) i=1 t=2 i=1 t =2 i=1 t =2 State noise covariance mean = D = (i) gt−1 gt −1 Pt(i) Rg i=1 t =2 Qnew −D new i=1 t =1 (i) Pt,t −1 T Pt(i)−1 i=1 t =1 i=1 t =2 Input to state matrix (i) Pt(i) i=1 t =1 B new −1 Pt(i) i=1 t =1 )−C (i) gt (i) gt−1 (i) − gt (i) xt Rg T Rg T i=1 t =1 i=1 t =1 Input to observation (i) Rg T Dnew Rg T (12) . (iv) Using the above obtained sample mean and variance, estimate confidence intervals for the elements of D. If D lies in this bootstrapped confidence interval, we infer a potential influence and if not, we discard it. Note that even though we write D, we carry out this hypothesis test for each Di, j , i = 1, . . . , n; j = 1, . . . , n; for each of the n genes under consideration in every regime. 7. SUMMARY OF ALGORITHM Within each regime identified by CPD, we model gene expression as Gaussian distributed vectors. We cluster the genes using a mixture-of-Gaussians (MoG) clustering algorithm [7] to identify sets of genes which have similar “dynamics of expression” —in that they are correlated within that regime. We then proceed to learn the dynamic system parameters (matrices A, B, C, D, Q, and R) for the state-space model (SSM) underlying each of the clusters. We note two important ideas: (i) we might obtain different cluster assignments for the genes depending on the regime; (ii) since all these genes (across clusters within a regime) are still related to the same biological process, the hidden state xt is shared among these clusters. Therefore, we learn the SSM parameters in an alternating manner by updating the estimates from cluster to cluster Arvind Rao et al. 7 Table 6: E-step of the EM algorithm for state-space parameter estimation. E-Step Forward x1 0 ≡ π1 V10 ≡ V1 xt t−1 Update Axt−1 t−1 + Bgt−1 Vtt−1 Update AVtt−−11 A + Q Kt Update Vtt−1 C CVtt−1 C + R xt t Update xt t−1 + Kt gt − Cxt t−1 − Dgt−1 Vtt Update Vtt−1 − Kt CVtt−1 Backward T VT,T −1 Initialization −1 I − KT C AVTT−−11 xt ≡ xt τ Pt ≡ VtT + xt T xt T Jt−1 Update Vtt−1 A Vtt−1 xt−1 T Update xt−1 t−1 + Jt−1 x1 T − Axt−1 t−1 − Bgt−2 VtT Update Vtt−−11 + Jt−1 VtT − Vtt−1 Jt−1 Pt,t−1 VtT−1,t−2 ≡ Update The discussion of the network inference procedure would be incomplete in the absence of any other algorithms for comparison. For this purpose, we implement the CoD- (coefficient-of-determination-) based approach [20, 21] along with the models proposed in [1] (SSM) and [22] (GGM). The CoD method allows us to determine the association between two genes within a regime via an R2 goodness of fit statistic. The methods of [1, 22] are implemented on the time-series data (with regard to underlying regime). Such a study would be useful to determine the relative merits of each approach. We believe that no one procedure can work for every application and the choice of an appropriate procedure would be governed by the biological question under investigation. Each of these methods use some underlying assumptions and if these are consistent with the question that we ask, then that method has great utility. These individual results, their evaluation, and their comparison are summarized in Section 8. 8. −1 Vt,tT −1 + xt T xt−1 T Vtt−−11 Jt−2 + Jt−1 Vt,tT −1 − AVtt−−11 Jt−2 while still retaining the form of the state vector xt . The learning is done using an expectation-maximization-type algorithm. The number of components during regime-specific clustering is estimated using a minimum message length criterion. Typically, O(N) iterations suffice to infer the mixture model in each regime with N genes under consideration. Thus, our proposed approach is as follows. (i) Identify the N key genes based on required phenotypical characteristic using fold change studies. Preprocess the gene expression profiles by standardization and cubic spline interpolation. (ii) Segment each gene’s expression profile into a sequence of state-dependent trajectories (regime change points), from underlying dynamics, using SSA. (iii) For each regime (as identified in step 2), cluster genes using an MoG model so that genes with correlated expression trajectories cluster together. Learn an SSM [17, 18] for each cluster (from (10) and (11) for estimation of the mean and covariance matrices of the state vector) within that regime. The input to observation matrix (D) is indicative of the topology of the network in that regime. (iv) Examine the network matrices D (by bootstrapping to find thresholds on strength of influence estimates) across all regimes to build the time-varying network. 8.1. RESULTS Application to the GATA pathway To illustrate our approach (regime-SSM), we consider the embryonic kidney gene expression dataset [8] and study the set of genes known to have a possible role in early nephric development. An interruption of any gene in this signaling cascade potentially leads to early embryonic lethality or abnormal organ development. An influence network among these genes would reveal which genes (and their products) become important at a certain phase of nephric development. The choice of the N(= 47) genes is done using FDR fold change studies [23] between ureteric bud and metanephric mesenchyme tissue types, since this spatial tissue expression is of relevance during early embryonic development. The dataset is obtained by daily sampling of the mRNA expression ranging from 11.5–16.5 days post coitus (dpc). Detailed studies of the phenotypes characterizing each of these days is available from the Mouse Genome Informatics Database at http://www.informatics.jax.org/. We follow [24] and use interpolated expression data pre-processing for cluster analysis. We resample this interpolated profile to obtain twenty points per gene expression profile. Two key aspects were confirmed after interpolation [24, 25]: (1) there were no negative expression values introduced, (2) the differences in fold change were not smoothed out. Initial experimental studies have suggested that the 10.5– 12.5 dpc are relatively more important in determination of the course of metanephric development. We chose to explore which genes (out of the 47 considered) might be relevant in this specific time window. The SSA-CPD procedure identified several genes which exhibit similar dynamics (have approximately same change points, for any given regime) in the early phase and distinctly different dynamics in later phases (Table 1). Our approach to influence determination using the statespace model yields up to three distinct regimes of expression over all the 47 genes identified from fold change studies between bud and mesenchyme. MoG clustering followed by 8 EURASIP Journal on Bioinformatics and Systems Biology Pax2 Mapk1 Lamc2 Acvr2b Bmp7 Wnt11 Ros1 Gata3 Rara Gdf11 Kcnj8 Gata3 Pbx1 Mapk1 Pax2 Lamc2 Cd44 Figure 1: Network topology over regimes (solid lines represent the first regime, and the dotted lines indicate the second regime). Acvr2b Npnt Lamc2 Gdf11 Cldn7 Kcnj8 Gata3 Npnt Rara Figure 3: Steady-state network inferred using CoD (solid lines represent the first regime, and the dotted lines indicate the second regime). Rara CD69 JunD EGR1 Mcl1 Figure 2: Steady-state network inferred over all time, using [1]. Casp7 state-space modeling yield three regime topologies of which we are interested in the early regime (days 10.5–12.5). This influence topology is shown in Figure 1. We compare our obtained network (using regime-SSM) with the one obtained using the approach outlined in [1], shown in Figure 2. We note that the network presented in Figure 2 extends over all time, that is, days 10.5–16.5 for which basal influences are represented but transient and condition-specific influences may be missed. Some of these transient influences are recaptured in our method (Figure 1) and are in conformity (lower false positives in network connectivity) with pathway entries in Entrez Gene [15] as well as in recent reviews on kidney expression [8, 12] (also, see Table 8). For example, the Mapk1-Rara [26] or the Pax2Gdf11 [27] interactions are completely missed in Figure 2— this is seen to be the case since these interactions only occur during the 10.5–12.5 dpc regime. We also see that the Acvr2b-Lamc2 [28] interaction is observed in the steady state but not in the first regime. This interaction becomes active in the second regime (first via the Acvr2b-Gdf11 and then via the Gdf11-Lamc2), indicating that it might not have particular relevance in the day 10.5–12.5 dpc stage. Several of these predicted interactions need to be experimentally characterized in the laboratory. It is especially interesting to see the Rara gene in this network, because it is known that Gata3 [29, 30] has tissue-specific expression in some cells of the developing eye. Also Gdf11 exhibits growth factor activity and is extremely important during organ formation. In Figure 3, we give the results of the CoD approach of network inference. Here the Gata3-Pax2 interaction seems reversed and counterintuitive. As can be seen, some of the interactions (e.g., Pax2-Gata3) can be seen here (via other nodes: Mapk1-Wnt11), but there is a need to resolve cycles (Ros1–Wnt11-Mapk1) and feedback/feedforward loops (Bmp7-Gata3-Wnt11). Both of these topologies can convey potentially useful information about nephric development. Thus a potentially useful way to combine these two methods is to “seed” the network using CoD and then try to resolve cycles using regime-SSM. IL6 nFKB CYP19A1 LAT Intgam IL2Rg CKR1 CDC2 T-cell activation Figure 4: Steady-state network inferred using SSM (solid lines represent the first regime, and the dotted lines indicate the second regime). 8.2. T-cell activation The regime-SSM network is shown in Figure 4. The corresponding network learnt in each regime using CoD is also shown (Figure 5). The study of this network using GGM (for the whole time-series data) is already available in [22]. Though there are several interactions of interest discovered in both the SSM and CoD procedures, we point out a few of interest. It is already known that synergistic interactions between IL-6 and IL-1 are involved in T-cell activation [31]. IL-2 receptor transcription is affected by EGR1 [32]. An examination of the topology of these two networks (CoD and SSM) would indicate some matches and is worth pursuing for experimental investigation. However, as already alluded to above, we have to find a way to resolve cycles from the CoD network [33]. Several of these match the interactions reported in [1, 22]. However, the additional information that we can glean is that some of the key interactions occur during “early response” to stimulation and some occur subsequently (interleukin-6 mediated T-cell activation) in the “late phase.” An examination of the gene ontology (GO) terms represented in each cluster as well as the functional annotations in Entrez Gene shows concordance with literature findings (Table 9). Because this dataset has been the subject of several interesting investigations, it would be ideal to ask other questions related to network inference procedures, for the purpose of comparison. One of the primary questions we seek Arvind Rao et al. CD69 Mcp1 9 JunD Pde4b EGR1 Intgam Pax2 Mcl1 Mapk1 Cldn4 Fmn CKR1 Lamc2 Clcn3 Cldn7 Cdh16 Ptprd Rara Pbx1 Cd44 Kcnj8 Gdf11 CCNA2 CYP19A1 IL2Rg CDC2 Figure 6: Steady-state network inferred using GGMs. Figure 5: Steady-state network inferred using CoD (solid lines represent the first regime, and the dotted lines indicate the second regime). to answer is what is the performance of the network inference procedure if a subsampled trajectory is used instead? In Table 7, the performances of the CoD and SSM algorithms are summarized. Using the T-cell (10 points, 44 replicates) data, we infer a network using the SSM procedure. With the identified edges as the gold standard for comparison, we now use SSM network inference on an undersampled version of this time series (5 points, 44 replicates) and check for any new edges ( fnew ) or deletion of edges ( flost ). Ideally, we would want both these numbers to be zero. fnew is the fraction of new edges added to the original set and flost is number of edges lost from the original data network over both regimes. Further, we now interpolate this undersampled data to 10 points and carry out network inference. This is done for each of the identified regimes. The same is done for the CoD method. We note that this is not a comparison between SSM and CoD (both work with very different assumptions), but of the effect of undersampling the data and subsequently interpolating this undersampled data to the original data length (via resampling). Table 7 suggests that as expected, there is degradation in performance (SSM/CoD) in the absence of all the available information. However, it is preferred to infer some false positives rather than lose true positive edges. This also indicates that interpolated data does not do worse than the undersampled data in terms of true positives ( flost ). We make three observations regarding this method of network inference. (i) It is not necessary for the target gene (Gata2/Gata3) to be present as part of the inferred network. We can obtain insight into the mechanisms underlying transcription in each regime even if some of the genes with similar coexpression dynamics as the target gene(s) are present in the inferred network. (ii) Probe-level observations from a small number of biological replicates seem to be very informative for network inference. This is because the LDS parameter estimation algorithm uses these multiple expression realizations to iteratively estimate the state mean, covariance and other parameters, notably D [17]. Hence inspite of few time points, we can use multiple measurements (biological, technical, and probe-level repli- cates) for reliable network inference. This follows similar observations in [34] that probe-level replicates are very useful for understanding intergene relationships. (iii) Following [24], it would seem that several network hypotheses can individually explain the time evolution behavior captured by the expression data. The LDS parameter estimation procedure seeks to find a maximum-likelihood (ML) estimate of the system parameters A, B, C, and D and then finally uses bootstrapping to only infer high confidence interactions. This ML estimation of the parameters uses an EM algorithm with multiple starts to avoid initializationrelated issues [17], and thus finds the “most consistent” hypothesis which would explain the evolution of expression data. It is this network hypothesis that we investigate. Since this network already contains our gene of interest Gata3, we can proceed to verify these interactions from literature and experimentally. 9. DISCUSSION One of the primary motivations for computational inference of state specific gene influence networks is the understanding of transcriptional regulatory mechanisms [36]. The networks inferred via this approach are fairly general, and thus there is a need to “decompose” these networks into transcriptional, signal transduction or metabolic using a combination of biological knowledge and chemical kinetics. Depending on the insights expected, the tools for dissection of these predicted influences might vary. For comparison, we additionally investigated a graphical Gaussian model (GGM) approach as suggested in [35] using partial correlation as a metric to quantify influence (Figure 6). This method works for short time-series data but we could not find a way to incorporate previous expression values as inputs to the evolution of state or individual observations—something we could explicitly do in the statespace approach. However, we are now in the process of examining the networks inferred by the GGM approach over the regimes that we have identified from SSA. Again, we observe that the network connections reflect a steady-state behavior and that transient (state-specific) changes in influence are not fully revealed. The same is observed in the case of the T-cell data, from the results reported in [22]. A comparison of all the presented methods, along with regime-SSM, has been presented in Table 10. The comparisons are based 10 EURASIP Journal on Bioinformatics and Systems Biology Table 7: Functional annotations (Entrez Gene) of some of the genes coclustered with Gata2 and Gata3. Gene symbol Gene name Possible role in nephrogenesis (function) Bmp7 Rara Gata2 Gata3 Pax2 Lamc2 Npnt Ros1 Ptprd Ret-Gdnf Gdf11 Mapk1 Kcnj8 Bone morphogenetic protein Retinoic acid receptor GATA binding protein 2 GATA binding protein 3 Paired homeobox-2 Laminin Nephronectin Ros1 proto-oncogene protein tyrosine phosphatase Ret proto-oncogene, Glial neutrophic factor Growth development factor Mitogen-activated protein kinase 1 potassium inwardly rectifying channel, subfamily J, member 8 Cell signaling Retinoic acid pathway, related to eye phenotype Hematopoiesis, urogenital development Hematopoiesis, urogenital development Direct target of Gata2 Cell adhesion molecule Cell adhesion molecule Signaling epithelial differentiation Cell adhesion Metanephros development Cell-cell signaling and adhesion Role in growth factor activity, cell adhesion Potassium ion transport Acvr2b Activin receptor IIB Transforming growth factor-beta receptor activity Table 8: Functional annotations of some of the coclustered genes (early and late responses) following T-cell activation. Gene symbol Gene name Possible role in T-cell activation (function) CD69 Mcl1 IL6 LAT EGR1 CDC2 Casp7 CD69 antigen Myeloid cell leukemia sequence 1 (BCL2-related) Interleukin 6 Linker for activation of T cells Early growth response gene 1 Cell division control protein 2 Caspase 7 Early T-cell activation antigen Mediates cell proliferation and survival Accessory factor signal Membrane adapter protein involved in T-cell activation activates nFKB signaling Involved in cell-cycle control Involved in apoptosis JunD Jun D proto-oncogene Regulatory role in T lymphocyte proliferation and Th cell differentiation CKR1 CYP19A1 Intgam nFKB IL2Rg Pde4b Mcp1 CCNA2 Chemokine receptor 1 Cytochrome P450, member 19 Integrin alpha M nFKB protein Interleukin-2 receptor gamma Phosphodiesterase 4B, cAMP-specific Monocyte chemotactic protein 1 Cyclin A2 negative regulator of the antiviral CD8+ T-cell response cell proliferation Mediates phagocytosis-induced apoptosis Signaling transduction activity Signaling activity Mediator of cellular response to extracellular signal Cytokine gene involved in immunoregulation Involved in cell-cycle control Table 9: Results of network inference on original, subsampled, and interpolated data. Method (T-cell data) SSM on original data SSM on undersampled data SSM on interpolated data CoD on original data CoD on undersampled data CoD on interpolated data Edges inferred fnew flost 14 — — 12 — — — 3 4 — 3 4 — 3 2 — 2 2 on whether these frameworks permit the inference of directional influences, regime specificity, resolution of cycles, and modeling of higher lags. 10. CONCLUSIONS In this work, we have developed an approach (regime-SSM) to infer the time-varying nature of gene influence network topologies, using gene expression data. The proposed approach integrates change-point detection to delineate phases Arvind Rao et al. 11 Table 10: Comparison of various network inference methods (Y: Yes, N: No). Method Direction Regime-specific Resolve cycles Higher lags (> 1) Nonlinear/locally linear CoD [20, 21] Y Y N N Y GGM [35] Y N N N Y SSM [1] Y N Y Y Y Regime-SSM Y Y Y Y Y of gene coexpression, MoG clustering implying possible coregulation, and network inference amongst the regimespecific coclustered genes using a state-space framework. We can thus incorporate condition specificity of gene expression dynamics for understanding gene influences. Comparison of the proposed approach with other current procedures like GGM or CoD reveals some strengths and would very well complement existing approaches (Table 10). We believe that this approach, in conjunction with sequence and transcription factor binding information, can give very valuable clues to understand the mechanisms of transcriptional regulation in higher eukaryotes. ACKNOWLEDGMENTS The authors gratefully acknowledge the support of the NIH under Award 5R01-GM028896-21 (JDE). The authors also thank the three anonymous reviewers for constructive comments to improve this manuscript. The material in this paper was presented in part at the IEEE International Workshop on Genomic Signal Processing and Statistics 2005 (GENSIPS05). REFERENCES [1] C. Rangel, J. Angus, Z. Ghahramani, et al., “Modeling Tcell activation using gene expression profiling and state-space models,” Bioinformatics, vol. 20, no. 9, pp. 1361–1372, 2004. [2] B.-E. Perrin, L. Ralaivola, A. Mazurie, S. Bottani, J. Mallet, and F. D’Alché-Buc, “Gene networks inference using dynamic Bayesian networks,” Bioinformatics, vol. 19, supplement 2, pp. II138–II148, 2003. [3] N. M. Luscombe, M. M. Babu, H. Yu, M. Snyder, S. A. Teichmann, and M. Gerstein, “Genomic analysis of regulatory network dynamics reveals large topological changes,” Nature, vol. 431, no. 7006, pp. 308–312, 2004. [4] E. Sontag, A. Kiyatkin, and B. N. Kholodenko, “Inferring dynamic architecture of cellular networks using time series of gene expression, protein and metabolite data,” Bioinformatics, vol. 20, no. 12, pp. 1877–1886, 2004. [5] S. Kim, H. Li, D. Russ, et al., “Context-sensitive probabilistic Boolean networks to mimic biological regulation,” in Proceedings of Oncogenomics, Phoenix, Ariz, USA, January-February 2003. [6] H. Li, C. L. Wood, Y. Liu, T. V. Getchell, M. L. Getchell, and A. J. Stromberg, “Identification of gene expression patterns using planned linear contrasts,” BMC Bioinformatics, vol. 7, p. 245, 2006. [7] M. A. T. Figueiredo and A. K. Jain, “Unsupervised learning of finite mixture models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 3, pp. 381–396, 2002. [8] R. O. Stuart, K. T. Bush, and S. K. Nigam, “Changes in gene expression patterns in the ureteric bud and metanephric mesenchyme in models of kidney development,” Kidney International, vol. 64, no. 6, pp. 1997–2008, 2003. [9] M. Khandekar, N. Suzuki, J. Lewton, M. Yamamoto, and J. D. Engel, “Multiple, distant Gata2 enhancers specify temporally and tissue-specific patterning in the developing urogenital system,” Molecular and Cellular Biology, vol. 24, no. 23, pp. 10263–10276, 2004. [10] N. Golyandina, V. Nekrutkin, and A. Zhigljavsky, Analysis of Time Series Structure—SSA and Related Techniques, Chapman & Hall/CRC, New York, NY, USA, 2001. [11] V. Moskvina and A. Zhigljavsky, “An algorithm based on singular spectrum analysis for change-point detection,” Communications in Statistics Part B: Simulation and Computation, vol. 32, no. 2, pp. 319–352, 2003. [12] K. Schwab, L. T. Patterson, B. J. Aronow, R. Luckas, H.-C. Liang, and S. S. Potter, “A catalogue of gene expression in the developing kidney,” Kidney International, vol. 64, no. 5, pp. 1588–1604, 2003. [13] Y. Zhou, K.-C. Lim, K. Onodera, et al., “Rescue of the embryonic lethal hematopoietic defect reveals a critical role for GATA-2 in urogenital development,” The EMBO Journal, vol. 17, no. 22, pp. 6689–6700, 1998. [14] G. A. Challen, G. Martinez, M. J. Davis, et al., “Identifying the molecular phenotype of renal progenitor cells,” Journal of the American Society of Nephrology, vol. 15, no. 9, pp. 2344–2357, 2004. [15] NCBI Pubmed, http://www.ncbi.nlm.nih.gov/entrez/query. fcgi. [16] H. H. Zadeh, S. Tanavoli, D. D. Haines, and D. L. Kreutzer, “Despite large-scale T cell activation, only a minor subset of T cells responding in vitro to Actinobacillus actinomycetemcomitans differentiate into effector T cells,” Journal of Periodontal Research, vol. 35, no. 3, pp. 127–136, 2000. [17] Z. Ghahramani and G. E. Hinton, “Parameter estimation for linear dynamical systems,” Tech. Rep., University of Toronto, Toronto, Ontario, Canada, 1996. [18] R. H. Shumway and D. S. Stoffer, Time Series Analysis and Applications, Springer Texts in Statistics, Springer, New York, NY, USA, 2000. [19] B. Effron, An Introduction to the Bootstrap, Chapman & Hall/CRC, New York, NY, USA, 1993. [20] E. R. Dougherty, S. Kim, and Y. Chen, “Coefficient of determination in nonlinear signal processing,” Signal Processing, vol. 80, no. 10, pp. 2219–2235, 2000. [21] S. Kim, E. R. Dougherty, M. L. Bittner, et al., “General nonlinear framework for the analysis of gene interaction via multivariate expression arrays,” Journal of Biomedical Optics, vol. 5, no. 4, pp. 411–424, 2000. [22] R. Opgen-Rhein and K. Strimmer, “Using regularized dynamic correlation to infer gene dependency networks from 12 [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] EURASIP Journal on Bioinformatics and Systems Biology time-series microarray data,” in Proceedings of the 4th International Workshop on Computational Systems Biology (WCSB ’06), Tampere, Finland, June 2006. A. O. Hero III, G. Fleury, A. J. Mears, and A. Swaroop, “Multicriteria gene screening for analysis of differential expression with DNA microarrays,” EURASIP Journal on Applied Signal Processing, vol. 2004, no. 1, pp. 43–52, 2004, special issue on genomic signal processing. Z. Bar-Joseph, “Analyzing time series gene expression data,” Bioinformatics, vol. 20, no. 16, pp. 2493–2503, 2004. A. Kundaje, O. Antar, T. Jebara, and C. Leslie, “Learning regulatory networks from sparsely sampled time series expression data,” Tech. Rep., Columbia University, New York, NY, USA, 2002. J. E. Balmer and R. Blomhoff, “Gene expression regulation by retinoic acid,” Journal of Lipid Research, vol. 43, no. 11, pp. 1773–1808, 2002. A. F. Esquela and S. E.-J. Lee, “Regulation of metanephric kidney development by growth/differentiation factor 11,” Developmental Biology, vol. 257, no. 2, pp. 356–370, 2003. A. Maeshima, S. Yamashita, K. Maeshima, I. Kojima, and Y. Nojima, “Activin a produced by ureteric bud is a differentiation factor for metanephric mesenchyme,” Journal of the American Society of Nephrology, vol. 14, no. 6, pp. 1523–1534, 2003. M. Mori, N. B. Ghyselinck, P. Chambon, and M. Mark, “Systematic immunolocalization of retinoid receptors in developing and adult mouse eyes,” Investigative Ophthalmology and Visual Science, vol. 42, no. 6, pp. 1312–1318, 2001. K.-C. Lim, G. Lakshmanan, S. E. Crawford, Y. Gu, F. Grosveld, and J. D. Engel, “Gata3 loss leads to embryonic lethality due to noradrenaline deficiency of the sympathetic nervous system,” Nature Genetics, vol. 25, no. 2, pp. 209–212, 2000. H. Mizutani, L. T. May, P. B. Sehgal, and T. S. Kupper, “Synergistic interactions of IL-1 and IL-6 in T cell activation. Mitogen but not antigen receptor-induced proliferation of a cloned T helper cell line is enhanced by exogenous IL-6,” Journal of Immunology, vol. 143, no. 3, pp. 896–901, 1989. J.-X. Lin and W. J. Leonard, “The immediate-early gene product Egr-1 regulates the human interleukin- 2 receptor β-chain promoter through noncanonical Egr and Sp1 binding sites,” Molecular and Cellular Biology, vol. 17, no. 7, pp. 3714–3722, 1997. M. J. Herrgård, M. W. Covert, and B. Ø. Palsson, “Reconciling gene expression data with known genome-scale regulatory network structures,” Genome Research, vol. 13, no. 11, pp. 2423–2434, 2003. C. Li and W. H. Wong, “Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection,” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 1, pp. 31–36, 2001. J. Schäfer and K. Strimmer, “An empirical Bayes approach to inferring large-scale gene association networks,” Bioinformatics, vol. 21, no. 6, pp. 754–764, 2005. A. Rao, A. O. Hero III, D. J. States, and J. D. Engel, “Inference of biologically relevant gene influence networks using the directed information criterion,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’06), vol. 2, pp. 1028–1031, Toulouse, France, May 2006. Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 32454, 15 pages doi:10.1155/2007/32454 Research Article Inference of a Probabilistic Boolean Network from a Single Observed Temporal Sequence Stephen Marshall,1 Le Yu,1 Yufei Xiao,2 and Edward R. Dougherty2, 3, 4 1 Department of Electronic and Electrical Engineering, Faculty of Engineering, University of Strathclyde, Glasgow, G1 1XW, UK 2 Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128, USA 3 Computational Biology Division, Translational Genomics Research Institute, Phoenix, AZ 85004, USA 4 Department of Pathology, University of Texas M. D. Anderson Cancer Center, Houston, TX 77030, USA Received 10 July 2006; Revised 29 January 2007; Accepted 26 February 2007 Recommended by Tatsuya Akutsu The inference of gene regulatory networks is a key issue for genomic signal processing. This paper addresses the inference of probabilistic Boolean networks (PBNs) from observed temporal sequences of network states. Since a PBN is composed of a finite number of Boolean networks, a basic observation is that the characteristics of a single Boolean network without perturbation may be determined by its pairwise transitions. Because the network function is fixed and there are no perturbations, a given state will always be followed by a unique state at the succeeding time point. Thus, a transition counting matrix compiled over a data sequence will be sparse and contain only one entry per line. If the network also has perturbations, with small perturbation probability, then the transition counting matrix would have some insignificant nonzero entries replacing some (or all) of the zeros. If a data sequence is sufficiently long to adequately populate the matrix, then determination of the functions and inputs underlying the model is straightforward. The difficulty comes when the transition counting matrix consists of data derived from more than one Boolean network. We address the PBN inference procedure in several steps: (1) separate the data sequence into “pure” subsequences corresponding to constituent Boolean networks; (2) given a subsequence, infer a Boolean network; and (3) infer the probabilities of perturbation, the probability of there being a switch between constituent Boolean networks, and the selection probabilities governing which network is to be selected given a switch. Capturing the full dynamic behavior of probabilistic Boolean networks, be they binary or multivalued, will require the use of temporal data, and a great deal of it. This should not be surprising given the complexity of the model and the number of parameters, both transitional and static, that must be estimated. In addition to providing an inference algorithm, this paper demonstrates that the data requirement is much smaller if one does not wish to infer the switching, perturbation, and selection probabilities, and that constituent-network connectivity can be discovered with decent accuracy for relatively small time-course sequences. Copyright © 2007 Stephen Marshall et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. INTRODUCTION A key issue in genomic signal processing is the inference of gene regulatory networks [1]. Many methods have been proposed and these are specific to the network model, for instance, Boolean networks [2–5], probabilistic Boolean networks [6–9], and Bayesian networks [10–12], the latter being related to probabilistic Boolean networks [13]. The manner of inference depends on the kind of data available and the constraints one imposes on the inference. For instance, patient data do not consist of time-course measurements and are assumed to come from the steady state of the network, so that inference procedures cannot be expected to yield net- works that accurately reflect dynamic behavior. Instead, one might just hope to obtain a set of networks whose steady state distributions are concordant, in some way, with the data. Since inference involves selecting a network from a family of networks, it can be beneficial to constrain the problem by placing restrictions on the family, such as limited attractor structure and limited connectivity [5]. Alternatively one might impose a structure on a probabilistic Boolean network that resolves inconsistencies in the data arising from mixing of data from several contexts [9]. This paper concerns inference of a probabilistic Boolean network (PBN) from a single temporal sequence of network states. Given a sufficiently long observation sequence, the 2 EURASIP Journal on Bioinformatics and Systems Biology goal is to infer a PBN that is a good candidate to have generated it. This situation is analogous to that of designing a Wiener filter from a single sufficiently long observation of a wide-sense stationary stochastic process. Here, we will be dealing with an ergodic process so that all transitional relations will be observed numerous times if the observed sequence is sufficiently long. Should one have the opportunity to observe multiple sequences, these can be used individually in the manner proposed and the results combined to provide the desired inference. Note that we say we desire a good candidate, not the only candidate. Even with constraints and a long sequence, there are many PBNs that could have produced the sequence. This is typical in statistical inference. For instance, point estimation of the mean of a distribution identifies a single value as the candidate for the mean, and typically the probability of exactly estimating the mean is zero. What this paper provides, and what is being provided in other papers on network inference, is an inference procedure that generates a network that is to some extent, and in some way, consistent with the observed sequence. We will not delve into arguments about Boolean or probabilistic Boolean network modeling, these issues having been extensively discussed elsewhere [14–21]; however, we do note that PBN modeling is being used as a framework in which to apply control theory, in particular, dynamic programming, to design optimal intervention strategies based on the gene regulatory structure [22–25]. With current technology it is not possible to obtain sufficiently long data sequences to estimate the model parameters; however, in addition to using randomly generated networks, we will apply the inference to data generated from a PBN derived from a Boolean network model for the segment polarity genes in drosophila melanogaster [26], this being done by assuming that some genes in the existing model cannot be observed, so that they become latent variables outside the observable model and therefore cause the kind of stochasticity associated with PBNs. It should be recognized that a key purpose of this paper is to present the PBN inference problem in a rigorous framework so that observational requirements become clear. In addition, it is hoped that a crisp analysis of the problem will lead to more approximate solutions based on the kind of temporal data that will become available; indeed, in this paper we propose a subsampling strategy that greatly mitigates the number of observations needed for the construction of the network functions and their associated regulatory gene sets. 2. PROBABILISTIC BOOLEAN NETWORKS A Boolean network (BN) consists of a set of n variables, {x0 , x1 , . . . , xn−1 }, where each variable can take on one of two binary values, 0 or 1 [14, 15]. At any time point t (t = 0, 1, 2, . . . ), the state of the network is defined by the vector x(t) = (x0 (t), x1 (t), . . . , xn−1 (t)). For each variable xi , there exist a predictor set {xi0 , xi1 , . . . , xi,k(i)−1 } and a transition function fi determining the value of xi at the next time point, xi (t + 1) = fi xi0 (t), xi1 (t), . . . , xi,k(i)−1 (t) , (1) where 0 ≤ i0 < i1 < · · · < i, k(i) − 1 ≤ n − 1. It is typically the case that, relative to the transition function fi , many of the variables are nonessential, so that k(i) < n (or even k(i) n). Since the transition function is homogeneous in time, meaning that it is time invariant, we can simplify the notation by writing xi+ = fi xi0 , xi1 , . . . , xi,k(i)−1 . (2) The n transition functions, together with the associated predictor sets, supply all the information necessary to determine the time evolution of the states of a Boolean network, x(0) → x(1) → · · · → x(t) → · · · . The set of transition functions constitutes the network function, denoted as f = ( f0 , . . . , fn−1 ). Attractors play a key role in Boolean networks. Given a starting state, within a finite number of steps, the network will transition into a cycle of states, called an attractor cycle (or simply, attractor), and will continue to cycle thereafter. Nonattractor states are transient and are visited at most once on any network trajectory. The level of a state is the number of transitions required for the network to transition from the state into an attractor cycle. In gene regulatory modeling, attractors are often identified with phenotypes [16]. A Boolean network with perturbation (BNp) is a Boolean network altered so that, at any moment t, there is a probability P of randomly flipping a variable of the current state x(t) of the BN. An ordinary BN possesses a stationary distribution but except in very special circumstances does not possess a steady-state distribution. The state space is partitioned into sets of states called basins, each basin corresponding to the attractor into which its states will transition in due time. On the other hand, for a BNp there is the possibility of flipping from the current state into any other state at each moment. Hence, the BNp is ergodic as a random process and possesses a steady-state distribution. By definition, the attractor cycles of a BNp are the attractor cycles of the BN obtained by setting P = 0. A probabilistic Boolean network (PBN) consists of a finite collection of Boolean networks with perturbation over a fixed set of variables, where each Boolean network is defined by a fixed network function and all possess common perturbation probability P [18, 20]. Moreover, at each moment, there is a probability q of switching out of the current Boolean network to a different constituent Boolean network, where each Boolean network composing the PBN has a probability (called selection probability) of being selected. If q = 1, then a new network function is randomly selected at each time point, and the PBN is said to be instantaneously random, the idea being to model uncertainty in model selection; if q < 1, then the PBN remains in a given constituent Boolean network until a network switch and the PBN is said to be context sensitive. The original introduction of PBNs considered only instantaneously random PBNs [18] and using this model PBNs were first used as the basis of applying control theory to Stephen Marshall et al. 3 optimal intervention strategies to drive network dynamics in favorable directions, such as away from metastatic states in cancer [22]. Subsequently, context-sensitive PBNs were introduced to model the randomizing effect of latent variables outside the network model and this leads to the development of optimal intervention strategies that take into account the effect of latent variables [23]. We defer to the literature for a discussion of the role of latent variables [1]. Our interest here is with context-sensitive PBNs, where q is assumed to be small, so that on average, the network is governed by a constituent Boolean network for some amount of time before switching to another constituent network. The perturbation parameter p and the switching parameter q will be seen to have effects on the proposed network-inference procedure. By definition, the attractor cycles of a PBN are the attractor cycles of its constituent Boolean networks. While the attractor cycles of a single Boolean network must be disjoint, those of a PBN need not to be disjoint since attractor cycles from different constituent Boolean networks can intersect. Owing to the possibility of perturbation, a PBN is ergodic and possesses a steady-state distribution. We note that one can define a PBN without perturbation but we will not do so. Let us close this section by noting that there is nothing inherently necessary about the quantization {0, 1} for a PBN; indeed, PBN modeling is often done with the ternary quantization corresponding to a gene being down regulated (−1), up regulated (1), or invariant (0). For any finite quantization the model is still referred to as a PBN. In this paper we stay with binary quantization for simplicity but it should be evident that the methodology applies to any finite quantization, albeit, with greater complexity. 3. INFERENCE PROCEDURE FOR BOOLEAN NETWORKS WITH PERTURBATION We first consider the inference of a single Boolean network with perturbation. Once this is accomplished, our task in the context of PBNs will be reduced to locating the data in the observed sequence corresponding to the various constituent Boolean networks. 3.1. Inference based on the transition counting matrix and a cost function The characteristics of a Boolean network, with or without perturbation, can be estimated by observing its pairwise state transitions, x(t) → x(t + 1), where x(t) can be an arbitrary vector from the n-dimensional state space Bn = {0, 1}n . The states in Bn are ordered lexicographically according to {00 · · · 0, 00 · · · 1, . . . , 11 · · · 1}. Given a temporal data sequence x(0), . . . , x(N), a transition counting matrix C can be compiled over the data sequence showing the number ci j of state transitions from the ith state to the jth state having occurred, ⎡ c00 ⎢ c ⎢ 10 C=⎢ ⎣ ··· c01 c11 ··· c2n −1,0 c2n −1,1 ⎤ · · · c0,2n −1 · · · c1,2n −1 ⎥ ⎥ ⎥. ··· ··· ⎦ · · · c2n −1,2n −1 (3) If the temporal data sequence results from a BN without perturbations, then a given state will always be followed by a unique state at the next time point, and each row of matrix C contains at most one nonzero value. A typical nonzero entry will correspond to a transition of the form a0 a1 · · · am → b0 b1 · · · bm . If {xi0 , xi1 , . . . , xi,k(i)−1 } is the predictor set for xi , because the variables outside the set {xi0 , xi1 , . . . , xi,k(i)−1 } have no effect on fi , this tells us that fi (ai0 , ai1 , . . . , ai,k(i)−1 ) = bi and one row of the truth table defining fi is obtained. The single transition a0 a1 · · · am → b0 b1 · · · bm gives one row of each transition function for the BN. Given deterministic nature of a BN, we will not be able to sufficiently populate the matrix C on a single observed sequence because, based on the initial state, the BN will transition into an attractor cycle and remain there. Therefore, we need to observe many runs from different initial states. For a BNp with small perturbation probability, C will likely have some nonzero entries replacing some (or all) of the 0 entries. Owing to perturbation and the consequent ergodicity, a sufficiently long data sequence will sufficiently populate the matrix to determine the entries caused by perturbation, as well as the functions and inputs underlying the model. A mapping x(t) → x(t+1) will have been derived linking pairs of state vectors. This mapping induces n transition functions determining the state of each variable at time t + 1 as a function of its predictors at time t, which are precisely shown in (1) or (2). Given sufficient data, the functions and the set of essential predictors may be determined by Boolean reduction. The task is facilitated by treating one variable at a time. Given any variable, xi , and keeping in mind that some observed state transitions arise from random perturbations rather than transition functions, we wish to find the k(i) variables that control xi . The k(i) input variables that most closely correlate with the behavior of xi will be identified as the predictors. Specifically, the next state of variable xi is a function of k(i) variables, as in (2). The transition counting matrix will contain one large single value on each line (plus some “noise”). This value indicates the next state that follows the current state of the sequence. It is therefore possible to create a two-column next-state table with current-state column x0 x1 · · · xn−1 and next-state column x0+ x1+ · · · xn+−1 , there being 2n rows in the table, a typical entry looking like 00101 → 11001 in the case of 5 variables. If the states are written in terms of their individual variables, then a mapping is produced from n variables to n variables, where the next state of any variable may be written as a function of all n input variables. The problem is to determine which subset consisting of k(i) out of the n variables is the minimal set needed to predict xi , for i = 0, 1, . . . , n − 1. We refer to the k(i) variables in the minimal predictor set essential predictors. To determine the essential predictors for a given variable, xi , we will define a cost function. Assuming k variables are used to predict xi , there are n!/(n − k)!k! ways of choosing them. Each k with a choice of variables has a cost. By minimizing the cost function, we can identify k such that k = k(i), as well as the predictor set. In a Boolean network without perturbation, if the value of xi is fully determined 4 EURASIP Journal on Bioinformatics and Systems Biology Table 1: Effect of essential variables. Current state x2 x3 x1 x0 x4 · · · · · 0 0 0 0 1 1 1 1 0 1 · · · · · · · · · · · 0 1 1 0 1 x0+ x1+ Next state x2+ x3+ 1 1 · All inputs with same value of x0 , x2 , x3 should result in the same output · · 1 1 1 0 x4+ 1 · · 1 1 · · by the predictor set, {xi0 , xi1 , . . . , xi,k−1 }, then this set will not change for different combinations of the remaining variables, which are nonessential insofar as xi is concerned. Hence, so long as xi0 , xi1 , . . . , xi,k−1 are fixed, the value of xi should remain 0 or 1, regardless of the values of the remaining variables. For any given realization (xi0 , xi1 , . . . , xi,k−1 ) = (ai0 , ai1 , . . . , ai,k−1 ), ai j ∈ {0, 1}, let ui0,i1,...,i(k−1) ai0 , ai1 , . . . , ai,k−1 = xi0 =ai0 ,...,xi,k−1 =ai,k−1 xi+ x0 , x1 , . . . , xn−1 . where I is the characteristic function. Function I(w) = 1 if w is true and function I(w) = 0 if w is false. The term ri0,i1,...,i(k−1) (ai0 , ai1 , . . . , ai,k−1 ) is designed to be minimized if ui0,i1,...,i(k−1) (ai0 , ai1 , . . . , ai,k−1 ) is close to either 0 or 2n−k . It represents a summation over one single realization of the variables xi0 , xi1 , . . . , xi,k−1 . Therefore, we define the cost function R by summing the individual costs over all possible realizations of xi0 , xi1 , . . . , xi,k−1 : (4) R xi0 , xi1 , . . . , xi,k−1 = ri0,i1,...,i(k−1) ai0 , ai1 , . . . , ai,k−1 . ai0 ,ai1 ,...,ai,k−1 ∈{0,1} According to this equation, ui0,i1,...,i(k−1) (ai0 , ai1 , . . . , ai,k−1 ) is the sum of the next-state values assuming xi0 , xi1 , . . . , xi,k−1 are held fixed at ai0 , ai1 , . . . , ai,k−1 , respectively. There will be 2n−k lines in the next-state table, where (xi0 , xi1 , . . . , xi,k−1 ) = (ai0 , ai1 , . . . , ai,k−1 ), while other variables can vary. Thus, there will be 2n−k terms in the summation. For instance, for the example in Table 1, when xi = x0 , k = 3, i0 = 0, i1 = 2, and i2 = 3, that is, xi+ = fi (x0 , ∗, x2 , x3 , ∗), we have u10,12,13 (0, 1, 1) = x1+ (0, 0, 1, 1, 0) + x1+ (0, 0, 1, 1, 1) + x1+ (0, 1, 1, 1, 0) + x1+ (0, 1, 1, 1, 1). (5) The term ui0,i1,...,i(k−1) (ai0 , ai1 , . . . , ai,k−1 ) attains its maximum (2n−k ) or minimum (0) if the value of xi+ remains unchanged on the 2n−k lines in the next-state table, which is the case in the above example. Hence, the k inputs are good predictors of the function if ui0,i1,...,i(k−1) (ai0 , ai1 , . . . , ai,k−1 ) is close to either 0 or 2n−k . The cost function is based on the quantity ri0,i1,...,i(k−1) ai0 , ai1 , . . . , ai,k−1 = ui0,i1,...,i(k−1) ai0 , ai1 , . . . , ai,k−1 I 2n−k 2 + 2n−k − ui0,i1,...,i(k−1) ai0 , ai1 , . . . , ai,k−1 I ui0,i1,...,i(k−1) ai0 , ai1 , . . . , ai,k−1 ≤ ui0,i1,...,i(k−1) ai0 , ai1 , . . . , ai,k−1 > 2n−k , 2 (7) The essential predictors for variable xi are chosen to be the k variables that minimize the cost R(xi0 , xi1 , . . . , xi,k−1 ) and k is selected as the smallest integer to achieve the minimum. We emphasize on the smallest because if k (k < n) variables can perfectly predict xi , then adding one more variable also achieves the minimum cost. For small numbers of variables, the k inputs may be chosen by a full search, with the cost function being evaluated for every combination. For larger numbers of variables, genetic algorithms can be used to minimize the cost function. In some cases the next-state table is not fully defined, due to insufficient temporal data. This means that there are donot-care outputs. Tests have shown that the input variables may still be identified correctly even for 90% of missing data. Once the input set of variables is determined, it is straightforward to determine the functional relationship by Boolean minimization [27]. In many cases the observed data are insufficient to specify the behavior of the function for every combination of input variables; however, by setting the unknown states as do-not-care terms, an accurate approximation of the true function may be achieved. The task is simplified when the number k of input variables is small. (6) 3.2. Complexity of the Procedure We now consider the complexity of the proposed inference procedure. The truth table consists of n genes and therefore Stephen Marshall et al. 5 Table 2: Values of Ξn,k . k 2 3 4 5 11430 16480 17545 6 86898 141210 159060 7 8 9 5.84 × 105 1.06 × 106 1.28 × 105 3.61 × 106 7.17 × 106 9.32 × 106 2.11 × 107 4.55 × 107 6.35 × 107 Table 3: Computation times. k 2 3 4 5 < 1s < 1s < 1s 6 < 1s < 1s < 1s 7 < 1s < 1s < 1s n 8 2s 6s 9s 9 12 s 36 s 68 s 4. 10 69 s 214 s 472 s 11 476 s 2109 s 3097 s has 2n lines. We wish to identify the k predictors which best describe the behavior of each gene. Each gene has a total of Ckn = n!/(n − k)!k! possible sets of k predictors. Each of these sets of k predictors has 2k different combinations of values. For every specific combination there are 2n−k lines of the truth table. These are lines where the predictors are fixed but the values of the other (nonpredictor) genes change. These must be processed according to (5), (6), and (7). The individual terms in (5) are binary values, 0 or 1. The cost function in (7) is designed to be maximized when all terms in (5) are either all 0 or all 1; that is, the sum is either at its minimum or maximum value. Simulations have shown that this may be more efficiently computed by carrying out all pairwise comparisons of terms and recording the number of times they differ. Hence a summation has been replaced by a computationally more efficient series of comparison operations. The number of pairs in a set of 2n−k values is 2n−k−1 (2n−k − 1). Therefore, the total number of comparisons for a given n and k is given by n! 2k 2n−k 2n−k−1 2n−k − 1 (n − k)!k! n! · 22n−k−1 2n−k − 1 . =n (n − k)!k! ξn,k = n (8) This expression gives the number of comparisons for a fixed value of k; however, if we wish to compute the number of comparisons for all values of predictors, up to and including k, then this is given by Ξn,k = k j =1 n n 10 1.18 × 108 2.74 × 108 4.09 × 108 n! 22n− j −1 2n− j − 1 . (n − j)! j! (9) Values for Ξn,k are given in Table 2 and actual computation times taken on an Intel Pentium 4 with a 2.0 GHz clock and 768 MB of RAM are given in Table 3. The values are quite consistent given the additional computational overheads not accounted for in (9). Even for 10 genes and up to 4 selectors, the computation time is less than 8 minutes. Because the procedure of one BN is not dependent on other BNs, the inference of multiple BNs can be run in parallel, so that time complexity is not an issue. 15 4.23 × 1011 1.34 × 1012 2.71 × 1012 20 1.04 × 1015 4.17 × 1015 1.08 × 1016 30 3.76 × 1021 5.52 × 1021 6.47 × 1022 50 1.94 × 1034 1.74 × 1035 1.09 × 1035 INFERENCE PROCEDURE FOR PROBABILISTIC BOOLEAN NETWORKS PBN inference is addressed in three steps: (1) split the temporal data sequence into subsequences corresponding to constituent Boolean networks; (2) apply the preceding inference procedure to each subsequence; and (3) infer the perturbation, switching, and selection probabilities. Having already treated estimation of a BNp, in this section we address the first and third steps. 4.1. Determining pure subsequences The first objective is to identify points within the temporal data sequence where there is a switch of constituent Boolean networks. Between any two successive switch points there will lie a pure temporal subsequence generated by a single constituent network. The transition counting matrix resulting from a sufficiently long pure temporal subsequence will have one large value in each row, with the remainder in each row being small (resulting from perturbation). Any measure of purity should therefore be maximized when the largest value in each row is significantly larger than any other value. The value of the transition counting matrix at row i and column j has already been defined in (3) as ci j . Let the largest value of ci j in row i be defined as ci1 and the second largest value be ci2 . The quantity ci1 − ci2 is proposed as the basis of a purity function to determine the likelihood that the temporal subsequence lying between two data points is pure. As the quantity relates to an individual row of the transition matrix, it is summed over all rows and normalized by the total value of the elements to give a single value P for each matrix: n 2 −1 1 c − c2 P = 2i=n −0 1 i2n −1 i . j =0 i=0 ci j (10) The purity function P is maximized for a state transition matrix when each row contains only one single large value and the remaining values on each row are zero. To illustrate the purity function, consider a temporal data sequence of length N generated from two Boolean networks. The first section of the sequence, from 0 to N1 , has been generated from the first network and the remainder of the sequence, from N1 + 1 to N − 1, has been generated from the second network. We desire an estimate η of the switch point N1 . The variable η splits the data sequence into two parts and 0 ≤ η ≤ N − 1. The problem of locating the switch point, and hence partitioning the data sequence, reduces to a search to locate N1 . To accomplish this, a trial switch point, G, is varied and the data sets before and after 6 EURASIP Journal on Bioinformatics and Systems Biology G Time step Partitioning at first pass W V Partitioning at second pass 1.01 1 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.9 3822 4777 5732 6687 7642 8597 9552 10507 11462 12417 13372 14327 15282 16237 The position of the switch point 2 957 1912 2867 Function P on W (tm) and V (tm) (a) Time step tm (2–16309) 0.96 0.95 0.94 0.93 0.92 0.91 0.9 0.89 0.88 0.87 0.86 3822 4777 5732 6687 7642 8597 9552 10507 11462 12417 13372 14327 15282 16237 The position of the switch point 2 957 1912 2867 Product of function P on WV (tm) (b) Time step tm (2–16309) (c) Figure 1: Switch point estimation: (a) data sequence divided by a sliding point G and transition matrices produced by for the data on each side of the partition; (b) purity functions from W and V ; (c) simple function of two purity functions indicating switch point between models. it are mapped into two different transition counting matrices, W and V . The ideal purity factor is a function which is maximized for both W and V when G = N1 . The procedure is illustrated in Figure 1. Figure 1(a) shows how the data are mapped from either side of a sliding point into the transition matrices. Figure 1(b) shows the purity functions derived from the transition counting matrices of W and V . Figure 1(c) shows a simple functional of W and V (in this case their product), which gives a peak at the correct switch point. The estimate η of the switch point is detected via a threshold. Figure 2: Passes for partitioning: the overall sequence is divided at the first pass into two shorter subsequences for testing. This is repeated in a second pass with the start and end points of the subsequences offset in order to avoid missing a switch point due to chaotic behavior. The method described so far works well provided the sequence to be partitioned derives from two networks and the switch point does not lie close to the edge of the sequence. If the switch point lies close to the start or end of the sequence, then one of the transition counting matrices will be insufficiently populated, thereby causing the purity function to exhibit chaotic behavior. If the data sequence is long and there is possibly a large number of switch points, then the sequence can be divided into a series of shorter subsequences that are individually tested by the method described. Owing to the effects of chaotic behavior near subsequence borders, the method is repeated in a second pass in which the sequence is again divided into shorter subsequences but with the start and end points offset (see Figure 2). This ensures that a switch point will not be missed simply because it lies close to the edge of the data subsequence being tested. The purity function provides a measure of the difference in the relative behavior of two Boolean networks. It is possible that two Boolean networks can be different but still have many common transitions between their states. In this case the purity function will indicate a smaller distinction between the two models. This is particularly true where the two models have common attractors. Moreover, on average, the value of the purity function may vary greatly between subsequences. Hence, we apply the following normalization to obtain a normalized purity value: Pnorm = P−T , T (11) where P is the purity value in the window and T is either the mean or geometric mean of the window values. The normalization removes differences in the ranges and average values of points in different subsequence, thereby making it easier to identify genuine peaks resulting from switches between Boolean networks. If two constituent Boolean networks are very similar, then it is more difficult to distinguish them and they may be identified as being the same on account of insufficient or noisy data. This kind of problem is inherent to any inference procedure. If two networks are identified during inference, this will affect the switching probability because it will be based on the inferred model, which will have Stephen Marshall et al. less constituent Boolean networks because some have been identified. In practice, noisy data are typically problematic owing to overfitting, the result being spurious constituent Boolean networks in the inferred model. This overfitting problem has been addressed elsewhere by using Hammingdistance filters to identify close data profiles [9]. By identifying similar networks, the current proposed procedure acts like a lowpass filter and thereby mitigates overfitting. As with any lowpass filter, discrimination capacity is diminished. 7 A measure of goodness of the estimator is given by (12) for ε > 0. Because S possesses a binomial distribution, this probability is directly expressible in terms of the binomial density, which means that the goodness of our estimator is completely characterized. This computation is problematic for large N, but if N is sufficiently large so that the rule-ofthumb min{N p, N(1 − p)} > 5 is satisfied, then the normal approximation to the binomial distribution can be used. Chebyshev’s inequality provides a lower bound: 4.2. Estimation of the switching, selection, and perturbation probabilities So far we have been concerned with identifying a family of Boolean networks composing a PBN; much longer data sequences are required to estimate the switching, selection, and perturbation probabilities. The switching probability may be estimated simply by dividing the number of switch points found by the total sequence length. The perturbation probability is estimated by identifying those transitions in the sequence not determined by a constituent-network function. For every data point, the next state is predicted using the model that has been found. If the predicted state does not match the actual state, then it is recorded as being caused by perturbation. Switch points are omitted from this process. The perturbation rate is then calculated by dividing the total instances of perturbation by the length of the data sequence. Regarding the selection probabilities, we assume that a constituent network cannot switch into itself; otherwise there would be no switch. This assumption is consistent with the heuristic that a switch results from the change of a latent variable that in turn results in a change of the network structure. Thus, the selection probabilities are conditional, depending on the current network. The conditional probabilities are of the form qAB , which gives the probability of selecting network B during a switch, given the current network is A, and qAB is estimated by dividing the number of times the data sequence switches from A to B by the number of times it switches out of A. In all cases, the length N of the sequence necessary to obtain good estimates is key. This issue is related to how often we expect to observe a perturbation, network switch, or network selection during a data sequence. It can be addressed in terms of the relevant network parameters. We first consider estimation of the perturbation probability P. Note that we have defined P as the probability of making a random state selection, whereas in some papers each variable is given a probability of randomly changing. If the observed sequence has length N and we let X denote the number of perturbations (0 or 1) at a given time point, then the mean of X is p and the estimate, p, we are using for p is the sample mean of X for a random sample of size N, the sample being random because perturbations are independent. The expected number of perturbations is N p, which is the mean of the random variable S given by an independent sum of N random variables identically distributed to X. S possesses a binomial distribution with variance N p(1 − p). P | p − p| < ε = P |N p − S| < Nε P | p − p| < ε = 1 − P |N p − S| ≥ Nε p(1 − p) ≥1− . Nε2 (13) A good estimate is very likely if N is sufficiently large to make the fraction very small. Although often loose, Chebyshev’s inequality provides an asymptotic guarantee of goodness. The salient issue is that the expected number of perturbations (in the denominator) becomes large. A completely analogous analysis applies to the switching probability q, with q replacing p and q replacing p in (12) and (13), with Nq being the expected number of switches. To estimate the selection probabilities, let pi j be the probability of selecting network B j given a switch is called for and the current network is Bi , pi j its estimator, ri the probability of observing a switch out of network Bi , ri the estimator of ri formed by dividing the number of times the PBN is observed switching out of Bi divided by N, si j the probability of observing a switch from network Bi to network B j , and si j the estimator of si j formed by dividing the number of times the PBN is observed switching out of Bi into B j by N. The estimator of interest, pi j , can be expressed as si j / ri . The probability of observing a switch out of Bi is given by qP(Bi ), where P(Bi ) is the probability that the PBN is in Bi , so that the expected number of times such a switch is observed is given by NqP(Bi ). There is an obvious issue here: P(Bi ) is not a model parameter. We will return to this issue. Let us first consider si j . Define the following events: At is a switch at time t, Bit is the event of the PBN being in network Bi at time t, and [Bi → B j ]t is the event Bi switches to B j at time t. Then, because the occurrence of a switch is independent of the current network, P Bi −→ B j t t = P At P Bit−1 P Bi −→ B j | Bit−1 = qP Bit−1 pi j . (14) The probability of interest depends on the time, as does the probability of being in a particular constituent network; however, if we assume the PBN is in the steady state, then the time parameters drop out to yield P Bi −→ B j t = qP Bi pi j . (15) Therefore the number of times we expect to see a switch from Bi to B j is given by NqP(Bi )pi j . 8 EURASIP Journal on Bioinformatics and Systems Biology Let us now return to the issue of P(Bi ) not being a model parameter. In fact, although it is not directly a model parameter, it can be expressed in terms of the model parameters so long as we assume we are in the steady state. Since Bit = t c A ∩ Bit−1 ∪ t A ∩ Btj−1 ∩ B j −→ Bi t Table 4: Average percentage of predictors and functions recovered from 104 BN sequences consisting of n = 7 variables for k = 2 and k = 3, and P = .01. Model recovery Sequence length j =i (16) a straightforward probability analysis yields P Bit = (1 − q)P Bit−1 +q t P Btj−1 P B j −→ Bi | Btj−1 . (17) j =i Under the steady-state assumption the time parameters may be dropped to yield P Bi = p ji P B j . 500 1000 2000 4000 6000 8000 10 000 20 000 30 000 40 000 Predictors recovered (%) k=2 46.27 54.33 71.71 98.08 98.11 98.18 98.80 100 100 100 k=3 21.85 28.24 29.84 34.87 50.12 50.69 51.39 78.39 85.89 87.98 Functions recovered (%) k=2 34.59 45.22 64.28 96.73 97.75 97.87 98.25 98.333 99.67 99.75 k=3 12.26 19.98 22.03 28.53 42.53 43.23 43.74 69.29 79.66 80.25 (18) j =i Hence, the network probabilities are given in terms of the selection probabilities by ⎛ −1 ⎜p ⎜ 12 0=⎜ ⎜ .. ⎝ . p21 −1 .. . p1m p2,m−1 5. ⎞ ⎛ ⎞ P B1 · · · pm1 ⎜ P B ⎟ · · · pm2 ⎟ 2 ⎟ ⎟⎜ ⎜ ⎟ .. ⎟ .. ⎟ ⎜ .. ⎟ . . . ⎠ ⎝ . ⎠ · · · −1 P Bm (19) EXPERIMENTAL RESULTS A variety of experiments have been performed to assess the proposed algorithm. These include experiments on single BNs, PBNs, and real data. Insofar as the switching, selection, and perturbation probabilities are concerned, their estimation has been characterized analytically in the previous section so we will not be concerned with them here. Thus, we are concerned with the percentages of the predictors and functions recovered from a generated sequence. Letting c p and t p be the number of predictors correctly identified and the total number of predictors in the network, respectively, the percentage, π p , of predictors correctly identified is given by πp = cp × 100. tp cf × 100. tf Single Boolean networks When inferring the parameters of single BNs from data sequences by our method, it was found that the predictors and functions underlying the data could be determined very accurately from a limited number of observations. This means that even when only a small number of the total states and possible transitions of the model are observed, the parameters can still be extracted. These tests have been conducted using a database of 80 sequences generated by single BNs with perturbation. These have been constructed by randomly generating 16 BNs with n = 7 variables and connectivity k = 2 or k = 3, and P = .01. The sequence lengths vary in 10 steps from 500 to 40 000, as shown in Table 4. The table shows the percentages of the predictors and functions recovered from a sequence generated by a single BN, that is, a pure sequence with n = 7, for k = 2 or k = 3, expressed as a function of the overall sequence length. The average percentages of predictors and functions recovered from BN sequences with k = 2 is much higher than for k = 3 in the same sequence length. 5.2. Probabilistic Boolean networks (20) Letting c f and t f be the number of function outputs correctly identified and the total number of function outputs in network, respectively, the percentage, π f , of function outputs correctly identified is given by πf = 5.1. (21) The functions may be written as truth tables and π f corresponds to the percentage of lines in all the truth tables recovered from the data which correctly match the lines of the truth tables for the original function. For the analysis of PBN inference, we have constructed two databases consisting of sequences generated by PBNs with n = 7 genes. (i) Database A: the sequences are generated by 80 randomly generated PBNs and sequence lengths vary in 10 steps from 2000 to 500 000, each with different values of p and q, and two different levels of connectivity k. (ii) Database B: 200 sequences of length 100 000 are generated from 200 randomly generated PBNs, each having 4 constituent BNs with k = 3 predictors. The switching probability q varies in 10 values: .0001, .0002, .0005, .001, .002, .005, .01, .02, .05, 0.1. Stephen Marshall et al. (1) Select one subsequence for each BN and analyze that only. (2) Collate all subsequences generated by the same BN and analyze each set. Using the first strategy, the accuracy of the recovery of the predictors and functions tends to go down as the switching probability goes up because the lengths of the subsequences get shorter as the switching probability increases. Using the second strategy, the recovery rate is almost independent of the switching probability because the same number of data points from each BN is encountered. They are just cut up into smaller subsequences. Past a certain threshold, when the switching probability is very high the subsequences are so short that they are hard to classify. Figure 3 shows a graph of predictor recovery as a function of switching probability for the two strategies using database B. Both strategies give poor recovery for low switching probability because not all of the BNs are seen. Strategy 2 is more effective in recovering the underlying model parameters over a wider range of switching values. For higher values of q, the results from strategy 1 decline as the subsequences get shorter. The results for strategy 2 eventually decline as the sequences become so short that they cannot be effectively classified. These observations are borne out by the results in Figure 4, which show the percentage of predictors recovered using strategy 2 from a PBN-generated sequence with 4 BNs consisting of n = 7 variables with k = 3, P = .01, and switching probabilities q = .001 and q = .005 for various length sequences using database A. It can be seen that for low sequence lengths and low probability, only 21% of the predictors are recovered because only one BN has been observed. As sequence length increases, the percentage of predictors recovered increases and at all times the higher switching probability does best, with the gap closing for very long sequence lengths. More comparisons are given in Figures 5 and 6, which compare the percentage predictor recovery for two different connectivity values and for two different perturbation val- 100 Predictor recovered (%) 90 80 70 60 50 40 30 20 10 0.1 0.05 0.02 0.01 0.005 0.002 0.001 0.0005 0.0002 0.0001 0 Network switching probability Strategy 1 Strategy 2 Figure 3: The percentage of predictors recovered from fixed length PBN sequences (of 100 000 sample points). The sequence is generated from 4 BNs, with n = 7 variables and k = 3 predictors, and P = .01. Predictor recovered (%) The key issue for PBNs is how the inference algorithm works relative to the identification of switch points via the purity function. If the data sequence is successfully partitioned into pure sequences, each generated by a constituent BN, then the BN results show that the predictors and functions can be accurately determined from a limited number of observations. Hence, our main concern with PBNs is apprehending the effects of the switching probability q, perturbation probability p, connectivity k, and sequence length. For instance, if there is a low switching probability, say q = .001, then the resulting pure subsequences may be several hundred data points long. So while each BN may be characterized from a few hundred data points, it may be necessary to observe a very long sequence simply to encounter all of the constituent BNs. When analyzing long sequences there are two strategies that can be applied after the data have been partitioned into pure subsequences. 9 100 90 80 70 60 50 40 30 20 10 0 2 4 6 8 10 50 100 Length of sequence 200 300 500 ×103 q = .001 q = .005 Figure 4: The percentage of predictors recovered using strategy 2 from a sequence generated from a PBN with 4 BNs consisting of n = 7 variables with k = 3, P = .01 and switching probabilities q = .001 and q = .005 for various length sequences. ues, respectively. They both result from strategy 2 applied to database A. It can be seen that it is easier to recover predictors for smaller values of k and larger values of p. A fuller picture of the recovery of predictors and functions from a PBN sequence of varying length, varying k, and varying switching probability is given in Table 5 for database A, where P = .01 and there are three different switching probabilities: q = .001, .005, .03. As expected, it is easier to recover predictors for low values of k. Also over this range the percentage recovery of both functions and predictors increases with increasing switching probability. 10 EURASIP Journal on Bioinformatics and Systems Biology Table 5: The percentage of predictors recovered by strategy 2 as a function of various length sequences from sequences generated by experimental design A with at P = .01, switching probabilities, q = .001, .005, .03, and for k = 2 and k = 3. q = .001 Sequence length 2000 4000 6000 8000 10 000 50 000 100 000 200 000 300 000 500 000 q = .005 Functions recovered (%) Predictor recovered (%) Functions recovered (%) Predictor recovered (%) Functions recovered (%) k=2 22.07 36.90 53.59 54.75 58.69 91.50 97.28 97.69 97.98 99.40 k=2 20.15 33.13 43.23 47.15 53.57 88.22 95.43 96.39 96.82 98.67 k=2 50.74 55.43 76.08 77.02 79.10 94.58 97.97 98.68 99.00 99.68 k=2 37.27 42.49 66.74 67.48 69.47 92.59 96.47 97.75 98.19 99.18 k=2 65.25 74.88 75.69 76.22 86.36 96.70 98.47 99.27 99.40 99.83 k=2 53.52 66.31 67.20 67.72 80.92 94.71 96.68 98.03 98.97 99.25 k=3 20.94 36.31 38.80 44.54 45.63 75.03 79.68 83.65 85.62 89.88 k=3 12.95 23.89 26.79 29.42 36.29 65.29 71.19 76.23 79.00 84.85 Predictor recovered (%) 120 100 80 60 40 20 0 2 4 6 8 10 50 100 200 300 Length of sequence 500 ×103 k=2 k=3 Figure 5: The percentage of predictors recovered using strategy 2 and experimental design A as a function of sequence length for connectivities k = 2 and k = 3. Predictor recovered (%) 120 100 80 60 40 20 0 2 4 q = .03 Predictor recovered (%) 6 8 10 50 100 Length of sequence 200 300 500 ×103 p = 0.02 p = 0.005 Figure 6: The percentage of predictors recovered using strategy 2 and experimental design A as a function of sequence length for perturbation probabilities P = .02 and P = .005. k=3 41.79 52.54 54.92 59.77 65.37 80.07 85.51 86.76 92.37 93.90 k=3 25.44 37.06 42.02 45.07 51.94 71.55 78.34 80.24 88.28 90.30 k=3 48.84 56.08 64.33 67.86 73.82 86.64 90.71 94.02 95.50 96.69 k=3 34.01 42.72 51.97 55.10 61.84 78.32 85.06 90.79 92.50 94.21 We have seen the marginal effects of the switching and perturbation probabilities, but what about their combined effects? To understand this interaction, and to do so taking into account both the number of genes and the sequence length, we have conducted a series of experiments using randomly generated PBNs composed of either n = 7 or n = 10 genes, and possessing different switching and perturbation values. The result is a set of surfaces giving the percentages of predictors recovered as a function of p and q. The PBNs have been generated according to the following protocol. (1) Randomly generate 80 BNs with n = 7 variables and connectivity k = 3 (each variable has at most 3 predictors, the number for each variable being randomly selected). Randomly order the BNs as A1, A2, . . . , A80. (2) Consider the following perturbation and switching probabilities: P = .005, P = .01, P = .015, P = .02, q = .001, q = .005, q = .01, q = .02, q = .03. (3) For each p, q, do the following: (1) construct a PBN from A1, A2, A3, A4 with selection probabilities 0.1, 0.2, 0.3, 0.4, respectively; (2) construct a PBN from A5, A6, A7, A8 with selection probabilities 0.1, 0.2, 0.3, 0.4, respectively; (3) continue until the BNs are used up. (4) Apply the inference algorithm to all PBNs using data sequences of length N = 4000, 6000, 8000, 10 000, 50 000. (5) Repeat the same procedure from (1)–(4) using 10 variables. Figures 7 and 8 show fitted surfaces for n = 7 and n = 10, respectively. We can make several observations in the parameter region considered: (a) as expected, the surface heights increase with increasing sequence length; (b) as expected, the surface heights are lower for more genes, meaning that longer sequences are needed for more genes; (c) the surfaces tend to increase in height for both p and q, but if q is too large, then recovery percentages begin to decline. The trends are the same for both numbers of genes, but recovery requires increasingly long sequences for larger numbers of genes. 11 70 Predictor recovered (%) Predictor recovered (%) Stephen Marshall et al. 60 50 40 30 20 0.025 0.02 0.015 Val 0.01 ue 0.005 of q 80 70 60 50 40 30 0.025 0.02 0.01 0.005 e of Valu p 80 70 60 50 40 0.025 0.01 0.005 e of Valu p 0.02 0.01 0.005 e of Valu 90 80 70 60 50 40 0.025 0.02 0.02 0.015 Val 0.01 ue 0.005 of q 0.015 p (c) Predictor recovered (%) 0.015 (b) Predictor recovered (%) Predictor recovered (%) (a) 0.02 0.015 Val 0.01 ue 0.005 of q 0.02 0.02 0.015 Val 0.01 ue 0.005 of q 0.015 0.015 0.01 0.005 e of Valu p (d) 95 90 85 80 75 70 0.025 0.02 0.015 Val 0.01 ue 0.005 of q 0.02 0.015 0.01 0.005 p e of Valu (e) Figure 7: Predictor recovery as a function of switching and perturbation probabilities for n = 7 genes: (a) N = 4000, (b) N = 6000, (c) N = 8000, (d) N = 10 000, (e) N = 50 000. 6. A SUBSAMPLING STRATEGY It is usually only necessary to observe a few thousand sample points in order to determine the underlying predictors and functions of a single BN. Moreover, it is usually only necessary to observe a few hundred sample points to classify a BN as being BN1, BN2, and so forth. However, in analyzing a PBN-generated sequence with low switching probability, say EURASIP Journal on Bioinformatics and Systems Biology 70 Predictor recovered (%) Predictor recovered (%) 12 60 50 40 30 60 50 40 30 20 0.025 0.02 0.015 Val 0.01 ue 0.005 of q 70 0.025 0.02 0.01 0.005 e of Valu p (a) 0.015 0.01 0.005 e of Valu p (b) 70 Predictor recovered (%) Predictor recovered (%) 0.02 0.02 0.015 Val 0.01 ue 0.005 of q 0.015 60 50 40 30 80 70 60 50 40 0.025 0.02 0.015 Val 0.01 ue 0.005 of q 0.025 0.02 0.01 0.005 e of Valu p (c) Predictor recovered (%) 0.02 0.02 0.015 Val 0.01 ue 0.005 of q 0.015 0.015 0.01 0.005 e of Valu p (d) 85 80 75 70 65 60 55 0.025 0.02 0.015 Val 0.01 ue 0.005 of q 0.02 0.015 0.01 0.005 p e of Valu (e) Figure 8: Predictor recovery as a function of switching and perturbation probabilities for n = 10 genes: (a) N = 4000, (b) N = 6000, (c) N = 8000, (d) N = 10 000, (e) N = 50 000. Stephen Marshall et al. S BN1 0 BN2 10000 13 BN3 20000 BN2 30000 BN4 40000 Figure 9: Subsampling strategy. Predictor recovered (%) 120 100 80 60 40 20 0 200 400 600 800 1000 2000 3000 4000 5000 10000 Sampling space Sample set 200 Sample set 100 Sample set 150 Sample set 50 Subsampling represents an effort at complexity reduction and is commonly used in engineering applications to gain speed and reduce cost. From a larger perspective, the entire investigation of gene regulatory networks needs to take complexity reduction into consideration because in the natural state the networks are extremely complex. The issue is whether goals can be accomplished better using fine- or coarse-grained analysis [28]. For instance, a stochastic differential equation model might provide a more complete description in principle, but a low-quantized discrete network might give better results owing to reduced inference requirements or computational complexity. Indeed, in this paper we have seen the inference difficulty that occurs by taking into account the stochasticity caused by latent variables on a coarse binary model. Not only does complexity reduction motivate the use of models possessing smaller numbers of critical parameters and relations, for instance, by network reduction [29] suppressing functional relations in favor of a straight transitional probabilistic model [30], it also motivates suboptimal inference, as in the case of the subsampling discussed herein or in the application of suboptimal intervention strategies to network models, such as PBNs [31]. Figure 10: Predictor recovery percentages using various subsampling regimes. 7. q = .001, it is necessary on average to observe 1,000 points before a switch to the second BN occurs. This requires huge data lengths, not for deriving the parameters (predictors and functions) of the underlying model, but in order for a switch to occur in order to observe another BN. This motivates consideration of subsampling. Rather than analyzing the full sequence, we analyze a small subsequence of data points, skip a large run of points, analyze another sample, skip more points, and so forth. If the sample is sufficiently long to classify it correctly, then the samples from the same BN may be collated to produce good parameter estimates. The subsampling strategy is illustrated in Figure 9. It is for use with data possessing a low switching probability. It is only necessary to see a sequence containing a small number of sample points of each BN in order to identify the BN. The length of the sampled subsequences is fixed at some value S. To test the subsampling strategy, a set of 20 data sequences, each consisting of 100 000 samples points, was generated from a PBN consisting of 4 BNs, n = 7 variables, k = 2, P = .01, and q = .001 in database A. We define a sampling space to consist of a sampling window and nonsampling interval, so that the length of a sampling space is given by L = S + I, where I is the length of the nonsampling interval. We have considered sampling spaces of lengths L = 200, 400, 600, 800, 1000, 2000, 3000, 4000, 5000, and 10 000 and sampling windows (subsequences) of lengths S = 50, 100, 150, and 200. When S = L, there is no subsampling. The results are shown in Figure 10, which shows the percentage of predictors recovered. The recovery percentage by processing all 100 000 points in the full sequence is 97.28%. REAL-DATA NETWORK EXPERIMENT To test the inference technique on real data, we have considered an experiment based on a model affected by latent variables, these being a key reason for PBN modeling. Latent variables are variables outside the model whose behavior causes the model to appear random—switch between constituent networks. The real-gene PBN is derived from the drosophila segment polarity genes for which a Boolean network has been derived that consists of 8 genes: wg1 , wg2 , wg3 , wg4 , PTC1 , PTC2 , PTC3 , and PTC4 [26]. The genes are controlled by the following equations: wg1 = wg1 and not wg2 and not wg4 , wg2 = wg2 and not wg1 and not wg3 , wg3 = wg1 or wg3 , wg4 = wg2 or wg4 , PTC1 = not wg2 and not wg4 or (22) PTC1 and not wg1 and not wg3 , PTC2 = not wg1 and not wg3 or PTC2 and not wg2 and not wg4 , PTC3 = 1, PTC4 = 1. Now let wg4 and PTC4 be hidden variables (not observable). Since PTC4 has a constant value, its being a hidden variable has no effect on the network. However, if we let 14 EURASIP Journal on Bioinformatics and Systems Biology Table 6: Percentages of the predictors and functions recovered from the segment polarity PBN. q = .001 Predictor Function recovered (%) recovered (%) Length 2000 4000 6000 8000 10 000 20 000 30 000 40 000 50 000 100 000 51.94 58.39 65.78 72.74 76.03 87.81 97.35 98.64 99.59 99.69 q = .005 Predictor Function recovered (%) recovered (%) 36.83 38.49 50.77 59.23 63.98 76.86 84.61 85.74 90.18 90.85 56.95 59.86 80.42 83.97 88.10 95.68 97.65 99.19 99.59 99.87 wg4 = 0 or 1, we will arrive at a 6-gene PBN consisting of two BNs. When wg4 = 0, we have the following BN: wg1 = wg1 and not wg2 , wg2 = wg2 and not wg1 and not wg3 , wg3 = wg1 or wg3 , PTC1 = not wg2 or PTC1 and not wg1 and not wg3 , PTC2 = not wg1 and not wg3 or PTC2 and not wg2 , PTC3 = 1. (23) When wg4 = 1, we have the following BN: wg1 = wg1 and not wg2 and 0, wg2 = wg2 and not wg1 and not wg3 , wg3 = wg1 or wg3 , PTC1 = not wg2 and 0 or PTC1 and not wg1 and not wg3 , (24) PTC2 = not wg1 and not wg3 or PTC2 and not wg2 and 0 , PTC3 = 1. Together, these compose a 6-gene PBN. Note that in the second BN we do not simplify the functions for wg1 , PTC1 , and PTC3 , so that they have the same predictors as in the first BN. There are 6 genes considered here: wg1 , wg2 , wg3 , PTC1 , PTC2 , and PTC3 . The maximum number of predictor genes is k = 4. The two constituent networks are regulated by the same predictor sets. Based on this real-gene regulatory network, synthetic sequences have been generated and the inference procedure is applied to these sequences. 600 sequences with 10 different lengths (between 2000 and 100 000) have been generated with various lengths, P = .01, and three switching probabilities, q = .001, .005, .02. Table 6 shows the average percentages of the predictors and functions recovered. 39.81 44.53 65.40 69.47 74.31 81.60 88.28 90.03 90.35 91.19 8. Predictor recovered (%) 68.74 70.26 85.60 86.82 92.80 96.98 99.17 99.66 99.79 100 q = .02 Function recovered (%) 49.43 52.86 68.11 70.28 77.83 83.48 88.82 91.05 91.94 93.97 CONCLUSION Capturing the full dynamic behavior of probabilistic Boolean networks, whether they are binary or multivalued, will require the use of temporal data, and a goodly amount of it. This should not be surprising given the complexity of the model and the number of parameters, both transitional and static, that must be estimated. This paper proposed an algorithm that works well, but shows the data requirement. It also demonstrates that the data requirement is much smaller if one does not wish to infer the switching, perturbation, and selection probabilities, and that constituent-network connectivity can be discovered with decent accuracy for relatively small time-course sequences. The switching and perturbation probabilities are key factors, since if they are very small, then large amounts of time are needed to escape attractors; on the other hand, if they are large, estimation accuracy is hurt. Were we to restrict our goal to functional descriptions of state transitions when in attractor cycles, then the necessary amount of data would be enormously reduced; however, our goal in this paper is to capture as much of the PBN structure as possible, including transient regulation. Among the implications of the issues raised in this paper, there is a clear message regarding the tradeoff between fine- and coarse-grain models. Even if we consider a binary PBN, which is considered to be a coarse-grain model, and a small number of genes, the added complexity of accounting for function switches owing to latent variables significantly increases the data requirement. This is the kind of complexity problem indicative of what one must confront when using solely data-driven learning algorithms. Further study should include mitigation of data requirements by prior knowledge, such as transcriptional knowledge of connectivity or regulatory functions for some genes involved in the network. It is also important to consider the reduction in complexity resulting from prior constraints on the network generating the data. These might include: connectivity, attractor structure, the effect of canalizing functions, and regulatory bias. In the other direction, one can consider complicating factors such as missing data and inference when data measurements cannot be placed into direct relation with the synchronous temporal dynamics of the model. Stephen Marshall et al. ACKNOWLEDGMENTS We appreciate the National Science Foundation (CCF0514644 and BES-0536679) and the National Cancer Institute (R01 CA-104620) for partly supporting this research. We would also like to thank Edward Suh, Jianping Hua, and James Lowey of the Translational Genomics Research Institute for providing high-performance computing support. REFERENCES [1] E. R. Dougherty, A. Datta, and C. Sima, “Research issues in genomic signal processing,” IEEE Signal Processing Magazine, vol. 22, no. 6, pp. 46–68, 2005. [2] T. Akutsu, S. Miyano, and S. Kuhara, “Identification of genetic networks from a small number of gene expression patterns under the Boolean network model,” in Proceedings of the 4th Pacific Symposium on Biocomputing (PSB ’99), pp. 17–28, Mauna Lani, Hawaii, USA, January 1999. [3] H. Lähdesmäki, I. Shmulevich, and O. Yli-Harja, “On learning gene regulatory networks under the Boolean network model,” Machine Learning, vol. 52, no. 1-2, pp. 147–167, 2003. [4] S. Liang, S. Fuhrman, and R. Somogyi, “REVEAL, a general reverse engineering algorithm for inference of genetic network architectures,” in Proceedings of the 3rd Pacific Symposium on Biocomputing (PSB ’98), pp. 18–29, Maui, Hawaii, USA, January 1998. [5] R. Pal, I. Ivanov, A. Datta, M. L. Bittner, and E. R. Dougherty, “Generating Boolean networks with a prescribed attractor structure,” Bioinformatics, vol. 21, no. 21, pp. 4021–4025, 2005. [6] X. Zhou, X. Wang, and E. R. Dougherty, “Construction of genomic networks using mutual-information clustering and reversible-jump Markov-chain-Monte-Carlo predictor design,” Signal Processing, vol. 83, no. 4, pp. 745–761, 2003. [7] R. F. Hashimoto, S. Kim, I. Shmulevich, W. Zhang, M. L. Bittner, and E. R. Dougherty, “Growing genetic regulatory networks from seed genes,” Bioinformatics, vol. 20, no. 8, pp. 1241–1247, 2004. [8] X. Zhou, X. Wang, R. Pal, I. Ivanov, M. L. Bittner, and E. R. Dougherty, “A Bayesian connectivity-based approach to constructing probabilistic gene regulatory networks,” Bioinformatics, vol. 20, no. 17, pp. 2918–2927, 2004. [9] E. R. Dougherty and Y. Xiao, “Design of probabilistic Boolean networks under the requirement of contextual data consistency,” IEEE Transactions on Signal Processing, vol. 54, no. 9, pp. 3603–3613, 2006. [10] D. Pe’er, A. Regev, G. Elidan, and N. Friedman, “Inferring subnetworks from perturbed expression profiles,” Bioinformatics, vol. 17, supplement 1, pp. S215–S224, 2001. [11] D. Husmeier, “Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks,” Bioinformatics, vol. 19, no. 17, pp. 2271–2282, 2003. [12] J. M. Peña, J. Björkegren, and J. Tegnér, “Growing Bayesian network models of gene networks from seed genes,” Bioinformatics, vol. 21, supplement 2, pp. ii224–ii229, 2005. [13] H. Lähdesmäki, S. Hautaniemi, I. Shmulevich, and O. YliHarja, “Relationships between probabilistic Boolean networks and dynamic Bayesian networks as models of gene regulatory networks,” Signal Processing, vol. 86, no. 4, pp. 814–834, 2006. 15 [14] S. A. Kauffman, “Metabolic stability and epigenesis in randomly constructed genetic nets,” Journal of Theoretical Biology, vol. 22, no. 3, pp. 437–467, 1969. [15] S. A. Kauffman, “Homeostasis and differentiation in random genetic control networks,” Nature, vol. 224, no. 5215, pp. 177– 178, 1969. [16] S. A. Kauffman, The Origins of Order: Self-Organization and Selection in Evolution, Oxford University Press, New York, NY, USA, 1993. [17] S. Huang, “Gene expression profiling, genetic networks, and cellular states: an integrating concept for tumorigenesis and drug discovery,” Journal of Molecular Medicine, vol. 77, no. 6, pp. 469–480, 1999. [18] I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang, “Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks,” Bioinformatics, vol. 18, no. 2, pp. 261–274, 2002. [19] S. Kim, H. Li, E. R. Dougherty, et al., “Can Markov chain models mimic biological regulation?” Journal of Biological Systems, vol. 10, no. 4, pp. 337–357, 2002. [20] I. Shmulevich, E. R. Dougherty, and W. Zhang, “From Boolean to probabilistic Boolean networks as models of genetic regulatory networks,” Proceedings of the IEEE, vol. 90, no. 11, pp. 1778–1792, 2002. [21] I. Shmulevich and E. R. Dougherty, “Modeling genetic regulatory networks with probabilistic Boolean networks,” in Genomic Signal Processing and Statistics, E. R. Dougherty, I. Shmulevich, J. Chen, and Z. J. Wang, Eds., EURASIP Book Series on Signal Processing and Communication, pp. 241–279, Hindawi, New York, NY, USA, 2005. [22] A. Datta, A. Choudhary, M. L. Bittner, and E. R. Dougherty, “External control in Markovian genetic regulatory networks,” Machine Learning, vol. 52, no. 1-2, pp. 169–191, 2003. [23] R. Pal, A. Datta, M. L. Bittner, and E. R. Dougherty, “Intervention in context-sensitive probabilistic Boolean networks,” Bioinformatics, vol. 21, no. 7, pp. 1211–1218, 2005. [24] R. Pal, A. Datta, and E. R. Dougherty, “Optimal infinitehorizon control for probabilistic Boolean networks,” IEEE Transactions on Signal Processing, vol. 54, no. 6, part 2, pp. 2375–2387, 2006. [25] A. Datta, R. Pal, and E. R. Dougherty, “Intervention in probabilistic gene regulatory networks,” Current Bioinformatics, vol. 1, no. 2, pp. 167–184, 2006. [26] R. Albert and H. G. Othmer, “The topology of the regulatory interactions predicts the expression pattern of the segment polarity genes in Drosophila melanogaster,” Journal of Theoretical Biology, vol. 223, no. 1, pp. 1–18, 2003. [27] G. Langholz, A. Kandel, and J. L. Mott, Foundations of Digital Logic Design, World Scientific, River Edge, NJ, USA, 1998. [28] I. Ivanov and E. R. Dougherty, “Modeling genetic regulatory networks: continuous or discrete?” Journal of Biological Systems, vol. 14, no. 2, pp. 219–229, 2006. [29] I. Ivanov and E. R. Dougherty, “Reduction mappings between probabilistic Boolean networks,” EURASIP Journal on Applied Signal Processing, vol. 2004, no. 1, pp. 125–131, 2004. [30] W.-K. Ching, M. K. Ng, E. S. Fung, and T. Akutsu, “On construction of stochastic genetic networks based on gene expression sequences,” International Journal of Neural Systems, vol. 15, no. 4, pp. 297–310, 2005. [31] M. K. Ng, S.-Q. Zhang, W.-K. Ching, and T. Akutsu, “A control model for Markovian genetic regulatory networks,” in Transactions on Computational Systems Biology V, Lecture Notes in Computer Science, pp. 36–48, 2006. Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 20180, 13 pages doi:10.1155/2007/20180 Research Article Algorithms for Finding Small Attractors in Boolean Networks Shu-Qin Zhang,1 Morihiro Hayashida,2 Tatsuya Akutsu,2 Wai-Ki Ching,1 and Michael K. Ng3 1 Advanced Modeling and Applied Computing Laboratory, Department of Mathematics, The University of Hong Kong, Pokfulam Road, Hong Kong 2 Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan 3 Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong Received 29 June 2006; Revised 24 November 2006; Accepted 13 February 2007 Recommended by Edward R. Dougherty A Boolean network is a model used to study the interactions between different genes in genetic regulatory networks. In this paper, we present several algorithms using gene ordering and feedback vertex sets to identify singleton attractors and small attractors in Boolean networks. We analyze the average case time complexities of some of the proposed algorithms. For instance, it is shown that the outdegree-based ordering algorithm for finding singleton attractors works in O(1.19n ) time for K = 2, which is much faster than the naive O(2n ) time algorithm, where n is the number of genes and K is the maximum indegree. We performed extensive computational experiments on these algorithms, which resulted in good agreement with theoretical results. In contrast, we give a simple and complete proof for showing that finding an attractor with the shortest period is NP-hard. Copyright © 2007 Shu-Qin Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. INTRODUCTION The advent of DNA microarrays and oligonucleotide chips has significantly sped up the systematic study of gene interactions [1–4]. Based on microarray data, different kinds of mathematical models and computational methods have been developed, such as Bayesian networks, Boolean networks and probabilistic Boolean networks, ordinary and partial differential equations, qualitative differential equations, and other mathematical models [5]. Among all the models, the Boolean network model has received much attention. It was originally introduced by Kauffman [6–9] and reviews can be found in [10–12]. In a Boolean network, gene expression states are quantized to only two levels: 1 (expressed) and 0 (unexpressed). Although such binary expression is very simple, it can retain meaningful biological information contained in the real continuous-domain gene expression profiles. For instance, it can be applied to separation between types of gliomas and types of sarcomas [13]. In a Boolean network, genes interact through some logical rules called Boolean functions. The state of a target gene is determined by the states of its regulating genes (input genes) and its Boolean function. Given the states of the input genes, the Boolean function transforms them into an output, which is the state of the target gene. Although the Boolean network model is very simple, its dynamic process is complex and can yield insight to the global behavior of large genetic regulatory networks [14]. The total number of possible global states for a Boolean network with n genes is 2n . However, for any initial condition, the system will eventually evolve into a limited set of stable states called attractors. The set of states that can lead the system to a specific attractor is called the basin of attraction. There can be one or many states for each attractor. An attractor having only one state is called a singleton attractor. Otherwise, it is called a cyclic attractor. There are two different interpretations for the function of attractors. One intuition that follows Kauffman is that one attractor should correspond to a cell type [11]. Another interpretation of attractors is that they correspond to the cell states of growth, differentiation, and apoptosis [10]. Cyclic attractors should correspond to cell cycles (growth) and singleton attractors should correspond to differentiated or apoptosis states. These two interpretations are complementary since one cell type can consist of several neighboring attractors and each of them corresponds to different cellular functional states [15]. The number and length of attractors are important features of networks. Extensive studies have been done for analyzing them. Starting from [11], a fast increase of the number 2 of attractors has been seen in [16–19]. Many studies have also been done on the mean length of attractors [11, 17], although there is no conclusive result. It is also important to identify attractors of a given Boolean network. In particular, identification of all singleton attractors is important because singleton attractors correspond to steady states in Boolean networks and have close relation with steady states in other mathematical models of biological networks [10, 20–23]. As mentioned before, Huang wrote that singleton attractors correspond to differentiation and apoptosis states of a cell [10]. Devloo et al. transforms the problem of finding steady states for some types of biological networks to a constraint satisfaction problem [20]. The resulting constraint satisfaction problem is very close to the problem of identification of singleton attractors in Boolean networks. Mochizuki introduced a general model of genetic networks based on nonlinear differential equations [21]. He analyzed the number of steady states in that model, where steady states are again closely related to singleton attractors in Boolean networks. Zhou et al. proposed a Bayesian-based approach to constructing probabilistic genetic networks [23]. Pal et al. proposed algorithms for generating Boolean networks with a prescribed attractor structure [22]. These studies focus on singleton attractors and it is mentioned that realworld attractors are most likely to be singleton attractors, rather than cyclic attractors. Therefore, it is meaningful to identify singleton attractors. Of course, these can be done by examining all possible states of a Boolean network. However, it would be too time consuming even for small n, since 2n states have to be examined. Of course, if we want to find any one (not necessarily singleton) attractor, we may find it by following the trajectory to the attractor beginning from a randomly selected state. If the basin of attraction is large, the possibility to find the corresponding attractor would be high. However, it is not guaranteed that a singleton attractor can be found. In order to find a singleton attractor, a lot of trajectories may be examined. Indeed, Akutsu et al. proved in 1998 that finding a singleton attractor is NP-hard [24]. Independently, Milano and Roli showed in 2000 that the satisfiability problem can be transformed into the problem of finding a singleton attractor [25], which provides a proof of NP-hardness of the singleton attractor problem. Thus, it is not plausible that the singleton attractor problem can be solved efficiently (i.e., polynomial time) in all cases. However, it may be possible to develop algorithms that are fast in practice and/or in the average case. Therefore, this paper studies algorithms for identifying singleton attractors that are fast in many practical cases and have concrete theoretical backgrounds. Some studies have been done on fast identification of singleton attractors. Akutsu et al. proposed an algorithm for finding singleton attractors based on a feedback vertex set [24]. Devloo et al. proposed algorithms for finding steady states of various biological networks using constraint programming [20], which can also be applied to identification of singleton attractors in Boolean networks. In particular, the algorithms proposed by Devloo et al. are efficient in practice. However, there are no theoretical results on the efficiency of EURASIP Journal on Bioinformatics and Systems Biology their algorithms. Thus, we aim at developing algorithms that are fast in practice and have a theoretical guarantee on their efficiency (more precisely, the average case time complexity). In this paper, we propose several algorithms for identifying all singleton attractors. We first present a basic recursive algorithm. In this algorithm, a partial solution is extended one by one according to a given gene ordering that leads to a complete solution. If it is found that a partial solution cannot be extended to a complete solution, the next partial solution is examined. This algorithm is quite similar to the backtracking method employed in [20]. The important difference of this paper from [20] is that we perform some theoretical analysis of the average case time complexity. For example, we show that the basic recursive algorithm works in O(1.23n ) time in the average case under the condition that Boolean networks with maximum indegree 2 are given uniformly at random. It should be noted that O(1.23n ) is much smaller than O(2n ), though it is not polynomial. Next, we develop improved algorithms using the outdegree-based ordering and the breadth-first search (BFS) based ordering. For these algorithms, we perform theoretical analysis of the average case time complexity, which shows that these are better than the basic recursive algorithm. Moreover, we examine the algorithm based on feedback vertex sets (FVS) and its combination with the outdegree-based ordering, where the idea of use of FVS was previously proposed in our previous work [24]. We also perform computational experiments using these algorithms, which show that the FVS-based algorithm with the outdegree-based gene ordering is the most efficient in practice among these algorithms. Then, we extend the gene-ordering-based algorithms for finding cyclic attractors with short periods along with theoretical analysis and computational experiments. Though we do not have strong evidence that small attractors are more important than those with long periods, it seems that cell cycles correspond to small attractors and large attractors are not so common (with the exception of circadian rhythms) in real biological networks. As a minimum, these extensions show that application of the proposed techniques is not limited to the singleton attractor problem. As mentioned before, NP-hardness results on finding a singleton attractor (or the smallest attractor) were already presented in [24, 25]. However, both papers appeared as conference papers, the detailed proof is not given in [24], and the transformation given in [25] is a bit complicated. Therefore, we describe a simple and complete proof. We believe that it is worthy to include a simple and complete proof in this paper. Finally, we conclude with future work. 2. ANALYSIS OF ALGORITHMS USING GENE ORDERING FOR FINDING SINGLETON ATTRACTORS In this section, we present algorithms using gene ordering for identification of singleton attractors along with theoretical analysis of the average case time complexity. Experimental results will be given later along with those of FVS-based Shu-Qin Zhang et al. 3 Table 1: Example of a truth table of a Boolean network. v1 v2 v3 f1 f2 f3 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 1 1 0 1 1 1 0 0 1 1 1 0 1 0 0 1 0 000 001 011 100 101 110 010 111 methods. Before presenting the algorithms, we briefly review the Boolean network model. Figure 1: State transitions of the Boolean network shown in Table 1. 2.1. Boolean network and attractor A Boolean network G(V , F) consists of a set of n nodes (vertices) V and n Boolean functions F, where V = v 1 , v2 , . . . , vn , (1) F = f1 , f2 , . . . , fn . In general, V and F correspond to a set of genes and a set of gene regulatory rules, respectively. Let vi (t) represent the state of vi at time t. The overall expression level of all the genes in the network at time step t is given by the following vector: v(t) = v1 (t), v2 (t), . . . , vn (t) . (2) This vector is referred to as the Gene Activity Profile (GAP) of the network at time t, where vi (t) = 0 means that the ith gene is not expressed and vi (t) = 1 means that it is expressed. Since v(t) ranges from [0, 0, . . . , 0] (all entries are 0) to [1, 1, . . . , 1] (all entries are 1), there are 2n possible states. The regulatory rules among the genes are given as follow: vi (t + 1) = fi vi1 (t), vi2 (t), . . . , viki (t) , i = 1, 2, . . . , n. (3) This rule means that the state of gene vi at time t + 1 depends on the states of ki genes at time t, where ki is called the indegree of gene vi . The maximum indegree of a Boolean network is defined as K = max ki . i Input: a Boolean network G(V , F) Output: all the singleton attractors Initialize m := 1; Procedure IdentSingletonAttractor(v, m) if m = n + 1 then Output v1 (t), v2 (t), . . . , vn (t), return; for b = 0 to 1 do vm (t) := b; = v j (t) for some j ≤ m then if it is found that v j (t + 1) continue; else IdentSingletonAttractor(v, m + 1); return. (4) The number of genes that are directly affected by gene vi is called the outdegree of gene vi . The states of all genes are updated synchronously according to the corresponding Boolean functions. A consecutive sequence of GAPs v(t), v(t +1), . . . , v(t + p) is called an attractor with period p if v(t) = v(t + p). An attractor with period 1 is called a singleton attractor and an attractor with period > 1 is called a cyclic attractor. Table 1 gives an example of a truth table of a Boolean network. Each gene will update its state according to the states of some other genes in the previous step. The state transitions of this Boolean network can be seen in Figure 1. The Algorithm 1 system will eventually evolve into two attractors. One attractor is [0, 1, 1], which is a singleton attractor, and the other one is [1, 0, 1] −→ [1, 0, 0] −→ [0, 1, 0] −→ [1, 1, 0] −→ [1, 0, 1], (5) which is a cyclic attractor with period 4. 2.2. Basic recursive algorithm The number of singleton attractors in a Boolean network depends on the regulatory rules of the network. If the regulatory rules are given as vi (t + 1) = vi (t) for all i, the number of singleton attractors is 2n . Thus, it would take O(2n ) time in the worst case if we want to identify all the singleton attractors. On the other hand, it is known that the average number of singleton attractors is 1 regardless of the number of genes n and the maximum indegree K [21]. Therefore, it is useful to develop algorithms for identifying all singleton attractors without examining all 2n states (in the average case). For that purpose, we propose a very simple algorithm, which is referred to as the basic recursive algorithm in this paper. In the algorithm, a partial GAP (i.e., profile with m (< n) genes) is extended one by one towards a complete GAP (i.e., 4 EURASIP Journal on Bioinformatics and Systems Biology singleton attractor), according to a given gene ordering. If it is found that a partial GAP cannot be extended to a singleton attractor, the next partial GAP is examined. The pseudocode of the algorithm is given as shown in Algorithm 1. The algorithm extends a partial GAP by one gene at a time. At the mth recursive step, the states of the first m − 1 genes are determined. Then, the algorithm extends the partial GAP by adding vm (t) = 0. If v j (t + 1) = v j (t) holds or the value of v j (t + 1) is not determined for all j = 1, . . . , m, the algorithm proceeds to the next recursive step. That is, if there is a possibility that the current partial GAP can be extended to a singleton attractor, it goes to the next recursive step. Otherwise, it extends the partial GAP by adding vm (t) = 1 and executes a similar procedure. After examining vm (t) = 0 and vm (t) = 1, the algorithm returns to the previous recursive step. Since the number of singleton attractors is small in most cases, it is expected that the algorithm does not examine many partial GAPs with large m. The average case time complexity is estimated as follows. Suppose that Boolean networks with maximum indegree K are given uniformly at random. Then the average case time complexity of the algorithm for K = 1 to K = 10 is given in the first row of Table 2. Theoretical analysis 2.3. Outdegree-based ordering algorithm In the basic recursive algorithm, the original ordering of genes was used. If we sort the genes according to their outdegree (genes are ordered from larger outdegree to smaller outdegree), it is expected that values of v j (t + 1) for a larger number of genes are determined at each recursive step than those determined for the basic recursive algorithm, and thus a lower number of partial GAPs are examined. This intuition is justified by the following theoretical analysis. Suppose that Boolean networks with maximum indegree K are given uniformly at random. After reordering all genes according to their outdegrees from largest to smallest, the average case time complexity of the algorithm for K = 1 to K = 10 is given in the second row of Table 2. Theoretical analysis We assume (without loss of generality) w.l.o.g. that the indegrees of all genes are K. If the input genes for any gene are randomly selected from all the genes, the outdegree of genes follows the Poisson distribution with mean approximately λ. In this case, λ = K holds since the total indegree must be equal to the total outdegree. Thus, λ and K are confused in the following. The probability that a gene has outdegree k is P(k) = Assume that we have tested the first m out of n genes, where m ≥ K. For all i ≤ m, vi (t) = vi (t + 1) holds with probability P vi (t) = vi (t + 1) = 0.5 · m Cki ≈ 0.5 · n Cki m n ki ≥ 0.5 · m n K . (6) If vi (t) = vi (t + 1) does not hold, the algorithm can continue. Therefore, the probability that the algorithm examines the (m + 1)th gene is not more than 1 − P vi (t) = vi (t + 1) m = 1 − 0.5 · m n K m . λk exp(−λ) . k! We reorder the genes according to their outdegrees from largest to smallest. Assume that the first m genes have been tested and gene m is the uth gene among the genes with outdegree l. Then ∞ λk exp(−λ) m−u=n· m f (m) = 2 · 1 − 0.5 · n n−m=n· l λk exp(−λ) k! n· (8) Let s = m/n, and f (s) = [2s · (1 − 0.5 · sK )s ]n = [(2 − sK )s ]n . The average case time complexity is estimated by the maximum value of f (s). Though an additional O(nm) factor is required, it can be ignored since O(n2 an ) O((a + )n ) holds for any a > 1 and > 0. Since the time complexity should be a function with respect to n, we only need to compute the maximum value of the function g(s) = (2 − sK )s . With simple numerical calculations, we can get its maximum value for fixed K. Then, the average case time complexity of the algorithm can be estimated as O((max(g))n ). We list the time complexity from K = 1 to 10 in the first row of Table 2. As K gets larger, the complexity increases. − u. (11) The total outdegree of these n − m genes is K m . (10) and therefore k=0 m k! k=l+1 (7) Thus, the number of recursive calls executed for the first m genes is at most (9) l λk exp(−λ) k! k=0 · k − u · l. (12) The total outdegree for the first m genes is λn − n · l λk exp(−λ) k=0 = λn − λn · k! ·k−u·l l−1 k λ exp(−λ) k=0 k! +u·l λl exp(−λ) = λn − λ n − (m − u) − n · +u·l l! = λm + λn · λl exp(−λ) + u(l − λ). l! (13) Shu-Qin Zhang et al. 5 Thus, for i ≤ m, we have Thus, (14) λl exp(−λ) (l − λ) λl exp(−λ) λ s ·n· + l! λn l! l l −1 λ exp(−λ) λ exp(−λ) λ s = 2− s+ . + (l − λ) · l! l! (22) The number of recursive calls executed for the first m genes is There are only a few values that are less than λ. Using a method similar to the one above, we can get an upper bound for g(s). = vi (t + 1) P vi (t) λm + λn · λl exp(−λ)/l! + u(l − λ) λn l m λ exp(−λ) (l − λ)u λ = 0.5 · . + + n l! λn λ = 0.5 · f (m) = 2m · 1 − 0.5 · g(s) ≤ 2 − s + m λl exp(−λ) (l − λ)u + + n l! λn λ m . (15) Letting s = m/n, f (m) can be rewritten as λl exp(−λ) (l − λ)u + l! λn λl exp(−λ) (l − λ)u λ s n = 2− s+ . + l! λn λ s n f (m) = 2s · 1 − 0.5 · s + (16) As in Section 2.2, we estimate the maximum value of g(s) where it is defined here as g(s) = [2 − (s + λl exp(−λ)/l! + (l − λ)u/λn)λ ]s . We also must consider the relationship between l and λ. (1) If l > λ, g(s) ≤ 2 − s + λl exp(−λ) l! λ s = g1 (s). (17) Since λl exp(−λ)/l! tends to zero if l is large, we only need to examine several small values of l. The upper bound of g(s) can be obtained by computing the maximum value of g1 (s) with some numerical methods. However, we should be careful so that P(k ≥ l + 1) ≤ s ≤ P(k ≥ l) (18) holds. That is, it should be guaranteed that the maximum value obtained is for the gene with outdegree l. (2) If l = λ, g(s) = 2 − s + λl exp(−λ) l! λ s . (19) Similar to above, we can get an upper bound for g(s). (3) If l < λ, g(s) = 2 − s + λl exp(−λ) (l − λ)u + l! λn λ s . (20) Since gene m is the uth gene among the genes with outdegree l, u≤n· λl exp(−λ) . l! (21) It should be noted that l must belong to exactly one of these three cases when g(s) reaches its maximum value. Summarizing the three different cases above, we can get an approximation of the average case time complexity of the algorithm. The second row of Table 2 shows the time complexity of the algorithm for K = 1 to K = 10. As in Section 2.2, the complexity increases as K increases. We remark that the difference between this improved algorithm and the basic recursive algorithm lies only in that we need to sort all the genes according to their outdegrees from largest to smallest before executing the main procedure of the basic recursive algorithm. 2.4. Breadth-first search-based ordering algorithm Breadth-first search is a general technique for traversing a graph. It visits all the nodes and edges of a graph in a manner that all the nodes at depth (distance from the root node) d are visited before visiting nodes at depth d + 1. For example, suppose that node a has outgoing edges to nodes b and c, b has outgoing edges to nodes d and e, and c has outgoing edges to nodes f and g, where other edges (e.g., an edge from d to f ) can exist. In this case, nodes are visited in the order of a, b, c, d, e, f . In this way, all of the nodes are totally ordered according to the visiting order. The algorithm for implementing BFS can be found in many text books. The computation time for BFS on a graph with n nodes and m edges is O(n+m). If we use this BFS-based ordering, as in the case of outdegree-based ordering, it is expected that values of v j (t + 1) for a larger number of genes are determined at each recursive step, and thus, lower numbers of partial GAPs are examined. We can estimate the average case time complexity as follows. Suppose that Boolean networks with maximum indegree K are given uniformly at random. After reordering all genes according to the BFS-ordering, the average case time complexity of the algorithm for K = 1 to K = 10 is given in the third row of Table 2. Theoretical analysis As in Section 2.3, we assume w.l.o.g. that all n genes have the same indegree K. Suppose that we have tested m genes. Since the input genes of the ith gene must be among the first K · i + 1 genes, whether vi (t + 1) = vi (t) or not can be determined before visiting the (K · i + 2)th gene. According to 6 EURASIP Journal on Bioinformatics and Systems Biology Table 2: Theoretical time complexities of basic, outdegree-based, and BFS-based algorithms. K Basic Outdegree-based BFS-based 1 1.23n 1.09n ≈ O(n) 2 1.35n 1.19n 1.16n 3 1.43n 1.27n 1.27n 4 1.49n 1.34n 1.35n the determination pattern of states of m genes, we consider 3 cases. (1) The states of the first (m − 1)/K genes are determined and they must satisfy vi (t+1) = vi (t), where a denotes the standard floor function. Then, we have 5 1.53n 1.41n 1.41n 6 1.57n 1.45n 1.45n i≤ m−1 . K (23) (m−1)/K P vi (t) = vi+1 (t) = vi+1 (t) P vi (t) = 0.5 · · K , (24) n−1 m ≤i≤ . K K The algorithm can continue for any gene i with probability 1 − P vi (t) = vi+1 (t) i=(m−1)/K m 1 − P vi (t) = vi+1 (t) i=(n−1)/K = 0.5(m−1)/K · 1 − P vi (t) = vi+1 (t) m m+ j ·K = vi+1 (t) = 0.5 · P vi (t) m m+i·K K K m−(n−1)/K . Then, the total number of recursive calls is f (m) = 2m · 0.5(m−1)/K · (n− 1)/K m n K , K = vi+1 (t) = 1 − 0.5 · 1 − P vi (t) , m 1 − 0.5 · m+i·K m · 1 − 0.5 · n n−1 m ≤i≤ . K K (25) n−1 ≤ i ≤ m. (26) K Here, the algorithm can continue for each gene with probability 1 − 0.5 · m · 1 − 0.5 · n (3) From the n/Kth gene to the mth gene, the input genes to them can be any gene; thus (n− 1)/K i=(m−1)/K i=(m−1)/K (28) m m+ j·K = 1 − 0.5 · 10 1.67n 1.57n 1.58n 9 1.65n 1.56n 1.56n (n− 1)/K · (2) For any gene i between the m/Kth gene and the (n − 1)/Kth gene, whether vi (t + 1) is equal to vi (t) can be determined before examining the (m + j · K)th gene, where j = 1, 2, . . . , (n − m)/K. Then, we have 8 1.62n 1.51n 1.53n The probability that the algorithm can be executed for all m genes is i=1 P vi (t) = vi+1 (t) = 0.5, 7 1.60n 1.48n 1.50n m n K , n−1 ≤ i ≤ m. K (27) K K m−(n−1)/K K m−(m−1)/K m ≤ 2m · 0.5(m−1)/K · 1 − 0.5 · n m K m−(m−1)/K = 2− n K [(m−(m−1)/K)/n]·n m = 2− n K (m/n)(1−1/K)·n m ≈ 2− . n (29) Let s = m/n and g(s) = (2 − sK )s(1−1/K) . Using numerical methods, we can get the maximum value of g. From K = 1 to K = 10, the upper bound of the average case time complexity of the algorithm is in the third row of Table 2. It is to be noted that in the estimation of the upper bound of f (m), we overestimated the probability that genes belong to the second case, and thus the upper bound obtained here is not tight. More accurate time complexities can be estimated from the results of computational experiments. Shu-Qin Zhang et al. 3. FINDING SINGLETON ATTRACTORS USING FEEDBACK VERTEX SET In this section, we present algorithms based on the feedback vertex set and the results of computational experiments on all of our proposed algorithms for identification of singleton attractors. The algorithms in this section are based on a simple and interesting property on acyclic Boolean networks although they can be applied to general Boolean networks with cycles. Though an algorithm based on the feedback vertex set was already proposed in our previous work [24], some improvements (ordering based on connected components and ordering based on outdegree) are achieved in this section. 7 Input: a Boolean network G(V , F) (FVS) Output: an ordered feedback vertex set F = v1(FVS) , . . . , vM Procedure FindFeedbackVertexSet let F := ∅, M := 1; let C:= (all the connected components of G); for each connected component C ∈ C do let V := (a set of vertices in C ); = ∅ do while V (FVS) := (a vertex selected randomly from V ); let vM (FVS) and vertices whose truth values remove vM can be fixed only from F in V ; increment M. 3.1. Acyclic network As to be shown in Section 5, the problem of finding a singleton attractor in a Boolean network is NP-hard. However, we have a positive result for acyclic networks as follows. Proposition 1. If the network is acyclic, there exists a unique singleton attractor. Moreover, the unique attractor can be computed in polynomial time. Proof. In an acyclic network, there exists at least one node without incoming edges. Such nodes should have fixed Boolean values. The values of the other nodes are uniquely determined from these nodes by the nth time step in polynomial time. Since the state of any node does not change after the nth step, there exists only one singleton attractor. As shown below, this property is also useful for identifying singleton attractors in cyclic networks. Algorithm 2 Input: a Boolean network G(V , F) and an ordered feedback (FVS) vertex set F = v1(FVS) , . . . , vM Output: all the singleton attractors Initialize m := 1; Procedure IdentSingletonAttractorWithFVS(v, m) if m = M + 1 then Output v1 (t), v2 (t), . . . , vn (t), return; for b = 0 to 1 do vm(FVS) (t) := b; propagate truth values of v1(FVS) (t), . . . , vm(FVS) (t) to all possible v(t) except F ; compute v1(FVS) (t + 1), . . . , vm(FVS) (t + 1) from v(t); (FVS) (FVS) if it is found that v j (t + 1) = vj (t) for some j ≤ m then continue; else IdentSingletonAttractorWithFVS(v, m + 1); return. 3.2. Algorithm In the basic recursive algorithm, we must consider truth assignments to all the nodes in the network. On the other hand, Proposition 1 indicates that if the network is acyclic, the truth values of all nodes are uniquely determined from the values of the nodes with no incoming edges. Thus, it is enough to examine truth assignments only to the nodes with no incoming edges, if we can decompose the network into acyclic graphs. Such a set of nodes is called a feedback vertex set (FVS). The problem of finding a minimum feedback vertex set is known to be NP-hard [26]. Some algorithms which approximate the minimum feedback vertex set have been developed [27]. However, such algorithms are usually complicated. Thus, we use a simple greedy algorithm (shown in Algorithm 2) for finding a (not necessarily minimum) feedback vertex set, where a similar algorithm was already presented in [24]. In our proposed algorithm, nodes in FVS are ordered according to the connected components of the original network in order to reduce the number of iterations. In other words, nodes in the same connected component are ordered sequentially. Then, we modify the procedure IdentSingletonAttractor(v, m) for FVS as shown in Algorithm 3. Algorithm 3 Furthermore, we can combine the outdegree-based ordering with FVS. In FindFeedbackVertexSet, we select a node randomly from a connected component. When combined with the outdegree-based ordering, we can instead select the node with the maximum outdegree in a connected component. 3.3. Computational experiments In this section, we evaluate the proposed algorithms by performing a number of computational experiments on both random networks and scale-free networks [28]. 3.3.1. Experiments on random networks For each K (K = 1, . . . , 10) and each n (n = 1, . . . , 20), we randomly generated 10 000 Boolean networks with maximum indegree K and took the average values. All of these computational experiments were done on a PC with Opteron 8 EURASIP Journal on Bioinformatics and Systems Biology Table 3: Empirical time complexities of basic, outdegree, BFS, feedback vertex set, and FVS + outdegree algorithms. K Basic Outdegree BFS Feedback FVS + Outdegree 1 1.27n 1.14n 1.09n 1.10n 1.05n 2 1.39n 1.23n 1.16n 1.28n 1.13n 3 1.46n 1.30n 1.24n 1.39n 1.21n 4 1.53n 1.37n 1.31n 1.47n 1.29n 5 1.57n 1.42n 1.37n 1.53n 1.35n 7 1.63n 1.51n 1.45n 1.60n 1.46n 8 1.67n 1.54n 1.49n 1.64n 1.49n 9 1.69n 1.56n 1.52n 1.66n 1.52n 10 1.70n 1.59n 1.53n 1.68n 1.55n 10000 1.7 1.6 The number of iterations Base of the time complexity (a of an ) 6 1.60n 1.47n 1.42n 1.56n 1.41n 1.5 1.4 1.3 1.2 1000 100 10 1.1 1 1 2 3 4 5 6 7 8 9 10 Indegree K Basic Outdegree BFS Feedback FVS + outdegree Figure 2: Base of the empirical time complexity (an ’s a value) of the proposed algorithms for finding singleton attractors. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 The number of nodes Basic O(1.39n ) Outdegree O(1.23n ) BFS O(1.16n ) Feedback O(1.28n ) FVS + outdegree O(1.13n ) Figure 3: Number of iterations done by the proposed algorithms for K = 2. 3.3.2. Experiments on scale-free networks 2.4 GHz CPUs and 4 GB RAM running under the Linux (version 2.6.9) operating system, where the gcc compiler (version 3.4.5) was used with optimization option -O3. Table 3 shows the empirical time complexity of each proposed method for each K. We used a tool for GNUPLOT to fit the function b · an to the experimental results. The tool uses the nonlinear least-squares (NLLS) Marquardt-Levenberg algorithm. Figure 2 is a graphical representation of the result of Table 3. It is seen that the FVS + Outdegree method is the fastest in most cases. Figure 3 is an example to show the average number of iterations with respect to the number of genes for K = 2. Figure 4 shows the average computation time with respect to the number of genes when K = 2, where similar results were obtained for other values of K. The time complexities estimated from the results of computational experiments are a little different from those obtained by theoretical analysis. However, this is reasonable since, in our theoretical analysis, we assumed that the number of genes is very large, we made some approximations, and there were also small numerical errors in computing the maximum values of g(s). It is known that many real biological networks have the scalefree property (i.e., the degree distribution approximately follows a power-law) [28]. Furthermore, it is observed that in gene regulatory networks, the outdegree distribution follows a power-law and the indegree distribution follows a Poisson distribution [29]. Thus, we examined networks with scale free topology. We generated scale-free networks with a power-law outdegree distribution (∝ k−2 ) and a Poisson indegree distribution (with the average indegree 2) as follows. We first choose the number of outputs for each gene from a power-law distribution. That is, gene vi has Li outputs where all the Li are drawn from a power-law distribution. Then, we choose the Li outputs of each gene vi randomly with uniform probability from n genes. Once each gene has been assigned with a set of outputs, the inputs of all genes are fully determined because v j is an input of vi if vi is an output of v j . Since Li output genes are chosen randomly for each gene vi , the indegree distribution should follow a Poisson distribution. Figure 5 compares the outdegree-based algorithm, the BFS-based algorithm and the FVS + Outdegree algorithm for scale-free networks generated as above and for random networks with constant indegree 2, where the average CPU time Shu-Qin Zhang et al. 9 0.01 Input: a Boolean network G(V , F) and a period p Output: all of the small attractors with period p Initialize m := 1; Procedure IdentSmallAttractor(v, m) if m = n + 1 then Output v1 (t), v2 (t), . . . , vn (t), return; for b = 0 to 1 do vm (t) := b; for p = 0 to p−1 do compute v(t+p +1) from v(t+p ); = v j (t) for some j ≤ m then if it is found that v j (t+p) continue; else IdentSmallAttractor(v, m + 1); return. Elapsed time (s) 0.001 1e-04 1e-05 1e-06 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Algorithm 4 The number of nodes Basic Outdegree BFS Feedback FVS + outdegree Figure 4: Elapsed time (in seconds) by the proposed algorithms for random networks with K = 2. 1000 100 Elapsed time (s) 10 1 0.1 0.01 0.001 1e-04 1e-05 40 50 60 70 80 90 100 110 120 The number of nodes Fix/outdegree Fix/BFS Fix/FVS + outdegree PS/outdegree PS/BFS PS/FVS + outdegree Figure 5: Elapsed time (in seconds) of some of the proposed algorithms for random networks with K = 2 (Fix) and scale-free networks (PS). BFS-based algorithm, and O(1.12n ) versus O(1.05n ) for the FVS + Outdegree algorithm, where (random) versus (scalefree) is shown for each case. The average case complexities for random networks are better than those in Table 3 and are closer to the theoretical time complexities shown in Table 2. These results are reasonable because networks with much larger number of nodes were examined in this case. It should be noted that Devloo et al. proposed constraint programming based methods for finding steady-states in some kinds of biological networks [20]. Their methods use a backtracking technique, which is very close to our proposed recursive algorithms, and may also be applied to Boolean networks. Their methods were applied to networks up to several thousand nodes with indegree = outdegree = 2. Since different types of networks were used, our proposed methods cannot be directly compared with their methods. Their methods include various heuristics and may be more useful in practice than our proposed methods. However, no theoretical analysis was performed on the computational complexity of their methods. 4. In this section, we modify the gene-ordering-based algorithms presented in Section 2 to find cyclic attractors with short periods. We also perform a theoretical analysis and computational experiments. 4.1. was taken over 100 networks for each case and a PC with Xeon 5160 3 GHz CPUs with 8 GB RAM was used. The result is interesting and we observed that all algorithms work much faster for scale-free networks than for random networks. This result is reasonable because scale-free networks have a much larger number of high degree nodes than random networks and thus heuristics based on the outdegree-based ordering or the BFS-based ordering should work efficiently. The average case time complexities estimated from this experimental result are as follows: O(1.19n ) versus O(1.09n ) for the outdegree-based algorithm, O(1.12n ) versus O(1.09n ) for the FINDING SMALL ATTRACTORS Modifications of algorithms The basic idea of our modifications is very simple. Instead of checking whether or not vi (t + 1) = vi (t) holds, we check whether or not vi (t + p) = vi (t) holds. The pseudocode of the modified basic recursive algorithm is given in Algorithm 4. This procedure computes v(t + p) from the truth assignments on the first m genes of v(t). Values of some genes of v(t + p) may not be determined because these genes may also depend on the last (n − m) genes of v(t). If either v j (t + p) = v j (t) holds or the value of v j (t + p) is not determined for each j = 1, . . . , m, the algorithm will continue to the next EURASIP Journal on Bioinformatics and Systems Biology recursive step. As in Section 2, we can combine this algorithm with the outdegree-based ordering and the BFS-based ordering. In these algorithms, it is assumed that the period p is given in advance. However, the algorithms can be modified for identifying all cyclic attractors with period at most P. For that purpose, we simply need to execute the algorithms for each of p = 1, 2, . . . , P. Though this method does not seem to be practical, its theoretical time complexity is still better than O(2n ) for small P. Suppose that the average case time complexity for p is O(T p (n)). Then, this simple method would take O( Pp=1 T p (n)) ≤ O(P · TP (n)) time, which is still faster than O(2n ) if TP (n) = o(2n ) and P is bounded by some polynomial of n. 2 Base of the time complexity (a of an ) 10 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 1 Let the period of the attractor be p. We assume w.l.o.g. as before that the indegree of all genes is K. As in Section 2.2, we consider the first m genes among all n genes. Given the states of all m genes at time t, we need to know the states of all these genes at time t + p. The probability that vi (t) = vi (t + p) holds for each i ≤ m is approximated by: m P vi (t) = vi (t + p) = 0.5 · n K m · n K 2 m ··· n K p , 1 − P vi (t) = vi (t + p) 6 7 8 9 10 2 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 2 3 4 5 6 7 8 9 10 Indegree K Basic Outdegree BFS Figure 7: Base of the empirical time complexity (an ’s a value) of the proposed algorithms for finding cyclic attractors with period 3. m K K 2 K p m m m m = 1 − 0.5 · · ··· . n n (31) n Therefore, the number of total recursive calls executed for these m genes is 5 Figure 6: Base of the empirical time complexity (an ’s a value) of the proposed algorithms for finding cyclic attractors with period 2. (30) where (m/n)K means that the K input genes to gene vi at time 2 t + p − 1 are among the first m genes, (m/n)K means that at time t + p − 2 the input genes to the K input genes to gene vi are also in the first m genes, and so on. Then, the probability that the algorithm examines some specific truth assignment on m genes is approximately given by 4 Basic Outdegree BFS Base of the time complexity (a of an ) Theoretical analysis 3 Indegree K 4.2. Theoretical analysis Before giving the experimental results, we perform a theoretical analysis on the modified basic recursive algorithm. Suppose that Boolean networks with maximum indegree K are given uniformly at random. Then the average case time complexity of the modified basic recursive algorithm for period 1 to 5 and K = 1 to K = 10 is given in Table 4. 2 = vi (t + p) f (m) = 2m · 1 − P vi (t) m K K 2 K p m m m m = 2m · 1 − 0.5 · · ··· . n n n (32) As in Section 2.2, we can compute the maximum value of f (m). The results are given in Table 4. 4.3. Computational experiments Computational experiments were also performed to examine the time complexity of the algorithms for finding small attractors. The environment and parameters of the experiments were the same as in Section 3.3.1. Though FVS-based algorithms can also be modified for small attractors, they are not efficient for p > 1. Therefore, we only examined geneordering-based algorithms. Figures 6 to 8 show the time complexity of the algorithms estimated from the results of computational experiments for p = 2 to p = 4 and for K = 1 to K = 10. When K is comparatively small, the outdegree-based ordering method is the Shu-Qin Zhang et al. 11 Table 4: Theoretical time complexities for the modified basic algorithm for finding small attractors with period p. K p=1 p=2 p=3 p=4 p=5 1 1.23n 1.35n 1.43n 1.49n 1.53n 2 1.35n 1.57n 1.72n 1.83n 1.90n 3 1.43n 1.70n 1.86n 1.94n 1.97n 4 1.49n 1.78n 1.92n 1.97n 1.99n 5 1.53n 1.83n 1.95n 1.99n 1.99n Base of the time complexity (a of an ) 2 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 2 3 4 5 6 7 8 9 10 Indegree K 6 1.57n 1.87n 1.97n 1.99n 1.99n 7 1.60n 1.89n 1.97n 1.99n 1.99n 9 1.65n 1.92n 1.99n 1.99n 1.99n vi (t + 1) = vi (t). (33) Suppose that fi (xi1 , . . . , xi3 ) is a Boolean function assigned to ci in 3SAT. Then, for each vN+i , we assign the following function: vN+i (t + 1) = fi vi1 (t), vi2 (t), vi3 (t) ∨ vN+i (t). most efficient. But when K increases, all the three methods perform the same, which is equivalent to the worst case in finding the attractors, that is O(2n ). The results obtained from the numerical experiments for the modified basic recursive algorithm are consistent with the theoretical results presented in Section 4.2. 5. HARDNESS RESULT As mentioned in Section 1, Akutsu et al. [24] and Milano and Roli [25] showed that finding a singleton attractor (or an attractor with the shortest period) is NP-hard. Those results justify our proposed algorithms which take exponential time in the worst case (and even in the average case). However, the proof is omitted in [24] and the proof in [25] is a bit complicated: Boolean functions assigned in the transformed Boolean network are much longer than those in the original satisfiability problem. Here we give a simpler and complete proof. Theorem 1. Finding an attractor with the shortest period is NP-hard. Proof. We show that deciding whether or not there exists a singleton attractor is NP-hard, from which the theorem follows since the singleton attractor is the attractor with the shortest period (if any such period exists). 10 1.67n 1.93n 1.99n 1.99n 1.99n We use a simple polynomial time reduction from 3SAT [26] to the singleton attractor problem. Let x1 , . . . , xN be Boolean variables (i.e., 0-1 variables). Let c1 , . . . , cL be a set of clauses over x1 , . . . , xN , where each clause is a logical OR of at most three literals. It should be noted that a literal is a variable or its negation (logical NOT). Then, 3SAT is a problem of asking whether or not there exists an assignment of 0-1 values to x1 , . . . , xN which satisfies all the clauses (i.e., the values of all clauses are 1). From an instance of 3SAT, we construct an instance of the singleton attractor problem. We let the set of vertices (nodes) V = {v1 , . . . , vN+L }, where each vi for i = 1, . . . , N corresponds to xi and each vN+i for i = 1, . . . , L corresponds to ci . For each vi such that i ≤ N, we make the following assignment: Basic Outdegree BFS Figure 8: Base of the time complexity (an ’s a value) of the proposed algorithms for finding cyclic attractors with period 4. 8 1.62n 1.91n 1.98n 1.99n 1.99n (34) Figure 9 is an example of reduction from 3SAT to the singleton attractor problem. Here, we show that 3SAT is satisfiable if and only if there exists a singleton attractor. Suppose that there exists an assignment of Boolean values b1 , . . . , bN to x1 , . . . , xN which satisfies all clauses c1 , . . . , cL . Then, we let ⎧ ⎨b vi (0) = ⎩ 1 i for i = 1, . . . , N, for i = N + 1, . . . , N + L. (35) It is straight forward to see that v(0) = (v1 (0), . . . , vN+L (0)) is a singleton attractor (i.e., v(0) = v(1)). Suppose that there exists a singleton attractor. Let v(0) = (v1 (0), . . . , vN+L (0)) be the state of the singleton attractor. Then, vN+i (0) must be 1 for all i = 1, . . . , L. Otherwise (i.e., vN+i (0) = 0), vN+i (1) would be 1 and it contradicts the assumption that v(0) is a singleton attractor. Furthermore, fi (vi1 (0), vi2 (0), vi3 (0)) = 1 must hold. Otherwise, vN+i (1) would be 0 since the equations vN+i (0) = 1 and fi (vi1 (0), vi2 (0), vi3 (0)) = 0 hold. This contradicts the assumption that v(0) is a singleton attractor. Therefore, by assigning vi (0) to xi for i = 1, . . . , N, all the clauses are satisfied. Since the reduction can trivially be done in polynomial time, we have the theorem. 12 EURASIP Journal on Bioinformatics and Systems Biology v1 v2 v3 v4 Table 5: Comparison of time complexities for simulation of network behavior, identification of attractors, finding control strategies, and identification of networks. P means that the problem can be solved in polynomial time. v7 v5 v1 v2 v6 v3 v5 v1 v3 v4 v6 v2 v3 v4 v7 Figure 9: Example of a reduction from 3SAT to the singleton attractor problem. An instance of 3SAT {x1 ∨ x2 ∨ x3 , x1 ∨ x3 ∨ x4 , x2 ∨ x3 ∨ x4 } is transformed into this Boolean network. 6. CONCLUSION In this paper, we have presented fast algorithms for identifying singleton attractors and cyclic attractors with short periods. The proposed algorithms are much faster than the naive enumeration-based algorithm. However, the proposed algorithms cannot be applied to random networks with several hundreds or more genes. Moreover, it may not be faster than the constraint programming-based algorithms in [20]. However, the most important point of this work is that the average case time complexities of the ordering-based algorithms are analyzed and are shown to be better than O(2n ). We hope that our work stimulates further development of faster algorithms and deeper theoretical analysis. It is interesting that the results of computational experiments suggest that our proposed algorithms are much faster for scale-free networks than for random networks. However, we could not yet perform theoretical analysis for scale-free networks. Thus, theoretical analysis of the average case time complexity for scale-free networks (precisely, networks with a power-law outdegree distribution and a Poisson indegree distribution) is left as future work. Although this paper focused on the Boolean network as a model of biological networks, the techniques proposed here may be useful for designing algorithms for finding steady states in other models and for theoretical analysis of such algorithms. For instance, Mochizuki performed theoretical analysis on the number of steady states in some continuous biological networks that are based on nonlinear differential equations [21]. However, the core part of the analysis is done in a combinatorial manner and is very close to that for Boolean networks. Thus, it may be possible to develop fast algorithms for finding steady states in such continuous network models. Application and extension of the proposed techniques to other types of biological networks are important future research topics. Finally, it is interesting to compare the complexities of four problems for three classes of networks: simulation of network behavior (almost trivial), identification of attractors (this paper), identification of networks [30, 31], and finding control strategies [32] for trees, acyclic graphs, and general graphs. These four problems constitute a more complete picture of modeling genetic regulatory networks with Simulation of network Identification of attractor Finding control strategies Identification of network Identification of network (bounded indegree) Tree Acyclic graph General graph P P P NP-hard P P NP-hard NP-hard P NP-hard NP-hard NP-hard P P P a Boolean network. Simulation of a Boolean network is a trivial but important step to analyze the model. Attractors describe the long run behavior of the Boolean network system. Finding a control strategy is to consider how the system can be made to evolve desirably. Identification of genetic regulatory networks is the first step in obtaining the model from data. Table 5 shows complexities for various problems with several network structures. Although many works have been done for these problems, the computational complexity is still an important issue. It is also left as future work to study how to cope with high computational complexity (e.g., NP-hardness) of these problems. ACKNOWLEDGMENTS We thank anonymous reviewers for helpful comments. TA was partially supported by a Grant-in-Aid “Systems Genomics” from MEXT, Japan and by the Cell Array Project from NEDO, Japan. WKC was partially supported by Hung Hing Ying Physical Research Fund, HKU GRCC Grants nos. 10206647, 10206483, and 10206147. MKN was partially supported by RGC 7046/03P, 7035/04P, 7035/05P, and HKBU FRGs. S.-Q. Zhang and M. Hayashida contributed equally to this work. REFERENCES [1] J. E. Celis, M. Kruhøffer, I. Gromova, et al., “Gene expression profiling: monitoring transcription and translation products using DNA microarrays and proteomics,” FEBS Letters, vol. 480, no. 1, pp. 2–16, 2000. [2] T. R. Hughes, M. Mao, A. R. Jones, et al., “Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer,” Nature Biotechnology, vol. 19, no. 4, pp. 342–347, 2001. [3] R. J. Lipshutz, S. P. A. Fodor, T. R. Gingeras, and D. J. Lockhart, “High density synthetic oligonucleotide arrays,” Nature Genetics, vol. 21, supplement 1, pp. 20–24, 1999. [4] D. J. Lockhart and E. A. Winzeler, “Genomics, gene expression and DNA arrays,” Nature, vol. 405, no. 6788, pp. 827–836, 2000. [5] H. D. Jong, “Modeling and simulation of genetic regulatory systems: a literature review,” Journal of Computational Biology, vol. 9, no. 1, pp. 67–103, 2002. Shu-Qin Zhang et al. [6] K. Glass and S. A. Kauffman, “The logical analysis of continuous, nonlinear biochemical control networks,” Journal of Theoretical Biology, vol. 39, no. 1, pp. 103–129, 1973. [7] S. A. Kauffman, “Metabolic stability and epigenesis in randomly constructed genetic nets,” Journal of Theoretical Biology, vol. 22, no. 3, pp. 437–467, 1969. [8] S. A. Kauffman, “Homeostasis and differentiation in random genetic control networks,” Nature, vol. 224, no. 215, pp. 177– 178, 1969. [9] S. A. Kauffman, “The large scale structure and dynamics of genetic control circuits: an ensemble approach,” Journal of Theoretical Biology, vol. 44, no. 1, pp. 167–190, 1974. [10] S. Huang, “Gene expression profiling, genetic networks, and cellular states: an integrating concept for tumorigenesis and drug discovery,” Journal of Molecular Medicine, vol. 77, no. 6, pp. 469–480, 1999. [11] S. A. Kauffman, The Origins of Order: Self-Organization and Selection in Evolution, Oxford University Press, New York, NY, USA, 1993. [12] R. Somogyi and C. Sniegoski, “Modeling the complexity of genetic networks: understanding multigenic and pleiotropic regulation,” Complexity, vol. 1, no. 6, pp. 45–63, 1996. [13] I. Shmulevich and W. Zhang, “Binary analysis and optimization-based normalization of gene expression data,” Bioinformatics, vol. 18, no. 4, pp. 555–565, 2002. [14] D. Thieffry, A. M. Huerta, E. Pérez-Rueda, and J. ColladoVides, “From specific gene regulation to genomic networks: a global analysis of transcriptional regulation in Escherichia coli,” BioEssays, vol. 20, no. 5, pp. 433–440, 1998. [15] S. Huang, “Cell state dynamics and tumorigenesis in Boolean regulatory networks,” InterJournal Genetics, MS: 416, http://www.interjournal.org/ [16] B. Drossel, “Number of attractors in random Boolean networks,” Physical Review E, vol. 72, no. 1, Article ID 016110, 5 pages, 2005. [17] B. Drossel, T. Mihaljev, and F. Greil, “Number and length of attractors in a critical Kauffman model with connectivity one,” Physical Review Letters, vol. 94, no. 8, Article ID 088701, 4 pages, 2005. [18] B. Samuelsson and C. Troein, “Superpolynomial growth in the number of attractors in Kauffman networks,” Physical Review Letters, vol. 90, no. 9, Article ID 098701, 4 pages, 2003. [19] J. E. S. Socolar and S. A. Kauffman, “Scaling in ordered and critical random Boolean networks,” Physical Review Letters, vol. 90, no. 6, Article ID 068702, 4 pages, 2003. [20] V. Devloo, P. Hansen, and M. Labbé, “Identification of all steady states in large networks by logical analysis,” Bulletin of Mathematical Biology, vol. 65, no. 6, pp. 1025–1051, 2003. [21] A. Mochizuki, “An analytical study of the number of steady states in gene regulatory networks,” Journal of Theoretical Biology, vol. 236, no. 3, pp. 291–310, 2005. [22] R. Pal, I. Ivanov, A. Datta, M. L. Bittner, and E. R. Dougherty, “Generating Boolean networks with a prescribed attractor structure,” Bioinformatics, vol. 21, no. 21, pp. 4021–4025, 2005. [23] X. Zhou, X. Wang, R. Pal, I. Ivanov, M. Bittner, and E. R. Dougherty, “A Bayesian connectivity-based approach to constructing probabilistic gene regulatory networks,” Bioinformatics, vol. 20, no. 17, pp. 2918–2927, 2004. [24] T. Akutsu, S. Kuhara, O. Maruyama, and S. Miyano, “A system for identifying genetic networks from gene expression patterns produced by gene disruptions and overexpressions,” Genome Informatics, vol. 9, pp. 151–160, 1998. 13 [25] M. Milano and A. Roli, “Solving the satisfiability problem through Boolean networks,” in Proceedings of the 6th Congress of the Italian Association for Artificial Intelligence on Advances in Artificial Intelligence, vol. 1792 of Lecture Notes in Artificial Intelligence, pp. 72–83, Springer, Bologna, Italy, September 1999. [26] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, W.H. Freeman, New York, NY, USA, 1979. [27] G. Even, J. Naor, B. Schieber, and M. Sudan, “Approximating minimum feedback sets and multicuts in directed graphs,” Algorithmica, vol. 20, no. 2, pp. 151–174, 1998. [28] A.-L. Barabási and R. Albert, “Emergence of scaling in random networks,” Science, vol. 286, no. 5439, pp. 509–512, 1999. [29] N. Guelzim, S. Bottani, P. Bourgine, and F. Képès, “Topological and causal structure of the yeast transcriptional regulatory network,” Nature Genetics, vol. 31, no. 1, pp. 60–63, 2002. [30] T. Akutsu, S. Miyano, and S. Kuhara, “Identification of genetic networks from a small number of gene expression patterns under the Boolean network model,” in Proceedings of the 4th Pacific Symposium on Biocomputing (PSB ’99), vol. 4, pp. 17–28, Big Island of Hawaii, Hawaii, USA, January 1999. [31] T. Akutsu, S. Kuhara, O. Maruyama, and S. Miyano, “Identification of genetic networks by strategic gene disruptions and gene overexpressions under a Boolean model,” Theoretical Computer Science, vol. 298, no. 1, pp. 235–251, 2003. [32] T. Akutsu, M. Hayashida, W.-K. Ching, and M. K. Ng, “Control of Boolean networks: hardness results and algorithms for tree structured networks,” Journal of Theoretical Biology, vol. 244, no. 4, pp. 670–679, 2007. Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 97356, 8 pages doi:10.1155/2007/97356 Research Article Fixed Points in Discrete Models for Regulatory Genetic Networks Dorothy Bollman,1 Omar Colón-Reyes,1 and Edusmildo Orozco2 1 Departament 2 Department of Mathematical Sciences, University of Puerto Rico, Mayaguez, PR 00681, USA of Computer Science, University of Puerto Rico, Rı́o Piedras, San Juan, PR 00931-3355, USA Received 1 July 2006; Revised 22 November 2006; Accepted 20 February 2007 Recommended by Tatsuya Akutsu It is desirable to have efficient mathematical methods to extract information about regulatory iterations between genes from repeated measurements of gene transcript concentrations. One piece of information is of interest when the dynamics reaches a steady state. In this paper we develop tools that enable the detection of steady states that are modeled by fixed points in discrete finite dynamical systems. We discuss two algebraic models, a univariate model and a multivariate model. We show that these two models are equivalent and that one can be converted to the other by means of a discrete Fourier transform. We give a new, more general definition of a linear finite dynamical system and we give a necessary and sufficient condition for such a system to be a fixed point system, that is, all cycles are of length one. We show how this result for generalized linear systems can be used to determine when certain nonlinear systems (monomial dynamical systems over finite fields) are fixed point systems. We also show how it is possible to determine in polynomial time when an ordinary linear system (defined over a finite field) is a fixed point system. We conclude with a necessary condition for a univariate finite dynamical system to be a fixed point system. Copyright © 2007 Dorothy Bollman et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. INTRODUCTION Finite dynamical systems are dynamical systems on finite sets. Examples include cellular automata and Boolean networks, (e.g., [1]) with applications in many areas of science and engineering (e.g., [2, 3]), and more recently in computational biology (e.g., [4–6]). A common question in all of these applications is how to analyze the dynamics of the models without enumerating all state transitions. This paper presents partial solutions to this problem. Because of technological advances such as DNA microarrays, it is possible to measure gene transcripts from a large number of genes. It is desirable to have efficient mathematical methods to extract information about regulatory iterations between genes from repeated measurements of gene transcript concentrations. One piece of information about regulatory iterations of interest is when the dynamics reaches a steady state. In the words of Fuller (see [7]): “this paradigm closely parallels the goal of professionals who aim to understand the flow of molecular events during the progression of an illness and to predict how the disease will develop and how the patient will respond to certain therapies.” The work of Fuller et al. [7] serves as an example. When the gene expression profile of human brain tumors was analyzed, these were divided into three classes—high grade, medium grade, and low grade. A key gene expression event was identified, which was a high expression of insulin-like growth factor binding protein 2 (IGFBP2) occurring only in high-grade brain tumors. It can be assumed that gene expression events were initiated at some stages in low-level tumors and may have led to the state when IGFBP2 is activated. The activation of IGFBP2 can be understood to be a steady state. If we model the kinetics and construct a model that reconstructs the genetic regulatory network that activates during the brain tumor process, then we may be able to predict the convergence of events that lead to the activation of IGFBP2. In the same way, we also want to know what happens in the next step following the activation of IGFBP2. Our goal is to develop tools that will enable this type of analysis in the case of modeling gene regulatory networks by means of discrete dynamical systems. 2 EURASIP Journal on Bioinformatics and Systems Biology The use of polynomial dynamical systems to model biological phenomena, in particular gene regulatory networks, have proved to be as valid as continuous models. Laubenbacher and Stigler (see [6]) point out, for example, that most ordinary differential equations models cannot be solved analytically and that numerical solutions of such time-continuous systems necessitate approximations by time-discrete systems, so that ultimately, the two types of models are not that different. Once a gene regulatory network is modeled, in our case by finite fields, or by finitely generated modules, we obtain a finite dynamical system. Our goal is to determine if the dynamical system represents a steady-state gene regulatory network (i.e., if every state eventually enters a steady state). This is a crucial task. Shmulevich et al. (see [8]) have shown that the steady-state distribution is necessary in order to compute the long term influence that is a measure of gene impact over other genes. The rest of the paper is organized as follows. In Section 2 we give some basic definitions and facts about finite dynamical systems and their associated state spaces. In Section 3 we discuss multivariate and univariate finite field models for genetic networks and show that they are equivalent. Each of the models can be converted to the other by a discrete Fourier transform. Section 4 is devoted to fixed point systems. We give a new definition of linear finite dynamical systems and give necessary and sufficient conditions for such a system to be a fixed point system. We review results concerning monomial fixed point systems and show how our results concerning linear systems can be used to determine when a monomial finite dynamical system over an arbitrary finite field is a fixed point system. We show how fixed points can be determined in the univariable model by solving a polynomial equation over a finite field and we give a necessary condition for a finite dynamical system to be a fixed point system. Finally, in Section 5 we discuss some implementation issues. 2. PRELIMINARIES A finite dynamical system (fds) is an ordered pair (X, f ) where X is a finite set and f is a function that maps X into itself, that is, f : X → X. The state space of an fds (X, f ) is a digraph (i.e., directed graph) whose nodes are labeled by the elements of X and whose edges consist of all ordered pairs (x, y) ∈ X × X such that f (x) = y. We say that two finite dynamical systems are isomorphic if there exists a graph isomorphism between their state spaces. Let G = (V , E) be a digraph. A path in G of the form (v1 , v2 ), (v2 , v3 ), . . . , (vn−1 , vn ), (vn , v1 ), where v1 , v2 , . . . , vn are distinct members of V is a cycle of length n. We define a tree to be a digraph T = (V , E) which has a unique node v0 , called the root of T, such that (a) (v0 , v0 ) ∈ E, (b) for any node v = v0 , there is a path from v to v0 , (c) T has no “semicycles” (i.e., alternate sequence of nodes and edges v1 , x1 , v2 , . . . , xn , vn+1 , n = 0, where v1 = vn+1 and each xi is (vi , vi+1 or vi+1 , vi )) other than the trivial one (v0 , (v0 , v0 ), v0 ). (Such a tree with the edge (v0 , v0 ) deleted is sometimes called an “in-tree” with “sink” v0 .) Let T be a tree, let nT = ni=1 Ti be the union of n copies T1 , T2 , . . . , Tn of T, and let ri be the root of Ti . Define T (n) to be the digraph obtained from nT by deleting the edges (ri , ri ), i = 1, 2, . . . , n, and adjoining the edges (ri , r j ), i, j = 1, 2, . . . , n, where j = i + 1 mod n. We call T (n) an ncycled tree. Note that by definition, every tree is an n-cycled tree with n = 1. Note also that by definition, a digraph Tr = ({r }, {r, r }) consisting of a single trivial cycle is a tree and hence every cycle of length n is isomorphic to nTr and hence is an n-cycled tree. The product of two digraphs G1 = (V1 , E1 ) and G2 = (V2 , E2 ), denoted G1 × G2 , is the digraph G = (V , E) where V = V1 × V2 (the Cartesian product of V1 by V2 ) and E = {((x1 , y1 ), (x2 , y2 )) ∈ V × V : (x1 , x2 ) ∈ E1 and (y1 , y2 ) ∈ E2 }. The following facts follow easily from the definitions. Lemmas 1, 3, and 4 have been noted in [9]. Lemma 1. The state space of an fds is the disjoint union of cycled trees. Of special interest are those fds whose state space consists entirely of trees. Such an fds is called a fixed point system (fps). For any finite set X we call f : X → X nilpotent if there exists a unique x0 ∈ X such that f k (X) = x0 for some positive integer k. Lemma 2. The state space of an fds (X, f ) is a tree if and only if f is nilpotent. Hence (X, f ) is an fps if f is nilpotent. Proof. Suppose that the state space (X, f ) is a tree with root x0 and height k. Then f k (x) = x0 for all x ∈ X and x0 is the only node with this property. Hence f is nilpotent. Conversely, if f is nilpotent and f k (X) = x0 , then by Lemma 1, the state space consists of an n-cycled tree and since x0 is unique, n = 1. Example 1. Consider the fds (F23 , f ), where f : F23 → F23 is defined by f (x, y, z) = (y, 0, x) and F2 is the binary field. In this case f is a nilpotent function. The state space of (F23 , f ) is a tree whose state space is shown in Figure 1. Lemma 3. The state space of an fds (X, f ) is the union of cycles if and only if f is one-to-one. Lemma 4. The product of a tree and a cycle of length l is a cycled tree whose cycle has length l. 3. FINITE FIELD MODELS A finite dynamical system constitutes a very natural discrete model for regulatory processes (see [10]), in particular genetic networks. Experimental data can be discretized into a finite set X of expression levels. A network consisting of n genes is then represented by an fds (X n , f ). The dynamics of the network is described by a discrete time series f s0 = s1 , f s1 = s2 , . . . , f sk−2 = sk−1 . (1) Special cases of the finite dynamical model are the Boolean model and finite field models. In the Boolean model, Dorothy Bollman et al. 010 3 011 110 100 101 111 001 000 Figure 1: State space of (F23 , f ), where f (x, y, z) = (y, 0, x) over F2 . either a gene can affect another gene or not. In a finite field model, one is able to capture graded differences in gene expression. A finite field model can be considered as a generalization of the Boolean model since each Boolean operation can be expressed in terms of the sum and product in Z2 . In particular, x ∧ y = x · y, x ∨ y = x + y + x · y, x = x + 1. (2) Two types of finite field models have emerged, the multivariate model [6] and the univariable model [11]. The multivariate model is given by the fds (Fqn , f ), where Fqn represents the set of n-tuples over the finite field Fq with q elements. Each coordinate function fi gives the next state of gene i, given the states of the other genes. The univariate model is given by the fds (Fqn , f ). In this case, each value of f represents the next states of the n genes, given the present states. The two types of finite field models can be considered equivalent in the following sense. of a polynomial in n variables over Fq , q prime, can be done with O(qn /n) operations (see [12]) and hence, evaluating f in all n of its coordinates is O(qn ), the same number of operations needed for the evaluation of a univariate polynomial over Fqn . However, the complexity of the comparison of two values in Fqn is O(n), whereas the complexity of the comparison of two values in Fqn , represented as described below, is O(1). Arithmetic in Fqn , q prime, is integer arithmetic modulo q. Arithmetic in Fqn is efficiently performed by table lookup methods, as shown below. Nonzero elements of Fqn are represented by powers of a primitive element α. Multiplication is then performed by adding exponents modulo qn − 1. For addition we make use of a precomputed table of values defined as follows. Every nonozero element of Fqn has a unique representation in the form 1 + αi and the unique number z(i), 0 ≤ z(i) ≤ qn − 2, such that 1 + αi = αz(i) is called the Zech log of i. Note that for a ≤ b, αa + αb = n αa (1 + αb−a ) = αa+z(b−a) mod q −1 . Addition is thus performed by adding one exponent to the Zech log of the difference, which is found in a precomputed table. In order to construct a table of Zech logs for Fqn , we first need a primitive polynomial, which can be found in any one of various tables (e.g., [13]). Example 2. Let us construct a table of Zech logs for F32 using the primitive polynomial x5 +x2 +1. Thus, we have α5 = α2 +1, where α is a root of x5 + x2 + 1. Continuing to compute the powers and making use of this fact, we have α6 = α3 + α, α7 = α4 + α2 , α8 = α5 + α3 = α3 + α2 + 1, . . . , α31 = 1. Now use these results to compute for each i = 1, . . . , 30, the number z(i) such that αi + 1 = αz(i) . For example, since α5 = α2 + 1, we have α5 + 1 = α2 and so z(5) = 2, and so forth. See Table 1. Definition 1. An fds (X, f ) is equivalent to an fds (Y , g) if there is an epimorphism φ : X → Y such that φ ◦ f = g ◦ φ. Usually it is most convenient to choose the most appropriate model at the outset. However, at the cost of computing all possible values of the map, it is possible to convert one model to the other. The rest of this section is devoted to developing such an algorithm. It is easy to see that if two fds’s are equivalent, then their state spaces are the same up to isomorphism. We can show that for any n-dimensional dynamical system (Fqn , f ) there is an equivalent one-dimensional system (Fqn , g). To see this, consider a primitive element α of Fqn , that is, a generator of the multiplicative group of Fqn − {0}. Then there is a natural correspondence between Fqn and Fqn , given by Definition 2. Let F be a field and let α ∈ F be an element of order d, that is, αd = 1 and no smaller power of α equals 1. The discrete Fourier transform (DFT) of blocklength d over F is defined by the matrix T = [αi j ], i, j = 0, 1, . . . , d − 1. The inverse discrete Fourier transform is given by T −1 = d−1 [α−i j ], i, j = 0, 1, . . . , d − 1, where d−1 denotes the inverse of the field element d = 1 + 1 + · · · + 1 (d times). φα x0 , . . . , xn−1 = x0 + x1 α + x2 α2 + · · · + xn−1 αn−1 . (3) Since for each a ∈ Fqn there exists unique yi ∈ Fq such that a = y0 + y1 α + y2 α2 + · · · + yn−1 αn−1 we can define g : Fqn → Fqn as g(a) = (φα ◦ f )(y0 , . . . , yn−1 ). Notice then that g ◦ φα = φα ◦ f and therefore the dynamical systems g and f are equivalent. One important consideration in choosing an appropriate finite field model for a genetic network is the complexity of the needed computational tasks. For example, the evaluation It is easy to show that TT −1 = Id , where Id denotes the d × d identity matrix (see, e.g., [14]). Now an element in Fq , is of order d if and only if d divides q − 1. Thus, for every finite field Fq there is a DFT over Fq with block length q − 1 which is defined by [αi j ], i, j = 0, 1, . . . , q − 2, where α is a primitive element of Fq . We denote such a DFT by Tq,α . Theorem 1. Let B0 = (φα ◦ f )(0, . . . , 0) and for each i = 1, 2, . . . , qn − 1, let Bi = (φα ◦ f )(a0,i , a1,i , . . . , an−1,i ) where 4 EURASIP Journal on Bioinformatics and Systems Biology Table 1: Zech Logs for F32 . Coli is in an environment with lactose, then the lac operon turns on the enzymes that are needed in order to degrade lactose. These enzymes are beta-galactisidase, Lactose Permease, and Thiogalactoside transectylase. In [15], a continuous model is proposed that measures the rate of change in the concentration of these enzymes as well as the concentration of mRNA and intracellular lactose. In [16, 17], Laubenbacher and Stigler provide a discrete model for the lac operon given by i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 z(i) 18 5 29 10 2 27 22 20 16 4 19 23 14 13 24 i 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 z(i) 9 30 1 11 8 25 7 12 15 21 28 6 26 3 17 α is a primitive element of Fqn and where an−1,i αn−1 + · · · + a1,i α + α0,i = αi−1 . Then g is given by the polynomial Aqn −1 x qn −1 + Aqn −2 x qn −2 + · · · + A1 x + A0 , ⎞ ⎛ ⎞ Aqn −1 B1 − A0 ⎜A n ⎟ ⎜ B −A ⎟ 0 ⎟ ⎜ q −2 ⎟ ⎜ 2 ⎜ . ⎟ = −Tqn ,α ⎜ ⎟. .. ⎜ . ⎟ ⎜ ⎟ ⎝ . ⎠ ⎝ ⎠ . A1 Bqn −1 − A0 (5) Proof. For each i = 0, 1, . . . , qn − 2, we have Bi+1 = φα f a0,i+1 , a1,i+1 , . . . , an−1,i+1 = g φα a0,i+1 , a1,i+1 , . . . , an−1,i+1 = g a0,i+1 + a1,i+1 α + · · · + an−1,i+1 αn−1 = g αi . (6) Now every function defined on a finite field Fqn can be expressed as a polynomial of degree not more than qn − 1. Hence g is of the form (4) and it remains to show that the Ai are given by (5). For this we need only to solve the following system of equations: qn −1 g αi =Aqn −1 αi qn −2 + Aqn −2 αi + · · · + A1 αi + A0 , i = 0, 1, . . . , qn − 2. Since α is a primitive element of Fqn , we have (αi )q and so qn −1 Bi+1 − A0 = g αi = Aqn −1 αi + · · · + A1 αi , n −1 (7) = 1 qn −2 + Aqn −2 αi i = 0, 1, . . . , qn − 2. (8) Thus, ⎛ ⎜ ⎜ ⎜ ⎜ ⎝ B1 − A0 B2 − A0 .. . Bqn −1 − A0 ⎞ ⎛ ⎞ Aqn −1 ⎟ ⎜A n ⎟ q −2 ⎟ ⎟ ⎜ ⎟ = d −1 T −n1 ⎜ . ⎟ , q ,α ⎜ . ⎟ ⎟ ⎠ ⎝ . ⎠ (9) A1 where d−1 = (qn − 1)−1 = −1. The theorem then follows by applying Tqn ,α to both sides of this last equation. F25 , f x1 , x2 , x3 , x4 , x5 = x3 , x1 , x3 + x2 x4 + x2 x3 x4 , 1 + x2 x4 (4) where A0 = B0 and ⎛ (10) + x5 + 1 + x2 x4 x5 , x1 , where x1 represents mRNA, x2 represents beta-galactosidase, x3 represents allolactose, x4 represents lactose, and x5 represents permease. In order to find an equivalent univariate fds ( f25 , g) we first find a primitive element α in F25 . This can be done by finding a “primitive polynomial,” that is, an irreducible polynomial of degree 5 over F2 that has a zero α in F25 that generates the multiplicative cyclic group of F25 − {0}. Such an α can be found either by trial and error or by the use of tables (see, e.g., [13]). In our case, we choose α to be a zero of x5 + x2 + 2. Next, we compute Bi , i = 0, 1, . . . , 31. By definition B0 = φα ( f (0, 0, 0, 0, 0)) = φα (0, 0, 0, 0, 0) = 0 and Bi = φα ( f (a0,i , a1,i , a2,i , a3,i , a4,i )) where αi−1 = a0,i + a1,i α + a2,i α2 + a3,i α3 + a4,i α4 for i = 1, 2, . . . , 31. So, for example, B1 = φα f (1, 0, 0, 0, 0) = φα (0, 1, 0, 0, 1) = α + α4 = α1+z(3) = α30 , B2 = φα f (0, 1, 0, 0, 0) = φα (0, 0, 0, 0, 0) = 0. (11) Continuing we find that [B1 , B2 , . . . , B31 ] = [α30 , 0, α5 , α3 , α3 , α26 , α2 , α8 , α15 , α20 , α9 , α26 , α5 , α8 , α15 , α15 , α24 , α9 , α30 , α5 , α8 , α3 , α15 , α26 , α8 , α9 , α15 , α28 , α8 , α9 , α3 ]. Multiplying by the 31 × 31 matrix T32,α = [αi j ], 0 ≤ i, j ≤ 30, we obtain [A31 , A30 , . . . , A1 ] and hence the equivalent univariate polynomial, which is g(x) = x +α22 x2 +α2 x3 + α11 x4 + α20 x5 + x6 + α12 x7 + α25 x8 + α20 x9 + α20 x10 + α2 x11 + α5 x12 +α5 x13 +α23 x14 +α7 x16 +α27 x17 +α20 x18 +α6 x19 +α20 x20 + α27 x21 + x22 + α27 x24 + x25 + α27 x26 + α16 x28 . As previously mentioned, the complexity of evaluating a polynomial in n variables over a finite field Fq is O(qn /n). The complexity of evaluating f in all of its n coordinates is thus O(qn ) and the complexity of evaluating f in all points of Fqn is thus O(q2n ). The computation of the matrix-vector product in (5) involves O(q2n ) operations over the field Fqn . However, using any one of the number of classical fast algorithms, such as Cooley-Tukey (see, e.g., [14]), the number of operations can be reduced to O(qn n). We illustrate the algorithm given by Theorem 1 with an example. 4. Example 3. A recent application involves the study and creation of a model for lac operon [15]. When the bacteria E. A fixed point system (fps) is defined to be an fds whose state space consists of trees, that is, contains no cycles other than FIXED POINT SYSTEMS Dorothy Bollman et al. trivial ones (of length one). The fixed point system problem is the problem of determining of an fds whether or not it is an fps. Of course one such method would be the brute force method whereby we examine sequences determined by successive applications of the state map to determine if any such sequence contains a cycle of length greater than one. The worst case occurs when the state space consists of one cycle. Consider such a multivariate fds (Fqn , f ). In order to recognize a maximal cycle, f (a1 , a2 , . . . , an ), f 2 (a1 , a2 , . . . , an ), . . . , such an approach would require backtracking at each step in order to compare the most recent value f i (a1 , a2 , . . . , an ) with all previous values. An evaluation requires O(qn ) operations, there are qn such evaluations, and a comparison of two values requires n steps. The complexity of the complete process is thus O(q2n + n2 ) = O(q2n ). To date, all results concerning the fixed point system problem are characterized in terms of multivariate fds. A solution to the fixed point system problem consists of characterizing such an fds (Fqn , f ) in terms of the structure of f . Ideally, such conditions should be amenable to implementations in polynomial time in n. In a recent work, Just [18] claims that if the class of regulatory functions contains the quadratic monotone functions xi ∨ x j and xi ∧ x j , then the fixed point problem for Boolean dynamical systems is NP-hard. In view of this result, it is unlikely that we can achieve the goal of a polynomial time solution to the fixed point problem, at least in the general case. However, the question arises if the above O(q2n ) result can be improved (to say O(qn )) and also what are special cases of the fixed point problem that have polynomial time solutions. In this section we give a polynomial solution to the special case of the fixed point problem for linear finite dynamical systems, we review known results for the nonlinear case, and we point out how our results concerning the general linear case for fds over finitely generated modules give a more complete solution to the case of monomial finite field dynamical systems over arbitrary finite fields. We conclude by proposing a new approach to the problem via univariate systems, we give an algorithm for determining the fixed points of a univariate system, and we give a necessary condition for a univariate fds to be an fps. 4.1. Linear fixed point systems Finite dynamical systems over finite fields that are linear are very amenable to analysis and have been studied extensively in the literature (see [2, 9]). In the multivariate case, a linear system over a finite field is represented by an fds (Fqn , f ) where f can be represented by an n × n matrix A over Fq . The fixed points of a multivariate fds (Fqm , A) are simply the solutions to the homogeneous system of equations (A − I)x = 0. In the finite field model for genetic networks, we assume that the number of states of each gene is a power of a prime. However, we will give a more general model that eliminates this assumption. 5 A module M over a ring R is said to be finitely generated if there exists a set of elements {s1 , s2 , . . . , sn } ⊂ M such that M = {r1 s1 + r2 s2 + · · · + rn sn | ri ∈ R}. Finitely generated modules are generalized vector spaces. Examples are Fqn and the set Zmn of n tuples over the ring of integers modulo an arbitrary integer m. A linear finite dynamical module system (lfdms) consists of an ordered pair (M(R), f ) where M(R) is a finitely generated module over a finite commutative ring R with unity and f : M(R) → M(R) is linear. Let (M1 (R), f1 ) and (M2 (R), f2 ) be lfdms. We define the direct sum of (M1 (R), f1 ) and (M2 (R), f2 ) to be the fds (M1 ⊕M2 , f1 ⊕ f2 ) where M1 ⊕M2 is the direct sum of the modules M1 (R) and M2 (R) and f1 ⊕ f2 : M1 ⊕ M2 → M1 ⊕ M2 is defined by ( f1 ⊕ f2 )(u + v) = f1 (u) + f2 (v), for each u ∈ M1 (R) and v ∈ M2 (R). The state space of the direct sum is related to the component fds as follows. Lemma 5. Let G1 be the state space of the lfdms (M1 (R), f1 ) and let G2 be the state space of the lfdms (M2 (R), f2 ). Then the state space of the direct sum of (M1 (R), f1 ) and (M2 (R), f2 ) is G1 × G 2 . This result has been noted in [9] for lfdms over fields. We use the following well-known result (see, e.g., [19]) in order to establish necessary and sufficient conditions for an lfdms to be a fixed point system. Lemma 6 (Fitting’s lemma). Let (M(R), f ) be an lfdms. Then there exist an integer n > 0 and submodules N and P satisfying (i) N = f n (M(R)), (ii) P = f −n (0), (iii) (M(R), f ) = (N(R), f1 ) ⊕ (P(R), f2 ), f1 = f |N (the restriction of f to N) is invertible and f2 = f |P is nilpotent. Theorem 2. Let (M(R), f ) be an lfdms and let N be defined as above. Then (M(R), f ) is a fixed point system if and only if either f is nilpotent or f |N is the identity map. Proof. By Fitting’s lemma, we have (M(R), f ) = (N(R), f |N ) ⊕ (P(R), f |P ) where N = f n (M(R)) and P = f −n (0). Suppose that f is nilpotent. Then by Lemma 2, the state space of (M(R), f ) is a tree. Next suppose that f |N is the identity. Then by Lemma 3, the state space of (M(N), f |N ) is a union of cycles each of length one and by Lemma 2, the state space of (M(P), f |P ) is a tree. Hence by Lemma 4, the state space of (M(R), f ) is a union of trees and so (M(R), f ) is a fixed point system. Conversely, suppose that (M(R), f ) is a fixed point system. Then the state space of (M(R), f ) is a union U of trees. If U consists of only one tree, then by Lemma 3, f is nilpotent. Now suppose that U is the union of at least two trees. Since f is invertible on N, it is also one-to-one on N. By Lemma 2, the state space of (N(R), f |N ) is a union of cycles. Each of these cycles must be of length one. For if not, the state space of (M(R), f ) would contain at least one n-cycled tree where n > 1, contradicting that (M(R), f ) is a fixed point system. 6 Theorem 2 can be used to prove the following result, which is suggested in [20, 21]. Corollary 1. A linear finite dynamical system (Fqn , f ) over a field is a fixed point system if and only if the characteristic polynomial of f is of the form xn0 (x − 1)n1 and the minimal polynomial is of the form xn0 (x − 1)n1 where n1 is either zero or one. EURASIP Journal on Bioinformatics and Systems Biology each hi = x1δi1 x2δi2 · · · xnδin and where δi j is one if > 0 and is zero otherwise. The following theorem was published in [25]. Theorem 3 (Colón-Reyes, Jarrah, Laubenbacher, and Sturmfels). A monomial fds (Fqn , f ) is an fps if and only if (Zqn−1 , L f ) and (Z2n , S f ) are fixed point systems. Proof. Suppose (Fqn , f ) is an fps. Then either f is nilpotent or f |N is the identity. If f is nilpotent, then the characteristic polynomial of f is of the form xn0 and the minimal polynomial of f is of the form xn0 . If f |N is the identity, then the characteristic polynomial of f |N is of the form (x − 1)n1 and the minimal polynomial of f |N is of the form (x − 1)n1 where 0 ≤ n1 ≤ n1 . Furthermore, n1 ≤ 1 since otherwise (Fqn , f ) would not be an fps [2]. Therefore the characteristic and minimal polynomials of f are of the desired forms. Conversely, suppose that the characteristic polynomial of f is of the form xn0 (x − 1)n1 and its minimal polynomial is of the form xn0 (x − 1)n1 , where n0 ≤ n0 and n1 is either zero or one. If n0 = 0, then the characteristic polynomial of f is (x − 1)n1 and so the minimal polynomial of f is x − 1, which implies that f is the identity and hence (Fqn , f ) is an fps. Next, suppose that n0 > 0. Then either n1 = 0 or n1 = 1. If n1 = 0, then the state space of (Fqn , f ) is a tree. If n1 = 1, then the state space of (Fqn , f ) is the product of a tree and cycles of length one and, hence the union of trees. Example 4. Consider the monomial fds (F52 , f ) where f = (xy, y) = (x1 y 1 , x0 y 1 ). The matrix L f = ( 01 11 ) over Z4 is nonsingular and hence not nilpotent. Furthermore, the n of Theorem 2 is 1, N = Z4 , and L f is not the identity. By Theorem 2, (Z4 , L f ) is not an fps and by the previous theorem, (F52 , f ) is not an fps. The corollary gives us a polynomial time algorithm to determine of a linear fds (Fqn , f ), where f is given by an n × n matrix, whether or not it is an fps. The characteristic polynomial of f can be determined in time O(n3 ) using the definition. The minimal polynomial of f can be determined in time O(n3 ) using an algorithm of Storjohann [22]. Both polynomials can be factored in subquadratic time using an algorithm of Kaltofen and Shoup [23]. Lemma 7. (Fqn , g) has fixed points if and only if h(x) = n gcd(g(x) − x, xq − x) = 1 and in such a case, the fixed points are the zeros of h(x). 4.2. Monomial systems The simplest nonlinear multivariate fds (Fqn , f ) is one in which each component function fi of f is a monomial, that is, a product of powers of the variables. In [24], Colón-Reyes et al. provide necessary and sufficient conditions that allow one to determine in polynomial time when an fds of the form (F2n , f ), where f a monomial, is an fps. In [25], Colón-Reyes et al. give necessary and sufficient conditions for (Fqn , f ), where f is a monomial and q an arbitrary prime, to be an fps. However, one of these conditions is that a certain linear fds over a ring be an fps, but no criterion is given for such an fds to be an fps. Theorem 2 gives such a criterion. Let us describe the situation in more detail. Definition 3. If f = ( f1 , f2 , . . . , fn ) where each f j is of the 1 j 2 j n j form x1 x2 · · · xn , j = 1, 2, . . . , n, where each i j belongs to the ring Zq−1 of integers modulo q −1, then (Fqn , f ) is called a monomial finite dynamical system. The log map of (Fqn , f ) is defined by the n × n matrix L f = [i j ], where 1 ≤ i, j ≤ n. The support map is defined by S f = (h1 , h2 , . . . , hn ) where The problem of determining in polynomial time (in n) when an lfdms (Rn , f ) is an fps, where R is a finite ring, is open. 4.3. A univariate approach The fixed point problem is an important problem, suitable solutions for which have been obtained only in certain special cases. All of the work done so far has been done for multivariate fds. By considering the problem in the univariate domain, it is possible to gain some insight that is not evident in the multivariate domain. The results in the remainder of this section are examples of this. Proof. An element a of Fqn is a fixed point of (Fqn , g) if and n only if a is a zero of g(x) − x. Since xq = x for all x ∈ Fqn , n q x − x contains all linear factors of the form x − a, a ∈ Fqn and so a is a zero of g(x) − x if and only if x − a is a factor of n both g(x) − x and xq − x, that is, if and only if it is a factor of h(x). Lemma 7 gives us algorithms for determining whether or not a given univariate fds has fixed points and if so, a method to find all such points. For the first part, we note that the greatest common divisor of two univariate polynomials of degree no more than d can be determined using no more than (d log2 d) operations [26]. Since g has degree at most qn − 1, this means that the complexity for calculating h(x), that is, for determining whether or not a given univariate fds (Fqn , g) is an fps O(n2 qn ). = 1, h(x) can be factored in order to deterWhen h(x) mine the set of all fixed points. At worst, using the algorithm in [23], the complexity of determining the factors of h(x) is O(d1.815 n), where d is the degree of h(x). Clearly, d is less than or equal to the degree of g(x), which in practice is determined by experimental data (e.g., from microarrays) and thus considerably less than the total number of possible points qn . If we assume that the degree of g(x) is not more than the square root of qn , then d1.815 n ≤ n2 qn and the total complexity of the algorithm for determining all fixed points is thus O(n2 qn ). Dorothy Bollman et al. 0 7 1 2 4 3 Figure 2: State space of (F5 , g), where g(x) = x3 over F5 . In contrast, the only known method for determining the fixed points of a multivariate fds (Fqn , f ) is the brute force method of enumerating all state transitions and for each value f (a1 , a2 , . . . , an ) so generated, check to see if f (a1 , a2 , . . . , an ) = (a1 , a2 , . . . , an ). The number of operations in this method is O(q2n ). In many cases, the degree of h(x) of Lemma 7 is small and its zeros can be found by inspection or by only several trials and errors. The lac operon example illustrates this. Example 5. Let (F32 , g) be the fds describing the lac operon (Example 3). We have h(x) = gcd(g(x), x32 − x) = x4 +α26 x3 + α18 x2 = x2 (x − α3 )(x − α15 ) and thus the fixed points are x = 0, x = α3 , and x = α15 . Lemma 7 also gives a necessary condition for an fds to be an fps, which for emphasis we state as a theorem. Theorem 4. With the notation of Lemma 7, if (Fq , g) is an fps, = 1. then h(x) Proof. If h(x) = 1, then (Fqn , g) has no fixed points and all cycles are nontrivial. Hence by Lemma 7, (Fqn , g) is not an fps. The converse of Theorem 4 is not true. Example 6. Consider the fds (F5 , g) where g(x) = x3 . Then h(x) = gcd(x5 − x, x3 − x) = x3 − x = 1, but (F5 , g) is not an fps. (see Figure 2). 5. IMPLEMENTATION ISSUES One of the difficulties of implementing algorithms for the multivariate model is the choice of data structures, which can, in fact, affect complexity. For example, no algorithm is known for factoring multivariate polynomials that runs in time polynomial in the length of the “sparse” representation. However, such an algorithm exists for the “black box” representation (see, e.g., [27]). On the other hand, data structures needed for algorithms for the univariate model are well known and simple to implement. In this case, one can also take advantage of wellknown methods used in cryptography and coding theory. Table lookup methods for carrying out finite field arithmetic are an example. By using lookup tables we can make arithmetic operations at almost no cost. However, for very large fields, memory space becomes a limitation. Ferrer [28] has implemented table lookup arithmetic for fields of characteristic 2 on a Hewlett-Packard Itanium machine with two 900 MHz ia64 CPU modules and 4 GB of RAM. On this machine, we can create lookup tables of up to 229 elements. Multiplication is by far the most costly finite field operation and also the most often used, since other operations such as computing powers and computing inverses make use of multiplication. In other experiments on the HewlettPackard Itanium, Ferrer [28] takes advantage of machine hardware in order to implement a “direct” multiplication algorithm for F2n that runs in time linear in n for n = 2 up to n = 63 [28]. Here the field size is limited by the word-length of the computer architecture. For larger fields, we can make use of “composite” fields (see, e.g., [29]), that is, fields Fn where n is composite, say n = rs. Making use of the isomorphism of F2r s and F(2r )s , we can use table lookup for a suitable “ground field” Fr and the direct method mentioned above for multiplication in the extension field F(2r )s . Using the ground field F25 and selected values of s, Ferrer [28] obtains running time O(s2 ). Still another approach to implement finite field arithmetic, that is, especially efficient for fields of characteristic 2, is the use of reconfigurable hardware or “field programmable gate arrays” (FPGAs). In [30], Ferrer, Moreno and the first author obtain a multiplication algorithm which outperforms all other known FPGA multiplication algorithms for fields of characteristic 2. 6. CONCLUSIONS One piece of information that is of utmost interest when modeling biological events, in particular gene regulation networks, is when the dynamics reaches a steady state. If the modeling of such networks is done by discrete finite dynamical systems, such information is given by the fixed points of the underlying system. We have shown that we can choose between a multivariate and a univariate polynomial representation. Here we introduce a new tool, the discrete Fourier transform that helps us change from one representation to the other, without altering the dynamics of the system. We provide a criterion to determine when a linear finite dynamical system over an arbitrary finitely generated module over a commutative ring with unity is a fixed point system. When a gene regulation network is modeled by a linear finite dynamical system we can then decide if such an event reaches a steady state using our results. When the finitely generated module is a finite field we can decide in polynomial time. Gene regulation networks, as suggested in the literature, seem to obey very complex mechanisms whose rules appear to be of a nonlinear nature (see [31]). In this regard, we have made explicit some useful facts concerning fixed points and fixed point systems. We have given algorithms for determining when a univariate fds has at least one fixed point and how to find them. We have also given a necessary condition for a univariable fds to be a fixed point system. However, there are still much to be done and a number of open problems remain. In particular, what families of fds admit polynomial time algorithms for determining whether or not a given fds is an fps? This work is a first step towards the aim of designing theories and practical tools to tackle the general problem of fixed points in finite dynamical systems. 8 ACKNOWLEDGMENTS This work was partially supported by Grant number S06GM08102 NIH-MBRS (SCORE). The figures in this paper were created using the Discrete Visualizer of Dynamics software, from the Virginia Bioinformatics Institute (http://dvd.vbi.vt.edu/network visualizer). The authors are grateful to Dr. Oscar Moreno for sharing his ideas on the univariate model and composite fields. REFERENCES [1] J. F. Lynch, “On the threshold of chaos in random Boolean cellular automata,” Random Structures & Algorithms, vol. 6, no. 2-3, pp. 239–260, 1995. [2] B. Elspas, “The theory of autonomous linear sequential networks,” IRE Transactions on Circuit Theory, vol. 6, no. 1, pp. 45–60, 1959. [3] J. Plantin, J. Gunnarsson, and R. Germundsson, “Symbolic algebraic discrete systems theory—applied to a fighter aircraft,” in Proceedings of the 34th IEEE Conference on Decision and Control, vol. 2, pp. 1863–1864, New Orleans, La, USA, December 1995. [4] D. Bollman, E. Orozco, and O. Moreno, “A parallel solution to reverse engineering genetic networks,” in Computational Science and Its Applications—ICCSA 2004—Part 3, A. Laganà, M. L. Gavrilova, V. Kumar, et al., Eds., vol. 3045 of Lecture Notes in Computer Science, pp. 490–497, Springer, Berlin, Germany, 2004. [5] A. S. Jarrah, H. Vastani, K. Duca, and R. Laubenbacher, “An optimal control problem for in vitro virus competition,” in Proceedings of the 43rd IEEE Conference on Decision and Control (CDC ’), vol. 1, pp. 579–584, Nassau, Bahamas, December 2004. [6] R. Laubenbacher and B. Stigler, “A computational algebra approach to the reverse engineering of gene regulatory networks,” Journal of Theoretical Biology, vol. 229, no. 4, pp. 523– 537, 2004. [7] G. N. Fuller, C. H. Rhee, K. R. Hess, et al., “Reactivation of insulin-like growth factor binding protein 2 expression in glioblastoma multiforme,” Cancer Research, vol. 59, no. 17, pp. 4228–4232, 1999. [8] I. Shmulevich, E. R. Dougherty, and W. Zhang, “Gene perturbation and intervention in probabilistic Boolean networks,” Bioinformatics, vol. 18, no. 10, pp. 1319–1331, 2002. [9] R. A. Hernández Toledo, “Linear finite dynamical systems,” Communications in Algebra, vol. 33, no. 9, pp. 2977–2989, 2005. [10] J. Bähler and S. Svetina, “A logical circuit for the regulation of fission yeast growth modes,” Journal of Theoretical Biology, vol. 237, no. 2, pp. 210–218, 2005. [11] O. Moreno, D. Bollman, and M. Aviño, “Finite dynamical systems, linear automata, and finite fields,” in Proceedings of the WSEAS International Conference on System Science, Applied Mathematics and Computer Science, and Power Engineering Systems, pp. 1481–1483, Copacabana, Rio de Janeiro, Brazil, October 2002. [12] B. Sunar and D. Cyganski, “Comparison of bit and word level algorithms for evaluating unstructured functions over finite rings,” in Proceedings of the 7th International Workshop Cryptographic Hardware and Embedded Systems (CHES ’05), J. R. Rao and B. Sunar, Eds., vol. 3659 of Lecture Notes in Computer Science, pp. 237–249, Edinburgh, UK, August-September 2005. EURASIP Journal on Bioinformatics and Systems Biology [13] M. Zivkovic, “A table of primitive binary polynomials,” Mathematics of Computation, vol. 62, no. 205, pp. 385–386, 1994. [14] R. E. Blahut, Algebraic Methods for Signal Processing and Communications Coding, Springer, New York, NY, USA, 1991. [15] N. Yildirim and M. C. Mackey, “Feedback regulation in the lactose operon: a mathematical modeling study and comparison with experimental data,” Biophysical Journal, vol. 84, no. 5, pp. 2841–2851, 2003. [16] R. Laubenbacher, “Network Inference, with an application to yeast system biology,” Presentation at the Center for Genomics Science, Cuernavaca, Mexico, September 2006, http://mitla.lcg.unam.mx/. [17] R. Laubenbacher and B. Stigler, “Mathematical Tools for Systems Biology,” http://people.mbi.ohio-state.edu/bstigler/sbworkshop.pdf. [18] W. Just, “The steady state system problem is NP-hard even for monotone quadratic Boolean dynamical systems,” submitted to Annals of Combinatorics. [19] B. R. Macdonald, Finite Rings with Identity, Marcel Dekker, New York, NY, USA, 1974. [20] O. Colón-Reyes, Monomial dynamical systems, Ph.D. thesis, Virginia Polytechnic Institute and State University, Blacksburg, Va, USA, 2005. [21] O. Colón-Reyes, Monomial Dynamical Systems over Finite Fields, ProQuest, Ann Arbor, Mich, USA, 2005. [22] A. Storjohann, “An O(n3 ) algorithm for the Frobenius normal form,” in Proceedings of the 23rd International Symposium on Symbolic and Algebraic Computation (ISSAC ’98), pp. 101–104, Rostock, Germany, August 1998. [23] E. Kaltofen and V. Shoup, “Subquadratic-time factoring of polynomials over finite fields,” Mathematics of Computation, vol. 67, no. 223, pp. 1179–1197, 1998. [24] O. Colón-Reyes, R. Laubenbacher, and B. Pareigis, “Boolean monomial dynamical systems,” Annals of Combinatorics, vol. 8, no. 4, pp. 425–439, 2004. [25] O. Colón-Reyes, A. S. Jarrah, R. Laubenbacher, and B. Sturmfels, “Monomial dynamical systems over finite fields,” Journal of Complex Systems, vol. 16, no. 4, pp. 333–342, 2006. [26] A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design and Analysis of Computer Algorithms, Addison Wesley, Boston, Mass, USA, 1974. [27] J. von zur Gathen and J. Gerhard, Modern Computer Algebra, Cambridge University Press, Cambridge, UK, 2nd edition, 2003. [28] E. Ferrer, “A co-design approach to the reverse engineering problem,” CISE Ph.D. thesis proposal, University of Puerto Rico, Mayaguez, Puerto Rico, USA, 2006. [29] E. Savas and C. K. Koc, “Efficient method for composite field arithmetic,” Tech. Rep., Electrical and Computer Engineering, Oregon State University, Corvallis, Ore, USA, December 1999. [30] E. Ferrer, D. Bollman, and O. Moreno, “Toward a solution of the reverse engineering problem usings FPGAs,” in Proceedings of the International Euro-Par Workshops, Lehner, et al., Eds., vol. 4375 of Lecture Notes in Computer Science, pp. 301–309, Springer, Dresden, Germany, September 2006. [31] R. Thomas, “Laws for the dynamics of regulatory networks,” International Journal of Developmental Biology, vol. 42, no. 3, pp. 479–485, 1998. Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 82702, 11 pages doi:10.1155/2007/82702 Research Article Comparison of Gene Regulatory Networks via Steady-State Trajectories Marcel Brun,1 Seungchan Kim,1, 2 Woonjung Choi,3 and Edward R. Dougherty1, 4, 5 1 Computational Biology Division, Translational Genomics Research Institute, Phoenix, AZ 85004, USA of Computing and Informatics, Ira A. Fulton School of Engineering, Arizona State University, Tempe, AZ 85287, USA 3 Department of Mathematics and Statistics, College of Liberal Arts and Sciences, Arizona State University, Tempe, AZ 85287, USA 4 Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA 5 Cancer Genomics Laboratory, Department of Pathology, University of Texas M.D. Anderson Cancer Center, Houston, TX 77030, USA 2 School Received 31 July 2006; Accepted 24 February 2007 Recommended by Ahmed H. Tewfik The modeling of genetic regulatory networks is becoming increasingly widespread in the study of biological systems. In the abstract, one would prefer quantitatively comprehensive models, such as a differential-equation model, to coarse models; however, in practice, detailed models require more accurate measurements for inference and more computational power to analyze than coarse-scale models. It is crucial to address the issue of model complexity in the framework of a basic scientific paradigm: the model should be of minimal complexity to provide the necessary predictive power. Addressing this issue requires a metric by which to compare networks. This paper proposes the use of a classical measure of difference between amplitude distributions for periodic signals to compare two networks according to the differences of their trajectories in the steady state. The metric is applicable to networks with both continuous and discrete values for both time and state, and it possesses the critical property that it allows the comparison of networks of different natures. We demonstrate application of the metric by comparing a continuous-valued reference network against simplified versions obtained via quantization. Copyright © 2007 Marcel Brun et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. INTRODUCTION The modeling of genetic regulatory networks (GRNs) is becoming increasingly widespread for gaining insight into the underlying processes of living systems. The computational biology literature abounds in various network modeling approaches, all of which have particular goals, along with their strengths and weaknesses [1, 2]. They may be deterministic or stochastic. Network models have been studied to gain insight into various cellular properties, such as cellular state dynamics and transcriptional regulation [3–8], and to derive intervention strategies based on state-space dynamics [9, 10]. Complexity is a critical issue in the synthesis, analysis, and application of GRNs. In principle, one would prefer the construction and analysis of a quantitatively comprehensive model such as a differential equation-based model to a coarsely quantized discrete model; however, in practice, the situation does not always suffice to support such a model. Quantitatively detailed (fine-scale) models require signifi- cantly more complex mathematics and computational power for analysis and more accurate measurements for inference than coarse-scale models. The network complexity issue has similarities with the issue of classifier complexity [11]. One must decide whether to use a fine-scale or coarse-scale model [12]. The issue should be addressed in the framework of the standard engineering paradigm: the model should be of minimal complexity to solve the problem at hand. To quantify network approximation and reduction, one would like a metric to compare networks. For instance, it may be beneficial for computational or inferential purposes to approximate a system by a discrete model instead of a continuous model. The goodness of the approximation is measured by a metric and the precise formulation of the properties will depend on the chosen metric. Comparison of GRN models needs to be based on salient aspects of the models. One study used the L1 norm between the steady-state distributions of different networks in the context of the reduction of probabilistic Boolean networks 2 EURASIP Journal on Bioinformatics and Systems Biology [13]. Another study compared networks based on their topologies, that is, connectivity graphs [14]. This method suffers from the fact that networks with the same topology may possess very different dynamic behaviors. A third study involved a comprehensive comparison of continuous models based on their inferential power, prediction power, robustness, and consistency in the framework of simulations, where a network is used to generate gene expression data, which is then used to reconstruct the network [15]. A key drawback of most approaches is that the comparison is applicable only to networks with similar representations; it is difficult to compare networks of different natures, for instance, a differential-equation model to a Boolean model. A salient property of the metric proposed in this study is that it can compare networks of different natures in both value and time. We propose a metric to compare deterministic GRNs via their steady-state behaviors. This is a reasonable approach because in the absence of external intervention, a cell operates mainly in its steady state, which characterizes its phenotype, that is, cell cycle, disease, cell differentiation, and so forth. [16–19]. A cell’s phenotypic status is maintained through a variety of regulatory mechanisms. Disruption of this tight steady-state regulation may lead to an abnormal cellular status, for example, cancer. Studying steady-state behavior of a cellular system and its disruption can provide significant insight into cellular regulatory mechanisms underlying disease development. We first introduce a metric to compare GRNs based on their steady-state behaviors, discuss its characteristics, and treat the empirical estimation of the metric. Then we provide a detailed application to quantization utilizing the mathematical framework of reference and projected networks. We close with some remarks on the efficacy of the proposed metric. 2. METRIC BETWEEN NETWORKS In this section, we construct the distance metric between networks using a bottom-up approach. Following a description of how trajectories are decomposed into their transient and steady-state parts, we define a metric between two periodic or constant functions and then extend this definition to a more general family of functions that can be decomposed between transient and steady-state parts. 2.1. Steady-state trajectory Given the understanding that biological networks exhibit steady-state behavior, we confine ourselves to networks exhibiting steady-state behavior. Moreover, since a cell uses nutrients such as amino acids and nucleotides in cytoplasm to synthesize various molecular components, that is, RNAs and proteins [18], and since there are only limited supplies of nutrients available, the amount of molecules present in a cell is bounded. Thus, the existence of steady-state behavior implies that each individual gene trajectory can be modeled as a bounded function f (t) that can be decomposed into a transient trajectory plus a steady-state trajectory: f (t) = ftran (t) + fss (t), (1) where limt→∞ ftran (t) = 0 and fss (t) is either a periodic function or a constant function. The limit condition on the transient part of the trajectory indicates that for large values of t, the trajectory is very close to its steady-state part. This can be expressed in the following manner: for any > 0, there exists a time tss such that | f (t) − fss (t)| < for t > tss . This property is useful to identify fss (t) from simulated data by finding an instant tss such that f (t) is almost periodical or constant for t > tss . A deterministic gene regulatory network, whether it is represented by a set of differential equations or state transition equations, produces different dynamic behaviors, depending on the starting point. If ψ is a network with N genes and x0 is an initial state, then its trajectory, (1) (N) (t), . . . , f(ψ,x (t) , f(ψ,x0 ) (t) = f(ψ,x 0) 0) (2) (i) where f(ψ,x (t) is a trajectory for an individual gene (denoted 0) by f (i) (t) or f (t) where there is no ambiguity) generated by the dynamic behavior of the network ψ when starting at x0 . For a differential-equation model, the trajectory f(ψ,x0 ) (t) can be obtained as a solution of a system of differential equations; for a discrete model, it can be obtained by iterating the system’s transition equations. Trajectories may be continuoustime functions or discrete-time functions, depending on the model. The decomposition of (1) applies to f(ψ,x0 ) (t) via its ap(i) (t). In the case plication to the individual trajectories f(ψ,x 0) of discrete-valued networks (with bounded values), the system must enter an attractor cycle or an attractor state at some time point tss . In the first case f(ψ,x0 ),ss (t) is periodical, and in the second case it is constant. In both cases, f(ψ,x0 ),tran (t) = 0 for t ≥ tss . 2.2. Distance based on the amplitude cumulative distribution Different metrics have been proposed to compare two realvalued trajectories f (t) and g(t), including the correlation f , g , the cross-correlation Γ f ,g (τ), the cross-spectral density p f ,g (ω), the difference between their amplitude cumulative distributions F(x) = p f (x) and G(x) = pg (x), and the difference between their statistical moments [20]. Each has its benefits and drawbacks depending on one’s purpose. In this paper, we propose using the difference between the amplitude cumulative distributions of the steady-state trajectories. Let fss (t) and gss (t) be two measurable functions that are either periodical or constant, representing the steady-state parts of two functions, f (t) and g(t), respectively. Our goal is to define a metric (distance) between them by using the 3 6 1 5 0.9 4 0.8 3 0.7 2 0.6 F(x) x = f (t) Marcel Brun et al. 1 0.5 0 0.4 −1 0.3 −2 0.2 −3 0.1 −4 0 200 400 600 0 −4 800 1000 1200 1400 1600 1800 2000 t −3 −2 −1 0 1 2 3 4 5 6 x 2∗ sin(t) 2∗ cos(2∗ t + 1) 2∗ sin(t) + 2∗ sin(2∗ t) (a) 2∗ sin(t) + 2∗ sin(2∗ t) + 2 3 + 0∗ t 4 + 0∗ t (b) Figure 1: Example of (a) periodical and constant functions f (t) and (b) their amplitude cumulative distributions F(x). amplitude cumulative distribution (ACD), which measures the probability density of a function [20]. If fss (t) is periodic with period t p > 0, its cumulative densityfunction F(x) over R is defined by F(x) = λ M(x) , tp (3) where λ(A) isthe Lebesgue measure of the set A and M(x) = ts ≤ t < te | fss (t) ≤ x , (4) where te = ts + t p , for any point ts . If fss is constant, given by fss (t) = a for any t, then we define F(x) as a unit step function located at x = a. Figure 1 shows an example of some periodical functions and their amplitude cumulative distributions. Given two steady-state trajectories, fss (t) and gss (t), and their respective amplitude cumulative distributions, F(x) and G(x), we define the distance between fss and gss as the distance between the distributions dss fss , gss = F − G (5) for some suitable norm · . Examples of norms include L∞ , defined by the supremum of their differences, dL∞ ( f , g) = sup F(x) − G(x) , 0≤x≤∞ (6) and L1 defined by the area of the absolute value of their difference, dL1 ( f , g) = 0≤x<∞ F(x) − G(x) dx. (7) In both cases, we apply the biological constraint that the amplitudes are nonnegative. The L1 norm is well suited to the steady-state behavior because in the case of constant functions f (t) = a and g(t) = b, their distributions are unit steps functions at x = a and x = b, respectively, so that dL1 ( f , g) = |a − b|, the distance, in amplitude, between the two functions. Hence, we can interpret the distance dL1 ( f , g) as an extension of the distance, in amplitude, between two constant signals, to the general case of periodic functions, taking into consideration the differences in their shapes. 2.3. Network metric Once a distance between their steady-state trajectories is defined, we can extend this distance to two trajectories f (t) and g(t) by dtr ( f , g) = dss fss , gss , (8) where dss is defined by (5). The next step is to define the distance between two multivariate trajectories f(t) and g(t) by dtr (f, g) = N 1 (i) (i) dtr f , g , N i=1 (9) where f (i) (t) and g (i) (t) are the component trajectories of f(t) and g(t), respectively. Owing to the manner in which a norm is used to define dss , in conjunction with the manner in which dtr is constructed from dss , the triangle inequality dtr (f, h) ≤ dtr (f, g) + dtr (g, h) (10) 4 EURASIP Journal on Bioinformatics and Systems Biology holds, and dtr is a metric. The last step is to define the metric between two networks as the expected distance between the trajectories over all possible initial states. For networks ψ1 and ψ2 , we define d ψ1 , ψ2 = ES dtr f(ψ1 ,x0 ) , f(ψ2 ,x0 ) , (11) where the expectation is taken with respect to the space S of initial states. The use of a metric, in particular, the triangle inequality, is essential for the problem of estimating complex networks by using simpler models. This is akin to the pattern recognition problem of estimating a complex classifier via a constrained classifier to mitigate the data requirement. In this situation, there is a complex model that represents a broad family of networks and a simpler model that represents a smaller class of networks. Given a reference network from the complex model and a sampled trajectory from it, we want to estimate the optimal constrained network. We can identify the optimal constrained network, that is, projected network, as the one that best approximates the complex one, and the goal of the inference process should be to obtain a network close to the optimal constrained network. Let ψ be a reference network (e.g., a continuous-valued ODE-based network), let P(ψ) be the optimal constrained network (e.g., a discretevalued network), and let ω be an estimator of P(ψ) estimated from data sampled from ψ. Then d(ω, ψ) ≤ d ω, P(ψ) + d P(ψ), ψ , x t1 t0 t2 mi = f 2.4. Estimation of the amplitude cumulative distribution The amplitude cumulative distribution of a trajectory can be estimated by simulating the trajectory and then estimating the ACD from the trajectory. Assuming that the steady-state ti+2 ti + ti+1 2 Figure 2: Example of determination of values mi . trajectory fss (t) is periodic with period t p , we can analyze fss (t) between two points, ts and te = ts + t p . For a continuous function fss (t), we assume that any amplitude value x is visited only a finite number of times by fss (t) in a period ts ≤ t < te . In accordance with (3), we define the cumulative distribution λ ts ≤ t ≤ te | fss (t) ≤ x F(x) = tp . (13) To calculate F(x) from a sampled trajectory, for each value x, let Sx be the set of points where fss (t) = x: Sx = ts ≤ t ≤ te | fss (t) = x ∪ ts , te . (14) The set Sx is finite. Let n = |Sx | denote the number of elements t0 , . . . , tn−1 . These can be sorted so that ts = t0 < t1 < t2 < · · · < tn−1 = te . Now we define the set mi , i = 0, . . . , n − 2, of intermediate values between two consecutive points where fss (t) crosses x (see Figure 2) by mi = fss where the following distances have natural interpretations: This structure is analogous to the classical constrained regression problem, where constraints are used to facilitate better inference via reduction of the estimation error (so long as this reduction exceeds the projection error) [11]. In the case of networks, the constraint problem becomes one of finding a projection mapping for models representing biological processes for which the loss defined by d(P(ψ), ψ) may be maintained within manageable bounds so that with good inference techniques, the estimation error defined by d(ω, P(ψ)) will be minimized. ti+1 (12) (i) d(ω, ψ) is the overall distance and quantifies the approximation of the reference network by the estimated optimal constrained network; (ii) d(ω, P(ψ)) is the estimation distance for the constrained network and quantifies the inference of the optimal constrained network; (iii) d(P(ψ), ψ) is the projection distance and quantifies how well the optimal constrained network approximates the reference network. ti ti + ti+1 . 2 (15) Let Ix be a set of the indices of points ti such that the function f (t) is below x in the interval [ti , ti+1 ], Ix = 0 ≤ i ≤ n − 2 | mi ≤ x . (16) Finally, the cumulative distribution F(x), defined by the measure of the set {ts ≤ t ≤ te | f (t) ≤ x}, can be computed as the sum of the lengths of the intervals where f (t) ≤ x: F(x) = i∈Ix ti+1 − ti . tp (17) The estimation of F(x) from a finite set {a1 , . . . , am } representing the function f (t) at points t1 , . . . , tm reduces to estimating the values in (17): F(x) = 1 ≤ i ≤ m | ai ≤ x m (18) at the points ai , i = 1, . . . , m. In the case of computing the distance between two functions f (t) and g(t), where the only information available consists of two samples, {a1 , . . . , am } and {b1 , . . . , br }, for f and g, respectively, both cumulative distributions F(x) and G(x) need only be defined at the points in the set S = a1 , . . . , am ∪ b1 , . . . , br . (19) Marcel Brun et al. 5 p1 (t) r1 (t) Translation r3 (t) Cis-regulation Transcription r2 (t) Translation p2 (t) Figure 3: Block diagram of a model for transcriptional regulation. In this case, if we sort the set S so that 0 = s0 < s2 < · · · < sk = T (with T being the upper limit for the amplitude values, and k ≤ r + m), then (6) can be approximated by dL∞ ( f , g) = max F si − G si 0≤i≤k (20) and (7) can be approximated by dL1 ( f , g) = si+1 − si F si − G si . (21) 0≤i≤k−1 3. APPLICATION TO QUANTIZATION To illustrate application of the network metric, we will analyze how different degrees of quantization affect model accuracy. Quantization is an important issue in network modeling because it is imperative to balance the desire for fine description against the need for reduced complexity for both inference and computation. Since it is difficult, if not impossible, to directly evaluate the goodness of a model against a real biological system, we will study the problem using a standard engineering approach. First, an in numero reference network model or system is formulated. Then, a second network model with a different level of abstraction is introduced to approximate the reference system. The objective is to investigate how different levels of abstraction, quantization levels in this study, impact the accuracy of the model prediction. The first model is called the reference model. From it, reference networks will be instantiated with appropriate sets of model parameters. The model will be continuous-valued to approximate the reference system at its fullest closeness. The second model is called a projected model, and projected networks will be instantiated from it. This model will be a discrete-valued model at a given different level of quantization. The ability of a projected network, an instance of the projected model, to approximate a reference network, an instance of the reference model, can be evaluated by comparing the trajectories generated from each network with different initial states and computing the distances between the networks as given by (11). 3.1. Reference model The origin of our reference model is a differential-equation model that quantitatively represents transcription, translation, cis-regulation and chemical reactions [7, 15, 21]. Specifically, we consider a differential-equation model that approximates the process of transcription and translation for a set of genes and their associated proteins (as illustrated in Figure 3) [7].The model comprises the following differential equations: d pi (t) = λi ri t − τ p,i − γi pi (t), i ∈ G, dt dri (t) = κi ci t − τr,i − βi ri (t), i ∈ G, dt ci (t) = φi p j t − τc, j , j ∈ Ri , i ∈ G, (22) where ri and pi are the concentrations of mRNA and proteins induced by gene i, respectively, ci (t) is the fraction of DNA fragments committed to transcription of gene i, κi is the transcription rate of gene i, and τ p,i , τr,i , and τc,i are the time delays for each process to start when the conditions are given. The most general form for the function φi is a real-valued (usually nonlinear) function with domain in R|Ri | and range in R, φi : R|Ri | → R. The functions are defined by the equations φi p j , j ∈ Ri = 1 − ρ p j , Si j , θi j j ∈Ri+ × ρ p j , Si j , θi j , (23) j ∈Ri− ρ(p, S, θ) = 1 , (1 + θ p)S where the parameters θ are the affinity constants and the parameters Si j are the distinct sites for gene i where promoter j can bind. The functions depend on the discrete parameter Si j , the number of binding sites for protein j on gene i, and θi j , the affinity constant between gene i and protein j. A discrete-time model results from the preceding continuous-time model by discretizing the time t on intervals nδt, and the assumption that the fraction of DNA 6 EURASIP Journal on Bioinformatics and Systems Biology Table 1: Parameter values used in simulations. Parameter Affinity constant Value θ = 108 M−1 mRNA and protein half-life ρ = 1200 s π = 3600 s Transcription rates Translation rate λ = 0.20 s−1 Time delays Transcription Input substrate concentration 1 1 1 2 2 2 3 3 3 4 4 Translation 3.2. Value S=1 κ1 = 0.001 pMs−1 κ2 = κ3 = κ4 = 0.05 pMs−1 τr = 2000 s τc = 200 s τ p = 2400 s Projected model The next step is to reduce the reference network model to a projected network model. This is accomplished by applying constraints in the reference model. The application of constraints modifies the original model, thereby obtaining a simpler one. We focus on quantization of the gene expression levels (which are continuous-valued in the reference model) via uniform quantization, which is defined by a finite or denumerable set L of intervals, L1 = [0, Δx ), L2 = [Δx , 2Δx ), . . . , Li = [(i − 1)Δx , iΔx ), . . . , and a mapping ΠL : R → R such that Π(x) = ai for some collection of points ai ∈ Li . The equations for ri , pi , and ci (24) are replaced by 4 Cis-regulation mRNA Protein Gene r i (n) = Π e−βi δt r i (n − 1) + κi s βi , δt ci n − nr,i − 1 , (27) Figure 4: Example of a tRS of a hypothetical metabolic pathway that consists of four genes. In this figure, denotes an activator, whereas, denotes a repressor. fragments committed to transcription and concentration of mRNA remains constant in the time interval [t − δt, t) [7]. In place of the differential equations for ri , pi , and ci , at time t = nδt, we have the equations ri (n) = e−βi δt ri (n − 1) + κi s(βi , δt)ci n − nr,i − 1 , pi (n) = e−γi δt pi (n − 1) + λi s λi , δt ri n − n p,i − 1 , Parameter Number of binding sites ci (n) = φi p j n − nc, j , j ∈ Ri , (25) This model, which will serve as our reference model, is called a (discrete) transcriptional regulatory system (tRS). We generate networks using this model and a fixed set θ of parameters. We call these networks reference networks. A reference network is identified by its set θ of parameters, θ = α1 , β1 , λ1 , γ1 , κ1 , τ p,1 , τr,1 , τc,1 , φ1 , R1 , . . . , αN , βN , λN , γN , κN , τ p,N , τr,N , τc,N , φN , RN . i ∈ G. (29) Issues to be investigated include (1) how different quantization techniques (specification of the partition L) affect the quality of the model; (2) which quantization technique (mapping Π) is the best for the model; and (3) the similarity of the attractors of the dynamical system defined by (27) and (28) to the steady state of the original system, as a function of Δx . We consider the first issue. 3.3. i ∈ G, 1 − e−xy . x ci (n) = φi p j n − nc, j , j ∈ Ri , A hypothetical metabolic pathway (24) where nr,i = τr,i /δt, n p,i = τ p,i /δt, nc, j = τc, j /δt, and s(x, y) = pi (n) = Π e−γi δt pi (n − 1) + λi s λi , δt r i n − n p,i − 1 , (28) (26) To illustrate the proposed metric in the framework of the reference and projected models, we compare two networks based on a hypothetical metabolic pathway. We first briefly describe the hypothetical metabolic pathway with necessary biochemical parameters to set up a reference system. Then, the simulation study shows the impacts of various quantization levels in both time and trajectory based on the proposed metric. We consider a gene regulatory network consisting of four genes. A graphical representation of the system is depicted in Figure 4, where denotes an activator and denotes a repressor. We assume that the GRN regulates a hypothetical pathway, which metabolizes an input substrate to an output product. This is done by means of enzymes whose transcriptional control is regulated by the protein produced from gene 3. Moreover, we assume that the effect of a higher input substrate concentration is to increase the transcription rate κ1 , Marcel Brun et al. 7 Gene 1 6 Gene 2 50 5 40 4 30 3 20 2 10 1 0 0 Initial Final 10000 seconds Initial 10000 seconds 10000 seconds Quant = 0 Q = 0.001, S = 0.06, Sn = 0 Q = 0.01, S = 0.5, Sn = 0 Q = 0.1, S = 1.7, Sn = 0 10000 seconds Quant = 0 Q = 0.001, S = 0.65, Sn = 0.82 Q = 0.01, S = 6.65, Sn = 0 Q = 0.1, S = 49.5, Sn = 0 (a) (b) Gene 4 Gene 3 120 Final 200 100 150 80 100 60 40 50 20 0 0 Initial Final 10000 seconds 10000 seconds Quant = 0 Q = 0.001, S = 0.63, Sn = 0.13 Q = 0.01, S = 4.34, Sn = 13 Q = 0.1, S = 111.66, Sn = 13 (c) Initial Final 10000 seconds 10000 seconds Quant = 0 Q = 0.001, S = 9.76, Sn = 0.07 Q = 0.01, S = 52.18, Sn = 0.89 Q = 0.1, S = 58.96, Sn = 0.89 (d) Figure 5: Example of trajectories from the first simulation of 4-gene network. Each figure shows the trajectory for one of the four genes, for several values of the level quantization Δx , represented by the lines Q = 0, Q = 0.001, Q = 0.01 and Q = 0.1 (Q = 0 represents the original network without quantization). The values S displayed in the graphs shows the distance computed between the trajectory and the one with Q=0. The vertical axis shows the concentration levels x in pM. The horizontal axis shows the time t in seconds. whereas the effect of a lower substrate concentration is to reduce κ1 . Unless otherwise specified, the parameters are assumed to be gene-independent. These parameters are summarized in Table 1. We assume that each cis-regulator is controlled by one module with four binding sites, and set S = 4, θ = 108 M−1 , κ2 = κ3 = κ4 = 0.05 pMs−1 , and λ = 0.05 s−1 . The value of the affinity constant θ corresponds to a binding free energy 8 EURASIP Journal on Bioinformatics and Systems Biology Iter. 1, gene 2 Iter. 1, gene 1 1 0.8 0.8 0.6 0.6 F(x) F(x) 1 0.4 0.4 0.2 0.2 0 0 0.5 1 x 1.5 0 2 0 Quant = 0 Q = 0.0001, S = 0.06, Sn = 0 Q = 0.01, S = 0.5, Sn = 0 Q = 0.1, S = 1.7, Sn = 0 10 20 30 x 40 50 60 Quant = 0 Q = 0.001, S = 0.65, Sn = 0.82 Q = 0.01, S = 6.65, Sn = 0 Q = 0.1, S = 49.5, Sn = 0 (a) (b) Iter. 1, gene 3 Iter. 1, gene 4 0.8 0.8 0.6 0.6 F(x) 1 F(x) 1 0.4 0.4 0.2 0.2 0 0 50 100 150 0 0 50 100 x x Quant = 0 Q = 0.001, S = 0.63, Sn = 0.13 Q = 0.01, S = 4.34, Sn = 1.3 Q = 0.1, S = 111.66, Sn = 1.3 150 200 Quant = 0 Q = 0.001, S = 9.76, Sn = 0.07 Q = 0.01, S = 52.18, Sn = 0.89 Q = 0.1, S = 58.96, Sn = 0.89 (c) (d) Figure 6: Example of estimated cumulative density function (CDF) from the first simulation of 4-gene network, computed from the trajectories in Figure 5. Each figure shows the CDF for one of the four genes, for several values of the level quantization Δx , represented by the lines Q = 0, Q = 0.001, Q = 0.01, and Q = 0.1 (Q = 0 represents the original network without quantization). The value S displayed in the graphs show the distance computed between the trajectory and the one with Q = 0. The vertical axis shows the cumulative distribution F(x). The horizontal axis shows the concentration levels x in pM. of ΔU = −11.35 kcal/mol at temperature T = 310.15◦ K (or 37◦ C). The values of the transcription rates κ2 , κ3 , and κ4 correspond to transcriptional machinery that, on the average, produces one mRNA molecule every 8 seconds. This value turns out to be typical for yeast cells [22]. We also assume that on the average, the volume of each cell in C equals 4 pL [18]. The translation rate λ is taken to be 10-fold larger than the rate of 0.3/minute for translation initiation observed in vitro using a semipurified rabbit reticulocyte system [23]. The degradation parameters β and γ are specified by means of the mRNA and protein half-life parameters ρ and π, respectively, which satisfy 1 e−βρ = , 2 1 e−γπ = . 2 (30) ln 2 . π (31) In this case, β= ln 2 , ρ γ= Marcel Brun et al. 9 80 70 120 100 60 80 50 60 40 40 20 0 3600 1800 600 30 20 300 120 δt 101 60 100 30 10 5 1 10−3 10−2 10−1 Δx 10 0 1 5 10 30 60 120 300 600 1800 3600 δt Figure 7: Results for the first simulation: the vertical axis shows the distance dL1 ( f(Δx ,δt ) , f(Δx =0,δt ) ) as function of quantization levels for both the values (axis labeled “Δx ”) and the time (axis labeled “δt ”). Δx = 0.1 Δx = 1 Δx = 0 Δx = 0.001 Δx = 0.01 (a) 3.4. Results and discussion It is expected that the finer the quantization is (smaller values of Δx ), the more similar will be the projected networks to the reference networks. This similarity should be reflected by the trajectories as measured by the proposed metric. A straightforward simulation consists of the design of a reference network, the design of a projected network (for some value of Δx ), the generation of several trajectories for both networks from randomly selected starting points, and the computation of the average distance between trajectories, using (9) and (21). Each process is repeated for different time intervals δt to study how the time intervals used in the simulation affect the analysis. The firstsimulation is based on the same 4-gene model presented in [7]. We use 6 different quantization levels, Δx = 0, 0.001, 0.01, 0.1, 1, and 10, where Δx = 0 means no quantization, and designates the reference network. For each quantization level Δx and starting point x0 , we generate the simulated time series expression and compare it to the time-series generated with Δx = 0 (the reference network), estimating the proposed metric using (21). The process is repeated using a total of 10 different time intervals, δt = 1 second, 5 seconds, 10 seconds, 30 seconds, 1 minute, 2 minutes, 5 minutes, 10 minutes, 30 minutes, and 1 hour. The simulation is repeated and the distances are averaged for 30 different starting points x0 . Figures 5 and 6 show the trajectories and empirical cumulative density functions estimated from the simulated system as illustrated in the previous section. Several quantization levels are used in the simulation. The last graph in Figure 5 shows the mRNA concentration for the forth gene, over the 10 000 first seconds (transient) and over the last 10 000 seconds (steady-state). We can see that for quantizations 0 and 0.001, the steady-state solutions are periodic, and for quantizations 0.001 and 0.1, the solutions are constant. This is reflected by the associated plot of F(x) in Figure 6. 120 100 80 60 40 20 0 10−3 10−2 10−1 Δx δt = 1 δt = 10 δt = 60 100 101 δt = 300 δt = 1800 (b) Figure 8: Results for the first simulation: the vertical axis shows the distance dL1 ( f(Δx ,δt ) , f(Δx =0,δt ) ) as function of quantization levels for both the values (labeled “Δx ”) and the time (labeled “δt ”). Part (a) shows the distance as a function of Δx for several values of δt . Part (b) shows the distance as a function of δt for several values of Δx . Figure 7 shows how strong quantization (high values of Δx ) yields high distance, with the distance decreasing again when the time interval (δt ) increases. The z-axis in the figure represents the distance dL1 ( f(Δx ,δt ) , f(Δx =0,δt ) ). In our second simulation, we use a different connectivity (all other kinetic parameters are unchanged), and we 10 EURASIP Journal on Bioinformatics and Systems Biology 40 35 40 35 30 30 25 25 20 15 20 10 15 5 0 3600 1800 600 10 300 120 δt 101 60 100 30 10 5 1 10−3 10−2 10−1 Δx 5 0 1 5 10 30 60 120 300 600 1800 3600 δt Figure 9: Results for the second simulation: the vertical axis shows the distance dL1 ( f(Δx ,δt ) , f(Δx =0,δt ) ) as function of quantization levels for both the values (axis labeled “Dx”) and the time (axis labeled “delta t”). Δx = 0 Δx = 0.001 Δx = 0.01 Δx = 0.1 Δx = 1 (a) again use 10 different time intervals, δt = 1 second, 5 seconds, 10 seconds, 30 seconds, 1 minute, 2 minutes, 5 minutes, 10 minutes, 30 minutes and 1 hour, and 6 different quantization levels, Δx = 0, 0.001, 0.01, 0.1, 1, and 10. (Δx = 0 meaning no quantization). The simulation is repeated and the distances are averaged for 30 different starting points. Analogous to the first simulation, Figure 9 shows how strong quantization (high values of Δx ) yields high distance, which decreases when the time interval (δt ) increases. An important observation regarding Figures 8 and 10 is that the error decreases as δt increases. This is due to the fact that the coarser the amplitude quantization is, the more difficult it is for small time intervals to capture the dynamics of slowly changing sequences. 4. CONCLUSION This study has proposed a metric to quantitatively compare two networks and has demonstrated the utility of the metric via a simulation study involving different quantizations of the reference network. A key property of the proposed metric is that it allows comparison of networks of different natures. It also takes into consideration differences in the steady-state behavior and is invariant under time shifting and scaling. The metric can be used for various purposes besides quantization issues. Possibilities include the generation of a projected network from a reference network by removing proteins from the equations and connectivity reduction by removing edges in the connectivity matrix. The metric facilitates systematic study of the ability of discrete dynamical models, such as Boolean networks, to approximately represent more complex models, such as differential-equation models. This can be particularly important in the framework of network inference, where the parameters for projected models can be inferred from the reference model, either analytically or via synthetic data generated via simulation of the reference model. Then, given the 40 35 30 25 20 15 10 5 0 10−3 10−2 10−1 Δx δt = 1 δt = 10 δt = 60 100 101 δt = 300 δt = 1800 (b) Figure 10: Results for the second simulation: the vertical axis shows the distance dL1 ( f(Δx ,δt ) , f(Δx =0,δt ) ) as function of quantization levels for both the values (labeled “Δx ”) and the time (labeled “δt ”). Part (a) shows the distance as a function of Δx for several values of δt . Part (b) shows the distance as a function of δt for several values of Δx . reference and projected models, the metric can be used to determine the level of abstraction that provides the best inference; given the amount of observations available, this approach corresponds to classification-rule constraint for classifier inference in pattern recognition. Marcel Brun et al. 11 NOMENCLATURE Trajectory: A function f (t) Distance Function: The proposed distance between networks NOTATIONS t: ψ: x0 : f (t), g(t), h(t): fss , gss : fψ,xo (t): ftran : fss : F(x), G(x), H(x): dtr (·, ·): dss (·, ·): λ(A): f(t): Time Network Starting Point Trajectories Steady-State trajectories Trajectory Transient part of the trajectory Steady-state part of the trajectory Cumulative distribution functions Distance between two trajectories Distance between two periodic or constant trajectories Lebesgue measure of set A Multivariate trajectory ACKNOWLEDGMENTS We would like to thank the National Science Foundation (CCF-0514644) and the National Cancer Institute (R01 CA104620) for sponsoring in part this research. REFERENCES [1] H. De Jong, “Modeling and simulation of genetic regulatory systems: a literature review,” Journal of Computational Biology, vol. 9, no. 1, pp. 67–103, 2002. [2] R. Srivastava, L. You, J. Summers, and J. Yin, “Stochastic vs. deterministic modeling of intracellular viral kinetics,” Journal of Theoretical Biology, vol. 218, no. 3, pp. 309–321, 2002. [3] R. Albert and A.-L. Barabási, “Statistical mechanics of complex networks,” Reviews of Modern Physics, vol. 74, no. 1, pp. 47–97, 2002. [4] S. Kim, H. Li, E. R. Dougherty, et al., “Can Markov chain models mimic biological regulation?” Journal of Biological Systems, vol. 10, no. 4, pp. 337–357, 2002. [5] R. Albert and H. G. Othmer, “The topology of the regulatory interactions predicts the expression pattern of the segment polarity genes in Drosophila melanogaster,” Journal of Theoretical Biology, vol. 223, no. 1, pp. 1–18, 2003. [6] S. Aburatani, K. Tashiro, C. J. Savoie, et al., “Discovery of novel transcription control relationships with gene regulatory networks generated from multiple-disruption full genome expression libraries,” DNA Research, vol. 10, no. 1, pp. 1–8, 2003. [7] J. Goutsias and S. Kim, “A nonlinear discrete dynamical model for transcriptional regulation: construction and properties,” Biophysical Journal, vol. 86, no. 4, pp. 1922–1945, 2004. [8] H. Li and M. Zhan, “Systematic intervention of transcription for identifying network response to disease and cellular phenotypes,” Bioinformatics, vol. 22, no. 1, pp. 96–102, 2006. [9] A. Datta, A. Choudhary, M. L. Bittner, and E. R. Dougherty, “External control in Markovian genetic regulatory networks,” Machine Learning, vol. 52, no. 1-2, pp. 169–191, 2003. [10] A. Choudhary, A. Datta, M. L. Bittner, and E. R. Dougherty, “Control in a family of boolean networks,” in IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS ’06), College Station, Tex, USA, May 2006. [11] L. Devroye, L. Györffi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition, Springer, New York, NY, USA, 1996. [12] I. Ivanov and E. R. Dougherty, “Modeling genetic regulatory networks: continuous or discrete?” Journal of Biological Systems, vol. 14, no. 2, pp. 219–229, 2006. [13] I. Ivanov and E. R. Dougherty, “Reduction mappings between probabilistic boolean networks,” EURASIP Journal on Applied Signal Processing, vol. 2004, no. 1, pp. 125–131, 2004. [14] S. Ott, S. Imoto, and S. Miyano, “Finding optimal models for small gene networks,” in Proceedings of the Pacific Symposium on Biocomputing (PSB ’04), pp. 557–567, Big Island, Hawaii, USA, January 2004. [15] L. F. Wessels, E. P. van Someren, and M. J. Reinders, “A comparison of genetic network models,” in Proceedings of the Pacific Symposium on Biocomputing (PSB ’01), pp. 508–519, Lihue, Hawaii, USA, January 2001. [16] M. B. Elowitz, A. J. Levine, E. D. Siggia, and P. S. Swain, “Stochastic gene expression in a single cell,” Science, vol. 297, no. 5584, pp. 1183–1186, 2002. [17] S. A. Kauffman, The Origins of Order: Self-Organization and Selection in Evolution, Oxford University Press, New York, NY, USA, 1993. [18] B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, and P. Walter, Molecular Biology of the Cell, Garland Science, New York, NY, USA, 4th edition, 2002. [19] S. A. Kauffman, “Metabolic stability and epigenesis in randomly constructed genetic nets,” Journal of Theoretical Biology, vol. 22, no. 3, pp. 437–467, 1969. [20] P. A. Lynn, An Introduction to the Analysis and Processing of Signals, John Wiley & Sons, New York, NY, USA, 1973. [21] A. Arkin, J. Ross, and H. H. McAdams, “Stochastic kinetic analysis of developmental pathway bifurcation in phage λinfected Escherichia coli cells,” Genetics, vol. 149, no. 4, pp. 1633–1648, 1998. [22] V. Iyer and K. Struhl, “Absolute mRNA levels and transcriptional initiation rates in Saccharomyces cerevisiae,” Proceedings of the National Academy of Sciences of the United States of America, vol. 93, no. 11, pp. 5208–5212, 1996. [23] J. R. Lorsch and D. Herschlag, “Kinetic dissection of fundamental processes of eukaryotic translation initiation in vitro,” EMBO Journal, vol. 18, no. 23, pp. 6705–6717, 1999. Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 73109, 11 pages doi:10.1155/2007/73109 Research Article A Robust Structural PGN Model for Control of Cell-Cycle Progression Stabilized by Negative Feedbacks Nestor Walter Trepode,1 Hugo Aguirre Armelin,2 Michael Bittner,3 Junior Barrera,1 Marco Dimas Gubitoso,1 and Ronaldo Fumio Hashimoto1 1 Institute of Mathematics and Statistics, University of São Paulo, Rua do Matao 1010, 05508-090 São Paulo, SP, Brazil of Chemistry, University of São Paulo, Avenue Professor Lineu Prestes 748, 05508-900 São Paulo, SP, Brazil 3 Translational Genomics Research Institute, 445 N. Fifth Street, Phoenix, AZ 85004, USA 2 Institute Received 27 July 2006; Revised 24 November 2006; Accepted 10 March 2007 Recommended by Tatsuya Akutsu The cell division cycle comprises a sequence of phenomena controlled by a stable and robust genetic network. We applied a probabilistic genetic network (PGN) to construct a hypothetical model with a dynamical behavior displaying the degree of robustness typical of the biological cell cycle. The structure of our PGN model was inspired in well-established biological facts such as the existence of integrator subsystems, negative and positive feedback loops, and redundant signaling pathways. Our model represents genes interactions as stochastic processes and presents strong robustness in the presence of moderate noise and parameters fluctuations. A recently published deterministic yeast cell-cycle model does not perform as well as our PGN model, even upon moderate noise conditions. In addition, self stimulatory mechanisms can give our PGN model the possibility of having a pacemaker activity similar to the observed in the oscillatory embryonic cell cycle. Copyright © 2007 Nestor Walter Trepode et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. INTRODUCTION A complex genetic network is the central controller of the cell-cycle process, by which a cell grows, replicates its genetic material, and divides into two daughter cells. The cell-cycle control system shows adaptability to specific environmental conditions or cell types, exhibits stability in the presence of variable excitation, is robust to parameter fluctuation and is fault tolerant due to replications of network structures. It also receives information from the processes being regulated and is able to arrest the cell cycle at specific “checkpoints” if some events have not been correctly completed. This is achieved by means of intracellular negative feedback signals [1, 2]. Recently, two models were proposed to describe this control system. After exhaustive literature studies, Li et al. proposed a deterministic discrete binary model of the yeast cell-cycle control system, completely based on documented data [3]. They studied the signal wave generated by the model, that goes through all the consecutive phases of the cell-cycle progression, and verified, by simulation, that almost all the state transitions of this deterministic model converge to this “biological pathway,” showing stability under different activation signal waveforms. Based on experimental data, Pomerening et al. proposed a continuous deterministic model for the self-stimulated embryonic cell-cycle, which performs one division after the other, without the need of external stimuli nor waiting to grow [4]. We recently proposed the probabilistic genetic network (PGN) model, where the influence between genes is represented by a stochastic process. A PGN is a particular family of Markov Chains with some additional properties (axioms) inspired in biological phenomena. Some of the implications of these axioms are: stationarity; all states are reachable; one variable’s transition is conditionally independent of the other variables’ transitions; the probability of the most probable state trajectory is much higher than the probabilities of the other possible trajectories (i.e., the system is almost deterministic); a gene is seen as a nonlinear stochastic gate whose expression depends on a linear combination of activator and inhibitory signals and the system is built by compiling these elementary gates. This model was successfully applied for designing malaria parasite genetic networks [5, 6]. Here we propose a hypothetical structural PGN model for the eukaryote control of cell-cycle progression, that aims 2 EURASIP Journal on Bioinformatics and Systems Biology to reproduce the typical robustness observed in the dynamical behavior of biological systems. Control structures inspired in well-known biological facts, such as the existence of integrators, negative and positive feedbacks, and biological redundancies, were included in the model architecture. After adjusting its parameters heuristically, the model was able to represent dynamical properties of real biological systems, such as sequential propagation of gene expression waves, stability in the presence of variable excitation and robustness in the presence of noise [7]. We carried out extensive simulations—under different stimulus and noise conditions—in order to analyze stability and robustness in our proposed model. We also analyzed the performance of the yeast cell cycle control model constructed by Li et al. [3] under similar simulations. Under small noisy conditions, our PGN model exhibited remarkable robustness whereas Li’s yeast-model did not perform that well. We infer that our PGN model very likely possesses some structural features ensuring robustness which Li’s model lacks. To further emulate cellular environment conditions, we extended our model to include random delays in its regulatory signals without degrading its previous stability and robustness. Finally, with the addition of positive feedback, our model became self-stimulated, showing an oscillatory behavior similar to the one displayed by the embryonic cell-cycle [4]. Besides being able to represent the observed behavior of the other two models, our PGN model showed strong robustness to system parameter fluctuation. The dynamical structure of the proposed model is composed of: (i) prediction by an almost deterministic stochastic rule (i.e., gene model), and (ii) stochastic choice of an almost deterministic stochastic prediction rule (i.e., random delays). After this introduction, in Section 2, we present our mathematical modeling of a gene regulatory network by a PGN. In Section 3, we briefly describe Li’s yeast cell-cycle model and present the simulation, in the presence of noise, of our PGN version of it. Sections 4 and 5 describe the architecture and dynamics of our model for control of cell-cycle progression and analyze its simulations in the presence of noise and random delays in the regulatory signals (the same noise pattern was applied to both our model and Li’s yeastmodel). Section 6 shows the inclusion of positive feedback in our model to obtain a pacemaker activity, similar to the one found in embryonic cells. Finally, in Section 7 we discuss our results and the continuity of this research. 2. MATHEMATICAL MODELING OF GENETIC NETWORKS 2.1. Genetic regulatory networks The cell cycle control system is a complex network comprising many forward and feedback signals acting at specific times. Figure 1 is a schematic representation of such a network, usually called a gene regulatory network. Proteins produced as a consequence of gene expression (i.e., after transcription and translation) form multiprotein complexes, that interact with each other, integrating extracellular signals— not shown—, regulating metabolic pathways (arrow 3), re- Feedback signals 1 2 Transcription DNA Translation RNA Proteins 3 4 Metabolic pathways Microarray measurements Figure 1: Gene regulatory network. ceiving (arrow 4) and sending (arrow 1 and 2) feedback signals. In this way, genes and their protein products form a signaling network that controls function, cell division cycle, and programmed cell death. In that network, the level of expression of each gene depends on both its own expression value and the expression values of other genes at previous instants of time, and on previous external stimuli. This kind of interactions between genes forms networks that may be very complex. The dynamical behavior of these networks can be adequately represented by discrete stochastic dynamical systems. In the following subsections, we present a model of this kind. 2.2. Discrete dynamical systems Discrete dynamical systems, discrete in time and finite in range, can model the behavior of gene networks [8–12]. In this model, we represent each gene or protein by a variable which takes the value of the gene expression or the protein concentration. All these variables, taken collectively, are the components of a vector called the state of the system. Each component (i.e., gene or protein) of the state vector has associated a function that calculates its next value (i.e., expression value or protein concentration) from the state at previous instants of time. These functions are the components of a function vector, called transition function, that defines the transition from one state to the next and represents the actual regulatory mechanisms. Let R be the range of all state components. For example, R = {0, 1} in binary systems, R = {−1, 0, 1} or R = {0, 1, 2} in three levels systems. The transition function φ, for a network of N variables and memory m, is a function from RmN to RN . This means that the transition function φ maps the previous m states, x(t − 1), x(t − 2), . . . , x(t − m), into the state x(t) with x(t) = [x1 (t), x2 (t), . . . , xN (t)]T ∈ RN . A discrete dynamical system is given by, for every time t ≥ 0, x(t) = φ x(t − 1), x(t − 2), . . . , x(t − m) . (1) A component of x is a value xi ∈ R. Systems defined as above are time translation invariant, that is, the transition function is the same for all discrete time t. The system architecture— or structure—is the wiring diagram of the dependencies Nestor Walter Trepode et al. 3 between the variables (state vector components). The system dynamics is the temporal evolution of the state vector, given by the transition function. (v) for every variable i there exists a matrix ai and a vector bi of real numbers such that, for every x, z ∈ RmN and yi ∈ R if N m 2.3. Probabilistic genetic networks When the transition function φ is a stochastic function (i.e., for each sequence of states x(t − m), . . . , x(t − 2), x(t − 1), the next state x(t) is a realization of a random vector), the dynamical system is a stochastic process. Here we represent gene regulatory networks by stochastic processes, where the stochastic transition function is a particular family of Markov chains, that is called probabilistic genetic network (PGN). Consider a sequence of random vectors X0 , X1 , X2 , . . . assuming values in RN , denoted, respectively, x(0), x(1), x(2), . . . . A sequence of random states (Xt )∞ t =0 is called a Markov chain if for every t ≥ 1, P Xt = x(t) | X0 = x(0), . . . , Xt−1 = x(t − 1) = P Xt = x(t) | Xt−1 = x(t − 1) . (2) That is, the conditional probability of the future event, given the past history, depends only upon the last instant of time. Let X, with realization x, represent the state before a transition, and let Y , with realization y be the first state after that transition. A Markov chain is characterized by a transition matrix πY |X of conditional probabilities between states, whose elements are denoted p y|x , and the probability distribution π0 of the random vector representing the initial state. The stochastic transition function φ at time t, is given by, for every t ≥ 1, φ[x] = φ x(t − 1) = y, (3) where y is a realization of a random vector with distribution p•|x . An m order Markov chain—which depends on the m previous instants of time—is equivalent to a Markov chain whose states have dimension m × N. Let the sequence X = Xt−1 , . . . , Xt−m with realization x = x(t − 1), . . . , x(t − m) represent the sequence of m states before a transition. A probabilistic genetic network (PGN) is an m order Markov chain (πY |X , π0 ) such that (i) πY |X is homogeneous, that is, p y|x is independent of t, (ii) p y|x > 0 for all states x ∈ RmN , y ∈ RN , (iii) πY |X is conditionally independent, that is, for all states x ∈ RmN , y ∈ RN , p y|x = ΠNi=1 p yi | x , (4) (iv) πY |X is almost deterministic, that is, for every sequence of states x ∈ RmN , there exists a state y ∈ RN such that p y|x ≈ 1, akji x j (t − k) = j =1 k=1 pi k=1 N m akji z j (t − k), j =1 k=1 bik xi (t − k) = pi bik zi (t − k), (5) k=1 then p yi | x = p yi | z , 0 ≤ pi ≤ m. These axioms imply that each variable xi is characterized by a matrix and a vector of coefficients and a stochastic function gi from Z, a subset of integer numbers, to R. If akji is positive, then the target variable xi is activated by the variable x j at time t − k, if akji is negative, then it is inhibited by variable x j at time t − k, if akji is zero, then it is not affected by variable x j at time t − k. We say that variable xi is predicted by the variable x j when some akji is different from zero. Similarly, if bik is zero, the value of xi at time t is not affected for its previous value at time t − k. The constant parameter pi , for the state variable xi , represents the number of previous instants of time at which the values of xi can affect the value of xi (t). If pi = 0, previous values of xi have no p effect on the value of xi (t) and the summation k=i 1 bik xi (t − k) is defined to be zero. The component i of the stochastic transition function φ, denoted φi , is built by the composition of a stochastic function gi with two linear combinations: (i) ai and the previous states x(t − 1), . . . , x(t − m), and (ii) bi and the values of xi (t − 1), . . . , xi (t − pi ). This means that, for every t ≥ 1, φi x(t − 1), . . . , x(t − m) = gi (α, β), (6) where α= N m akji x j (t − k), j =1 k=1 β= pi bik xi (t − k) (7) k=1 and gi (α, β) is a realization of a random variable in R, with distribution p(• | α, β). This restriction on gi means that the components of a PGN transition function vector are random variables with a probability distribution conditioned to two linear combinations, α and β, from the fifth PGN axiom. The PGN model reflects the properties of a gene as a nonlinear stochastic gate. Systems are built by compiling these gates. Biological rationale for PGN axioms The axioms that define the PGN model are inspired by biological phenomena. The dynamical system structure is justified by the necessity of representing a sequential process. The discrete representation is sufficient since the interactions between genes and proteins occur at the molecular level [13]. The stochastic aspects represent perturbations or lack of detailed knowledge about the system dynamics. Axiom (i) is EURASIP Journal on Bioinformatics and Systems Biology just a constraint to simplify the model. In general, real systems are not homogeneous, but may be homogeneous by parts, that is, in time intervals. Axiom (ii) imposes that all states are reachable, that is, noise may lead the system to any state. It is a quite general model that reflects our lack of knowledge about the kind of noise that may affect the system. Axiom (iii) implies that the prediction of each gene can be computed independently of the prediction of the other genes, which is a kind of system decomposition consistent with what is observed in nature. Axiom (iv) means that the system has a main trajectory, that is, one that is much more probable than the others. Axiom (v) means that genes act as a nonlinear gate triggered by a balance between inhibitory and excitatory inputs, analogous to neurons. State variables 4 1 Mcm1SFF 0 1 Clb12 0 1 Sic1 0 1 clb56 0 1 Cdc20Cdc14 0 1 Swi5 0 1 Cdh1 0 1 Cln12 0 1 SBF 0 1 MBF 0 1 Cln3 0 1 CS 0 0 4 8 12 16 20 24 28 Time steps The eukaryotic cell-cycle process is an ordered sequence of events by which the cell grows and divides in two daughter cells. It is organized in four phases: G1 (the cell progressively grows and by the end of this phase becomes irreversibly committed to division), S (phase of DNA synthesis and chromosome replication), G2 (bridging “gap” between S and M), and M (period of chromosomes separation and cell division) [1, 2]. The cell-cycle basic organization and control system have been highly conserved during evolution and are essentially the same in all eukaryotic cells, what makes more relevant the study of a simple organism, like yeast. We made studies of stability and robustness on a recently published deterministic binary control model of the yeast cell-cycle, which was entirely built from real biological knowledge after extensive literature studies [3]. From the ≈ 800 genes involved in the yeast cell-cycle process [14], only a small number of key regulators, responsible for the control of the cell-cycle process, were selected to construct a model where each interaction between its variables is documented in the literature. A dynamic model of these interactions would involve various binding constants and rates [15, 16], but inspired by the on-off characteristic of many of the cell-cycle control network components, and focusing mainly on the overall dynamic properties and stability, they constructed a simple discrete binary model. In this work we refer to its simplified version, whose architecture is shown in Figure 1B of [3]. The simulation1 in Figure 2(a) shows the state variables’ temporal evolution over the biological pathway, that goes through all the sequential phases of the cell cycle, from the excited G1 state (activated when CS—cell size—grows beyond a certain threshold), to the S phase, the G2 phase, the M phase, and finally to the stationary G1 state where it remains. The cell-cycle sequence has a total length of 13 discrete time steps (period of the cycle). Under simulations driven by CS pulses of increasing frequency,2 this system behaved well, 1 (a) Simulation of the deterministic binary yeast cell-cycle model with only one activator pulse of CS = 1 at t = −1. After the START state at t = 0, the system goes through the biological pathway, passing by all the sequential cell-cycle phases: G1 at t = 1, 2, 3; S at t = 4; G2 at t = 5; M at t = 6, . . . , 10; G1 at t = 11; and from t = 12 the system remains in the G1 stationary state (all variables at zero level except Sic1 = Cdh1 = 1) YEAST CELL-CYCLE MODEL All simulations in this work were performed using SGEN (simulator for gene expression networks) [17]. 2 Simulations are not shown here. 2 Mcm1SFF 0 2 Clb12 0 2 Sic1 0 2 clb56 0 State variables 3. 2 Cdc20Cdc14 0 2 Swi5 0 2 Cdh1 0 2 Cln12 0 2 SBF 0 2 MBF 0 2 Cln3 0 2 CS 0 0 30 60 90 120 150 180 Time steps (b) Simulation of the three-level PGN yeast cell-cycle model with 1% of noise (PGN with P = .99) activated by a single pulse of CS = 2 at t = −1. After 13 time steps (period of the cycle), the system should remain in the G1 stationary state—all variables at zero level except Sic1 = Cdh1 = 2—(compare with Figure 2(a)). Instead, this small amount of noise is enough to take the system completely out of its expected normal behavior Figure 2: Yeast cell-cycle model simulations. showing strong stability, with all initiated cycles systematically going to conclusion, and new cycles being initiated only after the previous one had finished. 3.1. PGN yeast cell-cycle model In order to study the effect of noise and the increase of the number of signal levels in the performance of Li’s yeastmodel [3], we translated it into a three level PGN model. Initially, we mapped Li’s binary deterministic model into a three Nestor Walter Trepode et al. 5 Table 1: Threshold values for variables without self-degradation in the PGN yeast cell-cycle model. xi (t − 1) = 0 xi (t − 1) = 1 xi (t − 1) = 2 th(1) xi 1 0 −1 th(2) xi 2 1 0 level deterministic one, with range of values R = {0, 1, 2} for the state variables. By PGN axiom (iv), the PGN transition matrix πY |X is almost deterministic, that is, at every time step, one of the transition probabilities p y|x ≈ 1. The deterministic case would be the case when, at every time step, this most probable transition have p y|x → 1, or, in real terms, the case corresponding to total absence of noise in the system. In this mapping, binary value 1 was mapped to 2, and binary value 0 was mapped to 0, of the three-level model. Intermediate values (in the driving and transition functions) were mapped in a convenient way, so that they lay between the ones that have an exact correspondence. From this deterministic threelevel model (having exactly the same dynamical behavior of the binary model from which it was derived) we specified the following PGN. 3.1.1. PGN specification and simulation The total input signal driving a generic variable xi (t) ∈ {0, 1, 2} (1 ≤ i ≤ N) is given by its associated driving function: di (t − 1) = N a ji x j (t − 1). (8) j =1 Here, the system has memory m = 1 and a ji is the weight for variable x j at time t − 1 in the driving function of variable xi . If variable x j is an activator of variable xi , then a ji = 1; if variable x j is an inhibitor of variable x j , then a ji = −1; otherwise, a ji = 0. Let ⎧ ⎪ ⎪ ⎪2 ⎪ ⎨ if di (t − 1) ≥ th(2) xi , (2) yi (t) = ⎪1 if th(1) xi ≤ di (t − 1) < thxi , ⎪ ⎪ ⎪ ⎩0 (9) if di (t − 1) < th(1) xi . The stochastic transition function chooses the next value of each variable to be (i) xi (t) = yi (t) with probability P ≈ 1, (ii) xi (t) = a with probability (1 − P)/2, or (iii) xi (t) = b with probability (1 − P)/2; where a, b ∈ {0, 1, 2} − { yi } and (2) th(1) xi , thxi are the threshold values for one and two in the transition function of variable xi . For this model to converge, when P → 1, to the deterministic one in the previous subsection, these thresholds must have the values indicated in Table 1, depending on the value of xi (t − 1). If variable xi has the self degradation property, its threshold values are those in the column of xi (t − 1) = 0, regardless of the actual value of xi (t − 1). We simulated the three-level PGN version of Li’s yeastmodel with probability P = 0.99 to represent the presence of 1% of noise in the system. Figure 2(b) shows a 200 steps simulation of the system when the G1 stationary state is activated by a single start pulse of CS = 2 at t = −1. Comparing with Figure 2(a), we observe that this moderate noise is sufficient to degrade the systems’ performance. Particularly, the system should remain in the G1 stationary state after the 13 steps cycle period, however, numerous spurious waveforms are generated. Furthermore, when we simulated this system increasing the frequency of the CS activator pulses, noise seriously disturbed the normal signal wave propagation [18]. We conclude that this system does not have a robust performance under 1% of noise. 4. OUR STRUCTURAL MODEL FOR CONTROL OF CELL-CYCLE PROGRESSION The PGN was applied to construct a hypothetical model based on components and structural features found in biological systems (integrators, redundancy, positive forward signals, positive and negative feedback signals, etc.) having a dynamical behavior (waves of control signals, stability to changes in the input signal, robustness to some kinds of noise, etc.) similar to those observed in real cell-cycle control systems. During cell-cycle progression, families of genes have either brief or sustained expression during specific cell-cycle phases or transitions between phases (see, e.g., Figure 7 in [14]). In mammalian cells, the transition G0 /G1 of cell cycle requires sequential expression of genes encoding families of master transcription factors, for instance the fos and jun families of proto-oncogenes. Among the fos genes c-fos and fos B are essentially regulated at transcription level and are expressed for a brief period of time (0.5 to 1 h), displaying mRNAs and proteins of very short half life. In addition, G1 progression and G1 /S transition are controlled by the cell cycle regulatory machine, comprised by proteins of sustained (cyclin-dependent kinases—CDKs—and Rb protein) and transient expression (cyclins D and E). The genes encoding cyclins D and E are transcribed at middle and late G1 phase, respectively. Actually, there are several CDKs regulating progression along all cell cycle phases and transitions, whose activities are dependent on cyclins that are transiently expressed following a rigid sequential order. This basic regulation of cell cycle progression is highly conserved in eukaryotes, from yeast to mammalians. Accordingly, we organized our model into successive gene layers expressed sequentially in time. This wave of gene expression controls timing and progression through the cell-cycle process. The architecture of our cell-cycle control model is depicted in Figure 3, showing the forward and feedback regulatory signals between gene layers (s, T, v, w, x, y, and z), that determine the system’s dynamic behavior. These gene layers represent consecutive stages taking place along the classical cell-cycle phases G1 , S, G2 , and M. These layers are comprised by the genes—state variables—expressed during the execution of each stage and are grouped into the two main parts: (i) G1 phase—layer s—that represents the cell growth phase immediately before the onset of DNA 6 EURASIP Journal on Bioinformatics and Systems Biology F: integration of signals from layer s s1 Table 2: PGN weight values and transition function thresholds. x1 Weights . . . akFP F w1 s2 y1 w2 y2 s G1 phase T v x6 w x y (2) th(1) v = 11, thv = 22 akwv = −2, k = 1, 2 akvw = 6, k = 5, 6, . . . , 9 Trigger gene .. . Gene layers (2) th(1) P = 9, thP = 12 akPv = 4, k = 5, 6, . . . , 9 z T s5 = 6, k = 5, 6, . . . , 9 a1jP = −2, j = v, w, x, y, z v External stimuli Thresholds z S, G2 and M phases Time Forward signal Feedback to T Feedback to previous layer Figure 3: Cell-cycle network architecture. akxw = −1, k = 1, 2 akwx = 5, k = 5, 6, . . . , 9 akyx = −1, k = 1, 2 (2) th(1) x = 20, thx = 28 a5xy = 2 (2) th(1) y = 6, th y = 12 a5yz = 2 (2) th(1) z = 4, thz = 8 pleted before initiating another cycle of cell duplication and division. Parallel signaling also provide robustness, acting as backup mechanisms in case of parts malfunction. 4.1. replication (i.e., S phase), during which the cell responds to external regulatory stimuli (I) and (ii) S, G2 plus M phases— layers T, v, w, x, y, and z—that goes from DNA replication to mitosis. The S phase trigger gene T represents an important cell-cycle checkpoint, interfacing G1 phase regulatory signals and the initiation of DNA replication. The signal F (Figure 3) stands for integration, at the trigger gene T, of activator signals from layer s. Our basic assumption implies that the cellcycle control system is comprised of modules of parallel sequential waves of gene expression (layers s to z) organized around a check-point (trigger gene T) that integrates forward and feedback signals. For example, within a module, the trigger gene T balances forward and feedback signals to avoid initiation of a new wave of gene expression while a first one is still going through the cell cycle. A number of check-point modules, across cell cycle, regulate cell growth and genome replication during the sequential G1 , S, and G2 phases and cell duplication via mitosis. In our model, the expression of one of the genes in layers v to z (i.e., after the trigger gene T—see Figure 3) typically yields three types of signals in the system: (i) a forward activator signal to genes in the next layer that tends to make the cell-cycle progress in its sequence; (ii) an inhibitory feedback signal to the genes in the previous layer aiming to stop the propagation of a new forward signal for some time; and (iii) an inhibitory feedback signal to the trigger gene T that tends to avoid the triggering of a new wave of gene expression while the current cycle is unfinished. The negative feedback signals perform an important regulatory action, tending to ensure that a new forward signal wave is not initiated nor propagated through the system when the previous one is still going on. This imposes in the model essential robustness features of the biological cell cycle, for example, a cycle must be com- (2) th(1) w = 20, thw = 35 Complete PGN specification This PGN is specified in the same way as the one in Section 3.1.1, changing the driving function to the following: di (t − 1) = N m akji x j (t − k), (10) j =1 k=1 where m is the memory of the system and akji is the weight for variable x j at time t − k in the driving function of variable xi ; and using the weight and threshold values shown in Table 2, where akji is the weight for the expression values of genes in layer j at time t − k in the driving function at time t of genes at layer i. Weight values not shown in the table are zero. Thresholds are the same for all genes in the same layer. 4.2. Experimental results We simulated our hypothetical cell-cycle control model, as a PGN with probability P = .99 driven by different excitation signals F (integration of signals from layer s driving the trigger gene T): beginning with a single activation pulse (F = 2), then pulses of F of increasing frequency—that is, pulses arriving each time more frequently in each simulation—and, finally, with a constant signal F = 2. As the initial condition for the simulations of our model, we chose all variables from layers T to z at zero value in the m—memory of the system— previous instants of time. This represents, in our model, the G1 stationary state, where the system remains after a previous cycle has ended and when there is no activator signal F strong enough to commit the cell to division. For simplicity, when plotting these simulations, we show only one representative gene for each gene layer. A single pulse of F (Figure 4(a)) makes the system go through all the cycle stages and then, all signals remain at 7 2 z0 2 z0 2 y1 0 2 y1 0 2 x1 0 2 x1 0 State variables State variables Nestor Walter Trepode et al. 2 w1 0 2 v0 2 w1 0 2 v0 2 T 0 2 T 0 2 F0 2 F0 0 30 60 90 120 150 180 0 30 60 Time steps 2 z0 2 z0 2 y1 0 2 y1 0 2 x1 0 2 x1 0 2 w1 0 2 v0 2 F0 2 F0 90 120 180 150 180 2 v0 2 T 0 60 150 2 w1 0 2 T 0 30 120 (a) F = period 30 oscillator State variables State variables (a) One single start pulse of F = 2 at t = −1 0 90 Time steps 150 180 Time steps 0 30 60 90 120 Time steps (b) F = Period 50 oscillator (b) F = period 3 oscillator Figure 4: Simulation of our three-level PGN cell-cycle progression control model with 1% of noise (PGN with P = .99) when activator pulses of F arrive after the previous cycle has ended. Figure 5: Simulation of our three level PGN cell-cycle progression control model with 1% of noise (PGN with P = 0.99) when activator pulses of F can arrive before the previous cycle has ended. zero level—G1 stationary state—with a very small amount of noise. Comparing this simulation with the one in Figure 2(b) (three-level PGN model of the yeast cell cycle under the same noise and activation conditions), we see that this system is almost unaffected by this amount of noise during the cycle progression or when it is in a stationary state. Those small extra pulses, that arise outside the signal trains in the simulations of our model, are the observable effect due to the presence of 1% of noise (they do not appear when the system is simulated without noise [19]—not shown here). Figure 4(b) shows that when new F activator pulses are applied after each cycle is finished, cycles start and are completed normally. For F pulses arriving more frequently, a new cycle is started only if the previous one has finished (Figure 5(a)). This control action is performed by the inhibitory negative feedback signals—from layers v to z—acting on the trigger gene T, carrying the information that a previous cycle is still unfinished. We see, in these simulations, that no spurious signal waves are generated by noise nor the forward cell-cycle signal is stopped by it (i.e., all normally initiated cycles finish). If a very frequent train of pulses triggers gene T be- fore the end of the ongoing cycle, that signal is stopped at the following gene layers by the negative interlayer feedbacks. The regulation performed by these interlayer feedbacks provide another timing effect, assigning each stage—or layer—a given amount of time for the processes it controls, stopping the propagation of a new forward signal wave—coming from the previous layer—for some time. By means of two types of negative feedbacks (to the previous layer and to gene T), this system is able to resist the excessive activation signal, maintaining its natural period, and thus mimicking the biological cell cycle in nature. But, as in biological systems robustness has its limits, in our model a very frequent excitation (short period train of F pulses—Figure 5(b)—or constant F = 2— not shown here) surpasses the resistance of the negative feedbacks, taking the system out of its normal behavior. For comparison purposes, we simulated both Li’s model and ours with 1% of noise. In other simulations, not shown here, we increased gradually the noise in our model to see how much it can resist, and decreased gradually the noise in Li’s model to determine the smallest amount of it that can lead to undesired dynamical behavior. In the first case, 8 EURASIP Journal on Bioinformatics and Systems Biology Table 3: Delay probabilities. P(td ) .2 .6 .2 Table 4: PGN weight values and transition function thresholds in the model with random delays in the regulatory signals. Weights (k = k + td ) th(1) T =9 th(2) T = 12 akjT = −1.33; j = v, w, x, y, z; k = 1 akjT = −0.67; j = v, w, x, y, z; k = 2 — akTv = 5, k = 5, . . . , 9 th(1) v = 11 akwv = −0.77, k = 1, . . . , 9 th(2) v = 22 akvw th(1) w = 15 th(2) w = 25 akxw = 7, k = 3, . . . , 7 = −0.83, k = 1, . . . , 9 th(1) x = 20 th(2) x = 28 akwx = 6, k = 4, . . . , 8 akyx = −1.77, k = 1, . . . , 9 akxy = 3, k = 6 akyz = 3, k = 6 th(1) y =6 =4 th(2) z =8 2 w1 0 2 v0 2 F0 0 30 60 90 120 150 180 Time steps (a) One single start pulse of F = 2 at t = −1, −2 2 z0 2 y1 0 th(2) y = 12 th(1) z 2 x1 0 2 T 0 Thresholds akFT = 6, k = 5, . . . , 9 State variables 2 y1 0 State variables td 0 1 2 2 z0 2 x1 0 2 w1 0 2 v0 2 T 0 2 F0 0 30 60 90 120 150 180 Time steps we observed that in our model, a noise above 3% is needed for a noise pulse to propagate through the consecutive layers as a spurious signal train (5% of noise is needed to stop the normal signal wave, preventing it from finishing an ongoing cell cycle) [19]. On the other hand, when simulating Li’s binary model, we observed spurious pulse propagation even at 0.05% noise [18]. 5. CELL-CYCLE PROGRESSION CONTROL MODEL WITH RANDOM DELAYS We modified our model in order to admit random delays in signal propagation, maintaining its overall behavior and robustness. 5.1. PGN specification In this version, before computing the driving function of a variable, the model chooses a random delay td for its arguments, with the probability distribution of Table 3. Once these delays are chosen, the stochastic transition function defined in Section 4.1 calculates the temporal evolution of the system, with the weights and thresholds indicated in Table 4. The transition function parameters, specifically its PGN weights values, depend on these variable delays. As shown in Table 4, these delays produce a time displacement of the weights, and so, of the inputs to the driving function of each variable. This system is no longer time translation invariant, but adaptive. At each time step, it chooses a PGN (b) F = period 60 oscillator Figure 6: Simulation of our three-level PGN cell-cycle progression control model with random delays and 1% of noise (PGN with P = .99), when activator pulses of F arrive after the previous cycle has ended. from a set of candidate PGNs (each one determined by one of the possible combinations of delays for its variables). In Table 4, akji denotes the weight for the expression values of genes in layer j at time t − k (where k = k + td ) in the driving function of layer i genes at time t. Weight values not shown in the table are zero. Thresholds are the same for all genes in the same layer, but td is not. It is chosen individually for each gene—by its associated component of the transition function—at each step of discrete time. 5.2. Experimental results We simulated this new model—with random delays—in the same conditions as the previous one obtaining a similar dynamical behavior. Due to the random delays applied at every time step in the signals, the waveform widths and the period of the cycle are somewhat variable and longer than they were in the previous model. Figure 6 shows the behavior of the system when it is driven by a single pulse of F = 2 or by a train of pulses whose 9 2 z0 2 z0 2 y1 0 2 y1 0 2 x1 0 2 x1 0 State variables State variables Nestor Walter Trepode et al. 2 w1 0 2 v0 2 w1 0 2 v0 2 T 0 2 T 0 2 F0 2 F0 0 30 60 90 120 150 180 0 30 60 Time steps (a) F = period 20 oscillator 2 y1 0 2 x1 0 150 180 2 z0 2 w1 0 2 y1 0 2 v0 2 T 0 2 F0 0 30 60 90 120 150 180 State variables State variables 120 (a) Due to the positive feedback from z to T, a new cycle is started right after the previous one has finished, without the need of a new F activator signal. This behavior is typical of the embryonic cell-cycle, which depends on positive feedback loops to maintain undamped oscillations with the correct timing 2 z0 (b) Constant F = 2 CELL-CYCLE PROGRESSION CONTROL MODEL WITH RANDOM DELAYS AND POSITIVE FEEDBACK Our model can exhibit a pacemaker activity, initiating onecell division cycle after the previous one has finished with- 2 w1 0 2 v0 2 F0 510 Figure 7: Simulation of our three-level PGN cell-cycle progression control model with random delays and 1% of noise (PGN with P = .99), when activator pulses of F arrive before the previous cycle has ended, and with constant activation F = 2. period is greater than the cycle period. The system behaves normally, with a little amount of noise, much weaker than the regulatory signals. When F pulses arrive more frequently and the period of the activator signal is shorter than the period of the cycle (Figure 7(a)), a new cycle is not started if the activator pulse arrives when the previous cycle has not been completed. Finally, when the activation F becomes very frequent or constant (Figure 7(b)), the negative feedbacks can no longer exert their regulatory action and the system undergoes disregulation. These simulations show the degree of robustness of our model system under noise and random delays, when driven by a wide variety of activator signals [20]. 2 x1 0 2 T 0 Time steps 6. 90 Time steps 540 570 600 630 660 690 Time steps (b) The second cycle in this figure is somewhat weakened (by the effect of noise and random delays), but the positive feedback gets to overcome this (without the need of F activation) and the system recovers its normal cyclical activity Figure 8: PGN cell-cycle progression control model with positive feedback from gene z to the trigger gene T (akzT = 7, k = 5 + td ), 1% of noise and only one initial activator pulse F = 2 at t = −1, −2. out the requirement of external stimuli, if we include positive feedback in it. This oscillatory behavior is observed in nature during proliferation of embryonic cells [4]. For our model to present this oscillatory behavior, it suffices to include a positive feedback signal from gene z—last layer—to the trigger gene T. The system is exactly the same as the previous random delay PGN model, except for an additional weight different of zero: akzT = 7 (where k = 5 + td ). 6.1. Experimental results In the simulation of Figure 8, the system is initially driven by a single pulse of F = 2 at t = −1, −2. As in the embryonic cell cycle, the positive feedback loop induces a pacemaker activity 10 EURASIP Journal on Bioinformatics and Systems Biology where all cycles are completed normally with the correct timing for all the different phases. A new cycle starts right after the completion of the previous one without the need of any activator signal F. Figure 8(b) shows that when a signal wave is weakened by the combined effect of noise and random delays, the positive feedback (without the need of any F activation) is sufficient to overcome this signal failure, putting the system back into a normal-amplitude cyclical activity. These simulations show the flexibility of our PGN model to represent different types of dynamical behavior, including the embryonic cell-cycle, that is induced by positive feedback loops. 7. DISCUSSION We designed a PGN hypothetical model for control of cellcycle progression, inspired on qualitative description of wellknown biological phenomena: the cell cycle is a sequence of events triggered by a control signal that propagates as a wave; there are signal integrating subsystems and (positive and negative) feedback loops; parallel replicated structures make the cell-cycle control fault tolerant. Furthermore, important real-world nonbiological control systems usually are designed to be stable, robust, fault tolerant and admit small probabilistic parameter fluctuations. Our model’s parameters were adjusted guided by the expected behavior of the system and exhaustive simulation. This modeling effort had no intention of representing details of molecular mechanisms such as kinetics and thermodynamics of protein interactions, functioning of the transcription machinery, microRNA, and transcription factors regulation, but their concerted effects on the control of gene expression [13]. Our cell-cycle progression control model was able to represent some behavioral properties of the real biological system, such as: (i) sequential waves of gene expression; (ii) stability in the presence of variable excitation; (iii) robustness under noisy parameters: (iii-i) prediction by an almost deterministic stochastic rule; (iii-ii) stochastic choice of an almost deterministic stochastic prediction rule (random delays), and (iv) auto stimulation by means of positive feedback. The presence of numerous negative feedback loops in the model provide stability and robustness. They warrant that, under multiple noisy perturbation patterns, the system is able to automatically correct external stimuli that could destroy the cell. This kind of mechanisms has commonly been found in nature. Particularly, we think that the robustness of Li’s yeast cell-cycle model [3] would be improved by addition of critical negative feedback loops, that we suspect should exist in the biological system. The inclusion of positive feedback can make our model capable of exhibiting a pacemaker activity, like the one observed in embryonic cells. The parallel structure of the system architecture represents biological redundancy, which increases system fault tolerance. Our discrete stochastic model qualitatively reproduces the behavior of both Li et al. [3] and Pomerening et al. [4] models, exhibiting remarkable robustness under noise and parameters’ random variation. The natural follow up of this research is to infer the PGN model from available dynamical data of cell-cycle progression, analogously to what we have done for the regulatory system of the malaria parasite [5, 6]. We anticipate that, very likely, analysis of these dynamical data will uncover unknown negative feedback loops in cell-cycle control mechanisms. ACKNOWLEDGMENTS This work was partially supported by Grants 99/073900, 01/14115-7, 03/02717-8, and 05/00587-5 from FAPESP, Brazil, and by Grant 1 D43 TW07015-01 from The National Institutes of Health, USA. REFERENCES [1] A. Murray and T. Hunt, The Cell Cycle, Oxford University Press, New York, NY, USA, 1993. [2] B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, and P. Walter, Molecular Biology of the Cell, Garland Science, New York, NY, USA, 4th edition, 2002. [3] F. Li, T. Long, Y. Lu, Q. Ouyang, and C. Tang, “The yeast cellcycle network is robustly designed,” Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. 14, pp. 4781–4786, 2004. [4] J. R. Pomerening, S. Y. Kim, and J. E. Ferrell Jr., “Systems-level dissection of the cell-cycle oscillator: bypassing positive feedback produces damped oscillations,” Cell, vol. 122, no. 4, pp. 565–578, 2005. [5] J. Barrera, R. M. Cesar Jr., D. C. Martins Jr., et al., “A new annotation tool for malaria based on inference of probabilistic genetic networks,” in Proceedings of the 5th International Conference for the Critical Assessment of Microarray Data Analysis (CAMDA ’04), pp. 36–40, Durham, NC, USA, November 2004. [6] J. Barrera, R. M. Cesar Jr., D. C. Martins Jr., et al., “Constructing probabilistic genetic networks of plasmodium falciparum from dynamical expression signals of the intraerythrocytic developement cycle,” in Methods of Microarray Data Analysis V, chapter 2, Springer, New York, NY, USA, 2007. [7] N. W. Trepode, H. A. Armelin, M. Bittner, J. Barrera, M. D. Gubitoso, and R. F. Hashimoto, “Modeling cell-cycle regulation by discrete dynamical systems,” in Proceedings of IEEE Workshop on Genomic Signal Processing and Statistics (GENSIPS ’05), Newport, RI, USA, May 2005. [8] S. A. Kauffman, The Origins of Order, Oxford University Press, New York, NY, USA, 1993. [9] N. Friedman, M. Linial, I. Nachman, and D. Pe’er, “Using Bayesian networks to analyze expression data,” Journal of Computational Biology, vol. 7, no. 3-4, pp. 601–620, 2000. [10] H. De Jong, “Modeling and simulation of genetic regulatory systems: a literature review,” Journal of Computational Biology, vol. 9, no. 1, pp. 67–103, 2002. [11] I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang, “Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks,” Bioinformatics, vol. 18, no. 2, pp. 261–274, 2002. [12] J. Goutsias and S. Kim, “A nonlinear discrete dynamical model for transcriptional regulation: construction and properties,” Biophysical Journal, vol. 86, no. 4, pp. 1922–1945, 2004. [13] S. Bornholdt, “Less is more in modeling large genetic networks,” Science, vol. 310, no. 5747, pp. 449–451, 2005. Nestor Walter Trepode et al. [14] P. T. Spellman, G. Sherlock, M. Q. Zhang, et al., “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Molecular Biology of the Cell, vol. 9, no. 12, pp. 3273–3297, 1998. [15] J. J. Tyson, K. Chen, and B. Novak, “Network dynamics and cell physiology,” Nature Reviews Molecular Cell Biology, vol. 2, no. 12, pp. 908–916, 2001. [16] K. C. Chen, L. Calzone, A. Csikasz-Nagy, F. R. Cross, B. Novak, and J. J. Tyson, “Integrative analysis of cell cycle control in budding yeast,” Molecular Biology of the Cell, vol. 15, no. 8, pp. 3841–3862, 2004. [17] H. A. Armelin, J. Barrera, E. R. Dougherty, et al., “Simulator for gene expression networks,” in Microarrays: Optical Technologies and Informatics, vol. 4266 of Proceedings of SPIE, pp. 248–259, San Jose, Calif, USA, January 2001. [18] http://www.vision.ime.usp.br/∼walter/pgn cell cycle/ycc info.pdf. [19] http://www.vision.ime.usp.br/∼walter/pgn cell cycle/pgn ccm add info.pdf. [20] http://www.vision.ime.usp.br/∼walter/pgn cell cycle/pgn ccmrd add info.pdf. 11
© Copyright 2024