1 FAST ABSTRACT: Replication vs. Failure Prevention — How to Boost Service Availability? Felix Salfner International Computer Science Institute, Berkeley [email protected] Katinka Wolter Humboldt-Universit¨at zu Berlin [email protected] Abstract—The objective of this paper is to provide a first analysis of the effectiveness of simple server replication vs. failure prevention in non-high-availability applications. We analyze service availability for a system with N servers where each server is modeled as a finite queue subject to failures. A Petri net analysis suggests that service availability is most effectively improved by server duplication, but for further improvement the combination with failure prevention seems most effective. I. I NTRODUCTION Excluding the area of high availability computing, two trends can currently be observed in achieving better service availability. The first approach is to rely on simple replication of commodity hardware, while the second approach makes use of algorithms in order to anticipate and avoid upcoming failures. This work provides a first analysis which approach is more effective with respect to service availability. II. T HE M ODEL We analyze a service providing system consisting of a dispatcher / load balancer and N independent servers (see Figure 1). Each server is modeled as a finite queue that processes one job/request at a time, and that is subject to failures. We use Petri nets as modeling technique (see Figure 2). Fig. 2. Petri net model for a single server with finite queue of capacity K that is subject to failures. at this server until the server is repaired. Repair times are known to be non-exponentially distributed. The same holds for the distribution of time-to-failure (TTF), as is discussed in Section IV. Therefore, “fail” and “repair” are general transitions. We assume repair times to be uniformly distributed between a lower and an upper limit. From these assumptions it is clear that our modeling does not address high-end dependable computing systems with reliable messaging and queueing. We rather focus on scenarios where a single point of failure causes the entire server to crash and to loose all jobs currently in the queue. Such scenarios typically occur in low-budget scientific computing or possibly cloud computing. III. C OMPUTING S ERVICE AVAILABILITY Fig. 1. Modeled system: each job is assigned once to one of N independent servers. We assume that jobs arrive following an exponential distribution characterized by parameter arrival time. Each server has a finite queue of capacity K. Once a job arrives, the dispatcher assigns it to a queue that has available capacity. If all queues are full, the job is lost. Once a job is assigned to a server, it cannot be reassigned. On average, each server is assigned the same number of jobs such that utilization of each of the servers is identical. Each job in the queue is sequentially processed and the time needed to complete a job is exponentially distributed (parameter service time). We assume that the server is either up and running or down. In case the server fails, it looses all jobs currently in the queue. While it is down, no arriving jobs can be enqueued Several definitions for service availability have been proposed in the literature. We focus on a definition that is applicable to atomic jobs/requests, i.e., there are many distinct jobs and each job is either considered to be successfully completed or lost. Then service availability is given by: As = = E [no. of completed jobs] E [total no. of jobs] (1) completion rate completion rate + loss rate (2) where E [·] denotes the expected value. The task is hence to compute completion rate and loss rate from the queueing model. If all servers would work on jobs without any interruption, the completion rate would be N/service time. However, each server 2 As = P (serve) + ρ N N i=0 i P (serve) P (down)i P (full )N −i + service time E[T T F ] (6) E [jobs in queue] only completes jobs if jobs are in the queue, hence we have: V. E VALUATION P (serve) rc = N (3) service time There are two reasons why jobs can be lost in the modeled system: (a) an arriving job cannot be enqueued when some (or all) queues are full and the rest of the servers are down, (b) when a server fails all jobs currently in its queue are lost. The loss rate related to enqueueing failures is given by: N N P (down)i P (full )N −i (4) re = i=0 i arrival time The loss rate due to server failures is given by: We analyzed the Petri net using TimeNET [1] and performed experiments for various combinations of N and p. On average, jobs arrive every ten hours and take eight hours to complete, failures occur every 180 days and each repair takes between half an hour to three hours (hence steady state system availability of a single server is 0.999595). Queue capacity K has been set to 20. Figure 3 shows the resulting logarithmic service unavailability log10 (1 − As ). Measure P (serve) P (full) P (down) E [jobs in queue] Equivalent Petri Net Measure P (in queue ≥ 1) P (queue places = 0) P (down > 0) E [queue places] −3.0 lability TABLE I M APPING OF MEASURES OCCURRING IN (6) TO THE P ETRI NET SHOWN IN F IGURE 2. DENOTES NUMBER OF TOKENS IN A PLACE . −2.5 e unavai −3.5 1 0.0 pr 2 0.2 ev en tio 0.4 np ro b er s where E [jobs in queue] denotes the expected number of jobs in the queue of a single server, and E [T T F ] denotes its expected time-to-failure. Defining utilization ρ = service time/N ·arrival time and substituting (3) to (5) into (2) yields (6). P (serve), P (down), P (f ull), and E [jobs in queue] can be determined from the Petri net (see Table I). nu m be ro fs er v (5) −2.0 ic servic logarithm E [jobs in queue] rf = N E [T T F ] 3 ab 0.6 ilit y 4 0.8 5 Fig. 3. Logarithmic service unavailability for the modeled system for N equal to 1,2,3, and 5 servers, as well as failure prevention probability p equal to 0 (no prevention), 0.1, 0.3, 0.5, 0.7, and 0.9. VI. C ONCLUSIONS AND D ISCUSSION IV. I NCORPORATING FAILURE P REVENTION Failure prevention affects time-to-failure. We assume that each of the N servers has a separate failure prevention mechanism. We summarize the effects of all sophisticated failure prediction and prevention methods by one real-valued number p, which is the overall probability that an upcoming failure can be prevented. Let f (t) denote the original distribution of TTF (without prevention), then time-to-failure with prevention is given by a superposition of convolved f (t) weighted by a geometric distribution: ∞ pn−1 (1 − p) f (t)n∗ (7) f (t) = n=1 where f (t)n∗ denotes the (n − 1)-th convolution of f (t) with itself. f (t) defines the distribution of the “fail” transition in the model. In order to compute service availability by (6), we need to determine E [T T F ]. The expected value of f (t) is given by: E [f (t)] (8) E [T T F ] = 1−p since the expected value of the geometric part of (7) is 1/1−p. We have investigated the effect of server replication and failure prevention on service availability for a system with N independent queueing servers subject to failures as it typically appears in low-budget scientific computing or cloud computing scenarios. The presented preliminary analysis suggests that if only one of the two approaches can be applied, due to significant reduction of utilization replication seems to boost service availability more than failure prevention. However, the effect of replication fades the more servers are added so that further improvement of service availability can be achieved best by failure prevention. There undoubtedly is room for discussion. For example, it needs to be clarified how well the modeled scenario fits real applications, what other metrics of service availability could be used, or how the model can be extended to incorporate degraded service delivery, just to name a few. R EFERENCES [1] A. Zimmermann, J. Freiheit, R. German, and G. Hommel, Petri net modelling and performability evaluation with TimeNET 3.0, ser. LNCS. Springer, 2000, vol. 1786, pp. 188–202.