FAST ABSTRACT: Replication vs. Failure

1
FAST ABSTRACT: Replication vs. Failure
Prevention — How to Boost Service Availability?
Felix Salfner
International Computer Science Institute, Berkeley
[email protected]
Katinka Wolter
Humboldt-Universit¨at zu Berlin
[email protected]
Abstract—The objective of this paper is to provide a first
analysis of the effectiveness of simple server replication vs. failure
prevention in non-high-availability applications. We analyze service availability for a system with N servers where each server is
modeled as a finite queue subject to failures. A Petri net analysis
suggests that service availability is most effectively improved by
server duplication, but for further improvement the combination
with failure prevention seems most effective.
I. I NTRODUCTION
Excluding the area of high availability computing, two
trends can currently be observed in achieving better service
availability. The first approach is to rely on simple replication
of commodity hardware, while the second approach makes
use of algorithms in order to anticipate and avoid upcoming
failures. This work provides a first analysis which approach is
more effective with respect to service availability.
II. T HE M ODEL
We analyze a service providing system consisting of a
dispatcher / load balancer and N independent servers (see Figure 1). Each server is modeled as a finite queue that processes
one job/request at a time, and that is subject to failures. We
use Petri nets as modeling technique (see Figure 2).
Fig. 2. Petri net model for a single server with finite queue of capacity K
that is subject to failures.
at this server until the server is repaired. Repair times are
known to be non-exponentially distributed. The same holds
for the distribution of time-to-failure (TTF), as is discussed
in Section IV. Therefore, “fail” and “repair” are general
transitions. We assume repair times to be uniformly distributed
between a lower and an upper limit.
From these assumptions it is clear that our modeling does
not address high-end dependable computing systems with
reliable messaging and queueing. We rather focus on scenarios
where a single point of failure causes the entire server to crash
and to loose all jobs currently in the queue. Such scenarios
typically occur in low-budget scientific computing or possibly
cloud computing.
III. C OMPUTING S ERVICE AVAILABILITY
Fig. 1. Modeled system: each job is assigned once to one of N independent
servers.
We assume that jobs arrive following an exponential distribution characterized by parameter arrival time. Each server
has a finite queue of capacity K. Once a job arrives, the
dispatcher assigns it to a queue that has available capacity.
If all queues are full, the job is lost. Once a job is assigned
to a server, it cannot be reassigned. On average, each server
is assigned the same number of jobs such that utilization of
each of the servers is identical. Each job in the queue is
sequentially processed and the time needed to complete a job
is exponentially distributed (parameter service time).
We assume that the server is either up and running or down.
In case the server fails, it looses all jobs currently in the
queue. While it is down, no arriving jobs can be enqueued
Several definitions for service availability have been proposed in the literature. We focus on a definition that is
applicable to atomic jobs/requests, i.e., there are many distinct
jobs and each job is either considered to be successfully
completed or lost. Then service availability is given by:
As =
=
E [no. of completed jobs]
E [total no. of jobs]
(1)
completion rate
completion rate + loss rate
(2)
where E [·] denotes the expected value. The task is hence
to compute completion rate and loss rate from the queueing
model.
If all servers would work on jobs without any interruption,
the completion rate would be N/service time. However, each server
2
As =
P (serve) + ρ
N N i=0
i
P (serve)
P (down)i
P (full )N −i +
service time
E[T T F ]
(6)
E [jobs in queue]
only completes jobs if jobs are in the queue, hence we have:
V. E VALUATION
P (serve)
rc = N
(3)
service time
There are two reasons why jobs can be lost in the modeled
system: (a) an arriving job cannot be enqueued when some (or
all) queues are full and the rest of the servers are down, (b)
when a server fails all jobs currently in its queue are lost.
The loss rate related to enqueueing failures is given by:
N N P (down)i P (full )N −i
(4)
re = i=0 i
arrival time
The loss rate due to server failures is given by:
We analyzed the Petri net using TimeNET [1] and performed experiments for various combinations of N and p. On
average, jobs arrive every ten hours and take eight hours to
complete, failures occur every 180 days and each repair takes
between half an hour to three hours (hence steady state system
availability of a single server is 0.999595). Queue capacity K
has been set to 20. Figure 3 shows the resulting logarithmic
service unavailability log10 (1 − As ).
Measure
P (serve)
P (full)
P (down)
E [jobs in queue]
Equivalent Petri Net Measure
P (in queue ≥ 1)
P (queue places = 0)
P (down > 0)
E [queue places]
−3.0
lability
TABLE I
M APPING OF MEASURES OCCURRING IN (6) TO THE P ETRI NET SHOWN IN
F IGURE 2. DENOTES NUMBER OF TOKENS IN A PLACE .
−2.5
e unavai
−3.5
1
0.0
pr
2
0.2
ev
en
tio 0.4
np
ro
b
er
s
where E [jobs in queue] denotes the expected number of jobs
in the queue of a single server, and E [T T F ] denotes its
expected time-to-failure.
Defining utilization ρ = service time/N ·arrival time and
substituting (3) to (5) into (2) yields (6). P (serve), P (down),
P (f ull), and E [jobs in queue] can be determined from the
Petri net (see Table I).
nu
m
be
ro
fs
er
v
(5)
−2.0
ic servic
logarithm
E [jobs in queue]
rf = N
E [T T F ]
3
ab 0.6
ilit
y
4
0.8
5
Fig. 3. Logarithmic service unavailability for the modeled system for N
equal to 1,2,3, and 5 servers, as well as failure prevention probability p equal
to 0 (no prevention), 0.1, 0.3, 0.5, 0.7, and 0.9.
VI. C ONCLUSIONS AND D ISCUSSION
IV. I NCORPORATING FAILURE P REVENTION
Failure prevention affects time-to-failure. We assume that
each of the N servers has a separate failure prevention mechanism. We summarize the effects of all sophisticated failure
prediction and prevention methods by one real-valued number
p, which is the overall probability that an upcoming failure
can be prevented. Let f (t) denote the original distribution of
TTF (without prevention), then time-to-failure with prevention
is given by a superposition of convolved f (t) weighted by a
geometric distribution:
∞
pn−1 (1 − p) f (t)n∗
(7)
f (t) =
n=1
where f (t)n∗ denotes the (n − 1)-th convolution of f (t) with
itself. f (t) defines the distribution of the “fail” transition in
the model.
In order to compute service availability by (6), we need to
determine E [T T F ]. The expected value of f (t) is given by:
E [f (t)]
(8)
E [T T F ] =
1−p
since the expected value of the geometric part of (7) is 1/1−p.
We have investigated the effect of server replication and
failure prevention on service availability for a system with N
independent queueing servers subject to failures as it typically
appears in low-budget scientific computing or cloud computing
scenarios.
The presented preliminary analysis suggests that if only
one of the two approaches can be applied, due to significant
reduction of utilization replication seems to boost service
availability more than failure prevention. However, the effect
of replication fades the more servers are added so that further
improvement of service availability can be achieved best by
failure prevention.
There undoubtedly is room for discussion. For example, it
needs to be clarified how well the modeled scenario fits real
applications, what other metrics of service availability could
be used, or how the model can be extended to incorporate
degraded service delivery, just to name a few.
R EFERENCES
[1] A. Zimmermann, J. Freiheit, R. German, and G. Hommel, Petri net
modelling and performability evaluation with TimeNET 3.0, ser. LNCS.
Springer, 2000, vol. 1786, pp. 188–202.