Download Report

Business Continuity Planning
& Disaster Recovery
Solutions
Achieving Continuous Availability
Introduction
3
Today’s business continuity problem
5
Planning for disaster recovery and business continuity
8
Regulatory squeeze and ubiquitous deployment
10
The ideal enterprise continuity solution
13
Achieving continuous availability with Sybase
18
Conclusion
24
Appendix: Business Continuity Planning and Disaster Recovery in practice
25
First Published 2002
www.datamonitor.com
Datamonitor USA
1 Park Avenue
14th Floor
New York, NY 10016-5802
USA
Datamonitor Europe
Charles House
108-110 Finchley Road
London NW3 5JJ
United Kingdom
Datamonitor Germany
Messe Turm
Box 23
60308 Frankfurt
Deutschland
Datamonitor Asia Pacific
Room 2413-18, 24/F
Shui On Centre
6-8 Harbour Road
Hong Kong
t: +1 212 686 7400
f: +1 212 686 2626
e: [email protected]
t: +44 20 7675 7000
f: +44 20 7675 7500
e: [email protected]
t: +49 69 9754 4517
f: +49 69 9754 4900
e: [email protected]
t: +852 2520 1177
f: +852 2520 1165
e: [email protected]
ABOUT DATAMONITOR
Datamonitor plc is a premium business information company specializing in industry
analysis.
We help our clients, 5000 of the world’s leading companies, to address complex
strategic issues.
Through our proprietary databases and wealth of expertise, we provide clients with
unbiased expert analysis and in-depth forecasts for six industry sectors: Automotive,
Consumer Markets, Energy, Financial Services, Healthcare, Technology.
Datamonitor maintains its headquarters in London and has regional offices in New
York, Frankfurt and Hong Kong.
All Rights Reserved.
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form by
any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission
of the publisher, Datamonitor plc.
The facts of this report are believed to be correct at the time of publication but cannot be guaranteed.
Please note that the findings, conclusions and recommendations that Datamonitor delivers will be based
on information gathered in good faith from both primary and secondary sources, whose accuracy we are
not always in a position to guarantee. As such Datamonitor can accept no liability whatever for actions
taken based on any information that may subsequently prove to be incorrect.
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 2
BCP & DR – Achieving Continuous
Availability
Introduction
As a solution, Disaster Recovery has been around for some time - the need to back
up IT assets in the event of incidents such as flooding, fire, power failure or other
physical and electronic attack is clear. Business continuity is an extension of this to
ensure processes and transactions can continue under all circumstances. Recent
events – notably those of the September 11th 2001– have made it clear that BCP and
DR (Business Continuity Planning and Disaster Recovery) is increasingly vital for all
large enterprises.
The solutions themselves have also moved on; it has become apparent that classic
back-up and restore where tapes containing data assets would be shipped to a
remote location is no longer feasible. The complexity of the modern IT environment
exposes it both to an increased internal technical risk of failure as well as external
risks, in addition to the need for continuous data access in a ‘real-time’ IT world.
There has been massive investment over the past few years in enterprise
applications, such as CRM, ERP and SCM, to name but a few. This investment has
greatly affected business continuity and disaster recovery, as many organisations are
integrating their business processes with those of their customers, suppliers and
business partners. Recover times have shrunk to minutes and hours, and in some
cases moved to zero – this means 24x7x52 continuous business process availability.
Furthermore, scenario plans have broadened to take on the new risks of eBusiness,
for example, downtime due to:
• operational risk (such as the Microsoft.com three-day outage);
• security risk (denial-of-service attacks bringing down Yahoo);
• lack of capacity (the launch of the UK Governments 1901 census site resulted in a
1.2 million hits per hour causing the site to crash and be withdrawn for several
months);
• application failure (such as the full day outage last year by the London Stock
Exchange);
• partner / outsourcer unavailability (such as ISP network failure or links from a
web site to those of partners that are unavailable);
• natural disasters (such as floods on the River Oder and the Rhine, earthquakes in
Bulgaria and Turkey);
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 3
BCP & DR – Achieving Continuous
Availability
• terrorist attacks (such as those common to the United Kingdom and Spain).
Any downtime risk must be a concern to every business, as any downtime today
results in a press event which could impact the image and reputation of the
enterprise.
Therefore, the aim of this paper is to increase the awareness and understanding of
the benefits surrounding Business Continuity and Disaster Recovery technologies,
by:
• discussing the increasing need for BCP and DR solutions by organisations;
• highlighting the business benefits of business continuity solutions;
• outlining the various technology solutions that can be used to achieve BCP and DR;
• explaining the advantages of using a 3-tiered, continuously available softwarebased solution.
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 4
BCP & DR – Achieving Continuous
Availability
Today’s business continuity problem
The issue surrounding business continuity planning (BCP) is essentially the same for
a traditional bricks and mortar company or an innovative ‘dot com’; business
processes and information systems must be continually available, 24x7x52. However,
what has changed over time is the speed in which recovery is required and the
increased willingness of competitors to capitalise on a company’s downtime. With
increasing reliance on electronic markets and the impact of natural disasters and
criminal activity on the company’s technology base, its supply chain and the customer
base, companies are becoming more and more concerned about business continuity
planning. The main drivers and factors that have created the need for comprehensive
business continuity solutions are:
• the global nature of modern business practices;
• corporate, internal and external supply chain interdependency
• speed and timeliness
• IT-dependant business processes;
• the increasing value of data;
• legislation / regulation.
Business continuity planning means formalising a company’s strategy for dealing with
the unexpected and unknown by planning, training and testing for the recovery of
critical business processes and IT systems in a timely fashion to minimise the impact
of any disruption on the business and the customer. Typical natural disaster
scenarios addressed by many businesses include fire, floods, tornadoes, hurricanes,
earthquakes, snow / ice and extreme heat.
However, many businesses are willing to accept the risk of the above natural
disasters due to their perceived unlikelihood and hence they feel business continuity
planning is not necessary. Many threats that have nothing to do with geographic
location or threat of natural disaster are often overlooked, such as:
• work-place violence;
• terrorism;
• workforce unavailability;
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 5
BCP & DR – Achieving Continuous
Availability
• computer / Internet based crime (including denial of service attacks);
• geographical restrictions caused by events, such as chemical contamination;
• power outage;
• computer viruses;
• telecommunications failure;
• asset malfunction (including hardware and software failures);
• human error.
Business continuity planning is not simply ‘ticking a box’ by purchasing a hot-site
contract or a business continuity planning software package. Similarly, having a
documented plan is not enough.
In the beginning
As illustrated in Figure 1, the modern concepts of disaster recovery and business
continuity were hatched around the beginning of the nineties mainly in the form of
legal requirements for banks and financial institutions to set up contingency plans in
the event of a disaster.
Figure 1:
The evolution of disaster recovery and business
continuity
R e c o v e ry
tim e s c a le
seconds
m in u te s
A p p lic a tio n
re c o v e ry &
c o n tin u ity
h o u rs
D is a s te r
re c o v e ry
2 4 -h o u r s
days
B u s in e s s
C o n tin u ity
D is a s te r
m itig a tio n
1990
1995
Source: Datamonitor
2000
2000 +
DATAMONITOR
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 6
BCP & DR – Achieving Continuous
Availability
These solutions mainly mitigated the impact of a major disaster. Consequently, these
contingency arrangements were designed to enable the restoration of business and
data over a period of a few days protecting the business from a permanent or longterm loss. As such the contingencies developed in this period did not allow for any
immediate continuity in the running of the business.
Over the following half-decade the concept of being able to recover the operations of
the business in the case of a disaster gained further ground with enterprises
establishing remote locations where workers could be relocated for the restoration of
a business in the case of a disaster. These facilities are to this day the ultimate
protection against major disasters, but are as such only available to enterprises
whose wealth and data value makes a facility of this nature possible.
The dawn of the millennium
At the end of the last millennium, the spread of the Internet and the advent of
eCommerce saw the timeframes of business transactions shrink to levels never seen
before. For example the average patience of people surfing the web is less than 10
seconds until they give up and move to another site, unlikely to return to the site that
did not work. At the same time IT has been integrated into business processes,
leading to the increased importance of maintaining the availability of applications and
IT infrastructure at all times. These solutions still mainly involved back-up and restore
implementations, typically to tape, with the aim of minimising any data loss in case of
a systems failure. Furthermore a backup and restore facility facilitated the process of
restoring the last known working configuration of a system thereby enabling higher
levels of availability.
This process was perfected with more frequent back-up windows and redundant
application servers enabling the fail-over of application to occur in a matter of
minutes. The fall in the price of back-up and restore solutions over this period to
levels that were affordable for most enterprises meant that the majority of companies
with an IT department could afford a simple tape library for the backup of critical data.
The birth of Business Continuity
In the scare relating to the Y2K bug, many enterprises performed a close audit of
their business involving contingency plans in the event of downtime or disaster. The
resulting solutions for the first time saw the widespread acceptance of the concept of
business continuity whereby the business operations were to be maintained at all
times in some form. Business continuity is meant to encompass the whole business,
not just the web-site or key applications.
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 7
BCP & DR – Achieving Continuous
Availability
Today the main issue in business continuity and disaster recovery is no longer merely
related to the attainment of higher rates of availability as the economics of moving
from the highly expensive availability level of ‘four or five 9s’ of reliability (i.e. 99.99%
or 99.999% reliable) to even higher levels rarely makes sense to even the most
downtime sensitive-organization. Instead, the main focus in business continuity and
disaster recovery should be to concentrate of the level of deployment within the
enterprise. Furthermore, in today’s world of interdependence and collaboration it is
also important to think about partner companies, most notably suppliers, to ensure
that agreed levels of performance can be maintained. Most organizations at the
moment still only have quite rudimentary solutions in use for disaster recovery and
business continuity.
Planning for disaster recovery and business continuity
Datamonitor research on issues relating to business continuity and disaster recovery
indicated that enterprises had a very poor idea of what these solutions entail and that
levels of deployment seem to be quite low. It is this low level of deployment that
needs to be addressed, as with so many IT issues, through education.
Despite the higher priority that business continuity and disaster recovery have been
able to gain in overall IT budgets, there is still a clear shortfall in the level of spending
in enterprises resulting in inadequate cover against disruptive events. Furthermore,
many BCP mistakes or assumptions are made by companies dependant on
traditional technology infrastructures, as well as those relying more heavily on the
Internet, and include:
• over resilience – relying on a business continuity plan can lead to a false sense of
security and potential business failure if the plan is not updated regularly and fully
tested. Formal mechanisms should be in place to force a plan update on a regular
basis or when significant systems or business process change occurs. A
comprehensive BCP plan should include mechanisms to ensure periodic updates;
• segregated planning processes – companies often limit the scope of their efforts
to systems recovery or consider IT assets and business processes separately. BCP
requires consideration of both business process and systems recovery together,
given that technology often play’s a critical role in the business processes. The plan
must address those processes that coincide with corporate strategy and objectives;
• lack of planning prioritisation – prioritising key business processes is a critical
step that often does not get the appropriate attention. Without prioritisation, a plan
may recover less-than-critical business processes rather than the ones crucial for
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 8
BCP & DR – Achieving Continuous
Availability
survival. Furthermore, vulnerability should also be considered and taken into
account when prioritising which processes will be planned first. The fact that more
and more business processes are interdependent, the order in which processes
and systems are recovered is important;
• safety deficiency – all stages of the BCP processes, employee, business partner
and customer safety must be taken into account. The plan should include safety
controls to minimise casualties in the event of a disaster and ways to contain the
situation to minimise or eliminate risk;
• inadequate communications – communication issues are often overlooked.
Often, businesses lack formal communications plans to contact employees,
business partners and clients, in the event of problems. Strategies to address how
these groups obtain recovery status updates are usually inadequate;
• poor security – physical and information systems security controls are often
disregarded during plan development and implementation, resulting in greater risk
exposure during recovery operations. In order to recover equipment quickly and
without interference, as well as to process insurance claims in a timely manner,
physical security over a disaster site is an important consideration. Also, similar
logical security controls should be in place for the back-up information systems as
the primary processing environment;
• ineffective response to insurance requirements – many business continuity
plans fail to adequately plan to support the filing of insurance claims, resulting in
delayed or reduced settlements;
• poor recovery services evaluation – many companies poorly evaluate recovery
products (hot sites, cold sites, off site storage and planning software), relying on
vendor information. Furthermore, some companies eliminate the product or service
simply due to cost without understanding how it could significantly affect the timely
recovery of critical business processes. Lack of foresight may lead to a solution that
does not adequately address a company’s needs;
• regulatory / legislative compliance – adherence to legislative issues is also often
overlooked, with many companies being unaware of their legal requirement to plan
for the continuity of business processes and must understand local statutes and
industry regulations governing business resumption and disaster recovery;
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 9
BCP & DR – Achieving Continuous
Availability
Regulatory squeeze and ubiquitous deployment
Until now the regulation relating to business continuity and disaster recovery has
been quite limited, relating mostly to financial companies and financial records within
an enterprise. Very little legislation is actually in place that forces businesses in
normal circumstances to implement measures for business continuity.
However, driven by the political will created following the events of September 11
regulatory bodies across the Western world will most likely step up their requirements
on business continuity features of businesses, especially publicly listed businesses.
Primarily, this would be to ensure the going concern of the business but also would
set down standards for business availability- in essence setting down a legally
enforceable SLA. However, there are two regulatory bodies that are providing
guidance with the adoption of risk-based approaches to business operations. They
include the Basel Committee for the financial services sector and the Turnbull
Committee for companies of any industry. A discussion of both regulatory bodies is
given below.
Basel Committee
In 1988, the Basel committee of the Bank for International Settlements (BIS)
classified the risk levels of different types of credit and the minimum amount of capital
that should be held as a cushion against the risks of each. The revised Accord (Basel
II) conceived in June 1999, attempts to tackle the shortfalls of its predecessor but
more importantly, it encourages banks to develop internal models rather than have
capital reserves imposed by regulators. Proposals focus on incentivizing banks to
develop more sophisticated approaches to credit and operational risk based on the
banks’ own internal ratings and measurement systems. In order to benefit from
capital incentives, banks must be able to demonstrate high levels of data quality,
accuracy and integrity in order to obtain approval for the internal rating based
approach that will give the institutions the ability to self regulate risk management.
Basel II also recognises that risks other than credit and the market can be substantial,
and therefore also incorporates directives on operational risk, i.e. the risk of losses
from inadequate or failed internal processes, people, systems or external events.
These losses can be catastrophic, especially with the increased reliance on electronic
payment processing, electronic trading and the automation of other middle and back
office functions.
The increased complexity and granularity of Basel II over its predecessor forces
banks to revise existing credit and operational risk management systems and data
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 10
BCP & DR – Achieving Continuous
Availability
standards. Critical to this is the management of large volumes of high quality loss
data and detailed documentation to enable audit trails and third party verification.
Figure 2:
The Basel II data challenge
DATA
MANAGEMENT
Data collection
Assimilation of data from across the
enterprise at a business unit and
geographical level
Data standardisation
Imposing a uniform discipline on the
captured data
Data cleansing
Quality control
Data consolidation
Developing and maintaining a
dynamic database
Source: Datamonitor
DATAMONITOR
A significant factor in a Basel II project will be the process for collecting, standardizing
and consolidating data through the business units to model risk. Banks need to
establish a workable definition of risk and for this definition to be applied across all
geographies and business units. While presenting a significant technological and
business process challenge, possibly the single greatest factor will be cultural. Many
banks are not culturally accustomed to amassing large volumes of historical
operational loss data given that operational risk by definition has traditionally been
regarded as a risk category immune to standard linear risk management
methodologies.
The links between Basel II and Business Continuity are clear. Firstly, it integrates
business risk and the need to collect and store information. Business Continuity as a
concept is tightly integrated into the requirements for Basel II as a method of
mitigating risk and ensuring data integrity against low frequency high consequence
and high frequency low risk events.
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 11
BCP & DR – Achieving Continuous
Availability
Turnbull Committee Report
The Turnbull Report is the popular name given to the ICAEW's Guidance on Internal
Control which was produced in clarification of The Combined Code on Corporate
Governance. The Turnbull guidance is linked via the Combined Code on Corporate
Governance to the Listing Rule disclosure requirements of the London Stock
exchange. The guidance is concerned with the adoption of a risk-based approach to
establishing a system of internal control and reviewing its effectiveness and can be
instituted by any company in any industry.
The guidance is intended to reflect sound business practice whereby internal control
is embedded within the business processes themselves. This ensures that rather
than being merely an exercise to meet regulatory hurdles it is incorporated into the
company’s normal management and governance processes. Crucial to this is the
system of internal control; sound internal control systems should ensure the
enterprise will not be hindered in achieving its objectives. Building on the internal
control aspect, the Turnbull guidance mandates the need for an ongoing internal audit
stating how the enterprise identifies, evaluates and manages risks. As with Basel II,
there is a requirement for transparency of the results of this and the enterprise is
required to state what actions are being taken to rectify or improve any major failings
or weaknesses with the control system or risk management system.
Although this requirement involves non-IT functions it is clear that in any enterprise
where information handling or processing is paramount, the institution of a business
continuity capability is key. An effective system should not only mitigate risk, but also
satisfies the IT internal control requirements for Turnbull.
Business Continuity for the masses
Traditionally, Business Continuity solutions have been associated with the financial
services sector. However, with the evolution of eBusiness and the wide-spread use of
mission critical applications, organisations from all verticals are investing in continuity
and recovery solutions. In 2002, financial services institutions will account for 35% of
a €2.9 billion market in Europe. Datamonitor expects the dominance of financial
services end-users to remain, due to various initiatives such as GSTPA, T+1 and
multi-channel banking. The uptake of BC and DR solutions in other verticals is
relatively similar to the security market – with the public sector, utilities and
telecommunications organisations showing an increasing spend pattern. By 2005,
Datamonitor expects European businesses to spend a total of €7.7 billion in achieving
continuous availability of their operations.
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 12
BCP & DR – Achieving Continuous
Availability
The ideal enterprise continuity solution
The need to ensure the availability and continuous operation of business systems, in
spite of potential failures ranging from disk crashes and CPU failures to catastrophic
losses of computing facilities or communications networks, and planned downtime
and maintenance is of critical importance to today’s businesses. Typically, many
companies also have geographically dispersed data centres. Information must be
available across the data centres and steps should be taken to ensure data
continuity, particularly when a data centre has an outage, planned or unplanned.
Although solutions exist to provide tolerance to component failure, the issue of site
loss is often overlooked, with potentially disastrous consequences – business
interruption and loss of information. Traditionally, off-site tape dumps have satisfied
the requirements for disaster recovery for batch systems but they are typically
inadequate for protecting the information stored in real-time eBusinesses and online
transaction processing systems. Asynchronous replication facilities can provide
continuous duplication of critical eBusiness applications to off-site backup facilities,
without the high latency inherent in tape backup strategies.
Once established, such an environment can be automated to ensure that information
is replicated in a timely manner and the switch to backup systems is accomplished
with minimal or no business interruption. Companies must develop a solution to
provide continuous availability of their systems, or risk losing business to their
competitors. It is essential that high availability solutions address the availability of
data in three areas:
• managed planned downtime (e.g. routine maintenance downtime, software
upgrades, etc.);
• protect during unplanned downtime (machine / network outages);
• provide disaster recovery.
Any system providing high availability (HA) should provide continuous availability of
data in all three scenarios. The remainder of this section will provide an
understanding of the various components needed to achieve continuously available
systems. In combination, these components can provide a near-continuous service.
Hardware redundancy and physical replication
Hardware redundancy is often though as the first line of defence in continuous
systems and it protects against computer and disk failure. Although this can be an
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 13
BCP & DR – Achieving Continuous
Availability
expensive option, it is the only way to protect against the failure of a machine or disk
drive. Multiple processors in a single machine can protect against processor failure,
while multiple machines in a cluster formation protects against machine failure, and
these form the basis of hardware redundancy.
Drive redundancy can be achieved by using RAID (redundant array of inexpensive
disk) and disk mirroring solutions:
• RAID – provides redundancy within a single RAID cabinet that contains extra
storage / disk drives;
• Disk mirroring – a subset of RAID, two separate disk storage devices are used to
provide redundancy.
Both of these solutions protect against disk media failure as long as the redundant
storage contains a valid copy of the data and is also known as ‘cold standby’.
However, these solutions only secure data in its raw low-level format and do not
address the actual meaning and usage of the data, or the interaction with
applications. Today’s most critical eBusiness systems depend on data which has a
complicated internal structure and is used in ways that are dependent on the state of
key applications and business processes, than a conventional set of data files.
The delay and perception of unreliability involved in such incidents can represent
millions of Euros in direct and indirect costs or loss of opportunity. Reputations are
also at risk due to the external impression that the organisation is not in full control of
its data, or that strict audit trails may have been compromised.
Logical replication
Logical replication is a process whereby logical operations / processes are replicated
at the level of the database instead of at the disk level. Typically, database operations
are replicated between two or more databases. Therefore, it replicates meaningful
operations such as ‘Customer Y with account number 00192 has placed an order that
needs delivered in 36 hours’. Using logical replication allows the customer information
to be validated before the redundant copy is written to disk.
Redundant hardware is used in conjunction with logical replication, with two or more
databases running on separate machines, writing to separate disk devices. This
solution can be combined with physical replication, providing a line of defence against
loss of availability and logical replication providing coverage in the event of physical
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 14
BCP & DR – Achieving Continuous
Availability
replication failure. This is also known as a ‘warm standby’ system and can be located
anywhere in the world, technology and reliable bandwidth permitting.
Surviving planned downtime
Generally, high availability is thought of in the context of server failures or disasters
that are actually quite rare. Today’s administrators have to consider another measure
of availability – keeping systems / applications available during normal planned
maintenance. Although this downtime is planned, it accounts for the majority of time
that systems are unavailable. This becomes critical for businesses that are operating
a true 24x7 system that serves customers globally, and taking any data offline can
have severe consequences. Therefore, systems and software must be designed so
that maintenance activities do not disrupt normal usage. This process is usually
known as switchover.
Previous solutions involved minimising the level of data impacted by maintenance
operations. Although this reduces the number of end-users affected, it still leaves
some data unavailable, making it far from an ideal solution.
Application and business process continuity
Combining hardware redundancy, physical replication and logical replication reduces
the amount of time it takes to recover in the event of an outage. However, it takes
more than just fast recovery to provide continuous availability. Even if there is an
immediately available copy of data when there is a disruption, users and applications
must be switched over to the copy. This process of switching over to a good copy of
data or another copy of an application is known as failover. The modes of failover
include:
• active-passive – the fastest failover mechanism using a cold standby copy of the
data, either a genuine copy or switchable disks attached to both machines in a pair.
However, this is not the most rapid and seamless approach, as a second server
has to be started which involves its own overheads and wastes resources because
the second server is not available for other work during the time when there is no
emergency. Therefore, although failover is rapid, achieving true business continuity
can be slow procedure;
• active-active – this method solves the active-passive problem by running a second
server alongside the main server. This server is available to perform other duties
and means that resources are not wasted during normal operation. Also, the time to
take over from the main server during failure is significantly shorter;
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 15
BCP & DR – Achieving Continuous
Availability
• connection transparency – allows the various protection and failover solutions to
work without any of the users or applications noticing that there has been a failure
in the system.
Supporting continuous availability requires that failover is automatic and immediate
so that it is transparent to users and applications accessing the failed component.
Also, once this component has been repaired or replaced and brought back online,
users and applications will need to be switched back to the original component.
Again, this process must be immediate and transparent. Achieving automatic,
immediate and transparent failover requires appropriate failover support in both
database management systems and applications servers.
What about the network?
A common mistake to make in BCP and DR deployments is to ignore the vulnerability
of the actual network, whether private or public. Typically organisations will use a
secure VPN between the onsite and offsite facilities to mirror / RAID their information
assets. However, what happens if this link goes down during planned maintenance,
or when network traffic becomes excessive for example? Any loss of data,
transactions or applications could prove very costly to the organisation. There is a
method of bypassing this problem however and businesses must accept that
networks are unreliable and plan for it through their BCP and DR solution. It is
essential that storing and queuing techniques are used, where the most recent
version of the replicated data is queued at the failed network node. Once the network
is restored, this information can then be passed securely to the offsite facility.
Restoration – the weakest link?
Backup of information has to be fast enough to keep up with the size of the database,
must scale with the database as it grows and must have minimal performance impact
on active database operations. One method of minimising this impact is to offload as
much of the backup processing from the database itself to a backup server.
Restoring the database is the last step in recovering from an outage. However, this
can be the limiting step in the whole recovery process. Typically in day-to-day
operations the key issue in a smoothly running system is backup performance, not
restoration performance and it is often the case that not enough attention is paid to
restoration time.
For example, a database management system that backs up at a rate of 500Mb /
hour and restores at a rate of 40Mb / hour does little more than provide a false sense
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 16
BCP & DR – Achieving Continuous
Availability
of security. Under these conditions, a one terabyte database can be backed up in two
hours, but restoration would take 25 hours. These restoration time frames are now
unacceptable for most businesses who wish to retain their customers, partners and
suppliers, without losing competitive edge. Therefore, it is essential that fast
restoration systems are used.
Common mis-conceptions about BCP
Many businesses have spent millions on technology infrastructure and the resilience
solutions to support them, however many areas still exist where they are left exposed.
This is usually a result of the mis-perceptions and lack of education and
understanding among the end-user base, and includes:
• throwing hardware at the problem will not solve it – disk mirroring, clustered
machine pairs and offsite disk or tape backups may secure critical data, but are not
enough to allow the continuous availability of applications that customers demand;
• successful failover is only the beginning – mirroring or taking snapshots
preserves critical data, but unless there is a systematic, rapid and reliable way of
going back to normality, organisations will be exposed to further failures;
• clustering is not sufficient – clustered pairs will not protect against all emergency
situations, especially those that would be categorised as disasters. A pair of
machines will not guarantee continuous availability when fires, floods or terrorist
attacks occur;
• dual level resilience is not perfect – this becomes significant during planned
maintenance. The applications and users are dependent on the second
environment while the first is undergoing maintenance. What happens if the second
system fails during the maintenance session? The answer is to have three layers
(or more) of resilience, preferably combined with offsite secure copies of the data;
• messaging has its limitations – the use of messaging to some extent insulates
one application from a failure in other applications feeding it. However, this does
not guarantee that an entire chain of related systems will continue to function
smoothly whatever the combination of emergencies.
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 17
BCP & DR – Achieving Continuous
Availability
Achieving continuous availability with Sybase
This paper has, so far, discussed numerous continuity planning solutions, such as
warm standby, avoidance of low level database corruption, triple layer resilience and
connection transparency. Combinations of the above solutions are essential to make
an enterprise truly resilient.
Depending on the business need, Sybase can provide a variety of levels of
availability that address the various degrees of severity in disasters and emergencies,
in various combinations. Their solution offering is illustrated in Figure 3. It can be
observed that Sybase offers a complete solutions architecture – the products, design
and implementation templates to make them work. Each element of the architecture
raises the bar in terms of continuously availability to higher levels on the 4x9’s, 5x9’s
scale. Their three key product offerings are:
•
Adaptive Server Enterprise- High Availability (ASE HA);
•
Replication Server;
•
OpenSwitch.
Figure 3:
Sybase Triple layer resilience solution
D a ta b a s e o n d ua l
p o rte d d is k a rra y
M irro r in g
W a rm S ta n d b y
R e p lic a tio n
S e rv e r
AS E H A
AS E H A
O p e n S w itch
S o ftw a re F a ilure s
C O N T IN U O U S L Y A V A IL A B L E
A P P L IC A T IO N S
H a rd w a re F a ilure s
2 4 x7 x5 2 B us ine s s o p e ra tio n
Da
ta b
e
as
o
r ro
m
ri s
e
bl
ou
F lo
Te
d,
O ffs ite
S ta nd b y A S E
O ffs ite R e p lic a te d
D a ta b a s e
D
Source: Datamonitor
s
re
H a rd w a re
C lu s te rin g
ilu
C lu s te r
C o n tro l
S o ftw a re
Fa
e,
F ir
Co
on
p ti
rru
DATAMONITOR
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 18
BCP & DR – Achieving Continuous
Availability
Adaptive Server Enterprise (High Availiability)
Sybase ASE (HA) 12.5 is a data management platform for mission-critical,
transaction-intensive enterprise applications. It can be used to provide the continuous
availability of database systems to customers, partners and suppliers, without
creating any headaches, confusion or requiring them to reconnect. It has the ability to
manage each level of a BCP and DR solution, including hardware redundancy, cold
standby (backup and restore), warm standby (replication) and active / active hot
standby.
In the event of a failure, ASE (HA) enables the movement of end-users from a
primary system to a back-up system at any point in time, without causing any
disruption to the applications or information that they are using. Essentially, it
insulates the user from the complexities of back-end systems.
Sybase ASE (HA) leverages the cluster architecture discussed previously and
provides failover to a backup server without losing any non-committed data or
severing user connections. It also works in concert with existing hardware and
software high availability solutions from third party vendors, to deliver maximum
systems availability.
Furthermore, Sybase ASE (HA) incorporates a Companion Server that allows the
configuration of two ASE (HA) servers as companions in either asymmetric (masterslave) or symmetric companion (active / active hot standby) to create a hot standby
capability that can further reduce unplanned downtime. This approach involves the
deployment of a two-node hardware cluster with two ASE (HA) databases running as
companion servers to each other. In this configuration, both servers run applications,
so that if Server 1 fails, Server 2 opens up the devices on which the primary’s
databases are built and perform a fast recovery to bring them online, while continuing
to handle its own clients.
The failover process is simplified through the use of ASE (HA), as it sends the enduser an error message indicating that failover has occurred and that the current
transaction must be resubmitted.
Replication Server
Sybase’s replication server provides automatic server failover solutions that reach
across LAN and WAN, providing simple replication of data over extensive geographic
regions. The replication server provides high availability and disaster recovery
services, giving greater protection against failures through asynchronous, wide-area
delivery of database transactions. It is also possible to share information across an
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 19
BCP & DR – Achieving Continuous
Availability
organization by replicating data to and from heterogeneous hardware and data
sources without losing transactional integrity of the data.
The replication server operates in a warm standby mode where databases are kept in
close synchronisation using a database replication system. This provides simple
configuration, database mirroring with geographical reach of up to 1000s of km, the
ability for rapid switching of the direction of replication and automatic switchover in
the event of failure (when used in combination with OpenSwitch).
This warm standby feature enables customers to more easily configure and manage
a high-availability, distributed recovery environment at lower administrative cost than
traditional replication products. Also, this feature simplifies the configuration and
operation of a warm standby environment by eliminating the traditional requirement to
define individual objects eligible for replication and explicit subscriptions to the
objects.
With the replication server operating in the warm standby mode, it is possible to
switch the direction of replication when the primary system is unavailable, with the
applications being routed to the standby system. Activation of the switch results in the
reverse flow of transaction traffic from the standby system to the active system (which
maybe currently unavailable). This enables the replication server to queue up any
transactions applied to the standby system by the clients, until the primary system is
brought online. Once online, the primary system can be re-synchronised from the
queued data.
This functionality is critical in situations where failback is as important as failover, for
example when managing planned downtime, or unplanned downtime where the
primary systems are recovered relatively quickly and need to be resynchronised
rather than built from scratch.
Sybase Replication Server is fully interoperable with ASE (HA) and supports the high
availability features in a cluster configuration. For example, in the event of a node
failure (or process failure), the replication server process is migrated over to the
companion node without administrator intervention, thereby presenting the
organisation with cost and time benefits.
OpenSwitch – facilitating continuous availability
Adaptive Server Enterprise’s Companion Server option provides the fastest failover
functionality and can also guarantee that users will not be disconnected in the event
of a system failure. However, organisations that wish to provide continuous
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 20
BCP & DR – Achieving Continuous
Availability
availability to a variety of clients over extended geographic regions will benefit from
the enhanced “connection transparency” facilities of Sybase OpenSwitch
Sybase OpenSwitch is a solution that provides transparent client connection failover
as well as transparent client connection routing, load balancing and central
management.
OpenSwitch sits between client applications and the servers to which they connect
and works either manually, in response to an administrative request, or automatically
when it detects a server failure. It transparently transfers incoming connections to any
Sybase server product, or another instance of OpenSwitch.
The true beauty of OpenSwitch is that it monitors server availability on an ongoing
basis and when a server becomes unavailable, it transfers the client connection to
another server without disturbing the connection. By monitoring and restoring the
transaction state, communications state and connection state of each connection,
OpenSwitch is able to transfer client connections without the need to disconnect and
reconnect. From an end-users perspective, nothing has changed, while behind the
scenes a great deal may be happening. The user may notice a pause or short delay,
but the connection and the customer is not lost. This functionality is available to
existing applications without requiring any programming changes and highlights the
continuous availability available from Sybase solutions.
Advantages of OpenSwitch include:
• load balancing and routing – allows organisations to transfer connections based
on the type of users coming into pools of servers. For example, different users may
have different performance requirements for their applications, ranging from the
need for sub-second query responses to the need to run less time sensitive batch
reports. This allocation process is performed on the fly without having to bring any
servers down;
• central management – provides tools designed to make the end-user experience
as simple as possible and to simplify the challenges IT faces behind the scenes.
For example, consider an organization in which the administrator needs to perform a
data load involving hundreds of gigabytes. Such loads can take hours and are
normally done in the middle of the night. Using OpenSwitch, the load can be
performed during the day and once the load is completed, users can be moved over
to the updated server with a simple flick of a switch.
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 21
BCP & DR – Achieving Continuous
Availability
Why Sybase?
• builds on existing investments – in mirroring, clustered hardware, cluster control
software, remote DR recovery, etc. but adds the database and application level
resilience that enhances basic data protection and hardware duplications;
• active–active – configuration for the quickest failover, and a secondary server that
is always running;
• secure, physical and logical warm standby – of database copies, providing
insurance against lurking corruption and provision of an offsite copy that is secure
against sever emergencies like fires, floods and terrorist attacks;
• triple level resilience – architected with active-active and warm standby
combined. Provides integrated, automated protection against service interruptions,
database corruption and severe emergencies;
• connection transparency – provides protection and failover without applications
or users noticing. Real assurance provided for each part of an STP, reporting or
customer service chain will continue smoothly. The whole value chain will continue
correct operation both in parts and as a whole;
• multiple benefits from each investment – remote DR sites also support failover
operations if hardware failure occurs during planned maintenance and OpenSwitch
ensures this is invisible to end-users, while also ensuring minimum disruption to
applications during any normal high availability failover;
• ease of switching back to normal – well defined procedures are inherent in the
Sybase architecture and allow for ‘returning to normal’ configuration. Also, the
Sybase solution set works at the database and the application level, providing
logical integrity of the system at all times.
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 22
BCP & DR – Achieving Continuous
Availability
The true value of the Sybase offering arises when each of the products work as one,
to provide continuously available BCP and DR solutions. The benefits of such an
integrated solution include:
• fast failover – locally provides resilience against any hardware breakdowns;
• secure logical copy – provides a highly up to date logical copy of all data at the
remote site, providing resilience against any site disasters, or inaccessibility of the
main site to the key staff;
• invisibility – any maintenance activities to hardware or applications are invisible to
the end-users;
• triple layer resilience – possible breakdowns during planned maintenance periods
go unnoticed by users through the secure triple layer resilience;
• highly automated – procedures providing rapid failover and failback are provided
through the automation of the solution;
• immune to corruptions – the solution provides a remote copy which is immune to
any database corruptions originating in the original copy;
• intelligent – the data replication functionality is closely related to the individual
databases supporting individual applications, meaning that the failback procedures
are less error prone than if all the data is mirrored indiscriminately.
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 23
BCP & DR – Achieving Continuous
Availability
Conclusion
Today’s world of eBusiness means that it is essential for businesses to have some
form of business continuity planning and disaster recovery solution in place. This is
not only limited to the financial service sector, which is dependent on high volume
and value transactions to run its business, but to businesses in all industry sectors.
Now it is not only the internal employees that are dependent on system availability,
but also your business partners, customers and suppliers. Enterprise applications and
collaborative initiatives have further highlighted the need BCP and DR.
The cost of not having a robust continuity solution in place could be catastrophic –
lost revenues, bad press coverage, loss of customers and competitive mindshare to
name but a few.
The level of continuity needed is obviously dependent on the business need, however
BCP and DR demands a certain level of skill, expertise and experience that may be
lacking in an enterprise. Using an outside vendor for the writing, testing and
implementation stages of any plan may be an expensive option. On the other hand, a
‘build it yourself’ approach could prove complicated and time consuming, especially
when multiple organizations and sites are involved.
Therefore, it is essential that businesses select a continuity vendor with a proven
track record and a comprehensive solution offering that meets the business need.
This is where Sybase comes in with its various continuity offerings. Together, the
Adaptive Server Enterprise, Replication Server and OpenSwitch solutions provide a
truly available and continuous continuity solution for use with all angles of the
business, on a 24x7x52 basis.
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 24
BCP & DR – Achieving Continuous
Availability
Appendix: Business Continuity Planning and Disaster
Recovery in practice
Case study – large European clearing bank
The business challenge:
A large European clearing bank is offering one of its largest customers an extranet
web-based system allowing frequent and high volume transactions to be performed.
These transactions would be aggregated and calculations performed based on
external data feeds and the aggregated transactional feeds sent to existing bank
foreign exchange systems. As this business offering was to support the customers’
own web sites, 24x7x52 availability of the transactional receipt process was essential
for customer satisfaction. In order to avoid backlogs, minimal downtime in any part of
the system could be tolerated. Also, to meet regulatory and auditor requirement, the
ability to switch processing to an alternate site was essential.
The business solution:
This was a greenfield deployment and the bank opted for a three tier solution, using a
well known application server for the middle tier and web clients at the top. The banks
appreciation of Sybase’s high availability solution in ASE 12.5 led them to insist on
this architecture, in preference to other database vendor’s offerings. Also, the bank’s
long track record with Replication Server gave them faith in this as the most reliable
way to transport data rapidly and safely offsite, while filtering out any low level
physical database corruptions.
The Sybase High Availability Solution:
This customer used all elements of the Sybase continuity offering bar one, Sybase
OpenSwitch. The customer felt that connection transparency could be built into the
J2EE based middle tier. No other parts of the system would require direct connection
to the database. For a site with many existing 2 tier applications, the decision would
have been made differently. This example indicates that, while comprehensive, the
Sybase architecture is flexible and capable of working with 3rd party products where
necessary.
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 25
BCP & DR – Achieving Continuous
Availability
To fulfil the business and technology requirements, the bank opted for the ASE 12.5
solution, providing dynamic memory management features for administration without
frequent restarts, in addition to the High Availability, fast failover support. Replication
servers were used onsite and offsite to avoid physical low level data corruptions.
The benefits:
The bank is able to provide the ‘always available’ solution demanded by the end
customers using the web sites of the bank’s own customers, and comfortably meet
the regulatory requirements and service level agreements (SLAs) in place. Use of
such a scalable solution will allow the bank to expand these operations easily to
include the growing number of customers who are interested in this business offering.
Case study – European branch of a major Asian bank
The business challenge:
A European branch of a major Asian bank was running a significant number of
business critical applications and Sybase based systems that were supported by a
variety of technical architectures. The day-to-day operational management and
administrative workload generated by these various architectures / hosts was
becoming and unacceptable cost to the business. Furthermore, the resilience
strategy for a number of these systems and applications (notably Summit and
Fidessa) was not going to meet the latest demands of the business user
communities, both in terms of loss of data and time to recover in the event of a
significant primary and / or host failure.
The bank was facing a situation where the operational risks that its business systems
infrastructure presented were now considered unacceptable and needed to be
addressed as a matter of urgency.
The business solution:
To address these problems, the bank chose a Sybase systems consolidation and
resilience solution that addressed two key criteria:
• migration / consolidation of these disparate Sybase based systems to a single
technical architecture based on two ASE servers;
• incorporation of the migrated solution into a single standardised high availability
solution, based on proven / complementary Sybase technologies.
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 26
BCP & DR – Achieving Continuous
Availability
The result of the project was that it produced a new high resilience architecture for
the bank’s operations.
The Sybase High Availability Solution:
For each of the systems / applications that were migrated to the new architecture,
exactly the same high availability solution was deployed. Even if the SLA
requirements for a given system did not require the level of resilience the solution
provided, it was considered that taking an uniform approach would be most cost
effective in terms of operational support, control and management.
The high availability solution deployed utilised two Sybase ASE(HA)’s, one on the
primary site and the other on the DR site. Warm standby functionality was provided
by the replication server and allowed near real time replication of all database data
between the two ASE(HA)s. This server was deployed at the DR site host.
This bank made use of the Sybase OpenSwitch technology, providing user and
application transparency in the event of a systems failure. Management of the
OpenSwitch technology is achieved through the OpenSwitch Coordination Module
and it monitors all other Sybase components in the architecture and runs alongside
the OpenSwitch server on the DR host. Its primary role is to detect significant failure
in the primary database server environment and then coordinate the switching of user
connections within OpenSwitch, with management activities being performed by the
replication server.
The benefits:
In addition to solving the critical business problem, the project aimed to maximise the
return on investment, by ensuring that best use was made of the new host
architecture. The migration of a given system / application to the new architecture /
high availability solution would be transparent to business users and in house
application development teams. Disruption to 3rd party systems was nonexistent,
thereby allowing the data movement and interfaces to remain the same.
The resulting high availability solution is platform independent and therefore allows
for flexibility in client choices, particularly when it comes to platform upgrades such as
operating systems or disk sub-systems. This also allows for a scalable architecture,
both in terms of hardware and software performance. Using OpenSwitch has allowed
the bank to perform essential maintenance tasks by seamlessly switching business
users between the primary and disaster recovery database servers.
Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability
Page 27