how to Leverage Disk to improve recovery plans white paper

white paper
How to Leverage Disk to
Improve Recovery Plans
WHITE PAPER
Introduction
How to Leverage Disk to Improve Recovery Plans
| 2
If you are still using tape to back up your production servers, you are likely experiencing
some problems around backup windows, data loss on recovery, recovery time, or recovery
reliability. For enterprises that need to retain backup data for long periods of time to meet
business and/or regulatory mandates, tape still has a place in the backup infrastructure. But
disk offers significant advantages when it is integrated into the backup infrastructure in the
right locations, and provides ready access to a number of other next generation recovery
technologies that are critical in meeting business recovery requirements.
As we recover from the industry downturn, cost-cutting and cost-containment are still key
issues for most enterprises. Being able to provide the right recovery capabilities for your
business extends to more than just data – it must cover applications as well. If you’re like
most information technology (IT) shops, this means multiple products and multiple vendors.
Interestingly, the strategic deployment of disk in your backup infrastructure can help you
resolve this issue, simplifying your environment and actually lowering costs relative to what
you’re doing now.
Assuming disk as a well-integrated part of the backup infrastructure, this white paper will
discuss technologies that are critical in designing a comprehensive and scalable recovery
capability for your business that covers data and applications, both to meet everyday needs
as well as more infrequent disaster recovery (DR) requirements.
Understanding Why Disk Is A Strategic Data
Protection Technology
DISK ADVANTAGES
– Ability to operate at varying
speeds shortens backups
– On-line searchability lets admins
find data faster
already multiple terabytes in size puts a huge load on servers,
networks, storage devices, and administrators. As the data
grows, these loads increase. With its performance and serial access characteristics, tape is
TAPE ADVANTAGES
a poor choice as a backup target for large data
sets. It just takes too long to backup and/or
– Tape has a lower $/GB cost than
restore data – activities that your shop is doing
disk (10x – 100x)
on a daily basis.
– Tape can dump large amounts of
data faster than disk
By designating disk as the initial backup target,
its performance and random access characteristics can be leveraged to address backup window
– Each tape device operates at a
and recovery time issues. Conventional backups
single rated speed
– Disk is a more reliable media than
can be completed much faster against disktape for backup tasks
based targets than they can against tape-based
– Disk provides access to key next
targets, and this can in many cases cut backup
generation recovery technologies
times considerably. Given that most restores
are done from the most recent backups,
keeping the last couple of backups on disk
Figure 1. Comparing the relevant “data protection” characteristics of disk and tape.
ensures that data can be found and restored faster. And for the
types of random access activities that typify initial backups and
Data is growing at 50% - 60% a year for most enterprises.
most object-level restores, disk is much better suited to it and
Using legacy tape backup-based approaches, providing adequate
therefore performs much more reliably.
protection for mushrooming data stores that are in many cases
– Data does not need to be
converted to disk format prior
to restore
– If network infrastructure can
support it
WHITE PAPER
Protected Servers
How to Leverage Disk to Improve Recovery Plans
Backup Tier 1
SATA Disk
• Handles initial “backups”
• Feeds all near term
restores
Backup Tier 2
Tape
• Low cost, long term
retention
• Meets regulatory,
compliance requirements
Figure 2. To improve data protection and recovery, leverage the advantages of disk
and tape in the appropriate backup “tiers”.
The use of disk as a backup target opens up the use of diskbased snapshots. For most recovery scenarios, it is best to use
application-consistent recovery points. Relative to crash-consistent recovery points (the only other kind), the use of applicationconsistent recovery points allows data and/or an entire server to
be recovered faster and more reliably. When the objective is to
restore the data fast or to bring an application service (such as
Exchange) back up as quickly as possible, you want applicationconsistent recovery options. Many key enterprise applications
offer APIs that allow third party products to interact with them
to create application-consistent recovery points which can be
kept on disk (called snapshots). Windows Volume Shadowcopy
Services (VSS) and Oracle Recovery Manager (RMAN) are probably the two most widely known of these interfaces, but most
enterprise data service platforms (databases, file systems, etc.)
offer them.
There are two very viable options for integrating disk into your
current backup infrastructure that require minimal changes.
Many enterprise backup software products will optionally
support disk as a backup target, allowing you to stay with your
current backup schedules while you leverage some of the
advantages of disk. The other option is to buy a virtual tape
library (VTL). VTLs use disk as the storage media, but fool the
backup software into thinking they are tapes.
While this is all interesting, these are only incremental improvements that are based almost entirely on maintaining the same
processes you have today (scheduled backups that occur once
a day) and merely using disk as the backup target. Because data
will continue to grow at high rates, many of the same problems
you have with tape today (e.g. backup window, data protection
overhead) will re-surface again at some point in the future with
these two approaches.
| 3
What really makes disk a strategic play in data protection
is the access it provides to a number of newer technologies that can transform how data protection occurs.
Given current data growth rates, within 5-7 years almost
all enterprises will have had to make the transition to a
different recovery model. Many enterprises are already
at that point today, and need to move away from point-intime oriented backup approaches so that they can reign
in the growth of server overhead, network utilization, and
storage capacity increases.
This coming transformation will be a move from scheduled, point-in-time based data protection schemes to
more continuous approaches that spread the capture
of “backup” data out across the entire day (a 24 hour period)
rather than attempting to capture it once a day in a scheduled,
“off hours” period. With business operating 24x7 in most industries, there are no more “off hours” periods. And there just isn’t
enough bandwidth in many cases to complete backups within
some pre-defined window. Massively scalable, disk-based
platforms built around scale out architectures designed to
handle very large data sets have already been recommending
“continuous” approaches to provide data protection for their
platforms for years. It’s just a matter of time before enterprises
run into the same “scale” problems with their backup data, if
they are not already there.
Dealing With the Cost of Disk
The one huge negative of disk relative to tape has been its much
higher cost. Two key developments provide options for end
users to manage disk-based backup costs down. The widespread availability of large capacity, SATA disks with enterprise
class reliability have brought the cost of a viable disk target
down to somewhere in the $5/GB range (raw storage capacity
cost) for many products. Tape, however, is still much cheaper
coming in at $.40 - $.50/GB (raw storage capacity cost) for many
enterprise class tape libraries. Compression technologies,
included on most enterprise class tape libraries, generally can
provide a 2:1 reduction in storage capacity requirements for
most types of data, resulting in a $/GB cost for tape in the
$.20 - $.25 range.
The increasing use of storage capacity optimization technologies
is also helping to reduce the cost of disk significantly. Storage
capacity optimization includes technologies like compression,
file-level single instancing, delta differencing, and data deduplication which can provide a 20:1 or more reduction in storage
capacity requirements for backup data. When storage capacity
optimization is applied to a SATA-based disk array being used
for backup purposes, the raw storage capacity cost of that array
may be reduced to the $.25 - $.50 range, making it only slightly
How to Leverage Disk to Improve Recovery Plans
more expensive than tape. Storage capacity optimization ratios
vary significantly based on algorithms used and data set types
on which it operates, but when used together with SATA disk it
can narrow the price difference between disk and tape for
backup purposes.
Keep in mind, however, that storage capacity optimization is
not inherently a “backup” solution. It is a technology which
increases the amount of information that can be stored within
a given storage array, based on the achievable storage capacity
optimization ratio. The use of it may have some implications for
backup and DR in certain environments, however, so look for it
to be included as a feature in many mid range and larger disk
arrays within the next 1-2 years.
Decision Points: Disk vs Tape
While tape is poor at handling initial backup and object-level
restore requirements, “poor” is a relative term. If you can complete your backups within the allotted time, the amount of data
you lose on recovery is acceptable assuming one or two backups per day, you are meeting recovery time requirements, and
recovery reliability is not an issue affecting your restores, then
by all means stay with tape.
If you are looking for a relatively easy fix to handling in particular
backup window, recovery time and recovery reliability issues
that you are having with tape, then you may consider just
inserting disk as a backup target into your existing backup infrastructure, and manage your backup data such that most of your
restores are taken from disk-based data. Realize, however, that
if you take this approach, it is a tactical rather than a strategic
use of disk, postponing a problem that you (or your successor)
will ultimately have to deal with anyway within a year or two
(depending on your data growth rates).
Other pressing problems, such as data protection overhead,
available network bandwidth, data loss on recovery, better root
cause analysis for data corruption problems, and complex environments with multiple, application-specific recovery solutions
like log shipping may drive you to consider the strategic use
of disk.
Applying Disk Strategically
If you have decided that you want to address backup problems
rather than symptoms, the key transformation that needs to
occur is a change from scheduled, point-in-time backups to a
“continuous” approach. With scheduled backups, it became
clear long ago that you can’t perform a full backup each time –
there’s not enough time, network bandwidth, or storage capacity to do so. Incremental and differential backups which focus
on just backing up the changes since the last backup have been
in mainstream use for at least a decade. But given the size of
today’s data sets and the change rates, even just backing up the
changes has now run into the same time, network bandwidth,
and storage capacity limitations.
| 4
The way to address this is through “continuous” backup. These
technologies capture only change data, they capture it in real
time as it is created, and they spread the transmission of that
data across 24 hours each day. The difference is between bundling up 50GB of daily changes from your database and trying to
send them all at once versus spreading the transmission of that
50GB out across a 24 hour period. This immediately addresses
two of the three key problems mentioned above. It completely
eliminates point in time backups, replacing it with a transparent
approach that puts negligible overhead on the system being
backed up at any given point in time. Backup is effectively occurring all the time, but the impact on servers being backed up
is so low that it’s not noticeable. Interestingly, using continuous
approaches data becomes recoverable the instant it is created,
not just once it’s backed up. And it significantly reduces peak
bandwidth requirements. By spreading the transmission of that
50GB out across the day, the instantaneous bandwidth usage is
so low that it is barely noticeable.
Backup as a point in time
operation generates spikes in
resource utilization that impact
production performance
Resource Usage
WHITE PAPER
0
Backup as a
discrete operation
“Backup” as a
continuous operation
Figure 3. Point in time backups generate spikes in server overhead and network
bandwidth utilization that continuous backup does not.
How continuous backup addresses the storage problem is not
necessarily intuitive. After an initial synchronization (i.e. creating
a copy of the original state of the production data), continuous
backup just collects and stores the changes. So the amount of
storage required will be heavily dependent on the change rate
and how long the data is kept on disk. Think about how just
capturing and storing changes compares to what data deduplication (discussed earlier in the white paper) does. Data deduplication takes a full backup and somehow processes it to remove
redundancies during each scheduled backup, then stores it in its
compacted form. Continuous backup never operates with a full
backup (after the initial sync), it only ever captures and stores
change data. Continuous backup approaches can come very
close to achieving the same capacity optimization ratios that
deduplication does without having to repeatedly process full
backups. This points up another important distinction between
continuous backup and data deduplication: you are working
WHITE PAPER
How to Leverage Disk to Improve Recovery Plans
with “capacity optimized” data sets and enjoying the benefits
it provides in terms of minimizing network bandwidth utilization
across every network hop (LAN or WAN). With deduplication,
you must consider the impacts of source-based vs target
based deduplication on server overhead and network bandwidth
requirements.
There are two forms of continuous backup available today:
replication and continuous data protection (CDP). Replication
basically keeps designated source and target volumes which are
connected across a network in sync. But replication by itself is
not a data protection solution, because it only tracks the latest
data state. If the source volume somehow becomes corrupted,
that corruption will be transferred to the target volume, leaving
you without a way to recover. So replication is often combined
with snapshot technology so that if the latest data state is
not available from the recovery volume, other good recovery
points are.
CDP, on the other hand, captures and stores all changes in a
disk-based log, allowing administrators to choose any data point
within the log to generate a disk-based recovery point. If corruption occurs, it is still transferred to the CDP log, but all points
prior to that within the log are still good and can be used as
recovery points. CDP can also be integrated with application
snapshot APIs so that application-consistent recovery points can
be marked (using “bookmarks”) in the CDP data stream. CDP
by itself is a data protection solution, and it is often combined
with replication so that the CDP log can be replicated to remote
sites to support DR operations in exactly the same way as it
supports local recovery operations.
| 5
structure to do this: to migrate data to tape from a CDP system
you generate a recovery point, mount that volume on a backup
server, and back up the server represented by that volume (or
volumes) just like you would have in the past.
Note also that disk-based images of application-consistent data
states can be generated for purposes other than recovery. Because recovery images can effectively be mounted on any other
server, they can be used for test, development, reporting and
other analysis purposes, enhancing the value of CDP beyond
just data protection and recovery. Think about how many other
“copy creation” tools and products it may be able to replace
within your existing infrastructure.
Virtual Servers Argue for the Strategic Use of Disk
As enterprises deploy virtual server technology, it is generally
used in server consolidation efforts to decrease energy and
floorspace costs. While physical servers are often configured at
25% - 35% utilization rates to provide headroom for data and
application growth, most virtual servers are configured at utilization rates of 85% or higher to maximize the benefits of server
consolidation.
The lack of headroom available on most virtual servers has
key implications for data protection strategy that many enterprises do not realize until after they’ve tried to stay with the old
method of deploying a backup agent in each server. Once you’ve
come to the realization that you need a low overhead approach
that supports application-consistent recovery options that can
work across not just physical servers but also any virtual server
platform (VMware, Microsoft, Citrix), you can really start to appreciate what CDP technology
has to offer.
Virtual servers offer significant
opportunities for improved recovery and lower costs in DR
scenarios. Restarting applications
on virtual servers at a remote
Figure 4. In addition to being the lowest overhead way to perform data protection, CDP can minimize data loss on recovery, offer
site removes all the “bare metal
reliable application-consistent recovery options, and can re-create one or more recovery points anywhere along the timeline
restore” problems that exist with
represented by the CDP log for root cause analysis.
physical servers, and enables
server consolidation on the DR
CDP offers the highest value for initial backups and recovery
side that significantly cuts infrastructure requirements, lowering
operations, not cost-effective long term retention. Most cusenergy and floorspace costs. The use of disk as a backup metomers will size the CDP log to create a “retention window”
dium provides access not only to CDP and application snapshot
of two to four weeks. The retention window size is generally
API integration (for application-consistent recovery options),
driven by the frequency of accessing the data. CDP makes it
but also to asynchronous replication. Asynchronous replication
easy to access the data, since an administrator merely has to
enables long distance DR configurations that will not impact pro“point and click” to select and generate a disk-based recovery
duction application performance, and has already been integratpoint. Once the data has aged to a point where it is not likely
ed with many CDP offerings available in the market today. When
to be accessed, it can be moved to tape for long term retention
replication technologies are not tied to a particular disk array venpurposes. CDP integrates well with existing tape-based infrador, they can provide a lot of flexibility to use a single product to
WHITE PAPER
How to Leverage Disk to Improve Recovery Plans
replicate data stored on any kind of storage architecture (SAN,
NAS, or DAS). Figure 5 indicates that most IT shops with virtual
servers have very heterogeneous storage environments where
such a feature may be valuable.
Figure 5. An IDC survey of 168 respondents done in October 2009 indicates significant
heterogeneity in storage architectures in virtual server environments.
DR Implications of Disk-Based Backup
Data sitting on tapes can’t be replicated. To move that data to
remote locations for DR purposes, the standard approach has
been to clone the tapes (create a copy of each) and ship them
via ground transportation to the alternate site. Typically this has
not been done for daily tape backups, just weekly full backups.
And it takes several days to ship tapes to the remote location.
So if recovery from the remote site is required, data is quite old.
Recovering from old data means lots of data loss.
If data is sitting on disk, however, it can be replicated to remote
locations automatically, either continuously or through scheduled replication. This gets the data to the remote locations much
faster, on the order of hours or even minutes (if continuous replication is used) so the achievable RPOs from data at the remote
site are much better. Once at the remote site, data can always
be migrated to tape for more cost-effective long term retention
so many enterprises using replication may keep the last couple
of days of “replicated” data on disk, migrating it to tape
thereafter to minimize costs.
Any time replication is considered, network bandwidth costs
must also be considered. While LAN bandwidths are often 100
Mbit/sec or greater, WAN bandwidth is much more expensive
and is generally much lower (as much as 100 times lower).
Trying to send “backup” data across the WAN can take up much
of that very limited bandwidth, imposing performance impacts
on other production applications that are using the WAN. If data
is on disk, however, storage capacity optimization technologies
can be applied to it. WAN optimization, another form of storage
| 6
capacity optimization, can be effectively deployed to minimize
the amount of data that has to be sent across WANs to make
the information it represents (your database tables, files, etc.)
recoverable at the remote site. When considering replication,
consider also how much bandwidth you will need
to meet your RPO requirements at the remote site
without unduly impacting other applications using
that network. Bandwidth throttling, often available
as part of replication products, can let you limit
the amount of bandwidth that replication takes up
throughout the day, thus guaranteeing a certain
percentage of your network bandwidth for other
critical applications. The use of bandwidth throttling
may, however, impact your RPO at the remote site.
When deploying replicated configurations, you will
also want to quantify your resource utilization and
capacity requirements as much as possible up front.
This lets you accurately predict recovery performance and costs associated with your selected
configuration. Look for tools from vendors that will
collect quantitative data about your environment before full
product installation so you know these answers – just using
backup logs to gauge these requirements can under-report
resource requirements by 20% - 40%.
Application Recovery
While data must be recovered every day, applications must be
recovered as well, just generally less frequently. But when applications are down, the impacts to the business can be much
larger than the impact of a few lost files. For most IT shops, the
pressure is on to recover failed applications quickly and reliably.
When applications must be recovered manually, the process can
be time consuming, very dependent upon the skill set of the
administrator, and inherently risky. Application recovery generally incorporates a set of well-known steps, however, which
lend themselves very well to automation for most applications
(as long as those applications are considered to be “crash tolerant”). At a high level, those steps are as follows:
•Identify that a failure has occurred and an application
service must be re-started
•Determine whether to re-start the application service on
the same or a different physical server
•Mount the data volumes representing the desired
recovery point on the “target” recovery server
•Re-start the application
•Re-direct any network clients that were using that application service on the “old” server to the “new” server
WHITE PAPER
How to Leverage Disk to Improve Recovery Plans
Failover addresses the issue of recovering from the initial failure,
but administrators will generally want to “fail back” to the original server at some point. In a disaster recovery scenario,
depending on the scope of the initial “disaster”, it may take
several weeks or more to get the primary site ready again to
support the IT infrastructure needed for business operations.
During that time, business operations may be running out of
the “disaster recovery site”. When it comes time to fail back,
administrators will want to fail back with all of the latest data
generated from business operations since the original disaster.
The same issues that apply to failover apply to failback as well.
If failback is not automated, it can be a lengthy, risky undertaking. When evaluating application recovery options, selecting a
solution that can automate both failover and failback will minimize both the risk and the downtime associated with
failure scenarios.
Shared Disk
Cluster Architecture
Shared Nothing
Cluster Architecture
| 7
Solutions that can integrate data and application recovery into
a single, centrally managed product offer some distinct advantages in terms of deployment and ease of use. If they incorporate CDP, application snapshot API integration, asynchronous
replication, and application failover/failback, then they can offer a
comprehensive set of benefits:
•CDP provides a low overhead way to capture data transparently from physical servers that applies extremely
well to virtual servers, it can meet stringent RPO and
RTO as well as recovery reliability requirements, and it
offers the industry’s best approach to recovering from
logical corruption
•Application snapshot API integration allows the CDP product to “bookmark” application-consistent recovery points
within the CDP data stream, giving administrators a variety
of application-consistent and crash-consistent recovery
options to use for recovery, root cause analysis, or other
administrative purposes
•Asynchronous replication extends all the benefits that
CDP provides for local data recovery over long distances
to remote locations for disaster recovery purposes, and
does so without impacting the performance of production
applications like synchronous replication would
Figure 6. In shared disk architectures, the same set of physical disks is connected to
all cluster nodes, with access to the data on the disks controlled by software. In shared
nothing architectures, each cluster node owns its own disk.
For the last 20 years, application failover products built around
shared “disk” architectures comprised the lion’s share of the
server “high availability” market. But in the last several years,
HA configurations built around “shared nothing” architectures
have started to come to the fore. Shared nothing architectures
can support automated failover and failback processes between
servers, but are much simpler to configure and manage than
shared disk architectures. Source and target servers are connected over a network (LAN or WAN), and some form of replication technology is used to keep the source and target disks in
sync. If data corruption is a concern, end users can look for
replication products that are integrated with CDP technology.
•Application failover and failback extend rapid, reliable
recovery capabilities to applications, as well as offering
easy application migration options to address maintenance
and change management issues while minimizing production downtime
Data and application recovery are really just different points
along the recovery continuum. If you are going to make a decision to limit your recovery capabilities to just data, make that
decision consciously with a full understanding of the implications.
Approaches that cover data and applications provide more comprehensive solutions that ultimately support faster recovery and
higher overall availability across the range of failure scenarios
likely to be encountered. When recovery capabilities are automated, they become more predictable, easier to incrementally
improve, and ultimately make it easier to manage to service
level agreements that may be in place due to either business or
regulatory mandates.
How to Leverage Disk to Improve Recovery Plans
| 8
Conclusion
Due to industry trends like high data growth rates, tougher RPO
and RTO requirements, and the increasing penetration of server
virtualization technology, disk has become a strategic data protection technology that all enterprises should evaluate. Whether
it is deployed tactically or strategically, it offers clear advantages
over backup infrastructures based solely on tape. Tactical deployments will improve data protection operations and capabilities
over the short term, but all enterprises should be considering
how and when they will deploy it strategically to make the move
to more continuous data protection operations. Disk is a required
foundation for the next generation recovery technologies like
CDP, application snapshot API integration, asynchronous replication, and storage capacity optimization that will ultimately
become data protection prerequisites for most enterprises, and
for some already have. When performing the strategic planning
for the recovery architecture that will be required to meet your
needs in the future, don’t forget to extend your definition of
“recovery” to include both data and applications. Rapid, reliable
recovery, both locally and remotely, for both data and applications, is really the baseline requirement going forward to keep
your business running optimally, regardless of what challenges
may lay ahead.
100 Century Center Court, #705, San Jose, CA 95112 | p: 1.800.646.3617 | p: 408.200.3840 | f: 408.588.1590
Web: www.inmage.com
Copyright 2012, InMage Systems, Inc. All Rights Reserved. This document is provided for information purposes only, and the contents hereof are subject to change without notice. The information
contained herein is the proprietary and confdential information of InMage and may not be reproduced or transmitted in any form for any purpose without InMage’s prior written permission. InMage is a
registered trademark of InMage Systems, Inc.
2012V1
Email: [email protected] |