How-To: Disaster Recovery Basics Data Sheet: Scalable Website/Application Big Data Solution

Big Data Solution
How-To:
Disaster
Recovery
Basics
Data
Sheet:
Scalable
Website/Application
Overview
Systems administration and IT management boils down to that proverbial 3:00am phone call. Your
application is down. How do you respond? Having the proper plan and appropriate recovery assets in
place is the key to surviving this all-too-real scenario. How current are your backups? Do you have
standby servers already in place? If not, how quickly can you bring new ones online?
There are three basic strategies you can implement today on GoGrid to make your application better
able to recover from a data center outage:
1. Cold standby
2. Warm standby
3. Full geographic-redundancy with multiple active data centers
Recovery Planning
Let’s start off with a definition:
Redundancy: (noun) the ability of an application or system to resist the failure of one or more
constituent parts, or recover quickly from such failure.
GoGrid offers two products that make this process easy to implement:

Cloud Link is a redundant, dark-fiber connection (separate from the public Internet) between
GoGrid’s US data centers. It’s available on a flat monthly subscription basis for a given amount
of bandwidth starting at 10 Mbps scaling all the way up to 1 Gbps.

Cloud Storage is redundant, scalable NAS-based storage available in all three GoGrid data
centers.
With these building blocks and the goal of having a plan and recovery assets in place to make a smooth
recovery from an outage, let’s look at those three basic disaster recovery solutions, which are defined
according to the state of the data being stored in the secondary site: cold, warm, or hot.
Strategy #1: Cold Standby
Cold standby is the most basic form of geo-redundancy. It involves executing a backup strategy
appropriate to your application—weekly full and daily “diffs” (differential backups), for example—then
copying them across Cloud Link to the secondary data center and storing them in Cloud Storage. This
strategy is called “cold” standby because the data exists in backup form and must be restored to bring
the database online.
You’ll need:
1. A small Cloud Server in the secondary data center to create the link to the remote Cloud
Storage.
2. A simple web page (stored on the same server) with “Under Maintenance” messaging to be
displayed while you’re bringing your application back online.
3. “Gold Masters” for each server type in your application. Use GoGrid Server Image (GSI)
functionality to create these GSIs and store them in Cloud Storage. You can use them later to
templatize the deployment of Cloud Servers.
Figure 1: Cold Standby – Primary environment in West data center; backups shipped via Cloud Link to East data center.
Here’s how it works:
1. To execute the failover, change your application’s public DNS to point to the secondary data
center. This process will take some time to propagate, but as end users get the updated DNS
record, they’ll see the “Under Maintenance” page rather than an error.
2. In the meantime, you’re spinning up a database server and application servers from your GSIs,
restoring the data, and reconstituting your environment.
Figure 2: Recovery in Cold Standby – Spin up servers from GSIs and restore the database from backup
Pluses

Speed – Recovery from a catastrophic outage can occur within a single day for most of your
customers.
© Copyright 2013 GoGrid. All rights reserved. Various trademarks held by their respective owners.
Minuses

Downtime – Depending on the complexity of the application, the volume of data to be restored,
and DNS propagation time, application downtime could stretch to 12–48 hours. However, this is
definitely less time than if you needed to reconstitute from traditional offsite backups in a
totally new data center. And if you didn’t have offsite backups at all, it might take weeks to get
back online and you might never be able to fully recover from the outage.

Data Loss – The biggest “gotcha” in this scenario is data loss. The database restore is to the last
backup. Any data captured between the time of the last backup and the outage will be lost if the
original data at the primary data center is unrecoverable.
Despite these limitations, cold standby is a viable, entry-level disaster recovery strategy.
Strategy #2: Warm Standby
Warm standby is a substantially better, but only slightly more advanced form of geo-redundancy. In this
scenario, you synchronize live data to a standby database server in the secondary data center via
replication, log-shipping, or database mirroring. This process is called “warm” standby because you have
warm data available—data that is ready to go. You still need to save GSIs to Cloud Storage for the
application servers that would need to be spun up. It’s also a good idea to have simple “Under
Maintenance” messaging in place and ready to display while you’re bringing your application back
online.
Figure 3: Warm Standby – Primary environment in West data center; live data synchronized to standby database server
in East via Cloud Link.
Here’s how it works:
1. To execute the failover, change your application’s public DNS to point to the secondary data
center. The DNS change will take some time to propagate, but as end users get the updated DNS
record, they’ll see the “Under Maintenance” page (if you created one), rather than an error in
their browser.
2. In the meantime, you’re spinning up application servers from your GSIs and reconstituting your
environment. Spinning up only application servers is going to be much faster than if you needed
to provision a database server and restore data as well, making your recovery time
correspondingly faster.
© Copyright 2013 GoGrid. All rights reserved. Various trademarks held by their respective owners.
3. If your public DNS provider supports it, you could even set up an automatic failover upon loss of
connectivity to your primary data center. In this scenario, rather than simple “Under
Maintenance” messaging, you could have a portion of the application environment in place to
deliver your application in an over-subscribed state on a short-term basis until it is augmented
with additional application servers, spun up from GSIs, to the point where it can comfortably
handle the full application load.
Figure 4: Recovery in Warm Standby – Spin up application servers from GSIs; database is already in place
Pluses

Speed – Recovery from a catastrophic outage can occur in as little as 1–2 hours, with minimal
data loss. With DNS failover and a partial application environment in the standby data center,
recovery could take only minutes.
Minuses

Downtime – There will still be downtime, but far less than in the cold standby model, even with
a manual DNS switch-over.

Cost – The cost to implement warm standby is greater, in both dollars and engineering
resources, but not excessively so. As discussed, there are different levels of warm standby, so
there is a tradeoff between cost/complexity and recovery time.
Strategy #3: Full Geo-Redundancy (Hot Secondary)
The gold-standard for geographically redundant disaster recovery is to have an active/active data center
deployment. With DNS load balancing, both application environments serve end users simultaneously.
Users are routed to the nearest available data center, which should provide improved application
performance. The databases are both active and taking application data simultaneously, so they must be
synchronized via master-master replication to keep each environment aware of the other’s data changes.
© Copyright 2013 GoGrid. All rights reserved. Various trademarks held by their respective owners.
Figure 5: Full geographic redundancy – Active/active data centers with public traffic DNS-balanced between them and
live data being synchronized on the back end bi-directionally via Cloud Link.
Here’s how it works:
1. In this scenario, you store “Gold Master” GSIs in Cloud Storage in both data centers. In the event
one data center goes offline, you can use the GSIs to spin up additional capacity in the remaining
one to serve the increased application load in that data center.
2. Failover is automatic. The DNS provider has “keep-alive” monitors that can detect when a data
center has gone offline and the DNS servers stop sending traffic to it.
Figure 6: Automatic recovery in full geographic-redundancy – DNS detects outage at West data center and directs traffic
to East; spin up additional application servers from GSIs, as needed.
Pluses
 Availability – With this scenario, you enjoy the highest-possible availability and automatic
recovery from an outage with no (or nearly no) downtime.
Minuses
 Cost & Resources – A solution with multiple active data centers is going to cost more than the
two standby strategies discussed previously (cold and warm), and it will be more demanding of
engineering resources to implement. The tradeoff is definitely worthwhile, however—greater
levels of availability and redundancy are going to cost more.
© Copyright 2013 GoGrid. All rights reserved. Various trademarks held by their respective owners.
Conclusion
With the appropriate plan and recovery assets in place, you can smoothly recover from an outage and
minimize or even eliminate downtime altogether. This document outlined three strategies—from the
entry-level to the gold standard—for implementing a geographically redundant disaster-recovery
solution. GoGrid has provided the tools to add geographic redundancy, with Cloud Link linking its two US
data centers via redundant dark fiber connections and Cloud Storage providing a secure, scalable
storage repository for recovery assets. GoGrid customer, Martini Media, implemented a disaster
recovery solution as part of its Big Data implementation. You can read more about this success story in
the case study.
Note: This document is based on a blog post by Scott Pankonin.
About GoGrid
GoGrid enables companies to evaluate and run multiple, on-demand big data solutions quickly, simply,
reliably, securely, and cost-effectively. As the leader in Open Data Services (ODS), GoGrid is committed
to delivering purpose-built, non-opinionated Big Data solutions and services for the management and
integration of open source, commercial, and proprietary technologies across multiple platforms. With
over 15,000 customers and over 600,000 VMs deployed, GoGrid has pioneered cloud infrastructure for
more than a decade for companies like Condé Nast, Merkle, and Preventice. For more information,
please visit www.GoGrid.com.
© Copyright 2013 GoGrid. All rights reserved. Various trademarks held by their respective owners.
HT_DR-Basics_20131127