Big Data Solution How-To: Disaster Recovery Basics Data Sheet: Scalable Website/Application Overview Systems administration and IT management boils down to that proverbial 3:00am phone call. Your application is down. How do you respond? Having the proper plan and appropriate recovery assets in place is the key to surviving this all-too-real scenario. How current are your backups? Do you have standby servers already in place? If not, how quickly can you bring new ones online? There are three basic strategies you can implement today on GoGrid to make your application better able to recover from a data center outage: 1. Cold standby 2. Warm standby 3. Full geographic-redundancy with multiple active data centers Recovery Planning Let’s start off with a definition: Redundancy: (noun) the ability of an application or system to resist the failure of one or more constituent parts, or recover quickly from such failure. GoGrid offers two products that make this process easy to implement:  Cloud Link is a redundant, dark-fiber connection (separate from the public Internet) between GoGrid’s US data centers. It’s available on a flat monthly subscription basis for a given amount of bandwidth starting at 10 Mbps scaling all the way up to 1 Gbps.  Cloud Storage is redundant, scalable NAS-based storage available in all three GoGrid data centers. With these building blocks and the goal of having a plan and recovery assets in place to make a smooth recovery from an outage, let’s look at those three basic disaster recovery solutions, which are defined according to the state of the data being stored in the secondary site: cold, warm, or hot. Strategy #1: Cold Standby Cold standby is the most basic form of geo-redundancy. It involves executing a backup strategy appropriate to your application—weekly full and daily “diffs” (differential backups), for example—then copying them across Cloud Link to the secondary data center and storing them in Cloud Storage. This strategy is called “cold” standby because the data exists in backup form and must be restored to bring the database online. You’ll need: 1. A small Cloud Server in the secondary data center to create the link to the remote Cloud Storage. 2. A simple web page (stored on the same server) with “Under Maintenance” messaging to be displayed while you’re bringing your application back online. 3. “Gold Masters” for each server type in your application. Use GoGrid Server Image (GSI) functionality to create these GSIs and store them in Cloud Storage. You can use them later to templatize the deployment of Cloud Servers. Figure 1: Cold Standby – Primary environment in West data center; backups shipped via Cloud Link to East data center. Here’s how it works: 1. To execute the failover, change your application’s public DNS to point to the secondary data center. This process will take some time to propagate, but as end users get the updated DNS record, they’ll see the “Under Maintenance” page rather than an error. 2. In the meantime, you’re spinning up a database server and application servers from your GSIs, restoring the data, and reconstituting your environment. Figure 2: Recovery in Cold Standby – Spin up servers from GSIs and restore the database from backup Pluses  Speed – Recovery from a catastrophic outage can occur within a single day for most of your customers. © Copyright 2013 GoGrid. All rights reserved. Various trademarks held by their respective owners. Minuses  Downtime – Depending on the complexity of the application, the volume of data to be restored, and DNS propagation time, application downtime could stretch to 12–48 hours. However, this is definitely less time than if you needed to reconstitute from traditional offsite backups in a totally new data center. And if you didn’t have offsite backups at all, it might take weeks to get back online and you might never be able to fully recover from the outage.  Data Loss – The biggest “gotcha” in this scenario is data loss. The database restore is to the last backup. Any data captured between the time of the last backup and the outage will be lost if the original data at the primary data center is unrecoverable. Despite these limitations, cold standby is a viable, entry-level disaster recovery strategy. Strategy #2: Warm Standby Warm standby is a substantially better, but only slightly more advanced form of geo-redundancy. In this scenario, you synchronize live data to a standby database server in the secondary data center via replication, log-shipping, or database mirroring. This process is called “warm” standby because you have warm data available—data that is ready to go. You still need to save GSIs to Cloud Storage for the application servers that would need to be spun up. It’s also a good idea to have simple “Under Maintenance” messaging in place and ready to display while you’re bringing your application back online. Figure 3: Warm Standby – Primary environment in West data center; live data synchronized to standby database server in East via Cloud Link. Here’s how it works: 1. To execute the failover, change your application’s public DNS to point to the secondary data center. The DNS change will take some time to propagate, but as end users get the updated DNS record, they’ll see the “Under Maintenance” page (if you created one), rather than an error in their browser. 2. In the meantime, you’re spinning up application servers from your GSIs and reconstituting your environment. Spinning up only application servers is going to be much faster than if you needed to provision a database server and restore data as well, making your recovery time correspondingly faster. © Copyright 2013 GoGrid. All rights reserved. Various trademarks held by their respective owners. 3. If your public DNS provider supports it, you could even set up an automatic failover upon loss of connectivity to your primary data center. In this scenario, rather than simple “Under Maintenance” messaging, you could have a portion of the application environment in place to deliver your application in an over-subscribed state on a short-term basis until it is augmented with additional application servers, spun up from GSIs, to the point where it can comfortably handle the full application load. Figure 4: Recovery in Warm Standby – Spin up application servers from GSIs; database is already in place Pluses  Speed – Recovery from a catastrophic outage can occur in as little as 1–2 hours, with minimal data loss. With DNS failover and a partial application environment in the standby data center, recovery could take only minutes. Minuses  Downtime – There will still be downtime, but far less than in the cold standby model, even with a manual DNS switch-over.  Cost – The cost to implement warm standby is greater, in both dollars and engineering resources, but not excessively so. As discussed, there are different levels of warm standby, so there is a tradeoff between cost/complexity and recovery time. Strategy #3: Full Geo-Redundancy (Hot Secondary) The gold-standard for geographically redundant disaster recovery is to have an active/active data center deployment. With DNS load balancing, both application environments serve end users simultaneously. Users are routed to the nearest available data center, which should provide improved application performance. The databases are both active and taking application data simultaneously, so they must be synchronized via master-master replication to keep each environment aware of the other’s data changes. © Copyright 2013 GoGrid. All rights reserved. Various trademarks held by their respective owners. Figure 5: Full geographic redundancy – Active/active data centers with public traffic DNS-balanced between them and live data being synchronized on the back end bi-directionally via Cloud Link. Here’s how it works: 1. In this scenario, you store “Gold Master” GSIs in Cloud Storage in both data centers. In the event one data center goes offline, you can use the GSIs to spin up additional capacity in the remaining one to serve the increased application load in that data center. 2. Failover is automatic. The DNS provider has “keep-alive” monitors that can detect when a data center has gone offline and the DNS servers stop sending traffic to it. Figure 6: Automatic recovery in full geographic-redundancy – DNS detects outage at West data center and directs traffic to East; spin up additional application servers from GSIs, as needed. Pluses  Availability – With this scenario, you enjoy the highest-possible availability and automatic recovery from an outage with no (or nearly no) downtime. Minuses  Cost & Resources – A solution with multiple active data centers is going to cost more than the two standby strategies discussed previously (cold and warm), and it will be more demanding of engineering resources to implement. The tradeoff is definitely worthwhile, however—greater levels of availability and redundancy are going to cost more. © Copyright 2013 GoGrid. All rights reserved. Various trademarks held by their respective owners. Conclusion With the appropriate plan and recovery assets in place, you can smoothly recover from an outage and minimize or even eliminate downtime altogether. This document outlined three strategies—from the entry-level to the gold standard—for implementing a geographically redundant disaster-recovery solution. GoGrid has provided the tools to add geographic redundancy, with Cloud Link linking its two US data centers via redundant dark fiber connections and Cloud Storage providing a secure, scalable storage repository for recovery assets. GoGrid customer, Martini Media, implemented a disaster recovery solution as part of its Big Data implementation. You can read more about this success story in the case study. Note: This document is based on a blog post by Scott Pankonin. About GoGrid GoGrid enables companies to evaluate and run multiple, on-demand big data solutions quickly, simply, reliably, securely, and cost-effectively. As the leader in Open Data Services (ODS), GoGrid is committed to delivering purpose-built, non-opinionated Big Data solutions and services for the management and integration of open source, commercial, and proprietary technologies across multiple platforms. With over 15,000 customers and over 600,000 VMs deployed, GoGrid has pioneered cloud infrastructure for more than a decade for companies like Condé Nast, Merkle, and Preventice. For more information, please visit www.GoGrid.com. © Copyright 2013 GoGrid. All rights reserved. Various trademarks held by their respective owners. HT_DR-Basics_20131127