NEODBUG – November 21st, 2013 Bad Things Happen to Good People: How to Minimize the Pain DB2 for Linux, UNIX, and Windows Kelly Schlamb ([email protected]) Executive IT Specialist, Worldwide Information Management Technical Sales IBM Canada Ltd. © 2013 IBM Corporation Disclaimer © Copyright IBM Corporation 2013. All rights reserved. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBM'S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR SOFTWARE. IBM's statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM's sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. IBM, the IBM logo, ibm.com, Information Management, DB2, DB2 Connect, DB2 OLAP Server, pureScale, System Z, Cognos, solidDB, Informix, Optim, InfoSphere, and z/OS are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml Other company, product, or service names may be trademarks or service marks of others. 2 © 2013 IBM Corporation Logging and Recovery October 28, 2013 © 2013 IBM Corporation What Could Possibly Go Wrong? Famous Last Words. As the Boy Scouts say … "Be prepared!" Database server goes down (planned or unplanned) Underlying storage becomes unusable Site disaster Database is accidentally dropped Table is accidentally dropped or data deleted 4 Hardware component failure Table space is accidentally dropped © 2013 IBM Corporation Develop a Recovery Plan Some questions to ask yourself when developing a recovery plan – Does the database need to be recoverable? – What are your RPO and RTO requirements? • Recovery Point Objective – how much data loss is acceptable, if any, if a major incident occurs? • Recovery Time Objective – what is an acceptable length of time to perform the recovery while the system is unavailable? – – – – How frequent do the backup operations need to be? How much storage space can be used for backups and archived logs? Where do you want the backups and archived logs to go? Will table space level backups be sufficient, or will full database backups be necessary? – Do the backups need to be automated? – Is high availability (HA) a consideration? – Is off-site disaster recovery (DR) a consideration? 5 © 2013 IBM Corporation Logging & Log File Management Recommendations Use archive logging (not circular logging) for production environments – Provides better recovery characteristics (lower RPO) – Permits use of online backups (better availability) and table space level backups (more granular) – Can setup two archive log paths for best protection Include logs in backup images (default behavior) – Allows restoring of an online backup if all you have is the backup image (e.g. disaster recovery) Configure mirrored logging on separate file systems – Protects against file system corruption or accidental deletion of log files For archive logging, consider using automated log file management – Choose how long to retain recovery objects like backups and archived logs and when to automatically prune them – See NUM_DB_BACKUPS, REC_HIS_RETENTN, and AUTO_DEL_REC_OBJ database configuration parameters 6 © 2013 IBM Corporation Backup & Recovery Various methods and options available for backing up and recovering the data in your databases – – – – – – Offline or online backups Database or table space backups Split mirror / flash copy backups Incremental backups Rebuild database from table space backup images Backup compression As database size increases, consider using more frequent, online table space level backups Built-in autonomics for backup command provides optimal values for number of buffers, buffer size, and parallelism – Values calculated based on amount of utility heap memory available, the number of processors available, and the database configuration Automatic backup maintenance can be enabled 7 © 2013 IBM Corporation Automatic Backups Automatic database backups simplify backup management by ensuring that recent full backup of database is performed The need to perform a backup is based on one or more of the following criteria – You have never created a full database backup – The time elapsed since the last full backup is more than a specified number of hours – The transaction log space consumed since the last backup is more than a specified number of 4 KB pages (in archive logging mode only). Backups can be configured to be offline or online Supports disk, tape, TSM, and vendor DLL media types Feature is enabled/disabled by using auto_db_backup and auto_maint database configuration parameters Policies can be defined via SYSPROC.AUTOMAINT_SET_POLICY and SYSPROC.AUTOMAINT_SET_POLICYFILE stored procedures 8 © 2013 IBM Corporation Incremental Backups Incremental backups contain data changed since previous backups (depending on the type of backup – incremental or delta) – In addition to data, each incremental backup image also contains all of the database metadata Incremental (cumulative) backup image – Contains all database data that has changed since the most recent, successful, full backup operation – Cumulative because a series of incremental backups taken over time will each have the contents of the previous incremental backup image Delta backup image – Contains a copy of all database data that has changed since the last successful backup (full, incremental, or delta) of the table space in question Combinations of database and table space incremental backups are permitted, in both online and offline modes of operation To restore a database or table space to a consistent state, the recovery process must begin with a consistent image of the object to be restored, followed by the application of the appropriate incremental backup images TRACKMOD database configuration parameter must be set to YES 9 © 2013 IBM Corporation Incremental Backups: Examples Weekly full backups, daily incremental (cumulative) backups: Weekly full backups, daily delta backups: Weekly full backups (Sunday), incremental (cumulative) backup half-way through week, delta backups all other days: Recovery required (prior to Saturday's delta being taken): 10 © 2013 IBM Corporation Database Rebuild Automatic restoring of necessary backup images Rebuild a database from table space backup images – Means no longer having to take as many full database backups, which is becoming less possible as databases grow in size – Instead, take more frequent table space backups In a recovery situation, if you need to bring a subset of table spaces online faster than others, you can do a partial database rebuild – May also be used for • Creating a separate database for QA purposes • Data recovery purposes Can choose which table spaces to restore as part of the rebuild – – – – 11 All table spaces in the database at a time that the backup was taken All table spaces included in a selected table space backup A specific list of table spaces specified as part of the restore command All table spaces except those in the specific list provided © 2013 IBM Corporation Database Disaster Recovery using Rebuild Necessary table spaces will be restored automatically based on recovery history data Recovery History File SalesDB SYSCATSPACE 2 SMS01 DMS01 3 Backup Database SALESDB SUN Backup TS DMS01 MON TUE Backup TS DMS02 WED THU DMS02 1 Backup TS DMS01 FRI Database Fails ! SAT 1. RESTORE DB SALESDB REBUILD WITH ALL TABLESPACES IN DATABASE TAKEN AT <Friday> (DB2 restores Friday, Sunday and Thursday) 2. ROLLFORWARD DB SALESDB TO END OF LOGS AND STOP 12 © 2013 IBM Corporation Database Recovery - Rebuild from Table Space Backups Database recovery using multiple table space backups Recovery History File SalesDB SYSCATSPACE 2 Backup TS DMS02 DMS01 SUN TUE DMS01 3 DMS02 1 Backup TS SMS01 SYSCATSPACE MON SMS01 Backup TS DMS02 DMS01 WED THU Backup TS DMS01 FRI Database Fails ! SAT 1. RESTORE DB SALESDB REBUILD WITH ALL TABLESPACES IN DATABASE TAKEN AT <Friday> (DB2 restores Friday, Tuesday and Thursday) 2. ROLLFORWARD DB SALESDB TO END OF LOGS AND STOP 13 © 2013 IBM Corporation Database Partial Copy using Rebuild Can create a database copy with a subset of table spaces SalesDB SalesDB SYSCATSPACE 2 DMS01 1. DMS01 DMS02 1 Backup Database SALESDB SUN SYSCATSPACE SMS01 Backup TS DMS01 MON TUE Backup TS DMS02 WED THU Backup TS DMS01 FRI Copy for testing SAT RESTORE DB SALESDB REBUILD WITH TABLESPACE(SYSCATSPACE,DMS01) TAKEN AT Friday (DB2 Restores Friday, and Sunday (just SYSCATSPACE TS) OR RESTORE DB SALESDB REBUILD WITH ALL TABLESPACES IN DATABASE EXCEPT TABLESPACE(SMS01,DMS02) TAKEN AT Friday 2. ROLLFORWARD DB SALESDB TO END OF LOGS AND STOP 14 © 2013 IBM Corporation Transportable Schemas Efficient schema movement between databases Using a backup image as the source, allows you to copy a set of table spaces and SQL schemas from one database into another A database schema must be transported in its entirety – If a table space contains both the schema you want to transport, as well as another schema, you must transport all data objects from both schemas – These self contained (from a table space perspective) sets of schemas that have no references to other database schemas are called transportable sets Restore will do multiple operations under the covers – – – – Restore SYSCATSPACE and specified table spaces from backup image Roll them forward to a point of consistency Validate the schemas specified Transfer ownership of the specified table spaces (including containers) to the target database – Recreate the schema in the target database 15 © 2013 IBM Corporation Transportable Schemas: Example Not a valid transportable set Valid transportable set The database contains the following valid transportable sets: mydata1: schema1 + schema2 mydata2 + myindex: schema3 multidata1 + multiuser2 + multiindex1: schema4 + schema5 Any combination of the above transportable sets To Move All Table Spaces: restore db old_db \ tablespace (“mydata1”,”mydata2”,”myindex”,”multidata1”,”multiindex1”,”multiuser2”)\ schema (“schema1”,”schema2”,”schema3”,”schema4”,”schema5”) transport into new_db 16 © 2013 IBM Corporation DB2's Dropped Table Recovery Feature Recover the contents of a dropped table using DB2's table space restore and rollforward operations – When rolling forward through the drop of the table, the data is exported prior to the replay of the drop Requires that the table space be enabled for dropped table recovery – Enabled by default at table space creation time When a table is dropped, an entry is made in the transaction log files as well as in the recovery history file You can recover a dropped table by doing the following: 1. Identify the dropped table by invoking the LIST HISTORY DROPPED TABLE command 2. Restore a database- or table space-level backup image taken before the table was dropped 3. Create an export directory to which files containing the table data are to be written 4. Roll forward to a point in time after the table was dropped (or to end of logs) by using the RECOVER DROPPED TABLE parameter on the ROLLFORWARD DATABASE command 5. Re-create the table by using the CREATE TABLE statement from the recovery history file 6. Import the table data that was exported during the rollforward operation into the table 17 © 2013 IBM Corporation Proactive Checking of Database and Recovery Objects db2dart – Database analysis and reporting tool – Examines database for architectural correctness INSPECT – Database inspection command – Inspects database for architectural integrity – Can be run while the database is online db2ckbkp – Check backup command – Test the integrity of a backup image db2cklog – Check archive log file validity command – Check the validity of archive log files in order to determine whether or not the log files can be used during rollforward recovery of a database or table space 18 © 2013 IBM Corporation IBM DB2 Recovery Expert Granular and Flexible Data Recovery Part of the DB2 Advanced Recovery Feature Faster – Simplifies and optimizes database recovery by reducing disruption during the recovery process • DBAs can quickly restore or correct erroneous data – Log Analysis enables organizations to monitor changes that allow for quick recovery Smarter – Provides intelligent analysis of DB2 and DB2 recovery assets to find the most efficient recovery path Simpler – Facilitates process of rebuilding data assets to a specified point in time, often without taking operations offline “AFS is establishing a disaster recovery policy with our Vision Application. DB2 Recovery Expert provides us with the functionality to roll back both databases to a point where the tables are consistent. This will help us meet 100% of our needs for this project. The product itself is awesome and WEB UI is very nice.” -Kirk B. Spadt, Principal Architect, Automated Financial Systems 19 © 2013 IBM Corporation IBM DB2 Merge Backup (MBK) Part of the DB2 Advanced Recovery Feature Merge full DB2 backups with DB2 incremental/delta backups to build a new up-to-date full DB2 backup image Eliminates the need to take DB2 full backups Use online or offline, and table space or database backup images Split out table space backup images from a full backup image Run MBK on the database server or on a standalone remote machine Benefits include – Reduce backup intensive resources on the database server – Reduce backup storage footprint • Eliminate full DB2 backups by backing up only what needs to be backed up – Reduce number of objects required during recovery • Simplify recovery process through restore of up-to-date full backup images • Speed up recovery with up-to-date more recent full backup images 20 © 2013 IBM Corporation Merge Backup Overview Time Full DB2 Backup from DB2 DB2 Incremental Backups Merge Backup Processing Full DB2 Backup from MBK 21 © 2013 IBM Corporation Eliminate the Need to Take Full DB2 Backups MBK Full DB2 Full DB2 Backup Backup MBK MBK ... 22 Full DB2 Full DB2 Backup Backup © 2013 IBM Corporation High Availability and Disaster Recovery October 28, 2013 © 2013 IBM Corporation What Does Downtime Mean To Your Business? Damaged DamagedReputation Reputation Negative NegativePress Press Loss Lossof ofMarket MarketShare Share to Competitors to Competitors Customer Customer Dissatisfaction Dissatisfaction Lost LostRevenue Revenue Availability Downtime per Year 99.999% 5 minutes 99.99% 52 minutes 99.9% 8 hours, 45 minutes 99% 3.65 days 95% 18.25 days 64% of organizations require that their databases deliver a minimum of 99.99% or better uptime for their most mission critical applications * 24 Breach Breachof of SLA/Contract SLA/Contract Reduced ReducedIT/Employee IT/Employee Productivity Productivity Reduced ReducedEnd-User End-User Productivity Productivity Reduced ReducedCompany/ Company/ Shareholder ShareholderValue Value * ITIC Paper "Two-Thirds of Corporations Now Require 99.99% Database Uptime, Reliability ", Laura DiDio, July 10th, 2013 © 2013 IBM Corporation DB2 Has the Right Solution to Meet Your HA Needs HADR HADR 25 CF CF pureScale Integrated Clustering © 2013 IBM Corporation High Availability and/or Disaster Recovery Options Server failover – Shared disk or remote disk mirroring HADR – – – – High Availability and/or Disaster Recovery Easy to set up and manage Automatic failover with TSA integration Fast failover Q-Replication – Flexible – can handle database subsets – Active/Active – Asynchronous pureScale (Active / Active) – Continuous Availability – Load Balancing – Easy to set up and manage 26 © 2013 IBM Corporation Server-Based Failover DB2 ships with an integrated TSA cluster manager – – – – tx tx Node Failure Detection Disk takeover IP takeover Restart DB2 Management framework included to keep the cluster topology in sync Active Server 27 © 2013 IBM Corporation Storage Replication Uses remote disk mirroring technology – Maximum distance between sites is typically 100s of km (for synchronous, 1000s of km for asynchronous) – For example: IBM Metro Mirror, EMC SRDF Transactions run against primary site only, DR site is passive – If primary site fails, database at DR site can be brought online All data and logs must be mirrored to the DR site – Synchronous replication guarantees no data loss – Writes are synchronous and therefore ordered, but “consistency groups” are still needed • If failure to update one volume, don’t want other volumes to get updated (leaving data and logs out of sync) 28 © 2013 IBM Corporation High Availability Disaster Recovery (HADR) Provides local High Availability and/or Disaster Recovery – Keeps two copies of a database in sync with each other on two different servers Simple to setup and manage DB2 9.5 adds an integrated cluster manager for automatic failover DB2 10 adds multiple standbys, time delay and log buffering to handle network spikes 29 © 2013 IBM Corporation HADR: DB2 Delivers Fast Failover at Low Cost Redundant copy of the database to protect against site or storage failure Support for Rolling Upgrades Failover typically under 15 seconds Example: Real SAP workload, 600 SAP users – database available in 11 sec. 100% performance after primary failure Automatic Client Reroute Client application transparently resumes on Standby tx Network Connection tx tx tx HADR TSA for server monitoring Built in cluster manager Monitors primary & performs takeover Keeps the two servers in sync Standby Server Server Primary 30 Standby Server © 2013 IBM Corporation HADR Synchronization Modes Synchronous, Near Synchronous, Asynchronous and Super Asynchronous HADR HADR send() receive() ro ch yn As log writer us no ra e N Co m m log file it R Sup er A syn c hr ro h nc y S u no s s on ou r h c n Sy eq ue st log file ono us Commit Succeeded 31 © 2013 IBM Corporation HADR Failover Single command called "TAKEOVER" – Change the standby into a primary – Switch the roles of a healthy primary-standby pair – No db2start / restart database / rollforward etc. Integrated TSA provides heartbeat monitoring & automated “TAKEOVER” – Set up for you during DB2 installation – Use a network tiebreaker to avoid split brain scenarios – Configuration is available in this whitepaper Automatic client re-route (ACR) provides transparent failover – And will rerun the statement that was running when the failure occurred as long as it’s the first statement of a transaction with no data yet returned 32 © 2013 IBM Corporation HADR Multiple Standbys Principal Standby Any Primary e mod c n sy Super async mode Super as ync mod e Auxiliary Standby Auxiliary Standby Allows for one standby for high availability and up to two other standbys for disaster recovery – Rolling fix pack updates of standbys and primary without losing HA Reads on standby supported on all standbys TSA Automation for takeover is only to Principal Standby Takeover (forced and non-forced) supported from any standby – After takeover, configuration parameters on new primary’s standbys will be changed automatically so they point to the new primary 33 © 2013 IBM Corporation HADR Log Spooling on the Standby When enabled, allows the standby to spool log records arriving from the primary Decouples log replay on the standby from receiving of the log data from the primary Supported with any synchronization mode Primary Standby Any sync mode Spooled logs on standby 34 34 © 2013 IBM Corporation HADR Time-Delayed Apply on the Standby Helps recover from application errors – For example, accidental deletion of important table data – Must be noticed before time delay on standby results in change being replayed Enabled via the new HADR_REPLAY_DELAY database configuration parameter – Specifies a delay in seconds for applying changes on a standby – A value of 0 means no time delay (the default) Standby Primary super async mode Spooled logs on standby 35 © 2013 IBM Corporation Q Replication High-throughput, low latency logical data replication – Distance between sites can be up to thousands of km Asynchronous replication Includes support for: – – – – Delayed apply Multiple targets Replicating a subset of data Data transformation DR site can be active – Bi-directional replication is supported for updates on both primary and DR sites 36 © 2013 IBM Corporation Q Replication Each message represents a transaction Highly parallel apply process Differentiated conflict detection and resolution 37 © 2013 IBM Corporation Continuous Availability Using Q Replication Q Q Replication Replication provides provides aa solution solution for for continuous continuous availability availability where where the the active active secondary secondary system system is is also also available available for for other other applications applications 38 © 2013 IBM Corporation Introducing DB2 pureScale Extreme capacity – Buy only what you need, add capacity as your needs grow Application transparency – Avoid the risk and cost of application changes Continuous availability – Deliver uninterrupted access to your data with consistent performance Learning from the undisputed Gold Standard... System z 39 © 2013 IBM Corporation DB2 pureScale Scalability, Performance, and Always Available Transactions DB2 pureScale – Robust infrastructure for OLTP workloads – Provides improved availability, performance and scalability – Application transparency – Scales to >100 members – Leverages z/OS cluster technology Highlights of pureScale enhancements in DB2 10.5 – Rich disaster recovery options, now including integrated HADR support – Backup and restore between pureScale and non-pureScale environments – Online fix pack updates – Add members online for additional capacity – Included in Advanced Workgroup and Advanced Enterprise editions 40 © 2013 IBM Corporation DB2 pureScale Architected for extreme scale and availability • Multiple DB2 members for scalable and available database environment • Client application connects into any DB2 member to execute transactions Clients • CS CS CS CS Member Member Member Member • Shared storage for database data and transaction logs • Cluster caching facilities (CF) provide centralized global locking and page cache management for highest levels of availability and scalability Cluster Interconnect CFCS Logs Logs Logs Logs Primary CF CFCS Secondary CF Automatic workload balancing • Duplexed, for no single point of failure • High speed, low latency interconnect for efficient and scalable communication between members and CFs • DB2 Cluster Services provides integrated failure detection, recovery automation and the clustered file system Database Shared Storage DB2 pureScale Cluster (Instance) 41 © 2013 IBM Corporation Scale with Ease Scale up or out… without changing your applications Add member online – Efficient coherency protocols designed to scale without application changes – Applications automatically and transparently workload balanced across members – Up to 128 members Member Without impacting availability – Members can be added while cluster remains online Log CF Member Log Member Log Member Member Log Log CF Without administrative complexity – No data redistribution required “DB2 pureScale is the only solution we found that provided near linear scalability... It scales 100 percent, which means when I add servers and resources to the cluster, I get 100 percent of the benefit. Before, we had to ‘oversize’ our servers, and used only 50 - 60 percent of the available capacity so we could scale them when we needed.” -- Robert M. Collins Jr. (Kent), Database Engineer, BNSF Railway Inc. 42 Online Recovery from Failures DB2 pureScale design point is to maximize availability during failure recovery processing X When a database member fails, only in-flight data remains locked until member recovery completes CF CF – In-flight = data being updated on the failed member at the time it failed Database member failure “We pulled cards, we powered off systems, we uninstalled devices, we did everything we could do to make the cluster go out of service, and we couldn’t make it happen.” -- Robert M. Collins Jr. (Kent), Database Engineer, BNSF Railway Inc. Only data in-flight updates locked during recovery % of Data Available Target time to availability of rows associated with in-flight updates on failed member in seconds 100 50 Time (~seconds) Stealth System Maintenance Allows DBAs to apply system maintenance without negotiating an outage window Example: Upgrade the OS in a rolling fashion across the cluster Single Database View Procedure: 1. Drain (a.k.a. Quiesce) Wait for transactions to end their life naturally; new transactions routed to other members 2. Remove & maintain 3. Reintegrate into cluster Workload balancing starts sending it work as a least loaded machine 4. Repeat until done 44 DB2 DB2 DB2 DB2 © 2013 IBM Corporation Rolling Database Fix Pack Updates New in DB2 10.5 Transparently install pureScale fix packs or perform system maintenance in an online rolling fashion No outage experienced by applications Single installFixPack command run on each member/CF for fix pack update – Quiesces member • Existing transactions allowed to finish (configurable timeout) • New transactions sent to other members CF CF – Installs binaries – Updates instance • Member still behaves as if running on previous fix pack level – Unquiesces member Final installFixPack command to complete and commit updates – Instance now running at new fix pack level 45 © 2013 IBM Corporation Rolling Fix Pack Updates (cont.) Cluster is effectively running at: GA FP1 Transactions routed away from member undergoing maintenance, so no application outages experienced. Workload balancing brings work back after maintenance finished Member Code level: 3 >> installFixPack installFixPack –online –online CF S Code level: Code level: GA FP1 4 >> installFixPack installFixPack –online –online Member Code level: GA FP1 5 >> installFixPack installFixPack –online –online CF P GA FP1 1 >> installFixPack installFixPack –online –online 46 6 >> installFixPack installFixPack –check_commit –check_commit > installFixPack –commit_level 7 Member GA FP1 Cluster not running at new level until commit is performed Code level: GA FP1 2 >> installFixPack installFixPack –online –online © 2013 IBM Corporation Disaster Recovery Options for pureScale Variety of disaster recovery options to meet your needs – HADR (new in DB2 10.5) – Storage Replication – Q Replication – InfoSphere Change Data Capture (CDC) – Geographically Dispersed pureScale Cluster (GDPC) – Manual Log Shipping 47 © 2013 IBM Corporation Geographically Dispersed pureScale Clusters (GDPC) A “stretch” or geographically dispersed pureScale cluster spans two sites – – – – At distances of tens of km Active/active DR, where half of the cluster is at site A, other half at site B Enables a level of DR support suitable for many types of disasters Supported for AIX (using InfiniBand) and RedHat Linux (using 10 Gigabit Ethernet) Both sites active and available for transactions during normal operation On failures, client connections are automatically redirected to surviving members – Applies to both individual members within sites and total site failure M1 M3 CFP Tens of km CFS M2 Site A M4 Site B Workload fully balanced M1 M3 CFP Site A Tens of km CFS M2 M4 Site B Workload rebalanced on hardware failure 48 M1 M3 CFP Site A Tens of km CFS M2 M4 Site B Workload rebalanced on site failure © 2013 IBM Corporation HA & DR Scenarios October 28, 2013 © 2013 IBM Corporation Local Cluster Failover DB2 automation with built-in cluster manager HA Cluster Local Standby Database Primary Database DB1 Pros: Cons: Inexpensive local failover solution No protection from disk failure Protection from software and server failure No protection from site failure DB2 9.5 integrated TSA cluster manager Failover times vary from 5 to 15 minutes 50 © 2013 IBM Corporation HADR Local or Remote with Read on Standby Primary Connection HADR Cluster Standby Database Primary Database DB1 Pros: Inexpensive local failover or DR solution Protection from software, server, storage, and site failures Simple to setup and monitor Failover time in the range of 30 sec Reporting on standby without increase in failover time 51 DB1a Read only Cons: Two full copies of the database (a plus from a redundancy perspective) Only read transactions can run on standby © 2013 IBM Corporation HADR With Disk Mirroring to Remote DR Site Primary Connection HADR Cluster Automatic client reroute Disaster Recovery Site Local Standby Database Primary Database Remote Disk Mirror Technology DB1 Pros: Very fast local failover with DR capability Protection from software, server, storage, and site failures Local failover time in the range 30 seconds 52 DB1aa DB1a Cons: Three full copies of the database (a plus from a redundancy perspective) More costly than HADR for just DR © 2013 IBM Corporation HADR With Multiple Standby’s (DB2 10) Primary Connection HADR Cluster Automatic client reroute Local Standby Database Primary Database DB1 DB1a Pros: Very fast local failover with DR capability Protection from software, server, storage, and site failures Allows for time delay on auxiliary standbys Local failover time in the range 30 seconds 53 Disaster Recovery Site DB1aa Remote Standby Cons: Three full copies of the database (a plus from a redundancy perspective) Super Async only for DR site © 2013 IBM Corporation Q Replication Multiple alternate standby servers … Read/Write write on standby Primary Connection Site A Primary Database Site C Site B Remote Standby DB3 Q-based SQL replication to logical standby’s DB2 DB1 Pros: Protected from software, server, storage, and site failures Failover time is “instant” Standby can be full or subset and is fully accessible (read and/or write) Multiple standby servers 54 Cons: More complex to setup and monitor (but more flexibility) vs. HADR Asynchronous © 2013 IBM Corporation High Availability Disaster Recovery (HADR) Options Local HA – fast failover with server and storage protection Site A Server A Server B Primary Standby Disaster Recovery – server, storage and site protection Site A Any Distance Server A Primary Server B Site B Standby Both – fast local failover with server, storage and site protection Any Distance Site A 55 Server A Server B Server C Primary Standby Standby Site B © 2013 IBM Corporation HADR with Replication – Best Practice for HA and DR HADR Pairs with Replication Site A Server A Server B Primary Standby Q Replication Any Distance Server C Server D Active Standby Site B Delivers: – – – – – – Fast Local Failover Active / Active DR Rolling patch upgrades Rolling version upgrades Online database on-disk modifications Schema modifications online/rolling Can replace HADR at each site with pureScale for even better HA 56 © 2013 IBM Corporation DB2 pureScale Availability Options Local only – Online recovery, active-active, protection from server failure Site A Member 1 Member 2 CF CF Geographically dispersed cluster – Online recovery, active-active, protection from server, storage and site failure Site A Member 1 Member 3 Member 2 CF Member 4 CF < 80km Site B Local pureScale plus DR replication – Online recovery, active-active, protection from server, storage and site failure Site A 57 Member 1 CF Member 2 Member 1 Any distance CF CF Disk Replication or IBM Replication Server Member 2 CF Site B © 2013 IBM Corporation DB2 pureScale Availability Option in DB2 10.5 DB2 10.5 supports HADR with pureScale – Online recovery, protection from server, storage and site failure – Easy to set up and manage – Any distance (ASYNC or SuperAsync) Site A Member 1 CF Member 2 CF CF HADR 58 Member 1 Any distance Member 2 CF Site B © 2013 IBM Corporation HA & DR Summary When it comes to HA and DR, one size does not fit all There are many availability options, each with their own advantages – – – – – Server failover HADR Q-Replication pureScale More likely a combination of several of the above Choose the one that best suits your deployment – Determine the right solution considering • • • • 59 Cost (hardware/software/network/site) Availability requirements Management costs Application requirements © 2013 IBM Corporation Kelly Schlamb ([email protected]) Executive IT Specialist, Worldwide Information Management Technical Sales IBM Canada Ltd. 60 © 2013 IBM Corporation