G HÖ KARLSKRONA RONNEBY AN AN HÖ SKO L L G SKO KARLSKRONA RONNEBY What is Availability? Operating systems ❐ The relation between the time the system is not working and the total time. ❐ When is the system not working? High Availability ❐ ❐ Institutionen för Programvaruteknik och Datavetenskap ❐ Availability is not affected if the system is "down" 2.00-4.00 when the users are in their offices 8.00-17.00. ❐ Webservices usually demand 7*24*365 Operating systems: High Availability HÖ SKO AN KARLSKRONA RONNEBY Sida 2/32 L G SKO AN HÖ When is when? Sida 1/32 L G Operating systems: High Availability "If a user cannot get his/her job done on time, the system is down" KARLSKRONA RONNEBY How to measure Availability Common measure of Availability MTBF Time in service Availability = --------------------------------------- = ------------------------------------MTBF + MTTR Total time ❐ Availability figures are commonly stated as the "number of 9s" Level Availability Downtime / year ❐ MTBF = Mean time between failure ❐ MTTR = Mean time to repair ❐ Standard deviation is important! ❐ Harddisk vendors like to state MTBF figures, but seldom the standard deviation ❐ [9.99, 10.01, 9.999, 10.002, 9.9999] -> Mean = 10 ❐ [1, 1, 1, 1, 46] -> Mean = 10 Operating systems: High Availability Sida 3/32 1 2 3 4 5 6 90 % 99 % 99.9 % 99.99 % 99.999 % 99.9999 % 1 month 3.65 days 8.75 hours 52 minutes 5.25 minutes 31.5 seconds Areas of use PC Well maintained server Telecom Operating systems: High Availability Sida 4/32 G ❐ Basic System, backup data. ❐ Redundant data, hardware and/or software RAID. ❐ System failover, cluster. 10 % - Hardware. ❐ 5 % - Environment. AN G AN HÖ SKO KARLSKRONA RONNEBY What to target Downtime costs ❐ Focus on components that: ❐ Lost production -> overtime, delays. ❐ Fail the most often (low MTBF). ❐ Lost customers -> bad reputation more lost customers. ❐ Are the hardest to replace (high MTTR). ❐ Have the greatest impact when they fail. Availability costs ❐ Sida 6/32 L G SKO L HÖ Operating systems: High Availability Why Availability? ❐ Power supply, fire, cable cut, dust. Sida 5/32 KARLSKRONA RONNEBY ❐ Careless misstakes. ❐ ❐ Operating systems: High Availability Add hardware, upgrade OS / software, preventive reboot. 15 % - People. ❐ Complete backup system at remote site. 30% server, 5 % client, 5% network. 30 % - Planned downtime. ❐ Disaster Recovery - Protect the organisation. ❐ 40 % - Software failure. ❐ High Availability - Protect the system. ❐ ❐ Causes for downtime Increased Availability - Protect the data. ❐ ❐ KARLSKRONA RONNEBY Availability index Regular Availability - Do nothing special. ❐ ❐ HÖ HÖ ❐ AN AN KARLSKRONA RONNEBY SKO L L G SKO About 10 times more per "9" added. Operating systems: High Availability Sida 7/32 Operating systems: High Availability Sida 8/32 G HÖ KARLSKRONA RONNEBY AN AN HÖ SKO L L G SKO KARLSKRONA RONNEBY 20 Design Principles ❐ ❐ 20 Design Principles #1 Keep it simple. ❐ ❐ Hard to have control of complex environment as it is (networks, disks, applications, ...) ❐ Eliminate unneccessary hardware. ❐ Slim servers down, run only critical applications. ❐ Remove ambiguity (Who has authority to take network down?) ❐ #2 One problem, one solution. ❐ Often complex problems. ❐ Examine and fix each subproblem. Do not invent the wheel again. ❐ Consult vendors on the implementation of your site. #4 Reuse configurations. ❐ Easier to support, less to learn. ❐ Well tested by use. #5 Select reliable and serviceable hardware ❐ How often will the it fail? ❐ How easy is it to fix once it fails? Operating systems: High Availability HÖ KARLSKRONA RONNEBY 20 Design Principles ❐ ❐ 20 Design Principles #6 Choose mature software. ❐ #9 Invest in failure isolation. ❐ Tested by others, probably less bugs. ❐ Keeps failures from spreading. ❐ Better support. ❐ e.g. RAID can handle failure of a disk, so operation of the database is not interrupted. #7 Build for growth. ❐ ❐ SKO AN KARLSKRONA RONNEBY Sida 10/32 L G SKO AN HÖ ❐ Sida 9/32 L G Operating systems: High Availability ❐ #3 Exploit external resources. Avoid downtime for hardware upgrades. #8 Examine the history of the system. ❐ What causes downtime. Operating systems: High Availability Sida 11/32 Operating systems: High Availability Sida 12/32 G HÖ KARLSKRONA RONNEBY AN AN HÖ SKO L L G SKO KARLSKRONA RONNEBY 20 Design Principles ❐ 20 Design Principles #10 Maintain separate environments. (Examples) ❐ ❐ Production. Controlled changes. ❐ Production mirror. Controlled changes, update after the production update is proven to work. ❐ Quality Assurance. Test environment, treat it as a production environment, but failures are not fatal. ❐ Development. ❐ Laboratory. Place to learn new technology. ❐ Disaster recovery. Production environment on another site. ❐ When is the best time to install the new compiler? (Last week before project deadlines? ) Sida 13/32 Operating systems: High Availability SKO AN HÖ Sida 14/32 L G SKO KARLSKRONA RONNEBY KARLSKRONA RONNEBY 20 Design Principles ❐ What if this disk fail? What if we unplugg this network connection? #12 Plan ahead. ❐ AN HÖ ❐ L G Operating systems: High Availability #11 Test everything. 20 Design Principles #13 Establish Service Level Agreements (SLA) ❐ ❐ So users know what to expect. A good SLA will get the users to work with you instead of against you. ❐ What percentage of the time is the system up? During which hours are the system actually critical? ❐ What systems get priority? ❐ What to do if SLA is not fullfilled? ❐ Do not agree to a SLA that can not be fullfilled, think about issues that are out of control, e.g. Do not promise to have a failed disk replaced in one day, when you do not know the delivery time of the disk vendor (or keep one in spare). Operating systems: High Availability ❐ Sida 15/32 #14 Document everything. ❐ History for new administrators. ❐ As memory, e.g. How did I install that software? ❐ Management can be shown what has been done. ❐ Keep documentation on paper as well!!! #15 Automate common tasks ❐ Saves time. ❐ Less error prone. Operating systems: High Availability Sida 16/32 G HÖ KARLSKRONA RONNEBY AN AN HÖ SKO L L G SKO KARLSKRONA RONNEBY 20 Design Principles ❐ 20 Design Principles #16 Consolidate your servers. ❐ Less servers to monitor. ❐ Expensive Availability solutions can be applied. ❐ Simpler environment. ❐ #17 Maintain tight security. ❐ #18 Remove Single Points Of Failures (SPOFs) ❐ RAID, Cluster. ❐ Muliple ISPs. ❐ Uninterrupted Power Supply (UPS). Get the most bang for the buck. Sida 17/32 Operating systems: High Availability SKO AN HÖ Sida 18/32 L G SKO KARLSKRONA RONNEBY Data Management, Disk storage crucial Data Management, Disk storage crucial Disks are the most likely component to fail. ❐ Just because one component worked in one environment does not mean it works in this. #20 Spend money... but not blindly ❐ AN HÖ ❐ KARLSKRONA RONNEBY ❐ #19 Assume nothing ❐ L G Operating systems: High Availability ❐ ❐ Approximation of time to first failure ❐ MTBF Time to first component failure = -------------------------------------------------------Number of components ❐ Disk with 200 000 hours MTBF / 100 disks = 2000 hours to first failure. ❐ Power supply with 30 000 hours MTBF / 6 supplies = 5000 hours to first failure. ❐ The probability that one out of many components should fail is higher than one out of few. Sida 19/32 Single most important asset. The data must be protected. ❐ ❐ Operating systems: High Availability Disks contain data Hardware can be replaced, data is harder if possible at all to replace (recreate). Operating systems: High Availability Sida 20/32 G HÖ KARLSKRONA RONNEBY AN AN HÖ SKO L L G SKO KARLSKRONA RONNEBY Disk terms ❐ ❐ Disk Array - Many (or slots for) disks in one cabinet. ❐ JBOD (Just a Bunch Of Disks). ❐ RAID. ❐ ❐ Operating systems: High Availability SKO AN AN HÖ Sida 22/32 L G SKO L G Reserve, ready to kick in when necessary. Sida 21/32 KARLSKRONA RONNEBY Disk terms ❐ Add / remove disk from cabinet downtime or impact on a few of the other disks while system is running. Hot spare. ❐ KARLSKRONA RONNEBY ❐ Add / remove disk from cabinet without downtime or impact on the other disks while system is running. Warm pluggable disk. ❐ More than one controller. Operating systems: High Availability HÖ ❐ Only one host can be active in accessing the data. Multipath - More than one cable to disk (array). ❐ Hot pluggable disk. ❐ Multihost - Disk (array) is physically connected to more than one host. ❐ ❐ Disk terms Disk terms Write cache ❐ RAID (Redundant Array of Independent Disks) ❐ Hardware to buffer writes to disk. ❐ Software or / and Hardware. ❐ Increased performance. ❐ ❐ Need proper battery backup in order not to loose data when power fails. Raid 0 - Stripping. write operations spread on multiple disks, no redundancy, availability worse than single disk (more components that can fail). ❐ Raid 1 - Mirroring. Everything is stored on more than two disk. Backup still necessary, does not protect against removal of files. ❐ Raid 5 - Parity. Redundancy with few extra disks. If one disk fail, that data can be recalculated from the other disks. Storage Area Network (SAN) ❐ Storage pool that hosts can access. ❐ Centralized management and allocation. ❐ Availability easier to address, one place to improve. Operating systems: High Availability Sida 23/32 Operating systems: High Availability Sida 24/32 G HÖ HÖ AN ❐ AN ❐ SKO L L G SKO KARLSKRONA RONNEBY KARLSKRONA RONNEBY Failover Requirements (for Clusters) Failover Requirements (for Clusters) Servers ❐ Disks ❐ Primary and Secondary. ❐ Internal, unshared disks. Used for OS and applications. ❐ Configured to the largest extent in the same way. ❐ Shared (multihost) - For critical application data. ❐ When primary fails, service is fails over to secondary. ❐ Shared nothing - Data is replicated (over the interconnect) between the servers. Must als be used in a shared environment if more than one server wants access to the data. Network connections ❐ Interconnect, between primary and secondary, used by heartbeat protocoll. ❐ Public. ❐ Administrative. ❐ Licensing issues. No SPOFs! Sida 25/32 Operating systems: High Availability HÖ SKO AN AN KARLSKRONA RONNEBY Sida 26/32 L G SKO L G Application portability ❐ Operating systems: High Availability HÖ ❐ KARLSKRONA RONNEBY More on heartbeats More on heartbeats ❐ Servers ping each other regularly. If a couple of heartbeats are missed, that server can be failing. ❐ Why is a heartbeat missed? ❐ Server is down. ❐ Network interface card (NIC) used for the heartbeat has failed -> use two NICs. ❐ ❐ Heartbeat cable is unplugged / broken -> use two. ❐ Network storm on interconnect -> use two, allow for a couple of missed heartbeats, use separate network. ❐ Network hub / switch has failed -> use two. Operating systems: High Availability Sida 27/32 ❐ Heartbeat process has failed -> implement heartbeat in OS. ❐ Server runs too slow (no time to send heartbeat) -> implement heartbeat in OS. Do not under any circumstance want a heartbeat to be missed for any other reason than that server is down. ❐ Split-brain syndrome. Both servers comes to the conclusion (missed heartbeats) that the other is down and continue / take over the operation -> data corruption!!! Operating systems: High Availability Sida 28/32 G HÖ KARLSKRONA RONNEBY AN AN HÖ SKO L L G SKO KARLSKRONA RONNEBY Cluster configurations ❐ ❐ ❐ Disaster Recovery Asymmetric ❐ Primary doing all the work. ❐ Secondary idle, waiting to take over. ❐ Backup server(s) on different site. ❐ Separate resources ❐ Symmetric ❐ ❐ The servers acts as primary and secondary for each other ❐ Both run critical applications. ❐ Each server must be capable of accepting the load of the other. Should they be duplicated as well? Complex return to normal operations. N to 1 asymmetric - One machine acts as secondary for a number of primaries. Operating systems: High Availability HÖ SKO AN KARLSKRONA RONNEBY Sida 30/32 L G SKO AN HÖ Sida 29/32 L G Operating systems: High Availability KARLSKRONA RONNEBY Disaster Recovery ❐ Clients might be affected too. ❐ ❐ Can not share resources with main site, when they are needed they might not be there (disaster, remeber). Final words of wishdom Do you need disaster recovery? ❐ Monitor your resources - SNMP (Simple network management protocol). ❐ Single system failure -> no use cluster. ❐ Power failure -> use UPS and generator. ❐ Do not store backup tapes in same place as original data. ❐ Flood, fire, tornado, earthquake -> yes. ❐ Beware of SPOFs ❐ War / terrorist attack -> Yes in another country / hidden. ❐ Atomic bomb -> Probably not (why bother, nobody that can use the system anyway). Operating systems: High Availability Sida 31/32 Operating systems: High Availability Sida 32/32
© Copyright 2025