What is Availability? Operating systems

G
HÖ
KARLSKRONA
RONNEBY
AN
AN
HÖ
SKO
L
L
G
SKO
KARLSKRONA
RONNEBY
What is Availability?
Operating systems
❐
The relation between the time the system is not working and the total
time.
❐
When is the system not working?
High Availability
❐
❐
Institutionen för Programvaruteknik
och Datavetenskap
❐
Availability is not affected if the system is "down"
2.00-4.00 when the users are in their offices 8.00-17.00.
❐
Webservices usually demand 7*24*365
Operating systems: High Availability
HÖ
SKO
AN
KARLSKRONA
RONNEBY
Sida 2/32
L
G
SKO
AN
HÖ
When is when?
Sida 1/32
L
G
Operating systems: High Availability
"If a user cannot get his/her job done on time, the system is
down"
KARLSKRONA
RONNEBY
How to measure Availability
Common measure of Availability
MTBF
Time in service
Availability = --------------------------------------- = ------------------------------------MTBF + MTTR
Total time
❐
Availability figures are commonly stated as the "number of 9s"
Level Availability Downtime / year
❐
MTBF = Mean time between failure
❐
MTTR = Mean time to repair
❐
Standard deviation is important!
❐
Harddisk vendors like to state MTBF figures, but seldom the
standard deviation
❐
[9.99, 10.01, 9.999, 10.002, 9.9999] -> Mean = 10
❐
[1, 1, 1, 1, 46] -> Mean = 10
Operating systems: High Availability
Sida 3/32
1
2
3
4
5
6
90 %
99 %
99.9 %
99.99 %
99.999 %
99.9999 %
1 month
3.65 days
8.75 hours
52 minutes
5.25 minutes
31.5 seconds
Areas of use
PC
Well maintained server
Telecom
Operating systems: High Availability
Sida 4/32
G
❐
Basic System, backup data.
❐
Redundant data, hardware and/or software RAID.
❐
System failover, cluster.
10 % - Hardware.
❐
5 % - Environment.
AN
G
AN
HÖ
SKO
KARLSKRONA
RONNEBY
What to target
Downtime costs
❐
Focus on components that:
❐
Lost production -> overtime, delays.
❐
Fail the most often (low MTBF).
❐
Lost customers -> bad reputation more lost customers.
❐
Are the hardest to replace (high MTTR).
❐
Have the greatest impact when they fail.
Availability costs
❐
Sida 6/32
L
G
SKO
L
HÖ
Operating systems: High Availability
Why Availability?
❐
Power supply, fire, cable cut, dust.
Sida 5/32
KARLSKRONA
RONNEBY
❐
Careless misstakes.
❐
❐
Operating systems: High Availability
Add hardware, upgrade OS / software, preventive reboot.
15 % - People.
❐
Complete backup system at remote site.
30% server, 5 % client, 5% network.
30 % - Planned downtime.
❐
Disaster Recovery - Protect the organisation.
❐
40 % - Software failure.
❐
High Availability - Protect the system.
❐
❐
Causes for downtime
Increased Availability - Protect the data.
❐
❐
KARLSKRONA
RONNEBY
Availability index
Regular Availability - Do nothing special.
❐
❐
HÖ
HÖ
❐
AN
AN
KARLSKRONA
RONNEBY
SKO
L
L
G
SKO
About 10 times more per "9" added.
Operating systems: High Availability
Sida 7/32
Operating systems: High Availability
Sida 8/32
G
HÖ
KARLSKRONA
RONNEBY
AN
AN
HÖ
SKO
L
L
G
SKO
KARLSKRONA
RONNEBY
20 Design Principles
❐
❐
20 Design Principles
#1 Keep it simple.
❐
❐
Hard to have control of complex environment as it is
(networks, disks, applications, ...)
❐
Eliminate unneccessary hardware.
❐
Slim servers down, run only critical applications.
❐
Remove ambiguity (Who has authority to take network down?)
❐
#2 One problem, one solution.
❐
Often complex problems.
❐
Examine and fix each subproblem.
Do not invent the wheel again.
❐
Consult vendors on the implementation of your site.
#4 Reuse configurations.
❐
Easier to support, less to learn.
❐
Well tested by use.
#5 Select reliable and serviceable hardware
❐
How often will the it fail?
❐
How easy is it to fix once it fails?
Operating systems: High Availability
HÖ
KARLSKRONA
RONNEBY
20 Design Principles
❐
❐
20 Design Principles
#6 Choose mature software.
❐
#9 Invest in failure isolation.
❐
Tested by others, probably less bugs.
❐
Keeps failures from spreading.
❐
Better support.
❐
e.g. RAID can handle failure of a disk, so operation of the
database is not interrupted.
#7 Build for growth.
❐
❐
SKO
AN
KARLSKRONA
RONNEBY
Sida 10/32
L
G
SKO
AN
HÖ
❐
Sida 9/32
L
G
Operating systems: High Availability
❐
#3 Exploit external resources.
Avoid downtime for hardware upgrades.
#8 Examine the history of the system.
❐
What causes downtime.
Operating systems: High Availability
Sida 11/32
Operating systems: High Availability
Sida 12/32
G
HÖ
KARLSKRONA
RONNEBY
AN
AN
HÖ
SKO
L
L
G
SKO
KARLSKRONA
RONNEBY
20 Design Principles
❐
20 Design Principles
#10 Maintain separate environments. (Examples)
❐
❐
Production. Controlled changes.
❐
Production mirror. Controlled changes, update after the production update is proven to work.
❐
Quality Assurance. Test environment, treat it as a
production environment, but failures are not fatal.
❐
Development.
❐
Laboratory. Place to learn new technology.
❐
Disaster recovery. Production environment on another site.
❐
When is the best time to install the new compiler? (Last week
before project deadlines? )
Sida 13/32
Operating systems: High Availability
SKO
AN
HÖ
Sida 14/32
L
G
SKO
KARLSKRONA
RONNEBY
KARLSKRONA
RONNEBY
20 Design Principles
❐
What if this disk fail? What if we unplugg this network
connection?
#12 Plan ahead.
❐
AN
HÖ
❐
L
G
Operating systems: High Availability
#11 Test everything.
20 Design Principles
#13 Establish Service Level Agreements (SLA)
❐
❐
So users know what to expect. A good SLA will get the users to
work with you instead of against you.
❐
What percentage of the time is the system up? During which
hours are the system actually critical?
❐
What systems get priority?
❐
What to do if SLA is not fullfilled?
❐
Do not agree to a SLA that can not be fullfilled, think about issues that are out of control, e.g. Do not promise to have a failed
disk replaced in one day, when you do not know the delivery
time of the disk vendor (or keep one in spare).
Operating systems: High Availability
❐
Sida 15/32
#14 Document everything.
❐
History for new administrators.
❐
As memory, e.g. How did I install that software?
❐
Management can be shown what has been done.
❐
Keep documentation on paper as well!!!
#15 Automate common tasks
❐
Saves time.
❐
Less error prone.
Operating systems: High Availability
Sida 16/32
G
HÖ
KARLSKRONA
RONNEBY
AN
AN
HÖ
SKO
L
L
G
SKO
KARLSKRONA
RONNEBY
20 Design Principles
❐
20 Design Principles
#16 Consolidate your servers.
❐
Less servers to monitor.
❐
Expensive Availability solutions can be applied.
❐
Simpler environment.
❐
#17 Maintain tight security.
❐
#18 Remove Single Points Of Failures (SPOFs)
❐
RAID, Cluster.
❐
Muliple ISPs.
❐
Uninterrupted Power Supply (UPS).
Get the most bang for the buck.
Sida 17/32
Operating systems: High Availability
SKO
AN
HÖ
Sida 18/32
L
G
SKO
KARLSKRONA
RONNEBY
Data Management, Disk storage crucial
Data Management, Disk storage crucial
Disks are the most likely component to fail.
❐
Just because one component worked in one environment does
not mean it works in this.
#20 Spend money... but not blindly
❐
AN
HÖ
❐
KARLSKRONA
RONNEBY
❐
#19 Assume nothing
❐
L
G
Operating systems: High Availability
❐
❐
Approximation of time to first failure
❐
MTBF
Time to first component failure = -------------------------------------------------------Number of components
❐
Disk with 200 000 hours MTBF / 100 disks = 2000 hours to first
failure.
❐
Power supply with 30 000 hours MTBF / 6 supplies = 5000
hours to first failure.
❐
The probability that one out of many components should fail is
higher than one out of few.
Sida 19/32
Single most important asset.
The data must be protected.
❐
❐
Operating systems: High Availability
Disks contain data
Hardware can be replaced, data is harder if possible at all to replace (recreate).
Operating systems: High Availability
Sida 20/32
G
HÖ
KARLSKRONA
RONNEBY
AN
AN
HÖ
SKO
L
L
G
SKO
KARLSKRONA
RONNEBY
Disk terms
❐
❐
Disk Array - Many (or slots for) disks in one cabinet.
❐
JBOD (Just a Bunch Of Disks).
❐
RAID.
❐
❐
Operating systems: High Availability
SKO
AN
AN
HÖ
Sida 22/32
L
G
SKO
L
G
Reserve, ready to kick in when necessary.
Sida 21/32
KARLSKRONA
RONNEBY
Disk terms
❐
Add / remove disk from cabinet downtime or impact on a few
of the other disks while system is running.
Hot spare.
❐
KARLSKRONA
RONNEBY
❐
Add / remove disk from cabinet without downtime or
impact on the other disks while system is running.
Warm pluggable disk.
❐
More than one controller.
Operating systems: High Availability
HÖ
❐
Only one host can be active in accessing the data.
Multipath - More than one cable to disk (array).
❐
Hot pluggable disk.
❐
Multihost - Disk (array) is physically connected to more than one
host.
❐
❐
Disk terms
Disk terms
Write cache
❐
RAID (Redundant Array of Independent Disks)
❐
Hardware to buffer writes to disk.
❐
Software or / and Hardware.
❐
Increased performance.
❐
❐
Need proper battery backup in order not to loose data when
power fails.
Raid 0 - Stripping. write operations spread on multiple disks, no
redundancy, availability worse than single disk (more components that can fail).
❐
Raid 1 - Mirroring. Everything is stored on more than two disk.
Backup still necessary, does not protect against
removal of files.
❐
Raid 5 - Parity. Redundancy with few extra disks. If one disk
fail, that data can be recalculated from the other disks.
Storage Area Network (SAN)
❐
Storage pool that hosts can access.
❐
Centralized management and allocation.
❐
Availability easier to address, one place to improve.
Operating systems: High Availability
Sida 23/32
Operating systems: High Availability
Sida 24/32
G
HÖ
HÖ
AN
❐
AN
❐
SKO
L
L
G
SKO
KARLSKRONA
RONNEBY
KARLSKRONA
RONNEBY
Failover Requirements (for Clusters)
Failover Requirements (for Clusters)
Servers
❐
Disks
❐
Primary and Secondary.
❐
Internal, unshared disks. Used for OS and applications.
❐
Configured to the largest extent in the same way.
❐
Shared (multihost) - For critical application data.
❐
When primary fails, service is fails over to secondary.
❐
Shared nothing - Data is replicated (over the
interconnect) between the servers. Must als be used in a shared
environment if more than one server wants
access to the data.
Network connections
❐
Interconnect, between primary and secondary, used by heartbeat protocoll.
❐
Public.
❐
Administrative.
❐
Licensing issues.
No SPOFs!
Sida 25/32
Operating systems: High Availability
HÖ
SKO
AN
AN
KARLSKRONA
RONNEBY
Sida 26/32
L
G
SKO
L
G
Application portability
❐
Operating systems: High Availability
HÖ
❐
KARLSKRONA
RONNEBY
More on heartbeats
More on heartbeats
❐
Servers ping each other regularly. If a couple of heartbeats are
missed, that server can be failing.
❐
Why is a heartbeat missed?
❐
Server is down.
❐
Network interface card (NIC) used for the heartbeat has failed
-> use two NICs.
❐
❐
Heartbeat cable is unplugged / broken -> use two.
❐
Network storm on interconnect -> use two, allow for a couple
of missed heartbeats, use separate network.
❐
Network hub / switch has failed -> use two.
Operating systems: High Availability
Sida 27/32
❐
Heartbeat process has failed -> implement heartbeat in OS.
❐
Server runs too slow (no time to send heartbeat) ->
implement heartbeat in OS.
Do not under any circumstance want a heartbeat to be missed for any
other reason than that server is down.
❐
Split-brain syndrome. Both servers comes to the conclusion
(missed heartbeats) that the other is down and
continue / take over the operation -> data corruption!!!
Operating systems: High Availability
Sida 28/32
G
HÖ
KARLSKRONA
RONNEBY
AN
AN
HÖ
SKO
L
L
G
SKO
KARLSKRONA
RONNEBY
Cluster configurations
❐
❐
❐
Disaster Recovery
Asymmetric
❐
Primary doing all the work.
❐
Secondary idle, waiting to take over.
❐
Backup server(s) on different site.
❐
Separate resources
❐
Symmetric
❐
❐
The servers acts as primary and secondary for each other
❐
Both run critical applications.
❐
Each server must be capable of accepting the load of the other.
Should they be duplicated as well?
Complex return to normal operations.
N to 1 asymmetric - One machine acts as secondary for a number of
primaries.
Operating systems: High Availability
HÖ
SKO
AN
KARLSKRONA
RONNEBY
Sida 30/32
L
G
SKO
AN
HÖ
Sida 29/32
L
G
Operating systems: High Availability
KARLSKRONA
RONNEBY
Disaster Recovery
❐
Clients might be affected too.
❐
❐
Can not share resources with main site, when they are needed
they might not be there (disaster, remeber).
Final words of wishdom
Do you need disaster recovery?
❐
Monitor your resources - SNMP (Simple network management protocol).
❐
Single system failure -> no use cluster.
❐
Power failure -> use UPS and generator.
❐
Do not store backup tapes in same place as original data.
❐
Flood, fire, tornado, earthquake -> yes.
❐
Beware of SPOFs
❐
War / terrorist attack -> Yes in another country / hidden.
❐
Atomic bomb -> Probably not (why bother, nobody that can use
the system anyway).
Operating systems: High Availability
Sida 31/32
Operating systems: High Availability
Sida 32/32