Coordination in distributed systems Mutual Exclusion q Why needed, sources of problems

Coordination in distributed systems
q Why needed, sources of problems
q The problem:
Ø for resource sharing: concurrent updates of
Ø N asynchronous processes, for simplicity no failures
Ø guaranteed message delivery (reliable links)
Ø to execute critical section (CS), each process calls:
§ records in a database (record locking)
§ files (file locks in stateless file servers)
§ a shared bulletin board
§ request()
§ resourceAccess()
§ exit()
Ø to agree on actions: whether to
§ commit/abort database transaction
§ agree on a readings from a group of sensors
P2
P1
q Requirements
Ø to dynamically re-assign the role of master
P3
Critical section
Ø At most one process is in CS at the same time.
Ø Requests to enter and exit are eventually granted.
Ø (Optional, stronger) Requests to enter granted according to
causality order.
§ choose primary time server after crash
§ choose coordinator after network reconfiguration
Synchronization and Coordination
Mutual Exclusion
1
Why difficult?
Synchronization and Coordination
4
Mutual Exclusion
q Centralized solutions not appropriate
Two requirements:
q Safety: at most one process can be in the critical section.
Ø communications bottleneck
q Fixed master-slave arrangements not appropriate
q Liveness: a process requesting entry to critical section will
eventually succeed.
Ø process crashes
q Varying network topologies
Ø ring, tree, arbitrary; connectivity problems
P2
q Failures must be tolerated if possible
P1
P3
Ø link failures
Ø process crashes
q Impossibility results
Critical section
Ø in presence of failures, esp asynchronous model
Synchronization and Coordination
2
Coordination problems
Synchronization and Coordination
5
Safety vs. Liveness
q Mutual exclusion
q A safety property describes a property that always holds;
sometimes we put it in this way “nothing ‘bad’ will
happen”.
Ø distributed form of critical section problems
Ø must use message passing
q Leader elections
q A liveness property describes a property that will
eventually hold; sometimes we put it in this way
“something ‘good’ will eventually happen”.
Ø after crash failure has occurred
Ø after network reconfiguration
q Consensus (also called Agreement): next lecture
Ø similar to coordinated attack
Ø some based on multicast communication
Ø variants depending on type of failure, network, etc
Synchronization and Coordination
What are the following properties belong?
deadlock free, mutual exclusion, bounded delay
3
Synchronization and Coordination
6
1
Some Solutions
Token Rings - Discussion
q continuous use of network bandwidth
q delay to enter depends on the size of ring
q causality order of requests not respected - why?
q Use a centralized server
q Ricart and Agrawala’s Distributed Algorithm
q Tree
q Quorum
q Token Ring
Synchronization and Coordination
7
Synchronization and Coordination
A Centralized Solution
P2
request
P1
Ricart and Agrawala’s Distributed Algorithm
1. A process requesting entry to the CS sends a request to every
other process in the system; and enters the CS when it obtains
permissions from every other process.
2. When does a process grant another process’s request?
Ø conflict resolved by logical timestamps of requests
P3
grant
q Use a centralized coordinator to main a
queue of requests, which are ordered by
physical timestamps.
q A process wishing to enter CS sends a
request to the coordinator, and enters
the CS when the coordinator grants its
request.
10
P1
Coordinator
queue
p5
Problems :
The coordinator becomes a bottleneck and single failure point.
Synchronization and Coordination
p4
p3
p4
query
p2
p3
p5
p2
ack
p1
8
Token Rings
p1
Synchronization and Coordination
11
How to implement the timestamps?
q Physical clocks?
q It is trivial to solve mutual exclusion over a ring---by using
a unique token.
Ø How to synchronize physical clocks?
Ø Will it work without a perfect clock synchronization scheme?
q For ordinary network, a logical ring has to be constrcuted.
q Logical clocks?
P0
P8
token
P1
P7
P2
P6
P3
P5
P4
Synchronization and Coordination
9
Synchronization and Coordination
12
2
Token-Based on Trees
1. The tree is dynamically structured
so that the root always holds the
token.
P2
P1
P6
3. A process requesting the token or
receiving a request from its
successor appends the request to
its queue and then request its
predecessor for the token if it does
not hold the token.
P5
[Raymond 1989]
Synchronization and Coordination
P2
P6
P4
P6
P5 P5
13
Synchronization and Coordination
1. P5 and P6 request the token
from P3, and suppose P5‘s
request arrives first.
4. P3 receives the token. It then
removes P5 from its queue,
and sends the token to P5,
which is the new predecessor
of P3.
P2
P1
P6
P3
P6
P4
16
Token-Based on Trees (contd.)
P5
P6
P3
P5
P6
P3
Token-Based on Trees (contd.)
P1
3. P3 receives the token. It then
removes P5 from its queue,
and sends the token to P5,
which is the new predecessor
of P3.
P2
P1
2. Each process maintains a FIFO
queue of requests for the Token
from its successors, and a pointer
to its immediate predecessor.
P3
P4
Token-Based on Trees (contd.)
P6
P6
P4
P5 P5
P6
Since P3‘s queue is still not
empty, it also sends a request
to the new predecessor.
P5 P3
Synchronization and Coordination
14
Token-Based on Trees (contd.)
Synchronization and Coordination
17
Quorum Systems
-- [Garcia-Molina & Barbara, 1985]
2. Since P3 does not hold the
token, it requests the token
from its predecessor P1.
P2
P1
P5
P6
P3
P6
P4
P6
A quorum system is a collection of sets of processes called
quora. In resource allocation, a process must acquire a
quorum (i.e., lock all the quorum members) in order to
access a resource. Resource allocation algorithms that use
quora typically have the following advantages:
o Less message complexity
o Fault tolerant
P5 P5
Synchronization and Coordination
15
Synchronization and Coordination
18
3
Formal Definition
Projective Planes
-- [Garcia-Molina & Barbara, 1985]
Let P = {p0, p1, p2, … , pn-1} be a set of processes.
A coterie C is a subset of
2P such
q A Projective Plane is a plane satisfies the following:
Ø Any line has at least two points.
Ø Two points are on precisely one line.
Ø Any two lines meet.
Ø There exists a set of four points, no three of which are
collinear.
that
n Intersection: ∀ Qi, Qj ∈ C , Qi ∩ Qj ≠ ∅
n Minimality: ∀ Qi, Qj ∈ C , Qi ≠ Qj ⇒ Qi ⊄ Qj
q A Projective Plane is said to be order n if a line contains
exactly n+1 points
Each set in C is call a quorum.
Synchronization and Coordination
19
Some Quorum Systems
Synchronization and Coordination
22
Projective Planes (contd.)
q A projective plane of order n has the following properties:
q Majority
q Tree quora
q Grid
q Finite Projective Plane
p1,1
p1,2
p1,3
p1,4
p1,5
p2,1
p2,2
p2,3
p2,4
p2,5
p3,1
p3,2
p3,3
p3,4
p3,5
p4,1
p4,2
p4,3
p4,4
p4,5
p5,1
p5,2
p5,3
p5,4
p5,5
Synchronization and Coordination
Ø Every line contains exactly (n+1) points
Ø Every point is on exactly (n+1) lines
Ø There are exactly (n 2 +n+1) points
Ø There are exactly (n 2 +n+1) lines
Fano plane (the projective plane of order 2)
20
Fully Distributed Quorum Systems
A quorum system C ={Q1, Q2, … , Qm} over P that additionally
satisfies the following conditions:
q Uniform: ∀ 1 ≤ i, j ≤ m:Qi  = Qj
q Regular: ∀ p, q ∈ P: np  = nq, where np is the set
{Qi | ∃ 1 ≤ i ≤ m: p ∈ Qi }, and similarly for n q .
E.g., Finite Projective Planes of order p k, where p is a prime.
Q1 = {l, 2}
Q1 = {l, 2, 3} Q5 = {2, 5, 7}
Q2 = {1, 3}
Q2 = {1, 4, 5} Q6 = {3, 4, 7}
Q3 = {2, 3}
Q3 = {I, 6, 7} Q7 = {3, 5, 6}
the projective plane of order 3
Synchronization and Coordination
23
Maekawa’s Algorithm
q A process p wishing to enter CS chooses a quorum Q, and
sends lock requests to all nodes of the quorum.
q It enters CS only when it has locked all nodes of the
quorum.
q Upon exiting CS, p unlocks the nodes.
q A node can be locked by one process at a time.
q Conflicting lock requests to a node are resolved by
priorities (e.g., timestamps). The loser must yield the lock
to the high priority one if it cannot successfully obtain all
locks it needs.
Q4 = {2, 4, 6}
Synchronization and Coordination
21
Synchronization and Coordination
24
4
Message Complexity
Comparison
Algorithm
Message per
entry/exit
Centralized
3
Maekawa’s algorithm needs 3c to 6c messages per entry to CS,
where c is the size of the quorum a process chooses.
Best case: 3c
p
p
p
Request locks
Grant locks
Release locks
Tree
O(log n)
O(log n)
Token Ring
1 to ∞
1 to n−1
Token loss,
process crash
2
Need to determine
a suitable coterie
25
quorum size
Token loss,
process crash
Synchronization and Coordination
28
The Election Problem
Many distributed algorithms require one process to act as
coordinator, initiator, sequencer, or otherwise perform some
special role, and therefore one process must be elected to take
the job, even if processes may fail.
Worst case: 6c
Request locks
Single failure
point
Crash of any
process
2
2 if multicast is
supported;
2(n−1) otherwise
2(n−1)
Worst Case
p
Drawbacks
Distributed
Voting
Synchronization and Coordination
Delay before entry
(in message time)
p
p
Requirements:
• Safety: at most one process can be elected at any time.
Inquire
Yield locks
• Liveness: some process is eventually elected.
Assumptions
Each process has a unique id.
p
Return locks
p
Release locks
p
I am the
leader
Grant locks
Synchronization and Coordination
26
Read/Write Quorums
Synchronization and Coordination
29
The Bully Algorithm
When a process P notices that the current coordinator is no longer
responding to requests, it initiates an election, as follows:
1. P sends an ELECTION message to every process with a larger id.
For database concurrency control,
– every read quorum must intersect with every write
quorum,
2. If after some timeout period no one responds, P wins the election
and becomes the coordinator.
– every two write qrora must intersect.
3. If one of the higher-ups answers, it takes over the election; and P’s
job is done. (A process must answer the ELECTION message if it
is alive.)
4. When a process is ready to take over the coordinator’s role, it sends
a COORDINATOR message to every process to announce this.
5. When a previously-crashed coordinator is recovered, it assumes the
job by sending a COORDINATOR message to every process.
Synchronization and Coordination
27
Synchronization and Coordination
30
5
Message Complexity
Complexity
Best case: n−2; Worst case: O(n2).
If only a single process (say p) starts an election, then, in the
worst case, n−1 messages are required to “wake up” the
process with the largest id ( which resides at p’s right side);
another 2n messages for that process to elect itself as the
coordinator.
Synchronization and Coordination
31
A Ring-Based Algorithm
Synchronization and Coordination
34
When Processes May Fail…
Assumptions:
1. Processes do not know each other’s id.
Things get a little complicated when processes may fail in the
above ring
2. Each process can only communicate with its neighbors.
3. All processes remain functional and reachable.
1
7
2
non-participants
max id seen
0
3
7
4
participants
6
5
Synchronization and Coordination
32
A Ring-Based Algorithm (contd.)
Synchronization and Coordination
35
When Processes May Fail…
Each process is either a participantor a non-participant of the game. Initially, all
processes are non-participant.
2. When a process wishes to initiate an election, it marks itself as a participant,
and then sends an ELECTION message (bearing its own id) to its left neighbor.
3. When a process P receives an ELECTION message, it compares its id with the
initiator’s. If the initiator’s is larger, then it forwards the message. Otherwise,
if P is not a participant, then it substitutes its own id in the message and
forward it to its left neighbor; otherwise it simply discards the message. On
forwarding an ELECTION message, a process marks itself as a participant.
q Things get a little complicated when processes may fail in
the above ring-based algorithm as it relies on the topology
that may be destroyed when processes may fail.
q How to cope with the problem?
4. When a process P receives an ELECTION message with its own id, it becomes
the coordinator. It announces this by issuing a COORDINATOR message
(bearing its id) to its left neighbor, and marks itself as a non-participant.
5. When a process other than the coordinator receives a COORDINATOR message,
it also marks itself as a non-participant, and then forwards the message to its
left neighbor.
Synchronization and Coordination
33
Synchronization and Coordination
36
6
Leader Election in Mobile Ad Hoc Networks
q Assumptions:
Solutions for k-exclusion
q Token-Based
Ø Each node has a unique ID
Ø Nodes do not know the total number of nodes in the system
Ø Nodes may move, fail, or rejoin the network
Ø Extension of Raymond’s token-based algorithm for mutual
exclusion ?
q Permission-Based
q Goal:
Ø Design an efficient distributed algorithm for the nodes to
elect a leader so that
§ If the system is stable, then eventually there is a unique leader
in every connected component and for every other node in the
component, there is a (unique) path to the leader
Ø Extension of Ricart and Agrawala’s algorithm for mutual
exclusion ?
Ø Design of quorum systems ?
§ Can the definition of ordinary quorum systems be used?
§ What’s the new definition?
The system needs to be self-stabilizing!
Synchronization and Coordination
37
Synchronization and Coordination
Mutual Exclusion [Dijkstra 1965]
40
k-coteries
q A quorum system S for k-exclusion (called k-coterie) is a
collection of subsets of processes satisfying:
Only one process can access a resource at a time.
Ø Intersection: ∀R ⊂ S , | R | = k+1 ⇒ ∃ Qi, Qj ∈ R, Qi ∩ Qj ≠ ∅
Ø Minimality: ∀Qi, Qj ∈ S , Qi ≠ Qj ⇒ Qi ⊄ Qj
Are the above conditions enough?
We need a non-intersection property!
Examples: k-majority, cohorts, degree-k tree quorum, ...
Synchronization and Coordination
38
k-Exclusion [Fisher, Lynch, Burns, & Borodin 1979]
Synchronization and Coordination
41
Group Mutual Exclusion (GME) [Joung 1998]
At most k processes can be in critical section at a time.
A resource can be shared by processes of the same
group, but not by processes of different groups.
CD JUKEBOX
Requirements:
p mutual exclusion
p lockout freedom
p concurrent entering
Variations:
§ Limit the number of processes that can be in CS.
§ Increase the number of groups that can be in CS.
Synchronization and Coordination
39
Synchronization and Coordination
42
7
Construction of Sm
Solutions for group mutual exclusion
q Token-Based
S1
Ø Extension of Raymond’s token-based algorithm for mutual
exclusion ?
p0
p1
q Permission-Based
Ø Extension of Ricart and Agrawala’s algorithm for mutual
exclusion ?
Ø Design of quorum systems ?
S2
§ Can ordinary quorum systems or k-coteries be used?
§ What’s the new definition?
Synchronization and Coordination
43
P0,0
P0,1
P0,2
P0,3
P0 ,4
P1,0
P1,1
P1,2
P1,3
P1,4
P2,0
P2,1
P2,2
P2,3
P2,4
P3,0
P3,1
P3,2
P3,3
P3,4
P4,0
P4,1
P4,2
P4,3
P4,4
p2
p3
p4
p5
p6
S3
Synchronization and Coordination
46
Construction of Sm (contd.)
Group Quorum Systems
Let P = {1,2,… , n} be a set of nodes.
An m-group quorum system is a tuple S = (C1, C2, … , Cm),
where each Ci ⊆ 2P satisfies
n Intersection :
∀ 1 ≤ i ≠ j ≤ m, ∀Q1 ∈ Ci, ∀Q2 ∈ Cj : Q1 ∩ Q2 ≠ ∅
n Minimality:
∀ 1 ≤ i ≤ m, ∀ Q1, Q2 ∈ Ci , Q1 ≠ Q2 : Q1 ⊄ Q2
S3
We call each Ci a cartel, and each Q ∈ Ci a quorum.
The degree of a cartel C is the maximum number of pairwise
disjointed quora in C.
Synchronization and Coordination
44
The Surficial Group Quorum System Sm
Synchronization and Coordination
Construction of Sm (contd.)
q It is balanced, uniform, and regular.
q It minimizes process’s load by letting np=2 for all p ∈ P .
q Each cartel has degree
2n
S2
m (m − 1 )
S3
2 n (m − 1)
m
S4
q Each quorum has size
47
S5
Synchronization and Coordination
45
Synchronization and Coordination
48
8
Time
Cristian’s Algorithm
q Time is important in computer systems
Time server
Request a new time
Ø Every file is stamped with the time it was created, modified,
and accessed.
Ø Every email, transaction, … are also timestamped.
Ø Setting timeouts and measuring latencies
q Sometimes we need precise physical time and sometimes
we only need relative time.
P
S
It’s time t
When P receives the message, it should set its time to t+Ttrans , where
Ttrans is the time to transmit the message.
Ttrans ≈ Tround /2, where Tround is the round-trip time
Accuracy.
Let min be the minimum time to transmit a message one-way.
Then P could receive S’s message any time between
[t+min, t+ Tround − min]
So accuracy is ±(Tround /2 − min)
Synchronization and Coordination
49
Synchronizing Physical Time
Ø E.g., a quartz crystal clock has a drift rate of 10-6 (ordinary),
or 10-7 to 10-8 (high precision).
Ø C.f. an atomic clock has a drift rate of 10-13 .
q Provide a service enabling clients across the Internet to be
synchronized accurately to UTC, despite the large and variable
message delays encountered in Internet communication.
q The NTP servers are connected in a logical hierarchy, where
servers in level n are synchronized directly to those in level n−1
(which have a higher accuracy). The logical hierarchy can be
reconfigured as servers become unreachable or failed.
q NTP servers synchronize with one another in one of three
modes (in the order of increasing accuracy):
Ø Multicast on high speed local LANs
Ø Procedure call mode (a la Cristian’s algorithm)
Ø Symmetric mode (for achieving highest accuracy).
Questions:
1. How do we synchronize computer clocks with real-world
clocks?
2. How do we synchronize computer clocks themselves?
50
Synchronization and Coordination
Compensation for clock drift
53
Symmetric Mode
q A computer clock usually can be adjusted forward but not
backward.
Ø Typical example: Y2K problem.
q Common terminology:
A pair of servers
exchange timing
information
Ti−2 Ti−1
Server A
m
Server B
Ti−3
time
m′
Ti
Assume: m takes t to transfer, m′ takes t′ to transfer
Offset between A’s clock and B’s clock is o; i.e., A(t) = B(t) + o
Ø Skew (offset): the instantaneous difference between (the
readings of) two clocks.
Ø Drift rate: the difference between the clock and a nominal
perfect reference clock per unit of time.
Then, Ti−2 = Ti−3 + t + o and Ti = Ti−1 − o + t′
Assuming that t≈ t′ , then the offset o can be estimated as follows:
o i = (Ti−2 − Ti−3 + Ti−1 − Ti ) / 2
q Linear adjustment:
Ø Let C be the software reading of a hardware clock H. Then
the operating system usually produces C in terms of H by the
following: C(t) = α H(t) + β
Synchronization and Coordination
52
The Network Time Protocol (NTP)
Observations:
q In some systems (e.g., real-time systems) actual time are
important, and we typically equip every computer host
with one physical clock.
q Computer clocks are unlikely to tick at the same rate,
whether or not they are of the ‘same’ physical construction.
Synchronization and Coordination
Synchronization and Coordination
51
Since Ti−2 − Ti−3 + Ti − Ti−1 = t + t′ (let’s say, t + t′ equal to di )
Then o = o i + (t′− t)/2
Given that t′, t ≥ 0, the accuracy of the estimate of o i is:
o i − d i /2 ≤ o ≤ o i + d i /2
Synchronization and Coordination
54
9
Symmetric Mode (contd.)
Logical Clocks
q The eight most recent pairs <oi, di> are retained; the value
of oi that corresponds to the minimum di s chosen to
estimate o.
A logical clock Cp of a process p is a software counter that is
used to timestamp events executed by p so that the
happened-before relation is respected by the timestamps.
q Timing messages are delivered using UDP.
The rule for increasing the counter is as follows:
• LC1: Cp is incremented before each event issued at process
p.
• LC2: When a process q sends a message m to p, it
piggybacks on m the current value t of Cq; on receiving m,
p advances its Cp to max(t, Cp).
back to mutual exclusion
Synchronization and Coordination
55
Synchronization and Coordination
Logical Time
58
Illustration of Timestamps
timestamp P1:
Motivation
q Event ordering linked with concept of causality.
Ø Saying that event a happened before event b is same as saying
that event a could have affected the outcome of event b
Ø If events a and b happen on processes that do not exchange any
data, their exact ordering is not important
P2:
1
x:=1;
2
y:=0;
3
send (x) to P2;
timestamp
y:=2;
1
receive(x) from P1;
4
Observation
x:=4;
5
q If two events occurred at the same process, then they occurred
in the order in which it observes them.
send (x+y) to P1;
6
x:= x+y;
7
q Whenever a message is sent between processes, the event of
sending the message occurred before the event of receiving the
message.
Synchronization and Coordination
56
Causal ordering (happened-before relation)
7
receive(y) from P2;
8
x:=x+y;
Synchronization and Coordination
Reasoning about timestamps
1. If process p execute x before y, then x → y.
Consequence: if a → b, then C(a) < C(b)
2. For any message m, send(m) → rcv(m).
The partial ordering can be made total by additionally
considering process ids.
3. If x → y and y → z, then x → z.
Two events a and b are said concurrent if neither a → b nor
b → a.
Process p
X
a
X
Process q
Process r
X
X
Suppose event a is issued by process p, and event b by
process q. Then the total ordering → t can be defined as
follows:
a → t b iff C(a) < C(b) or C(a) = C(b) and ID(p) < ID(q).
d
c
59
X
Does C(a) < C(b) imply a → b ?
e
b
Synchronization and Coordination
57
Synchronization and Coordination
60
10
Total Ordering of Events
Reasoning about vector timestamps
q Happened Before defines a Partial Ordering of events
(arising from causal relationships).
q We can use the logical clocks satisfying the Clock
Condition to place a total ordering on the set of all system
events.
Partial orders ‘≤’ and ‘<’ on two vector timestamps u, v are
defined as follows:
u ≤ v iff u[k] ≤ v[k] for all k’s, and u < v iff u ≤ v and u ≠ v.
Property: e happened-before f if, and only if, vt(e) < vt(f).
Ø Simply order the events by the times at which occur
Ø To break the ties, Lamport proposed the use of any arbitrary
total ordering of the processes, i.e. process id
q Using this method, we can assign a unique timestamp to
each event in a distributed system to provide a total
ordering of all events
q Very useful in distributed system
Ø Solving the mutual exclusion problem
back to mutual exclusion
Synchronization and Coordination
61
Synchronization and Coordination
64
Vector Timestamps
Each process P i maintains a vector of clocks VTi such that
VTi[k] represents a count of events that have occurred at
P k and that are known at and that are known at P i.
The vector is updated as follows:
1. All processes P i initializes its VTi to zeros.
2. When P i generates a new event, it increments VTi[i] by 1;
VTi is assigned as the timestamp of the event. Messagesending events are timestamped.
3. When P j receives a message with timestamp vt, its updates
its vector clock as follows:
VTi[k] := max(VTi[k], vt[k])
Synchronization and Coordination
62
Illustration of Vector Timestamps
timestamp P1:
〈1,0 〉
x:=1;
〈2,0 〉
y:=0;
〈3,0 〉
send (x) to P2;
〈4,4 〉
receive(y) from P2;
〈5,4 〉
x:=x+y;
P2:
timestamp
y:=2;
〈0,1 〉
receive(x) from P1;
〈3,2 〉
x:=4;
〈3,3 〉
send (x+y) to P2;
〈3,4 〉
x:= x+y;
〈3,5 〉
Synchronization and Coordination
63
11