Sample Solution for Midterm Examination ECE 419S: Distributed Systems Instructor: Cristiana Amza Department of Electrical and Computer Engineering University of Toronto Problem number 1 2 3 4 5 6 total Maximum Score 6 8 18 28 20 20 100 Your Score This exam is open textbook and open lecture notes. You have two hours to complete the exam. Use of computing and/or communicating devices is NOT permitted. You should not need any such devices. You can use a basic calculator if you feel it is absolutely necessary. Work independently. Do not remove any sheets from this test book. Answer all questions in the space provided. No additional sheets are permitted. Scratch space is available at the end of the exam. Write your name and student number in the space below. Do the same on the top of each sheet of this exam book. Your Student Number Your First Name Your Last Name 1 Student Number: Name: Problem 1. Basic Facts (6 points) (a) (3 points) How do idempotent operations simplify failure recovery and the state which the server needs to keep in case of crashes ? Please give an example of use in the case of recovery from a failure or crash. Answer: Idempotent operations can be executed multiple times with the same effect. Hence, the server does not need to keep state on whether an operation was executed or not, and does not need to do duplicate filtering. It simplifies the implementation of at-least-once semantics. (b) (3 points) Briefly state all fault tolerance mechanisms that are necessary to implement at-least-once RPC/RMI invocation semantics over an underlying TCP/IP network protocol. Answer: TCP/IP provides reliable and in-order delivery. Hence, as long as the connection is not broken, the procedure will be executed exactly once. If the connection is broken, the client needs to set up another TCP/IP connection and retransmit the RPC request. The server needs to reexecute the request. 2 Student Number: Name: Problem 2. The End-to-End Argument (8 Points) In the paper by Saltzer et al., titled “End-to-End Arguments in System Design”, the following example is presented: One network system involving several local networks connected by gateways used a packet checksum on each hop from one gateway to the next, on the assumption that the primary threat to correct communication was corruption of bits during transmission. Application programmers, aware of this checksum, assumed that the network was providing reliable transmission of files. However, files being transmitted were sometimes not correctly received. (a) (4 points) Give an explanation of why this problem could have occured, and discuss how this problem can be solved. Briefly discuss the overheads that your solution would introduce. Answer: The problem may be due to dropped packets, or packets arriving out of order. The application can implement/perform a checksum for the whole file. The overhead lies in the redundancy of checksum-ing efforts at both the application level and the network level. (b) (4 points) From this example, could we conclude that the lower levels should not play a part in achieving reliability? Please justify your answer. Answer: No - reliability functions implemented at the lower level may be useful for performance reasons. If an error on a given packet is detected by a gateway, it can ask for selective retransmission of that packet. Instead, if the application was solely in charge of reliability, then any error would result in resending the whole file, which could be hundreds of MB (and thousands of packets) in size. 3 Student Number: Name: Problem 3. Applications of Physical Clocks. (18 points) A scheme for implementing at-most-once message delivery uses physically synchronized clocks to reject duplicate messages. Processes place their local clock value (a physical clock timestamp) in the messages they send. Each receiver keeps a table giving, for each sending process, the largest message timestamp the receiver has seen from that sender (one entry per sender). Assume that clocks are synchronized to within 100 ms, and that messages can arrive at most 50 ms after transmission. For all parts of this problem, you can assume no failures of sender or receiver occur. (a) (8 points) When should a receiving process ignore a message bearing a timestamp T, if the timestamp of the last message received from the corresponding sender process that was recorded in its table was T ! ? Answer: If T ≤ T ! then the message should be ignored/rejected. (b) (10 points) Assume that each receiver wants to selectively discard (e.g., for garbage collection purposes) sender entries from its table, but still maintain the at-most-once message delivery guarantee. When may a receiver remove an entry with timestamp 175,000 (ms) from its table? (Hint: use the receiver’s local clock value, and give the earliest receiver time when this can be done.) Answer: When the receiver’s clock is r, the earliest message timestamp that could still arrive is r - 100 - 50 because the maximum transmission time is 50ms, and clocks are synchronized to within 100ms of each other. From then on, any incoming timestamp from the sender should be greater than r - 150, so the receiver does not need to save the entry with timestamp ≤ r − 150 in order to reject duplicate messages. Hence, if a receiver is going to remove an entry with timestamp 175,000 (so that we cannot mistakenly receive a duplicate), we need r -150 = 175,000, i.e. the earliest receiver time is r = 175,150. 4 Student Number: Name: Problem 4. Distributed Shared Memory (28 points) Part A (20 points) Assume that we use one of Ivy or Munin Distributed Shared Memory systems to implement memory consistency between processes P1 and P2, running on two nodes, respectively. Ivy is a page-based DSM that supports sequential consistency. It implements a single-writer, multiple-reader, page-invalidate protocol. Munin is the first DSM implementing a form of Weak consistency called Eager Release Consistency and also the first to implement a multiple writer protocol based on twinning and diffing. Assume a page invalidate protocol for Munin as well. (a) (8 points) What is false sharing and what type of overheads, in terms of computer resources used, can false sharing cause for each of these systems ? Please briefly describe the type of overheads and how/when they occur. (2 points) False sharing definition: Answer: False sharing: concurrent accesses to different parts of the page by different processes (nodes), at least one access is a write (read-write and write-write false sharing). (3 points) Ivy overheads: Answer: network overheads (useless messages consume network bandwidth): invalidations and page fetches due to the single writer protocol for write-write false sharing, useless page fetches for read-write false sharing. (3 points) Munin overheads: Answer: network overheads (reduced due to Eager Release consistency, but still present): invalidations sent at synchronization for false shared pages together with the ones for the true shared pages. CPU and memory overheads: due to the multiple writer protocol for computing and (temporarily) storing twins and diffs. 5 Student Number: Name: (b) (6 points) Consider the two pieces of code below and assume that they execute at the same time (concurrently) on the two different nodes running processes P1 and P2, respectively. Assume that the array a fits entirely on one memory page and that the two nodes are homogenous. P1: for (int i = 0; i < n; i+=2) a[i] = i; P2: for (int i = 1; i < n; i+=2) a[i] = i; What is the maximum number of messages that could be sent by the two processes in total during the concurrent execution of the above pieces of code in the case of each of the two DSM systems (Ivy and Munin). Please specify the message types. You can assume that no other code executes on either node, besides what is shown. (3 points) Ivy: Answer: Counting invalidations (assume no invalidation ACK’s) as 1 message each and page fetches, as two messages each. The maximum number of messages occurs when the page ping-pongs n times, with each write access, i.e., after the writes to a[0] (page ownership goes from P 1 → P 2), a[1] (P 2 → P 1), a[2] (P 1 → P 2), a[3] (P 2 → P 1), etc. This results in 3n messages. Other interleavings are also valid in Sequential Consistency and possible when actually running the protocol, depending on how long a message takes on the network to get from one node to the other. Example a[0], a[2], a[4], a[6] (P 1 → P 2) a[1], a[3], a[5] (P 2 → P 1) a[8] ... hence could result in fewer messages. (3 points) Munin: No messages are exchanged in Munin because of the Weak Consistency (Eager Release Consistency) design until a synchronization actually occurs. Since the code contains no syncronization there are zero messages. It should be clear to the programmer that no variable/location is accessed by both nodes. It is an initialization of the array, where each node does half of it, for maximum parallelism. Variable i, the loop index, is allocated on the stack (declared inside the loop), hence cannot be shared even across threads inside a node, let alone across nodes. (c) (6 points) What enhancements would you need to add to the implementation of Munin in order to support Distributed Shared Memory over heterogeneous computers ? Answer: Similar to RPC stubs, we would need to do marshalling and unmarshalling for creating and transfering diffs. These operations need to take into account the little/big endian differences in representation. We also need to agree beforehand on a common page size for all nodes. 6 Student Number: Name: Part B (8 points) Given the following code segments, list all results that are possible (or not possible) under sequential consistency (SC). Assume that all variables are initialized to 0 before this code is reached. P1 A=1 P2 x=A B=1 P3 y=B z=A Answer: 7 possible results, A=1, B=1, and x=0/1, y=0/1, z=0/1, except for x=1,y=1,z=0 7 Student Number: Name: Problem 5. Mutual Exclusion (20 points) Ricart and Agrawala’s is a fully distributed mutual exclusion algorithm making use of multicast and Lamport clocks. It guarantees that requests to enter a critical section are granted according to the happens before order that is defined by timestamps read from the logical Lamport clock. The algorithm assumes reliable and in-order delivery. In a certain system, each process typically uses a critical section many times, by re-entering almost immediately after exit and before another process requires that critical section. (a) (6 points) How many messages does the Ricart and Agrawala algorithm require per access to the critical section, in this case, if N is the number of nodes in this distributed system ? Answer: Still need 2(n-1) messages for re-acquiring a lock (gaining entry in the Critical Section) even if no one else is requesting the lock upon our previous release. (b) (14 points) Describe how you would modify the algorithm in order to improve its performance in this case. Explain any additional states or actions of your protocol. Answer: Add a state to the existing algorithm to clearly delineate this case: released-and-no-one-wants (renow). Then, there are three changes to the existing algorithm: i. on local lock release, if request queue is empty, enter renow state. ii. on local lock request, if in renow state, enter the critical section without multicasting requests. iii. If in renow state and the node receives a multicast request, release the lock (mark yourself as not in the renow state anymore) and reply immediately. 8 Student Number: Name: Problem 6. Applications of Logical Clocks. (20 points) Imagine a distributed implementation of a war game in which tanks fire at each other. Each participant uses a separate computer that communicates with others over the network, and local events are multicasted to all game participants. The implementation is such that when a player A fires at B, A multicasts the fire event. Player B, upon receiving the fire event, evaluates the damage, and multicasts the event destroyed, if necessary. This implementation may result in temporal anomaly due to network delays. As illustrated in the figure, player C may observe the damage (“destroyed”) on B before the firing from A. Design and explain a solution to avoid temporal anomalies by displaying the events in a consistent order, respecting causality, on all nodes. You can assume reliable message delivery in your solution. You should not use any additional assumptions about the underlying network protocol. However, if you do, please state them clearly. (a) (10 points) Specify the actions of your protocol, e.g., what state is maintained by each player, what events are significant, how you update the state on each event, any additional information transmitted on messages, etc. Answer: • On initialization Vi [j] = 0(j = 1, 2, ...N ); • Process i: Before multicast each fire or destroy event, V i [i] = Vi [i] + 1, include new timestamp in the message. • Process i: For each multicast message V j received, place the message in hold-back queue; Wait until Vj [j] == Vi [j] + 1 and Vj [k] ≤ Vi [k] (k $= j); Deliver message to application (display event) and V i [j] = Vi [j] + 1; (b) (10 points) Based on the actions of your protocol explained above, please make sure to answer the following question (if not already answered): How does a player decide when to deliver an event to its display engine ? Answer: Wait until there is no gap between the V j [j] of the message and the Vi [j] of this player. As stated above, this happens when the respective positions in the two VTS’s are consecutive, i.e., V j [j] == Vi [j] + 9 Student Number: Name: 1 and Vj [k] ≤ Vi [k] (k $= j); The player can deliver this event to its display engine because all previous events of j have been delivered. 10