Acknowledgements Distributed Hash Tables (DHTs) Chord & CAN The followings slides are borrowed or adapted from talks by Robert Morris (MIT) and Sylvia Ratnasamy (ICSI) CS 699/IT 818 Sanjeev Setia 1 2 Chord Simplicity Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Resolution entails participation by O(log(N)) nodes Resolution is efficient when each node enjoys accurate information about O(log(N)) other nodes Robert Morris Ion Stoica, David Karger, M. Frans Kaashoek, Hari Balakrishnan Resolution is possible when each node enjoys accurate information about 1 other node MIT and Berkeley “Degrades gracefully” SIGCOMM Proceedings, 2001 3 4 1 Chord Algorithms Chord Properties Basic Lookup Efficient: O(log(N)) messages per lookup Node Joins N is the total number of servers Scalable: O(log(N)) state per node Stabilization Robust: survives massive failures Failures and Replication Proofs are in paper / tech report Assuming no malicious participants 5 Chord IDs 6 Consistent Hashing[Karger 97] Target: web page caching Key identifier = SHA-1(key) Like normal hashing, assigns items to Node identifier = SHA-1(IP address) buckets so that each bucket receives roughly the same number of items Both are uniformly distributed Both exist in the same ID space Unlike normal hashing, a small change in the bucket set does not induce a total remapping of items to buckets How to map key IDs to node IDs? 7 8 2 Consistent Hashing [Karger 97] Key 5 Node 105 Basic lookup N120 K5 N105 K20 Circular 7-bit ID space N32 “N90 has K80” N32 K80 N90 A key is stored at its successor: node with next higher ID N90 N10 “Where is key 80?” N105 K80 N60 9 10 Simple lookup algorithm “Finger table” allows log(N)-time lookups Lookup(my-id, key-id) n = my successor 1/4 1/2 if my-id < n < key-id call Lookup(id) on node n // next hop 1/8 else return my successor // done 1/16 1/32 1/64 1/128 N80 Correctness depends only on successors 11 12 3 Lookup with fingers Finger i points to successor of n+2i-1 Lookup(my-id, key-id) N120 112 look in local finger table for 1/4 highest node n s.t. my-id < n < key-id 1/2 if n exists call Lookup(id) on node n // next hop 1/8 else return my successor 1/16 1/32 1/64 1/128 // done N80 13 Lookups take O(log(N)) hops 14 Node Join - Linked List Insert N5 N10 K19 N20 N110 N25 N99 N32 Lookup(K19) N36 1. Lookup(36) N40 K30 K38 N80 N60 15 16 4 Node Join (2) Node Join (3) N25 N25 2. N36 sets its own successor pointer 3. Copy keys 26..36 from N40 to N36 N36 N40 K30 K38 N36 K30 N40 K30 K38 17 Node Join (4) 18 Stabilization Case 1: finger tables are reasonably fresh Case 2: successor pointers are correct; fingers are inaccurate N25 4. Set N25’s successor pointer Case 3: successor pointers are inaccurate or key migration is incomplete N36 K30 N40 K30 K38 Stabilization algorithm periodically verifies and refreshes node knowledge Successor pointers Predecessor pointers Finger tables Update finger pointers in the background Correct successors produce correct lookups 19 20 5 Solution: successor lists Failures and Replication N120 Each node knows r immediate successors N10 N113 After failure, will know first live successor N102 N85 Correct successors guarantee correct lookups Guarantee is with some probability Lookup(90) N80 N80 doesn’t know correct successor, so incorrect lookup 21 Choosing the successor list length 22 Chord status Working implementation as part of CFS Chord library: 3,000 lines of C++ Assume 1/2 of nodes fail Deployed in small Internet testbed P(successor list all dead) = (1/2)r I.e. P(this node breaks the Chord ring) Depends on independent failure Includes: P(no broken nodes) = (1 – (1/2)r)N Correct concurrent join/fail Proximity-based routing for low delay Load control for heterogeneous nodes Resistance to spoofed node IDs r = 2log(N) makes prob. = 1 – 1/N 23 24 6 Experimental overview Chord lookup cost is O(log N) Average Messages per Lookup Quick lookup in large systems Low variation in lookup costs Robust despite massive failure See paper for more results Experiments confirm theoretical results Constant is 1/2 Number of Nodes 25 Failure experimental setup 26 Massive failures have little impact Start 1,000 CFS/Chord servers Successor list has 20 entries Failed Lookups (Percent) Wait until they stabilize Insert 1,000 key/value pairs Five replicas of each Stop X% of the servers Immediately perform 1,000 lookups 1.4 (1/2) 6 is 1.6% 1.2 1 0.8 0.6 0.4 0.2 0 5 10 15 20 25 30 35 40 45 50 Failed Nodes (Percent) 27 28 7 Latency Measurements Chord Summary 180 Nodes at 10 sites in US testbed 16 queries per physical site ( sic) for random keys Chord provides peer-to-peer hash lookup Maximum < 600 ms Efficient: O(log(n)) messages per lookup Minimum > 40 ms Robust as nodes fail and join Median = 285 ms Good primitive for peer-to-peer systems Expected value = 300 ms (5 round trips) http://www.pdos.lcs.mit.edu/chord 29 A Scalable, Content-Addressable Network 30 Content-Addressable Network (CAN) CAN: Internet-scale hash table Interface Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Shenker insert(key,value) value = retrieve(key) Properties scalable operationally simple good performance Related systems: Chord/Pastry/Tapestry/Buzz/Plaxton ... 31 32 8 Problem Scope Design a system that provides the interface CAN: basic idea scalability K V robustness performance security K V K V Application-specific, higher level primitives K V K V K V K V keyword searching mutable content anonymity K V K V K V K V 33 CAN: basic idea 34 CAN: basic idea K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V insert (K 1,V 1) K V K V K V insert (K 1,V 1) 35 K V K V 36 9 CAN: basic idea (K 1,V 1) CAN: basic idea K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V retrieve (K1) 37 CAN: solution 38 CAN: simple example virtual Cartesian coordinate space entire space is partitioned amongst all the nodes every node “owns” a zone in the overall space 1 abstraction can store data at “points” in the space can route from one “point” to another point = node that owns the enclosing zone 39 40 10 CAN: simple example CAN: simple example 3 1 2 1 2 41 CAN: simple example 42 CAN: simple example 3 1 2 4 43 44 11 CAN: simple example CAN: simple example node I::insert(K,V) I I 45 CAN: simple example CAN: simple example node I::insert(K,V) (1) a = hx(K) 46 node I::insert(K,V) I I (1) a = hx(K) b = hy(K) y=b x=a 47 x=a 48 12 CAN: simple example CAN: simple example node I::insert(K,V) (1) a = hx(K) node I::insert(K,V) I (1) a = hx(K) b = hy(K) I b = hy(K) (2) route(K,V) -> (a,b) (2) route(K,V) -> (a,b) (K,V) (3) (a,b) stores (K,V) 49 CAN: simple example 50 CAN node J::retrieve(K) (1) a = hx(K) b = hy(K) (K,V) (2) route “retrieve(K)” to (a,b) Data stored in the CAN is addressed by name (i.e. key), not location (i.e. IP address) J 51 52 13 CAN: routing table CAN: routing (a,b) (x,y) 53 CAN: routing 54 CAN: node insertion A node only maintains state for its immediate neighboring nodes Bootstrap node new node 1) Discover some node “I” already in CAN 55 56 14 CAN: node insertion CAN: node insertion (p,q) I new node 2) pick random point in space I 1) discover some node “I” already in CAN 57 CAN: node insertion new node 58 CAN: node insertion (p,q) J J new I new node 3) I routes to (p,q), discovers node J 59 4) split J’s zone in half… new owns one half 60 15 CAN: node insertion CAN: node failures Need to repair the space Inserting a new node affects only a single other node and its immediate neighbors recover database soft-state updates use replication, rebuild database from replicas repair routing takeover algorithm 61 CAN: takeover algorithm CAN: node failures Simple failures 62 Only the failed node’s immediate neighbors are required for recovery know your neighbor’s neighbors when a node fails, one of its neighbors takes over its zone More complex failure modes simultaneous failure of multiple adjacent nodes scoped flooding to discover neighbors hopefully, a rare event 63 64 16 Design recap Evaluation Basic CAN completely distributed self-organizing Scalability Low-latency nodes only maintain state for their immediate neighbors Load balancing Additional design features multiple, independent spaces (realities) background load balancing algorithm simple heuristics to improve performance Robustness 65 CAN: scalability CAN: low-latency Problem For a uniformly partitioned space with n nodes and d dimensions per node, number of neighbors is 2d average routing path is (dn 1/d)/4 hops simulations show that the above results hold in practice latency stretch = (CAN routing delay) (IP routing delay) application-level routing may lead to high stretch Solution Can scale the network without increasing per-node state Chord/Plaxton/Tapestry/Buzz 66 increase dimensions heuristics RTT-weighted routing log(n) nbrs with log(n) hops multiple nodes per zone (peer nodes) deterministically replicate entries 67 68 17 CAN: low-latency CAN: low-latency #dimensions = 2 140 w/o heuristics 120 w/ heuristics 100 80 60 40 20 Latency stretch Latency stretch 160 0 #dimensions = 10 10 180 8 w/o heuristics 6 w/ heuristics 4 2 0 16K 32K 65K 131K 16K #nodes 32K 65K 131K #nodes 69 CAN: load balancing 70 Uniform Partitioning Two pieces Dealing with hot-spots popular (key,value) pairs Added check nodes cache recently requested entries overloaded node replicates popular entries at neighbors Uniform coordinate space partitioning at join time, pick a zone check neighboring zones pick the largest zone and split that one uniformly spread (key,value) entries uniformly spread out routing load 71 72 18 Uniform Partitioning CAN: Robustness 65,000 nodes, 3 dimensions 100 Completely distributed w/o check 80 Percentage of nodes w/ check 60 no single point of failure Not exploring database recovery Resilience of routing 40 V = total volume n 20 0 V 16 V 8 V 4 V 2 V 2V 4V can route around trouble 8V Volume 73 Routing resilience 74 Routing resilience destination source 75 76 19 Routing resilience Routing resilience destination 77 Routing resilience 78 Routing resilience Node X::route(D) If (X cannot make progress to D) – check if any neighbor of X can make progress – if yes, forward message to one such nbr 79 80 20 Routing resilience Routing resilience CAN size = 16K nodes CAN size = 16K nodes #dimensions = 10 Pr(node failure) = 0.25 100 Pr(successful routing) Pr(successful routing) 100 80 60 40 20 0 2 4 6 8 60 40 20 0 0 10 dimensions 80 0.25 0.5 0.75 Pr(node failure) 81 82 Summary CAN an Internet-scale hash table potential building block in Internet applications Scalability O(d) per-node state Low-latency routing simple heuristics help a lot Robust decentralized, can route around trouble 83 21
© Copyright 2024