Grappa: enabling next-generation analytics tools via latency-tolerant distributed shared memory Jacob Nelson, Brandon Myers, Brandon Holt, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin {nelson, bholt, bdmyers, preston, luisceze, skahan, oskin}@cs.washington.edu University of Washington Dynamic Concurrency Discovery for Very Large Windows of Execution sa pa More info, tech report, downloads: University of Washington http://sampa.cs.washington.edu http://grappa.io Jacob Nelson, Luis Ceze Grappa is a modern take on software distributed shared memory, tailored to exploit parallelism inherent in data-intensive applications to overcome their poor locality and input-dependent loadHow distribution. ! can we parallelize Programming example Limit studies have shown that sequential programs have Grappa’s familiar single-system multithreaded C++ programming model enables easier untapped opportunities for parallelization. a sequential program? Grappa differs from traditional DSMs in three ways:! development of analysis tools for terabyte-scale data. We provide sequential consistency for race-free using RPC-like atomic delegate operations, along with standard How canprograms we identify them? multithreaded synchronization primitives.! latency for performance, Grappa • Instead of minimizing per-operation Dynamic Start with sequential trace of Dependence tolerates latency withTask concurrency (latency-sensitive apps need not apply!)! Instructions ID dynamic instructions Graph Here is an example of using Grappa’s C++11 library to build a distributed parallel wordcount-like application with a simple hash table: to data1 instead of caching data at computation! • Grappa moves stcomputation D 1 st Bat byte 2 granularity 2 rather • Grappa operates ld D tolerancesthas C Tasks are groups of than page granularity! instructions that will execute together // distributed input array // distributed input array Our vision: GlobalAddress<char> chars; GlobalAddress<char> chars = load_input(); Node 0 Dynamic thread-level speculation 3 3 applied successfully Latency been in hardware for nanosecond latencies (e.g., superscalar processors and GPUs). This project explores Dependences impose order the ld B 4 4 on task ld D application of this idea at distributed system scales withexecution millisecond latencies. Node 1 Node 2 ... Global Heap Time (normalized to Grappa) Ta sk s nces Depende // distributed hash table: "a" "i" "o" "q" "p" "b" "c" "x" "c" "d" "g" "h" // distributed hash table: using Cell = std::map<char,int>; using Cell = map<char,int>; hash("i") GlobalAddress<Cell> cells = global_alloc<Cell>(ncells); GlobalAddress<Cell> cells; ... • Identify dependences Cell[0] Cell[1] Cell[2] Cell[3] Cell[4] Cell[5] ld A 5 5 forall(chars, nchars, [=](char& c) { to find concurrency Critical dependences dictate ld D Irregular Relational forall(chars, nchars, [=](char c) { // hash the char to determine destination the minimum execution time B size_t the idx = char hash(c) // hash to% ncells; determine destination MapReduce ld apps, native ... Query GraphLab "a"→7 "b"→1 "i"→5 "e"→1 "f"→2 • Select tasks for parallel execution, delegate(&cells[idx], [=](Cell& cell) ld C size_t idx = hash(c) % ncells; code, etc... Engine 6 1 2 7 6 "g"→2 "c"→3 "l"→1 balancing speedup { // runs atomically st B potential with overhead if (cell.count(c) == 0) cell[c] = 1; st C "o"→1 delegate(cells+idx, [=](Cell& cell) else cell[c] += 1; Time 5 3 4 ld A { //});runs atomically 7 7 Local heap Memory Memory Memory Memory in parallel, Distributed Shared Memory • Execute tasks}); if (cell.count(c) == 0) cell[c] = 1; enforcing dependences 6 else cell[c] += 1; ... }); “Character count” with a simple hash table implemented using G RAPPA’s distributed shared memo Figure 1: }); Lightweight Multithreading Cores Cores Cores Cores G RAPPA: the stack associated with a task, accesse Irregular Relational & Global Task Pool MapReduce apps, native Query GraphLab Bloom Map Concept Bloom Map Structure Visit http://grappa.io for more info on memory from the memory’s home core, and accesse code, etc... Engine memory allocation, data distribution, Message aggregation layer ging infrastructure local to each system node. Loc Distributed • Goal: Bloom Filter: Communication Layer • Extends and hotspot avoidance via combining. Shared Address cannot access memory on other cores, and are val Memory Memory Memory Memory Experiments run at Pacific Northwest National Laboratory Now store Task IDs Capture critical dependences Memory Infiniband network, user level access 40 on PAL cluster (1.9GHz AMD Interlagos processors, • Idea: Store dependence Thousands of threads per coresources to expose in imprecise map instead of just bits their home core. ... Lightweight Grappa Multihreading w/ Now, instead ofShark “no”Cores or “yes”, Global Task Pool Mellanox ConnectX-2 InfiniBand network) memory4addressing G RAPPA allows Cumulative speedup 15 ...Cores Hn Time (s) Mean aggregated packet size (kB) Time (s) TimeTotal (normalized Grappa) data movedto(GB) Time (sec) Time (normalized to Grappa) 128 Nodes RDMA Grappa ● ● atomic delegate increment 16 B 1 kB 64 kB Message size MPI RDMA (verbs) Fraction of iteration time 96 ) 64 0 app compute app compute 1 1 plemented by the application programmer with a while loop. 5.0 0 0.50 0 The results(MPI are orshown in Figure 10a for K = (30-80 kB)0 most20of the 40 network overhead 2 60 stack 3 400SP Bench Query 2 0 0 0 0 GraphLab Spark Grappa Grappa 2.5 Time (s) We find G RAPPA-MapReduce to be nea 4.3.1. (TCP) Performance pick k-means clustering a test 300 TCP) is amortized forasboth systems.K = 10000. (TCP) (TCP) We (RDMA) Grappa Shark Grappa Shark 0.25 200comparable Spark im der of magnitude faster than the workload; it exercises all-to-all communication and iteration. Figure 10 demonstrates connection concurprogramthe phases enablebetween More efficient network layer, Kernel-bypass communication with High-concurrency 0.0 100 tation. Absolute runtime for G RAPPA -MapReduce To provide a reference point, werency compare theand performance to while and aggregation overmessage time executing PageR10lower 10000 aggregation thus high rates. serialization0 0.00 cost. cache-aware data placement. 0 5 10 per iteration for K k= iteration for K the SparkKMeans implementation Bothsee versions ank. for WeSpark. can clearly that each iteration, the number of10 and 17.3s per 2 10 1001000 k compared 1s and 170s respectively for Spark. use the same algorithm: map the points, tasks reduce the cluster Grappa concurrent spikes as scatter delegates areto performed Spark 2.0 200 Concurrent tasks (millions) 16 32 ● 5 Cumulative speedup ● 5 Cumulative speedup 0e+00 ● ● Mean iteration time(sec) Time (normalized to Grappa) Time (sec) Atomic increment rate (millions / sec) ● Message rate (millions / s) Performance breakdown Bandwidth (GB/s) Atomic increments per second Result Choice Cores H1 Three prototype data analytics tools Fraction of iteration time Mean iteration time (normalized to Grappa) Global any Programmer sees global 30 queries return “no” or on a core’s stacks or heap toComponent be exported to the glob ... “yes, and the source is X” parallelism and mask network latencies. memory abstraction. A: B: 2 C: 3 D: 1 2 Message aggregation layer 3 network Communication space to be made accessible to other cores acro 10 Layer data serialization Infiniband network, user level access 20 Backend for Raco • Allowed Imprecision: 1: st D tem. This uses a 2traditional PGAS (partitioned glob In-memory MapReduce" store writing Task ID, GraphLab-like API" • Writes: iteration - may return correct dependence source 2: st B Messages aggregated to mitigate Direct access to network overwriting previous Figure value 2: G RAPPA design overview space [29]) addressing model,compiler" where each address relational query other overheads 3: st C - may return incorrect dependence source ~150 lines of code, implemented with ~60 lines of code, implementing:! 5 low network injection rates.dependence reduces software overhead. 10 4: ld B - may find spurious ofIDsa rank in the job (or globalapp process compute ID) and an Task 1 ... forall loop over •inputs byselected Task engine with delta caching! ~700 lines of code, translating physical - Synchronous Reads:followed choose one ID H1 Hn that process. forall over keys! graph partition with no - Random query plan into C++11 code using • Bloom Map target configuration determines The tasking system supports lightweight Tasking system Grappa’s current is small, ~128 node clusters and 0 Result G RAPPA also 0supports symmetric allocations, w 0 replication! likelihood of imprecision Grappa forall loops! Choice multithreading and global distributed work-stealing — tasks does not yet implement fault tolerance. Work is ongoing. Grappa K-Means computation with 64Q9nodes cates Shark space for a copy (or proxy) of an object on Q3b Q3c Q1 Q3a Q5a Q5b Q2 Q4 Query 2Bench benchmark cancytometry be stolendataset from any node in the system, provides on SeaFlow flow Benchmarks run onwhich 31 nodes using 1.8B SPsystem. run on16 nodes,to perfor in the The behavior is identical automated load balancing. edge Concurrency is social expressed through with two different k2values, compared Friendster network graph cal allocation compared with Shark distributed queryaddresses on all cores, but the local (b) Performance breakdown of speedup of G RAPPA over Shark o (a) Thecooperatively-scheduled SP Bench benchmark onuser-level 16 nodes.threads. Query Q4 is a large Threads that dataset, perusing MEMORY_ONLY and 1.4B edge Twitter follower engine are guaranteed to be identical. Symmet allocations Choosing Between Bloom Map Resultswith Spark workload Edge Error for SPECINT 2006 so it was run with 64 nodes. fault tolerance compared with GraphLab two form long-latency operations (i.e., remote memoryusing access) Key feature: Message Aggregation are often treated as a proxy to a global object, hol 1Mb partitioning strategies automatically suspend while the operation is executing and 16Mb Figure 9: Relational query performance. copies of constant data, or allowing operations to b Commodity networks have a limited message injection rate, so • Task IDs increase monotonically; 129Mb 1: st A • Measured Harmonic Mean wake up when the operation completes. Friendster ently buffered. A separate publication [40] explore writes inserted in order 1Gb building up larger packets from unrelated tasks is essential for small- overall edge error Friendster Twitter Twitter Twitter ● 1.0 40 Friendster 10.0 Communication Twitter 2: st B edge error on critical paths was used to implement G RAPPA ’s synchronized g layer The main goal of our communication 150 500 message throughput (fine-grained random access to global memory). 80 1.00 ● 4 Grappa ● ● ● • One option: Choose newest Task ID 6 messages structures, including vector and hash map. layer is to aggregate small into large ones. This 0.8 Shark 100 Platform● Node 0 7.5 No overwritten locations: correct source Node3:n ld A 400 ● Phase • As we increase bit budget, 30 ● process is invisible to the application programmer. Its in0.75 60 Grappa100 15 3 One overwritten location: later source Putting it all together Figure 3 shows an examp Bloom Map becomes Map 0.6 ● ● ● terface is based on active messages [69]. Since aggregation 4 300 more effective ● 5.0 ● ● all be GraphLab Combine Core 0 20 global, local and symmetric heaps can use Core 0 ● ● 50 0.50 ● 40 and deaggregation of messages needs to be Reduce very2 efficient, we 20 1 • Better option: choose oldest Task ID 0.4 1 ● ... 50 ● ● exampl 2 ● for a simple graph data structure. In this ● ● 200 2 ● As long as correct Task ID is visible in ● Version 10 2.5 • Bloom Map perform the process in parallel and carefully use lock-free Reallocate 2 ● ● are allocated from the global ●heap, automatically d ● one bit vector, we can find it GraphLab 1 0.2 finds dependences on 0.25 ● ● ● ● ● 20 Broadcast synchronization operations. For portability, we use MPI [49] (pds) ● 100 ● 10 ● 0 0 Core 1 ● Core 1 ● critical paths to each node them across nodes. Symmetric pointers are used 0 0 0 0 0 ● ● GraphLab 0.0 ● 10 ● as the underlying messaging library as well as for process ●● 0 16 32 64 128 (random) 0 0 • No later write to A can exist: ● objects ●which hold ● local information about the g 5 10 10000 0.00 Overall Critical Dependences Nod 0 0 setup kand tear down. would have updated all hashed locations ● ● Grappa ● Pagerank CC SSSP BFS Pagerank CC SSSP BFS 1 ● ● 0 as the base pointer to the vertices, from any cor ● 2 10 1001000 ● Pagerank Pagerank ld A Receiving core Messages ● (Smaller bars are better) Applications ● ● ● Messages lists Sending core Buffer moved Grap ● k ● ● Grappa (n distributes deserialized; ● Grappa ● Grappa Q3b Q3c Q1 Q3a Q9 Q5a Q5b Q2 Q4 communication. Finally, each vertex holds a vecto (pds 3.1. Distributed aggregated serializes over network Spark Shared Memory GraphLab GraphLab (MapReduce) messages handlers run Grappa 0 0 Query local heap, which other locally per core into buffer via MPI/RDMA (pds) (random) allocated from their core’s to dest. cores on home cores Breakdown time spent in MapRe-global Below we describe (b) how G RAPPAofimplements a shared 16 Communication 32 64 128 16 32 64 128 (b) metrics at access by going through the vertex. 2 duce phases for Gmodel RAPPA -MapReduce (a) The SP BenchNodes benchmark on nodes. Query (c) 16 Scaling PageRan ● 3e+09 address and the consistency it offers. (a) Performance forspace 64 nodes (or “why is Grappa faster?”) 31 nodes on PageRank. (a) Application performance (31 nodes) 3.1.2. Delegate Operations Access to G RAPPA ’s d workload so it was run with 64 nodes. Friendster, weak sca 3.1.1. Addressing Modes Figure 11: memory Scalingis BFS out through to 128 delegate nodes. operati In ad shared provided Figure 10: Data parallel experiments using k-means on a Figure GUPS PageRank on Twitter 15 GraphLab Local memory addressing Applications written for are short operations performed at the memory locati 4 15 G RAPPA ’s engine, we also show a cust GUPS 2 4 8.9GB Seaflow dataset. 2e+09 Figure 9: Performance characterization of Grappa’s GraphLab API. (a) shows time to 800 (aggregated) Grappa 600 G RAPPA may address memory in two ways: locally and glob-rithm node. When the data access pattern has low loc Component ● Component for BFS implemented natively which employs GUPS iterations) normalized to Grappa, on the Twitter and Friendster datasets. (b) shows co Prefetching GraphLab 3 3 10.0 network network (unaggregated) 600 ally. Local memory is local to a single core within a node more efficient to modify the data on itsbetter home core (pds) bottom-up to achieve even perfor 1.00 Disabled 10 10 optimization PageRank data points. (c) shows scaling results for PageRank out to 128 nodes – Friendst 400 serialization data data serialization into a local hash table, using G RAPPA ’s partition-awareness. Enabled Grappa 1e+09 in the system. Accesses occur through conventional pointbringing a copy to the requesting core and returning 1 ● 400 iteration iteration 2 2 scaling, and weak scaling is measured on synthetic power-law graphs scaled proportiona 7.5 Ph The global-view model ofuse G RAPPA allows iterations to be im- in ers. Applications local accesses for a number of things version. Delegate operations [48, 52] provide 200 0.75 other overheads this other overheads Cores
© Copyright 2024