Download Report

Grappa: enabling next-generation analytics tools via latency-tolerant distributed shared memory
Jacob Nelson, Brandon Myers, Brandon Holt, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin {nelson, bholt, bdmyers, preston, luisceze, skahan, oskin}@cs.washington.edu University of Washington Dynamic Concurrency Discovery for
Very Large Windows of Execution
sa pa
More info, tech report, downloads: University of Washington
http://sampa.cs.washington.edu
http://grappa.io
Jacob Nelson, Luis Ceze
Grappa is a modern take on software distributed shared memory, tailored to
exploit parallelism inherent in data-intensive applications to overcome their poor
locality and input-dependent loadHow
distribution.
!
can we
parallelize
Programming example
Limit studies have shown that
sequential programs have
Grappa’s familiar single-system multithreaded C++ programming model enables easier
untapped opportunities for parallelization.
a sequential program?
Grappa differs from traditional DSMs in three ways:!
development of analysis tools for terabyte-scale data. We provide sequential consistency
for race-free
using
RPC-like atomic delegate operations, along with standard
How canprograms
we identify
them?
multithreaded synchronization primitives.!
latency for performance, Grappa
• Instead of minimizing per-operation
Dynamic
Start with sequential trace of
Dependence
tolerates latency withTask
concurrency
(latency-sensitive
apps need not apply!)!
Instructions ID
dynamic instructions
Graph
Here is an example of using Grappa’s C++11 library to build a distributed parallel wordcount-like application with a simple hash table:
to data1 instead of caching data at computation!
• Grappa moves
stcomputation
D
1
st Bat byte
2 granularity
2 rather
• Grappa operates
ld D
tolerancesthas
C
Tasks are groups of
than
page granularity!
instructions
that will execute
together
// distributed
input
array // distributed
input
array
Our
vision:
GlobalAddress<char>
chars; GlobalAddress<char> chars
= load_input();
Node 0
Dynamic
thread-level speculation
3
3 applied successfully
Latency
been
in hardware for nanosecond
latencies (e.g., superscalar
processors and GPUs).
This project
explores
Dependences
impose
order the
ld B
4
4
on task
ld
D
application of this idea at distributed system scales
withexecution
millisecond latencies.
Node 1
Node 2
...
Global Heap
Time (normalized to Grappa)
Ta
sk
s
nces
Depende
// distributed hash table:
"a" "i" "o" "q" "p" "b" "c" "x" "c" "d" "g" "h"
// distributed
hash table: using Cell = std::map<char,int>;
using
Cell = map<char,int>; hash("i")
GlobalAddress<Cell> cells = global_alloc<Cell>(ncells);
GlobalAddress<Cell> cells; ...
• Identify dependences
Cell[0]
Cell[1]
Cell[2]
Cell[3]
Cell[4]
Cell[5]
ld A
5
5
forall(chars, nchars, [=](char& c) {
to find concurrency
Critical dependences dictate
ld D
Irregular
Relational
forall(chars,
nchars,
[=](char
c) { // hash the char
to determine
destination
the
minimum execution time
B
size_t the
idx = char
hash(c)
// hash
to% ncells;
determine destination MapReduce ld
apps, native ...
Query
GraphLab
"a"→7
"b"→1
"i"→5
"e"→1
"f"→2
•
Select
tasks
for
parallel
execution,
delegate(&cells[idx],
[=](Cell&
cell)
ld C
size_t idx = hash(c) % ncells; code,
etc...
Engine
6
1
2
7
6
"g"→2
"c"→3
"l"→1
balancing
speedup
{ // runs
atomically
st B
potential
with overhead if (cell.count(c) == 0) cell[c] = 1;
st C
"o"→1
delegate(cells+idx,
[=](Cell&
cell) else cell[c] += 1;
Time
5
3
4
ld A
{ //});runs atomically 7
7
Local heap
Memory
Memory
Memory
Memory
in parallel,
Distributed Shared Memory • Execute tasks});
if
(cell.count(c) == 0) cell[c] = 1; enforcing dependences
6
else cell[c] += 1; ...
}); “Character count” with a simple hash table implemented using G RAPPA’s distributed shared memo
Figure 1:
});
Lightweight Multithreading
Cores
Cores
Cores
Cores
G RAPPA: the stack associated with a task, accesse
Irregular
Relational
& Global Task Pool
MapReduce
apps, native
Query
GraphLab
Bloom Map Concept
Bloom
Map Structure
Visit http://grappa.io
for
more
info on
memory from the memory’s home core, and accesse
code, etc...
Engine
memory allocation, data distribution,
Message aggregation layer
ging infrastructure local to each system node. Loc
Distributed
• Goal:
Bloom Filter:
Communication Layer • Extends
and hotspot
avoidance
via
combining.
Shared
Address
cannot
access
memory
on other
cores,
and are val
Memory
Memory
Memory
Memory
Experiments
run at
Pacific Northwest
National
Laboratory
Now
store
Task
IDs
Capture critical dependences
Memory
Infiniband network, user level access
40
on PAL cluster (1.9GHz AMD Interlagos processors,
• Idea:
Store dependence
Thousands of threads
per coresources
to expose
in imprecise map
instead of just bits
their home core.
...
Lightweight
Grappa
Multihreading
w/
Now, instead ofShark
“no”Cores
or “yes”,
Global Task Pool
Mellanox ConnectX-2 InfiniBand network)
memory4addressing G RAPPA allows
Cumulative speedup
15
...Cores Hn
Time (s)
Mean aggregated packet size (kB)
Time (s)
TimeTotal
(normalized
Grappa)
data movedto(GB)
Time (sec)
Time (normalized to Grappa)
128
Nodes
RDMA
Grappa
●
● atomic
delegate
increment
16 B
1 kB
64 kB
Message size
MPI
RDMA (verbs)
Fraction of iteration time
96
)
64
0
app compute
app compute
1 1
plemented by the application programmer with a while loop.
5.0
0
0.50
0
The
results(MPI
are orshown in Figure
10a
for
K
=
(30-80 kB)0 most20of the 40
network
overhead
2
60 stack
3
400SP Bench Query 2
0 0
0 0
GraphLab Spark Grappa Grappa
2.5
Time
(s)
We
find G RAPPA-MapReduce
to be nea
4.3.1. (TCP)
Performance
pick k-means
clustering
a test
300
TCP) is amortized
forasboth
systems.K = 10000.
(TCP) (TCP) We
(RDMA)
Grappa
Shark
Grappa Shark
0.25
200comparable Spark im
der
of
magnitude
faster
than
the
workload; it exercises all-to-all communication
and
iteration.
Figure 10 demonstrates
connection
concurprogramthe
phases
enablebetween
More
eﬃcient network
layer,
Kernel-bypass communication with High-concurrency
0.0
100
tation.
Absolute
runtime
for G RAPPA
-MapReduce
To
provide a reference point, werency
compare
theand
performance
to while
and aggregation
overmessage
time
executing
PageR10lower
10000
aggregation
thus high
rates.
serialization0 0.00
cost.
cache-aware data placement.
0
5
10
per iteration
for K k=
iteration
for
K
the SparkKMeans implementation
Bothsee
versions
ank. for
WeSpark.
can clearly
that each iteration,
the number
of10 and 17.3s per
2 10 1001000
k
compared
1s and 170s
respectively
for
Spark.
use the same algorithm: map the
points, tasks
reduce
the cluster
Grappa
concurrent
spikes
as scatter delegates
areto
performed
Spark
2.0
200
Concurrent tasks
(millions)
16 32
●
5
Cumulative speedup
●
5
Cumulative speedup
0e+00 ● ●
Mean iteration
time(sec)
Time
(normalized to Grappa)
Time (sec)
Atomic increment rate
(millions / sec)
●
Message rate (millions / s)
Performance breakdown
Bandwidth (GB/s)
Atomic increments per second
Result
Choice
Cores H1
Three prototype data analytics tools
Fraction of iteration time
Mean iteration time
(normalized to Grappa)
Global
any
Programmer sees global
30
queries return “no” or
on a core’s stacks or heap toComponent
be exported to the glob
...
“yes,
and
the
source
is
X”
parallelism and mask network latencies.
memory
abstraction.
A:
B: 2
C: 3
D: 1
2
Message aggregation layer
3
network
Communication
space
to
be
made
accessible
to other cores acro
10
Layer
data serialization
Infiniband
network,
user
level
access
20
Backend
for
Raco
• Allowed Imprecision:
1: st D
tem. This uses a 2traditional PGAS
(partitioned glob
In-memory
MapReduce"
store writing Task ID, GraphLab-like API"
• Writes:
iteration
- may return correct dependence source
2: st B
Messages aggregated
to mitigate
Direct access to network
overwriting previous
Figure value
2: G RAPPA design overview
space
[29]) addressing
model,compiler"
where
each address
relational
query
other
overheads
3: st C
- may return incorrect dependence source
~150 lines of code, implemented with
~60 lines of code, implementing:!
5
low network
injection
rates.dependence
reduces software overhead.
10
4: ld B
- may
find spurious
ofIDsa rank in the job
(or globalapp
process
compute ID) and an
Task
1
...
forall loop over •inputs
byselected Task
engine
with delta caching! ~700 lines of code, translating physical
- Synchronous
Reads:followed
choose one
ID
H1
Hn
that process.
forall over keys!
graph
partition
with no
- Random
query plan into C++11 code using
• Bloom
Map target
configuration
determines
The
tasking
system
supports
lightweight
Tasking
system
Grappa’s
current
is small,
~128 node clusters and
0
Result
G
RAPPA also 0supports symmetric allocations, w
0
replication!
likelihood of imprecision
Grappa forall loops!
Choice
multithreading
and
global
distributed
work-stealing
—
tasks
does not yet implement fault tolerance. Work is ongoing.
Grappa
K-Means computation
with
64Q9nodes
cates Shark
space for a copy (or proxy) of an object on
Q3b Q3c Q1
Q3a
Q5a Q5b Q2 Q4
Query
2Bench benchmark
cancytometry
be stolendataset
from any node
in the system,
provides
on SeaFlow flow
Benchmarks
run onwhich
31 nodes
using 1.8B
SPsystem.
run on16
nodes,to perfor
in the
The behavior
is identical
automated
load
balancing. edge
Concurrency
is social
expressed
through
with two different
k2values,
compared
Friendster
network
graph cal allocation
compared with
Shark
distributed
queryaddresses
on all
cores,
but
the
local
(b)
Performance
breakdown
of
speedup
of
G
RAPPA
over Shark o
(a) Thecooperatively-scheduled
SP Bench benchmark onuser-level
16 nodes.threads.
Query Q4
is
a
large
Threads
that dataset,
perusing MEMORY_ONLY
and 1.4B edge Twitter
follower
engine are guaranteed to be identical. Symmet
allocations
Choosing Between Bloom Map Resultswith Spark workload
Edge
Error
for
SPECINT
2006
so it was run with 64 nodes.
fault
tolerance compared
with GraphLab
two
form
long-latency
operations
(i.e., remote
memoryusing
access)
Key feature: Message Aggregation
are often treated as a proxy to a global object, hol
1Mb
partitioning
strategies
automatically suspend while
the operation
is executing
and
16Mb
Figure
9:
Relational
query
performance.
copies of constant data, or allowing operations to b
Commodity networks
have
a
limited
message
injection
rate,
so
• Task IDs increase monotonically;
129Mb
1: st A
• Measured
Harmonic Mean
wake
up when the operation completes.
Friendster
ently
buffered.
A
separate
publication
[40]
explore
writes
inserted
in
order
1Gb
building up larger packets from unrelated tasks is essential for small- overall edge error
Friendster
Twitter
Twitter
Twitter
●
1.0
40 Friendster
10.0 Communication
Twitter
2:
st
B
edge
error
on
critical
paths
was
used
to
implement
G
RAPPA
’s
synchronized
g
layer
The
main
goal
of
our
communication
150
500
message throughput (fine-grained random access to global memory).
80
1.00
●
4
Grappa
●
●
●
• One option: Choose newest Task ID
6 messages
structures,
including
vector
and
hash
map.
layer
is
to
aggregate
small
into
large
ones.
This
0.8
Shark
100
Platform●
Node 0
7.5
No overwritten locations: correct source Node3:n ld A
400
●
Phase
• As we increase
bit budget,
30 ●
process
is
invisible
to
the
application
programmer.
Its
in0.75
60
Grappa100
15
3
One overwritten location: later source
Putting
it
all
together
Figure
3
shows
an
examp
Bloom Map becomes
Map
0.6
●
●
●
terface
is
based
on
active
messages
[69].
Since
aggregation
4
300
more
effective
●
5.0
●
● all be
GraphLab
Combine
Core 0
20
global,
local and symmetric
heaps
can
use
Core 0
●
●
50
0.50
●
40
and
deaggregation
of
messages needs
to be Reduce
very2 efficient, we
20
1
• Better option: choose oldest Task ID
0.4
1
●
...
50
● ● exampl
2
●
for
a
simple
graph
data
structure.
In
this
●
●
200
2
●
As long as correct Task ID is visible in
●
Version
10
2.5
• Bloom Map
perform
the process in parallel
and
carefully
use
lock-free
Reallocate
2
●
●
are allocated from the global ●heap, automatically
d
●
one bit vector, we can find it
GraphLab
1
0.2
finds dependences on 0.25
●
●
●
●
●
20
Broadcast
synchronization
operations.
For
portability,
we use MPI [49]
(pds)
●
100 ●
10
●
0
0
Core
1
●
Core 1
●
critical paths to each node
them
across
nodes.
Symmetric
pointers
are
used
0 0
0
0 0
●
●
GraphLab
0.0
●
10
●
as
the
underlying
messaging
library
as
well
as
for
process
●●
0
16 32 64
128
(random)
0
0
• No later write to A can exist:
● objects ●which hold
●
local
information
about
the
g
5
10
10000
0.00
Overall
Critical Dependences
Nod
0
0
setup kand tear down.
would have updated all hashed locations
●
● Grappa
●
Pagerank CC SSSP BFS
Pagerank CC SSSP BFS
1
●
●
0
as
the
base
pointer
to
the
vertices,
from
any cor
●
2
10
1001000
●
Pagerank
Pagerank
ld A
Receiving core
Messages
●
(Smaller bars are better) Applications
●
●
●
Messages lists Sending core
Buffer moved
Grap
●
k
●
● Grappa
(n
distributes
deserialized;
● Grappa
●
Grappa
Q3b
Q3c
Q1
Q3a
Q9
Q5a
Q5b
Q2
Q4
communication.
Finally,
each
vertex
holds
a
vecto
(pds
3.1.
Distributed
aggregated
serializes
over network
Spark Shared Memory
GraphLab
GraphLab
(MapReduce)
messages
handlers run
Grappa
0
0 Query local heap, which other
locally per core
into buffer
via MPI/RDMA
(pds)
(random)
allocated
from
their
core’s
to dest. cores on home cores
Breakdown
time spent in
MapRe-global
Below we describe (b)
how
G RAPPAofimplements
a shared
16 Communication
32
64
128
16
32
64
128
(b)
metrics
at
access
by
going
through
the
vertex.
2
duce
phases for Gmodel
RAPPA -MapReduce
(a) The SP BenchNodes
benchmark on
nodes.
Query
(c) 16
Scaling
PageRan
●
3e+09
address
and the
consistency
it
offers.
(a) Performance
forspace
64 nodes
(or
“why
is
Grappa
faster?”)
31 nodes
on PageRank.
(a) Application performance (31 nodes) 3.1.2.
Delegate
Operations
Access
to G RAPPA
’s
d
workload
so
it
was
run
with
64
nodes.
Friendster,
weak
sca
3.1.1. Addressing Modes
Figure
11: memory
Scalingis BFS
out through
to 128 delegate
nodes. operati
In ad
shared
provided
Figure 10: Data parallel experiments using k-means on a
Figure
GUPS
PageRank on
Twitter
15 GraphLab
Local
memory
addressing
Applications
written
for
are
short
operations
performed
at
the
memory
locati
4
15
G
RAPPA
’s
engine,
we
also
show
a
cust
GUPS
2
4
8.9GB
Seaflow
dataset.
2e+09
Figure
9:
Performance
characterization
of
Grappa’s
GraphLab
API.
(a)
shows
time
to
800
(aggregated)
Grappa
600
G RAPPA may address memory
in two
ways: locally
and glob-rithm
node.
When
the
data access
pattern
has
low loc
Component
●
Component
for
BFS
implemented
natively
which
employs
GUPS
iterations)
normalized
to
Grappa,
on
the
Twitter
and
Friendster
datasets.
(b)
shows
co
Prefetching
GraphLab
3 3
10.0
network
network
(unaggregated)
600
ally.
Local
memory
is
local
to
a
single
core
within
a
node
more
efficient
to
modify
the
data
on
itsbetter
home
core
(pds)
bottom-up
to
achieve
even
perfor
1.00
Disabled
10 10 optimization
PageRank
data
points.
(c)
shows
scaling
results
for
PageRank
out
to
128
nodes
–
Friendst
400
serialization
data data
serialization
into
a
local
hash
table,
using
G
RAPPA
’s
partition-awareness.
Enabled
Grappa
1e+09
in
the
system.
Accesses
occur
through
conventional
pointbringing
a
copy
to
the
requesting
core
and
returning
1
●
400
iteration
iteration
2
2
scaling, and weak scaling is measured on synthetic
power-law
graphs
scaled
proportiona
7.5
Ph
The global-view
model ofuse
G RAPPA
allows
iterations
to be
im- in
ers. Applications
local accesses
for a number
of things
version. Delegate operations [48,
52]
provide
200
0.75
other
overheads this
other
overheads
Cores