MALT: Distributed Data-Parallelism for Existing ML

MALT: Distributed Data-Parallelism for Existing ML Applications
Hao Li, Asim Kadav, Erik Kruus, Cristian Ungureanu
ML transforms data into insights
Properties of ML applications
Machine learning tasks have all of the following properties:
• Fine-Grained and Incremental: ML tasks perform repeated model
updates over new input data.
Large amounts of data is being generated by
user-software interactions, social networks, and
hardware devices.
• Asynchronous: ML tasks may communicate asynchronously. E.g.
communicating model information, back-propagation etc.
• Approximate: ML applications are stochastic and often an
approximation of the trained model is sufficient.
Timely insights depend on providing accurate and
updated machine learning (ML) models using this
data.
• Need Rich Developer Environment: Developing ML applications
requires a rich set of libraries, tools and graphing abilities which is often
missing in many highly scalable systems.
Our Solution: MALT
Network-efficient learning
Goal: Provide an efficient library for
providing data-parallelism to existing
ML applications.
In a peer-to-peer learning, instead of sending model info.
to all replicas, MALT sends model updates to log(N)
nodes, such that (i) the graph of all nodes is connected
(ii) the model updates are disseminated uniformly across
all nodes.
SGD using
V1 as primary
model param
SGD using
V2 as primary
model param
V1
V2
Vn
V1
V2
....
Vn
Replica n
SGD using
Vn as primary
model param
V1
V2
Data
model 6
Data
Data
model 1
model 5
Vn
Data
model 2
MALT performs peer-to-peer machine
learning. It provides abstractions for finegrained in-memory updates using onesided RDMA, limiting data movement
costs when training models. MALT allows
machine learning developers to specify
the dataflow and apply communication
and representation optimizations.
model 3
Traditional: all-reduce exchange of model
information. As number of nodes (N) increase, the
total number of updates transmitted in the network
increases as O(N^2).
model 6
Data
loss=0.03
loss=0.03
DNA detection (DNA)
SVM
800
10 GB
Genome detection
(splice-site)
SVM
11M
250 GB
Webspam detection
(webspam)
SVM
16.6M
10 GB
Collaborative filtering
(netflix)
Matrix
Factorization
14.9M
1.6 GB
Ad perdition
(KDD 2012)
Neural networks
12.8M
3.1 GB
0
all
Halton
parameter server
12
10
8
6
4
2
0
0
Wait
Halton-model avg Halton-grad-avg PS-model-avg PS-grad-avg
0
1
2
3
RCV1, all, BSP, gradavg, ranks=10
4
RCV1, all, BSP, gradavg, ranks=10
model 4
0.185
Splice−site, all, modelavg,
cb=5000, ranks=8
0.185
goal 0.145
single rank SGD
cb=5000 7.3X
Data
goal 0.145
single rank SGD
cb=5000 6.7X
0.18
goal 0.01245
BSP
ASYNC 6X
SSP 7.2X
0.175
0.17
0.024
loss
0.165
0.022
model
3
0.02
0.16
loss
x 10
0.012
Data
loss
0.02
14
Compute
server (PS) for distributed SVM for webspam workload for
asynchronous training, with achieved loss values for 20 ranks.
0.155
0.165
0.16
0.155
MALT model propagation: Each machine sends
updates to log(N) nodes (to N/2 + i and N/4 + i for node
i). As N increases, the outbound nodes follows Halton
(1 sec)
Figure 4. This figure shows time
convergence
for RCV1 workload
for MALT
a single machine workload.
We find that
sequence (N/2, N/4, 3N/4,
N/8,with
3N/8..).and
the total
Figure 10. This figure shows the convergence for bulkMALT converges quicker to achieve the desired accuracy.
synchronous (BSP), asynchronous processing (ASP) and
number of updates transmitted
increases as O(N logN).
bounded staleness processing (SSP) for splice-site workload.
0.15
0.145
0.145
0.014
0
1
10
0.012
2
10
0
10
1
10
50
2
10
iterations (10000)
0
10
time (0.01 sec)
100
150
200
250
all
all
300
Speedup over single SGD for fix loss
0.022
0.014
Data
Figure 9. This figure compares MALTHalton with parameter
Serial SGD
goal 0.01245
BSP all
ASYNC all 6X
ASYNC Halton 11X
Webspam, BSP, gradavg, cb=5000
4
Splice−site, modelavg, cb=5000, ranks=8
40
0.026
alpha, all, BSP, modelavg, ranks=10
250
1: procedure SERIALSGD
MALT-SVM
RCV1, Halton, BSP, gradavg, ranks=10
200 RCV1, all, BSP, gradavg, ranks=10
2: Gradient g;
0.185
0.185
goal 0.145
goal 0.145
3: Parameter W;
single rank SGD
single rank SGD
0.18
0.18
cb=1000 5.2X
cb=1000 5.9X
4:
cb=5000
6.7X
cb=5000 8.1X
150
0.175
0.175
cb=10000 5.5X
cb=10000 5.7X
5: for epoch = 1 : maxEpochs 0.17
do
0.17
6:
0.165
0.165
MR-SVM
7:
for i = 1 : maxData do 1000.16
0.16
8:
g = cal gradient(data[i]);
0.155
0.155
50
9:
W = W + g;
0.15
0.15
10:
0.145
0.145
11: return W
0
loss
loss
Transforming
serial SGD (Stochastic
Gradient Descent) to
data-parallel SGD.
0
Works with existing applications.
Currently integrated with SVM-SGD,
HogWild-MF and NEC RAPID.
1 GB
0.016
0.016
Re-use
developer
environment
500
loss=0.05
80
0.15
Approximate
SVM
0.018
0.17
MALT allows different consistency
models to trade-off consistency and
training time.
Image classification
(PASCAL - alpha)
0.026
120
0.018
Asynchronous
480 MB
0.024
0.028
Models train and scatter updates to
per-sender receive queues. This
mechanism when used with one-sided
RDMA writes, ensure no interruption to
the receiver CPU.
47K
Data
160
0.175
MALT provides a scatter-gather
API. scatter allows sending of model
Efficient model
updates to the peer replicas. Local
communication
gather function applies any userdefined function on the received values.
SVM
loss=0.05
0.18
MALT’s solution
Document
Classification (RCV1)
0.028
model 5
model 2
Design
requirements
Dataset size
(uncompressed)
200
model 1
Data
# Parameters
5
0
10
1
10
2
10
Runtime
time (0.01
sec)
0
10
1
10
configurationstime (0.01 sec)
2
10
1: procedure PARALLELSGD
Figure
This
figure
shows
convergence
(loss with
vs time
2:
maltGradient g(SPARSE,
ALL);
Figure
5. 11.
This
figure
shows
speedup
by iterations
PAS-in
seconds)
RCV1 dataset
for all
MALT
(left)MR-SVM.
and MALTMRHalton
3:
Parameter W;
CAL
alpha for
workload
for MALT
SVMallwith
(right)
for different
communication
sizes. library
We findover
that
SVM
algorithm
is implemented
using batch
the MALT
4:
MALTHalton
faster than MALTall . for some workinfiniBand.
Weconverges
achieve
5:
for epoch = 1 : maxEpochs
do super-linear speedup
loads because of the averaging effect from parallel replicas [52].
6:
benefit of different synchronization
7:
for i = 1 : maxData/totalMachines
do methods. We compare
speedup of the systems under test by running them until
8:
g = cal gradient(data[i]);
they reach the same loss value and compare the total time
9:
g.scatter(ALL);
and number of iterations (passes) over data per machine.
10:
g.gather(AVG);
11:
W = W + g; Distributed training requires fewer iterations per machine
since examples are processed in parallel. For each of our
12:
experiments, we pick the desired final optimization goal as
13: return W
achieved by running a single-rank SGD [6, 12]. Figure 4
the speedup
Data-Parallelcompares
SGD with
MALTof MALTall with a single machine for
the RCV1 dataset [12], for a communication batch size or cb
size of 5000. By cb size of 5000, we mean that every model
50
100
150
200
250
2
4
10
20
time (1 sec)
ranks
SVM Convergence (loss vs time in seconds) for
Network costs for MALT-all, MALT-Halton and
Figure 12. This figure shows convergence (loss vs Figure
time in 13. This figure shows the data sent by MALTall ,
SVM using splice dataset for MALT-all and
the parameter server for the whole network for
seconds) for splice-site dataset for MALTall and MALT
.
HaltonHalton
MALT
and the parameter server for the webspam workMALT-Halton.
We
find
that
MALT-Halton
the
webspam
workload. We find that MALTWe find that MALTHalton converges faster than MALTallload.
.
MALT sends and receives gradients while parameter
converges faster than MALT-all.
Halton reduces network communication costs
server
sends
gradients but needs to receive whole models.
machines. Second, each node performs model averagingand provides
fast convergence.
To summarize, we find that MALT provides sending graof fewer (log(N )) incoming models. Hence, even though
dients
(instead of sending the model) that saves network
MALTHalton may require more iterations than MALT
all , the
costs.
MALT Halton is network efficient and
400
overall time required
for every iteration is less, and
over-Furthermore,
KDD2012, all, BSP, modelavg, ranks=8
speedup over MALT all .
0.71
all convergence
time to reach the desired accuracyachieves
is less. 350
failure
Finally,0.7since MALTHalton spreads out its updates across
all saturation tests: 1-node
Network
We
perform
infiniBand network
fault-free
300
nodes,0.69
that aids faster convergence.
throughput tests, and measure the time to scatter updates
Figure
12 shows the model convergence for the splice- 250
0.68
in MALTall case with the SVM workload. In the synchronous
site dataset
and speedup over BSP-all in reaching the desired 200
0.67
case, we find that all ranks operate in a log step fashion,
goal with
8
nodes.
From
the
figure,
we
see
that
MALT
0.66
andHalton
during the scatter phase, all machines send models at
desired
goal
0.7
converges faster than MALTall . Furthermore, we find that 150
0.65
single rank SGD the line rate (about 5 GB/s). Specifically, for the webspam
that until the model converges to the desired
goal, each node 100
cb=15000 1.13X workload,
we see about 5.1 GB/s (about 40 Gb/s) during
0.64
in MALTall sends out 370 GB of updates
for every
cb=20000
1.5X machine, 50
scatter. In the asynchronous case, to saturate the network,
1.24X
while 0.63
MALTHalton only sends 34 GB cb=25000
of data for
every
mawe
run
replicas on every machine. When running
0
0.62
chine. As the number of nodes increase, the logarithmic fan- multiple
500
1000
1500
2000
2500
3000 three ranks on every machine, we find that each machine
Runtime configurations
out of MALTHalton shouldtime
result
in lower amounts of data
(1 sec)
sends model updates at 4.2 GB/s (about 33 Gb/s) for the
transferred
andnetworks
faster convergence.
Neural
AUC
(area
under
curve)
vs
tolerance:
Timedemonstrate
taken to converge
for a large
Figure 6. This figure shows the AUC (Area Underwebspam
Curve)Fault
dataset.
These tests
that using
MALT
trades-off
freshness
of
updates
at
peer
repliHalton
(in seconds)
a three
neural
network
DNA dataset with fault-free and a single
vs timetime
(in seconds)
for afor
three
layerlayer
neural
network
for textthe network
bandwidth
is beneficial for training models with
cas with
savings
in
network
communication
time.
For
workfor(click
text learning
(click prediction) using KDD
rank failure case. MALT is able to recover
learning
prediction).
large number
of parameters. Furthermore, using networkloads where
the model is dense and network communication
2012 data.
from the failure and train the model correctly.
efficient
techniques such as MALTHalton can improve perforSVM
algorithm
is based toonthecommon
Hadoop
implemencosts are
small compared
update costs,
MALT
all conmance.
tations
andmay
communicates
gradients
every
figuration
provide similar
or betterafter
results
overpartitionMALT
epoch
[56].example,
We implement
MR-SVM
algorithms
the
for the SSI
workload,
which is over
a fully
Halton . For
6.3 Developer Effort
MALT
library
andnetwork,
run it over
our infiniBand
connected
neural
we only
see a 1.1⇥ cluster.
speedup MRfor
We
evaluate
the ease of implementing parallel learning in
SVM
averaging
at theasend
every of
epoch
to
MALTuses
over MALT
theof
number
nodes
Haltonone-shot
all . However,
and model sizes
increase, (cb
the size
cost=of25K).
communication
beginsby adding support to the four applications listed in
communicate
parameters
MALT is MALT
designed
3. For each application we show the amount of code
to dominate,
andcommunication
using MALTHalton
beneficial. Table
for
low-latency
andiscommunicated
parameweHalton
modified
as well as the number of new lines added.
13 shows
the=data
by MALT
MALT
,
tersFigure
more often
(cb size
1K).sent
Figure
5 shows
by
itall ,speedup
and the for
parameter
serveralpha
over workload
the entire for
network,
the
InforSection
4, we described the specific changes required.
erations
the PASCAL
MR-SVM
(imwebspam workload.
We find
MALT
is the
most
netThe
new
Halton
plemented
over MALT)
and that
MALT
find
that
both
the code adds support for creating MALT objects, to
all . We
work efficient.
Webspam
is a high-dimensional
scatter-gather
the model updates. In comparison, imworkloads
achieve
super-linear
speedup over a workload.
single
maMALTSGD
only
updates
log(N )This
nodes.
The
paplementing
Haltonon
chine
the sends
PASCAL
alphatodataset.
happens
be- a whole new algorithm takes hundreds of lines
rameter
sendseffect
gradients
but needsprovides
to receivesuper-linear
the
whole
new
code assuming underlying data parsing and arithmetic
cause
theserver
averaging
of gradients
model from
central
server.[52].
We note
that otherweoptimizalibraries
speedup
for the
certain
datasets
In addition,
find thatare provided by the processing framework. On avtions such
as compression,
and
filters can
further
erage,
we moved 87 lines of code and added 106 lines, repMALT
provides
3⇥ speedup
(byother
iterations,
about
1.5⇥reby
duce
the
network
costs
as
noted
in
[36].
Furthermore,
when
time) over MR-SVM. MALT converges faster sinceresenting
it is de- about 15% of overall code.
the parameter server is replicated for high-availability, there
signed over low latency communication, and sends gradients
6.4 Fault Tolerance
is more network traffic for additional N (asynchronous)
more frequently. This result shows that existing algorithms
messages for N way chain replication of the parameters.
evaluate the time required for convergence when a node
designed for map-reduce may not provide the mostWe
optimal
fails. When the MALT fault monitor in a specific node respeedup
latency frameworks
as MALT.
[1] for
A.low
Halevy,
P. Norvig,such
and
F. Pereira.
The unreasonable effectiveness
Figure 6 shows the speedup with time for convergence
of data. Intelligent Systems, IEEE, 24(2):8–12, 2009.
for ad-click prediction implemented using a fully connected,
J. neural
Deannetwork.
et. al.,This
Large
distributed
three[2]
layer
three scale
layer network
needs deep networks, NIPS 2012.
to synchronize
parameters
at each
layer. machine
Furthermore,learning
these
[3] L. Bottou.
Large
scale
with SGD. COMPSTAT 2010.
networks
denseet.
parameters
and thereA islock
computa[4] B.have
Recht
al., HogWild:
free approach to parallelizing
tion in the forward and the reverse direction. Hence, fullystochastic
descent,
NIPS
connected
neural networks
are harder
to 2011.
scale than convoluB. Bai
e.t.
SSI:
Supervised
Indexing. ACM CIKM 2009.
tion [5]
networks
[23].
Weal.
show
the speedup
by usingSemantic
MALT
all to train over KDD-2012 data on 8 processes over single
machine. We obtain up to 1.5⇥ speedup with 8 ranks. The
AUC
HDFS/NFS (for loading training data)
model 4
Data
dstorm (distributed one-sided remote memory)
Model
Data
Time to run 100 epochs for webspam
vector object library
Application (Dataset)
total network traffic (MBs)
Replica 2
We integrate MALT with three applications:
SVM[3], matrix factorization[4] and neural
network[5]. MALT requires reasonable developer
efforts and provides speedup over existing
methods.
loss
Replica 1
Results
Time to process 50 epochs
Large learning models, trained on large datasets
often improve model accuracy [1].
0
5
We demonstrate that MALT outperforms single
machine performance for small workloads and can
efficiently train models over large datasets that span
multiple machines (See our paper in EuroSys 2015
for more results).
References and Related Work