Download Report

CoCoA vs. CoCoA+ : Adding vs. Averaging in
Distributed Optimization
Martin Tak´aˇc
SAS, 10th of March, 2015
1 / 32
Outline
Problem Formulation
Serial and Parallel Coordinate Descent Method (CDM)
Distributed CDM
Original CoCoA Framework
CoCoA+ Framework
Computation vs. Communication Trade-off
Spark
Some Numerical Experiments
2 / 32
The Problem - Regularized Empirical Loss Minimization
Let {(xi , yi )}ni=1 be our training data data, xi ∈ Rd and yi ∈ R.
"
#
n
1X
λ
min P(w ) :=
`i (w T xi ) + kw k2
n
2
w ∈Rd
(P)
i=1
where
λ > 0 is a regularization parameter
`i (·) is convex loss function which can depend on the label yi
Examples:
Logistic loss: `i (ζ) = log(1 + exp(−yi ζ))
Hinge loss: `i (ζ) = max{0, 1 − yi ζ}
The dual problem
#
n
λ
1X ∗
2
`i (−αi )
max D(α) := − kAαk −
α∈Rn
2
n
"
(D)
i=1
1
where A = λn
X T and X T = [x1 , x2 , . . . , xn ] ∈ Rd×n
∗
`i is convex conjugate of `i
wlog kxi k ≤ 1
3 / 32
Duality
Primal-Dual mapping
For any α ∈ dom(D) we can define
w (α) := Aα
(1)
From strong duality we have that w ∗ = w (α∗ ) is optimal to (P) if α∗ is optimal
solution to (D).
Gap function
G (α) = P(w (α)) − D(α)
4 / 32
The Setting & Challenges
The size of matrix A is huge (e.g. TBs of data)
We want to use many nodes of computer cluster (or cloud) to speed-up the
computation
Challenges
distributed data: no single machine can load the whole instance
expensive communication:
latency
RAM
100 nanoseconds
standard network connection 250,000 nanoseconds
unreliable nodes: we assume that the node can die at any point during the
computation (we want to have fault tolerant solution)
5 / 32
The Serial/Parallel/Distributed SDCA Algorithm
assume we have K nodes (computers) each with parallel processing power
we partition the coordinates {1, 2, . . . , n} into K balanced sets P1 , . . . , PK
∀k ∈ {1, . . . , K } we have |Pk | = Kn
Serial Parallel Distributed Stochastic Dual Coordinate Ascent
choose α(0) ∈ Rn
repeat
α(t+1) = α(t)
pick a random coordinate i ∈ {1, . . . , n}
pick a random coordinate
i ∈ {1, . . . , n}
pick a random subset S ⊂ {1, . . . , n} with |S| = H
for each
computer k ∈ {1, . . . , K } in parallel do
pick a random subset S ⊂ {1, . . . , n} with |S| = H
pick a random subset Sk ⊂ Pk with |S| = H ≤ Kn
for each i ∈ S in parallel do
for each i ∈ Sk in parallel do
compute the update: hti (α(t) ) := arg maxh D(α(t) + hei )
(t+1)
(t+1)
apply the update: αi
= αi
+ hti (α(t) )ei
(t+1)
(t+1)
apply the update: αi
= αi
+ hti (α(t) )ei
6 / 32
Distributed CDM
Illustration: K = 4 and H = 2
7 / 32
Disadvantages of Distributed CDM
we cannot choose H > |Pk |!
the computation of step is very easy (usually close form or a bit complicated
1D problem)
after taking H steps, usually the objective function doesn’t change much
it is almost impossible to balance computation and communication
8 / 32
Data Distribution
Vector α and columns of matrix A are partitioned according {Pk }K
k=1 .
Notation: For k ∈ {1, 2, . . . , K } we use αk ∈ R|Pk | is a subvector of α.
Vector α[k] ∈ Rn is a vector obtained from vector α by setting all coordinates
∈
/ Pk to zero.
Example: α1 = (∗, ∗, ∗, ∗)T , α[1] = (∗, ∗, ∗, ∗, 0, 0, . . . , 0)T .
9 / 32
Local Problem
CoCoA subproblem
At iteration t at node k
(t)
(∆α∗ )[k] = arg max n D(α(t) + ∆α[k] )
∆α[k] ∈R
= arg max
∆α[k] ∈Rn
n
λ
1X ∗
`i (−(α(t) + ∆α[k] )i )
− kA(α(t) + ∆α[k] )k2 −
2
n
!
i=1
we cannot solve the subproblem as it depends on α(t) and A
if we know w (t) = Aα(t) then
(t)
(∆α∗ )[k]
= arg max
∆α[k] ∈Rn
λ
1 X ∗
− kw (t) + A∆α[k] k2 −
`i (−(α(t) + ∆α[k] )i )
2
n
!
i∈Pk
(t)
if we know w (t) we can compute (∆α∗ )[k]
10 / 32
The CoCoA Framework
Communication-Efficient Distributed Dual Coordinate Ascent
Input: T ≥ 1
Data: {(xi , yi )}ni=1 distributed over K machines
(0)
Initialize: α[k] ← 0 for all machines k, and w (0) ← 0
for t = 1, 2, . . . , T
for all machines k = 1, 2, . . . , K in parallel
Solve local problem approximately to obtain ∆α[k]
(t−1)
← α[k] + K1 ∆α[k]
∆wk ← K1 A∆α[k]
PK
reduce w (t) ← w (t−1) + k=1
computation
(t)
α[k]
∆wk
communication
The performance of this methods (in worst case) can be the same as if we
randomly pick k and solve corresponding subproblem and replace K1 by 1
How accurately do we need to solve the local sub-problem?
How to change the local problem to avoid averaging (e.g. just to add local
solutions)?
Can we prove it will be better?
11 / 32
Smarter Subproblem
Local Subproblem for CoCoA+
0
max n Gkσ (∆α[k] ; w (t) )
∆α[k] ∈R
(2)
where
0
Gkσ (∆α[k] ; w (t) ) = −
1 X ∗
(t)
`i (−(α[k] + ∆α[k] )i )
n
i∈Pk
1 λ (t) 2
kw k − λ(w (t) )T A∆α[k]
K 2
2
λ − σ 0 A∆α[k] .
2
−
Compare with:
max
∆α[k] ∈Rn
!
X
λ (t)
1
− kw + A∆α[k] k2 −
`∗i (−(α(t) + ∆α[k] )i )
2
n
(3)
i∈Pk
If σ 0 = 1 then the optimal solutions of (2) and (3) coincides.
12 / 32
The CoCoA+ Framework
Communication-Efficient Distributed Dual Coordinate Ascent
Input: T ≥ 1, γ ∈ [ K1 , 1], σ 0 ∈ [1, ∞)
Data: {(xi , yi )}ni=1 distributed over K machines
(0)
Initialize: α[k] ← 0 for all machines k, and w (0) ← 0
for t = 1, 2, . . . , T
for all machines k = 1, 2, . . . , K in parallel
0
approximately max G σ (∆α[k] ; w (t) ) to obtain ∆α[k]
(t)
α[k]
←
+ γ∆α[k]
∆wk ← γA∆α[k]
PK
reduce w (t) ← w (t−1) + k=1 ∆wk
If γ =
If γ =
1
K
1
K
computation
(t−1)
α[k]
communication
we obtain CoCoA
then σ 0 = 1 is ”safe” value
What about another values of γ? (we want γ = 1)
13 / 32
CoCoA+ Parameters - σ 0 and γ
σ 0 measures the difficulty of the given data partition
it must be chosen not smaller than
def
0
σ 0 ≥ σmin
= γ maxn PK
α∈R
kAαk2
k=1
kAα[k] k2
(4)
Lemma
For any α ∈ Rn (α 6= 0) we have
kAαk2
PK
k=1
kAα[k] k2
≤K
We can take the safe value σ 0 = K · γ
Again: if γ = K1 then σ 0 = K · K1 = 1 is a safe value
New: if γ = 1 then σ 0 = K · 1 = K is a safe value
14 / 32
How Accurately?
Assumption: Θ-approximate solution
We assume that there exists Θ ∈ [0, 1) such that ∀k ∈ [K ], the local solver at any
iteration t produces a (possibly) randomized approximate solution ∆α[k] , which
satisfies
0
0
0
0
∗
∗
, w ) − Gkσ (0, w ) ,
(5)
, w ) − Gkσ (∆α[k] , w ) ≤ Θ Gkσ (∆α[k]
E Gkσ (∆α[k]
where
∆α∗ ∈ arg min n
∆α∈R
K
X
0
Gkσ (∆α[k] , w ).
(6)
k=1
because the subproblem is not really what one wants to solve, therefore in
practise Θ ≈ 0.9 (depending on the cluster and problem)
what about convergence guarantees?
how to get Θ approximate solution?
15 / 32
Iteration Complexity - Smooth Loss
Theorem
Assume the loss functions functions `i are (1/µ)-smooth, for i ∈ {1, 2, . . . , n}. We
define
kAα[k] k2
def
≤ |Pk |
(7)
σk = max n
α[k] ∈R kα[k] k2
and σmax = maxk∈[K ] σk .
Then after T iterations of CoCoA+ , with
T ≥
λµn+σmax σ 0
1
γ(1−Θ)
λµn
log 1 ,
it holds that E[D(α∗ ) − D(αT )] ≤ .
Furthermore, after T iterations with
T ≥
λµn+σmax σ 0
1
γ(1−Θ)
λµn
log
λµn+σmax σ 0 1
1
γ(1−Θ)
λµn
,
we have the expected duality gap
E[P(w (α(T ) )) − D(α(T ) )] ≤ .
16 / 32
Averaging vs. Adding
The leading term is
λµn+σmax σ 0
1
.
γ(1−Θ)
λµn
Averaging
Let us assume that ∀k : |Pk | =
n
K
Adding
1
K
γ=1
σ0 = K
γ=
σ0 = 1
n
K λµn+ K
1−Θ λµn
n
1 λµn+ K K
1−Θ
λµn
1 λµK +1
1−Θ λµ
1 λµ+1
1−Θ λµ
Note: this is in the worst case (for the worst case example)
17 / 32
Iteration Complexity - General Convex Loss
Theorem
Consider CoCoA+ starting with α0 = 0 ∈ Rn and ∀i ∈ {1, 2, . . . , n} : `i (·) be
L-Lipschitz continuous and > 0 be the desired duality gap. Then after T
iterations, where
m
l
4L2 σσ 0
1
, 2
},
T ≥ T0 + max{
γ(1 − Θ) λn γ(1 − Θ)
2 0
2
8L σσ
T0 ≥ t0 +
−1
,
γ(1 − Θ)
λn2 +
l
m
2
∗
)−D(α0 ))
1
t0 ≥ max(0, γ(1−Θ)
),
log( 2λn (D(α
)
4L2 σσ 0
we have that the expected duality gap satisfies E[P(w (α)) − D(α)] ≤ , at the
averaged iterate
PT −1
1
(t)
α := T −T
t=T0 +1 α ,
0
PK
where σ = k=1 |Pk |σk .
18 / 32
SDCA as a Local Solver
SDCA
Input: α[k] , w = w (α)
Data: Local {(xi , yi )}i∈Pk
0
Initialize: ∆α[k]
= 0 ∈ Rn
for h = 0, 1, . . . , H − 1 do
choose i ∈ Pk uniformly
at random
0
h
6:
δi∗ = arg max Gkσ (∆α[k]
+ δi ei , w )
1:
2:
3:
4:
5:
δi ∈R
(h+1)
(h)
∆α[k] = ∆α[k] + δi∗ ei
8: end for
(H)
9: Output: ∆α[k]
7:
Theorem
Assume the functions `i are (1/µ)−smooth for i ∈ {1, 2, . . . , n}. If
H ≥ nk
σ 0 + λnµ
1
log
λnµ
Θ
then SDCA will produce a Θ-approximate solution.
(8)
19 / 32
Total Runtime
To get accuracy we need
O
Recall Θ = 1 −
λnγ K
1+λnγ n
1
1
log
1−Θ
H
Let
τo be the duration of communication per iteration
τc be the duration of ONE coordinate update during the inner iteration
Total runtime

O


 1 

1
τc 



(τO + Hτc ) = O 
1 + H

1−Θ
τo 
1 − Θ 
|{z}
rc/o
20 / 32
H(τc /τo ), Θ(τc /τo )
optimal H vs. r, 1e−3
optimal H vs. r, 1e−6
5
5 x 10
4000
4
3000
3
H
H
5000
2000
2
1000
1
0
0
−4
10
−2
10
0
10
2
computation to communication ratio
−4
10
10
−2
10
0
2
10
computation to communication ratio
optimal θ vs. r, 1e−3
10
optimal θ vs. r, 1e−6
0.8
0.8
0.6
0.6
θ
1
θ
1
0.4
0.4
0.2
0.2
0
0
−4
10
−2
10
0
10
computation to communication ratio
2
10
−4
10
−2
10
0
10
computation to communication ratio
2
10
21 / 32
Apache Spark
Apache Spark
is a fast and general engine for large-scale data processing
runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse
data sources including HDFS, Cassandra, HBase, S3
is slower than our C++ code for CoCoA+
we run it on Amazon Elastic Compute Cloud (Amazon EC2)
22 / 32
Numerical Experiments
Datasets
Dataset
Training (n)
Features (d)
nnz/(n · d)
cov
rcv1
imagenet
522,911
677,399
32,751
54
47,236
160,000
22.22%
0.16%
100%
λ
1e-6
1e-6
1e-5
Workers (K )
4
8
32
23 / 32
Dependence of Primal Suboptimality on H
COV
2
Log Primal Suboptimality
10
0
10
−2
10
−4
10
−6
10
0
1e5
1e4
1e3
100
1
20
40
60
Time (s)
80
100
24 / 32
Comparison with Different Algorithms
COCOA
minibatch-CD
Tianbao Yang.Trading Computation for Communication: Distributed Stochastic
Dual Coordinate Ascent. In NIPS 2013.
Martin Tak´
aˇc, Avleen Bijral, Peter Richt´
arik, and Nathan Srebro. Mini-Batch
Primal and Dual Methods for SVMs. In ICML, March 2013.
local-SGD
Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos:
Primal Estimated Sub-Gradient Solver for SVM. Mathematical Programming,
127(1):3–30, October 2010.
batch-SGD
Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos:
Primal Estimated Sub-Gradient Solver for SVM. Mathematical Programming,
127(1):3–30, October 2010.
25 / 32
Cov
2
Log Primal Suboptimality
0
−2
10
−4
10
−6
0
COCOA (H=1e5)
minibatch−CD (H=100)
local−SGD (H=1e5)
batch−SGD (H=1)
20
0
10
−2
10
−4
10
COCOA (H=1e5)
minibatch−CD (H=100)
local−SGD (H=1e4)
batch−SGD (H=100)
−6
40
60
80
10
100
0
100
Time (s)
200
300
400
Time (s)
Imagenet
2
10
Log Primal Suboptimality
Log Primal Suboptimality
10
10
10
RCV1
2
10
0
10
−2
10
−4
10
−6
10
0
COCOA (H=1e3)
mini−batch−CD (H=1)
local−SGD (H=1e3)
mini−batch−SGD (H=10)
200
400
600
800
Time (s)
26 / 32
Cov
2
Log Primal Suboptimality
0
−2
10
−4
10
−6
0
COCOA (H=1e5)
minibatch−CD (H=100)
local−SGD (H=1e5)
batch−SGD (H=1)
50
100
0
10
−2
10
−4
10
COCOA (H=1e5)
minibatch−CD (H=100)
local−SGD (H=1e4)
batch−SGD (H=100)
−6
150
200
250
10
300
0
100
# of Communicated Vectors
200
300
400
500
600
700
# of Communicated Vectors
Imagenet
2
10
Log Primal Suboptimality
Log Primal Suboptimality
10
10
10
RCV1
2
10
0
10
−2
10
−4
10
−6
10
0
COCOA (H=1e3)
mini−batch−CD (H=1)
local−SGD (H=1e3)
mini−batch−SGD (H=10)
500
1000
1500
2000
2500
3000
# of Communicated Vectors
27 / 32
CoCoA vs. CoCoA+
Covertype, 1e-4
100
6
H=10
H=10 5
4
H=10
H=106
H=105
4
H=10
-2
10-3
10
-4
10
1
10
2
10
3
10
-2
-4
4
10
1
10
Number of Communications
H=10 6
H=10 5
H=10 4
H=106
H=105
4
H=10
3
10
10-2
10-3
10-4
CoCoA
CoCoA
CoCoA
CoCoA+
CoCoA+
CoCoA+
10-1
Duality Gap
Duality Gap
10
4
Covertype, 1e-5
100
CoCoA
CoCoA
CoCoA
CoCoA+
CoCoA+
CoCoA+
10-1
2
Number of Communications
Covertype, 1e-4
100
6
H=10
H=10 5
4
H=10
H=106
H=105
4
H=10
10-3
10
10
CoCoA
CoCoA
CoCoA
CoCoA+
CoCoA+
CoCoA+
10-1
Duality Gap
Duality Gap
10-1
10
Covertype, 1e-5
100
CoCoA
CoCoA
CoCoA
CoCoA+
CoCoA+
CoCoA+
H=10 6
H=10 5
H=10 4
H=106
H=105
4
H=10
10-2
10-3
10-4
10
0
10
1
Elapsed Time (s)
10
2
10
0
10
1
Elapsed Time (s)
10
2
28 / 32
CoCoA vs. CoCoA+
RCV1, 1e-4
100
6
H=10
H=10 5
4
H=10
H=106
H=105
4
H=10
10-2
10
-3
10
-4
10
1
10
2
10
3
10
4
CoCoA
CoCoA
CoCoA
CoCoA+
CoCoA+
CoCoA+
10-1
Duality Gap
Duality Gap
10-1
RCV1, 1e-5
100
CoCoA
CoCoA
CoCoA
CoCoA+
CoCoA+
CoCoA+
10
5
10-2
10
-3
10
-4
10
1
10
Number of Communications
H=10 6
H=10 5
H=10 4
H=106
H=105
4
H=10
10-2
10
-3
10
1
10
2
Elapsed Time (s)
10
3
10
4
10
5
CoCoA
CoCoA
CoCoA
CoCoA+
CoCoA+
CoCoA+
10
4
10
4
H=10 6
H=10 5
H=10 4
H=106
H=105
4
H=10
10-2
10
10-4
3
10-1
Duality Gap
Duality Gap
10-1
10
RCV1, 1e-5
100
CoCoA
CoCoA
CoCoA
CoCoA+
CoCoA+
CoCoA+
2
Number of Communications
RCV1, 1e-4
100
6
H=10
H=10 5
4
H=10
H=106
H=105
4
H=10
-3
10-4
10
1
10
2
Elapsed Time (s)
10
3
29 / 32
Scaling up
Scaling up K, RCV1
Time (s) to e-3 Accurate Primal
250
200
150
100
50
2
4
6
8
10
12
14
CoCoA+
CoCoA
Mini-batch SGD
102
10
16
1
2
Number of machines (K)
Time (s) to e-2 Duality Gap
Time (s) to e-4 Duality Gap
300
0
Scaling up K, RCV1
103
CoCoA+
CoCoA
350
6
8
10
12
14
16
Number of machines (K)
Scaling up K, Epsilon
700
600
4
CoCoA+
CoCoA
500
400
300
200
100
0
20
40
60
80
100
Number of machines (K)
30 / 32
Effect of σ 0
Effect of <` for . = 1 (adding)
101
10
0
10
Duality Gap
Duality Gap
10
101
-1
10-2
<` = 8 (K)
<` = 6
<` = 4
<` = 2
<` = 1
10-3
10
1
0
-1
10-2
10-3
-4
10
10
10
2
Number of Communications
10
3
Effect of <` for . = 1 (adding)
10
<` = 8 (K)
<` = 6
<` = 4
<` = 2
<` = 1
-4
10
1
Elapsed Time (s)
31 / 32
References
1
Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I. Jordan, Peter Richt´
arik and Martin Tak´
aˇ
c: Adding
vs. Averaging in Distributed Primal-Dual Optimization, arXiv: 1502.03508, 2015.
2
Martin Jaggi, Virginia Smith, Martin Tak´
aˇ
c, Jonathan Terhorst, Thomas Hofmann and Michael I.
Jordan: Communication-Efficient Distributed Dual Coordinate Ascent, NIPS 2014.
3
Richt´
arik, P. and Tak´
aˇ
c, M.: On optimal probabilities in stochastic coordinate descent methods,
arXiv:1310.3438, 2013.
4
Richt´
arik, P. and Tak´
aˇ
c, M.: Parallel coordinate descent methods for big data optimization,
arXiv:1212.0873, 2012.
5
Richt´
arik, P. and Tak´
aˇ
c, M.: Iteration complexity of randomized block-coordinate descent methods for
minimizing a composite function, Mathematical Programming, 2012.
6
Tak´
aˇ
c, M., Bijral, A., Richt´
arik, P. and Srebro, N.: Mini-batch primal and dual methods for SVMs, In
ICML, 2013.
7
Qu, Z., Richt´
arik, P. and Zhang, T.: Randomized dual coordinate ascent with arbitrary sampling,
arXiv:1411.5873, 2014.
8
Qu, Z., Richt´
arik, P., Tak´
aˇ
c, M. and Fercoq, O.: SDNA: Stochastic Dual Newton Ascent for Empirical
Risk Minimization, arXiv:1502.02268, 2015.
9
Tappenden, R., Tak´
aˇ
c, M. and Richt´
arik, P., On the Complexity of Parallel Coordinate Descent, arXiv:
1503.03033, 2015.
32 / 32