Jocibi ADMM

Parallel Multi-Block ADMM with o(1/k) Convergence
Wotao Yin (UCLA Math)
W. Deng, M.-J. Lai, Z. Peng, and W. Yin, Parallel Multi-Block ADMM with
o(1/k) Convergence, UCLA CAM 13-64, 2013.
Advanced Workshop at Shanghai U.
1 / 22
2 / 22
Linearly Constrained Separable Problem
minimize f1 (x1 ) + ∙ ∙ ∙ + fN (xN )
subject to A1 x1 + ∙ ∙ ∙ + AN xN = c,
x 1 ∈ X 1 , . . . , xN ∈ X N .
• fi : Rni → (−∞, +∞] are convex functions. N ≥ 2.
• a.k.a. extended monotropic programming [Bertsekas, 2008]
• Examples:
•
•
•
•
Linear programming
Multi-agent network optimization
Exchange problem
Regularization model
3 / 22
Parallel and Distributed Algorithms
Motivation:
• Data may be collected and stored in a distributed way
• Often difficult to minimize all the fi ’s jointly
Strategy:
• Decompose the problem into N simpler and smaller subproblems
• Solve subproblems in parallel
• Coordinate by passing some information
4 / 22
Dual Decomposition
 k+1 k+1
k+1
) = arg min{xi } L(x1 , . . . , xN , λk ),
 (x1 , x2 , . . . , xN
P
N
k+1
k
 k+1
= λ − αk
λ
i=1
A i xi
−c ,
αk > 0.
• Lagrangian:
L(x1 , . . . , xN , λ) =
N
X
i=1
fi (xi ) − λ
>
N
X
i=1
Ai xi − c
!
• x-step has N decoupled xi -subproblems, parallelizable:
xik+1 = arg min fi (xi ) − hλk , Ai xi i, for i = 1, 2, . . . , N ,
xi
√
• Convergence rate: O(1/ k) (for general convex problems)
• Often slow convergence in practice
5 / 22
Distributed ADMM
[Bertsekas and Tsitsiklis, 1997, Boyd et al., 2010, Wang et al., 2013]
• Variable splitting:
min
{xi },{zi }
s.t.
N
X
fi (xi )
i=1
Ai x i − z i =
N
X
c
, i = 1, 2, . . . , N ,
N
zi = 0.
i=1
• Apply ADMM, alternatively update {xi } and {zi }, then multipliers {λi }:
zk+1
i
xik+1
=
Ai xik
c
λk
−
− i
ρ
N
ρ
= arg min fi (xi ) +
2
xi
N
λkj
1 X
c
−
Aj xjk −
−
ρ
N
N
j=1
, ∀i;
2
c
λki Ai xi − zk+1
, ∀i;
−
−
i
N
ρ
= λki − ρ Ai xik+1 − zk+1
−
λk+1
i
i
c
N
, ∀i.
6 / 22
Jacobi ADMM
• Augmented Lagrangian:
Lρ (x1 , . . . , xN , λ) =
N
X
fi (xi )−λ
N
X
>
i=1
i=1
!
ρ
Ai xi − c +
2
N
2
X
Ai x i − c i=1
• Do not introduce {zi }, directly apply Jacobi-type block minimization:
k
k
k
xik+1 = arg min Lρ (x1k , . . . , xi−1
, xi , xi+1
, . . . , xN
, λk )
xi
2
X
ρ
λk = arg min fi (xi ) + Ai xi +
Aj xjk − c −
2
ρ
xi
j6=i
for i = 1, . . . , N in parallel;
λ
k+1
k
=λ −ρ
N
X
i=1
Ai xik+1
−c
!
.
• Not necessarily convergent (even if N = 2)
• Need either conditions or modifications to converge
7 / 22
A Sufficient Condition for Convergence
Theorem
Suppose that there exists δ > 0 such that
kA>
i Aj k ≤ δ, ∀ i 6= j,
and
λmin (A>
i Ai ) > 3(N − 1)δ, ∀ i,
Then Jacobi ADMM converges to a solution.
The assumption basically says:
• {Ai , i = 1, 2, . . . , N } are mutually “near-orthogonal”
• every Ai has full column rank and is sufficiently strong.
8 / 22
Proximal Jacobi ADMM
1. for i = 1, . . . , N in parallel,
xik+1
ρ
= arg min fi (xi ) +
2
xi
2. λk+1 = λk − γρ
P
N
i=1
2
X
λk k
A j xj − b −
Ai xi +
+
ρ j6=i
Ai xik+1 − b , γ > 0.
1
2
xi − xik 2 ;
P
i
• The added proximal term is critical to convergence.
• Some forms of Pi 0 make subproblems easier to solve and more stable.
• Global o(1/k) convergence if Pi and γ are properly chosen.
• Suitable for parallel and distributed computing.
9 / 22
Little-o convergence
Lemma
If a sequence {ak } ⊂ R obeys:
ak ≥ 0
and
∞
X
t=1
then we have:
at < ∞,
1. (convergence) limk→∞ ak = 0;
2. (ergodic convergence)
1
k
Pk
t=1
3. (running best) min{at } = o
t≤k
1
k
at = O
;
1
k
;
4. (non-ergodic convergence) if ak is monotonically nonincreasing, then
ak = o 1k .
10 / 22
Convergence of Proximal Jacobi ADMM
Sufficient condition: there exist i > 0, i = 1, 2, . . . , N such that
(C1)
Pi ρ( 1i − 1)A>
i Ai , i = 1, 2, . . . , N
PN
< 2 − γ.
i=1 i
Simplification to (C1): set i <
Pi ρ
2−γ
:
N
N
− 1 A>
i Ai , i = 1, 2, . . . , N
2−γ
− 1 kAi k2
• Pi = τi I (standard proximal method): τi > ρ
N
2−γ
• Pi = τi I − ρA>
i Ai (prox-linear method): τi >
ρN
kAi k2
2−γ
11 / 22
o(1/k) Convergence Rate
Notation:


Gx := 
P1 + ρA>
1 A1
..
.
PN + ρA>
N AN


0
>
 , Gx := Gx − ρA A
Theorem
If Gx0 0 and the condition (C1) holds, then
kxk − xk+1 k2Gx0 = o(1/k)
and
kλk − λk+1 k2 = o(1/k).
Note: (xk+1 , λk+1 ) is optimal if kxk − xk+1 k2Gx0 = 0, kλk − λk+1 k2 = 0. The
quantity kuk − uk+1 k2G0 as a measure of the convergence rate. Proof is similar
to He and Yuan [2012], He et al. [2013].
12 / 22
Adaptive Parameter Tuning
• Condition (C1) may be rather conservative.
• Adaptively adjusting the matrices {Pi } with guaranteed convergence.
1
2
3
4
5
6
7
Initialize with small Pi0 0 (i = 1, 2, . . . , N ) and a small η > 0;
for k = 1, 2, . . . do
if h(uk−1 , uk ) > η ∙ kuk−1 − uk k2G then
Pik+1 ← Pik , ∀i;
else
Increase Pi : Pik+1 ← αi Pik + βi Qi (αi > 1, βi ≥ 0, Qi 0), ∀i;
Restart: uk ← uk−1 ;
Note: h(uk , uk+1 ) can be computed at little extra cost in the algorithm.
• Often yields much smaller paramters {Pi } than those required by
condition (C1), leading to substantially faster convergence in practice.
13 / 22
Numerical Experiments
Compare several parallel splitting algorithms:
• Prox-JADMM: Proximal Jacobi ADMM [this work]
• VSADMM: distributed ADMM, variable splitting [Bertsekas, 2008]
• Corr-JADMM: Jacobian ADMM with correction steps [He et al., 2013]
They have roughly the same per-iteration cost (in terms of both computation
and communication).
14 / 22
Exchange Problem
Consider a network of N agents that exchange n commodities.
min
{xi }
N
X
fi (xi )
s.t.
i=1
N
X
xi = 0.
i=1
• xi ∈ Rn (i = 1, 2, . . . , N ): quantities of commodities that are exchanged
by agents i.
• fi : Rn → R: cost function for agent i.
15 / 22
Numerical Result
Let fi (xi ) := 12 kCi xi − di k2 , Ci ∈ Rp×n and di ∈ Rp .
5
2
10
10
Prox−JADMM
VSADMM
Corr−JADMM
0
10
Prox−JADMM
VSADMM
Corr−JADMM
0
10
Residual
Objective Value
−2
10
−5
10
−10
10
−4
10
−6
10
−15
10
−8
10
−20
10
−10
10
−25
10
−12
0
50
100
Iteration
150
200
10
0
50
100
150
200
Iteration
Figure: Exchange problem (n = 100, N = 100, p = 80).
16 / 22
Basis Pursuit
Finding sparse solutions of an under-determined linear system:
min kxk1
x
s.t. Ax = c
• x ∈ Rn , A ∈ Rm×n (m < n)
• Partition data into N blocks:
x = [x1 , x2 , . . . , xN ], A = [A1 , A2 , . . . , AN ], fi (xi ) = kxi k1
• YALL1: a dual-ADMM solver for the basis pursuit problem.
17 / 22
Numerical Result
4
5
10
0
10
10
Relative Error
Relative Error
Prox−JADMM
YALL1
VSADMM
Corr−JADMM
−5
10
−10
10
0
10
−2
10
−4
−15
10
10
−6
−20
10
Prox−JADMM
YALL1
VSADMM
Corr−JADMM
2
10
0
200
400
600
800
Iteration
(a) Noise-free (σ = 0)
1000
10
0
50
100
150
200
250
300
350
Iteration
(b) Noise added (σ = 10−3 )
Figure: `1 -problem (n = 1000, m = 300, k = 60).
18 / 22
Amazon EC2
Tested solve two basis pursuit problems.
m
dataset 1
dataset 2
n
5
1.0 × 10
1.5 × 105
Size
k
5
2.0 × 10
3.0 × 105
3
2.0 × 10
3.0 × 103
150GB
337GB
Environment:
• C code uses GSL and MPI, about 300 lines
• 10 instances from Amazon, each with 8 cores and 68GB RAM
• price: $17 each hour
19 / 22
150GB Test
Data generation
CPU per iteration
Comm. per iteration
Reach 10−1
Reach 10−2
Reach 10−3
Reach 10−4
337GB Test
Itr
Time(s)
Cost($)
Itr
Time(s)
Cost($)
–
–
–
23
30
86
234
44.4
1.32
0.07
30.4
39.4
112.7
307.9
0.21
–
–
0.14
0.18
0.53
1.45
–
–
–
27
39
84
89
99.5
2.85
0.15
79.08
113.68
244.49
259.24
0.5
–
–
0.37
0.53
1.15
1.22
20 / 22
Summary
• It is feasible to extend ADMM from 2 blocks to 3 or more blocks
• Jacobi ADMM is good for problems with large and distributed data
• Gauss-Seidel ADMM is good for 3 or a few more blocks,
Jacobi ADMM is good for many blocks
• Asynchronous subproblems become a real need (talk to Dong Qian)
21 / 22
References
D. P. Bertsekas. Extended monotropic programming and duality. Journal of
optimization theory and applications, 139(2):209–225, 2008.
D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical
Methods, Second Edition. Athena Scientific, 1997.
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and
statistical learning via the alternating direction method of multipliers. Machine
Learning, 3(1):1–122, 2010.
X. F. Wang, M. Y. Hong, S. Q. Ma, and Z.-Q. Luo. Solving multiple-block separable
convex minimization problems using two-block alternating direction method of
multipliers. arXiv preprint arXiv:1308.5294, 2013.
B. S. He and X. M. Yuan. On non-ergodic convergence rate of Douglas-Rachford
alternating direction method of multipliers. 2012.
B. S. He, L. S. Hou, and X. M. Yuan. On full Jacobian decomposition of the
augmented lagrangian method for separable convex programming. 2013.
22 / 22