Parallel Multi-Block ADMM with o(1/k) Convergence Wotao Yin (UCLA Math) W. Deng, M.-J. Lai, Z. Peng, and W. Yin, Parallel Multi-Block ADMM with o(1/k) Convergence, UCLA CAM 13-64, 2013. Advanced Workshop at Shanghai U. 1 / 22 2 / 22 Linearly Constrained Separable Problem minimize f1 (x1 ) + ∙ ∙ ∙ + fN (xN ) subject to A1 x1 + ∙ ∙ ∙ + AN xN = c, x 1 ∈ X 1 , . . . , xN ∈ X N . • fi : Rni → (−∞, +∞] are convex functions. N ≥ 2. • a.k.a. extended monotropic programming [Bertsekas, 2008] • Examples: • • • • Linear programming Multi-agent network optimization Exchange problem Regularization model 3 / 22 Parallel and Distributed Algorithms Motivation: • Data may be collected and stored in a distributed way • Often difficult to minimize all the fi ’s jointly Strategy: • Decompose the problem into N simpler and smaller subproblems • Solve subproblems in parallel • Coordinate by passing some information 4 / 22 Dual Decomposition k+1 k+1 k+1 ) = arg min{xi } L(x1 , . . . , xN , λk ), (x1 , x2 , . . . , xN P N k+1 k k+1 = λ − αk λ i=1 A i xi −c , αk > 0. • Lagrangian: L(x1 , . . . , xN , λ) = N X i=1 fi (xi ) − λ > N X i=1 Ai xi − c ! • x-step has N decoupled xi -subproblems, parallelizable: xik+1 = arg min fi (xi ) − hλk , Ai xi i, for i = 1, 2, . . . , N , xi √ • Convergence rate: O(1/ k) (for general convex problems) • Often slow convergence in practice 5 / 22 Distributed ADMM [Bertsekas and Tsitsiklis, 1997, Boyd et al., 2010, Wang et al., 2013] • Variable splitting: min {xi },{zi } s.t. N X fi (xi ) i=1 Ai x i − z i = N X c , i = 1, 2, . . . , N , N zi = 0. i=1 • Apply ADMM, alternatively update {xi } and {zi }, then multipliers {λi }: zk+1 i xik+1 = Ai xik c λk − − i ρ N ρ = arg min fi (xi ) + 2 xi N λkj 1 X c − Aj xjk − − ρ N N j=1 , ∀i; 2 c λki Ai xi − zk+1 , ∀i; − − i N ρ = λki − ρ Ai xik+1 − zk+1 − λk+1 i i c N , ∀i. 6 / 22 Jacobi ADMM • Augmented Lagrangian: Lρ (x1 , . . . , xN , λ) = N X fi (xi )−λ N X > i=1 i=1 ! ρ Ai xi − c + 2 N 2 X Ai x i − c i=1 • Do not introduce {zi }, directly apply Jacobi-type block minimization: k k k xik+1 = arg min Lρ (x1k , . . . , xi−1 , xi , xi+1 , . . . , xN , λk ) xi 2 X ρ λk = arg min fi (xi ) + Ai xi + Aj xjk − c − 2 ρ xi j6=i for i = 1, . . . , N in parallel; λ k+1 k =λ −ρ N X i=1 Ai xik+1 −c ! . • Not necessarily convergent (even if N = 2) • Need either conditions or modifications to converge 7 / 22 A Sufficient Condition for Convergence Theorem Suppose that there exists δ > 0 such that kA> i Aj k ≤ δ, ∀ i 6= j, and λmin (A> i Ai ) > 3(N − 1)δ, ∀ i, Then Jacobi ADMM converges to a solution. The assumption basically says: • {Ai , i = 1, 2, . . . , N } are mutually “near-orthogonal” • every Ai has full column rank and is sufficiently strong. 8 / 22 Proximal Jacobi ADMM 1. for i = 1, . . . , N in parallel, xik+1 ρ = arg min fi (xi ) + 2 xi 2. λk+1 = λk − γρ P N i=1 2 X λk k A j xj − b − Ai xi + + ρ j6=i Ai xik+1 − b , γ > 0. 1 2 xi − xik 2 ; P i • The added proximal term is critical to convergence. • Some forms of Pi 0 make subproblems easier to solve and more stable. • Global o(1/k) convergence if Pi and γ are properly chosen. • Suitable for parallel and distributed computing. 9 / 22 Little-o convergence Lemma If a sequence {ak } ⊂ R obeys: ak ≥ 0 and ∞ X t=1 then we have: at < ∞, 1. (convergence) limk→∞ ak = 0; 2. (ergodic convergence) 1 k Pk t=1 3. (running best) min{at } = o t≤k 1 k at = O ; 1 k ; 4. (non-ergodic convergence) if ak is monotonically nonincreasing, then ak = o 1k . 10 / 22 Convergence of Proximal Jacobi ADMM Sufficient condition: there exist i > 0, i = 1, 2, . . . , N such that (C1) Pi ρ( 1i − 1)A> i Ai , i = 1, 2, . . . , N PN < 2 − γ. i=1 i Simplification to (C1): set i < Pi ρ 2−γ : N N − 1 A> i Ai , i = 1, 2, . . . , N 2−γ − 1 kAi k2 • Pi = τi I (standard proximal method): τi > ρ N 2−γ • Pi = τi I − ρA> i Ai (prox-linear method): τi > ρN kAi k2 2−γ 11 / 22 o(1/k) Convergence Rate Notation: Gx := P1 + ρA> 1 A1 .. . PN + ρA> N AN 0 > , Gx := Gx − ρA A Theorem If Gx0 0 and the condition (C1) holds, then kxk − xk+1 k2Gx0 = o(1/k) and kλk − λk+1 k2 = o(1/k). Note: (xk+1 , λk+1 ) is optimal if kxk − xk+1 k2Gx0 = 0, kλk − λk+1 k2 = 0. The quantity kuk − uk+1 k2G0 as a measure of the convergence rate. Proof is similar to He and Yuan [2012], He et al. [2013]. 12 / 22 Adaptive Parameter Tuning • Condition (C1) may be rather conservative. • Adaptively adjusting the matrices {Pi } with guaranteed convergence. 1 2 3 4 5 6 7 Initialize with small Pi0 0 (i = 1, 2, . . . , N ) and a small η > 0; for k = 1, 2, . . . do if h(uk−1 , uk ) > η ∙ kuk−1 − uk k2G then Pik+1 ← Pik , ∀i; else Increase Pi : Pik+1 ← αi Pik + βi Qi (αi > 1, βi ≥ 0, Qi 0), ∀i; Restart: uk ← uk−1 ; Note: h(uk , uk+1 ) can be computed at little extra cost in the algorithm. • Often yields much smaller paramters {Pi } than those required by condition (C1), leading to substantially faster convergence in practice. 13 / 22 Numerical Experiments Compare several parallel splitting algorithms: • Prox-JADMM: Proximal Jacobi ADMM [this work] • VSADMM: distributed ADMM, variable splitting [Bertsekas, 2008] • Corr-JADMM: Jacobian ADMM with correction steps [He et al., 2013] They have roughly the same per-iteration cost (in terms of both computation and communication). 14 / 22 Exchange Problem Consider a network of N agents that exchange n commodities. min {xi } N X fi (xi ) s.t. i=1 N X xi = 0. i=1 • xi ∈ Rn (i = 1, 2, . . . , N ): quantities of commodities that are exchanged by agents i. • fi : Rn → R: cost function for agent i. 15 / 22 Numerical Result Let fi (xi ) := 12 kCi xi − di k2 , Ci ∈ Rp×n and di ∈ Rp . 5 2 10 10 Prox−JADMM VSADMM Corr−JADMM 0 10 Prox−JADMM VSADMM Corr−JADMM 0 10 Residual Objective Value −2 10 −5 10 −10 10 −4 10 −6 10 −15 10 −8 10 −20 10 −10 10 −25 10 −12 0 50 100 Iteration 150 200 10 0 50 100 150 200 Iteration Figure: Exchange problem (n = 100, N = 100, p = 80). 16 / 22 Basis Pursuit Finding sparse solutions of an under-determined linear system: min kxk1 x s.t. Ax = c • x ∈ Rn , A ∈ Rm×n (m < n) • Partition data into N blocks: x = [x1 , x2 , . . . , xN ], A = [A1 , A2 , . . . , AN ], fi (xi ) = kxi k1 • YALL1: a dual-ADMM solver for the basis pursuit problem. 17 / 22 Numerical Result 4 5 10 0 10 10 Relative Error Relative Error Prox−JADMM YALL1 VSADMM Corr−JADMM −5 10 −10 10 0 10 −2 10 −4 −15 10 10 −6 −20 10 Prox−JADMM YALL1 VSADMM Corr−JADMM 2 10 0 200 400 600 800 Iteration (a) Noise-free (σ = 0) 1000 10 0 50 100 150 200 250 300 350 Iteration (b) Noise added (σ = 10−3 ) Figure: `1 -problem (n = 1000, m = 300, k = 60). 18 / 22 Amazon EC2 Tested solve two basis pursuit problems. m dataset 1 dataset 2 n 5 1.0 × 10 1.5 × 105 Size k 5 2.0 × 10 3.0 × 105 3 2.0 × 10 3.0 × 103 150GB 337GB Environment: • C code uses GSL and MPI, about 300 lines • 10 instances from Amazon, each with 8 cores and 68GB RAM • price: $17 each hour 19 / 22 150GB Test Data generation CPU per iteration Comm. per iteration Reach 10−1 Reach 10−2 Reach 10−3 Reach 10−4 337GB Test Itr Time(s) Cost($) Itr Time(s) Cost($) – – – 23 30 86 234 44.4 1.32 0.07 30.4 39.4 112.7 307.9 0.21 – – 0.14 0.18 0.53 1.45 – – – 27 39 84 89 99.5 2.85 0.15 79.08 113.68 244.49 259.24 0.5 – – 0.37 0.53 1.15 1.22 20 / 22 Summary • It is feasible to extend ADMM from 2 blocks to 3 or more blocks • Jacobi ADMM is good for problems with large and distributed data • Gauss-Seidel ADMM is good for 3 or a few more blocks, Jacobi ADMM is good for many blocks • Asynchronous subproblems become a real need (talk to Dong Qian) 21 / 22 References D. P. Bertsekas. Extended monotropic programming and duality. Journal of optimization theory and applications, 139(2):209–225, 2008. D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods, Second Edition. Athena Scientific, 1997. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Machine Learning, 3(1):1–122, 2010. X. F. Wang, M. Y. Hong, S. Q. Ma, and Z.-Q. Luo. Solving multiple-block separable convex minimization problems using two-block alternating direction method of multipliers. arXiv preprint arXiv:1308.5294, 2013. B. S. He and X. M. Yuan. On non-ergodic convergence rate of Douglas-Rachford alternating direction method of multipliers. 2012. B. S. He, L. S. Hou, and X. M. Yuan. On full Jacobian decomposition of the augmented lagrangian method for separable convex programming. 2013. 22 / 22
© Copyright 2024