Chapter 8 Proximal Point Method and Augmented Lagrangian Method

Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
1
Chapter 8
Proximal Point Method and Augmented
Lagrangian Method
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
Linearly Constrained Problem
Consider the linearly constrained problem:
min f (x)
s.t. Ax = b,
where A ∈ Rp×n.
A natural thought is gradient projection method:
xk+1 := P{x|Ax=b}(xk − tk ∇f (xk ))
But, projection onto the affine set Ax = b is not easy:
P{x|Ax=b}(u) = argmin kx − uk22 = u + A>(AA>)−1(b − Au)
{x|Ax=b}
inexpensive only if p n, or AA> = I, ...
Usually, we want to avoid computing projection onto affine set.
2
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
3
Augmented Lagrangian Method
For linearly constrained problem, augmented Lagrangian method
(ALM) is usually used. Define augmented Lagrangian function:
t
Lt(x; λ) := f (x) − λ>(Ax − b) + kAx − bk22
2
where λ is called the Lagrange multiplier or the dual variable
The augmented Lagrangian method is (starting with λ0 = 0)
k+1
x
:= argminx Lt(x; λk )
λk+1 := λk − t(Axk+1 − b)
ALM is in fact the proximal point method applied to the dual problem
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
4
8.1. Proximal Point Method
min f (x)
x
where f (x) is a closed convex function.
The proximal point method (PPM) iterates:
xk = proxtk f (xk−1)
= argminu f (u) +
1
2tk ku
−
xk−1k22
• can be viewed as proximal gradient method with g(x) = 0
• of interest if prox operator is much easier than minimizing f directly
• a practical algorithm if inexact prox evaluations are used
• step size tk > 0 affects number of iterations
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
5
Convergence
assumptions
• f is closed and convex (hence proxtf (x) is uniquely defined for all x)
• optimal value f ∗ is finite and attained at x∗
result
kx0 − x∗k22
, for k ≥ 1
f (x ) − f ≤ Pk
2 i=1 ti
P
• implies convergence if i ti → ∞
k
∗
• rate is 1/k if ti is fixed or variable but bounded away from 0
• ti is arbitrary; however number of iterations depends on ti
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
Proof: apply analysis of proximal gradient method with g(x) = 0
• since g is zero, inequality (1) on page 7-4 holds for any t > 0
• from page 7-6, f (xi) is nonincreasing and
i
ti f (x ) − f
∗
1
i
∗ 2
i−1
∗ 2
≤
kx − x k2 − kx − x k2
2
• combine inequalities for i = 1 to i = k to get
!
k
k
X
X
1
ti f (xi) − f ∗ ≤ kx0 − x∗k22
ti f (xk ) − f ∗ ≤
2
i=1
i=1
6
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
7
Accelerated proximal point algorithms
FISTA (take g(x) = 0 on p.7-20): choose x0 = x−1 and for k ≥ 1
1
−
θ
k−1
xk = proxtk f xk−1 + θk
(xk−1 − xk−2)
θk−1
Nesterov’s 2nd method (p. 7-34): choose x0 = v 0 and for k ≥ 1
v k = prox(tk /θk )f (v k−1),
xk = (1 − θk )xk−1 + θk v k
possible choices of parameters
• fixed steps: tk = t and θk = 2/(k + 1)
• variable steps: choose any tk > 0, θ1 = 1, and for k > 1, solve θk
from
(1 − θk )tk
tk−1
=
2
θk2
θk−1
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
8
Convergence
assumptions
• f is closed and convex
• optimal value f ∗ is finite and attained at x∗
result
2kx0 − x∗k22
f (x ) − f ≤ √
Pk √ 2 ,
2 t1 + i=2 ti
P √
• implies convergence if i ti → ∞
k
∗
for k ≥ 1
• rate is 1/k 2 if ti fixed or variable but bounded away from zero
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
9
Proof: follows from analysis in lecture 7 with g(x) = 0
• since g is zero, first inequalities on p.7-28 and p.7-36 hold for any
t>0
• therefore the conclusion on p.7-29 and p.7-37 holds:
θk2 0
kx − x∗k22
f (x ) − f ≤
2tk
k
∗
• for fixed step size tk = t, θk = 2/(k + 1),
θk2
2
=
2tk (k + 1)2t
• for variable step size, we proved on p. that
θk2
2
≤ √
P √
2tk
(2 t1 + ki=2 ti)2
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
8.2. Augmented Lagrangian Method
primal problem
min f (x)
s.t. Ax = b
Lagrangian function
L(x; λ) = f (x) − λ>(Ax − b)
dual problem
max inf L(x; λ) = −f ∗(A>λ) + b>λ
λ
x
optimality conditions: x, λ are optimal if
• x ∈ dom f , Ax = b
• A>λ ∈ ∂f (x)
10
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
11
augmented Lagrangian method: proximal point method applied
to dual problem
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
Proximal mapping of dual function
proximal mapping of h(λ) = f ∗(A>λ) − b>λ is
1
proxth(λ) = argmin f ∗(A>u) − b>u + ku − λk22
2t
u
x − b) where
dual expression: proxth(λ) = λ − t(Aˆ
t
xˆ = argmin f (x) − λ>(Ax − b) + kAx − bk22
2
x
i.e., xˆ minimizes the augmented Lagrangian function
12
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
13
The 1st proof
minu f ∗(A>u) − b>u + 2t1 ku − λk22
⇔ minu maxxhx, A>ui − f (x) − b>u + 2t1 ku − λk22
⇔ maxx minuhx, A>ui − f (x) − b>u + 2t1 ku − λk22
The inner problem gives u = λ − t(Ax − b). Thus, the outer problem
becomes
t
min f (x) − hλ, Ax − bi + kAx − bk22
x
2
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
14
The 2nd proof
• write the augmented Lagrangian minimization as
minx,z f (x) + 2t kzk22
s.t.
Ax − b − λ/t = z
• optimality conditions (u is the multiplier for equality)
tz + u = 0,
Ax − b − λ/t = z,
A>u ∈ ∂f (x)
• eliminating x, z gives u = λ − t(Ax − b) and
0 ∈ A∂f ∗(A>u) − b + (u − λ)/t
this is the optimality condition for u = proxth(λ)
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
15
Augmented Lagrangian Method
choose initial λ0 and repeat:
1. minimize augmented Lagrangian function
t
xk+1 = argmin f (x) − hλk , Ax − bi + kAx − bk22
2
x
2. update Lagrange multiplier (dual variable)
λk+1 = λk − t(Axk+1 − b)
• also known as method of multipliers, Bregman iteration
• proximal point method applied to the dual problem
• as variants, can apply the fast proximal point methods to the dual
• usually implemented with inexact minimization in step 1
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
Example: basis pursuit
Variant of `1 minimization in compressed sensing and Lasso.
min kxk1
s.t. Ax = b,
where A ∈ Rm×n and b ∈ Rm.
Augmented Lagrangian method:
1. minimize the augmented Lagrangian function
t
xk+1 = argmin kxk1 − hλk , Ax − bi + kAx − bk22
2
x
2. update Lagrange multiplier (dual variable)
λk+1 = λk − t(Axk+1 − b)
16
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
17
• How to solve the problem in step 1?
• note that it is equivalent to
t
min kxk1 + kAx − b − λk /tk22
x
2
• so one can use proximal gradient method (ISTA) to solve it to certain
accuracy
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
18
ALM vs PGM
Three variants of the `1 minimization problem:
1. `1-norm regularized problem
1
min τ kxk1 + kAx − bk22
2
2. `1-ball constrained problem
min kAx − bk22,
s.t. kxk1 ≤ η
3. basis pursuit problem
min kxk1,
s.t. Ax = b
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
19
solvers:
• use proximal gradient method to solve 1 (involves computing the
prox operator of `1 norm)
• use gradient projection method to solve 2 (involves computing the
projection onto `1 ball)
• use ALM to solve 3 (involves inexactly minimizing the augmented
Lagrangian function)
advantages:
• basis pursuit problem requires the equality to hold (important for
many applications)
• ALM converges such that Axk − b → 0
• while the other 2 models cannot guarantee this
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
20
8.3. Moreau-Yosida smoothing
Moreau-Yosida regularization (Moreau envelope) of closed convex f is
(with t > 0)
1
2
ft(x) = inf u f (u) + 2t ku − xk2
= f (proxtf (x)) + 2t1 kproxtf (x) − xk22
immediate properties
• ft is convex (infimum over u of a convex function of x, u)
• domain of ft is Rn (recall that proxtf (x) is defined for all x)
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
21
Examples
indicator function: smoothed f is squared Euclidean distance
1
f (x) = IC (x), ft(x) = d(x)2
2t
1-norm: smoothed function is Huber penalty
n
X
f (x) = kxk1, ft(x) =
φt(xk )
k=1
where
φt(z) =
z 2/(2t)
if |z| ≤ t
|z| − t/2 if |z| ≥ t
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
22
Conjugate of Moreau Envelope
1
ft(x) = inf f (u) + ku − xk22
u
2t
• ft can be equivalently written as
1
ft(x) = inf
f (u) + kvk22
u+v=x
2t
• calculus rule for conjugate function
f (x) = inf (g(u) + h(v))
u+v=x
• thus,
f ∗(y) = g ∗(y) + h∗(y)
t
(ft)∗(y) = f ∗(y) + kyk22
2
• hence, conjugate is strongly convex with parameter t
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
23
Gradient of Moreau envelope
t
ft(x) = sup x>y − f ∗(y) − kyk22
2
y
• maximizer in definition is unique and satisfies
x − ty ∈ ∂f ∗(y) ⇔ y ∈ ∂f (x − ty)
• maximizing y is the gradient of ft:
1
∇ft(x) = (x − proxtf (x)) = prox(1/t)f ∗ (x/t)
t
• gradient ∇ft is Lipschitz continuous with constant 1/t (follows from
the non-expansiveness of prox)
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
Interpretation of proximal point algorithm
apply gradient method to minimize Moreau envelope
1
min ft(x) = inf f (u) + ku − xk22
u
2t
this is an exact smooth reformulation of problem of minimizing f (x):
• solution x is minimizer of f
• ft is differentiable with Lipschitz continuous gradient (L = 1/t)
gradient update: with fixed tk = 1/L = t
xk+1 = xk − t∇ft(xk ) = proxtf (xk )
this is the proximal point update with constant step size tk = t
24
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
25
Interpretation of ALM
xk+1 = argminx f (x) − hλk , Ax − bi + 2t kAx − bk22
λk+1 = λk − t(Axk+1 − b)
• dual problem is maxλ inf x L(x; λ), i.e.,
min − inf L(x; λ)
λ
x
• dual function is − inf x L(x; λ)
• smoothed dual function is
1
ft(λ) = inf − inf L(x, u) + ku − λk22
u
x
2t
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
26
• gradient of ft is
∇ft(λ) = (λ − u)/t,
where u is the optimal solution of
1
max inf L(x, u) − ku − λk22
u
x
2t
i.e.,
−(Ax − b) − (u − λ)/t = 0
and x minimizes
f (x) − hu, Ax − bi −
1
ku − λk22
2t
• Thus, with fixed t, dual update of ALM is gradient step applied to
smoothed dual