Convex Optimization Lecture Notes (Incomplete)

Convex Optimization Lecture Notes (Incomplete)
[email protected]
January 21, 2015
Contents
1 Introduction
5
2 Convex Sets
5
2.1 Types of Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.1.1 Affine Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.1.2 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.1.3 Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2 Important Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2.1 Norm Balls / Norm Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.2.2 Positive Semidefinite Cone
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.3 Operations that Preserve Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.3.1 Linear-fractional and Perspective Functions . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.4 Generalized Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.4.1 Positive Semidefinite Cone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.2 Minimum and minimal Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
7
2.5 Supporting Hyperplane Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.6 Dual Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.6.1 Dual Generalized Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.6.2 Dual Generalized Inequality and Minimum/Minimal Elements . . . . . . . . . . . . . .
7
3 Convex Functions
8
3.1 Basic Properties and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
3.1.1 Extended-value Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
3.1.2 Equivalent Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
3.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
3.1.4 Epigraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
3.1.5 Jensen’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
3.2 Operations That Preserve Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
3.2.1 Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
3.3 The Conjugate Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
3.4 Quasiconvex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
11
3.4.2 Operations That Preserve Quasiconvexity
. . . . . . . . . . . . . . . . . . . . . . . . . .
11
3.5 Log-concavity and Log-convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1
4 Convex Optimization Problems
11
4.1 Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Feasibility Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
11
4.1.2 Transformations and Equivalent Problems . . . . . . . . . . . . . . . . . . . . . . . . . .
12
4.2 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
4.2.1 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
4.2.2 Equivalent Convex Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
4.3 Quasiconvex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
4.3.1 Solving Quasiconvex Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . .
13
4.3.2 Example: Convex over Concave
14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Linear Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
4.4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
4.4.2 Linear-fractional Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
4.5 Quadratic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
4.5.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
4.5.2 Second Order Cone Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
4.5.3 SOCP: Robust Linear Programming
16
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.4 SOCP: Stochastic Inequality Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
4.6 Geometric Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
4.6.1 Monomials and Posynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
4.6.2 Geometric Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
4.6.3 Convex Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
4.7 Generalized Inequality Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
4.7.1 Conic Form Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
4.7.2 SDP: Semidefinite Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
4.8 Vector Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
4.8.1 Optimal Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8.2 Pareto Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
18
4.8.3 Scalarization for Pareto Optimality
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
4.8.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
5 Duality
19
5.1 The Lagrangian And The Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
5.1.1 The Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
5.1.2 The Lagrangian Dual Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
5.1.3 Lagrangian Dual and Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
5.1.4 Intuitions Behind Lagrangian Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
5.1.5 LP example and Finite Dual Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
5.1.6 Conjugate Functions and Lagrange Dual . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
5.2 The Lagrange Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
5.2.1 Duality Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
5.2.2 Strong Duality And Slater’s Constraint Qualification . . . . . . . . . . . . . . . . . . . .
21
5.2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
5.3 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Strong Duality of Convex Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
21
5.4 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2
5.4.1 Certificate of Suboptimality and Stopping Criteria . . . . . . . . . . . . . . . . . . . . . .
22
5.4.2 Complementary Slackness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.3 KKT Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
23
5.5 Solving The Primal Problem via The Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
5.6 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
5.6.1 Global Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
5.6.2 Local Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
5.7 Examples and Reformulating Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
5.7.1 Introducing variables
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
5.7.2 Making explicit constraints implicit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
5.7.3 Transforming the objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
5.8 Generalized Inequailities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
6 Approximation and Fitting
25
6.1 Norm Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
6.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
6.1.2 Different Penalty Functions and Their Consequences . . . . . . . . . . . . . . . . . . . .
26
6.1.3 Outliers and Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
6.1.4 Least-norm Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Regularized Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
27
6.2.1 Bi-criterion Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
6.2.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
6.2.3 `1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
6.2.4 Signal Reconstruction Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
6.3 Robust Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
6.3.1 Stochastic Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
6.3.2 Worst-case Robust Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
6.4 Function Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
6.4.1 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
6.4.2 Sparse Descriptions and Basis Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
6.4.3 Checking Model Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
7 Statistical Estimation
31
7.1 Parametric Distribution Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
7.1.1 Logistic Regression Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
7.1.2 MAP Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Nonparameteric Distribution Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
32
7.2.1 Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
7.2.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
7.3 Optimal Detector Design And Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . .
33
7.4 Chebyshev and Chernoff Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
7.4.1 Chebyshev Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
7.4.2 Chernoff Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
7.5 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
7.5.1 Further Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3
8 Geometric Problems
34
8.1 Point-to-Set Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.1 PCA Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
35
8.2 Distance between Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
8.3 Euclidean Distance and Angle Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
8.3.1 Expressing Constraints in Terms of G . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
8.3.2 Well-Condition Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
8.3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
8.4 Extremal Volume Ellipsoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
8.4.1 Lowner-John Ellipsoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
8.4.2 Maximum Volume Inscribed Ellipsoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
8.4.3 Affine Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
8.5 Centering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
8.5.1 Chebychev Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
8.5.2 Maximum Volume Ellipsoid Center
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
8.5.3 Analytic Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
8.6 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
8.6.1 Linear Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
8.6.2 Robust Linear Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
8.6.3 Nonlinear Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
8.7 Placement and Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
8.8 Floor Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
9 Numerical Linear Algebra Background
40
10 Unconstrained Optimization
40
10.1 Unconstrained Minimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
10.1.1 Strong Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
10.1.2 Conditional Number of Sublevel Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
10.2 Descent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
10.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
10.3.1 Performance Analysis on Toy Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
10.4 Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
10.4.1 Steepest Descent With an `1 Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
10.4.2 Performance and Choice of Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
10.5 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.5.1 The Newton Decrement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
43
10.5.2 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
10.5.3 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
10.5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
10.6 Self-Concordant Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4
1
Introduction
2
Convex Sets
This chapter introduces numerous convex sets.
2.1
Types of Sets
2.1.1 Affine Sets
A set is affine when it contains any line that passes through any two points that belong to the set. Examples:
• Hyperplanes
• Solution set of linear equation systems.
An affine combination is defined to be a linear combination where the coefficients sum up to 1. An affine
set is closed under affine combination.
2.1.2 Convex Sets
A set is convex when it contains any line segment that passes through any two points that belong to the set.
A convex combination is defined to be an affine combination with nonnegative coefficients. A convex set
is closed under convex combination.
All affine sets are convex.
2.1.3 Cones
A set is a cone when a nonnegative scalar multiple of any element belongs to the set. A cone feels like a
ice-cream cone; it starts at the origin, and it gets wider going away from the origin. It could be analogous
to a pie slice. Given an point in a cone, the cone will contain the ray (half-lines) originating from the origin
and going through the point.
A conic combination is a linear combination with nonnegative coefficients. A convex cone is closed
under conic combinations.
2.2
Important Sets
• Hyperplane is affine.
• Halfspaces are convex.
• Polyhedra/polyhedron are intersections of halfspaces, so are convex.
• Euclidean balls are defined by ||2 . They are convex. Elipsoids are related to balls, related by a positive
definite matrix P .
5
2.2.1 Norm Balls / Norm Cones
Given a norm, a norm ball replaces the L2 norm for Euclidean balls. A norm cone is a different beast; it is
the set
C = {(x, t) | kxk ≤ t, t ≥ 0} ∈ Rn+1
where x ∈ Rn . The norm cone with an L2 norm is called the second-order cone.
2.2.2 Positive Semidefinite Cone
The following notations are used:
• Sn is the set of all symmetric n × n matrices.
• Sn+ is the set of all positive semidefinite n × n matrices.
• Sn++ are positive definite.
Because
θ1 , θ2 ≥ 0, A, B ∈ Sn+ =⇒ θ1 A + θ2 B ∈ Sn+
the set is a convex cone.
2.3
Operations that Preserve Convexity
Some functions preserve convex-ness of the set.
• Intersections
• Taking image under an affine function
2.3.1 Linear-fractional and Perspective Functions
The perspective function
P (z, t) = z/t
preserves convexity. (The domain needs that t > 0) Similarly, a linear-fractional function,
f (x) =
Ax + b
cT x + b
which is formed by combining the perspective function with an affine function, preserves convexity as
well.
2.4
Generalized Inequalities
A cone can be used to define inequalities, if it meets certain criteria. It goes like:
x K y ⇐⇒ y − x ∈ K
6
where K is a cone. It certainly makes sense; since K contains any rays, we get transitivity. If K is pointed
at the origin, we get asymmetry. Since K contains the origin, we get reflexivity. This is an useful concept
that will be exploited later in the course a lot.
2.4.1 Positive Semidefinite Cone
Actually, using the positive semidefinite cone to compare matrices is a standard practice so for the rest of
the book, matrix inequalities are automatically done used PSD cone.
2.4.2 Minimum and minimal Elements
Generalized inequalities do not always give you a single element that is the minimum; we sometimes get
a class of elements that are not smaller than any other elements, and are incomparable to each other.
2.5
Supporting Hyperplane Theorem
This has not proved to be useful in the course yet; when it is needed I will come back and fill it in...
2.6
Dual Cones
Let K be a cone. Then the dual cone is that
K ∗ = y|xT y ≥ 0 ∀x ∈ K
The idea of it is hard to express intuitively. When the cone is sharp, the dual cone will be obstuse, and
vice versa. Also, when K is a proper cone, K∗∗ = K.
2.6.1 Dual Generalized Inequality
The relationship K∗ is a generalized inequality induced by the dual cone K∗. In some ways it can relate
to the original inequality K . Most notably,
x K y ⇐⇒ y − x ∈ K ⇐⇒ λT (y − x) ≥ 0 ⇐⇒ λT x ≤ λT y
where λ is any element of K∗. The takeaway of this is that we can use dual cones to compare values
with respect to the original cone. See below for how this is used.
2.6.2 Dual Generalized Inequality and Minimum/Minimal Elements
Minimum Element
From above, we can see x ∈ S will be a minimum element with respect to K, when
it is a unique minimizer λT x for all λ ∈ K∗. Geometrically, when x is a minimum element, the hyperplane
passing through x with λ as a normal vector z|λT (x − z) = 0 is a strict supporting hyperplane; it touches
S only at x. To see this: say x 6= y ∈ S. Since λT x < λT y by our assumption, we have λT (y − z) > 0.
Minimal Elements
Here, we have a gap between necessary and sufficient conditions.
Necessity If x is a minimizer of λT x for all λ ∈ K∗, x is a minimal element.
Sufficiency Even if x is a minimal element, it is possible that x is not a minimizer of λT x for some λ ∈ K∗.
7
However, if S is convex, the two conditions are indeed equivalent.
3
Convex Functions
3.1
Basic Properties and Definitions
A function f is convex if the domain is convex, and for any x, y ∈ domf and θ ∈ [0, 1] we have
θ · f (x) + (1 − θ) · f (y) ≥ f (θx + (1 − θ) y)
Colloquially, you can say “the chord is above the curve”.
3.1.1 Extended-value Extensions
We can augment a function f :

f (x) x ∈ domf
f˜ (x) =
∞
otherwise
which is called the extended-value extension of f . Care must be taken in ensuring the extension is still
convex/concave: the extension can break such properties.
3.1.2 Equivalent Conditions
The following conditions, paired with the domain being convex, is equivalent to convexity.
First Order Condition
T
f (y) ≥ f (x) + ∇f (x) (y − x)
Geometrically, you can explain this as the tangential hyperplane which meets a convex function at a
certain point, it will be a global lower bound in the entire domain. It has outstanding consequences; by
examining a single point, we get information about the entire function.
Second Order Condition
∇2 f (x) 0
which says the Hessian is positive semidefinite.
3.1.3 Examples
• ln x is concave on R++
• eax is convex on R
• xa is convex on R++ if a ≥ 1, concave 0 ≤ a ≤ 1
a
• |x| is convex on R
• Negative entropy: x log x is convex on R++
8
• Norms: every norm is convex on Rn
• Max functions: max {x1 , x2, x3 , · · · } is convex in Rn
• Quadratic-over-linear: x2 /y is convex in {y > 0}
• Log-sum-exp: as a “soft” approximation of the max function, f (x) = log (
• Geometric mean: (
Qn
i=1
xi )
1/n
P
exi ) is convex on Rn .
is concave on Rn++
• Log-determinant: log det X is convex on positive definite X.
3.1.4 Epigraph
An epigraph for a function f : Rn → R is a set of points “above” the graph. Naturally, it’s a subset of Rn+1
and sometimes is an useful way to think about convex functions. A function is convex iff its epigraph is
convex!
3.1.5 Jensen’s Inequality
The definition of convex function extends to multiple, even infinite, sums:
f
as long as
3.2
P
X
X
xi θ i ≤
f (xi ) θi
θ = 1. This can be used to prove a swath of other inequalities.
Operations That Preserve Convexity
• Nonnegative weighted sums
• Composition with an affine mapping: if f is convex/concave, f (Ax + b) also is.
• Pointwise maximum/supremum: f (x) = maxi fi (x) is convex if all fi are convex. This can be used to
prove that:
– Sum of k largest elements is convex: it’s the max of
n
k
combinations.
• Max eigenvalue of symmetric matrix is convex.
• Minimization:
g (x) = inf f (x, y)
y∈C
is convex if f is convex. You can prove this by using epigraphs – you are slicing the epigraph of g.
• Perspective function
9
3.2.1 Composition
Setup: h: Rk → R and g: Rn → Rk . The composition f (x) = h (g (x)).
Then:
• f is convex:
– if h is convex and nondecreasing, and g is convex, or
– if h is convex and nonincreasing, and g is concave
• f is concave:
– if h is concave and nondecreasing, and g is concave, or
– if h is concave and nonincreasing, and g is convex
To summarize: if h is nondecreasing, and h and g has same curvature (convexity/concavity), f follows. If
h is nonincreasing, and h and g have differing curvature, f follows h.
When determining nonincreasing/nondecreasing property of h, use its extended-value extension. Since
extended-value extension can break nonincreasing/nondecreasing properties, you might want to come up
with alternative definitions which are defined everywhere.
Composition Examples
• exp g (x) is convex if g is.
• log g (x) is concave if g is concave and positive.
•
1
g(x)
is convex if g is positive and concave.
p
• g (x) is convex if p ≥ 1 and g is convex and nonnegative.
Vector Composition
If
f (x) = h (g1 (x) , g2 (x) , g3 (x) , · · · )
the above rules still hold, except:
• All gs need to have the same curvature.
• h need to be nonincreasing/nondecreasing with respect to every input.
3.3
The Conjugate Function
Given an f : Rn → R, the function f ∗: Rn → R is the conjugate:
f ∗ (y) = sup
y T x − f (x)
x∈domf
It is closely related to Lagrange Duals. I will omit further materials on this.
10
3.4
Quasiconvex Functions
Quasiconvex functions are defined by having all its sublevel sets convex. Colloquially, they are “unimodal”
functions. Quasiconvex functions are solved by bisection method + solving feasibility problems. The
linear-fractional function
f (x) =
aT x + b
cT x + b
is quasiconvex. (Can prove: let f (x) = α and try to come up with the definition of the sublevel set.)
Remember how we solved this by bisection methods? :-)
3.4.1 Properties
T
• First-order condition: f (y) ≤ f (x) =⇒ ∇f (x) (y − x) ≤ 0. Note: the gradient defines a supporting
hyperplane. Since f is quasiconvex, all points y with f (y) ≤ f (x) must lie in one side of the hyperplane.
• Second-order condition: f 00 (x) ≥ 0
3.4.2 Operations That Preserve Quasiconvexity
• Nonnegative weighted sum
• Composition: if h is nondecreasing and g is quasiconvex, then h (g (x)) is quasiconvex.
• Composition with affine/linear fractional transformation.
• Minimization along an axis: g (x) = miny f (x, y) is quasiconvex if f is.
3.5
Log-concavity and Log-convexity
The function is log-concave or log-convex if its log is concave/convex.
• The pdf of a Gaussian distribution is log-concave.
• The gamma function is log-concave.
4
4.1
Convex Optimization Problems
Optimization Problems
A typical optimization problem formulation looks like:
minimize f0 (x)
subject to fi (x) ≤ 0
(i = 1, · · · , m)
hi (x) = 0
(i = 1, · · · , p)
4.1.1 Feasibility Problem
There are cases where you want to find a single x which satisfies all equality and inequality constraints.
These are feasibility problems.
11
4.1.2 Transformations and Equivalent Problems
Each problem can have multiple representations which are same in nature but expressed differently. Different expression can have different properties.
• Nonzero equivalence constraints: move everything to LHS.
• Minimization/maximization: flip signs.
• Transformation of objective/constraint function: transforming objective functions through monotonic functions can yield an equivalent problem.
• Slack variables: f (x) ≤ 0 is swapped out by f (x) + s = 0 and s ≥ 0.
• Swapping implicit/explicit constraint: move an implicit constraint to explicit, by using extended value
extension.
4.2
Convex Optimization
A convex optimization problem looks just like the above definition, but have a few differences:
• All f s are convex.
• All gs are affine. Thus, the set of equivalence constraints can be expressed by aTi x = bi thus Ax = b
Note all convex optimization problems do not have any locally optimal points; all optimals are global. Also,
another important property arises from them: the feasible set of a convex optimization problem is convex,
because it is an intersection of convex and affine sets – a sublevel set of a convex function is convex, and
the feasible set is an intersection of sublevel sets and an affine set.
Another important thing to note is that convexity is a function of the problem description; different
formulations can make a non-convex problem convex and vice versa. It’s one of the most important points
of the course.
4.2.1 Optimality Conditions
If the objective function f0 is differentiable, x is the unique optimal solution if for all feasible y, we have
T
∇f0 (x) (y − x) ≥ 0
this comes trivially from the first order condition of a quasiconvex function (a convex function is always
quasiconvex).
Optimality conditions for some special (mostly trivial) cases of problems are discussed:
• Unconstrained problem: Set gradient to 0.
• Only equality constraint: We can derive the below from the general optimality condition described
above.
∇f0 (x) + AT ν = 0
(ν ∈ Rp )
T
Here’s a brief outline: for any feasible y, we need to have ∇f0 (x) (y − x) ≥ 0. Note that y − x ∈ N (A)
⊥
(the null space) because Ax = b = Ay =⇒ A (x − y) = 0. Now, this means ∇f0 (x) ∈ N (A) = R AT ,
the last term being the column space of AT . Now, we can let
12
∇f0 (x) = AT (−ν)
for some ν and we are done.
• Minimize over nonnegative orthant: for each i, we need to have

∇f (x) = 0
0
i
∇f (x) ≥ 0
0
i
if xi > 0
if xi = 0
This is both intuitive, and relates to a concept (KKT conditions) which is discussed later. If xi is at
the boundary, it can have some positive gradient along that axis; we can still be optimal because
decreasing xi will make it infeasible. Otherwise, we need to have zero gradients.
4.2.2 Equivalent Convex Problems
Just like what we discussed for general optimization problems, but specially for convex problems.
• Eliminating affine constraint: Ax = b is equivalent to x ∈ F z + x0 where the column space of F is
N (A) and x0 is a particular solution of Ax = b. So we can just put f0 (F z + x0 ) in place of f0 (x).
• “Uneliminating” affine constraint: going in the opposite direction; if we are dealing with f0 (Ai x + bi ),
let Ai x + bi = yi .
– On eliminating/uneliminating affine constraint: on a naive view, eliminating affine constraint
always seem like a good idea. However, it isn’t so; it is usually better to keep the affine constraint,
and only do the elimination if it is immediately computationally advantageous. (This will be
discussed later in the course.. but I don’t remember where)
• Slack variables
• Epigraph form: minimize t subject to f0 (x) − t ≤ 0 is effectively minimizing f0 (x). This seems stupid,
but this gives us a convenient framework, because we can make objectives linear.
4.3
Quasiconvex Optimization
When the objective function is quasiconvex, it is called a quasiconvex optimization problem. The biggest
difference is that we will now have local optimal points; quasiconvex functions are allowed to have “flat”
portions which give rise to local optimal points.
4.3.1 Solving Quasiconvex Optimization Problems
Quasiconvex optimization problems are solved by bisection methods; at each iteration we ask if the sublevel set empty for given threshold. We can solve this by a convex feasibility problem.
13
4.3.2 Example: Convex over Concave
Say p (x) is convex, q (x) is concave. Then f (x) = p (x) /q (x) is quasiconvex! How do you know? Consider
the sublevel set:
{x : f (x) ≤ t} = {x : p (x) /q (x) ≤ t} = {x : p (x) − t · q (x) ≤ 0}
and p (x) − t · q (x) is convex! So the sublevel sets are convex.
4.4
Linear Optimization Problems
In an LP problem, objectives and constraints are all affine functions. LP algorithms are very, very advanced
and all these problems are readily solvable in today’s computers. It is a very mature technology.
4.4.1 Examples
• Chebyshev Center of a Polyhedron: note that the ball lying inside a halfplane aTi x ≤ bi can be represented as
kuk2 ≤ r =⇒ aTi (xc + u) ≤ bi
Since
sup aTi u = r kai k2
kuk2 ≤r
we can rewrite the constraint as
aTi xc + r kai k2 ≤ bi
which is a linear constraint on xc and r. Therefore, having this inequality constraint for all sides of
the polyhedron gives a LP problem.
• Piecewise-linear minimization. Minimize:
max aTi x + bi
i
This is equivalent to LP: minimize t subject to aTi x + bi ≤ t! This can be a quick, dirty, cheap way to
solve convex optimization problems.
4.4.2 Linear-fractional Programming
If the objective is linear-fractional, while the constraints are affine, it becomes a LFP problem. This is a
quasiconvex problem, but it can also be translated into a LP problem. I will skip the formulation here.
14
4.5
Quadratic Optimization
QP is a special kind of convex optimization where the objective is a convex quadratic function, and the
constraint functions are affine.
1
minimize xT P x + q T x + r
2
subject to Gx h
Ax = b
and P ∈ Sn+ . When the inequality constraint is quadratic as well, it becomes a QCQP (Quadratically Constrainted Quadratic Programming) problem.
4.5.1 Examples
• Least squares: needs no more introduction. When linear inequality constraints are added, it is no
longer analytically solvable, but still is very tractable.
• Isotonic regression: we add the following constraint to a least squares algorithm: x1 ≤ x2 ≤ · · · ≤ xn .
This is still very easy in QP!
• Distance between polyhedra: Minimizing Euclidean distance is a QP problem, and the constraints
(two polyhedras) are convex.
• Classic Markowitz Portfolio Optimization
– Given an expected return vector p¯ and the covariance matrix Σ, find the minimum variance
portfolio with expected return greater than, or equal to, rmin . This is trivially representable in
QP.
– Many extensions are possible; allow short positions, transaction costs, etc.
4.5.2 Second Order Cone Programming
SOCP is closely related to QP. It has a linear objective, but a second-order cone inequality constraint:
minimize f T x
subject to |Ai x + bi |2 ≤ cTi x + di
Fx = g
The inequality constraint forces the tuple Ai x + bi , cTi + di to lie in the second-order cone in Rn+1 .
When ci = 0 for all i, we can make them regular quadratic constraints and this becomes a QCQP. So basically,
using a second-order cone instead of a (possibly open) polyhedra in LP.
Note that the linear objective does not make SOCP weaker than QCQP. You can minimize t where f0 (x) ≤
t.
15
4.5.3 SOCP: Robust Linear Programming
Suppose we have a LP
minimize cT x
subject to aTi x ≤ bi
but the numbers given in the problem could be inaccurate. As an example, let’s just assume that the
true value of ai can lie in a ellipsoid defined by Pi , centered at the given value:
ai ∈ E = {a¯i + Pi u| kuk2 ≤ 1}
and other values (c and bi ) are fixed. We want the inequalities to hold for all possible value of a. The
inequality constraint can be cast as
sup (a¯i + Pi u) = a¯i T x + PiT x2 ≤ bi
kuk2 ≤1
which is actually a SOCP constraint. Note that the additional norm term PiT x2 acts as a regularization
term; they prevent x from being large in directions with considerable uncertainty in the parameters ai .
4.5.4 SOCP: Stochastic Inequality Constraints
When ai are normally distributed vectors with mean a¯i and covariance matrix Σi , the following constraint
P aTi x ≤ bi ≥ η
says that a linear inequality constraint will hold with a probability of η or better. This can be cast as a SOCP
constraint as well. Since x will be concrete numbers, we can say aTi x ∼ n u¯i , σ 2 . Then
P aTi x ≤ bi = P
ui − u¯i
bi − u¯i
≤
σ
σ
=P
Z≤
bi − u¯i
σ
≥ η ⇐⇒ Φ
bi − u
¯i
σ
≥ η ⇐⇒
bi − µ¯i
≥ Φ−1 (η)
σ
The last condition can be rephrased as
1/2 a¯i T x + Φ−1 (η) Σi x ≤ bi
2
which is a SOCP constraint.
4.6
Geometric Programming
Geometric programming problems involve products of powers of variables, not weighted sums of variables.
4.6.1 Monomials and Posynomials
A monomial function f is a product of powers of variables in the form
16
f (x) = cxa1 1 xa2 2 · · · xann
where c > 0. A sum of monomials are called a posynomial; which looks like
f (x) =
K
X
ck xa1 1k xa2 2k · · · xannk
k
4.6.2 Geometric Programming
A GP problem looks like:
minimize f0 (x)
subject to fi (x) ≤ 1
(i = 1, 2, 3, · · · , m)
hi (x) = 1
(i = 1, 2, 3, · · · , p)
where f are posynomials and h are monomials. The domain of this problem is Rn++ .
4.6.3 Convex Transformation
GP problems are not convex in general, but a change of variables will turn a GP into a convex optimization
problem. Letting
yi = log xi ⇐⇒ xi = eyi
yields a monomial f (x) to
f (x1 , x2 , x3 , · · · ) = f (ey1 , ey2 , ey3 , · · · )
= c · ea1 y1 · ea2 y2 · ea3 y3 · · ·
= exp aT y + b
which is now an exponential of affine function. Similarly, a posynomial will be converted into a sum
of exponentials of affine functions. Now, taking log of the objective and the constraints. The posynomials
turn into log-sum-exp (which are convex), the monomials will be become affine. Thus, this is our regular
convex problem now.
4.7
Generalized Inequality Constraints
4.7.1 Conic Form Problems
Conic form problem is a generalization of LP, replacing componentwise inequality with generalized linear
inequality with a cone K.
minimize cT x
subject to F x + g K 0
Ax = b
17
The SOCP can be expressed as a conic form problem if we set Ki to be a second-order cone in Rni +1 :
minimize cT x
subject to − Ai x + bi , cTi x + di Ki 0
(i = 1, · · · , m)
Fx = g
from which the name of SOCP comes.
4.7.2 SDP: Semidefinite Programming
A special form of conic program, where K is Sn+ , which is the set of positive semidefinite matrices, is called
a SDP. It has the form:
minimize cT x
subject to x1 F1 + x2 F2 + · · · + xn Fn 0
Ax = b
4.8
Vector Optimization
We can generalize the regular convex optimization by letting the objective function take vector values;
we can now use proper cones and generalized inequalities to find the best vector value. These are called
vector optimization problems.
4.8.1 Optimal Values
When a point x∗ is better or equal to than every other point in the domain of the problem, x∗ is called the
optimal. In a vector optimization problem, if an optimal exists, it is unique. (Why? Vector optimization
requires a proper cone; proper cones are pointed – they do not contain lines. However, if x1 and x2 are
both optimal, p = x1 − x2 and −p are both in the cone, making it improper.)
4.8.2 Pareto Optimality
In many problems we do not have a minimum value achievable, but a set of minimal values. They are
incomparable to each other. A point x ∈ D is pareto-optimal when for all y that f0 (y) K f0 (x) implies
f0 (y) = f0 (x). Note that there can be multiple values with the same minimal value. Note that every pareto
value has to lie on the boundary of the set of achievable values.
4.8.3 Scalarization for Pareto Optimality
A standard technique for finding pareto optimal points is to scalarize vector objectives by taking a weighted
sum. This can be explained in terms of dual generalized inequality. Pick any λ ∈ K∗ and solve the following
problem:
18
minimize λT f0 (x)
subject to fi (x) ≤ 0
hi (x) = 0
By what we discussed in 2.6.2, a pareto optimal point must be a minimizer of this objective for any
λ ∈ K∗.
Now what happens when the problem is convex? Each λ with λ K∗ 0 will likely give us a different
pareto point. Note λ K∗ 0 might not give us such guarantee, some elements might not be pareto optimal.
4.8.4 Examples
• Regularized linear regression tries to minimize RMSE and norm of the coefficient at the same time 2
so we optimize kAx − bk2 + λxT x. Changing λ lets us explore all pareto optimal points.
5
Duality
This chapter explores many important ideas. Duality is introduced and used as a tool to derive optimality
conditions. KKT conditions are explained.
5.1
The Lagrangian And The Dual
5.1.1 The Lagrangian
The Lagrangian L associated with a convex optimization problem
minimize f0 (x)
subject to fi (x) ≤ 0
(i = 1, · · · , m)
hi (x) = 0
(i = 1, · · · , p)
is a function taking x and the weights as input, and returning a weighted sum of the objective and
constraints:
L (x, λ, ν) = f0 (x) +
m
X
λi fi (x) +
i=1
p
X
νi hi (x)
i=1
So positive values of fi (x) are going to penalize the objective function. The weights are called dual
variables or Lagrangian multiplier vectors.
5.1.2 The Lagrangian Dual Function
The Lagrangian dual function takes λ and ν, and minimizes L over all possible x.
g (λ, ν) = inf L (x, λ, ν) = inf
x∈D
x∈D
f0 (x) +
m
X
i=1
19
λi fi (x) +
p
X
i=1
!
νi hi (x)
5.1.3 Lagrangian Dual and Lower Bounds
It is easy to see that for any elementwise positive λ, the Lagrangian dual function provides a lower bound
on the optimal value p∗ of the original problem. This is very easy to see; if x is any feasible point, the
dual function value is the sum of (possibly suboptimal) value p; if xp is the feasible optimal point, we have
negative values of fi (i ≥ 1) and zeros for hi . Then,
g (λ, ν) = inf L (x, λ, ν) ≤ L (xp , λ, ν) ≤ f (xp ) = p∗
x∈D
5.1.4 Intuitions Behind Lagrangian Dual
An alternative way to express constraints is to introduce indicator functions in the objectives:
minimize f0 (x) +
m
X
I− (fi (x)) +
i=1
p
X
I0 (hi (x))
i=1
the indicator functions will have a value of 0 when the constraint is met, ∞ otherwise. Now, these
represent how much you are irritated by a violated constraint. We can replace them with a linear function
- just a different set of preferences. Instead of hard constraints, we are imposing soft constraints.
5.1.5 LP example and Finite Dual Conditions
A linear program’s lagrange dual function is
T
g (λ, ν) = inf L (x, λ, ν) = −bT ν + inf c + AT ν − λ x
x
x
The dual value can be found analytically, since it is a affine function of x. Whenever any element of
c + AT ν − λ is nonzero, we can manipulate x to make the dual value −∞. So it is finite only on a line where
c + AT ν − λ = 0, which is a surprisingly common occurrence.
5.1.6 Conjugate Functions and Lagrange Dual
The two functions are closely related, and Lagrangian dual can be expressed in terms of the conjugate
function of the objective function, which makes duals easier to derive if the conjugate is readily known.
5.2
The Lagrange Dual Problem
There’s one more thing which is named Lagrangian: the dual problem. The dual problem is the optimization problem
maximize g (λ, ν)
subject to λ 0
A pair of (λ, ν) is called dual feasible if it is a feasible point of this problem. The solution of this problem,
(λ∗, ν∗) is called dual optimal or optimal Lagrange multipliers. The dual problem is always convex; whether
or not the primal problem is convex. Why? g is a pointwise infimum of affine functions of λ and ν.
Note the langrange dual for many problems were bounded only for a subset of the domain. We can
bake this restriction into the problem explicitly, as a constraint.
20
5.2.1 Duality Gap
Langrangian dual’s solution d∗ are related to the solution of the primal problem p∗, notably:
d∗ ≤ p∗
Regardless of the original problem being convex. When the inequality is not strict, this is called a weak
duality. The difference p ∗ −d∗ is called the duality gap. Duality can be used to provide lower bound of the
primal p roblem.
5.2.2 Strong Duality And Slater’s Constraint Qualification
When the duality gap is 0, we say strong duality holds for the problem. That means the lower bound
obtained from the dual equals to the optimal solution of the problem; therefore solving the dual is (sort
of) same as solving the primal. For obvious reasons, strong duality is very desirable but it doesn’t hold in
general. But for convex problems, we usually (but not always) have strong duality.
Given a convex problem, how do we know if strong duality holds? There are many qualifications, which
ensures strong duality if the qualifications are satisfied. The text discusses one such qualification; Slater’s
constraint qualification. The condition is quiet simple: if there exists x ∈ relintD such that all inequality
conditions are strictly held, we have strong duality. Put another way:
fi (x) < 0 (i = 0, 1, 2, · · · ) , Ax = b
Also, it is noted that affine inequality constraints are allowed to held weakly.
5.2.3 Examples
• Least-squares: since there are no infeasibility constraints, Slater’s condition just equals feasibility: so
as long as the primal problem is feasible, strong duality holds.
• QCQP: The Lagrangian is a quadratic form. When all λs are nonnegative, we have a positive semidefinite form and we can solve minimization over x analytically.
• Nonconvex example: Minimizing a nonconvex quadratic function over the unit ball has strong duality.
5.3
Geometric Interpretation
This section introduces some ways to think about Lagrange dual functions, which offer some intuition
about why Slater’s condition works, and why most convex problems have strong duality.
5.3.1 Strong Duality of Convex Problems
Let’s try to explain figure 5.3 to 5.5 from the book. Consider following set G:
G = {(f1 (x) , · · · , fm (x) , h1 (x) , · · · , hp (x) , f0 (x)) |x ∈ D} = {(u, v, t) |x ∈ D}
Note ui = fi (x), vi = gi (x) and t = f0 (x). Now the Lagrangian of this problem
21
L (λ, ν, x) =
X
λi ui +
X
ν i vi + t
can be interpreted as a hyperplane passing through x with normal vector (λ, ν, 1)1 and that hyperplane will
meet t-axis at the value of the Lagrangian. (See figure 5.3 from the book.)
Now, the Lagrange dual function
g (λ, ν) = inf L (λ, ν, x)
x∈D
will find x in the border of D: intuitively, the Lagrangian can still be decreased by wiggling x if x ∈
relintD. Therefore, the value of the Lagrange dual function can now be interpreted as a supporting hyperplane with normal vector (λ, ν, 1).
Next, we solve the Lagrange dual problem which maximizes the position where the hyperplane hits the
t-axis. Can we hit p∗, the optimal value? When G is convex, the feasible portion of G (i.e. u 0 and ν = 0)
is convex again, and we can find a supporting hyperplane that meets G at the optimal point! But when G
is not, p∗ can hide in a “nook” inside G and the supporting hyperplane might not meet p∗ at all.
5.4
Optimality Conditions
5.4.1 Certificate of Suboptimality and Stopping Criteria
We know, without assuming strong duality,
g (λ, ν) ≤ p? ≤ f0 (x)
Now, f0 (x) − g (λ, ν) gives a upper bound on f0 (x) − p?, the quantity which shows how suboptimal x is.
This gives us a stopping criteria for iterative algorithms; when f0 (x) − g (λ, ν) ≤ , it is a certificate that x is
less than suboptimal. The quantity will never drop below the duality gap, so if you want this to work for
arbitrarily small we would need strong duality.
5.4.2 Complementary Slackness
Suppose the primal and dual optimal values are attained and equal. Then,
f0 (x? ) = g (λ? , ν ? )
X
X
= inf f0 (x) +
λi fi (x) +
νi hi (x)
x
X
X
≤ f0 (x? ) +
λi fi (x? ) +
νi hi (x? )
≤ f0 (x? )
(assumed 0 duality gap)
(definition of Lagrangian dual function)
(taking infimum is less than or equal to any x)
(λ?i are nonnegative, fi values are nonpositive and hi values are 0)
So all inequalities can be replaced by equalities! In particular, it means two things. First, x? minimizes
the Lagrangian. Next,
X
λ?i fi (x? ) = 0
Since each term in this sum is nonpositive, we can conclude all terms are 0: so for all i ∈ [1, m] we have:
1 Yep,
there are some notation abusing here since λ and ν themselves are vectors.
22

λ? = 0
i
either
f (x? ) = 0
i
This condition is called complementary slackness.
5.4.3 KKT Optimality Conditions
KKT is a set of conditions for a tuple (x? , λ? , ν ? ) which are primal and dual feasible. It is a necessary condition for x∗ and (λ∗, ν∗) being optimal points for their respective problems with zero duality gap. That is,
all optimal points must satisfy these conditions.
KKT condition is:
• x? is prime feasible: fi (x? ) ≤ 0 for all i, hi (x? ) = 0 for all i.
• (λ? , ν ? ) is dual feasible: λ?i ≥ 0
• Complementary slackness: λ?i fi (x?i ) = 0
• Gradient of Lagrangian disappears: ∇f0 (x? ) +
P
λ?i ∇fi (x? ) +
P
νi? ∇hi (x? ) = 0
Note the last condition is something we didn’t see before. It makes intuitive sense though - the optimal point
for the dual problem must minimize the Lagrangian. Since the primal problem is convex, the Lagrangian
is convex - and the only point with 0 gradient is the minimum.
KKT and Convex Problems
When the primal problem is convex, KKT condition is necessary-sufficient for optimality. This has immense importance. We can frame solving convex optimization problems by solving KKT conditions. Sometimes KKT condition might be solvable analytically, giving us closed form solution for the optimization
problem.
The text also mentions that when Slater’s condition is satisfied for a convex problem, we can say that
arbitrary x is primal optimal iff there are (λ, ν) that satisfies KKT along with x. I actually am not sure about
why Slater’s condition is needed for this claim but the lecture doesn’t make a big deal out of it, so meh..
5.5
Solving The Primal Problem via The Dual
When we have strong duality and dual problem is easier to solve(due to some exploitable structure or
analytical solution), one might solve the dual first to find the dual optimal point (λ? , ν ? ) and find the x
that minimizes the Lagrangian. If this x is feasible, we have a solution! Otherwise, what do we do? If
the Lagrangian is strictly convex, then there is a unique minimum: if this minimum is infeasible, then we
conclude the primal optimal is unattainable.
5.6
Sensitivity Analysis
The Lagrange multipliers for the dual problem can be used to infer the sensitivity of the optimal value
with respect to perturbations of the constraints. What kind of perterbations? We can tighten or relax
constraints for an arbitrary optimization problem, by changing the constraints to:
23

f (x) ≤ u
i
i
h (x) = v
i
(i = 1, 2, · · · , m)
(i = 1, 2, · · · , p)
i
Letting ui > 0 means we have more freedom regarding the value of fi ; ui < 0 otherwise.
5.6.1 Global Sensitivity
The Lagrange multipliers will give you information about how the optimal value will change when we do
this. Let’s denote the optimal value of the perturbed problem as a function of u and v: p? (u, v). We have a
lower bound for this value when strong duality holds:
f0 (x) ≥ p? (0, 0) − λ?T u − ν ?T v
which can be obtained by manipulating the definitions.
Using this lower bound, we can make some inferences about how f0 (x) will change with respect to u and
v. Basically, when lower bound increases greatly, we make an inference the optimal value will increase
greatly. However, when lower bound decreases, we don’t have such an assurance2 . Examples:
• When λ?i is large, and we tighten the constraint (ui < 0), this will increase the lower bound a lot; the
optimal value will increase greatly.
• When λ?i is is small, and we loosen the constraint (ui > 0), this will decrease the lower bound a bit,
but this might not decrease the optimal value a lot.
5.6.2 Local Sensitivity
The text shows an interesting identity:
λ?i = −
∂p? (0, 0)
∂ui
Now, λ?i gives you the slope of the optimal value with respect to the particular constraint. All these, along
with complementary slackness, can be used to interpret Lagrange multipliers; they tell you how “tight” a
given inequality constraint is. Suppose we found that λ?1 = 0.1, and λ?2 = 100 after solving a problem. By
complementary slackness, we know f1 (x? ) = f2 (x? ) = 0 and they are both tight. However, when we do
decrease u2 , we know p? will move much more abruptly, because of the slope interpretation above. On
the other hand, what happens when we increase u2 ? Locally we know p? will start to decrease fast; but it
doesn’t tell us how it will behave when we keep increasing u2 .
5.7
Examples and Reformulating Problems
As different formulations can change convex problems to non-convex and vice versa, dual problems are
affected by how the problem is exactly formulated. Because of this, a problem that looks unnecessarily
complicated might end up to be a better representation. The text gives some examples of this.
2 Actually I’m a bit curious regarding this as well - lower bound increasing might not increase the optimal value when optimal
value was well above the lower bound to begin with.
24
5.7.1 Introducing variables
5.7.2 Making explicit constraints implicit
5.7.3 Transforming the objective
5.8
Generalized Inequailities
How does the idea of Lagrangian dual extend to problems with vector inequalities? Well, it generalizes
pretty well - we can define everything pretty similarly. Except how the nonnegative restriction for λ becomes a nonnegative restriction with dual cone. Here are some intuition behind this difference. Say we
have the following problem:
minimize
f0 (x)
s.t.
fi (x) Ki 0
(i = 1, 2, · · · , m)
hi (x) = 0
(i = 1, 2, · · · , p)
Now the Lagrangian multiplier λ is vector valued. The Lagrangian becomes:
f0 (x) +
X
λTi fi (x) +
X
νi hi (x)
fi (x) is nonpositive with respect to Ki means that −fi (x) ∈ Ki . Remember we want each product
λTi fi
(x) to be nonpositive - otherwise this dual won’t be a lower bound anymore. Now we try to find the
set of λ that makes λT y negative for all −y ∈ K. We will need to make λT y positive for all y ∈ K. What is
this set? The dual cone.
The dual of SDP is also given as an example. The actual derivation involves more linear algebra than I
am comfortable with (shameful) so I’m skipping things here.
6
Approximation and Fitting
With this chapter begins part II of the book on applications of convex optimization. Hopefully, I will be
less confused/frustrated by materials in this part. :-)
6.1
Norm Approximation
This section discusses various forms of the linear approximation problem:
minx |Ax − b|
with different norms and constraints. Without doubt, this is one of the most important optimization
problems.
6.1.1 Examples
• `2 norm: we get least squares.
• `∞ norm: Chebyshev approximation problem. Reduces to an LP which is as easy as least squares, but
no one discusses it!
25
• `1 norm: Sum of absolute residuals norm. Also reduces to LP, extremely interesting, as we will discuss
further in this chapter.
6.1.2 Different Penalty Functions and Their Consequences
The shape of the norm used in the approximation affects the results tremendously. The most common
norms are `p norms - given a residual vector r,
!1/p
X
p
|ri |
i
We can ignore the powering by 1/p and just minimize the base of the exponentiation. Now we can think
of `p norms giving separate penalties to each of the residual vector. Note that most norms do the same - so
we can think in terms of a penalty function φ (r) when we think about norms.
The text examines a few notable penalty functions:
• Linear: sum of absolute values; associated with `1 norm.
• Quadratic: sum of squared errors; associated with `2 norm.
• Deadzone-linear: zero penalty for small enough residuals; grows linearly after the barrier.
• Log-barrier: grows infinitely as we get near the preset barrier.
Now how does these affect our solution? The penalty function measures our level of irritation with regard
to the residual. When φ (r) grows rapidly as r becomes large, we are immensly irritated. When φ (r) shrinks
rapidly as r becomes small, we don’t care as much.
This simple description actually explains the stark difference between `1 and `2 norms. With a `1 norm,
the slope of the penalty does not change when the residual gets smaller. Therefore, we still have enough
urge to shrink the residual until it becomes 0. On the other hand, with a `2 norm the penalty will quickly
get smaller when the residual gets smaller than 1. Now, once we go below 1, we do not have as much
motivation to shrink it further - the penalty does not decrease as much. What happens when the residual
is large? Then, `1 is actually less irritated by `2 ; the penalty grows much more rapidly.
These explanations let us predict how the residuals from both penalty functions will be distributed. `1
will give us a lot of zeros, and a handful of very large residuals. `2 will only have a very small number of
large residuals; and it won’t have as many zeros - many residuals will be “near” zero, but not exactly. The
figures in the text confirms this theory. This actually was one of the most valuable intuitions I got out of
this course. Awesome.
Another little gem discussed in the lecture is that, contrary to the classic approach to fitting problems,
the actual algorithms that find the x are not your tools anymore - they are standard now. The penalty
function is your tool - you shape your problem to fit your actual needs. This is a very interesting, and at
the same time very powerful perspective!
6.1.3 Outliers and Robustness
Different penalty functions behave differently when outliers are present. As we can guess, quadratic loss
functions are affected much worse than linear losses. When a penalty function is not sensitive to outliers,
it is called robust. Linear loss function is an obvious example of this. The text introduces another robust
penalty function, which is the Huber penalty function. It is a hybrid between quadratic and linear losses.
26
φhub (u) =

u2
(|u| < M )
M (2 |u| − M ) otherwise
Huber function grows linearly after the preset barrier. It is the closest thing to a “constant-beyondbarrier” loss function, without losing convexity. When all the residuals are small, we get the exact least
square results - but if there are large residuals, we don’t go nuts with it. It is said in the lecture that 80%
of all applications of linear regression could benefit from this. A bold, but very interesting claim.
6.1.4 Least-norm Problems
A closely related problem is least-norm problem which has the following form:
minimize |x|
subject to Ax = b
which obviously is meaningful only when Ax = b is underdetermined. This can be cast as a norm
approximation problem by noting that the solution space is given by a particular solution, and the null
space of A. Let Z consist of column vectors that are basis for N (A), and we minimize:
|x0 + Zu|
Two concrete examples are discussed in the lecture.
• If we use `2 norm, we have a closed form solution using the KKT conditions.
• If we use `1 norm, it can be modeled as an LP. This approach is in vogue, say in last 10 years or so. We
are now looking for a sparse x.
6.2
Regularized Approximations
Regularization is a practice of minimizing the norm of the coefficient |x|, as well as the norm of the residual.
It is a popular practice in multiple disciplines.
Why do it? The text introduces a few examples. First of all, it can be a way to express our prior knowledge or preference towards smaller coefficients. There might be cases where our model is not a good
approximation of reality when x gets larger.
Personally, this made the most sense to me; it can be a way of taking variations/errors of the matrix A
into account. For example, say we assume an error ∆ in our matrix A. So we are minimizing (A + ∆) x−b =
Ax − b + ∆x; the error is multiplied by x! We don’t want a large x.
6.2.1 Bi-criterion Formulation
Regularization can be cast as a bi-criterion problem, as we have two objectives to minimize. We can trace
the optimal trade-off curve between the two objectives. On one end, where |x| = 0, we have Ax = 0 and the
residual norm is |b|. At the other end, there can be multiple Pareto-optimal points which minimize |Ax − b|.
(When both norms are `2 , it is unique)
27
6.2.2 Regularization
The actual practice of regularization is more concrete than merely trying to minimize the two objectives;
it is a scalarization method. We minimize
|Ax − b| + γ |x|
where γ is a problem parameter. (Which, in practice, is typically set by cross validation or manual intervention) Practically, γ is the knobs we turn to solve the problem.
Another common practice is taking the weighted sum of squared norms:
2
|Ax − b| + δ |x|
2
Note it is not obvious that the two problems sweep out the same tradeoff curve. (They do, and you can
find the mapping between γ and δ given a specific problem).
The most prominent scheme is Tikhonov regularization/ridge regression. We minimize:
2
2
|Ax − b|2 + δ |x|2
which even has an analytic solution.
The text also mentions a “smoothing” regularization scheme - the penalty is on Dx instead of x. D can
change depending on your criteria of fitness of solutions. For example, if we want x to be smooth, we can
roughly penalize its second derivative by setting D as the Toeplitz matrix:

1

0
D=
0

..
.

−2
1
0
0
0
···
1
−2
0
0
0
0
..
.
1
..
.
−2
..
.
1
..
.
0
..
.

· · ·

· · ·

..
.
So that the elements of Dx are approximately the second derivatives (2xi − xi−1 − xi+1 ).
6.2.3 `1 Regularization
`1 regularization is introduced as a heuristic for finding sparse solutions. We minimize:
|Ax − b|2 + γ |x|1
The optimal tradeoff curve here can be an approximation of the optimal tradeoff curve between |Ax − b|2
and the cardinality cardx, which is the number of nonzero elements of x. `1 regularization can be solved
as a SOCP problem.
6.2.4 Signal Reconstruction Problem
An important class of problem is introduced: signal reconstruction. There is an underlying signal x which
is observed with some noise; resulting in corrupted observation. x is assumed to be smooth. What is the
most plausible guess for the time series x?
This can be cast as a bicriterion problem: first we want to minimize |ˆ
x − xcor |2 where x
ˆ is our guess
and xcor is the corrupted observation. On the other hand, we think smooth x
ˆ are more likely, so we minimize a penalization function: φ (ˆ
x). Different penalization schemes are introduced, quadratic smoothing
28
and total variance smoothing. In short, they are `2 and `1 penalizers, respectively. When underlying process has some jumps, as you can expect, total variance smoothing preserves those jumps, while quadratic
smoothing tries to smooth out the transition.
Some more insights are shared in the lecture videos. Recall `1 regularization gives you a small number
of nonzero regularized terms. So if you are penalizing
φ (ˆ
x) =
X
|ˆ
xi+1 − x
ˆi |
the first derivative is going to be sparse. What does the resulting function look like? Piecewise constant.
Similarly, say we take the approximate second derivative |2ˆ
xi − x
ˆi−1 − x
ˆi+1 |? We get piecewise linear! The
theme goes on - if we take the third difference, we get piecewise quadratic (actually, splines).
6.3
Robust Approximation
How do we solve approximation when A is noisy? Let’s say,
A = A¯ + U
where A¯ represents the componentwise mean, and U represents the random component with zero
mean. How do we handle this? The prevalent method is to ignore that A has possible errors. It is okay,
as long as you do a “posterior analysis” on the method: try changing A by a small amount and try the
approximation again, see how it changes.
6.3.1 Stochastic Formulation
A reasonable formulation for an approximation problem is to minimize:
minimize E kAx − bk
This is untractable in general, but tractable in some special cases, including when we minimize the `2
norm:
2
minimize E kAx − bk2
Then:
2
¯ − b + U x T Ax
¯ − b + Ux
E kAx − bk2 = E Ax
¯ − b T Ax
¯ − b + ExT U T U x
= Ax
¯ − b2 + xT P x
= Ax
2
2
2 1/2 ¯ − b + = Ax
x
P
2
2
Tada, we got Tikhonov regularization! This makes perfect sense - increasing magnitudes of x will increase the variation of Ax, which in turn increase the average value of kAx − bk by Jensen’s inequality. So
¯ − b small with making the variance small. This is a nice interpretation
we are trying to balance making Ax
for Tikhonov regularization as well.
29
6.3.2 Worst-case Robust Approximation
Instead of taking expected value of the error, we can try to minimize the supremum of error across a set A
consisting possible values of A. The text describes several types of A we can use to come up with explicit
solutions. The following are those examples:
• When A is a finite set
• When A is mean plus U , where U is an error in a norm ball.
• When each row of A is mean plus Pi , where Pi describes an ellipsoid of possible values.
• more examples..
Worst-case robust least squares is mentioned in the lecture. This is not a convex problem, but it can be
solved exactly. In fact, any optimization problem with two quadratic functions can be solved exactly (see
appendix of the book).
6.4
Function Fitting
In a function fitting problem, we try to approximate an unknown function by a linear combination of basis
functions. We determine the coefficient vector x which yields the following function:
f (u) =
X
xi fi (u)
where fi () is the i-th basis function. A typical basis function are powers of u: the possible set of f is the
set of polynomials. You can use piecewise linear and polynomial functions; using piecewise polynomial
will give you spline functions.
6.4.1 Constraints
We can impose various constraints on the function being fitted. The text introduces some tractable set of
constraints.
• Function value interpolation: the function value at a given point f (v) =
P
xi fi (v) is a linear function
of x. Therefore, equality constraints and inequality constraints are actually linear constraints.
• Derivative constraints: the derivative value at a given point f (v) =
P
xi ∇fi (v) is also a linear function
of x.
6.4.2 Sparse Descriptions and Basis Pursuit
In basis pursuit problems, we want to find a sparse f out of a very large number of basis functions. By
a sparse f , we mean there are a few nonzero entries in the coefficient vector x. Mathematically, this is
equivalent to the regressor selection problem (quite unsurprisingly), so the similar set of heuristics can be
used. First, we can use `1 regularization to approximate optimizing for cardx.
30
6.4.3 Checking Model Consistency
The text introduces an interesting problem - given a set of data points, is there a convex function that
satisfies all those data? Fortunately, recall the first order convexity condition from 3.1.2 - using this, ensuring the convexity of a function is as easy as finding the gradients at the data points so that the first order
condition is satisfied. We want to find g1 , · · · , gm so that:
yj ≥ yi + giT (uj − ui )
for any pair of i and j.
Fitting Convex Function To The Data
We can “fit” a convex function to the data by finding fitted values of y, and ensuring the above condition
holds for the fitted values. Formally, solve:
minimize (yi − yˆi )
2
subject toˆ
yj ≥ yˆi + giT (uj − ui ) for any pair of i, j
This is a regular QP. Note the result of this problem is not a functional representation of the fitted
function, as in regular regression problems. Rather, we get the value of the function - so it’s a point-value
representation.
Bounding Values
Say we want to find out if a new data point is “irregular” or not - is it consistent with what we saw earlier?
In other words, given a new unew , what is the range of values possible given the previous data? We can
minimize/maximize for yˆnew subject to the first order constraint, to find the range. These problems are LP.
7
Statistical Estimation
I was stuck in this chapter for too long. It’s time to finish this chapter no matter what. This chapter shows
some example applications of convex optimization in statistical settings.
7.1
Parametric Distribution Estimation
The first example is MLE fitting - the most obvious, but the most useful. We of course require the constraints
on x to be convex optimization friendly. A linear model with IID noise is discussed:
yi = aTi x + vi
The MLE is of course
xml = argmaxx l (x) = argmaxx log px (y)
px (y) depends on the distribution of vi s. Different assumptions on this distribution leads to different
fitting methods:
• Gaussian noise gives you OLS
31
• Laplacian noise gives you `1 regularized regression (of course, Laplacian distribution has a sharp
peak at 0, which equates to having high incentive to reduce residual when residual is really small)
• Uniform noise
Also note that we need to have log px (y) to be concave in x, not y: and exponential families of distribution
meet this criteria. Also, in many cases, your natural choice of parameters might not yield a log likelihood
function that is concave. Usually with a change of variables, we achieve this.
Also, we discuss that these distributions are equivalent to different penalty schemes - as demonstrated
by the equivalence of L2 with Gaussian, L1 with Laplacian. There are 1:1 correspondence. If you have a
penalty function p (v), the corresponding distribution is ep(v) normalized!
7.1.1 Logistic Regression Example
T
We model p = S a u + b =
exp aT u+b
1+exp(aT u+b)
where u are the explanatory variables, a and b are model param-
eters. Say we have n = q + m examples, the first q of them having yi = 1 and next m of them having yi = 0.
Then, the likelihood function has:
q
Y
i=1
pi
n
Y
(1 − pi )
i=q+1
Take log and plug in above equation for p and we get the following concave function:
q
X
i=1
n
X
aT ui + b −
log 1 + exp aT ui + b
i=q+1
7.1.2 MAP Estimation
MAP is the Bayes equivalent of MLE. The underlying philosophy is vastly different, but the optimization
technicality remains more or less the same, except a term that describes the prior distribution.
7.2
Nonparameteric Distribution Estimation
A nonparameteric distribution is one which we don’t have any closed formula for. So we will estimate a
vector p where
prob (x = αk ) = pk
which lies in Rn .
7.2.1 Priors
• Expected value of any function are just linear equality in terms of p, so we can express them easily.
• The variance of the random variable is a concave function of p. Therefore, a lower bound on the
variance can be expressed within the convex setting.
• The entropy of X is a concave, so we can express the lower bound as well.
• The KL-divergence between p and q is convex. So we can impose upper bound here.
32
7.2.2 Objectives
• We can minimize/maximize expected values because they are affine to p.
• We can find MLE because log likelihood for p in this setting is always concave.
• We can find maximum entropy.
• We can find minimum KL-divergence between p and q.
7.3
Optimal Detector Design And Hypothesis Testing
I will only cover this section briefly.
Problem setup: the parameter θ can take m values. For each value of θ, we have a nonparameteric
distribution over n possible values α1 · · · αn . The probabilities can be represented by a matrix P = Rn×m .
We call each θ a different hypothesis. We want to find which θ generated given sample. So the detector we
want to design is a function from a sample to θ.
We can create either deterministic or probabilistic detectors; like in game theory, introducing extra
randomness can improve the detector in many ways. For a simple and convincing example; say we have
a binary problem. Draw an ROC curve which shows the tradeoff between false positive and false negative
errors. A deterministic detector might not be able to hit the sweet spot where pf n = pf p depending on the
θs - but probabilistic detectors can.
7.4
Chebyshev and Chernoff Bounds
7.4.1 Chebyshev Bounds
Chebyshev bounds give an upper bound on a probability of a set based on known quantities; many inequalities follow this form. For example, Markov’s inequality says: If X ∈ R+ has EX = µ then we have
prob (X ≥ 1) ≤ µ. (Of course, this inequality is completely useless when µ > 1 but that’s how all these
inequalities are.) This section looks at cases where we can find such bounds using convex optimization.
In this setup, our prior knowledge is represented as a pair of functions and their expected values. The
set whose probability we want to find bounds for is given as C. We want something like:
prob (X ∈ C) ≤ Ef (X)
for some function f whose expectation we can take.
The recipe is to concoct an f which is a linear combination of the prior knowledges. Then Ef (X) is
simply a linear combination of the expectations. How do we ensure the EV is above prob (X ∈ C)? We
can impose that f (x) ≥ 1C (x) pointwise, where 1C is an indicator function for C. We can now state the
following problem:
X
ai xi =
Efi (X) xi = Ef (X)
X
subject tof (z) =
xi fi (z) ≥ 1
if z ∈ C
X
f (z) =
xi fi (z) ≥ 0
if z ∈ S\C
minimize
X
This is a convex optimization problem, since the constraints are convex. For example, the first constraint
can be recast as
33
g1 (x) = 1 − inf f (z) < 0
z∈C
which is surely convex. There is another formulation where we solve a case where the first two moments are specified; but I am omitting it.
7.4.2 Chernoff Bounds
This section deals with Chernoff bounds, which has a different form, but the same concept.
7.5
Experiment Design
We discuss various solutions to the experiment design problem as an application. The setup is as follows.
We have a fixed menu of p different experiments which is represented by ai (1 ≤ i ≤ p). We will perform
m experiments, each experiment taken from the menu. For each experiment, we get yi as the result which
is
yi = aTi x + wi
where wi are independent unit Gaussian noise. The maximum likelihood estimate is of course given by
least squares. Then, the associated error e = x
ˆ − x has zero mean and has covariance matrix E:
E = EeeT =
X
ai aTi
−1
How do we minimize E? What kind of metrics do we use?
7.5.1 Further Modeling
First, this is an offline problem and we don’t actually care about the order we perform. So the only thing
we care is that for each experiment on the menu, how many times do we perform it. So the optimization variables are a list of nonnegative integers mi which sum up to m. Of course, the above problem is
combinatorially hard and we relax it a bit, by modeling what fraction of m do we run each experiments.
Still, the objective E is a vector (actually a matrix) so we need some scalarization scheme to minimize
it. The text discusses some strategies including:
• D-optimial design: minimize determinant of E. Since determinant is the volume of the box, we are
in effect minimizing the volume of the confidence ellipsoid.
• E-optimal design: we minimize the largest eigenvalue of E. Rationale: the diameter of the confidence
ellipsoid is proportional to norm of the matrix.
• A-optimal design: we minimize the trace. This is, effectively, minimizing the error squared.
8
8.1
Geometric Problems
Point-to-Set Distance
A project of the point x0 to a closed set C is defined as the closest point in C that minimizes the distance
from x0 . When C is closed and convex, and the norm is strictly convex (e.g. Euclidean), we can prove the
34
projection is unique.
When the set C is convex, finding the projection is a convex optimization problem. Some examples are
discussed - planes, halfplanes, and a proper cone.
Finding the separating hyperplane between a point and a convex set is discussed as well. When we use
Euclidean norm, we have a geometric, intuitive way to find one - take x0 and its projection p (x0 ), and use
~ 0 ) − ~x and passes the mean of two points. However, for other norms,
the hyperplane which is normal to p (x
we have to construct such hyperplane using dual problem; if we find a particular Lagrangian multiplier
for which the dual problem is feasible, we know that multiplier constitutes a separating hyperplane.
8.1.1 PCA Example
Suppose the set C of m × n matrices with at most k rank. A projection of X0 onto C which minimizes the
Euclidean norm is achieved by a truncated SVD - yes PCA!
8.2
Distance between Sets
Distance between two convex sets is an convex optimization problem, of course. The dual of this problem
can be interpreted as a problem finding a separating hyperplane between the two sets. The argument can
be made: if strong duality holds, a positive distance implies an existence of a separating hyperplane.
8.3
Euclidean Distance and Angle Problems
This section deals with problems where Euclidean distances and angles between vectors are constrained.
Setup: n vectors in Rn , for which we assume their Euclidean lengths are known: li = kai k2 .
Distance and angular constraints can be cast as a constraint on G, which is the Gram matrix of A which
has ai as column vectors:
G = AT A
G will be our optimization variable; after the optimization we can back out the interested vectors by
Cholesky factorization. This is a SDP since G is always positive semidefinite.
8.3.1 Expressing Constraints in Terms of G
• Diagonal entries will give length squared: Gii = li2
• The distance between vector i and j dij can be written as:
kai − aj k2 = li2 + lj2 − 2aTi aj
which means Gij is an affine function of d2ij : Gij =
1/2
= li2 + lj2 − 2Gij
1/2
li2 +lj2 −d2ij
2
This means range constraints on d2ij can be a pair of linear constraints on Gij .
• Gij is an affine function of the correlation coefficient ρij .
• Gij is also an affine function of cosine of the angle between two vectors: cos α. Since cos−1 is monotonic, we can use this to constrain the range of α.
35
8.3.2 Well-Condition Constraints
The condition number of A, σ1 /σn , is a quasiconvex function of G. So we can impose a maximum value or
try to minimize it using quasiconvex optimization.
Two additional approaches to well-conditionness are discussed - dual basis and maximizing log det G .
8.3.3 Examples
• When we only care about angles between vectors (or correlations) we can set li = 1 for all i.
• When we only care about distance between vectors, we can assume that the mean of the vectors are 0.
This can be solved using the squared lengths as the optimization variable. Since Gij = li2 + lj2 − 2d2ij /2,
we get:
G = z1T + 1z T − D /2
which should be PSD (zi = li2 ).
8.4
Extremal Volume Ellipsoids
This section deals with problems which “approximates” given sets with ellipsoids.
8.4.1 Lowner-John Ellipsoid
The LJ ellipsoid lj for a set C is defined as the minimum-volume ellipsoid that contains C. This can be
cast as a convex optimization problem, however is only tractable when C is tractable. (Of course, C has an
infinite number of points or whatever, it’s not going to be tractable..) We set our optimization variable A
and b such that:
εlj = {v| kAv + bk2 ≤ 1}
The volume of the LJ ellipsoid is proportional to det A−1 , so that’s what we optimize for. We minimize:
log det A−1
subject to supv∈C kAv + bk2 ≤ 1. As a trivial example, consider when C is a finite set of size m; then the
constraints translate into m convex constraints on A and b.
A notable feature of LJ ellipsoid is that its efficiency can be bounded; if you shrink an LJ ellipsoid by a
factor or n (the dimension), it is guaranteed to fit inside C (of course, when C is bounded and has nonempty
√
interior). So roughly we have a factor of n approximation. (Argh.. the proof is tricky. Uses modified
problem’s KKT conditions.)
√
When the set is symmetric about a point x0 , the factor 1/n can be improved to 1/ n.
8.4.2 Maximum Volume Inscribed Ellipsoid
A related problem tries to find the maximum volume ellipsoid which lies inside a bounded, convex set C
with nonempty interrior. We use a different formulation of the ellipsoid now; it’s a forward projection of
a unit ball.
36
ε = {Bu + d| kuk2 ≤ 1}
Now its volume is proportional to det B. The constraint would be:
sup IC (Bu + d) ≤ 0
kuk2 ≤1
Max Ellipsoid Inside A Polyhedron A polyhedron is described by a set of m linear inequalities:
C = x|aTi x ≤ bi
We can now optimize regarding B and d. We can translate the constraint as:
sup IC (Bu + d) ≤ 0 ⇐⇒
kuk2 ≤1
sup aTi (Bu + d) ≤ bi
kuk2 ≤1
⇐⇒ BaTi 2 + aTi d ≤ bi
which is a convex constraint on B and d.
8.4.3 Affine Invariance
If T is an invertible matrix, it is stated that the transformed LJ ellipsoid will still cover the set C after
transformation. It holds for maximum volume inscribed ellipsoid as well.
8.5
Centering
8.5.1 Chebychev Center
Given a bounded, nonempty-interior set C ∈ Rn , a Chebychev centering problem finds a point where the
depth is maximized, which is defined as:
depth (x, C) = dist (x, Rn \C)
So it’s a point which is farthest from the exterior of C. This is not always tractable; suppose C is defined
as a set of convex inequalities fi (x) ≤ 0. Then, Chebychev center could be found by solving:
maximize
subject to
R
gi (x, R) ≤ 0 (i = 0, 1, · · · )
where gi is a pointwise maximum of fi (x + Ru) where kuk2 ≤ 1. Since fi is convex and x + Ru is affine
in x and R, gi is a convex function. However, it’s hard to evaluate gi ; since we have to find a pointwise
maximum of convex functions. Therefore, Chebychev center problem is feasible only for specific classes
of C. For example, when C is a polyhedron. (An LP can solve this case)
8.5.2 Maximum Volume Ellipsoid Center
A generalization of Chebychev center is MVE center; the center of the maximum volume inscribed ellipsoid.
When the MVE problem is solvable, MVE center is trivially attained.
37
8.5.3 Analytic Center
An analytic center works with the logarithmic barrier − log x. If C is defined as
fi (x) ≤ 0
for all i, the analytic center minimizes
−
X
log (−fi (x))
i
This makes sense; when x is feasible, the absolute value of fi (x) kind of denotes the margin between x
and infeasible regions. Analytic center tries to maximize the product of those margins. The analytic center
is not invariant under different representations of the same set C, obviously.
8.6
Classification
This section deals with two sets of data {x1 , x2 , · · · , xN } and {y1 , y2 , · · · , yM }. We want to find a function
f (x) such that f (xi ) > 0 and f (yi ) < 0.
8.6.1 Linear Discrimination
Linear discrimination finds an affine function f (x) = aT x − b which satisfies the above requirements.
Since these requirements are homogeneous in a and b, we can scale them arbitrarily so the following are
satisfied:

aT x − b ≥ 1
i
aT y − b ≤ −1
i
8.6.2 Robust Linear Discrimination
If two sets can be linearly discriminated, there will always be multiple functions that separate them. One
way to choose among them is to maximize the minimum distance from the line to each sample; in other
words, maximum margin or the “thickest slab”. This leads to the following problem:
maximize t
subject to aT xi − b ≥ t
aT yi − b ≤ −t
kak2 ≤ 1
Note the last requirement; we normalize a, since we will be able to arbitrarily increase t unless we normalize a.
Support Vector Classifier
When two sets cannot be linearly separated, we can relax the constraints
f (xi ) > 1 and f (yi ) < −1 by rewriting them as:
f (xi ) > 1 − ui , and f (yi ) < −1 + vi
38
where ui and vi are nonnegative. Those numbers can be interpreted as a measure of how much each
constraint is violated. We can try to make these sparse by optimizing for the sum; this is an `1 norm and u
and v will (hopefully) be sparse.
Support Vector Machine The above are two approaches for robust linear discrimination. First tries to
maximize the width of the slab. Second tries to minimize the number of misclassified points (actually, it
optimizes its proxy). We can consider the trade-off between the two. Note the width of the slab
z| − 1 ≤ aT z − b ≤ 1
Can be calculated by the distance between the two hyperplanes aT z = b − 1 and aT z = b + 1. Let
aT z1 = b − 1 and aT z2 = b + 1. aT (z1 − z2 ) = 2. It follows kz1 − z2 k2 = 2/ kak2 . Now we can solve the
following multicriterion optimization problem:
minimize kak2 + γ 1T u + 1T v
subject to aT xi − b ≥ 1 − ui
aT yi − b ≤ −1 + vi
u 0, v 0
We have SVM!
Logistic Regression
Another way to do approximate linear discrimination is logistic regression. This should be very familiar
now; the negative log likelihood function is convex.
8.6.3 Nonlinear Discrimination
We can create nonlinear separation space by introducing quadratic and polynomial features. For polynomial discrimination, we can do a bisection on the degree to find the smallest polynomial that can separate
the input.
8.7
Placement and Location
Placement problem deals with n points in Rk , where some locations are given, and the rest of the problems
are the optimization variables. The treatment is rather basic. In essense,
• You can minimize the sum of distances between connected nodes when the distance metric is convex.
• You can place upper bounds on distance between pairs of points, or lengths of certain paths.
• When the underlying connectivity represents a DAG, you can also minimize the max distance from a
source node to a sink node using a DP-like argument.
8.8
Floor Planning
A floor planning tries to place a number of axis-aligned rectangles without overlaps, optimizing for the size
of the bounding rectangle. This is a hard combinatorial optimization problem in general, but specifying
39
relative positioning of the boxes can make these problems convex. A relative positioning constraint gives
how individual pairs of rectangles are positioned. For example, rectangle i must be either above, below,
left to, right to rectangle j. These can be cast as linear inequalities. For example, we can specify that
rectangle i is left to rectangle j by specifying:
xi + w i ≤ xj
Some other constraints we can use:
• Minimum area for each rectangle
• Aspect ratio constraints are simple linear (in)equalities.
• Alignment constraints: for example, two rectangles are centered at the same line
• Symmetry constraints
• Distance constraints: given relative positioning constraints, `1 or `∞ constraints can be cast pretty
easily.
Optimizing for the area of the bounding box gives you a geometric programming problem.
9
10
Numerical Linear Algebra Background
Unconstrained Optimization
Welcome to part 3! For the rest of the material I plan to skip over the theoretical parts, only covering the
motivation and rationale of the algorithms.
10.1 Unconstrained Minimization Problems
An unconstrained minimization doesn’t have any constraints but a function f () we need to minimize. We
assume f to be convex and differentiable, so the optimality can be checked by looking at the gradient ∇f .
This can sometimes be solved analytically (for example, least squares), but in general we need to resort to
an iterative method (for example, geometric programming or analytic center).
10.1.1 Strong Convexity
For most of this chapter we assume that f is strongly convex; which means that there exists m > 0 such
that
∇2 f (x) mI
for all symmetric x. This feels like an analogue of having a positive second order coefficient - is this different
from having posdef Hessian? (Hopefully video lecture provides some insights)
Anyways, this is an extremely strong assumption, and we can’t, in general, expect our functions to be
strongly convex. Then why assume this? We are looking at theoretical convergence, which is already not
attainable (because no algorithm is going to run infinitely). Professor says it’s more of a “feel good” stuff so let’s make assumptions that can shorten the proof.
40
Strong convexity has interesting consequences; the usual convexity bound can be improved so we have:
T
f (y) ≥ f (x) + ∇f (x) (y − x) +
m
2
ky − xk2
2
We can analytically find the minimum point of RHS of this equation, and plug it back into the RHS to
get the lower bound of f (y):
f (y) ≥ f (x) −
1
2
k∇f (x)k2
2m
So this practically means that we have a near-optimal point when the gradient is small. When we know m
2
this can be a hard guarantee, but m is not attainable in general. Therefore, we resort to making k∇f (x)k2
small enough so that we have a high chance of being near optimal.
10.1.2 Conditional Number of Sublevel Sets
Conditional numbers of sublevel sets have a strong effect on efficiency of some algorithms. Conditional
number of a set is defined as the ratio between maximum and minimum width of the set. The width for a
convex set C along a direction q (kqk2 = 1) is defined by:
W (C, q) = sup q T z − inf q T z
z∈C
z∈C
10.2 Descent Methods
The family of iterative optimization algorithms which generate a new solution x(k+1) from x(k) by taking
x(k+1) = x(k) + t(k) ∆x(k)
where t(k) is called the step size, and ∆x(k) is called the search direction. Depending on how we choose t
and ∆x, we get different algorithms. There are two popular ways of choosing t:
• Exact line search minimizes f (t) = f x(k) + t · ∆x(k) exactly, by either analytic or iterative means.
This is used when this minimization problem can be solved efficiently.
• Backtracking search tries to find a t where the objective function sufficiently decreases. The exact
details isn’t very important; it is employed when the minimization problem is harder to solve. The
algorithm is governed by two parameters, which, practically, does not drastically change the performance of the search.
10.3 Gradient Descent
Taking ∆x(k) = −∇f (x) will give you gradient descent algorithm. Some results for the convergence analysis are displayed. The lower bound for iteration is given as:
log ((f (x0 ) − p∗ ) /)
log (1/c)
where p∗ is the optimal value, and we stop when we have f x(k) −p∗ < . The numerator is intuitive; the
denominator involves the condition number and roughly equal to m/M . Therefore, as condition number
increases, the number of required iterations will grow linearly. Given a constant condition number, this
41
bound shows that the error will decrease exponentially. For some reason this is called linear convergence
in optimization context.
10.3.1 Performance Analysis on Toy Problems
Exact search and backtracking line search is compared on toy problems; the number of iteration can differ
by a factor of 2 or something like that. Also, we look at an example where we play with the condition
number of the Hessian of f ; and the number of iteration can really blow up.
10.4 Steepest Descent
Steepest descent algorithm generalizes gradient descent by employing a different norm. Given a norm, the
normalized steepest descent direction is given by
n
o
T
∆x = argmin ∇f (x) v| kvk = 1
Geometrically, we look at all vectors in a unit ball centered at current x and try to minimize f (x). When
we use Euclidean norm, we regain gradient descent. Also, in some cases, we can think of SD as GD after a
change of coordinates (intuitively this makes sense, because using a different norm is essentially employing
a different view on the coordinate system).
10.4.1 Steepest Descent With an `1 Norm
When we use `1 norm, SD essentially becomes the coordinate descent algorithm. It can be trivially shown:
take the basis vector with the largest gradient component, and minimize along that direction. Since we
are using `1 norm, we can never take a steeper descent.
10.4.2 Performance and Choice of Norm
Without any problem-specific assumptions, essentially same as GD. However, remember that the condition number greatly affects the performance of GD - and change of coordinates can change the sublevel
set’s condition number. Therefore, if we can choose a norm such that the sublevel sets will approximate
an ellipsoid/sphere, SD works very well. A Hessian at the optimal point, if attainable, will minimize the
condition number greatly.
10.5 Newton’s Method
Newton’s method is the workhorse of convex optimization. The major motivation is that it tries to minimize
the quadratic approximation of f () at x. To do this, we choose a Newton step ∆xnt :
−1
∆xnt = −∇2 f (x)
∇f (x)
Several properties and interpretations are discussed.
• The Newton step minimizes the second-order Taylor approximation of f . So, when f () roughly follows
a quadratic form, Newton’s method is tremendously efficient.
42
• It’s the steepest descent direction for the quadratic norm defined by the Hessian. Recall that the
Hessian at the optimal point is a great choice for a norm for SD - so when we have a near-optimal
point, this choice minimizes the condition number greatly.
• Solution of linearized optimality condition: we want to find v such that ∇f (x + v) = 0. And approximately:
∇f (x + v) ≈ ∇f (x) + ∇2 f (x) v = 0
and the Newton update is a solution for this.
• Newton step is affinely invariant; so multiplying only a single coordinate by a constant factor will
not change convergence. This is a big advantage over the usual gradient descent. Therefore, Newton’s method is much more resistant to high condition number sublevel sts. In practice, extremely
high condition number can still hinder us because of finite precision arithmetic, yet it is still a big
improvement.
10.5.1 The Newton Decrement
The Newton decrement is a scalar value which is closely related; it is used as a stopping criterion as well:
1/2
T
−1
λ (x) = ∇f (x) ∇2 f (x) ∇f (x)
This is related to our estimate of our error f (x) − p∗ by the following relationship:
f (x) − p∗ ≈ f (x) − inf fˆ (y) =
y
1 2
λ
2
We stop when this value (λ2 /2) is less than .
10.5.2 Newton’s Method
Newton’s method closesly follows the gradient descent algorithm, except it uses the Newton decrement for
stopping criterion, which is checked before making the update.
10.5.3 Convergence Analysis
The story told by the convergence analysis is interesting. There exists a threshold for k∇f (x)k2 , when
broken, makes the algorithm converge quadratically. This condition (broken threshold for gradient), once
attained, will hold in all further iterations. Therefore, the algorithm works in two separate stages.
• In the dampen Newton phase, line search can give us an update size t < 1, and f will decrease by at
least γ, another constant.
• Pure Newton phase will follow, where we will only use full updates (t = 1) and we get quadratic
convergence.
43
10.5.4 Summary
• Very fast convergence: especially, quadratic convergence when we reach the optimal point.
• Affine invariant: much more resistant to high condition numbers.
• Performance does not depend much on the correct choice of parameters, unlike SD.
10.6 Self-Concordant Functions
This section covers an alternative assumption on f which allows us a better (or more elegant) analysis on
the performance of Newton method. This seems like more of an aesthetic, theoretic result, so unless some
insights come up in the video lectures, I am going to skip it.
44