Document

Jieping Ye
University of Michigan
Joint work with Zheng Wang, Ming-Jun Lai, Zhaosong Lu, Wei Fan, and Hasan Davulcu
First International Workshop on Machine Learning Methods for Recommender
Systems, Vancouver, British Columbia, Canada, May 2nd, 2015.
Outline
2
Background
Trace Norm Formulation
Matrix factorization
Orthogonal Rank-One Matrix Pursuit
Evaluation
Summary
2
Matrix Completion
3
Image
Inpainting
Microarray data
imputation
Video
Recovery
Collaborative filtering
Matrix Completion

SDN
Collaborative Filtering
4
Items
?
Customers
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
Customers are asked to rank items
¨  Not all customers ranked all items
¨  Predict the missing rankings (98.9% is missing)
¨ 
The Netflix Problem
5
Movies
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
Users
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
About a million users and 25,000 movies
¨  Known ratings are sparsely distributed
¨ 
Preferences of users are determined by a small number of factors à low rank
Matrix Rank
6
¨ 
The number of independent rows or columns
The singular value decomposition (SVD):
rank
}
¨ 
=
×
×
Low Rank Matrix Completion
7
¨ 
Low rank matrix completion with incomplete observations can be
formulated as: 
min
X
s.t.
rank (X)
PΩ (X) = PΩ (Y)
with the projection operator defined as:
⎧
⎪
PΩ (X) = ⎨
⎪
⎩
xij
(i, j) ∈Ω
0 (i, j) ∉Ω
Other Low-Rank Problems
8
¨ 
¨ 
¨ 
¨ 
¨ 
¨ 
Multi-Task/Class Learning
Image compression
System identification in control theory
Structure-from-motion problem in computer vision
Low rank metric learning in machine learning
Other settings:
¤  low-degree
statistical model for a random process
¤  a low-order realization of a linear system
¤  a low-order controller for a plant
¤  a low-dimensional embedding of data in Euclidean space
Two Formulations for Rank Minimization
9
min loss(X) + λ*rank(X)
min
subject to
Rank minimization is NP-hard
1
2
loss( X ) = PΩ (X) − PΩ (Y) F
2
rank(X)
loss(X)≤ ε
Trace Norm (Nuclear Norm)
10
¨ 
¨ 
trace norm ⇔ 1-norm of the vector of singular values
trace norm is the convex envelope of the rank function over
the unit ball of spectral norm ⇒ a convex relaxation
Two Convex Formulations
11
min loss(X) + λ×||X|| *
min
subject to
||X|| *
loss(X)≤ ε
Trace norm minimization is convex
•  Can be solved by semi-definite programming
•  Computationally expensive
•  Recent more efficient solvers:
•  Singular value thresholding (Cai et al, 2008 )
•  Fixed point method (Ma et al, 2009)
•  Accelerated gradient descent (Toh & Yun, 2009, Ji & Ye, 2009)
Trace Norm Minimization
¨ 
Trace norm convex relaxation
noisy case
min
X
s.t.
PΩ (X) = PΩ (Y)
X
*
min
X
1
2
PΩ (X) − PΩ (Y) F + λ X
2
It can be solved by the sub-gradient method, the proximal gradient method or
the conditional gradient method.
Convergence speed: sub-linear
Iteration: truncated SVD or top-SVD (Frank-Wolfe)
Ref: 1. Candes, E. J. and Recht, B. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717–772, 2009.
2. Jaggi, M. and Sulovsky, M. A simple algorithm for nuclear norm regularized problems. In ICML, 2010.
*
Gradient Descent for the Composite Model
(Nesterov, 2007; Beck and Teboulle, 2009)
13
min f(x)= loss(x) + λ×penalty(x)
1st order Taylor expansion
Repeat
Regularization
Nonsmooth part
Convergence rate
Until “convergence”
Can the proximal operator be computed efficiently?
Proximal Operator Associated with Trace Norm
14
Optimization problem
Associated proximal operator
Closed form solution:
A Non-convex Formulation via Matrix
Factorization
15
•  Rank-r matrix X can be written as a product of two smaller
matrices U and V
X = UVT
m
m
r
n
n
r
1
X * = minT ( U
X = UV 2
2
F
2
+ V F)
Alternating Optimization
16
min
U, V
1
PΩ (UV ) − PΩ (Y) + ( U
F
2
T
2
Non-convex
•  Can be solved via
•  Alternating minimization (Jain et al, 2012)
•  Augmented Lagrangian (Wen et al, 2007)
2
F
2
+ V F)
Summary of Two Approaches
17
¨ 
Trace norm convex relaxation
min
X
X
noisy case
*
min
PΩ (X) = PΩ (Y)
s.t.
X
PΩ (X) − PΩ (Y) F + λ X
2
$
&
PΩ (X) = %
&
'
Projection operator:
xij
(i, j) ∈ Ω
0 (i, j) ∉ Ω
•  Bilinear non-convex relaxation
X = UVT
min
U,V
PΩ (U V ) − PΩ (Y)
T
m
2
F
n
m
r
n
r
*
Rank-One Matrix Space
18
∑θ M
X
i
r
X = ∑ σ i ui viT
i
i∈I
i=1
SVD
Rank-one matrices with unit norm as Atoms
M ∈ℜn×m
for
M = uvT u ∈ℜn v ∈ℜ m
Matrix Completion in Rank-One Matrix Space
19
¨ 
Matrix completion in rank-one matrix space
min
I
θ
θ ∈ℜ ,{M i }
s.t.
0
PΩ (X(θ )) = PΩ (Y)
X(θ ) = ∑θ i M i
with the estimated matrix in the rank-one matrix space as
i∈I
•  Reformulation in the noisy case
min
X(θ )
s.t.
PΩ (X(θ )) − PΩ (Y)
θ
0
2
F
≤r
We solve this problem using an orthogonal matching pursuit type greedy algorithm. The
candidate set is an infinite set composed by all rank-one matrices
M ∈ℜn×m
Orthogonal Matching Pursuit
20
¨ 
Greedy algorithm to iteratively solve an optimization problem with a
solution spanned by the bases in a given (over-complete) dictionary
D = {d (1) , d ( 2 ) , …, d (T ) }
x − xˆ
min
xˆ
2
r
xˆ = ∑θ i di
s.t.
i=1
Iteration k:
Step 1: basis selection
di = argmax r, d
d∈D
Step 2: orthogonal
projection
k
θ = argmax x − ∑θ i di
θ
i=1
k−1
r = x − ∑ θ i di
i=1
k
xˆ = ∑θ i di
i=1
Compressive Sensing
¨ 
When data is sparse/compressible, can
directly acquire a condensed
representation
measurements
sparse
signal
nonzero
entries
Baraniuk et al., 2012
Convex Formulation
22
sparse
signal
random
measurements
nonzero
entries
¨ 
Signal recovery via
[Candes, Romberg, Tao; Donoho]
optimization
Baraniuk et al., 2012
Greedy Algorithms
23
random
measurements
sparse
signal
nonzero
entries
¨ 
Signal recovery via iterative greedy algorithms
¤ 
(orthogonal) matching pursuit [Gilbert, Tropp]
iterated thresholding [Nowak, Figueiredo; Kingsbury, Reeves;
Daubechies, Defrise, De Mol; Blumensath, Davies; …]
¤ 
CoSaMP
¤ 
[Needell and Tropp]
Baraniuk et al., 2012
Greedy Recovery Algorithm (1)
24
¨ 
Consider the following problem
sparse
signal
1 sparse
¨ 
Can we recover the support?
Baraniuk et al., 2012
Greedy Recovery Algorithm (2)
25
sparse
signal
1 sparse
¨ 
¨ 
= [ 1, 2, . . . , N ]
then arg max | h i , yi | gives the support of x
If
How to extend to K-sparse signals?
Baraniuk et al., 2012
Greedy Recovery Algorithm (3)
26
sparse
signal
K sparse
residue:
find atom:
r=y
x
bk
1
Add atom to support:
k = arg max | h i , ri |
Signal estimate
xk = (
S=S
[
{k}
†
)
S y
Baraniuk et al., 2012
Orthogonal Matching Pursuit
27
goal:
given y = x, recover a sparse x
columns of
are unit-norm
initialize: x
b0 = 0, r = y, ⇤ = {}, i = 0
iteration:
i=i+1
b = Tr
k = arg max{|b(1)|, |b(2)|, . . . , |b(N )|} Find atom with largest support
S
⇤=⇤ k
(b
xi )|⇤ = (
r=y
|⇤ )
x
bi
†
y, (b
xi )|⇤c = 0
Update signal estimate
Update residual
Baraniuk et al., 2012
Orthogonal Rank-One Matrix Pursuit for Matrix
Completion
28
¨ 
Matrix completion in rank-one matrix space
min
X(θ )
s.t.
PΩ (X(θ )) − PΩ (Y)
θ
0
2
∑θ M
i
i
i∈I
F
≤r
X(θ ) = ∑θ i M i
i∈I
We solve this problem using an orthogonal matching pursuit type greedy algorithm. The
candidate set is an infinite set composed by all rank-one matrices.
Top-SVD: Rank-One Matrix Basis
29
Step 1: basis construction
[u*, v* ] = argmax R, uv
T
u =1, v =1
M = u*v*T
= u Rv
with residual matrix
T
R = YΩ - X Ω
is selected from all rank-one matrices with unit norm.
All rank-one matrices
< , >
R
M = u*v*T
Top-SVD
Infinite size
Rank-One Matrix Pursuit Algorithm
30
¨ 
Step 1: construct the optimal rank-one matrix basis
M k+1 = u*v*T
[u*, v* ] = argmax (Y− X k )Ω , uvT
u,v
This is the top singular vector pair, which can be solved efficiently by power method.
This generalizes OMP with infinite dictionary set of all rank-one matrices M ∈ℜ n×m
¨ 
Step 2: calculate the optimal weights for current bases
θ k = arg min
θ ∈ℜ
k
∑θ M
i
i
2
i
−Y
Ω
This is a least squares problem, which can be solved incrementally.
some deOutput: Constructed matrix Y = i=1 ✓i Mi .
erminate
d matrix
one can
3. Convergence Analysis
solution
31 once
In this section, we will show that our proposed rankhod
one matrix pursuit algorithm achieves a linear converameter ".
¨  Linear upper bound for the algorithm to converge
gence rate. This main result is given in the following
(R1MP)
theorem.
Linear Convergence
Theorem 3.1. The rank-one matrix pursuit algorithm
satisfies
thogonal
e matrix.
||Rk ||  k 1 kYk⌦ , 8k 1.
However,
rojecting
is a constant in [0, 1).
ubspace,
by all left
Before proving Theorem 3.1, we need to establish some
obtained
useful and preparatory properties of Algorithm 1. The
This
is
significantly
different from the standard MP/OMP algorithm with a finite dictionary,
h higher
first
property
says that
Rk+1 isspeed
perpendicular
to all
which
are
known
to
have
a
sub-linear
convergence
at
the
worst
case.
Bresler,
previously generated Mi for i = 1, · · · , k.
m, which
At each iteration,
we guarantee
a significant
reduction of the residual, which depends
Property
3.2. hR
k+1 , Mi i = 0 for i = 1, · · · , k.
t chooses
on the top singular vector pair pursuit step. 
and then
Proof. Recall that ✓ k is the optimal solution of probZ. Wang et al. ICML’14; SIAM J. Scientific Computing 2015
n a ranklem (6). By the first-order optimality condition, one
P
Efficiency and Scalability
32
¨ 
An efficient and scalable algorithm for matrix
completion: Rank-One Matrix Pursuit
¤  Scalability:
top-SVD
¤  Convergence:
linear convergence
Z. Wang et al. ICML’14; SIAM J. Scientific Computing 2015
Related Work
33
¨ 
Atomic decomposition
X = ∑θ i M i
i∈I
It can be solved by matching pursuit type algorithms.
¨ 
Vs. Frank-Wolfe algorithm (FW)
Similarity: top-SVD 
Difference: linear convergence Vs. sub-linear convergence 
¨ 
Vs. existing greedy approach (ADMiRA)
Similarity: linear convergence
Difference: 1. top-SVD Vs. truncated SVD
2. no extra condition for linear convergence

Ref: Lee, K. and Bresler, Y. Admira: atomic decomposition for minimum rank approximation. IEEE Trans. on Information Theory, 56(9):4402–4416, 2010.
Time and Storage Complexity
34
¨ 
Time complexity
R1MP
ADMiRA & AltMin JS(FW)
Proximal
SVT
Each Iter. O(|Ω|)
O(r|Ω|)
O(|Ω|)
O(r|Ω|)
O(r|Ω|)
Iterations
O(log(1/ε))
O(log(1/ε))
O(1/ε)
O(1/√ε)
O(1/ε)
Total
O(|Ω|log(1/ε)) O(r|Ω|log(1/ε))
O(|Ω|/ε)
O(r|Ω|/√ε)
O(r|Ω|/ε)
minimum iteration cost
+ linear convergence
¨ 
Storage complexity
O(k | Ω |)
It is large when k keeps increasing.
O(| Ω |) is more suitable for large-scale problems.
sfies ||Rk || 
2
Step 2: Compute the optimal weights ↵k for
Xk 1 and Mk by solving: arg min ||↵1 Xk 1 +
↵ (MRank-One
)
Y || .
Economic
Matrix
Pursuit
Step 3: Set X = ↵ X
+ ↵ (M ) ; ✓ = ↵
that controls
i)
han35 min(m, n).
periments.
¨ 
Step 1:
2
k ⌦
⌦
↵
2
k
k 1 k
✓i ↵ 1
k
1
k 1
k
2
k ⌦
k
k
and ✓ik =
for i < k; k
k + 1.
until stopping criterion is satisfiedP
k
k
ˆ = basis
Output:
Constructed
matrix
Y
find
the optimal
rank-one
matrix
i=1 ✓i Mi .
k
2
all indices of
r rank-one maT
[u*, v* ] = argmax (Y− X k )Ω , uvT
M
=
u
v
k+1
* *
ard SVD using
u,v
The proposed
economic rank-one matrix pursuit algorithm (ER1MP) uses the same amount of storage as
¨  Step 2: calculate the weights for two matrices
the greedy algorithms (Jaggi & Sulovsk´
y, 2010; Tewari
d for the opti2 that
et al.,α2011),
which
is
significantly
smaller
than
= arg min α1 X k + α 2 M k+1 − Y Ω
rix completion
required by R1MP
α ∈ℜ2 algorithm. Interestingly, we can
hm to solve the
show thatk the ER1MP algorithm still retains the link−1
k−1
=
θ
θi = α 2
nd analyze θthe
i
i α1
ear convergence
rate. The main result is given in the
behavior under
following theorem, and the proof is provided in the
¨  It retains the linear convergence
ricted isometry
long version of this paper (Wang et al., 2014).
al., 2013). DeTheorem 4.1. The economic rank-one matrix pursuit
n of this paper
algorithm satisfies
ix Pursuit
o track all pur-
||Rk ||  ˜ k
˜ is a constant in [0, 1).
1
kYk⌦ ,
8k
1.
Experiments
36
¨ 
Experiments
¤ 
¤ 
¤ 
¨ 
Collaborative filtering
Image recovery
Convergence property
Competing algorithms
¤ 
¤ 
¤ 
¤ 
¤ 
¤ 
¤ 
¤ 
¤ 
singular value projection (SVP)
trace norm minimization
spectral regularization algorithm (SoftImpute)
low rank matrix fitting (LMaFit)
alternating minimization (AltMin)
alternating optimization
boosting type accelerated matrix-norm penalized solver (Boost)
Jaggi's fast algorithm for trace norm constraint (JS)
greedy efficient component optimization (GECO)
atomic decomposition
Rank-one matrix pursuit (R1MP)
Economic rank-one matrix pursuit (ER1MP)
MovieLens100K
MovieLens1M
MovieLens10M
1.32
18.90
> 103
Convergence
128.07
59.56
> 103
2.76
30.55
154.38
3.23
68.77
310.82
37
Lenna
−1
RMSE
RMSE
10
−2
10
−3
10
7. A
Lenna
−1
10
0
This w
R&D
and N
−2
10
Refe
−3
50
100
150
rank
200
250
300
10
0
50
100
150
200
250
300
rank
Residual curves of the Lena image for R1MP and ER1MP in log-scale
Figure 1. Illustration of the linear convergence of the proposed
rank-one matrix pursuit algorithms on the Lenna image: the xaxis is the iteration, and the y-axis is the RMSE in logarithmic
Argy
tas
272
Avron
Dataset
Jester1
Jester2
Jester3
MovieLens100K
MovieLens1M
38 MovieLens10M
SVP
4.7311
4.7608
8.6958
0.9683
0.9085
0.8611
SoftImpute
5.1113
5.1646
5.4348
1.0354
0.8989
0.8534
LMaFit
4.7623
4.7500
9.4275
1.2308
0.9232
0.8625
AltMin
4.8572
4.8616
9.7482
1.0042
0.9382
0.9007
Boost
5.1746
5.2319
5.3982
1.1244
1.0850
–
Collaborative Filtering
JS
4.4713
4.5102
4.6866
1.0146
1.0439
0.8728
GECO
4.3680
4.3967
5.1790
1.0243
0.9290
0.8668
R1MP
4.3418
4.3649
4.9783
1.0168
0.9595
0.8621
ER1MP
4.3384
4.3546
5.0145
1.0261
0.9462
0.8692
Running time for different algorithms
Table 4. The running time (measured in seconds) of all methods on all recommendation datasets.
Dataset
Jester1
Jester2
Jester3
MovieLens100K
MovieLens1M
MovieLens10M
SVP
18.35
16.85
16.58
1.32
18.90
> 103
SoftImpute
161.49
152.96
10.55
128.07
59.56
> 103
LMaFit
3.68
2.42
8.45
2.76
30.55
154.38
AltMin
11.14
10.47
12.23
3.23
68.77
310.82
Boost
93.91
261.70
245.79
2.87
93.91
–
JS
29.68
28.52
12.94
2.86
13.10
130.13
GECO
> 104
> 104
> 103
10.83
> 104
> 105
R1MP
1.83
1.68
0.93
0.04
0.87
23.05
ER1MP
0.99
0.91
0.34
0.04
0.54
13.79
Rank-One Matrix Pursuit for Matrix Completion
Lenna
7. Acknowledgments
Lenna
−1
10
Prediction accuracy in terms of RMSE
Table 3. Recommendation results measured in termsThis
of thework
RMSE.
fails oninthe
MovieLens10M.
wasBoost
supported
part
by China 973 Fundamen
50
SVP
10
4.7311
4.7608
8.6958
0.9683
10
0
0.9085
0.8611
RMSE
Dataset
Jester1
Jester2
Jester3
MovieLens100K
100
150
200
250
300
MovieLens1M
rank
MovieLens10M
−2
−3
SoftImpute
5.1113
5.1646
5.4348
1.0354
50
100
150
0.8989
rank
0.8534
200
LMaFit
4.7623
4.7500
9.4275
1.2308
250
300
0.9232
0.8625
R&D Program
(No.2014CB340304),
(LM01073
AltMin
Boost
JS
GECO R1MPNIHER1MP
and NSF
(IIS-0953662,
4.8572
5.1746
4.4713 CCF-1025177).
4.3680 4.3418
4.3384
4.8616 5.2319 4.5102 4.3967 4.3649
4.3546
9.7482 5.3982 4.6866 5.1790 4.9783
5.0145
References
1.0042 1.1244 1.0146 1.0243 1.0168
1.0261
0.9382
1.0850
1.0439 T.,
0.9290
0.9595M. 0.9462
Argyriou,
A., Evgeniou,
and Pontil,
Convex mul
0.9007
–
0.8728 0.8668 0.8621
0.8692
task feature learning. Machine Learning, 73(3):24
272, 2008.
ure 1. Illustration of the linear convergence of the proposed
Tablealgorithms
4. The running
(measured
seconds)
of all methods on all recommendation datasets.
-one matrix pursuit
on thetime
Lenna
image: inthe
x-
Summary
39
Matrix completion background
¨  Trace norm convex formulation
¨  Matrix factorization: non-convex formulation
¨  Orthogonal rank-one matrix pursuit
¨ 
¤  Efficient
update: top SVD
¤  Fact convergence: linear rate
¨ 
Extensions
¤  Tensor
completion
¤  Screening for matrices