Jieping Ye University of Michigan Joint work with Zheng Wang, Ming-Jun Lai, Zhaosong Lu, Wei Fan, and Hasan Davulcu First International Workshop on Machine Learning Methods for Recommender Systems, Vancouver, British Columbia, Canada, May 2nd, 2015. Outline 2 Background Trace Norm Formulation Matrix factorization Orthogonal Rank-One Matrix Pursuit Evaluation Summary 2 Matrix Completion 3 Image Inpainting Microarray data imputation Video Recovery Collaborative filtering Matrix Completion SDN Collaborative Filtering 4 Items ? Customers ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Customers are asked to rank items ¨ Not all customers ranked all items ¨ Predict the missing rankings (98.9% is missing) ¨ The Netflix Problem 5 Movies ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Users ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? About a million users and 25,000 movies ¨ Known ratings are sparsely distributed ¨ Preferences of users are determined by a small number of factors à low rank Matrix Rank 6 ¨ The number of independent rows or columns The singular value decomposition (SVD): rank } ¨ = × × Low Rank Matrix Completion 7 ¨ Low rank matrix completion with incomplete observations can be formulated as: min X s.t. rank (X) PΩ (X) = PΩ (Y) with the projection operator defined as: ⎧ ⎪ PΩ (X) = ⎨ ⎪ ⎩ xij (i, j) ∈Ω 0 (i, j) ∉Ω Other Low-Rank Problems 8 ¨ ¨ ¨ ¨ ¨ ¨ Multi-Task/Class Learning Image compression System identification in control theory Structure-from-motion problem in computer vision Low rank metric learning in machine learning Other settings: ¤ low-degree statistical model for a random process ¤ a low-order realization of a linear system ¤ a low-order controller for a plant ¤ a low-dimensional embedding of data in Euclidean space Two Formulations for Rank Minimization 9 min loss(X) + λ*rank(X) min subject to Rank minimization is NP-hard 1 2 loss( X ) = PΩ (X) − PΩ (Y) F 2 rank(X) loss(X)≤ ε Trace Norm (Nuclear Norm) 10 ¨ ¨ trace norm ⇔ 1-norm of the vector of singular values trace norm is the convex envelope of the rank function over the unit ball of spectral norm ⇒ a convex relaxation Two Convex Formulations 11 min loss(X) + λ×||X|| * min subject to ||X|| * loss(X)≤ ε Trace norm minimization is convex • Can be solved by semi-definite programming • Computationally expensive • Recent more efficient solvers: • Singular value thresholding (Cai et al, 2008 ) • Fixed point method (Ma et al, 2009) • Accelerated gradient descent (Toh & Yun, 2009, Ji & Ye, 2009) Trace Norm Minimization ¨ Trace norm convex relaxation noisy case min X s.t. PΩ (X) = PΩ (Y) X * min X 1 2 PΩ (X) − PΩ (Y) F + λ X 2 It can be solved by the sub-gradient method, the proximal gradient method or the conditional gradient method. Convergence speed: sub-linear Iteration: truncated SVD or top-SVD (Frank-Wolfe) Ref: 1. Candes, E. J. and Recht, B. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717–772, 2009. 2. Jaggi, M. and Sulovsky, M. A simple algorithm for nuclear norm regularized problems. In ICML, 2010. * Gradient Descent for the Composite Model (Nesterov, 2007; Beck and Teboulle, 2009) 13 min f(x)= loss(x) + λ×penalty(x) 1st order Taylor expansion Repeat Regularization Nonsmooth part Convergence rate Until “convergence” Can the proximal operator be computed efficiently? Proximal Operator Associated with Trace Norm 14 Optimization problem Associated proximal operator Closed form solution: A Non-convex Formulation via Matrix Factorization 15 • Rank-r matrix X can be written as a product of two smaller matrices U and V X = UVT m m r n n r 1 X * = minT ( U X = UV 2 2 F 2 + V F) Alternating Optimization 16 min U, V 1 PΩ (UV ) − PΩ (Y) + ( U F 2 T 2 Non-convex • Can be solved via • Alternating minimization (Jain et al, 2012) • Augmented Lagrangian (Wen et al, 2007) 2 F 2 + V F) Summary of Two Approaches 17 ¨ Trace norm convex relaxation min X X noisy case * min PΩ (X) = PΩ (Y) s.t. X PΩ (X) − PΩ (Y) F + λ X 2 $ & PΩ (X) = % & ' Projection operator: xij (i, j) ∈ Ω 0 (i, j) ∉ Ω • Bilinear non-convex relaxation X = UVT min U,V PΩ (U V ) − PΩ (Y) T m 2 F n m r n r * Rank-One Matrix Space 18 ∑θ M X i r X = ∑ σ i ui viT i i∈I i=1 SVD Rank-one matrices with unit norm as Atoms M ∈ℜn×m for M = uvT u ∈ℜn v ∈ℜ m Matrix Completion in Rank-One Matrix Space 19 ¨ Matrix completion in rank-one matrix space min I θ θ ∈ℜ ,{M i } s.t. 0 PΩ (X(θ )) = PΩ (Y) X(θ ) = ∑θ i M i with the estimated matrix in the rank-one matrix space as i∈I • Reformulation in the noisy case min X(θ ) s.t. PΩ (X(θ )) − PΩ (Y) θ 0 2 F ≤r We solve this problem using an orthogonal matching pursuit type greedy algorithm. The candidate set is an infinite set composed by all rank-one matrices M ∈ℜn×m Orthogonal Matching Pursuit 20 ¨ Greedy algorithm to iteratively solve an optimization problem with a solution spanned by the bases in a given (over-complete) dictionary D = {d (1) , d ( 2 ) , …, d (T ) } x − xˆ min xˆ 2 r xˆ = ∑θ i di s.t. i=1 Iteration k: Step 1: basis selection di = argmax r, d d∈D Step 2: orthogonal projection k θ = argmax x − ∑θ i di θ i=1 k−1 r = x − ∑ θ i di i=1 k xˆ = ∑θ i di i=1 Compressive Sensing ¨ When data is sparse/compressible, can directly acquire a condensed representation measurements sparse signal nonzero entries Baraniuk et al., 2012 Convex Formulation 22 sparse signal random measurements nonzero entries ¨ Signal recovery via [Candes, Romberg, Tao; Donoho] optimization Baraniuk et al., 2012 Greedy Algorithms 23 random measurements sparse signal nonzero entries ¨ Signal recovery via iterative greedy algorithms ¤ (orthogonal) matching pursuit [Gilbert, Tropp] iterated thresholding [Nowak, Figueiredo; Kingsbury, Reeves; Daubechies, Defrise, De Mol; Blumensath, Davies; …] ¤ CoSaMP ¤ [Needell and Tropp] Baraniuk et al., 2012 Greedy Recovery Algorithm (1) 24 ¨ Consider the following problem sparse signal 1 sparse ¨ Can we recover the support? Baraniuk et al., 2012 Greedy Recovery Algorithm (2) 25 sparse signal 1 sparse ¨ ¨ = [ 1, 2, . . . , N ] then arg max | h i , yi | gives the support of x If How to extend to K-sparse signals? Baraniuk et al., 2012 Greedy Recovery Algorithm (3) 26 sparse signal K sparse residue: find atom: r=y x bk 1 Add atom to support: k = arg max | h i , ri | Signal estimate xk = ( S=S [ {k} † ) S y Baraniuk et al., 2012 Orthogonal Matching Pursuit 27 goal: given y = x, recover a sparse x columns of are unit-norm initialize: x b0 = 0, r = y, ⇤ = {}, i = 0 iteration: i=i+1 b = Tr k = arg max{|b(1)|, |b(2)|, . . . , |b(N )|} Find atom with largest support S ⇤=⇤ k (b xi )|⇤ = ( r=y |⇤ ) x bi † y, (b xi )|⇤c = 0 Update signal estimate Update residual Baraniuk et al., 2012 Orthogonal Rank-One Matrix Pursuit for Matrix Completion 28 ¨ Matrix completion in rank-one matrix space min X(θ ) s.t. PΩ (X(θ )) − PΩ (Y) θ 0 2 ∑θ M i i i∈I F ≤r X(θ ) = ∑θ i M i i∈I We solve this problem using an orthogonal matching pursuit type greedy algorithm. The candidate set is an infinite set composed by all rank-one matrices. Top-SVD: Rank-One Matrix Basis 29 Step 1: basis construction [u*, v* ] = argmax R, uv T u =1, v =1 M = u*v*T = u Rv with residual matrix T R = YΩ - X Ω is selected from all rank-one matrices with unit norm. All rank-one matrices < , > R M = u*v*T Top-SVD Infinite size Rank-One Matrix Pursuit Algorithm 30 ¨ Step 1: construct the optimal rank-one matrix basis M k+1 = u*v*T [u*, v* ] = argmax (Y− X k )Ω , uvT u,v This is the top singular vector pair, which can be solved efficiently by power method. This generalizes OMP with infinite dictionary set of all rank-one matrices M ∈ℜ n×m ¨ Step 2: calculate the optimal weights for current bases θ k = arg min θ ∈ℜ k ∑θ M i i 2 i −Y Ω This is a least squares problem, which can be solved incrementally. some deOutput: Constructed matrix Y = i=1 ✓i Mi . erminate d matrix one can 3. Convergence Analysis solution 31 once In this section, we will show that our proposed rankhod one matrix pursuit algorithm achieves a linear converameter ". ¨ Linear upper bound for the algorithm to converge gence rate. This main result is given in the following (R1MP) theorem. Linear Convergence Theorem 3.1. The rank-one matrix pursuit algorithm satisfies thogonal e matrix. ||Rk || k 1 kYk⌦ , 8k 1. However, rojecting is a constant in [0, 1). ubspace, by all left Before proving Theorem 3.1, we need to establish some obtained useful and preparatory properties of Algorithm 1. The This is significantly different from the standard MP/OMP algorithm with a finite dictionary, h higher first property says that Rk+1 isspeed perpendicular to all which are known to have a sub-linear convergence at the worst case. Bresler, previously generated Mi for i = 1, · · · , k. m, which At each iteration, we guarantee a significant reduction of the residual, which depends Property 3.2. hR k+1 , Mi i = 0 for i = 1, · · · , k. t chooses on the top singular vector pair pursuit step. and then Proof. Recall that ✓ k is the optimal solution of probZ. Wang et al. ICML’14; SIAM J. Scientific Computing 2015 n a ranklem (6). By the first-order optimality condition, one P Efficiency and Scalability 32 ¨ An efficient and scalable algorithm for matrix completion: Rank-One Matrix Pursuit ¤ Scalability: top-SVD ¤ Convergence: linear convergence Z. Wang et al. ICML’14; SIAM J. Scientific Computing 2015 Related Work 33 ¨ Atomic decomposition X = ∑θ i M i i∈I It can be solved by matching pursuit type algorithms. ¨ Vs. Frank-Wolfe algorithm (FW) Similarity: top-SVD Difference: linear convergence Vs. sub-linear convergence ¨ Vs. existing greedy approach (ADMiRA) Similarity: linear convergence Difference: 1. top-SVD Vs. truncated SVD 2. no extra condition for linear convergence Ref: Lee, K. and Bresler, Y. Admira: atomic decomposition for minimum rank approximation. IEEE Trans. on Information Theory, 56(9):4402–4416, 2010. Time and Storage Complexity 34 ¨ Time complexity R1MP ADMiRA & AltMin JS(FW) Proximal SVT Each Iter. O(|Ω|) O(r|Ω|) O(|Ω|) O(r|Ω|) O(r|Ω|) Iterations O(log(1/ε)) O(log(1/ε)) O(1/ε) O(1/√ε) O(1/ε) Total O(|Ω|log(1/ε)) O(r|Ω|log(1/ε)) O(|Ω|/ε) O(r|Ω|/√ε) O(r|Ω|/ε) minimum iteration cost + linear convergence ¨ Storage complexity O(k | Ω |) It is large when k keeps increasing. O(| Ω |) is more suitable for large-scale problems. sfies ||Rk || 2 Step 2: Compute the optimal weights ↵k for Xk 1 and Mk by solving: arg min ||↵1 Xk 1 + ↵ (MRank-One ) Y || . Economic Matrix Pursuit Step 3: Set X = ↵ X + ↵ (M ) ; ✓ = ↵ that controls i) han35 min(m, n). periments. ¨ Step 1: 2 k ⌦ ⌦ ↵ 2 k k 1 k ✓i ↵ 1 k 1 k 1 k 2 k ⌦ k k and ✓ik = for i < k; k k + 1. until stopping criterion is satisfiedP k k ˆ = basis Output: Constructed matrix Y find the optimal rank-one matrix i=1 ✓i Mi . k 2 all indices of r rank-one maT [u*, v* ] = argmax (Y− X k )Ω , uvT M = u v k+1 * * ard SVD using u,v The proposed economic rank-one matrix pursuit algorithm (ER1MP) uses the same amount of storage as ¨ Step 2: calculate the weights for two matrices the greedy algorithms (Jaggi & Sulovsk´ y, 2010; Tewari d for the opti2 that et al.,α2011), which is significantly smaller than = arg min α1 X k + α 2 M k+1 − Y Ω rix completion required by R1MP α ∈ℜ2 algorithm. Interestingly, we can hm to solve the show thatk the ER1MP algorithm still retains the link−1 k−1 = θ θi = α 2 nd analyze θthe i i α1 ear convergence rate. The main result is given in the behavior under following theorem, and the proof is provided in the ¨ It retains the linear convergence ricted isometry long version of this paper (Wang et al., 2014). al., 2013). DeTheorem 4.1. The economic rank-one matrix pursuit n of this paper algorithm satisfies ix Pursuit o track all pur- ||Rk || ˜ k ˜ is a constant in [0, 1). 1 kYk⌦ , 8k 1. Experiments 36 ¨ Experiments ¤ ¤ ¤ ¨ Collaborative filtering Image recovery Convergence property Competing algorithms ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ singular value projection (SVP) trace norm minimization spectral regularization algorithm (SoftImpute) low rank matrix fitting (LMaFit) alternating minimization (AltMin) alternating optimization boosting type accelerated matrix-norm penalized solver (Boost) Jaggi's fast algorithm for trace norm constraint (JS) greedy efficient component optimization (GECO) atomic decomposition Rank-one matrix pursuit (R1MP) Economic rank-one matrix pursuit (ER1MP) MovieLens100K MovieLens1M MovieLens10M 1.32 18.90 > 103 Convergence 128.07 59.56 > 103 2.76 30.55 154.38 3.23 68.77 310.82 37 Lenna −1 RMSE RMSE 10 −2 10 −3 10 7. A Lenna −1 10 0 This w R&D and N −2 10 Refe −3 50 100 150 rank 200 250 300 10 0 50 100 150 200 250 300 rank Residual curves of the Lena image for R1MP and ER1MP in log-scale Figure 1. Illustration of the linear convergence of the proposed rank-one matrix pursuit algorithms on the Lenna image: the xaxis is the iteration, and the y-axis is the RMSE in logarithmic Argy tas 272 Avron Dataset Jester1 Jester2 Jester3 MovieLens100K MovieLens1M 38 MovieLens10M SVP 4.7311 4.7608 8.6958 0.9683 0.9085 0.8611 SoftImpute 5.1113 5.1646 5.4348 1.0354 0.8989 0.8534 LMaFit 4.7623 4.7500 9.4275 1.2308 0.9232 0.8625 AltMin 4.8572 4.8616 9.7482 1.0042 0.9382 0.9007 Boost 5.1746 5.2319 5.3982 1.1244 1.0850 – Collaborative Filtering JS 4.4713 4.5102 4.6866 1.0146 1.0439 0.8728 GECO 4.3680 4.3967 5.1790 1.0243 0.9290 0.8668 R1MP 4.3418 4.3649 4.9783 1.0168 0.9595 0.8621 ER1MP 4.3384 4.3546 5.0145 1.0261 0.9462 0.8692 Running time for different algorithms Table 4. The running time (measured in seconds) of all methods on all recommendation datasets. Dataset Jester1 Jester2 Jester3 MovieLens100K MovieLens1M MovieLens10M SVP 18.35 16.85 16.58 1.32 18.90 > 103 SoftImpute 161.49 152.96 10.55 128.07 59.56 > 103 LMaFit 3.68 2.42 8.45 2.76 30.55 154.38 AltMin 11.14 10.47 12.23 3.23 68.77 310.82 Boost 93.91 261.70 245.79 2.87 93.91 – JS 29.68 28.52 12.94 2.86 13.10 130.13 GECO > 104 > 104 > 103 10.83 > 104 > 105 R1MP 1.83 1.68 0.93 0.04 0.87 23.05 ER1MP 0.99 0.91 0.34 0.04 0.54 13.79 Rank-One Matrix Pursuit for Matrix Completion Lenna 7. Acknowledgments Lenna −1 10 Prediction accuracy in terms of RMSE Table 3. Recommendation results measured in termsThis of thework RMSE. fails oninthe MovieLens10M. wasBoost supported part by China 973 Fundamen 50 SVP 10 4.7311 4.7608 8.6958 0.9683 10 0 0.9085 0.8611 RMSE Dataset Jester1 Jester2 Jester3 MovieLens100K 100 150 200 250 300 MovieLens1M rank MovieLens10M −2 −3 SoftImpute 5.1113 5.1646 5.4348 1.0354 50 100 150 0.8989 rank 0.8534 200 LMaFit 4.7623 4.7500 9.4275 1.2308 250 300 0.9232 0.8625 R&D Program (No.2014CB340304), (LM01073 AltMin Boost JS GECO R1MPNIHER1MP and NSF (IIS-0953662, 4.8572 5.1746 4.4713 CCF-1025177). 4.3680 4.3418 4.3384 4.8616 5.2319 4.5102 4.3967 4.3649 4.3546 9.7482 5.3982 4.6866 5.1790 4.9783 5.0145 References 1.0042 1.1244 1.0146 1.0243 1.0168 1.0261 0.9382 1.0850 1.0439 T., 0.9290 0.9595M. 0.9462 Argyriou, A., Evgeniou, and Pontil, Convex mul 0.9007 – 0.8728 0.8668 0.8621 0.8692 task feature learning. Machine Learning, 73(3):24 272, 2008. ure 1. Illustration of the linear convergence of the proposed Tablealgorithms 4. The running (measured seconds) of all methods on all recommendation datasets. -one matrix pursuit on thetime Lenna image: inthe x- Summary 39 Matrix completion background ¨ Trace norm convex formulation ¨ Matrix factorization: non-convex formulation ¨ Orthogonal rank-one matrix pursuit ¨ ¤ Efficient update: top SVD ¤ Fact convergence: linear rate ¨ Extensions ¤ Tensor completion ¤ Screening for matrices
© Copyright 2024