Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent KDD 2011 Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis Presenter: Jiawen Yao Dept. CSE, UT Arlington 1 1 Outline • • • • • Matrix Factorization Stochastic Gradient Descent Distributed SGD Experiments Summary Dept. CSE, UT Arlington 2 Matrix completion Dept. CSE, UT Arlington 3 Matrix Factorization • As Web 2.0 and enterprise-cloud applications proliferate, data mining becomes more important • Real application • Set of users • Set of items (movies, books, products, …) • Feedback (ratings, purchase, tags,…) • Predict additional items a user may like • Assumption: Similar feedback Similar taste • Matrix Factorization Dept. CSE, UT Arlington 4 Matrix Factorization • Example - Netflix competition • 500k users, 20k movies, 100M movie ratings, 3M question marks Avatar The Matrix Up Alice ? 4 2 Bob 3 2 ? Charlie 5 ? 3 • The goal is to predict missing entries (denoted by ?) Dept. CSE, UT Arlington 5 Matrix Factorization • A general machine learning problem • Recommender systems, text indexing, face recognition,… • Training data 𝑉: 𝑚 × 𝑛 input matrix (e.g., rating matrix) Z: training set of indexes in 𝑉 (e.g., subset of known ratings) • Output find a approximation 𝑉 ≈ 𝑊𝐻 which have the smallest loss 𝑎𝑟𝑔𝑚𝑖𝑛𝑊,𝐻 𝐿(𝑉, 𝑊, 𝐻) Dept. CSE, UT Arlington 6 The loss function • Loss function – Nonzero squared loss 𝐿𝑁𝑍𝑆𝐿 𝐿𝑁𝑍𝑆𝐿 = 𝑖,𝑗:𝑉𝑖𝑗 ≠0 𝑉𝑖𝑗 − 𝑊𝐻 2 𝑖𝑗 • A sum of local losses over the entries in 𝑉𝑖𝑗 𝐿= (𝑖,𝑗)∈𝑍 𝑙 𝑉𝑖𝑗 , 𝑊𝑖∗ , 𝐻∗𝑗 • Focus on the class of nonzero decompositions 𝑍= 𝑖, 𝑗 : 𝑉𝑖𝑗 ≠ 0 Dept. CSE, UT Arlington 7 The loss function • Find best model 𝑎𝑟𝑔𝑚𝑖𝑛𝑊,𝐻 (𝑖,𝑗)∈𝑍 𝐿𝑖𝑗 (𝑊𝑖∗ , 𝐻∗𝑗 ) – 𝑊𝑖∗ row i of matrix W – 𝐻∗𝑗 column j of matrix H – To avoid trivialities, we assume there is at least one training point in every row and in every column. Dept. CSE, UT Arlington 8 Prior Work • Specialized algorithms – Designed for a small class of loss functions – GKL loss • Generic algorithms – Handle all differentiable loss functions that decompose into summation form – Distributed gradient descent (DGD), Partitioned SGD (PSGD) – The proposed Dept. CSE, UT Arlington 9 Successful applications • Movie recommendation >12M users, >20k movies, 2.4B ratings 36GB data, 9.2GB model • Website recommendation 51M users, 15M URLs, 1.2B clicks 17.8GB data, 161GB metadata, 49GB model • News personalization Dept. CSE, UT Arlington 10 Stochastic Gradient Descent Find minimum 𝜃 ∗ of function L Dept. CSE, UT Arlington 11 Stochastic Gradient Descent Find minimum 𝜃 ∗ of function L Pick a starting point 𝜃0 Dept. CSE, UT Arlington 12 Stochastic Gradient Descent Find minimum 𝜃 ∗ of function L Pick a starting point 𝜃0 Dept. CSE, UT Arlington 13 Stochastic Gradient Descent Find minimum 𝜃 ∗ of function L Pick a starting point 𝜃0 Dept. CSE, UT Arlington 14 Stochastic Gradient Descent Find minimum 𝜃 ∗ of function L Pick a starting point 𝜃0 Approximate gradient 𝐿′ (𝜃0 ) Dept. CSE, UT Arlington 15 Stochastic Gradient Descent Find minimum 𝜃 ∗ of function L Pick a starting point 𝜃0 Approximate gradient 𝐿′ (𝜃0 ) Stochastic difference equation 𝜃𝑛+1 = 𝜃𝑛 − 𝜖𝑛 𝐿′ (𝜃𝑛 ) Under certain conditions, asymptotically approximates (continuous) gradient descent Dept. CSE, UT Arlington 16 Stochastic Gradient Descent for Matrix Factorization Set 𝜃 = (𝑊, 𝐻) and use 𝐿(𝜃) = 𝐿′ (𝜃) = (𝑖,𝑗)∈𝑍 (𝑖,𝑗)∈𝑍 𝐿𝑖𝑗 𝑊𝑖∗ , 𝐻∗𝑗 𝐿′𝑖𝑗 𝑊𝑖∗ , 𝐻∗𝑗 𝐿′ (𝜃, 𝑧) = 𝑁𝐿′𝑧 (𝑊𝑖𝑧∗ , 𝐻∗𝑗𝑧 ) Where N = 𝑍 and training point z is chosen randomly from the training set. 𝑍= 𝑖, 𝑗 : 𝑉𝑖𝑗 ≠ 0 Dept. CSE, UT Arlington 17 Stochastic Gradient Descent for Matrix Factorization SGD for Matrix Factorization Input: A training set Z, initial values W0 and H0 1. Pick a random entry 𝑧 ∈ 𝑍 2. Compute approximate gradient 𝐿′ (𝜃, 𝑧) 3. Update parameters 𝜃𝑛+1 = 𝜃𝑛 − 𝜖𝑛 𝐿′ (𝜃𝑛 , 𝑧) 4. Repeat N times In practice, an additional projection is used 𝜃𝑛+1 = Π𝐻 [𝜃𝑛 − 𝜖𝑛 𝐿′ 𝜃𝑛 , 𝑧 ] to keep the iterative in a given constraint set 𝐻 = {𝜃: 𝜃 ≥ 0}. Dept. CSE, UT Arlington 18 Stochastic Gradient Descent for Matrix Factorization Why stochastic is good ? • Easy obtain • May help in escaping local minima • Exploit repetition with the data Dept. CSE, UT Arlington 19 Distributed SGD (DSGD) SGD steps depend on each other 𝜃𝑛+1 = 𝜃𝑛 − 𝜖𝑛 𝐿′ (𝜃𝑛 , 𝑧) How to distribute? • Parameter mixing (ISGD) Map: Run independent instances of SGD on subsets of the data Reduce: Average results once at the end Does not converge to correct solution • Iterative Parameter mixing (PSGD) Map: Run independent instances of SGD on subsets of the data (for some time) Reduce: Average results after each pass over the data Converges slowly Dept. CSE, UT Arlington 20 Distributed SGD (DSGD) Dept. CSE, UT Arlington 21 Stratified SGD Proposed Stratified SGD to obtain an efficient DSGD for matrix factorization 𝐿 𝜃 = 𝜔1 𝐿1 𝜃 + 𝜔2 𝐿2 𝜃 + ⋯ + 𝜔𝑞 𝐿𝑞 𝜃 The sum of loss function 𝐿𝑠 , and 𝑠 is a stratum. A stratum is a part or partition of dataset SSGD runs standard SGD on a single stratum at a time, but switches strata in a way that guarantees correctness Dept. CSE, UT Arlington 22 Stratified SGD algorithm Suppose a stratum sequence {𝛾𝑛 }, each 𝛾𝑛 takes values in {1, … , q} 𝜃𝑛+1 = Π𝐻 [𝜃𝑛 − 𝜖𝑛 𝐿′𝛾𝑛 𝜃𝑛 ] Appropriate sufficient conditions for the convergence of SSGD can be from stochastic approximation Dept. CSE, UT Arlington 23 Distribute SSGD SGD steps depend on each other 𝜃𝑛+1 = Π𝐻 [𝜃𝑛 − 𝜖𝑛 𝐿′𝛾𝑛 𝜃𝑛 ] An SGD step on example 𝑧 ∈ 𝑍 1. Reads W𝑖𝑧∗ , H∗𝑗𝑧 2. Performs gradient computation 𝐿′𝑖𝑗 𝑊𝑖𝑧∗ , 𝐻∗𝑗𝑧 3. Updates W𝑖𝑧∗ and H∗𝑗𝑧 Dept. CSE, UT Arlington 24 Problem Structure SGD steps depend on each other 𝜃𝑛+1 = Π𝐻 [𝜃𝑛 − 𝜖𝑛 𝐿′𝛾𝑛 𝜃𝑛 ] An SGD step on example 𝑧 ∈ 𝑍 1. Reads W𝑖𝑧∗ , H∗𝑗𝑧 2. Performs gradient computation 𝐿′𝑖𝑗 𝑊𝑖𝑧∗ , 𝐻∗𝑗𝑧 3. Updates W𝑖𝑧∗ and H∗𝑗𝑧 Dept. CSE, UT Arlington 25 Problem Structure SGD steps depend on each other 𝜃𝑛+1 = Π𝐻 [𝜃𝑛 − 𝜖𝑛 𝐿′𝛾𝑛 𝜃𝑛 ] An SGD step on example 𝑧 ∈ 𝑍 1. Reads W𝑖𝑧∗ , H∗𝑗𝑧 2. Performs gradient computation 𝐿′𝑖𝑗 𝑊𝑖𝑧∗ , 𝐻∗𝑗𝑧 3. Updates W𝑖𝑧∗ and H∗𝑗𝑧 Dept. CSE, UT Arlington 26 Problem Structure SGD steps depend on each other 𝜃𝑛+1 = Π𝐻 [𝜃𝑛 − 𝜖𝑛 𝐿′𝛾𝑛 𝜃𝑛 ] An SGD step on example 𝑧 ∈ 𝑍 1. Reads W𝑖𝑧∗ , H∗𝑗𝑧 2. Performs gradient computation 𝐿′𝑖𝑗 𝑊𝑖𝑧∗ , 𝐻∗𝑗𝑧 3. Updates W𝑖𝑧∗ and H∗𝑗𝑧 Not all steps are dependent Dept. CSE, UT Arlington 27 Interchangeability Definition 1. Two elements 𝑧1 , 𝑧2 ∈ 𝑍 are interchangeable if they share neither row nor column When𝑧𝑛 , 𝑧𝑛+1 are interchangeable, the SGD steps 𝜃𝑛+1 = 𝜃𝑛 − 𝜖𝐿′ 𝜃𝑛 , 𝑧𝑛 𝜃𝑛+2 = 𝜃𝑛 − 𝜖𝐿′ 𝜃𝑛 , 𝑧𝑛 − 𝜖𝐿′ (𝜃𝑛+1 , 𝑧𝑛+1 ) = 𝜃𝑛 − 𝜖𝐿′ 𝜃𝑛 , 𝑧𝑛 − 𝜖𝐿′ (𝜃𝑛 , 𝑧𝑛+1 ) Dept. CSE, UT Arlington 28 A simple case Denote by 𝑍 𝑏 the set of training points in block 𝒁𝑏 . Suppose we run T steps of SSGD on Z, starting from some initial point 𝜃0 = (𝑊0 , 𝐻0 ) and using a fixed step size 𝜖. Describe an instance of the SGD process by a training sequence 𝜔 = 𝑧0 , 𝑧1 , … , 𝑧𝑇−1 of T training points. 𝑇−1 𝜃𝑛+1 𝜔 = 𝜃𝑛 𝜔 + 𝜖 𝑌𝑛 (𝜔) 𝑛=0 Dept. CSE, UT Arlington 29 A simple case 𝑇−1 𝜃𝑛+1 𝜔 = 𝜃𝑛 𝜔 + 𝜖 𝑌𝑛 (𝜔) 𝑛=0 Consider the subsequence σ𝑏 (𝜔) means training points from block 𝒁𝑏 ; the length 𝑇𝑏 𝜔 = |σ𝑏 𝜔 |. The following theorem suggests we can run SGD on each block independently and then sum up the results. Theorem 3 Using the definitions above 𝑑 𝑇𝑏 𝜔 −1 𝜃𝑇 𝜔 = 𝜃0 + 𝜖 𝑌𝑘 (σ𝑏 (𝜔)) 𝑏=1 𝑘=0 Dept. CSE, UT Arlington 30 A simple case If we divide into d independent map tasks Γ1 , … Γ𝑑 . Task Γ𝑏 is responsible for subsequence σ𝑏 (𝜔) : It takes 𝑍 𝑏 , 𝑊 𝑏 and 𝐻 𝑏 as input, performs the block-local updates σ𝑏 (𝜔) Dept. CSE, UT Arlington 31 The General Case Theorem 3 can also be applied into a general case. “d-monomial” Dept. CSE, UT Arlington 32 The General Case Dept. CSE, UT Arlington 33 Exploitation Block and distribute the input matrix 𝑉 High-level approach (Map only) 1. Pick a “diagonal” 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next “diagonal” Steps 1-3 form a cycle Dept. CSE, UT Arlington 34 Exploitation Block and distribute the input matrix 𝑉 High-level approach (Map only) 1. Pick a “diagonal” 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next “diagonal” Steps 1-3 form a cycle Dept. CSE, UT Arlington 35 Exploitation Block and distribute the input matrix 𝑉 High-level approach (Map only) 1. Pick a “diagonal” 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next “diagonal” Steps 1-3 form a cycle Dept. CSE, UT Arlington 36 Exploitation Block and distribute the input matrix 𝑉 High-level approach (Map only) 1. Pick a “diagonal” 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next “diagonal” Steps 1-3 form a cycle Step 2: Simulate sequential SGD 1. Interchangeable blocks 2. Throw dice of how many iterations per block 3. Throw dice of which step sizes per block Dept. CSE, UT Arlington 37 Exploitation Block and distribute the input matrix 𝑉 High-level approach (Map only) 1. Pick a “diagonal” 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next “diagonal” Steps 1-3 form a cycle Step 2: Simulate sequential SGD 1. Interchangeable blocks 2. Throw dice of how many iterations per block 3. Throw dice of which step sizes per block Dept. CSE, UT Arlington 38 Exploitation Block and distribute the input matrix 𝑉 High-level approach (Map only) 1. Pick a “diagonal” 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next “diagonal” Steps 1-3 form a cycle Step 2: Simulate sequential SGD 1. Interchangeable blocks 2. Throw dice of how many iterations per block 3. Throw dice of which step sizes per block Dept. CSE, UT Arlington 39 Experiments • Compare with PSGD, DGD, ALS methods • Data Netflix Competition 100M ratings from 480k customers on 18k movies Synthetic dataset 10M rows, 1M columns, 1B nonzero entries Dept. CSE, UT Arlington 40 Experiments Test on two well-known loss functions 𝐿𝑁𝑍𝑆𝐿 = 𝑖,𝑗:𝑉𝑖𝑗 ≠0 𝑉𝑖𝑗 − 𝑊𝐻 𝐿𝐿2 = 𝐿𝑁𝑍𝑆𝐿 + 𝜆 𝑊 2 𝐹 2 𝑖𝑗 + 𝐻 2 𝐹 It works well on a variety kind of loss functions, results for other loss functions can be found in their technique report. Dept. CSE, UT Arlington 41 Experiments The proposed DSGD converges faster and achieves better results than others Dept. CSE, UT Arlington 42 Experiments 1. The processing time remains constant as the size increase 2. To very large datasets on larger clusters, the overall running time increases by a modest 30% Dept. CSE, UT Arlington 43 Summary • Matrix factorization • Widely applicable via customized loss functions • Large instances (millions * millions with billions of entries) • Distributed Stochastic Gradient Descent • Simple and versatile • Achieves • Fully distributed data/model • Fully distributed processing • Same or better loss • Faster • Good scalability Dept. CSE, UT Arlington 44 Thank you! Dept. CSE, UT Arlington 45
© Copyright 2024