Matrix Factorization

Large-Scale Matrix Factorization
with Distributed Stochastic Gradient Descent
KDD 2011
Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis
Presenter: Jiawen Yao
Dept. CSE, UT Arlington
1 1
Outline
•
•
•
•
•
Matrix Factorization
Stochastic Gradient Descent
Distributed SGD
Experiments
Summary
Dept. CSE, UT Arlington
2
Matrix completion
Dept. CSE, UT Arlington
3
Matrix Factorization
• As Web 2.0 and enterprise-cloud applications proliferate, data
mining becomes more important
• Real application
• Set of users
• Set of items (movies, books, products, …)
• Feedback (ratings, purchase, tags,…)
• Predict additional items a user may like
• Assumption: Similar feedback
Similar taste
• Matrix Factorization
Dept. CSE, UT Arlington
4
Matrix Factorization
• Example - Netflix competition
• 500k users, 20k movies, 100M movie ratings, 3M
question marks
Avatar
The Matrix
Up
Alice
?
4
2
Bob
3
2
?
Charlie
5
?
3
• The goal is to predict missing entries (denoted by ?)
Dept. CSE, UT Arlington
5
Matrix Factorization
• A general machine learning problem
• Recommender systems, text indexing, face recognition,…
• Training data
𝑉: 𝑚 × 𝑛 input matrix (e.g., rating matrix)
Z: training set of indexes in 𝑉 (e.g., subset of known ratings)
•
Output
find a approximation 𝑉 ≈ 𝑊𝐻 which have the smallest loss
𝑎𝑟𝑔𝑚𝑖𝑛𝑊,𝐻 𝐿(𝑉, 𝑊, 𝐻)
Dept. CSE, UT Arlington
6
The loss function
• Loss function
– Nonzero squared loss
𝐿𝑁𝑍𝑆𝐿
𝐿𝑁𝑍𝑆𝐿 =
𝑖,𝑗:𝑉𝑖𝑗 ≠0
𝑉𝑖𝑗 − 𝑊𝐻
2
𝑖𝑗
• A sum of local losses over the entries in 𝑉𝑖𝑗
𝐿=
(𝑖,𝑗)∈𝑍
𝑙 𝑉𝑖𝑗 , 𝑊𝑖∗ , 𝐻∗𝑗
• Focus on the class of nonzero decompositions
𝑍=
𝑖, 𝑗 : 𝑉𝑖𝑗 ≠ 0
Dept. CSE, UT Arlington
7
The loss function
• Find best model
𝑎𝑟𝑔𝑚𝑖𝑛𝑊,𝐻
(𝑖,𝑗)∈𝑍
𝐿𝑖𝑗 (𝑊𝑖∗ , 𝐻∗𝑗 )
– 𝑊𝑖∗ row i of matrix W
– 𝐻∗𝑗 column j of matrix H
– To avoid trivialities, we assume there is at
least one training point in every row and in every
column.
Dept. CSE, UT Arlington
8
Prior Work
• Specialized algorithms
– Designed for a small class of loss functions
– GKL loss
• Generic algorithms
– Handle all differentiable loss functions that decompose into summation form
– Distributed gradient descent (DGD), Partitioned SGD (PSGD)
– The proposed
Dept. CSE, UT Arlington
9
Successful applications
• Movie recommendation
>12M users, >20k movies, 2.4B ratings
36GB data, 9.2GB model
• Website recommendation
51M users, 15M URLs, 1.2B clicks
17.8GB data, 161GB metadata, 49GB model
• News personalization
Dept. CSE, UT Arlington
10
Stochastic Gradient Descent
Find minimum 𝜃 ∗ of function L
Dept. CSE, UT Arlington
11
Stochastic Gradient Descent
Find minimum 𝜃 ∗ of function L
Pick a starting point 𝜃0
Dept. CSE, UT Arlington
12
Stochastic Gradient Descent
Find minimum 𝜃 ∗ of function L
Pick a starting point 𝜃0
Dept. CSE, UT Arlington
13
Stochastic Gradient Descent
Find minimum 𝜃 ∗ of function L
Pick a starting point 𝜃0
Dept. CSE, UT Arlington
14
Stochastic Gradient Descent
Find minimum 𝜃 ∗ of function L
Pick a starting point 𝜃0
Approximate gradient 𝐿′ (𝜃0 )
Dept. CSE, UT Arlington
15
Stochastic Gradient Descent
Find minimum 𝜃 ∗ of function L
Pick a starting point 𝜃0
Approximate gradient 𝐿′ (𝜃0 )
Stochastic difference equation
𝜃𝑛+1 = 𝜃𝑛 − 𝜖𝑛 𝐿′ (𝜃𝑛 )
Under certain conditions, asymptotically
approximates (continuous) gradient
descent
Dept. CSE, UT Arlington
16
Stochastic Gradient Descent for Matrix Factorization
Set 𝜃 = (𝑊, 𝐻) and use
𝐿(𝜃) =
𝐿′ (𝜃) =
(𝑖,𝑗)∈𝑍
(𝑖,𝑗)∈𝑍
𝐿𝑖𝑗 𝑊𝑖∗ , 𝐻∗𝑗
𝐿′𝑖𝑗 𝑊𝑖∗ , 𝐻∗𝑗
𝐿′ (𝜃, 𝑧) = 𝑁𝐿′𝑧 (𝑊𝑖𝑧∗ , 𝐻∗𝑗𝑧 )
Where N = 𝑍 and training point z is chosen randomly from the training set.
𝑍=
𝑖, 𝑗 : 𝑉𝑖𝑗 ≠ 0
Dept. CSE, UT Arlington
17
Stochastic Gradient Descent for Matrix Factorization
SGD for Matrix Factorization
Input: A training set Z, initial values W0 and H0
1. Pick a random entry 𝑧 ∈ 𝑍
2. Compute approximate gradient 𝐿′ (𝜃, 𝑧)
3. Update parameters
𝜃𝑛+1 = 𝜃𝑛 − 𝜖𝑛 𝐿′ (𝜃𝑛 , 𝑧)
4. Repeat N times
In practice, an additional projection is used
𝜃𝑛+1 = Π𝐻 [𝜃𝑛 − 𝜖𝑛 𝐿′ 𝜃𝑛 , 𝑧 ]
to keep the iterative in a given constraint set 𝐻 = {𝜃: 𝜃 ≥ 0}.
Dept. CSE, UT Arlington
18
Stochastic Gradient Descent for Matrix Factorization
Why stochastic is good ?
• Easy obtain
• May help in escaping local minima
• Exploit repetition with the data
Dept. CSE, UT Arlington
19
Distributed SGD (DSGD)
SGD steps depend on each other
𝜃𝑛+1 = 𝜃𝑛 − 𝜖𝑛 𝐿′ (𝜃𝑛 , 𝑧)
How to distribute?
•



Parameter mixing (ISGD)
Map: Run independent instances of SGD on subsets of the data
Reduce: Average results once at the end
Does not converge to correct solution
• Iterative Parameter mixing (PSGD)
 Map: Run independent instances of SGD on subsets of the data (for
some time)
 Reduce: Average results after each pass over the data
 Converges slowly
Dept. CSE, UT Arlington
20
Distributed SGD (DSGD)
Dept. CSE, UT Arlington
21
Stratified SGD
Proposed Stratified SGD to obtain an efficient DSGD for matrix factorization
𝐿 𝜃 = 𝜔1 𝐿1 𝜃 + 𝜔2 𝐿2 𝜃 + ⋯ + 𝜔𝑞 𝐿𝑞 𝜃
The sum of loss function 𝐿𝑠 , and 𝑠 is a stratum.
A stratum is a part or partition of dataset
SSGD runs standard SGD on a single stratum at a time, but switches
strata in a way that guarantees correctness
Dept. CSE, UT Arlington
22
Stratified SGD algorithm
Suppose a stratum sequence {𝛾𝑛 }, each 𝛾𝑛 takes values in {1, … , q}
𝜃𝑛+1 = Π𝐻 [𝜃𝑛 − 𝜖𝑛 𝐿′𝛾𝑛 𝜃𝑛 ]
Appropriate sufficient conditions
for the convergence of SSGD can
be from stochastic approximation
Dept. CSE, UT Arlington
23
Distribute SSGD
SGD steps depend on each other
𝜃𝑛+1 = Π𝐻 [𝜃𝑛 − 𝜖𝑛 𝐿′𝛾𝑛 𝜃𝑛 ]
An SGD step on example 𝑧 ∈ 𝑍
1. Reads W𝑖𝑧∗ , H∗𝑗𝑧
2. Performs gradient computation 𝐿′𝑖𝑗 𝑊𝑖𝑧∗ , 𝐻∗𝑗𝑧
3. Updates W𝑖𝑧∗ and H∗𝑗𝑧
Dept. CSE, UT Arlington
24
Problem Structure
SGD steps depend on each other
𝜃𝑛+1 = Π𝐻 [𝜃𝑛 − 𝜖𝑛 𝐿′𝛾𝑛 𝜃𝑛 ]
An SGD step on example 𝑧 ∈ 𝑍
1. Reads W𝑖𝑧∗ , H∗𝑗𝑧
2. Performs gradient computation 𝐿′𝑖𝑗 𝑊𝑖𝑧∗ , 𝐻∗𝑗𝑧
3. Updates W𝑖𝑧∗ and H∗𝑗𝑧
Dept. CSE, UT Arlington
25
Problem Structure
SGD steps depend on each other
𝜃𝑛+1 = Π𝐻 [𝜃𝑛 − 𝜖𝑛 𝐿′𝛾𝑛 𝜃𝑛 ]
An SGD step on example 𝑧 ∈ 𝑍
1. Reads W𝑖𝑧∗ , H∗𝑗𝑧
2. Performs gradient computation 𝐿′𝑖𝑗 𝑊𝑖𝑧∗ , 𝐻∗𝑗𝑧
3. Updates W𝑖𝑧∗ and H∗𝑗𝑧
Dept. CSE, UT Arlington
26
Problem Structure
SGD steps depend on each other
𝜃𝑛+1 = Π𝐻 [𝜃𝑛 − 𝜖𝑛 𝐿′𝛾𝑛 𝜃𝑛 ]
An SGD step on example 𝑧 ∈ 𝑍
1. Reads W𝑖𝑧∗ , H∗𝑗𝑧
2. Performs gradient computation 𝐿′𝑖𝑗 𝑊𝑖𝑧∗ , 𝐻∗𝑗𝑧
3. Updates W𝑖𝑧∗ and H∗𝑗𝑧
Not all steps are dependent
Dept. CSE, UT Arlington
27
Interchangeability
Definition 1. Two elements 𝑧1 , 𝑧2 ∈ 𝑍 are interchangeable if they share
neither row nor column
When𝑧𝑛 , 𝑧𝑛+1 are interchangeable, the SGD steps
𝜃𝑛+1 = 𝜃𝑛 − 𝜖𝐿′ 𝜃𝑛 , 𝑧𝑛
𝜃𝑛+2 = 𝜃𝑛 − 𝜖𝐿′ 𝜃𝑛 , 𝑧𝑛 − 𝜖𝐿′ (𝜃𝑛+1 , 𝑧𝑛+1 )
= 𝜃𝑛 − 𝜖𝐿′ 𝜃𝑛 , 𝑧𝑛 − 𝜖𝐿′ (𝜃𝑛 , 𝑧𝑛+1 )
Dept. CSE, UT Arlington
28
A simple case
Denote by 𝑍 𝑏 the set of training points in block 𝒁𝑏 . Suppose we run T
steps of SSGD on Z, starting from some initial point 𝜃0 = (𝑊0 , 𝐻0 ) and
using a fixed step size 𝜖. Describe an instance of the SGD process by a
training sequence 𝜔 = 𝑧0 , 𝑧1 , … , 𝑧𝑇−1 of T training points.
𝑇−1
𝜃𝑛+1 𝜔 = 𝜃𝑛 𝜔 + 𝜖
𝑌𝑛 (𝜔)
𝑛=0
Dept. CSE, UT Arlington
29
A simple case
𝑇−1
𝜃𝑛+1 𝜔 = 𝜃𝑛 𝜔 + 𝜖
𝑌𝑛 (𝜔)
𝑛=0
Consider the subsequence σ𝑏 (𝜔) means training points from block 𝒁𝑏 ;
the length 𝑇𝑏 𝜔 = |σ𝑏 𝜔 |.
The following theorem suggests we can run SGD on each block
independently and then sum up the results.
Theorem 3 Using the definitions above
𝑑 𝑇𝑏 𝜔 −1
𝜃𝑇 𝜔 = 𝜃0 + 𝜖
𝑌𝑘 (σ𝑏 (𝜔))
𝑏=1
𝑘=0
Dept. CSE, UT Arlington
30
A simple case
If we divide into d independent map tasks Γ1 , … Γ𝑑 . Task Γ𝑏 is responsible
for subsequence σ𝑏 (𝜔) : It takes 𝑍 𝑏 , 𝑊 𝑏 and 𝐻 𝑏 as input, performs the
block-local updates σ𝑏 (𝜔)
Dept. CSE, UT Arlington
31
The General Case
Theorem 3 can also be applied into a general case.
“d-monomial”
Dept. CSE, UT Arlington
32
The General Case
Dept. CSE, UT Arlington
33
Exploitation
Block and distribute the input matrix 𝑉
High-level approach (Map only)
1. Pick a “diagonal”
2. Run SGD on the diagonal (in parallel)
3. Merge the results
4. Move on to next “diagonal”
Steps 1-3 form a cycle
Dept. CSE, UT Arlington
34
Exploitation
Block and distribute the input matrix 𝑉
High-level approach (Map only)
1. Pick a “diagonal”
2. Run SGD on the diagonal (in parallel)
3. Merge the results
4. Move on to next “diagonal”
Steps 1-3 form a cycle
Dept. CSE, UT Arlington
35
Exploitation
Block and distribute the input matrix 𝑉
High-level approach (Map only)
1. Pick a “diagonal”
2. Run SGD on the diagonal (in parallel)
3. Merge the results
4. Move on to next “diagonal”
Steps 1-3 form a cycle
Dept. CSE, UT Arlington
36
Exploitation
Block and distribute the input matrix 𝑉
High-level approach (Map only)
1. Pick a “diagonal”
2. Run SGD on the diagonal (in parallel)
3. Merge the results
4. Move on to next “diagonal”
Steps 1-3 form a cycle
Step 2:
Simulate sequential SGD
1. Interchangeable blocks
2. Throw dice of how many iterations
per block
3. Throw dice of which step sizes
per block
Dept. CSE, UT Arlington
37
Exploitation
Block and distribute the input matrix 𝑉
High-level approach (Map only)
1. Pick a “diagonal”
2. Run SGD on the diagonal (in parallel)
3. Merge the results
4. Move on to next “diagonal”
Steps 1-3 form a cycle
Step 2:
Simulate sequential SGD
1. Interchangeable blocks
2. Throw dice of how many iterations
per block
3. Throw dice of which step sizes
per block
Dept. CSE, UT Arlington
38
Exploitation
Block and distribute the input matrix 𝑉
High-level approach (Map only)
1. Pick a “diagonal”
2. Run SGD on the diagonal (in parallel)
3. Merge the results
4. Move on to next “diagonal”
Steps 1-3 form a cycle
Step 2:
Simulate sequential SGD
1. Interchangeable blocks
2. Throw dice of how many iterations
per block
3. Throw dice of which step sizes
per block
Dept. CSE, UT Arlington
39
Experiments
• Compare with PSGD, DGD, ALS methods
• Data
Netflix Competition
100M ratings from 480k customers on 18k movies
Synthetic dataset
10M rows, 1M columns, 1B nonzero entries
Dept. CSE, UT Arlington
40
Experiments
Test on two well-known loss functions
𝐿𝑁𝑍𝑆𝐿 =
𝑖,𝑗:𝑉𝑖𝑗 ≠0
𝑉𝑖𝑗 − 𝑊𝐻
𝐿𝐿2 = 𝐿𝑁𝑍𝑆𝐿 + 𝜆 𝑊
2
𝐹
2
𝑖𝑗
+ 𝐻
2
𝐹
It works well on a variety kind of loss functions, results for other loss
functions can be found in their technique report.
Dept. CSE, UT Arlington
41
Experiments
The proposed DSGD converges faster and achieves better results than others
Dept. CSE, UT Arlington
42
Experiments
1. The processing time remains constant as the size increase
2. To very large datasets on larger clusters, the overall running time increases by
a modest 30%
Dept. CSE, UT Arlington
43
Summary
• Matrix factorization
• Widely applicable via customized loss functions
• Large instances (millions * millions with billions of
entries)
• Distributed Stochastic Gradient Descent
• Simple and versatile
• Achieves
• Fully distributed data/model
• Fully distributed processing
• Same or better loss
• Faster
• Good scalability
Dept. CSE, UT Arlington
44
Thank you!
Dept. CSE, UT Arlington
45