Recommender Systems

Content-based recommendations
with Poisson factorization
Joanna Misztal
Recommender systems
• Recommender Systems (RSs) are software
tools and techniques providing suggestions for
items to be of use to a user.
– User’s preferences
– Constraints
Netflix Prize
• Predict user ratings for films, based on previous ratings
without any other information about the users or films
• Training data set of 100,480,507 ratings that 480,189
users gave to 17,770 movies
• US$1,000,000 in 2009 (2010 cancelled due to lawsuit)
Recommender systems
Collaborative
filtering
Contentbased
Recommends items
that were liked by
users with simmilar
preferences
Recommend items
simmilar to those
that user liked in the
past
Based on ratings
history
Based on items
attributes
Content-based recommender systems
• Build a model of user’s preferences based on the features of
items that he rated
• User’s interest in an object: matching user’s profile with items
attributes
Content-based recommender systems
+
-
User independence (from other users)
Limited content analysis
Transparency (explanation)
Over-specialization (no unexpected results)
New item recommendation
New user problem
Collaborative filtering –
neighbourhood approach
• User-based approach:
– User Eric has to decide whether or not to rent the movie
“Titanic” that he has not yet seen.
– He knows that Lucy has very similar tastes when it comes to
movies, as both of them hated “The Matrix” and loved “Forrest
Gump”, so he asks her opinion on this movie.
– On the other hand, Eric finds out he and Diane have different
tastes, Diane likes action movies while he does not, and he
discards her opinion or considers the opposite in his decision.
Collaborative filtering –
neighbourhood approach
• Item-based approach:
– Instead of consulting with his peers, Eric instead determines
whether the movie “Titanic” is right for him by considering the
movies that he has already seen.
– He notices that people that have rated this movie have given
similar ratings to the movies “Forrest Gump” and “Wall-E”.
– Since Eric liked these two movies he concludes that he will also
like the movie “Titanic”.
Matrix factorization
D1
D2
D3
D4
U1
5
3
-
1
U2
4
-
-
1
U3
1
1
-
5
U4
1
-
-
4
U5
-
1
5
4
• Task: fill missing blanks
• Latent features for user’s ratings (< nr users and
items)
Matrix factorization
• R of size |U|x|D| - matrix of ratings
• We want to discover K latent features
• K=2, P(|U|xK matrix), Q(|D|xK matrix)
• Each row of P – user’s features strength
• Each row of Q – item’s features strength
• Non-negative MF:
– All elements of P, Q >0
– Intuitive meaning
Matrix Factorization
• Gradient descent:
– Initialize P ad Q with some values
– Iteratively minimize the difference of PxQ and R
– Gradient of current values:
– Update rules:
– Overall error:
Singular Value Decomposition
• Find lower dimensional features that
represent concepts
Collaborative topic Poisson
factorization (CTPF)
• Latent articles topics indetification
• Readers – topics preferences
• How documents of one topic may be
interesting to other users?
• Massive, sparse, long-tailed data
• Solving the ‘cold start’ problem
• Organizing articles according to their topics
Users
preferences
TOPICS
Collaborative topic Poisson
factorization (CTPF)
Articles
A case study: EM paper impact
• ”Maximum likelihood from incomplete data via the EM algorithm”
(1977)
• Black bars – the topics that the EM paper is about
• Red bars – the preferences of the readers who have the EM paper
in their libraries
• CTPF has uncovered the interdisciplinary impact of the EM paper
Algorithm basics
Poisson
factorization
generative
process
Latent featured
initialization
Observed
features
initialization
Unread
documents
rating
Approximate
posterior
inference
Finding latent
features, given
observations
Posterior
approximation
by variational
inference
Coordinate
ascent
algorithm
CTPF – model
V(words)
Latent attributes
W (Words counts)
D(documents)
Latent preferences
R (Users ratings)
1
if user u consulted document d

rud   0 if user u rated document d
otherwise
0

Users preferences model
U(users)
Document model
D(documents)
CTPF – Poisson Factorization
V(words)
D(documents)
W (Words counts)
~Poisson(a,b)
D(documents)
U(users)
R (Users ratings)
~Poisson(c,d)
Poisson distribution
• Discrete probability distribution
CTPF – latent features
D(documents)
K(topics)
ϵ (topics offsets)
U(users)
V(words)
D(documents)
R (Users ratings)
K(topics)
K(topics)
η (topics
preferences)
Users preferences model
U(users)
β (words
intensities)
Document model
W (Words counts)
θ (topics
intensities)
K(topics)
V(words)
K(topics)
D(documents)
CTPF – generative process
D(documents)
K(topics)
ϵ (topics offsets)
U(users)
V(words)
D(documents)
R (Users ratings)
K(topics)
K(topics)
η (topics
preferences)
Users preferences model
U(users)
β (words
intensities)
Document model
W (Words counts)
θ (topics
intensities)
K(topics)
V(words)
K(topics)
D(documents)
Recommending old and new
documents
In-matrix documents
(rated by at least one user)
Out-matrix documents
(new to the system)
• User’s unread documents:
• No reader data – depend on the topics
• Both reader and article data – use topic offsets
Approximate posterior
inference
K(topics)
K(topics)
R (Users ratings)
ϵ (topics offsets)
V(words)
β (words
intensities)
K(topics)
K(topics)
K(topics)
D(documents)
U(users)
θ (topics
intensities)
U(users)
W (Words counts)
D(documents)
V(words)
D(documents)
η (topics
preferences)
Approximate posterior inference
• Posterior approximation by variational
inference
• Coordinate ascent algorithm – iterate over:
– Non-zero document-words counts
– Non-zero user-document ratings
Variational inference
• Approximate posterior density with a
(simpler) density with new parameters
q(z1:m | ν)
• Find parameters which minimizes the KL
divergence
• Use q for future data predictions
Auxiliary variables
• For each word adding K latent variables(integers):
• For each observed rating rud - K latent variables:
• Consider auxiliary variables for non-zero values only.
Variational family
• Defined over independent latent variables:
Optimal coordinate updates
• Iteratively optimizing each variational
parameter, the other fixed
• Setting the variational parameter equal to the
expected natural parameter
– expectation of a function of the other random
variables and observations.
– Example update:
Coordinate ascent algorithm
Initialize the topics 1:K and topic intensities 1:D (use LDA).
Repeat until convergence:
1. For each word count wdv > 0, set Φdv to the expected conditional
parameter of zdv.
2. For each rating rud > 0, set ξud to the expected conditional
parameter of yud.
3. For each document d and each k, update the block of variational
topic intensities θdk to their expected conditional parameters.
Perform similar block updates for β ~ vk, η ~ uk and ϵ~dk, in sequence.
Latent Dirichlet allocation
Document-topic model:
•
•
•
•
W – observable (words)
Z – topics for words in documents
M – documents
N – words in document
Empirical results
• Predictive approach to evaluating model fitness
• Comparing predictive accuracy of CTPF to CTR
• Datasets:
– Mendeley dataset of scientific articles - a binary matrix of 80,000 users
and 260,000 articles with 5 million observations
– arXiv - a matrix of 120,297 users and 825,707 articles, with 43 million
observations
• Competing methods (topics and topic intensities initialized with
LDA)
–
–
–
–
–
CTPF
Decoupled Poisson Factorization
Content Only (CTPF)
Ratings Only (Poisson)
CTR
Evaluation
• Test set: 20% of ratings and 1% of documents
in each data set
• Validation set: 1% of ratings (20% for arXiv)
• Testing:
– generate the top M recommendations for each
user (items with the highest predictive score
under each method)
Comparison
Top recommendations
Conclusions
• The text of the article + user behavior data
• Cold-start: New articles based on text
• Popular articles: based on readership
• Organization of documents by topics