here - FPT Technology Research Institute

Topic modeling: an update
Khoat Than
Hanoi University of Science and Technology
FPT, March 31, 2015
Contents
¡  About me
¡  Introduction to topic modeling
¡  Some challenges
¡  Our recent research
2
3
About me
A short bibliography
¡  B.S: Applied Mathematics and Informatics, University of
Science, VNU, Hanoi (2004)
¡  M.S: Information Technology, Hanoi University of Science
and Technology (2009)
¡  Ph.D: Knowledge Science, Japan Advanced Institute of
Science and Technology (2013)
4
Academic activity
¡  Program committee member:
¨ 
PAKDD (2015)
¨ 
ACML (2015, 2014)
¨ 
KSE (2015, 2014)
¡  PC co-chair: PhD colloquium at DASFAA-2015
¡  Director of Laboratory of Knowledge and Data
Engineering, at HUST.
5
Some projects
¡  NAFOSTED (2015-2017, VN): director
¨ 
¨ 
Title: Inference methods for analyzing the hidden semantics in big
data
Area: Machine learning, Big data
¡  AFOSR (2015-2017, USA): director
¨ 
Title: Inferring the hidden structures in big heterogeneous data
¨ 
Area: Machine learning, Big data
¡  AFOSR (2013-2014, USA): member
¨ 
¨ 
Title: Methods of sparse modeling and dimensionality reduction to
deal with big data
Area: Machine learning, Big data
6
Research of interests
¡  Topic modeling (mô hình hóa chủ đề).
¡  Probabilistic graphical models (mô hình đồ thị).
¡  Sparse modeling (mô hình thưa),
sparse coding (mã hóa thưa).
¡  Stochastic inference, SGD,
Online learning (học trực tuyến).
¡  Manifold learning (học đa tạp).
¡  Dimensionality reduction (giảm chiều dữ liệu).
7
8
Introduction
to
Topic Modeling
Topic modeling
9
¡  One of the main ways to automatically understand the
meanings of texts.
¡  Efficient tools to organize, understand, uncover useful
knowledge from a huge amount of data.
¡  Efficient tools to discover the hidden semantics/structures in
data.
Hidden semantics (1)
10
Hidden semantics (2)
¡  Hidden evolutions
11
Hidden semantics (3)
¡  Meanings of pictures
12
Hidden semantics (4)
¡  Objects in pictures
13
Hidden semantics (5)
¡  Activities
14
Hidden semantics (6)
¡  Contents of medical images
15
Hidden semantics (7)
¡  Interactions of hidden entities
16
Hidden semantics (8)
¡  Communities in social networks
17
18
Recent applications (1)
¡  Boosting performance of Search engines over the baseline
[Wang
et al.,
ACM
TISTfor2014]
ck: Learning
Long-Tail
Topic
Features
Industrial Applications
39:1
MAP
(A)
0.22
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
2
10
10
3
10
# Topics
(B)
4
10
5
19
Recent applications (2)
¡  Boosting performance of Online advertisement over the
baseline [Wang et al., ACM TIST 2014]
Y. Wang et a
AUC improvement (%)
2
1.5
1
0.5
0
10
2
10
3
10
4
10
5
# Topics
Fig. 11. Topic features improve the pCTR performance in online advertising systems.
Topic models: some concepts (1)
20
¡  Topic: is a set of semantically related words
¡  Document: is a mixture of few topics [Blei et al., JMLR 2003]
¡  Topic mixture: shows proportions of topics in a document
Topic models: some concepts (2)
¡  In reality, we only observe the documents.
¡  The other structures (topics, mixtures, ...) are hidden.
¡  Those structures compose a Topic Model.
21
Topic models: learning
¡  The main aim is to infer the hidden variables,
¡  e.g., topics, relations, interactions, ...
22
23
Topic models: posterior inference
Rockets strike Kabul -- AP, August 8, 1990.
More than a dozen rockets slammed into Afghanistan's capital of Kabul today, killing 14
people and injuring 10, Afghan state radio reported. No one immediately claimed
repsonsibility for the attack. But the Radio Kabul broadcast, monitored in Islamabad, blamed
``extremists,'' presumably referring to U.S.-backed guerrillas headquartered in Pakistan.
Moslem insurgents have been fighting for more than a decade to topple Afghanistan's
Communist-style government. In the past year, hundreds of people have died and thousands
more injured in rocket assaults on the Afghan capital.
How much topics contribute
to the news?
50
2
38
35
31
28
10
18
17
Some topics previously learned from a collection of news
2
police
students
palestinians
curfew
sikh
gaza
rangoon
moslem
israeli
militants
10
sinhalese
tamil
iranian
dam
khomeini
sri
cemetery
accord
wppss
guerrillas
18
31
38
contra
fire
shuttle
sandinistas
winds
nasa
chamorro firefighters
space
ortega
mph
launch
rebels
blaze
magellan
sandinista
brush
mars
aid
homes
spacecraft
nicaragua
acres
telescope
managua
water
venus
ceasefire
weather astronauts
40
42
gorbachev
index
soviet
stock
republics
yen
politburo
points
yeltsin
market
moscow
shares
tass
trading
party
dow
treaty
unchanged
grigoryants
volume
50
beirut
hezbollah
lebanon
aoun
syrian
militia
lebanese
amal
troops
wounded
¡  Infer the hidden variables for a given document, e.g.,
¨ 
What topics/objects appear in?
¨ 
What are their contributions?
Recent trends in topic modeling
24
¡  Large scale learning: learn models from huge corpora (e.g., 100
millions of documents).
¡  Sparse modeling: respect the sparseness nature of texts.
¡  Nonparametric models: automatically grow the model size.
¡  Theoretical foundation: provide guarantees for learning and
posterior inference.
¡  Incorporating meta-data: encode meta-data into a model.
25
Some challenges
and
Lessons learnt
Challenges: first
¡  Can we develop a fast inference method that has
provably theoretical guarantees on quality?
¡  Inference on each data instance:
¨ 
What topics appear in a document?
¨ 
What are they talking about?
¨ 
What animals appear in a picture?
¡  Vital role in many probabilistic models:
¨ 
¨ 
Enable us to design fast algorithms for massive/stream data.
Ensure high confidence and reliability when using topic models in
practices
¡  But: inference is often intractable (NP-hard)
26
Challenges: second
¡  How can we learn a big topic model from big data?
¡  Big model:
¨ 
billions of variables/parameters
¨ 
Which might not fit in the memory of a supercomputer
¡  Many applications lead to this problem:
¨ 
Exploration of a century of literature
¨ 
Exploration of online forums/networks
¨ 
Analyzing political opinions
¨ 
Tracking objects in videos
¡  But largely unexplored in the literature.
27
Challenges: third
28
¡  Can we develop methods with provable guarantees on
quality for handling streaming/dynamic text collections?
¡  Many practical applications:
¨ 
Analyzing political opinions in online forums
¨ 
Analyzing behaviors & interests of online users
¨ 
Identifying entities and temporal structures from news.
¡  But: existing methods often lack a theoretical guarantee
on inference quality.
Lessons: learnability
29
¡  In theory:
¨ 
A model can be recovered exactly if the number of documents is
sufficiently large .
[Anandkumar et al., NIPS 2012; Arora et al., FOCS 2012; Tang et al., ICML 2014]
¨ 
It is impossible to guarantee learnability of a model when having few
documents.
¡  In practice: [Tang et al., ICML 2014]
¨ 
Once there are sufficently many documents, further increasing the
number may not significantly improve the performance.
¨ 
The document length should be long, but need not too long.
¨ 
A model performs well when the topics are well separated.
Lessons: practical effectiveness
¡  Collapsed Gibbs sampling (CGS):
¨ 
¨ 
Efficient
Better than VB and BP in large-scale applications
[Wang et al., TIST 2014]
¡  Variational Bayes (VB): [Jiang et al., PAKDD 2015]
¨ 
Often slow
¨ 
And inaccurate
¡  Belief propagation (BP):
¨ 
Memory-intensive
30
Lessons: posterior inference
31
¡  Inference for individual texts:
¨ 
Variational method (VB) [Blei et al., JMLR 2003]
¨ 
Collapsed VB (CVB) [Teh et al., NIPS 2007]
¨ 
CVB0 [Asuncion et al., UAI 2009]
¨ 
Gibbs sampling [Griffiths & Steyver, PNAS 2004]
¨ 
Online Frank-Wolfe [Than & Doan, ACML 2014]
¡  It is often intractable in theory [Sontag & Roy, NIPS 2011].
¡  But it might be tractable in practice
[Than & Doan, ACML 2014]
¡  Online Frank-Wolfe an efficient algorithm that has provable
guarantees on quality.
32
Our recent research
(Than & Doan, ACML 2014)
Latent Dirichlet Allocation
33
¡  Latent Dirichlet Allocation (LDA) [Blei et al., JMLR 2003]
is a widely-used class of Bayesian networks.
¨ 
¨ 
Provides an efficient tool to analyze hidden themes in data
Helps us recover hidden structures/ evolutions in big text collections
and streaming data [Blei, Comm. 2012; Mimno, JCCH 2012]
¡  LDA is the core of a large family of probabilistic models.
Posterior inference in LDA
34
¡  Learning (Bayesian inference) from a corpus C:
¨ 
Estimate the posterior distribution
¨ 
β are the hidden topics.
¨ 
Θ are the topic mixtures in documents.
¡  Posterior inference for a document d:
¨ 
Estimate the joint distribution
¡  Those problems are intractable [Sontag & Roy, NIPS 2011]
Posterior inference: approaches
35
¡  Posterior inference for a document d:
¨ 
Variational method (VB) [Blei et al., JMLR 2003]:
¨ 
Collapsed VB (CVB) [Teh et al., NIPS 2007]:
¨ 
CVB0 [Asuncion et al., UAI 2009]:
¨ 
Gibbs sampling [Griffiths & Steyver, PNAS 2004]:
¡  Our work: approximate p(✓ d , d| , ↵)
p(z d , ✓ d , d| , ↵)
p(z d , d| , ↵)
p(z d , d| , ↵)
p(z d , d| , ↵)
Posterior inference: tractability
36
¡  Theoretical results for MAP inference:
✓ ⇤ = arg max✓d Pr(✓ d , d| , ↵)
¨ 
Intractable (NP-hard) in the worst case [Sontag & Roy, NIPS 2011]
¨ 
Non-concave in general
¡  Our work: tractable (concave) under some conditions
¨ 
High dimensionality
(fit well with text modeling)
(fit well with stream/online environments)
¨ 
Long documents
(similar with [Tang et al., ICML 2014])
37
Posterior inference: tractability
Dual online inference for latent Dirichlet allocation
Dual online inference for latent Dirichlet allocation
Corollary 7 (Concavity for long documents) Using the assumptions in Theorem 4,
⇤
p
p
4
2Pr(✓ , problem
arg
max
+ if nd ( V ✓ K= 1)
C 4 (1 ✓↵)
(3) is concave with probability
d d| , ↵)
d then the
1
at least 1 (nd ) 4 (V K+1) e cV .
Corollary 7 (Concavity for long documents) Using the assumptions in Theorem
4,
cV .
+ As ndp! +1,
the
problem
(3)
is
concave
with
probability
at
least
1
e
p
+ if nd ( V
K 1)4 C 4 (1 ↵)2 then the problem (3) is concave with probability
1
1
1
(V K+1)
cV
4
4
Proofat The
first
statement
can
be
derived
from
Theorem
6
by
choosing
"
=
n
least 1 (nd )
e
.
C d . The
second statement thus follows.
+ As nd ! +1, the problem (3) is concave with probability at least 1 e cV .
1
Proof The first statement can be derived from Theorem 6 by choosing " = C1 nd 4 . The
Corollary 8 (Concavity for high dimensionality) Using the notations and assumpsecond statement thus follows.
tions in Theorem 4, let K and nd be fixed. Then the MAP problem (3) is concave over
K with probability 1 as V ! +1.
Corollary 8 (Concavity for high dimensionality) Using the notations and assump2.3.
tolet
practices
tions Connection
in Theorem 4,
K and n be fixed. Then the MAP problem (3) is concave over
Implications in a broader context
¡  LDA is the core of a large family of probabilistic models
¨ 
MAP inference is very likely tractable in practice
¨ 
Hence might be solved easily
38
Posterior inference: algorithms
39
¡  Posterior inference for a document d:
¨ 
¨ 
Variational method (VB), Collapsed VB (CVB), CVB0, Gibbs sampling
But quality and convergence rate are unknown
⇤
¡  Our work: consider
¨ 
¨ 
✓ = arg max Pr(✓ d , d| , ↵)
Online Frank-Wolfe algorithm
(using stochastic zero & first order information)
Has a theoretical guarantee on quality & convergence rate
Posterior inference: OFW
40
Implications in a broader context
¡  A large family of probabilistic models:
¨ 
¨ 
Posterior inference can be done efficiently by OFW
And with a theoretical guarantee on quality
(for which VB, CVB, CVB0, CGS do not have)
41
Large-scale learning for LDA
42
¡  Learning LDA from a massive dataset C:
¨ 
The hidden topics β are often of practical interests.
¡  Approaches:
¨ 
¨ 
¨ 
Parallel/distributed algorithms [Smola et al., VLDB 2010; Asuncion et al.,
al., Stat. Med. 2011]
Online learning [Hoffman et al., JMLR 2013; Mimno et al., ICML 2012;
Foulds et al., KDD 2013; Patterson et al., NIPS 2013]
Streaming learning [Broderick et al., NIPS 2013]
Online learning: schemes
43
¡  Existing schemes for online learning:
¨ 
The global variables β are learnt online (stochastically).
¨ 
The local variables (z or θ) are approximated by
u 
Variational method, or
u 
Gibbs sampling [Mimno et al., ICML 2012]
¡  Our algorithm (DOLDA):
¨ 
Both global and local variables are learnt online (stochastically)
¨ 
There is a provable guarantee on quality when inferring local variables
Large-scale experiments
44
¡  Algorithms in comparison:
¨ 
Stochastic variational inference (SVI) [Hoffman et al., JMLR 2013]
¨ 
Streaming variational Bayes (SSU)
[Broderick et al., NIPS 2013]
¨ 
Dual online algorithm (DOLDA)
[our work]
¡  Data:
¨ 
Pubmed with 8 millions documents
¨ 
New York Times with 200K news
¡  Measures:
¨ 
Coherence: semantic quality of a model
¨ 
Predictive probability: predictiveness and generalization on new data
45
Experimental results
−9
−10
0
2
4
6
Documents seen (in millions)
8
Log Predictive Probability
Pubmed
−8
New York Times
−8
−9
−10
0
0.05
0.1
0.15
Documents seen (in millions)
0.2
0
0.05
0.1
0.15
Documents seen (in millions)
0.2
−300
Coherence
−600
−700
−800
−900
15
10
2
4
6
Documents seen (in millions)
8
DOLDA
SVI
SSU
5
0
−11
−10
−9
Log Predictive Probability
−400
−500
Learning hours
0
Learning hours
Coherence
Log Predictive Probability
¡  DOLDA performed better than SVI and SSU in both generalization
and semantic quality.
−8
0.8
0.4
0
−10
−9.5
−9
−8.5
Log Predictive Probability
References
46
¡  Anandkumar, Anima, et al. "A spectral algorithm for latent dirichlet allocation." In NIPS. 2012.
¡  Arora, Sanjeev, Rong Ge, and Ankur Moitra. "Learning topic models--going beyond SVD.” In FOCS, 2012.
¡  A. Asuncion, P. Smyth, and Max Welling. Asynchronous distributed estimation of topic models for document analysis.
Statistical Methodology, 8(1):3–17, 2011.
¡  David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. JMLR, 3(3):993–1022, 2003.
¡  Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael Jordan. Streaming variational bayes. In
NIPS, pages 1727–1735, 2013.
¡  J. Foulds, L. Boyles, C. DuBois, P. Smyth, and Max Welling. Stochastic collapsed variational bayesian inference for latent
dirichlet allocation. In KDD, pages 446–454. ACM, 2013.
¡  T.L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences of the United
States of America, 101(Suppl 1):5228, 2004.
¡  Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of
Machine Learning Research, 14(1):1303–1347, 2013.
¡  David Mimno. Computational historiography: Data mining in a century of classics journals. Journal on Computing and
Cultural Heritage, 5(1):3, 2012.
¡  Alexander Smola and Shravan Narayanamurthy. An architecture for parallel topic models. Proceedings of the VLDB
Endowment, 3(1-2):703–710, 2010.
¡  David Sontag and Daniel M. Roy. Complexity of inference in latent dirichlet allocation. In NIPS, 2011.
¡  Jian Tang, Zhaoshi Meng, Xuanlong Nguyen, Qiaozhu Mei, and Ming Zhang. Understanding the limiting factors of topic
modeling via posterior contraction analysis. In ICML, pages 190–198, 2014.
¡  Y.W. Teh, D. Newman, and M. Welling. A collapsed variational bayesian inference algorithm for latent dirichlet
allocation. In NIPS, volume 19, page 1353, 2007.
¡  WANG, Y., ZHAO, X., SUN, Z., YAN, H., WANG, L., JIN, Z., ... & ZENG, J. Peacock: Learning Long-Tail Topic Features for
Industrial Applications. ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, 2014.
¡  Zeng et al. “A Comparative Study on Parallel LDA Algorithms in MapReduce Framework”. In PAKDD, 2015.
47
Thank you