Slides - Dustin Tran

A reunification of optimization and spectral
methods: optimal learning in partially observable
settings
Dustin Tran
(draft) April 30, 2015
Harvard University, School of Engineering and Applied Sciences
How can we best learn parameters in a latent
variable model?
2
Motivation
In statistical estimation theory, the ideal estimator satisfies four
requirements:
� consistency
� statistical efficiency
� computational efficiency
� numerical stability
3
Motivation
The EM algorithm satisfies:
� consistency 7
� statistical efficiency 3 (assuming global)
� computational efficiency 7
� numerical stability 3
4
Motivation
Variational inference satisfies:
� consistency 7
� statistical efficiency 3 (assuming global; SVI with averaging)
� computational efficiency 3 (SVI)
� numerical stability 3 (implicit/proximal SVI)
5
Motivation
Markov chain Monte carlo satisfies:
� consistency 3
� statistical efficiency 3
� computational efficiency 7
� numerical stability 7(??)
6
What are spectral methods?
They are matrix (tensor) decompositions which generate sample
moments, i.e., observable representations that are functions of the
parameters.
� Can solve for them and asymptotically recover the ground truth in
expectation, i.e., lead to consistent estimators
� Are consistent estimators for (certain) nonconvex problems!
7
Motivation
Spectral methods satisfy:
� consistency 3
� statistical efficiency 7
� computational efficiency 3
� numerical stability 7
8
Can we leverage the optimization viewpoint to
obtain statistical efficiency and numerical stability?
9
Motivation
Given a cost function ℓ(θ; Y), can:
� Add regularization/priors f(θ)
� Add weighted distance (generalized method of moments)
� Use efficient and numerically stable optimization routines
(stochastic gradient methods)
10
Background: Generalized method of moments (GMMs)
Let Y1 , . . . , YN be (d + 1)-dimensional observations generated from
some model with unknown parameters θ ∈ Θ.
The k moment conditions for a vector-valued function g(Y, ·) : Θ → Rk
is
def
m(θ∗ ) = E[g(Y, θ∗ )] = 0k×1
The observable representations, or sample moments, are
1 ∑
g(Yn , θ)
N
N
b (θ) =
m
def
n =1
11
Background: Generalized method of moments (GMMs)
b (θ)∥ for some choice of
We aim to minimize the normed distance ∥m
∥ · ∥. Define the weighted Frobenius norm as
b (θ)∥2W = m
b (θ)T Wm
b (θ),
∥m
def
where W is a positive definite matrix. The generalized method of
moments (GMM) estimator is
b (θ)∥2W
θgmm = arg min ∥m
θ∈Θ
12
Background: Generalized method of moments (GMMs)
Under standard assumptions∗ , the GMM estimator θgmm is consistent
and asymptotically normal. Moreover, if
W = E[g(Yn , θ∗ )g(Yn , θ∗ )T ]−1
def
then θgmm is statistically efficient in the class of asymptotically normal
estimators (and conditioned only on the information in the moment!).
13
Hidden markov model
Time t ∈ {1, 2, . . .}
� Hidden states ht ∈ {1, . . . , m}
� Observations xt ∈ {1, . . . , n} (m ≤ n)
Given full rank matrices T ∈ Rm×m , O ∈ Rn×m , the dynamical system
for HMMs is given by
def
Tij = P[ht+1 = i | ht = j] ⇐⇒ P[ht+1 ] = TP[ht ]
def
Oij = P[xt = i | ht = j] ⇐⇒ P[xt ] = OP[ht ]
def
and π = P[h1 ] ∈ Rm the initial state distribution.
Goal: Estimate the joint distribution P[x1:t ], which also allows one to
make predictions: P[xt+1 | x1:t ] = [x1:t+1 ]/P[x1:t ].
14
Hidden markov model
15
Hidden markov model
def
Notation: P[x, y, · · · ]ij··· = P(x = i, y = j, · · · )
Theorem
The joint probability matrix P[xt+1 , xt ] satisfies the following for all columns
j ∈ {1, . . . , n}:
P[xt+1 , xt ]·j = Φj P[xt ]
Φj = OT diag(Oj· )O†
def
Furthermore,
P[xt+1 , xt , x1:t−1 ]·j⃗k = Φj P[xt , x1:t−1 ]·⃗k
where ⃗k represents a sequence of states for x1:t−1 , i.e.,
x1 = k1 , . . . , xt−1 = kt−1 .
16
Hidden markov model
We aim to solve
b[xt+1 , xt , x1:t−1 ] ⃗ − Φj P
b[xt , x1:t−1 ] ⃗ ∥F
min ∥P
·j k
·k
Φj
such that rank(Φj ) ≤ m for all j ∈ {1, . . . , n}. By the
Eckart-Young-Mirsky theorem, this is equivalent to solving
b[xt+1 , xt , x1:t−1 ] ⃗ − Φj UT P
b
min ∥UTj P
j [xt , x1:t−1 ]·⃗k ∥F
·j k
Φj
where Uj is the matrix of m left-singular vectors of
b[xt+1 , xt , x1:t−1 ] ⃗ P
b [x t , x 1 :t − 1 ]T ]
E[P
·jk
·⃗k
17
Hidden markov model
� It’s convex! (optimization leads to consistent estimator)
� This recovers the same (???) parameters as the method of moments
estimation derived in Hsu et al. (2009).
18
Hidden markov model
Use weighted Frobenius norm instead for GMM estimation:
b[xt+1 , xt , x1:t−1 ] ⃗ − Φj P
b[xt , x1:t−1 ] ⃗ ∥2
min ∥P
·j k
·k W
Φj
such that rank(Φj ) ≤ m.
19
Hidden markov model
And add your favorite regularizers/priors!
b[xt+1 , xt , x1:t−1 ] ⃗ − Φj P
b[xt , x1:t−1 ] ⃗ ∥2 + α∥Φj ∥1 + (1 − α)∥Φj ∥2
min ∥P
F
·jk
·k W
Φj
such that rank(Φj ) ≤ m.
20
Hidden markov model
b[xt+1 , xt , x1:t−1 ] ⃗ − Φj P
b[xt , x1:t−1 ] ⃗ ∥2
min ∥P
·j k
·k W
Φj
rank(Φj ) ≤ m
1. This extends naturally to predictive state representations (Boots
and Gordon, 2010) (??? work in progress).
2. Remark on local optima in the optimization
� In the unweighted case (MoM), any local optima are necessarily
global
� In the weighted case (GMM), this no longer holds (Nati and
Jaakola, 2003)
21
b[xt+1 , xt , x1:t−1 ] ⃗ − Φj P
b[xt , x1:t−1 ] ⃗ ∥2
min ∥P
·j k
·k W
Φj
rank(Φj ) ≤ m
If the weighting introduces local optima, then what’s the point?
22
Conclusion
Reframing moment estimation from spectral methods as optimization
leads to an alternative viewpoint on improving estimation:
� consistency 3 (???)
� statistical efficiency 3 (GMM estimator, using SGD with averaging)
� computational efficiency 3 (SGD)
� numerical stability 3 (implicit/proximal SGD)
23
Questions?
24