A reunification of optimization and spectral methods: optimal learning in partially observable settings Dustin Tran (draft) April 30, 2015 Harvard University, School of Engineering and Applied Sciences How can we best learn parameters in a latent variable model? 2 Motivation In statistical estimation theory, the ideal estimator satisfies four requirements: � consistency � statistical efficiency � computational efficiency � numerical stability 3 Motivation The EM algorithm satisfies: � consistency 7 � statistical efficiency 3 (assuming global) � computational efficiency 7 � numerical stability 3 4 Motivation Variational inference satisfies: � consistency 7 � statistical efficiency 3 (assuming global; SVI with averaging) � computational efficiency 3 (SVI) � numerical stability 3 (implicit/proximal SVI) 5 Motivation Markov chain Monte carlo satisfies: � consistency 3 � statistical efficiency 3 � computational efficiency 7 � numerical stability 7(??) 6 What are spectral methods? They are matrix (tensor) decompositions which generate sample moments, i.e., observable representations that are functions of the parameters. � Can solve for them and asymptotically recover the ground truth in expectation, i.e., lead to consistent estimators � Are consistent estimators for (certain) nonconvex problems! 7 Motivation Spectral methods satisfy: � consistency 3 � statistical efficiency 7 � computational efficiency 3 � numerical stability 7 8 Can we leverage the optimization viewpoint to obtain statistical efficiency and numerical stability? 9 Motivation Given a cost function ℓ(θ; Y), can: � Add regularization/priors f(θ) � Add weighted distance (generalized method of moments) � Use efficient and numerically stable optimization routines (stochastic gradient methods) 10 Background: Generalized method of moments (GMMs) Let Y1 , . . . , YN be (d + 1)-dimensional observations generated from some model with unknown parameters θ ∈ Θ. The k moment conditions for a vector-valued function g(Y, ·) : Θ → Rk is def m(θ∗ ) = E[g(Y, θ∗ )] = 0k×1 The observable representations, or sample moments, are 1 ∑ g(Yn , θ) N N b (θ) = m def n =1 11 Background: Generalized method of moments (GMMs) b (θ)∥ for some choice of We aim to minimize the normed distance ∥m ∥ · ∥. Define the weighted Frobenius norm as b (θ)∥2W = m b (θ)T Wm b (θ), ∥m def where W is a positive definite matrix. The generalized method of moments (GMM) estimator is b (θ)∥2W θgmm = arg min ∥m θ∈Θ 12 Background: Generalized method of moments (GMMs) Under standard assumptions∗ , the GMM estimator θgmm is consistent and asymptotically normal. Moreover, if W = E[g(Yn , θ∗ )g(Yn , θ∗ )T ]−1 def then θgmm is statistically efficient in the class of asymptotically normal estimators (and conditioned only on the information in the moment!). 13 Hidden markov model Time t ∈ {1, 2, . . .} � Hidden states ht ∈ {1, . . . , m} � Observations xt ∈ {1, . . . , n} (m ≤ n) Given full rank matrices T ∈ Rm×m , O ∈ Rn×m , the dynamical system for HMMs is given by def Tij = P[ht+1 = i | ht = j] ⇐⇒ P[ht+1 ] = TP[ht ] def Oij = P[xt = i | ht = j] ⇐⇒ P[xt ] = OP[ht ] def and π = P[h1 ] ∈ Rm the initial state distribution. Goal: Estimate the joint distribution P[x1:t ], which also allows one to make predictions: P[xt+1 | x1:t ] = [x1:t+1 ]/P[x1:t ]. 14 Hidden markov model 15 Hidden markov model def Notation: P[x, y, · · · ]ij··· = P(x = i, y = j, · · · ) Theorem The joint probability matrix P[xt+1 , xt ] satisfies the following for all columns j ∈ {1, . . . , n}: P[xt+1 , xt ]·j = Φj P[xt ] Φj = OT diag(Oj· )O† def Furthermore, P[xt+1 , xt , x1:t−1 ]·j⃗k = Φj P[xt , x1:t−1 ]·⃗k where ⃗k represents a sequence of states for x1:t−1 , i.e., x1 = k1 , . . . , xt−1 = kt−1 . 16 Hidden markov model We aim to solve b[xt+1 , xt , x1:t−1 ] ⃗ − Φj P b[xt , x1:t−1 ] ⃗ ∥F min ∥P ·j k ·k Φj such that rank(Φj ) ≤ m for all j ∈ {1, . . . , n}. By the Eckart-Young-Mirsky theorem, this is equivalent to solving b[xt+1 , xt , x1:t−1 ] ⃗ − Φj UT P b min ∥UTj P j [xt , x1:t−1 ]·⃗k ∥F ·j k Φj where Uj is the matrix of m left-singular vectors of b[xt+1 , xt , x1:t−1 ] ⃗ P b [x t , x 1 :t − 1 ]T ] E[P ·jk ·⃗k 17 Hidden markov model � It’s convex! (optimization leads to consistent estimator) � This recovers the same (???) parameters as the method of moments estimation derived in Hsu et al. (2009). 18 Hidden markov model Use weighted Frobenius norm instead for GMM estimation: b[xt+1 , xt , x1:t−1 ] ⃗ − Φj P b[xt , x1:t−1 ] ⃗ ∥2 min ∥P ·j k ·k W Φj such that rank(Φj ) ≤ m. 19 Hidden markov model And add your favorite regularizers/priors! b[xt+1 , xt , x1:t−1 ] ⃗ − Φj P b[xt , x1:t−1 ] ⃗ ∥2 + α∥Φj ∥1 + (1 − α)∥Φj ∥2 min ∥P F ·jk ·k W Φj such that rank(Φj ) ≤ m. 20 Hidden markov model b[xt+1 , xt , x1:t−1 ] ⃗ − Φj P b[xt , x1:t−1 ] ⃗ ∥2 min ∥P ·j k ·k W Φj rank(Φj ) ≤ m 1. This extends naturally to predictive state representations (Boots and Gordon, 2010) (??? work in progress). 2. Remark on local optima in the optimization � In the unweighted case (MoM), any local optima are necessarily global � In the weighted case (GMM), this no longer holds (Nati and Jaakola, 2003) 21 b[xt+1 , xt , x1:t−1 ] ⃗ − Φj P b[xt , x1:t−1 ] ⃗ ∥2 min ∥P ·j k ·k W Φj rank(Φj ) ≤ m If the weighting introduces local optima, then what’s the point? 22 Conclusion Reframing moment estimation from spectral methods as optimization leads to an alternative viewpoint on improving estimation: � consistency 3 (???) � statistical efficiency 3 (GMM estimator, using SGD with averaging) � computational efficiency 3 (SGD) � numerical stability 3 (implicit/proximal SGD) 23 Questions? 24
© Copyright 2024