Estimating DSGE Models with Stan Ben Goodrich∗ and Carlos Montes-Galdón† 1 Introduction This paper is markedly incomplete, but we will show how relatively new Bayesian software called Stan can be used to efficiently draw from the posterior distribution of challenging Dynamic Stochastic General Equilibrium (DSGE) models and similarly challenging posterior distributions in economics and other fields. Most attempts to estimate the parameters of a DSGE model have used a random-walk Metropolis-Hastings approach, which we strongly believe is inadequate in the sense that the Markov Chain does not mix well enough to confidently obtain reliable inferences. There have been some recent attempts to impove the mixing of Metropolis-Hastings in DGSE models (see [Chib and Ramamurthy(2010), Herbst(2011), Herbst and Schorfheide(2014)]). A few of these have started to use the gradient and Hessian of the posterior distribution to make “better” proposals in the sense that the proposals can be relatively far from the current state of the Markov Chain but are still accepted with a relatively high probability. The most advanced implementation of this idea to date is the Hamiltonian Monte Carlo algorithm implement in Stan. Stan has several important features that make it appropriate even for difficult posterior distributions: • All the derivatives are handled by automatic (sometimes called algorithmic) differentiation, so the user only has to specify the kernel of a (log) posterior distribution and does not need to worry about symbolic derivatives at all. Automatic differentiation is much faster and more accurate than numerical differentiation and is feasible for any posterior distribution that is differentiable everywhere. • Stan automatically adjusts the tuning parameters for Hamiltonian Monte Carlo during the warmup period. These tuning parameters include the stepsize and the innovation variances in order to achieve a target acceptance probability. By default, the target acceptance probability is 0.8, which is much higher than other Metropolis-Hastings algorithms, and yet the proposals are relatively far from the current state of the Markov Chain because they are governed primarily by the curvature of the posterior distribution. ∗ Lecturer, Columbia University European Central Bank † Researcher, 1 2 • Hamiltonian Monte Carlo output includes several additional quantities that make it easier to diagnose when the Markov Chain is not working well. These diagnostics are especially important in difficult problems like DSGE models. We have written (and are in the process of cleaning up) MATLAB code that will write the text of a Stan program given the solution to a DSGE model in symbolic form. We hope to eventually do the same with Dynare++. These tools will hopefully make it easier to estimate DSGE models with state of the art Markov Chain Monte Carlo algorithms. Stan is in the process of implementing Riemannian Manifold Monte Carlo, which generalizes Hamiltonian Monte Carlo to utilize the second derivatives of the (log) posterior distribution rather than merely the gradient. Riemannian Manifold Monte Carlo may well be essential for estimating DSGE models whose posterior distributions change to rapidly over the parameter space to be succesfully sampled by Hamiltonian Monte Carlo schemes. The outline of the paper is as follows. We briefly discuss the pertubation approach to DSGE models that has been thoroughly explained by [Klein(2000)] or [Schmitt-Grohé and Uribe(2004)]. A major stumbling block that we only recently overcame is implementing the approach of [Klein(2000)] or [Schmitt-Grohé and Uribe(2004)] using real numbers only. We then discuss various frequentist and Bayesian ways of estimating a linearized DSGE model and explain in some detail why we are optimistic that Stan will sample from difficult posterior distributions more efficiently than MetropolisHastings but with little to no additional burden on the researcher. The last section has an example of a Real Business Cycle model that we can estimate with Stan and will be expanded to include additional examples. There are no actual results yet, but our code has reached “proof of concept” stage and we expect it to be in “beta” stage by early May. 2 DSGE models and their solution In this paper, we are not trying to innovate the theory or the specification of DSGE models. Our innovation pertains to the estimation of DSGE models. Nevertheless, in this section we briefly explain a conventional approach to specifying and solving a DSGE model. A DSGE model consists of a rich set of structural difference equations that describe an economy in a coherent way, which considers economic agents as optimizing entities, who form their own expectations about the future, and that takes into account market clearing conditions. In the past years, computational advances have allowed economic researchers to build medium and large scale DSGE models that have become the standard tool in policy analysis (especially in Central Banks). The fact that solution and estimation methods of DSGE models have developed very fast in the last years has helped to their rapid development. Moreover, compared to reduced form models, DSGE models have proved to fit the data as well or better than Vector Autoregressions, they have better forecasting power, and they provide a stylized theory about the economy which helps us to understand better the interaction among different variables and the transmission mechanisms of economic shocks. 3 However, finding the solution of the set of equations that form a DSGE model is not a trivial task. Most of the difference equations are highly non linear, and it is not possible to find a closed form solution for the dynamic system that they describe. Most of the literature considers linearization methods to transform the non linear model to a linear one, and to find the solution in a relatively easy way. Once the model is written as a linear approximation, then it is also possible to find a procedure to estimate the deep parameters of the model that determine the equilibrium relationships among the different variables using a likelihood approach. While it is possible to solve some models using non linear methods (value function iteration), this approach is generally limited to small scale models, and in principle a likelihood approach is not available to estimate them. Following [DeJong and Dave(2011)], the DSGE model can be written a system of nonlinear difference equations that depend on a vector of parameters Θ, EΨ(zt+1 , zt ; Θ) = 0 (1) ¯ Θ) = 0, where z¯ is where E stands for the expectation operator. The system has a steady state, Ψ(z; a function of the deep parameters of the system. Defining wt = zt − z¯ (deviations of variables from its steady state value), our aim is to find a linearised solution such that, AEwt+1 = Bwt . (2) Following [Judd(1998)] or [Schmitt-Grohé and Uribe(2004)], we can find 2 using a Taylor series expansion of the original non linear system around the steady state value1 , Ψ(z¯ ) + ∂Ψ ∂Ψ (z¯)(zt+1 − z¯) ≈ 0 (z¯)(zt − z¯) + ∂zt ∂zt+1 and we can define A= ∂Ψ (z¯), ∂zt+1 B=− ∂Ψ (z¯). ∂zt Finally, once the model has been linearised we want to find a representation of the dynamic behavior of the variables as, wt+1 = F (Θ, w¯ )wt + G (Θ, w¯ )vt+1 (3) where vt is a vector of exogenous structural disturbances in the model, and F (Θ, w¯ ), G (Θ, w¯ ) are matrices formed by the deep parameters. As in [Schmitt-Grohé and Uribe(2004)], we include two types of variables in a DSGE model: predetermined variables (xt2 ) and control variables (xt1 ). Then, we can partition 3 as, xt1 xt2+1 = gx xt2 = h x xt2 + Σvt+1 (4) 1 We show here a first order approximation. However, higher order approximations can be performed as in [Andreasen et al.(2013)Andreasen, Fernández-Villaverde, and Rubio-Ramírez], [Fernández-Villaverde and Rubio-Ramírez(2007)] or [Schmitt-Grohé and Uribe(2004)] 4 In order to solve for gx and h x from the matrices A and B it is possible to follow the approach in [Klein(2000)] or [Schmitt-Grohé and Uribe(2004)], who use a generalized Schur (QZ) decomposition of A and B in order to find the stable solution of the model. As noted in [Klein(2000)], it is somewhat easier to use a complex Schur (QZ) decomposition of A = QSZ and B = QTZ, where both Q and Z are unitary and both S and T are upper triangular, and there is code in a variety of languages to do so that have been contributed by Chris Sims, Paul Cline, Martín Uribe and Stephanie Schmitt-Grohé that has been incorporated and extended by the Dynare team. However, the software we use to estimate DSGE models cannot handle derivatives of complex functions and thus we are forced to use a real Schur (QZ) decomposition of B = QSZ and A = QTZ, where both Q and Z are orthonormal but only T is upper triangular whereas S is upper Hessenberg, meaning it may have some non-zero entries immediately below its diagonal. [Klein(2000)] glosses over the implementation details of the real Schur decomposition so we will state them here. There are many implementations of the real Schur decomposition that all trace their roots to [Moler and Stewart(1973)]. We use the C++ implementation in Eigen, which supplies a good portion of the supporting code for Stan. Eigen includes (although it is not exposed) a method to rotate the decomposition with Givens rotations that preserve the orthogonality of Q and Z such that any zeros along the diagonal of T are pushed to the lower right corner. This “zerochasing” scheme may be originally attributable to [Golub and Van Loan(2012)]. Zeros along the diagonal of T imply that A is singular and thus chasing them to the lower-right corner makes it possible to partition the decomposition as " B=Q S11 S12 0 S22 # " Z and A=Q T11 T12 0 T22 # Z, where T22 have have all zeros along its diagonal (but may have non-zero elements above the diagonal) and T11 has no zeros on its diagonal. If A−1 exists, our task would be more straightforward, but in many DSGE models it does not. With some abuse of notation we use T −1 for the “inverse” of −1 T even when there is no unique solution to T22 T22 = I. As we shall see, T22 becomes irrelevant for our purposes. In other words, at least the upper-left corner of A−1 B = Z > T −1 Q> QSZ = Z > T −1 SZ is welldefined and a diagonal matrix of generalized eigenvalues (λ) could be constructed from the diagonal blocks of T −1 S such that AVλ = BV and V is unitary. However, λ would, in general, entail complex numbers that we need to avoid, so we convert to what Eigen calls a “pseudo eigen˙ = BV˙ and D is block diagonal with 1 × 1 or 2 × 2 blocks on its didecomposition” where AVD agonal.2 We can then specify a permutation matrix P in Eigen, such that PP> = I, implying > DPP> V ˙ V˙ > = VPP ˙ ˙ > = V¨ D ¨ V¨ > and the diagonal blocks of D ¨ = P> DP are increasing in the VD absolute value of the generalized eigenvalues. The generalized eigenvalues in the bottom right ¨ may be infinite, but it is more important to classify the finite generalized eigenvalues corner of D depending on whether they are less than or greater than unity in absolute value. We can repartition 2 Technically, Eigen uses the phrase pseudo eigen-decomposition to refer to the standard eigenvalue problem where B = I , but the internal Eigen code is was fairly easy to adapt to this generalized eigenvalue problem. 5 the decomposition as A h V¨1 V¨2 i " ¨ 11 D 0 0 ¨ D22 # = B h V¨1 V¨2 i , ¨ 11 are less than unity in absolute values and all the eigenvalues of where all the eigenvalues of D # " V¨11 ¨ ¨ can be further D22 are greater than unity (and possibly infinite) in absolute value. V1 = V¨12 ¨ 11 V¨ −1 and gx = V¨12 V¨ −1 in equation 4. partitioned such that h x = V¨11 D 11 11 3 Ways of estimating DSGE models DSGE models can generally be written as a state space form. Suppose that we have a set of observable variables that have a theoretical counterpart in the model, yt . Note that many times we can not observe all of the economic variables defined in the model (for example, total factor productivity which is a theoretical measure that does not have an empirical counterpart such as output or consumption growth). Given the set of observable variables, we can write the state space form of the model as, yt w t +1 = Swt + ε t F (Θ, w¯ )wt + G (Θ, w¯ )vt+1 = where S is a selection matrix that relates the variables in the model to the observed data, and ε t are measurement errors. The first equation is referred as the measurement equation, and the second one is the transition equation. The main goal of this section is to describe how to estimate Θ, the diverse methods that have been used to date, and our proposal using Hamiltonian Monte Carlo methods. The first step is to construct the likelihood function for the data, p(y T |Θ). Once it is constructed, the likelihood can be maximized using numerical methods or it can be combined with a prior distribution for Θ, p(Θ), to form a posterior distribution p(Θ|y T ) that can be sampled from using Markov Chain Monte Carlo (MCMC). However, building the likelihood function is not a trivial task, since the latent states have to be integrated out. That is, p(y T |Θ) T = ∏ p ( y t | y t −1 | Θ ) t =1 T ˆ = ∏ p(yt |wt , Θ) p(wt |yt−1 , Θ)dwt . (5) t =1 Under the assumption that both ε t and vt+1 follow a Gaussian distribution, the model is linear, and 6 the likelihood function can be evaluated recursively using the the Kalman Filter (see for example, [Hamilton(1994)] or [Harvey(1990)]) When the DSGE model is linearised to an order higher than one, then the transition equation is no longer linear, and therefore, the Kalman Filter can not be used to evaluate the conditional likelihood 5. In this case, the literature uses a Particle Filter to approximate the integrals (for a textbook exposition, see for example [Fernández-Villaverde and Rubio-Ramírez(2007)]). 3.1 Frequentist approach In the frequentist approach, the likelihood function 5 can be maximized numerically using any optimization technique. However, this seemingly easy approach has several pitfalls, which are pointed out in [Fernández-Villaverde(2010)]. The likelihood function has a substantial number of parameters in a medium or large-scale DSGE model, and therefore, it might be difficult to find a global maximum among a considerable number of local maxima. The likelihood function can also be flat or essentially flat in many areas of the parameter space, which raises additional computational difficulties and may preclude a unique estimator of the parameters. Moreover, the estimated standard errors based on multivariate normal approximations may be an inaccurate measure of uncretainty about the parameters. The observable macroeconomic time series are not especially long, and thus the favorable asymptotic properties of maximum likelihood estimators may not hold. For these reasons, the literature and central banks are increasingly turning to Bayesian teachniques to estimate DSGE models. 3.2 Previous Bayesian approaches We refer to [Herbst and Schorfheide(2014)], which explains the Bayesian approach to DSGE models in book form. In general, the likelihood function is multiplied by a prior distribution over the parameters, and Bayes’ Theorem implies a posterior distribution p(Θ|y T ) ∝ p(y T |Θ) p(Θ) The prior distribution is usually chosen using economic theory or results from previously-estimated models. The prior distribution helps to concentrate the posterior distribution in areas of the parameter space where there is a unique, stable solution to the DSGE model. For example, in a standard New Keynesian model, the conduct of monetary policy is typically specified as a Taylor rule, so that, in its simplest form, the nominal interest rate is a function of the inflation rate. In this situation, it is well known that the semielasticity of the interest rate of the inflation rate should be greater than unity in order to bring about a unique, stable equilibrium in the dynamic model that is consistent with the relatively stable monetary policy we observe. In general, in economics and other fields, a (hybrid) Gibbs sampler is created by deriving the fullconditional distributions of as many of the parameters as possible. The Gibbs sampler proceeds by 7 drawing from each full-conditional distribution in turn and falling back to some other sampling technique to draw parameters whose full-conditional distributions are not known or is known but is not easy to draw from. However, in the case of a DSGE model, F (Θ, w¯ ) and G (Θ, w¯ ) are highly non-linear combinations of the structural parameters, Θ, and it is essentially impossible to derive the full-conditional distribution of any parameter of the model. Therefore, the majority of the Bayesian literature on DSGE models uses the Random Walk Metropolis Hastings (RWMH) approach to explore the posterior distribution of Θ, which can be summarized as follows. Suppose that we have performed j simulations. Then, 1. Draw a vector of proposed parameters, Θ j using any of the methods outlined below. (a) Solve for the matrices of the linearised form of the model, given Θ j , and the implied steady state. (b) Evaluate the likelihood function, using the Kalman Filter if a first order Taylor expansion is used, or a Particle Filter otherwise. (c) Evaluate the prior distribution p(Θ j ), which when combined with the evaluated likelihood is proportional to the posterior distribution at Θ j 2. Accept the proposed draw Θ j with probability p(Θ j |y T ) p(Θ|y T ) × q(Θ|Θ j ) , q(Θ j |Θ) where q () is the density of the proposal distribution. However, the RWMH algorithm only mitigates many of the problems that plague the maximum likelihood estimator. The Markov Chain may become stuck in a flat area or local maximum of the posterior distribution for an indefinite number of iterations and thus not fully explore the posterior distribution in a remotely reasonable amount of time. In addition, the RWMH algorithm may not mix throughout the parameter space at a fast enough rate for the Markov Chain Monte Carlo Central Limit Theorem (MCMC-CLT) to hold, in which case it is difficult to characterize our uncertainty over how closely the empirical means of functions of the Markov Chain adhere to the corresponding analytical means (if it were even possible to derive them in closed-form). At the extreme, a RWMH algorithm could easily not have converged to the posterior distribution when the researcher stops to analyze the output. [Chib and Ramamurthy(2010)] recognizes these problems with the standard RWMH and proposes a variation of it. The researcher first specifies a number of subvectors that Θ can be partitioned into. If there are B mutually exclusive and exhaustive blocks, then we can write Θ = SB b =1 Θb . The difference in this “tailored” Metropolis-Hastings appoach is that at each iteration, the parameters are randomly allocated into B blocks — without regard to the current state of the Markov Chain — j and only Θb is proposed, leaving the other blocks at their current values. This process diminishes the change that the Markov Chain will become stuck in a region of the parameter space for an indefinite number of iterations. [Chib and Ramamurthy(2010)] reports that the autocorrelation of the draws dissapates after 30 or 40 iterations for most of the 36 parameters in a DSGE model. 8 [Herbst(2011)] uses a blocking scheme where the parameters are allocated into B blocks based on posterior correlations estimated during the burnin period and reports improved performance. Other alternatives to RWMH use the information in the shape of the posterior distribution to generate better proposals for the parameters, rather than drawing them blindly from a multivariate normal distribution. The Metropolis-Adjusted Langevin (MAL) approach utilizes the gradient and Hessian (evaluated at the posterior mode) to shift the mean vector of the multivariate normal proposal distribution. Another variant on this approach is to evaluate the Hessian at the current state of the Markov Chain. 3.3 Our approach [Herbst and Schorfheide(2014)] concludes its survey of Metropolis-Hastings based approaches by referring to Hamiltonian and Riemannian Monte Carlo Preliminary trials with both of these algorithms suggested, that, yes, they represented a statistical improvement from the algorithms presented above. However, the computational burden associated with them outweighed any gains. For example, computing the information matrix — a requirement for the Riemann-based method — of a DSGE is still too cumbersome to be used within an MH algorithm. Computational gains in these kinds of calculations, a subject of on-going research, might make them more feasible, and thus, preferred. (72) In this subsection, we argue that Hamiltonian Monte Carlo (HMC) is feasible today using Stan and Stan’s implementation of Riemannian Manifold Monte Carlo (RMMC) will likely become feasible in 2015 or 2016. Stan implements automatic differentiation to evaluate derivatives of the (log) posterior distribution. This section borrows from the description of Stan in [Goodrich et al.(2012)Goodrich, Katznelson, and Wawro]. For a comprehensive review of HMC, see [Neal(2011)]. The theoretical justification for Stan is given in [Hoffman and Gelman(2014)] and [Carpeter(2015)]. Physicists can model the evolution over time of a closed system given a particle’s position and momentum using functions measuring its potential and kinetic energy. It turns out that the same intuition and mathematics can be applied to sample from a posterior distribution via MCMC, which is the characteristic that distinguishes HMC from other MCMC algorithms. HMC augments the usual expresion for Bayes’ Theorem p ( Θ| data) ∝ p (Θ) f ( data| Θ) with an auxiliary vector (ρ) of “momentum” variables — one for each of the parameters — to specify the joint posterior distribution of Θ and ρ, which is p ( Θ, ρ| data) ∝ p (Θ, ρ) p ( data| Θ, ρ). These auxiliary variables are computationally important, but they are independent of Θ and the data, allowing us to write p ( Θ, ρ| data) ∝ p (ρ) p (Θ) p ( data| Θ). The researcher must specify priors for the parameters Θ and ρ but unlike Gibbs sampling, it is not necessary to specify the priors in such a way 9 Table 1: Hamiltonian Dynamics Concepts Mapped Between Physics and MCMC Concept Object Dimensions Position Momentum Potential Energy Kinetic Energy Physics Particle 2D / 3D q p U (q) K (p) Bayesian Statistics Parameters Any dimensionality Θ ρ − ln [ p ( Θ, ρ| data)] − ln [N (0, M)] w/ M diagonal that the full-conditional distributions of the parameters are known in closed forms. This feature is important in a DSGE context where it is impossible to derive full-conditional distributions of the elements of Θ anyway. The most convenient prior for the momentum variables is a multivariate normal with mean zero and covariance matrix M. HMC algorithms typically specify a fixed M at the outset, which is often M = mI for some chosen value of the tuning parameter, m. By default, Stan assumes M is diagonal and the diagonal entries are tuned automatically during the warmup phase of the MCMC algorithm so that the diagonal entries are approximately equal to the posterior variances of the elements of Θ. Since ρ is independent of Θ a priori and does not enter the likelihood, the fullconditional distribution of ρ is just its prior distribution, namely multivariate normal. Thus, ρ can be updated with a Gibbs-step, which is why HMC is also sometimes taken as an abbreviation for Hybrid Monte Carlo because Θ is updated in an entirely different way, conditional on ρ. The relationship between Hamiltonian dynamics in physics and Hamiltonian dynamics in MCMC is outlined in Table 1. Physics specifies a position vector (q) of a particle, a momentum vector (p) for that particle, a function for the potential energy (U (q)) of the particle, and a function for the kinetic energy (K (p)) of the particle. The Hamiltonian function, which measures total energy, is typically additive, H (p, q) = U (q) + K (p), which leads to the following system of differential equations, called the “Hamiltonian equations”: dqi dt dpi dt = = h i ∂H (q, p) = M −1 p , ∂t i ∂H (q, p) ∂U (q) − =− , ∂t ∂qi where M is interpreted as a “mass” matrix for the particle. These differential equations characterize where the particle moves in continuous time (t), but it is typically necessary to make a discrete-time approximation of this system in order to solve for the new position, momentum, potential energy, and kinetic energy. HMC utilizes the same discrete-time approximation to construct a MCMC algorithm, where the “potential energy” function is the negative log posterior distribution, U (Θ, ρ) = − ln [ p ( Θ, ρ| data)], and the “kinetic energy” function h i is the negative log prior 1 ρ > M −1 ρ − distribution for the momentum variables, K (ρ) = − ln c × e 2 . For both of these distributions, normalizing constants (c) can be included or ignored because they have no effect on the dynamics. 10 Each iteration of the MCMC algorithm requires a solution to the Hamiltonian equations and con sists of three distinct steps. First, a proposed momentum vector ρ j is drawn from its multivariate normal distribution. Second, a proposed location vector Θ j is drawn, given a realization of the proposed momentum vector that “pushes” the location of the parameters in accordance with the gradient of the log posterior. Third, a Metropolis-Hastings test is conducted to determine whether to accept the proposed momentum and location or to stay at the current location (Θ, ρ). There are several variants of the HMC algorithm, which differ primarily in how they implement the second step. A simple implementation of the second step uses the so-called “leapfrogging” algorithm to solve a discrete-time approximation of the Hamiltonian equations. The leapfrogging algorithm specifies a positive integer for the number of steps ( L) in time and a positive stepsize (e). The leapfrogging process presents three challenges for a HMC algorithm. First, it requires the gradient of the negative log-posterior, which is often difficult or impossible to calculate in closed form. Numeric gradients can be inaccurate and are computationally expensive when there are many parameters. Stan overcomes this problem by utilizing reverse-mode automatic differentiation and C++ template metaprogramming.3 Automatic differentiation exploits the fact that a software program only needs to know a few rules for differentiation, and then a compiler can construct a (potentially large) expression tree that calculates the gradient via the chain rule. Thus, applied researchers do not need to specify any derivatives to use Stan. Programmers also do not need to specify any derivatives to contribute to Stan, although it is often possible to make substantial improvements in computation time if some partial derivatives are known analytically and can be inserted into the expression tree that evaluates to the gradient. The second challenge is how to choose L in a HMC algorithm. If L is too small, Θ j will be too close to Θ, and the Markov Chain will mix slowly through the parameter space. If L is too large, a lot of computation time is wasted generating a proposal that is no more likely to be accepted. Stan allows the user to specify L, but by default, Stan endogenizes L with an algorithm called No U-Turn Sampling (NUTS), which is described in more detail in [Hoffman and Gelman(2014)]. The intuition of NUTS is to move Θ j both “forward” and “backward” (i.e. ±e) , repeatedly doubling the number of leapfrog steps until some proposal makes a U-turn (and, if not stopped, would start moving roughly toward Θ again). Stan then utilizes a slice sampling mechanism to randomly select some point along the path of proposals it has traced and decides whether to accept this proposal with a Metropolis-Hastings acceptance test, although the proposal distribution is symmetric so technically it is just a Metropolis test. The slice sampling mechanism adds a layer of complication but preserves the essential properties of a Markov Chain. The third challenge is how to choose e. As e ↓ 0, the discrete-time approximation of the solution to the Hamiltonian equations approaches a continuous-time process and all proposed locations will be accepted. However, if e is too small, Θ j will be too close to Θ, and the Markov Chain will mix slowly. If e is too large, the approximation can become dislodged from the true Hamiltonian dynamics and the resulting Θ j is unlikely to be accepted, although it is still a valid proposal. Stan 3 Forward-mode automatic differentiation will be used to implement Riemannian Manifold Monte Carlo. 11 allows the user to specify e at the outset, but by default, Stan utilizes the warmup period of the Markov Chain to tune e. Essentially, Stan tries to tune e so that 80% of proposals are accepted by default, which is optimal under certain assumptions and contrasts to an optimal acceptance rate of 23.4% in Random Walk Metropolis algorithms. Thus, Stan makes it possible for users to simply define a log posterior distribution (with or without normalizing constants) and then obtain samples from that posterior distribution. By default, M and e are tuned automatically during the warmup phase of the Markov Chain and frozen thereafter. L is chosen endogenously at every iteration of the Markov Chain. Experts can specify M, e, and L at the outset and possibly obtain a somewhat more efficient algorithm, but Stan’s defaults seem to work reasonably well for a wide range of statistical problems. By relying on automatic differentiation, Stan obviates the need for the user to define a function that evaluates the gradient of the log posterior, and the computational cost of automatic differentiation is roughly proportional to the cost of evaluating the log posterior. There is one additional complication where the developers of Stan have expended considerable effort to make life easier for the user. The above sketch of Stan works if the posterior has positive mass throughout the parameter space. However, if a Markov Chain wanders into a region of the parameter space with no mass, it would get stuck because the gradient is either zero or undefined. Many statistical models are well-defined only in certain regions of the parameter space; e.g. variance parameters cannot be negative. There are two main approaches in the HMC literature to deal with this problem. The first is to erect walls that define the admissible region of the parameter space and “bounce” if the Markov Chain runs into a wall. Intuitively, this involves negating the momentum, which is legal because doing so does not affect the multivariate normal density. Stan has not implemented bouncing yet. Instead, Stan takes the second approach of reparameterizing Θ as a function of unbounded primitive parameters. For example, if a user defines σ as the standard deviation of the errors in a linear regression model, Stan automatically specifies σ = exp (e σ) where e σ is an unbounded primitive parameter that is not even seen by the user. Technically, Stan samples e σ but saves samples of σ for the user to analyze. This change-of-variables requires that the posterior distribution be multiplied by the absolute value of the determinant of the Jacobian for the transformation from the bounded parameter to the unbounded parameter, which is simply exp (e σ ) in this case. For a multivariate change-of-variables, the Jacobian determinant can be tedious to define and / or costly to evaluate, but wherever possible Stan uses transformations that yield triangular Jacobians, whose determinants are relatively easy to evaluate. In almost all cases — including those discussed in this paper — users do not need to worry about derivates at all. 4 Examples The examples in this section are empirically incomplete. As of April 1, we have just gotten the real Schur (QZ) decomposition code to work (modulo a few patches that have not been upstreamed yet) with a personal fork of the Stan library. Our procedure 12 does not crash (any more), but we have not yet had time to even preliminarily analyze the results. It appears as if it will be challenging to get DSGE models to converge within a few hours of computation time, but we have not even attempted to optimize our code yet. We will have a much more complete set of results by the May 8 conference presentation. 4.1 A Real Business Cycle (RBC) Model In this subsection, we consider a small scalle RBC model, similar to the one in [Schmitt-Grohé and Uribe(2004)]. The model considers a representative agent that maximizes an infinite period expected utility function: ∞ ∑ max E0 t =0 1− γ βt ct 1−γ where c stands for consumption, E is the expectation operator, and β, γ are two deep parameters of the model, measuring the subjective value on the future by the agent and her risk aversion, respectively. The agent is constrained by the fact that, At kαt = ct + k t+1 − (1 − δ)k t which is other way to put that total output per capita (yt = At kαt ) is equal to consumption and investment per capita (k t+1 − (1 − δ)k t ). The variable k t represents the stock of capital, At is the productivity level, α is a parameter of the model related to the use of capital in production and δ is the depreciation rate of capital. In the model, we introduce stochastic exogenous technological change as follows, At+1 − 1 = ρ( At − 1) + ηε t+1 with ε ∼ N (0, 1) and ρ an extra parameter to be estimated. The solution to the model consists on the first order conditions of the maximization problem, plus the resource constraint, the definition of output, consumption growth and output growth, and the exogenous process for productivity: −γ ct At kαt yt = −γ βEt ct+1 [αAt+1 kαt+−11 + 1 − δ] = c t + k t +1 − (1 − δ ) k t = At kαt 13 ln At+1 gc gy = ρ ln At + σε t+1 ct = c t −1 yt = y t −1 Note that this set of equations can easily be cast in the form of equation 1. In this case, we can define the vector of structural parameters of the model as Θ = {γ, β, α, δ, ρ, σ } and the vector of variables as wt = { gyt , gct , yt , ct , k t , At }. The previous set of equations is easily cast in the form of 1, and the steady state of the model is found as the solution of the system when ¯ given a vector of parameters. Then, we can find it as, wt+1 = wt = w, A¯ k¯ = 1 1 β = −1+δ ! 1−1 α α c¯ = k¯ α − δk¯ y¯ = k¯ α g¯ c = 1 g¯ y = 1 Given this steady state and a parameter vector, we can solve for the linearised version of the model using the method described in section 2. Given a vector of observable variables, the state space model can be written as before, yt w t +1 = Swt + ε t F (Θ, w¯ )wt + G (Θ, w¯ )vt+1 = In this case, we will use two variables as observed data, output growth and consumption per capita o , go ]). For the estimation we will use either artificial data or true quarterly data growth (yt = [ gyt ct from the United States, spanning the period 1948Q1-2014Q4. With this two observable variables, and using the identifiability test in [Iskrev(2010)], in principle all the parameters in Θ could be identified using the priors that we will impose in the estimation of the model. Finally, in this case, the selection matrix is given by, " S= 1 0 0 0 0 0 0 1 0 0 0 0 # 14 4.2 More Examples Coming References [Andreasen et al.(2013)Andreasen, Fernández-Villaverde, and Rubio-Ramírez] Martin dreasen, Jesús Fernández-Villaverde, and Juan Rubio-Ramírez. M An- The pruned state-space system for non-linear dsge models: Theory and empirical applications. Technical report, National Bureau of Economic Research, 2013. [Carpeter(2015)] Bob et al. Carpeter. Stan: A probabalistic programming language. 2015. [Chib and Ramamurthy(2010)] Siddhartha Chib and Srikanth Ramamurthy. Tailored randomized block mcmc methods with application to dsge models. Journal of Econometrics, 155(1):19–38, 2010. [DeJong and Dave(2011)] David N DeJong and Chetan Dave. Structural macroeconometrics. Princeton University Press, 2011. [Fernández-Villaverde(2010)] Jesús Fernández-Villaverde. The econometrics of dsge models. SERIEs, 1(1-2):3–49, 2010. [Fernández-Villaverde and Rubio-Ramírez(2007)] Jesús Fernández-Villaverde and Juan F RubioRamírez. Estimating macroeconomic models: A likelihood approach. The Review of Economic Studies, 74(4):1059–1087, 2007. [Golub and Van Loan(2012)] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012. [Goodrich et al.(2012)Goodrich, Katznelson, and Wawro] Ben Goodrich, Ira Katznelson, and Greg Wawro. Designing quantitative historical social inquiry:an introduction to stan. 2012. [Hamilton(1994)] James Douglas Hamilton. Time series analysis, volume 2. Princeton university press Princeton, 1994. [Harvey(1990)] Andrew C Harvey. Forecasting, structural time series models and the Kalman filter. Cambridge university press, 1990. [Herbst(2011)] Edward Herbst. Gradient and Hessian-based MCMC for DSGE Models. Unpublished manuscript, 2011. [Herbst and Schorfheide(2014)] Edward Herbst and Frank Schorfheide. Bayesian Estimation of DSGE Models. Book manuscript, 2014. [Hoffman and Gelman(2014)] Matthew D Hoffman and Andrew Gelman. The no-u-turn sampler: Adaptively setting path lengths in hamiltonian monte carlo. The Journal of Machine Learning Research, 15(1):1593–1623, 2014. 15 [Iskrev(2010)] Nikolay Iskrev. Local identification in dsge models. Journal of Monetary Economics, 57(2):189–202, 2010. [Judd(1998)] Kenneth L Judd. Numerical methods in economics. MIT press, 1998. [Klein(2000)] Paul Klein. Using the generalized schur form to solve a multivariate linear rational expectations model. Journal of Economic Dynamics and Control, 24(10):1405–1423, 2000. [Moler and Stewart(1973)] C.B. Moler and G.W Stewart. An algorithm for generalized eigenvalue problems. SIAM Journal of Numerical Analysis, 10:241, 1973. [Neal(2011)] Radford M Neal. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2, 2011. [Schmitt-Grohé and Uribe(2004)] Stephanie Schmitt-Grohé and Martın Uribe. Solving dynamic general equilibrium models using a second-order approximation to the policy function. Journal of economic dynamics and control, 28(4):755–775, 2004.
© Copyright 2024