Estimating DSGE Models with Stan

Estimating DSGE Models with Stan
Ben Goodrich∗ and Carlos Montes-Galdón†
1
Introduction
This paper is markedly incomplete, but we will show how relatively new Bayesian software called
Stan can be used to efficiently draw from the posterior distribution of challenging Dynamic Stochastic General Equilibrium (DSGE) models and similarly challenging posterior distributions in economics and other fields. Most attempts to estimate the parameters of a DSGE model have used a
random-walk Metropolis-Hastings approach, which we strongly believe is inadequate in the sense
that the Markov Chain does not mix well enough to confidently obtain reliable inferences.
There have been some recent attempts to impove the mixing of Metropolis-Hastings in DGSE
models (see [Chib and Ramamurthy(2010), Herbst(2011), Herbst and Schorfheide(2014)]). A few
of these have started to use the gradient and Hessian of the posterior distribution to make “better”
proposals in the sense that the proposals can be relatively far from the current state of the Markov
Chain but are still accepted with a relatively high probability. The most advanced implementation
of this idea to date is the Hamiltonian Monte Carlo algorithm implement in Stan. Stan has several
important features that make it appropriate even for difficult posterior distributions:
• All the derivatives are handled by automatic (sometimes called algorithmic) differentiation,
so the user only has to specify the kernel of a (log) posterior distribution and does not need
to worry about symbolic derivatives at all. Automatic differentiation is much faster and more
accurate than numerical differentiation and is feasible for any posterior distribution that is
differentiable everywhere.
• Stan automatically adjusts the tuning parameters for Hamiltonian Monte Carlo during the
warmup period. These tuning parameters include the stepsize and the innovation variances
in order to achieve a target acceptance probability. By default, the target acceptance probability is 0.8, which is much higher than other Metropolis-Hastings algorithms, and yet the
proposals are relatively far from the current state of the Markov Chain because they are governed primarily by the curvature of the posterior distribution.
∗ Lecturer,
Columbia University
European Central Bank
† Researcher,
1
2
• Hamiltonian Monte Carlo output includes several additional quantities that make it easier
to diagnose when the Markov Chain is not working well. These diagnostics are especially
important in difficult problems like DSGE models.
We have written (and are in the process of cleaning up) MATLAB code that will write the text of
a Stan program given the solution to a DSGE model in symbolic form. We hope to eventually do
the same with Dynare++. These tools will hopefully make it easier to estimate DSGE models with
state of the art Markov Chain Monte Carlo algorithms. Stan is in the process of implementing Riemannian Manifold Monte Carlo, which generalizes Hamiltonian Monte Carlo to utilize the second
derivatives of the (log) posterior distribution rather than merely the gradient. Riemannian Manifold Monte Carlo may well be essential for estimating DSGE models whose posterior distributions
change to rapidly over the parameter space to be succesfully sampled by Hamiltonian Monte Carlo
schemes.
The outline of the paper is as follows. We briefly discuss the pertubation approach to DSGE models
that has been thoroughly explained by [Klein(2000)] or [Schmitt-Grohé and Uribe(2004)]. A major
stumbling block that we only recently overcame is implementing the approach of [Klein(2000)] or
[Schmitt-Grohé and Uribe(2004)] using real numbers only. We then discuss various frequentist and
Bayesian ways of estimating a linearized DSGE model and explain in some detail why we are optimistic that Stan will sample from difficult posterior distributions more efficiently than MetropolisHastings but with little to no additional burden on the researcher. The last section has an example
of a Real Business Cycle model that we can estimate with Stan and will be expanded to include
additional examples. There are no actual results yet, but our code has reached “proof of concept”
stage and we expect it to be in “beta” stage by early May.
2
DSGE models and their solution
In this paper, we are not trying to innovate the theory or the specification of DSGE models. Our
innovation pertains to the estimation of DSGE models. Nevertheless, in this section we briefly
explain a conventional approach to specifying and solving a DSGE model.
A DSGE model consists of a rich set of structural difference equations that describe an economy
in a coherent way, which considers economic agents as optimizing entities, who form their own
expectations about the future, and that takes into account market clearing conditions. In the past
years, computational advances have allowed economic researchers to build medium and large scale
DSGE models that have become the standard tool in policy analysis (especially in Central Banks).
The fact that solution and estimation methods of DSGE models have developed very fast in the
last years has helped to their rapid development. Moreover, compared to reduced form models,
DSGE models have proved to fit the data as well or better than Vector Autoregressions, they have
better forecasting power, and they provide a stylized theory about the economy which helps us to
understand better the interaction among different variables and the transmission mechanisms of
economic shocks.
3
However, finding the solution of the set of equations that form a DSGE model is not a trivial task.
Most of the difference equations are highly non linear, and it is not possible to find a closed form
solution for the dynamic system that they describe. Most of the literature considers linearization
methods to transform the non linear model to a linear one, and to find the solution in a relatively
easy way. Once the model is written as a linear approximation, then it is also possible to find a
procedure to estimate the deep parameters of the model that determine the equilibrium relationships among the different variables using a likelihood approach. While it is possible to solve some
models using non linear methods (value function iteration), this approach is generally limited to
small scale models, and in principle a likelihood approach is not available to estimate them.
Following [DeJong and Dave(2011)], the DSGE model can be written a system of nonlinear difference equations that depend on a vector of parameters Θ,
EΨ(zt+1 , zt ; Θ) = 0
(1)
¯ Θ) = 0, where z¯ is
where E stands for the expectation operator. The system has a steady state, Ψ(z;
a function of the deep parameters of the system. Defining wt = zt − z¯ (deviations of variables from
its steady state value), our aim is to find a linearised solution such that,
AEwt+1 = Bwt .
(2)
Following [Judd(1998)] or [Schmitt-Grohé and Uribe(2004)], we can find 2 using a Taylor series
expansion of the original non linear system around the steady state value1 ,
Ψ(z¯ ) +
∂Ψ
∂Ψ
(z¯)(zt+1 − z¯) ≈ 0
(z¯)(zt − z¯) +
∂zt
∂zt+1
and we can define
A=
∂Ψ
(z¯),
∂zt+1
B=−
∂Ψ
(z¯).
∂zt
Finally, once the model has been linearised we want to find a representation of the dynamic behavior of the variables as,
wt+1 = F (Θ, w¯ )wt + G (Θ, w¯ )vt+1
(3)
where vt is a vector of exogenous structural disturbances in the model, and F (Θ, w¯ ), G (Θ, w¯ ) are
matrices formed by the deep parameters.
As in [Schmitt-Grohé and Uribe(2004)], we include two types of variables in a DSGE model: predetermined variables (xt2 ) and control variables (xt1 ). Then, we can partition 3 as,
xt1
xt2+1
=
gx xt2
= h x xt2 + Σvt+1
(4)
1 We
show here a first order approximation.
However,
higher order approximations can
be
performed
as
in
[Andreasen et al.(2013)Andreasen, Fernández-Villaverde, and Rubio-Ramírez],
[Fernández-Villaverde and Rubio-Ramírez(2007)] or [Schmitt-Grohé and Uribe(2004)]
4
In order to solve for gx and h x from the matrices A and B it is possible to follow the approach in
[Klein(2000)] or [Schmitt-Grohé and Uribe(2004)], who use a generalized Schur (QZ) decomposition of A and B in order to find the stable solution of the model.
As noted in [Klein(2000)], it is somewhat easier to use a complex Schur (QZ) decomposition of
A = QSZ and B = QTZ, where both Q and Z are unitary and both S and T are upper triangular,
and there is code in a variety of languages to do so that have been contributed by Chris Sims, Paul
Cline, Martín Uribe and Stephanie Schmitt-Grohé that has been incorporated and extended by the
Dynare team. However, the software we use to estimate DSGE models cannot handle derivatives
of complex functions and thus we are forced to use a real Schur (QZ) decomposition of B = QSZ
and A = QTZ, where both Q and Z are orthonormal but only T is upper triangular whereas S is
upper Hessenberg, meaning it may have some non-zero entries immediately below its diagonal.
[Klein(2000)] glosses over the implementation details of the real Schur decomposition so we will
state them here. There are many implementations of the real Schur decomposition that all trace
their roots to [Moler and Stewart(1973)]. We use the C++ implementation in Eigen, which supplies
a good portion of the supporting code for Stan. Eigen includes (although it is not exposed) a
method to rotate the decomposition with Givens rotations that preserve the orthogonality of Q and
Z such that any zeros along the diagonal of T are pushed to the lower right corner. This “zerochasing” scheme may be originally attributable to [Golub and Van Loan(2012)]. Zeros along the
diagonal of T imply that A is singular and thus chasing them to the lower-right corner makes it
possible to partition the decomposition as
"
B=Q
S11
S12
0
S22
#
"
Z
and
A=Q
T11
T12
0
T22
#
Z,
where T22 have have all zeros along its diagonal (but may have non-zero elements above the diagonal) and T11 has no zeros on its diagonal. If A−1 exists, our task would be more straightforward,
but in many DSGE models it does not. With some abuse of notation we use T −1 for the “inverse” of
−1
T even when there is no unique solution to T22
T22 = I. As we shall see, T22 becomes irrelevant for
our purposes.
In other words, at least the upper-left corner of A−1 B = Z > T −1 Q> QSZ = Z > T −1 SZ is welldefined and a diagonal matrix of generalized eigenvalues (λ) could be constructed from the diagonal blocks of T −1 S such that AVλ = BV and V is unitary. However, λ would, in general,
entail complex numbers that we need to avoid, so we convert to what Eigen calls a “pseudo eigen˙ = BV˙ and D is block diagonal with 1 × 1 or 2 × 2 blocks on its didecomposition” where AVD
agonal.2 We can then specify a permutation matrix P in Eigen, such that PP> = I, implying
> DPP> V
˙ V˙ > = VPP
˙
˙ > = V¨ D
¨ V¨ > and the diagonal blocks of D
¨ = P> DP are increasing in the
VD
absolute value of the generalized eigenvalues. The generalized eigenvalues in the bottom right
¨ may be infinite, but it is more important to classify the finite generalized eigenvalues
corner of D
depending on whether they are less than or greater than unity in absolute value. We can repartition
2 Technically, Eigen uses the phrase pseudo eigen-decomposition to refer to the standard eigenvalue problem where B = I
, but the internal Eigen code is was fairly easy to adapt to this generalized eigenvalue problem.
5
the decomposition as
A
h
V¨1
V¨2
i
"
¨ 11
D
0
0
¨
D22
#
=
B
h
V¨1
V¨2
i
,
¨ 11 are less than unity in absolute values and all the eigenvalues of
where all the eigenvalues of D
#
"
V¨11
¨
¨
can be further
D22 are greater than unity (and possibly infinite) in absolute value. V1 =
V¨12
¨ 11 V¨ −1 and gx = V¨12 V¨ −1 in equation 4.
partitioned such that h x = V¨11 D
11
11
3
Ways of estimating DSGE models
DSGE models can generally be written as a state space form. Suppose that we have a set of observable variables that have a theoretical counterpart in the model, yt . Note that many times we can
not observe all of the economic variables defined in the model (for example, total factor productivity which is a theoretical measure that does not have an empirical counterpart such as output or
consumption growth). Given the set of observable variables, we can write the state space form of
the model as,
yt
w t +1
= Swt + ε t
F (Θ, w¯ )wt + G (Θ, w¯ )vt+1
=
where S is a selection matrix that relates the variables in the model to the observed data, and ε t are
measurement errors. The first equation is referred as the measurement equation, and the second
one is the transition equation. The main goal of this section is to describe how to estimate Θ, the
diverse methods that have been used to date, and our proposal using Hamiltonian Monte Carlo
methods.
The first step is to construct the likelihood function for the data, p(y T |Θ). Once it is constructed,
the likelihood can be maximized using numerical methods or it can be combined with a prior
distribution for Θ, p(Θ), to form a posterior distribution p(Θ|y T ) that can be sampled from using
Markov Chain Monte Carlo (MCMC). However, building the likelihood function is not a trivial
task, since the latent states have to be integrated out. That is,
p(y T |Θ)
T
=
∏ p ( y t | y t −1 | Θ )
t =1
T ˆ
=
∏
p(yt |wt , Θ) p(wt |yt−1 , Θ)dwt .
(5)
t =1
Under the assumption that both ε t and vt+1 follow a Gaussian distribution, the model is linear, and
6
the likelihood function can be evaluated recursively using the the Kalman Filter (see for example,
[Hamilton(1994)] or [Harvey(1990)])
When the DSGE model is linearised to an order higher than one, then the transition equation is no
longer linear, and therefore, the Kalman Filter can not be used to evaluate the conditional likelihood
5. In this case, the literature uses a Particle Filter to approximate the integrals (for a textbook
exposition, see for example [Fernández-Villaverde and Rubio-Ramírez(2007)]).
3.1
Frequentist approach
In the frequentist approach, the likelihood function 5 can be maximized numerically using any optimization technique. However, this seemingly easy approach has several pitfalls, which are pointed
out in [Fernández-Villaverde(2010)]. The likelihood function has a substantial number of parameters in a medium or large-scale DSGE model, and therefore, it might be difficult to find a global
maximum among a considerable number of local maxima. The likelihood function can also be flat
or essentially flat in many areas of the parameter space, which raises additional computational difficulties and may preclude a unique estimator of the parameters. Moreover, the estimated standard
errors based on multivariate normal approximations may be an inaccurate measure of uncretainty
about the parameters. The observable macroeconomic time series are not especially long, and thus
the favorable asymptotic properties of maximum likelihood estimators may not hold. For these reasons, the literature and central banks are increasingly turning to Bayesian teachniques to estimate
DSGE models.
3.2
Previous Bayesian approaches
We refer to [Herbst and Schorfheide(2014)], which explains the Bayesian approach to DSGE models
in book form. In general, the likelihood function is multiplied by a prior distribution over the
parameters, and Bayes’ Theorem implies a posterior distribution
p(Θ|y T ) ∝ p(y T |Θ) p(Θ)
The prior distribution is usually chosen using economic theory or results from previously-estimated
models. The prior distribution helps to concentrate the posterior distribution in areas of the parameter space where there is a unique, stable solution to the DSGE model. For example, in a standard
New Keynesian model, the conduct of monetary policy is typically specified as a Taylor rule, so
that, in its simplest form, the nominal interest rate is a function of the inflation rate. In this situation, it is well known that the semielasticity of the interest rate of the inflation rate should be
greater than unity in order to bring about a unique, stable equilibrium in the dynamic model that
is consistent with the relatively stable monetary policy we observe.
In general, in economics and other fields, a (hybrid) Gibbs sampler is created by deriving the fullconditional distributions of as many of the parameters as possible. The Gibbs sampler proceeds by
7
drawing from each full-conditional distribution in turn and falling back to some other sampling
technique to draw parameters whose full-conditional distributions are not known or is known but
is not easy to draw from. However, in the case of a DSGE model, F (Θ, w¯ ) and G (Θ, w¯ ) are highly
non-linear combinations of the structural parameters, Θ, and it is essentially impossible to derive
the full-conditional distribution of any parameter of the model.
Therefore, the majority of the Bayesian literature on DSGE models uses the Random Walk Metropolis Hastings (RWMH) approach to explore the posterior distribution of Θ, which can be summarized as follows. Suppose that we have performed j simulations. Then,
1. Draw a vector of proposed parameters, Θ j using any of the methods outlined below.
(a) Solve for the matrices of the linearised form of the model, given Θ j , and the implied
steady state.
(b) Evaluate the likelihood function, using the Kalman Filter if a first order Taylor expansion
is used, or a Particle Filter otherwise.
(c) Evaluate the prior distribution p(Θ j ), which when combined with the evaluated likelihood is proportional to the posterior distribution at Θ j
2. Accept the proposed draw Θ j with probability
p(Θ j |y T )
p(Θ|y T )
×
q(Θ|Θ j )
,
q(Θ j |Θ)
where q () is the density of
the proposal distribution.
However, the RWMH algorithm only mitigates many of the problems that plague the maximum
likelihood estimator. The Markov Chain may become stuck in a flat area or local maximum of the
posterior distribution for an indefinite number of iterations and thus not fully explore the posterior distribution in a remotely reasonable amount of time. In addition, the RWMH algorithm may
not mix throughout the parameter space at a fast enough rate for the Markov Chain Monte Carlo
Central Limit Theorem (MCMC-CLT) to hold, in which case it is difficult to characterize our uncertainty over how closely the empirical means of functions of the Markov Chain adhere to the
corresponding analytical means (if it were even possible to derive them in closed-form). At the
extreme, a RWMH algorithm could easily not have converged to the posterior distribution when
the researcher stops to analyze the output.
[Chib and Ramamurthy(2010)] recognizes these problems with the standard RWMH and proposes
a variation of it. The researcher first specifies a number of subvectors that Θ can be partitioned
into. If there are B mutually exclusive and exhaustive blocks, then we can write Θ =
SB
b =1
Θb . The
difference in this “tailored” Metropolis-Hastings appoach is that at each iteration, the parameters
are randomly allocated into B blocks — without regard to the current state of the Markov Chain —
j
and only Θb is proposed, leaving the other blocks at their current values. This process diminishes
the change that the Markov Chain will become stuck in a region of the parameter space for an
indefinite number of iterations. [Chib and Ramamurthy(2010)] reports that the autocorrelation of
the draws dissapates after 30 or 40 iterations for most of the 36 parameters in a DSGE model.
8
[Herbst(2011)] uses a blocking scheme where the parameters are allocated into B blocks based on
posterior correlations estimated during the burnin period and reports improved performance.
Other alternatives to RWMH use the information in the shape of the posterior distribution to generate better proposals for the parameters, rather than drawing them blindly from a multivariate
normal distribution. The Metropolis-Adjusted Langevin (MAL) approach utilizes the gradient and
Hessian (evaluated at the posterior mode) to shift the mean vector of the multivariate normal proposal distribution. Another variant on this approach is to evaluate the Hessian at the current state
of the Markov Chain.
3.3
Our approach
[Herbst and Schorfheide(2014)] concludes its survey of Metropolis-Hastings based approaches by
referring to Hamiltonian and Riemannian Monte Carlo
Preliminary trials with both of these algorithms suggested, that, yes, they represented
a statistical improvement from the algorithms presented above. However, the computational burden associated with them outweighed any gains. For example, computing
the information matrix — a requirement for the Riemann-based method — of a DSGE is
still too cumbersome to be used within an MH algorithm. Computational gains in these
kinds of calculations, a subject of on-going research, might make them more feasible,
and thus, preferred. (72)
In this subsection, we argue that Hamiltonian Monte Carlo (HMC) is feasible today using Stan and
Stan’s implementation of Riemannian Manifold Monte Carlo (RMMC) will likely become feasible
in 2015 or 2016.
Stan implements automatic differentiation to evaluate derivatives of the (log) posterior distribution.
This section borrows from the description of Stan in [Goodrich et al.(2012)Goodrich, Katznelson, and Wawro].
For a comprehensive review of HMC, see [Neal(2011)]. The theoretical justification for Stan is given
in [Hoffman and Gelman(2014)] and [Carpeter(2015)].
Physicists can model the evolution over time of a closed system given a particle’s position and
momentum using functions measuring its potential and kinetic energy. It turns out that the same
intuition and mathematics can be applied to sample from a posterior distribution via MCMC, which
is the characteristic that distinguishes HMC from other MCMC algorithms.
HMC augments the usual expresion for Bayes’ Theorem p ( Θ| data) ∝ p (Θ) f ( data| Θ) with an
auxiliary vector (ρ) of “momentum” variables — one for each of the parameters — to specify the
joint posterior distribution of Θ and ρ, which is p ( Θ, ρ| data) ∝ p (Θ, ρ) p ( data| Θ, ρ). These auxiliary variables are computationally important, but they are independent of Θ and the data, allowing
us to write p ( Θ, ρ| data) ∝ p (ρ) p (Θ) p ( data| Θ). The researcher must specify priors for the parameters Θ and ρ but unlike Gibbs sampling, it is not necessary to specify the priors in such a way
9
Table 1: Hamiltonian Dynamics Concepts Mapped Between Physics and MCMC
Concept
Object
Dimensions
Position
Momentum
Potential Energy
Kinetic Energy
Physics
Particle
2D / 3D
q
p
U (q)
K (p)
Bayesian Statistics
Parameters
Any dimensionality
Θ
ρ
− ln [ p ( Θ, ρ| data)]
− ln [N (0, M)] w/ M diagonal
that the full-conditional distributions of the parameters are known in closed forms. This feature is
important in a DSGE context where it is impossible to derive full-conditional distributions of the
elements of Θ anyway.
The most convenient prior for the momentum variables is a multivariate normal with mean zero
and covariance matrix M. HMC algorithms typically specify a fixed M at the outset, which is
often M = mI for some chosen value of the tuning parameter, m. By default, Stan assumes M is
diagonal and the diagonal entries are tuned automatically during the warmup phase of the MCMC
algorithm so that the diagonal entries are approximately equal to the posterior variances of the
elements of Θ. Since ρ is independent of Θ a priori and does not enter the likelihood, the fullconditional distribution of ρ is just its prior distribution, namely multivariate normal. Thus, ρ can
be updated with a Gibbs-step, which is why HMC is also sometimes taken as an abbreviation for
Hybrid Monte Carlo because Θ is updated in an entirely different way, conditional on ρ.
The relationship between Hamiltonian dynamics in physics and Hamiltonian dynamics in MCMC
is outlined in Table 1. Physics specifies a position vector (q) of a particle, a momentum vector (p)
for that particle, a function for the potential energy (U (q)) of the particle, and a function for the
kinetic energy (K (p)) of the particle. The Hamiltonian function, which measures total energy, is
typically additive, H (p, q) = U (q) + K (p), which leads to the following system of differential
equations, called the “Hamiltonian equations”:
dqi
dt
dpi
dt
=
=
h
i
∂H (q, p)
= M −1 p ,
∂t
i
∂H (q, p)
∂U (q)
−
=−
,
∂t
∂qi
where M is interpreted as a “mass” matrix for the particle. These differential equations characterize where the particle moves in continuous time (t), but it is typically necessary to make a
discrete-time approximation of this system in order to solve for the new position, momentum, potential energy, and kinetic energy. HMC utilizes the same discrete-time approximation to construct
a MCMC algorithm, where the “potential energy” function is the negative log posterior distribution, U (Θ, ρ) = − ln [ p ( Θ, ρ| data)], and the “kinetic energy”
function
h
i is the negative log prior
1 ρ > M −1 ρ
−
distribution for the momentum variables, K (ρ) = − ln c × e 2
. For both of these distributions, normalizing constants (c) can be included or ignored because they have no effect on the
dynamics.
10
Each iteration of the MCMC algorithm requires a solution to the Hamiltonian equations and con
sists of three distinct steps. First, a proposed momentum vector ρ j is drawn from its multivariate
normal distribution. Second, a proposed location vector Θ j is drawn, given a realization of the
proposed momentum vector that “pushes” the location of the parameters in accordance with the
gradient of the log posterior. Third, a Metropolis-Hastings test is conducted to determine whether
to accept the proposed momentum and location or to stay at the current location (Θ, ρ). There are
several variants of the HMC algorithm, which differ primarily in how they implement the second
step.
A simple implementation of the second step uses the so-called “leapfrogging” algorithm to solve
a discrete-time approximation of the Hamiltonian equations. The leapfrogging algorithm specifies
a positive integer for the number of steps ( L) in time and a positive stepsize (e). The leapfrogging process presents three challenges for a HMC algorithm. First, it requires the gradient of the
negative log-posterior, which is often difficult or impossible to calculate in closed form. Numeric
gradients can be inaccurate and are computationally expensive when there are many parameters.
Stan overcomes this problem by utilizing reverse-mode automatic differentiation and C++ template
metaprogramming.3 Automatic differentiation exploits the fact that a software program only needs
to know a few rules for differentiation, and then a compiler can construct a (potentially large) expression tree that calculates the gradient via the chain rule. Thus, applied researchers do not need
to specify any derivatives to use Stan. Programmers also do not need to specify any derivatives to
contribute to Stan, although it is often possible to make substantial improvements in computation
time if some partial derivatives are known analytically and can be inserted into the expression tree
that evaluates to the gradient.
The second challenge is how to choose L in a HMC algorithm. If L is too small, Θ j will be too
close to Θ, and the Markov Chain will mix slowly through the parameter space. If L is too large,
a lot of computation time is wasted generating a proposal that is no more likely to be accepted.
Stan allows the user to specify L, but by default, Stan endogenizes L with an algorithm called No
U-Turn Sampling (NUTS), which is described in more detail in [Hoffman and Gelman(2014)]. The
intuition of NUTS is to move Θ j both “forward” and “backward” (i.e. ±e) , repeatedly doubling
the number of leapfrog steps until some proposal makes a U-turn (and, if not stopped, would start
moving roughly toward Θ again). Stan then utilizes a slice sampling mechanism to randomly select
some point along the path of proposals it has traced and decides whether to accept this proposal
with a Metropolis-Hastings acceptance test, although the proposal distribution is symmetric so
technically it is just a Metropolis test. The slice sampling mechanism adds a layer of complication
but preserves the essential properties of a Markov Chain.
The third challenge is how to choose e. As e ↓ 0, the discrete-time approximation of the solution
to the Hamiltonian equations approaches a continuous-time process and all proposed locations
will be accepted. However, if e is too small, Θ j will be too close to Θ, and the Markov Chain will
mix slowly. If e is too large, the approximation can become dislodged from the true Hamiltonian
dynamics and the resulting Θ j is unlikely to be accepted, although it is still a valid proposal. Stan
3 Forward-mode
automatic differentiation will be used to implement Riemannian Manifold Monte Carlo.
11
allows the user to specify e at the outset, but by default, Stan utilizes the warmup period of the
Markov Chain to tune e. Essentially, Stan tries to tune e so that 80% of proposals are accepted by
default, which is optimal under certain assumptions and contrasts to an optimal acceptance rate of
23.4% in Random Walk Metropolis algorithms.
Thus, Stan makes it possible for users to simply define a log posterior distribution (with or without
normalizing constants) and then obtain samples from that posterior distribution. By default, M
and e are tuned automatically during the warmup phase of the Markov Chain and frozen thereafter. L is chosen endogenously at every iteration of the Markov Chain. Experts can specify M, e,
and L at the outset and possibly obtain a somewhat more efficient algorithm, but Stan’s defaults
seem to work reasonably well for a wide range of statistical problems. By relying on automatic
differentiation, Stan obviates the need for the user to define a function that evaluates the gradient
of the log posterior, and the computational cost of automatic differentiation is roughly proportional
to the cost of evaluating the log posterior.
There is one additional complication where the developers of Stan have expended considerable
effort to make life easier for the user. The above sketch of Stan works if the posterior has positive
mass throughout the parameter space. However, if a Markov Chain wanders into a region of the
parameter space with no mass, it would get stuck because the gradient is either zero or undefined.
Many statistical models are well-defined only in certain regions of the parameter space; e.g. variance parameters cannot be negative. There are two main approaches in the HMC literature to deal
with this problem. The first is to erect walls that define the admissible region of the parameter
space and “bounce” if the Markov Chain runs into a wall. Intuitively, this involves negating the
momentum, which is legal because doing so does not affect the multivariate normal density.
Stan has not implemented bouncing yet. Instead, Stan takes the second approach of reparameterizing Θ as a function of unbounded primitive parameters. For example, if a user defines σ as the standard deviation of the errors in a linear regression model, Stan automatically specifies σ = exp (e
σ)
where e
σ is an unbounded primitive parameter that is not even seen by the user. Technically, Stan
samples e
σ but saves samples of σ for the user to analyze. This change-of-variables requires that
the posterior distribution be multiplied by the absolute value of the determinant of the Jacobian
for the transformation from the bounded parameter to the unbounded parameter, which is simply exp (e
σ ) in this case. For a multivariate change-of-variables, the Jacobian determinant can be
tedious to define and / or costly to evaluate, but wherever possible Stan uses transformations that
yield triangular Jacobians, whose determinants are relatively easy to evaluate. In almost all cases
— including those discussed in this paper — users do not need to worry about derivates at all.
4
Examples
The examples in this section are empirically incomplete. As of April 1, we have just
gotten the real Schur (QZ) decomposition code to work (modulo a few patches that
have not been upstreamed yet) with a personal fork of the Stan library. Our procedure
12
does not crash (any more), but we have not yet had time to even preliminarily analyze
the results. It appears as if it will be challenging to get DSGE models to converge within
a few hours of computation time, but we have not even attempted to optimize our
code yet. We will have a much more complete set of results by the May 8 conference
presentation.
4.1
A Real Business Cycle (RBC) Model
In this subsection, we consider a small scalle RBC model, similar to the one in [Schmitt-Grohé and Uribe(2004)].
The model considers a representative agent that maximizes an infinite period expected utility function:
∞
∑
max E0
t =0
1− γ
βt
ct
1−γ
where c stands for consumption, E is the expectation operator, and β, γ are two deep parameters
of the model, measuring the subjective value on the future by the agent and her risk aversion,
respectively.
The agent is constrained by the fact that,
At kαt = ct + k t+1 − (1 − δ)k t
which is other way to put that total output per capita (yt = At kαt ) is equal to consumption and
investment per capita (k t+1 − (1 − δ)k t ). The variable k t represents the stock of capital, At is the
productivity level, α is a parameter of the model related to the use of capital in production and δ
is the depreciation rate of capital. In the model, we introduce stochastic exogenous technological
change as follows,
At+1 − 1 = ρ( At − 1) + ηε t+1
with ε ∼ N (0, 1) and ρ an extra parameter to be estimated. The solution to the model consists on
the first order conditions of the maximization problem, plus the resource constraint, the definition
of output, consumption growth and output growth, and the exogenous process for productivity:
−γ
ct
At kαt
yt
=
−γ
βEt ct+1 [αAt+1 kαt+−11 + 1 − δ]
= c t + k t +1 − (1 − δ ) k t
=
At kαt
13
ln At+1
gc
gy
= ρ ln At + σε t+1
ct
=
c t −1
yt
=
y t −1
Note that this set of equations can easily be cast in the form of equation 1. In this case, we can
define the vector of structural parameters of the model as Θ = {γ, β, α, δ, ρ, σ }
and the vector of variables as wt = { gyt , gct , yt , ct , k t , At }. The previous set of equations is easily
cast in the form of 1, and the steady state of the model is found as the solution of the system when
¯ given a vector of parameters. Then, we can find it as,
wt+1 = wt = w,
A¯
k¯
= 1
1
β
=
−1+δ
! 1−1 α
α
c¯ =
k¯ α − δk¯
y¯
= k¯ α
g¯ c
= 1
g¯ y
= 1
Given this steady state and a parameter vector, we can solve for the linearised version of the model
using the method described in section 2. Given a vector of observable variables, the state space
model can be written as before,
yt
w t +1
= Swt + ε t
F (Θ, w¯ )wt + G (Θ, w¯ )vt+1
=
In this case, we will use two variables as observed data, output growth and consumption per capita
o , go ]). For the estimation we will use either artificial data or true quarterly data
growth (yt = [ gyt
ct
from the United States, spanning the period 1948Q1-2014Q4. With this two observable variables,
and using the identifiability test in [Iskrev(2010)], in principle all the parameters in Θ could be
identified using the priors that we will impose in the estimation of the model. Finally, in this case,
the selection matrix is given by,
"
S=
1
0
0
0
0
0
0
1
0
0
0
0
#
14
4.2
More Examples Coming
References
[Andreasen et al.(2013)Andreasen, Fernández-Villaverde, and Rubio-Ramírez] Martin
dreasen, Jesús Fernández-Villaverde, and Juan Rubio-Ramírez.
M
An-
The pruned state-space
system for non-linear dsge models: Theory and empirical applications. Technical report,
National Bureau of Economic Research, 2013.
[Carpeter(2015)] Bob et al. Carpeter. Stan: A probabalistic programming language. 2015.
[Chib and Ramamurthy(2010)] Siddhartha Chib and Srikanth Ramamurthy. Tailored randomized
block mcmc methods with application to dsge models. Journal of Econometrics, 155(1):19–38,
2010.
[DeJong and Dave(2011)] David N DeJong and Chetan Dave. Structural macroeconometrics. Princeton University Press, 2011.
[Fernández-Villaverde(2010)] Jesús Fernández-Villaverde. The econometrics of dsge models. SERIEs, 1(1-2):3–49, 2010.
[Fernández-Villaverde and Rubio-Ramírez(2007)] Jesús Fernández-Villaverde and Juan F RubioRamírez. Estimating macroeconomic models: A likelihood approach. The Review of Economic
Studies, 74(4):1059–1087, 2007.
[Golub and Van Loan(2012)] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012.
[Goodrich et al.(2012)Goodrich, Katznelson, and Wawro] Ben Goodrich, Ira Katznelson, and Greg
Wawro. Designing quantitative historical social inquiry:an introduction to stan. 2012.
[Hamilton(1994)] James Douglas Hamilton. Time series analysis, volume 2. Princeton university
press Princeton, 1994.
[Harvey(1990)] Andrew C Harvey. Forecasting, structural time series models and the Kalman filter.
Cambridge university press, 1990.
[Herbst(2011)] Edward Herbst. Gradient and Hessian-based MCMC for DSGE Models. Unpublished
manuscript, 2011.
[Herbst and Schorfheide(2014)] Edward Herbst and Frank Schorfheide.
Bayesian Estimation of
DSGE Models. Book manuscript, 2014.
[Hoffman and Gelman(2014)] Matthew D Hoffman and Andrew Gelman. The no-u-turn sampler:
Adaptively setting path lengths in hamiltonian monte carlo. The Journal of Machine Learning
Research, 15(1):1593–1623, 2014.
15
[Iskrev(2010)] Nikolay Iskrev. Local identification in dsge models. Journal of Monetary Economics,
57(2):189–202, 2010.
[Judd(1998)] Kenneth L Judd. Numerical methods in economics. MIT press, 1998.
[Klein(2000)] Paul Klein. Using the generalized schur form to solve a multivariate linear rational
expectations model. Journal of Economic Dynamics and Control, 24(10):1405–1423, 2000.
[Moler and Stewart(1973)] C.B. Moler and G.W Stewart. An algorithm for generalized eigenvalue
problems. SIAM Journal of Numerical Analysis, 10:241, 1973.
[Neal(2011)] Radford M Neal. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte
Carlo, 2, 2011.
[Schmitt-Grohé and Uribe(2004)] Stephanie Schmitt-Grohé and Martın Uribe. Solving dynamic
general equilibrium models using a second-order approximation to the policy function. Journal of economic dynamics and control, 28(4):755–775, 2004.