Scalable MCMC - CBL - University of Cambridge

Scalable MCMC
Hong Ge
Jes Frellsen
Department of Engineering
University of Cambridge
Machine Learning Reading Group
07 May 2015
1
Bayesian inference and MCMC
In Bayesian inference we rely on the posterior
p(θ|D) ∝ p(D|θ)p(θ)
In many applications the posterior is intractable
and we have to rely on approximations.
A standard approach is to use MCMC where we
• Construct a Markov chain with stationary distribution
p(θ|D)
θ1 , θ2 , θ3 , . . .
where p(θi |θi−1 . . . θ1 ) = τ (θt |θt−1 )
• Calculate posterior expectations using a Monte Carlo
estimate (unbiased)
T
1X
Ep(θ|D) [f ] ≈
f (θt )
T
t=1
2
Computations bottleneck in MCMC
If the data points D = {x1 , . . . , xN } are conditional
independent given θ we can write the posterior as
p(θ|D) ∝ p(θ)p(D|θ) = p(θ)
N
Y
i=1
p(xi |θ)
If the data set is larger (N 1) evaluating
p(D|θ) or
∇p(D|θ)
is the computational bottleneck for most standard MCMC
methods, including
• Metropolis-Hastings (MH) Algorithm
• Langevin Monte Carlo (LMC)
• Hamiltonian Monte Carlo (HMC)
3
Outline
We are going to cover some recent developments in scalable
approaches to
• Metropolis-Hastings (MH) algorithm
• Langevin Monte Carlo (LMC)
• Hamiltonian Monte Carlo (HMC)
We are not going to cover distributed approaches (e.g., Ahn,
Shahbaba, et al. (2014) and Xu et al. (2014))
4
Scaling up Metropolis-Hastings
5
Metropolis-Hastings (MH) algorithm
• Probably the most widely used MCMC algorithm
• Algorithm for constructing a Markov chain θ1 , θ2 , θ3 , . . . with
stationary distribution p(θ|D)
At time t:
1. Propose a state θ0 ∼ q(·|θt−1 )
p(θ0 |D)q(θ|θ0 )
2. Accept θ as θt with probability α(θ |θ) = min 1,
p(θ|D)q(θ0 |θ)
0
0
• In the MH algorithm the transition operator is given by
Z
τ (θt |θt−1 ) = q(θt |θt−1 )α(θt |θt−1 ) + δθt−1 ,θt 1 − q(θ|θt−1 )α(θ|θt−1 ) dθ
{z
}
|
|
{z
}
acceptance
rejection
6
Scaling up Metropolis-Hastings by subsampling
“For each proposed sample, the MH rule needs to examine
the likelihood of all data-items. When the number of
data-cases is large this is an awful lot of computation for one
bit of information, namely whether to accept or reject a
proposal.”
(Korattikara et al. 2014)
We are going to consider three approaches to addressing this
problem by only using a subset of the data:
• Firefly Monte Carlo (FlyMC) (Maclaurin and Adams 2014)
• The Austerity framework (Korattikara et al. 2014)
• Metropolis-Hastings with subsampled likelihoods,
MHS UB L HD (Bardenet et al. 2014)
7
Firefly Monte Carlo
Maclaurin and Adams (2014)
• An exact MCMC procedure that only uses a subset of the data
• Leaves the full-data posterior invariant
• Introduces a set of binary auxiliary variables that can
effectively turn on/off the data points (hence the name firefly)
• Required a lower bound for each likelihood term
8
Firefly Monte Carlo
Maclaurin and Adams (2014)
• Recall that we assumed the posterior
p(θ|D) ∝ p(θ)p(D|θ) = p(θ)
N
Y
i=1
p(xi |θ)
Li (θ) = p(xi |θ)
• We introduce
• A set of binary auxiliary variables zi ∈ {0, 1}
• Bi (θ) which is a lower bound on Li (θ), i.e., 0 < Bi (θ) ≤ Li (θ)
• Each zi is Bernoulli distributed
Li (θ) − Bi (θ)
p(zi |xi , θ) =
Li (θ)
zi Bi (θ)
Li (θ)
1−zi
• Augment the posterior distribution with z1:N without chaining the
marginal p(θ|D)
p(θ, z1:N |x1:N ) ∝ p(θ)p(x1:N , z1:N |θ) = p(θ)
N
Y
i=1
p(xi |θ)p(zi |xi , θ)
9
Firefly Monte Carlo
Maclaurin and Adams (2014)
• Consider the i’th factor in the augmented the posterior
Li (θ) − Bi (θ) zi Bi (θ) 1−zi
p(xi |θ)P(zi |xi , θ) = Li (θ)
Li (θ)
Li (θ)
(
Li (θ) − Bi (θ) if zi = 1
=
Bi (θ)
if zi = 0
• {xi |zi = 1} forms a minibatch for which we only need to
evaluate the likelihood
• Bi (θ) needs to have a convenient form, such that
QN
i=1 Bi (θ)
can be computed in O(1) (e.g., exponential family distributions)
N h
i Y L (θ) − B (θ) Y
i
i
p(θ|D) ∝ p(θ)
Bi (θ)
Bi (θ)
i=1
i|zi =1
• We can use Gibbs sampling (zi ∼ P(zi |xi , θ)) to resample a
random (fixed-size) subset of {zi }
10
Illustration of FlyMC on synthetic toy data
Maclaurin and Adams (2014)
Illustration of the FlyMC
algorithm on a logistic
regression model
A two-class classification
problem in two dimensions
(and one bias dimension)
11
Results from empirical evaluations of FlyMC
Maclaurin and Adams (2014)
For the MAP-tuned results stochastic gradient decent optimisation was used to make the bound
tight at the MAP parameters
12
The Austerity framework and MHS UB L HD
Bardenet et al. (2014) and Korattikara et al. (2014)
• Speed up MH by performing an approximate accept/reject step
• We will be sampling from ˜
p(θ|D) and not the correct p(θ|D)
PT
• The Monte Carlo estimator ˆI = T −1 t=1 f (θt ) computed using
samples from ˜
p(θ|D) will be biased
• The MSE of ˆI can be decomposed into bias B and variance V
E[(I − ˆI)2 ] = B2 + V
where V = O(1/T).
• Despite the bias introduced by ˜
p(θ|D) we can (hope to) obtain a
lower MSE on ˆI with the approximate accept/reject step since
we can collect more samples (T) on the same computational
budget
13
Rewriting the Metropolis-Hastings algorithm
Bardenet et al. (2014) and Korattikara et al. (2014)
1:
2:
3:
4:
5:
6:
for t ∈ {0, . . . , M − 1} do
θ0 ∼ q(·|θt )
u ∼ U(0, 1)
0 )P(D|θ 0 ) q(θ |θ 0 )
t
α(θ0 |θt ) = p(θ
p(θt )P(D|θt ) q(θ0 |θt )
if u < α(θ0 |θt ) then θt+1 ← θ0
else θt+1 ← θt
By rearranging the terms in u < α(θ0 |θt ) we get the algorithm
for t ∈ {0, . . . , M − 1} do
θ0 ∼ q(·|θt )
3:
u ∼ U(0, 1)h
i
p(θt ) q(θ0 |θt )
4:
ψ ← N1 log u p(θ
0 ) q(θ |θ 0 )
h t 0 i
1 PN
i |θ )
5:
µN ← N i=1 log p(x
p(xi |θt )
1:
2:
6:
7:
if ψ < µN then θt+1 ← θ0
else θt+1 ← θt
14
Approximating the MH test
Bardenet et al. (2014) and Korattikara et al. (2014)
• We have reformulated the MH test as
N
1
p(θ) q(θ0 |θ)
1X
p(xi |θ0 )
log u
<
log
N
p(θ0 ) q(θ|θ0 )
N
p(xi |θ)
i=1
where the RHS is the average log likelihood ratio
N
N
1X
p(xi |θ0 )
1X
µN =
log
=
li
N
p(xi |θ)
N
i=1
i=1
• Can we approximate the MH test by using only a subsample
(Monte Carlo estimate) of D = {x1 , . . . , xN }?
∗ 0 n
n
p(xi |θ )
1X
1X ∗
log
li
µ∗n =
=
n
p(xi∗ |θ)
n
i=1
where
{x1∗ , . . . , xn∗ }
i=1
are sampled w/o replacement from D
• The Austerity paper and the MHS UB L HD paper provided two
different approaches to this
15
Austerity framework
Korattikara et al. (2014)
• Reformulate the MH test as a statistical decision problem
ψ(θ, θ0 , u) <
1
N
PN
i=1 li
• Let {l1∗ , . . . , ln∗ } be the log likelihood ratios of the subsample of D
P
• Test if the difference between the subsample mean ¯l = 1n i li∗
and ψ(θ, θ0 , u) is significantly greater than the SD of ¯l at
confidence level • If so, we can use ¯l to perform the MH test (with chance of error)
• Else we need to draw more data
• Keep drawing until we reach confidence level or t = N
• If n is large enough then ¯l ∼ N (central limit theorem) and we
can use a sequential one-sample t-test
16
Austerity framework
Korattikara et al. (2014)
0.5
0
ε =0.00, T = 75484
ε =0.01, T = 133069
ε =0.05, T = 200672
ε =0.10, T = 257897
ε =0.20, T = 422978
−4
−0.5
−6
−1
−1.5
−8
−10
−12
0
ε = 0, T = 5992
ε = 0.01, T = 11320
ε = 0.05, T = 40973
ε = 0.1, T = 171917
ε = 0.2, T = 1894000
0
Log (Risk)
Log (Risk)
−2
−2
−2.5
50
100
150
200
250
Wall Clock Time (secs)
300
350
400
Logistic regression on MNIST for
classifying 7 vs. 9. Risk in estimating
the predictive mean.
−3
−3.5
0
1000
2000
3000
4000
5000
Wall Clock Time (secs)
6000
7000
Independent Component Analysis
on synthetic data with 4 sources.
The risk is in mean Amari distrance.
17
MHS UB L HD
Bardenet et al. (2014)
• Approximate the MH test ψ(θ, θ 0 , u) < µN by use of
concentration inequalities
• Obtains exact confidence intervals for the Monte Carlo
approximation µ∗n of µN ,
h ∗ 0 i
P
p(x |θ )
µ∗n = 1t ti=1 log p(xi∗ |θ)
i
µN =
1
N
PN
i=1 log
h
p(xi |θ0 )
p(xi |θ)
i
where {x1∗ . . . xn∗ } are sampled w/o replacement from {x1 . . . xN }
• The precision of the estimate µ∗n can be quantify through
concentration inequalities, i.e., for δn > 0
P(|µ∗n (θ, θ0 ) − µN (θ, θ0 )| ≤ cn ) ≥ 1 − δn
18
Hoeffding’s inequality
Bardenet et al. (2014)
• Hoeffding’s inequality without replacement tells us that
P(|µ∗n (θ, θ0 ) − µN (θ, θ0 )| ≤ cn ) ≥ 1 − δn
where
s
Cθ,θ0
2(1 −
− log(2/δn )
n
0
= max | log p(xi |θ ) − log p(xi |θ)|
cn = Cθ,θ0
n−1
N )
1≤i≤n
• Main challenge is to find Cθ,θ0 without looking at data
• Gaussian likelihood: Cθ,θ0 ∝ maxi |xi |
• Logistic regression likelihood: Cβ,β 0 = ||β − β 0 || maxi |xi |
• In practice tighter bounds are used
19
Using the concentration inequality
Bardenet et al. (2014)
• The concentration inequality
P(|µ∗n (θ, θ0 ) − µN (θ, θ0 )| ≤ cn ) ≥ 1 − δn
tells us that
• If |µ∗n − ψ| > cn we can take the correct decision for the MH test
(ψ(θ, θ0 , u) < µN ) with probability at least 1 − δn
• Else we need to increase t until |µ∗n − ψ| > cn
• The authors have a rule for setting δn such that the total
probability of making an error in each decision is less than a
user specified δ > 0
20
Austerity vs. MHS UB L HD
Bardenet et al. (2014)
1800
1600
1400
1200
1000
800
600
400
200
0
0.080
Approx. CI, 3% of n
MHSubLhd, 54% of n
2500
2000
Approx. CI, 16% of n
MHSubLhd, 54% of n
1500
1000
500
0.085
0.090
0.095
0.100
0.105
2
(a) D ∼ N (0, 0.1 ) and initial m = 2
0
0.095 0.096 0.097 0.098 0.099 0.100 0.101 0.102 0.103
(b) D ∼ N (0, 0.12 ) and initial m = 500
3.0
2.5
2.0
Approx. CI, 21% of n
MHSubLhd, 99% of n
1.5
1.0
0.5
0.0
25
30
35
40
45
50
(b) D ∼ ln N (0, 2) and initial m = 500
21
Austerity vs. MHS UB L HD vs. FlyMC
Bardenet et al. (2014) and Korattikara et al. (2014)
• The main difference between the three methods is
• Austerity uses an approximate confidence interval
• MHS UB L HD uses an exact confidence interval (but we need to
know Cθ,θ0 )
PT
• Only FlyMC will produce unbiased estimates ˆI = T −1 t=1 f (θt )
• However
• In MHS UB L HD the probability of making a wrong accept/reject is
directly bounded by δ
• In the Austerity there is a computational overhead to set and
the batch size m based on a required tolerance δ on the average
or worst case error
• The main challenge in applying Firefly Monte Carlo is to find a
good lower bound on the likelihood
22
Scalable Langevin Monte Carlo (LMC)
23
Langevin Monte Carlo (LMC)
Kennedy (1990) and R. Neal (2011)
Stochastic diffusion as adaptive proposal process
• For θ ∈ RD with density p(θ), U(θ) ≡ − log p(θ), define
Langevin diffusion
1
dθ(t) = ∇θ U θ(t) dt + db(t)
2
(1)
where b denotes a D-dimensional Brownian motion.
• First order Euler discrete integration of diffusion
θ(τ + ) = θ(τ ) +
2
∇θ U θ(τ ) + z(τ )
2
(2)
• Equivalent with proposal
2
q(θ 0 | θ) = N · | µ(θ, ), 2 I with µ(θ, ) = θ + ∇θ θ
2
• Use MH acceptance step to correct for bias
24
Stochastic Gradient Langevin Dynamics (SGLD)
Welling and Teh (2011)
• Idea: Langevin dynamics with stochastic gradients.
θ(τ + ) = θ(τ ) +
n
τ NX
∇θ p(θ) +
∇θ log p(xti | θ) + z(τ )
2
n
i=1
z(τ ) ∼ N (0, τ I)
• Update is just SGD plus Gaussian noise.
• Noise variance is balanced with gradient step sizes.
• τ decreases to 0 slowly (step-size requirement).
25
Stochastic Gradient Langevin Dynamics (SGLD)
Welling and Teh (2011)
n
τ NX
θ(τ + ) = θ(τ ) +
∇θ p(θ) +
∇θ log p(xti | θ) + z(τ )
2
n
i=1
z(τ ) ∼ N (0, τ I)
• Only computationally expensive part of Langevin dynamics
is the gradient computation. If gradient can be
well-approximated on small minibatches the algorithm will
work well.
• As τ → 0
• MH acceptance probability approaches 1, so we can ignore the
expensive MH accept/reject step.
• τ approaches 0 slowly enough, so discretised Langevin
dynamics still able to explore whole parameter space.
26
Scalable Hamiltonian Monte Carlo (HMC)
27
Hamiltonian Monte Carlo (HMC)
Duane et al. (1987), R. M. Neal (1993), and R. Neal (2011)
• For θ ∈ RD with density p(θ), U(θ) ≡ − log p(θ) − log Z, where
U(θ) is the “potential energy” at “position” θ.
• Introduce extra momentum variables, p, of the same dimension
as θ, and define
a “kinetic energy”, K(p). Typically,
K(p) = pT p 2.
• Define the “Hamitonian” to be H(θ, p) = U(θ) + K(p), and let
(θ, p) ∼
1
exp − H(θ, p)
Z
Note: θ and p are independent, and the marginal density for θ
is p(θ).
28
Hamiltonian Monte Carlo (HMC)
Duane et al. (1987), R. M. Neal (1993), and R. Neal (2011)
• Repeatedly do the following:
• Sample p from its density, which is N 0, I .
• Find a proposal (θ 0 , p0 ) from (θ, p) by simulating Hamiltonian
dynamics for some amount of fictitious time.
• Accept or reject (θ 0 , p0 ) as the next state according to the
Metropolis
acceptance probability,i
h
min 1, exp − H(θ 0 , p0 ) + H(θ, p) .
29
Hamiltonian Monte Carlo (HMC)
Hamiltonian Dynamics
• Hamiltonian dynamics is defined by the following differential
equations for how the state (θ, p) evolves with “time”, t:
dθ
∂H
=
,
dt
∂p
dp
∂H
=−
dt
∂θ
(3)
• When H(θ, p) = U(θ) + K(p) and K(p) = pT p 2, we see that
dθi
= pi ,
dt
dpi
∂U
=−
dt
∂θi
(4)
which is equivalent with
dθi = pi dt,
dpi = −
∂U
dt
∂θi
(5)
so the “position” variables get pushed by the “momentum”
variables, which are themselves controled by the gradient of the
potential energy.
30
Stochastic gradient HMC
SGHMC; Chen et al. (2014)
Key ideas of SGHMC:
• Simulating Hamiltonian dynamics requires gradients of
target distribution.
• Replace exact gradient with stochastic gradient
• MH correction step requires calculation of likelihoods.
• Ignore the expensive MH accept/reject step
31
Stochastic gradient HMC
SGHMC; Chen et al. (2014)
• In SGHMC, exact gradient is replaced with stochastic
gradient calculated using a subset of data S ⊂ {x1 , . . . , xN }:
N X
˜
U(θ)
=−
log p(xi | θ) − log p(θ)
|S|
(6)
xi ∈S
• Assume that:
˜
∇θ U(θ) ≈ ∇θ U(θ)
+ N 0, 2B(θ)
(7)
• Assuming unit mass, a naive construction of SGHMC would be:
dθ = pdt,
˜
dp = ∇θ U(θ)dt
+ N 0, 2B(θ)dt
(8)
32
Stochastic gradient HMC
SGHMC; Chen et al. (2014)
dθ = pdt,
˜
dp = ∇θ U(θ)dt
+ N 0, 2B(θ)dt
(9)
• However, the equilibrium distribution of dynamical system
(9) (without an MH step) is not the model posterior.
• SGHMC overcome this by introducing a friction term B(θ)p
into the Hamiltonian dynamics
dθ = pdt,
˜
dp = ∇θ U(θ)dt
− B(θ)pdt + N 0, 2B(θ)dt (10)
• With this fraction term, the model posterior is preserved.
33
Stochastic gradient HMC
SGHMC; Chen et al. (2014)
• So far we have assumed the gradient noise B(θ) is known.
• In practise, this needs to be estimated, leading to a
troublesome issue to practitioners.
• An alternative solution is treating B(θ) as part of the
dynamical system.
• Leads to stochastic gradient Nosé-Hoover thermostat
dynamics (SGNHT).
34
Stochastic gradient Nosé-Hoover thermostats
SGNHT; Ding et al. (2014)
• Using Nosé-Hoover thermostats dynamics as a proposal,
we have:
dθ = pdt,
˜
dp = ∇θ U(θ)dt
− ξpdt + N 0, 2Cdt ,
1
dξ = ( pT p − 1)dt
n
(11)
(12)
(13)
• The formulas without the stochastic term N(0, 2Cdt) is
known as Nosé-Hoover thermostats.
• This Nosé-Hoover thermostats has the model posterior as
its equilibrium distribution (proved via Fokker-Planck
equations).
35
Stochastic gradient Nosé-Hoover thermostats
SGNHT; Ding et al. (2014)
Algorithm 1: Stochastic Gradient Nos´e-Hoover Thermostat
Input: Parameters h, A.
Initialize ✓(0) 2 Rn , p(0) ⇠ N (0, I), and ⇠(0) = A ;
for t = 1, 2, . . . do
˜ (✓(t 1) ) from (2) ;
Evaluate rU
p
˜ (✓(t 1) )h + 2A N (0, h);
p(t) = p(t 1) ⇠(t 1) p(t 1) h rU
✓(t) = ✓(t 1) + p(t) h;
⇠(t) = ⇠(t 1) + ( n1 p>
(t) p(t) 1)h;
end
Illustrations of a Double-well Potential To illustrate that the adaptive upda
the mean kinetic energy, and more importantly, produce correct sampling w
the gradient, we consider the following double-well potential,
U (✓) = (✓ + 4)(✓ + 1)(✓
1)(✓
3)/14 + 0.5.
36
Stochastic gradient Nosé-Hoover thermostats
SGNHT; Ding et al. (2014)
Results on the double-well problem:
U(θ) = (θ + 4)(θ + 1)(θ − 1)(θ − 3)/14 + 0.5
37
Stochastic gradient Nosé-Hoover thermostats
SGNHT; Ding et al. (2014)
Stochastic Gradient Thermostats
More results on Bayesian NN (MNIST), PMF (Netflix,
Results
Movielens1m), and LDA (ICML):
38
Further reading
Articles not covered in this presentation:
• M. Xu et al. (2014). “Distributed Bayesian posterior sampling via moment
sharing”. In: Advances in Neural Information Processing Systems,
pp. 3356–3364
• S. Ahn, B. Shahbaba, et al. (2014). “Distributed stochastic gradient MCMC”. . In:
Proceedings of the 31st International Conference on Machine Learning
(ICML-14), pp. 1044–1052
• H. Strathmann et al. (2015). “Unbiased Bayes for Big Data: Paths of Partial
Posteriors”. In: arXiv preprint arXiv:1501.03326
• S. Ahn, A. K. Balan, et al. (2012). “Bayesian Posterior Sampling via Stochastic
Gradient Fisher Scoring”. In: Proceedings of the 29th International Conference
on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1,
2012
• W. Neiswanger et al. (2014). “Asymptotically Exact, Embarrassingly Parallel
MCMC”. . In: Thirtieth Conference on Uncertainty in Artificial Intelligence (UAI)
• D. N. VanDerwerken and S. C. Schmidler (2013). “Parallel Markov Chain Monte
Carlo”. In: arXiv preprint arXiv:1312.7479
• D. Wilkinson (2005). “Parallel Bayesian Computation”. In: Handbook of Parallel
Computing and Statistics. Chapman and Hall/CRC, pp. 477–508
39
References I
Ahn, S., A. K. Balan, and M. Welling (2012). “Bayesian Posterior Sampling
via Stochastic Gradient Fisher Scoring”. In: Proceedings of the 29th
International Conference on Machine Learning, ICML 2012, Edinburgh,
Scotland, UK, June 26 - July 1, 2012.
Ahn, S., B. Shahbaba, and M. Welling (2014). “Distributed stochastic gradient
MCMC”. In: Proceedings of the 31st International Conference on Machine
Learning (ICML-14), pp. 1044–1052.
Bardenet, R., A. Doucet, C. Holmest, A. Doucet, and C. Holmes (2014).
“Towards scaling up Markov chain Monte Carlo: an adaptive subsampling
approach”. In: Proceedings of The 31st International Conference on
Machine Learning. Beijing, China, pp. 405–413.
Chen, T., E. Fox, and C. Guestrin (2014). “Stochastic Gradient Hamiltonian
Monte Carlo”. In: Proceedings of the 31st International Conference on
Machine Learning (ICML-14), pp. 1683–1691.
Ding, N., Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven (2014).
“Bayesian Sampling Using Stochastic Gradient Thermostats”. In:
Advances in Neural Information Processing Systems 27. Ed. by
Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger.
Curran Associates, Inc., pp. 3203–3211.
40
References II
Duane, S., A. Kennedy, B. J. Pendleton, and D. Roweth (1987). “Hybrid
Monte Carlo”. In: Physics Letters B 195.2, pp. 216–222.
Kennedy, A. (1990). “The Theory of Hybrid Stochastic Algorithms”. English.
In: Probabilistic Methods in Quantum Field Theory and Quantum Gravity.
Ed. by P. Damgaard, H. Hüffel, and A. Rosenblum. Vol. 224. NATO ASI
Series. Springer, pp. 209–223.
Korattikara, A., Y. Chen, and M. Welling (2014). “Austerity in MCMC Land:
Cutting the Metropolis-Hastings Budget”. In: Proceedings of The 31st
International Conference on Machine Learning. Beijing, China,
pp. 181–189.
Maclaurin, D. and R. P. Adams (2014). “Firefly Monte Carlo: Exact MCMC
with Subsets of Data”. In: Thirtieth Conference on Uncertainty in Artificial
Intelligence (UAI).
Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo
methods. Tech. rep. CRG-TR-92-1. Department of Computer Science,
University of Toronto Toronto, CA.
Neal, R. (2011). “MCMC Using Hamiltonian Dynamics”. In: Chapman and
Hall/CRC.
41
References III
Neiswanger, W., E. Xing, and C. Wang (2014). “Asymptotically Exact,
Embarrassingly Parallel MCMC”. In: Thirtieth Conference on Uncertainty
in Artificial Intelligence (UAI).
Strathmann, H., D. Sejdinovic, and M. Girolami (2015). “Unbiased Bayes for
Big Data: Paths of Partial Posteriors”. In: arXiv preprint arXiv:1501.03326.
VanDerwerken, D. N. and S. C. Schmidler (2013). “Parallel Markov Chain
Monte Carlo”. In: arXiv preprint arXiv:1312.7479.
Welling, M. and Y. W. Teh (2011). “Bayesian learning via stochastic gradient
Langevin dynamics”. In: Proceedings of the 28th International Conference
on Machine Learning (ICML-11), pp. 681–688.
Wilkinson, D. (2005). “Parallel Bayesian Computation”. In: Handbook of
Parallel Computing and Statistics. Chapman and Hall/CRC, pp. 477–508.
Xu, M., B. Lakshminarayanan, Y. W. Teh, J. Zhu, and B. Zhang (2014).
“Distributed Bayesian posterior sampling via moment sharing”. In:
Advances in Neural Information Processing Systems, pp. 3356–3364.
42