Color test sure are new

Color test
I just want to be
sure
that all my favourite colors
are
being displayed correctly on this
new
device. If not I’ll modify them.
R´
emi Bardenet
Bayesian numerical methods
1 / 51
How to unefficiently solve any problem
A tutorial on Bayesian numerical methods
R´emi Bardenet
LAL, LRI, Univ. Paris-Sud XI
May 30th, 2012
R´
emi Bardenet
Bayesian numerical methods
2 / 51
Summary
1
A very generic problem
2
First sampling methods
3
MCMC algorithms
4
A taste of a monster MCMC sampler for Auger
R´
emi Bardenet
Bayesian numerical methods
3 / 51
Summary
1
A very generic problem
2
First sampling methods
3
MCMC algorithms
4
A taste of a monster MCMC sampler for Auger
R´
emi Bardenet
Bayesian numerical methods
A very generic problem
4 / 51
Today’s talk: sampling for Bayesian Inference
When
I
a model p(data|θ) of an experiment has been written,
I
a prior p(θ) has been set on the parameters,
then by Bayes’ theorem:
π(θ) := p(θ|data) = R
R´
emi Bardenet
Bayesian numerical methods
p(data|θ)p(θ)
.
Θ p(data|θ)p(θ)dθ
A very generic problem
5 / 51
Today’s talk: sampling for Bayesian Inference
When
I
a model p(data|θ) of an experiment has been written,
I
a prior p(θ) has been set on the parameters,
then by Bayes’ theorem:
π(θ) := p(θ|data) = R
R´
emi Bardenet
Bayesian numerical methods
p(data|θ)p(θ)
.
Θ p(data|θ)p(θ)dθ
A very generic problem
5 / 51
Today’s talk: sampling for Bayesian Inference
When
I
a model p(data|θ) of an experiment has been written,
I
a prior p(θ) has been set on the parameters,
then by Bayes’ theorem:
π(θ) := p(θ|data) = R
R´
emi Bardenet
Bayesian numerical methods
p(data|θ)p(θ)
.
Θ p(data|θ)p(θ)dθ
A very generic problem
5 / 51
The generative process of an air shower
R´
emi Bardenet
Bayesian numerical methods
A very generic problem
6 / 51
0.2
Looking at Auger tank signals
0.2
0.0
0
50
100
150
tbin [25 ns]
200
250
300
0.1
0.1
0.2 0.3
!V
0.2 0.3
!V
0.2 0.3
!V
Event 1524469 - E = 10.6 [EeV] / ! = 39.6° - Station 612 at 1684 m
2.2
2.0
2
1.8
1.4
1
dN
d! V
VEM-peak
1.6
1.2
1.0
0.8
0.6
0.2
0.4
0.2
0.0
0
50
100
150
tbin [25 ns]
200
250
300
0.1
0.1
Event 1524469 - E = 10.6 [EeV] / ! = 39.6° - Station 622 at 2031 m
1.8
1.6
2
1.2
1
dN
d! V
VEM-peak
1.4
1.0
0.8
0.6
0.4
0.2
0.2
0.0
R´
emi Bardenet
0
50
100
150
tbin [25 ns]
200
250
300
0.1
0.1
F IG . 5.17
- Traces
FADC
sautsproblem
pour les cinq stations
Bayesian
numerical
methodset distributionsA des
very generic
7 / 51
Modelling tank signals: the single muon case
Figure: Muonic time response model pτ,td
0.015
p
0.010
0.005
0.000
-20
0
20
40
t @nsD
60
80
100
Mean number of Photo-electrons per bin
Z ti
n¯i (Aµ , tµ ) = Aµ
pτ,td (t − tµ )dt.
ti−1
R´
emi Bardenet
Bayesian numerical methods
A very generic problem
8 / 51
50
45
40
35
ðPE
30
25
20
15
10
5
0
0
100
200
300
400
500
600
t @nsD
ni Poisson with mean n¯i (Aµ , tµ ) =
Nµ
X
n¯i (Aµ j , tµ j ),
j=1
R´
emi Bardenet
Bayesian numerical methods
A very generic problem
9 / 51
So what is what in the cas Nµ = 1 ?
I
Parameters to infer are θ = (tµ , Aµ ).
I
The likelihood is fixed by the model:
p(data|θ) =
N
bins
Y
i=1
I
Choose e.g. a uniform prior:
p(θ) =
I
n¯i (θ)ni −¯ni (θ)
e
.
ni !
1 1
χ
(tµ ) χ[0,D] (Aµ ),
C D [0,C ]
Then the posterior reads
p(θ|data) ∝ χ[0,C ] (tµ ) χ[0,D] (Aµ )
N
bins
Y
n¯i (θ)ni e −¯ni (θ) .
i=1
R´
emi Bardenet
Bayesian numerical methods
A very generic problem
10 / 51
Bayesian inference requires computing integrals
I
MEP estimate (aka Bayes’): compute
Z
ˆ
θMEP :=
θp(θ|data)dθ,
Θ
I
credible intervals: find I such that
Z
p(θ|data)dθ ≥ 1 − α.
I
I
Bayes’ factors: require evidence computations
Z
p(data|M) = p(data|θ, M)p(θ|M)dθ.
The MAP estimate
θMAP := arg max p(θ|data)
with “Hessian” credible interval does not require integrals.
R´
emi Bardenet
Bayesian numerical methods
A very generic problem
11 / 51
Averaging whenever necessary
Most Bayesian inference tasks require integrals wrt. the posterior
Z
h(θ)p(θ|data)dθ.
The Monte Carlo principle
Z
N
1 X
I := h(x)π(x)dx ≈
h(xi ) =: ˆIN ,
N
i=1
where X1:N ∼ π
I
I
i.i.d.
ˆIN is unbiased:
EˆIN = I .
√
Error bars shrink at speed 1/ N:
Var(X )
.
Var(ˆIN ) =
N
R´
emi Bardenet
Bayesian numerical methods
A very generic problem
12 / 51
Averaging whenever necessary
Most Bayesian inference tasks require integrals wrt. the posterior
Z
h(θ)p(θ|data)dθ.
The Monte Carlo principle
Z
N
1 X
I := h(x)π(x)dx ≈
h(xi ) =: ˆIN ,
N
i=1
where X1:N ∼ π
I
I
i.i.d.
ˆIN is unbiased:
EˆIN = I .
√
Error bars shrink at speed 1/ N:
Var(X )
.
Var(ˆIN ) =
N
R´
emi Bardenet
Bayesian numerical methods
A very generic problem
12 / 51
Averaging whenever necessary
Most Bayesian inference tasks require integrals wrt. the posterior
Z
h(θ)p(θ|data)dθ.
The Monte Carlo principle
Z
N
1 X
I := h(x)π(x)dx ≈
h(xi ) =: ˆIN ,
N
i=1
where X1:N ∼ π
I
I
i.i.d.
ˆIN is unbiased:
EˆIN = I .
√
Error bars shrink at speed 1/ N:
Var(X )
.
Var(ˆIN ) =
N
R´
emi Bardenet
Bayesian numerical methods
A very generic problem
12 / 51
Pros & Cons
Pros
I Numerical integration (grid methods) is too costly and
imprecise when d ≥ 6.
I
MC should concentrate the effort on places where π is big.
I
Many, many applications !
Cons
One must be able to sample from π !
→ Need for generic clever sampling methods !
R´
emi Bardenet
Bayesian numerical methods
A very generic problem
13 / 51
Pros & Cons
Pros
I Numerical integration (grid methods) is too costly and
imprecise when d ≥ 6.
I
MC should concentrate the effort on places where π is big.
I
Many, many applications !
Cons
One must be able to sample from π !
→ Need for generic clever sampling methods !
R´
emi Bardenet
Bayesian numerical methods
A very generic problem
13 / 51
Pros & Cons
Pros
I Numerical integration (grid methods) is too costly and
imprecise when d ≥ 6.
I
MC should concentrate the effort on places where π is big.
I
Many, many applications !
Cons
One must be able to sample from π !
→ Need for generic clever sampling methods !
R´
emi Bardenet
Bayesian numerical methods
A very generic problem
13 / 51
Pros & Cons
Pros
I Numerical integration (grid methods) is too costly and
imprecise when d ≥ 6.
I
MC should concentrate the effort on places where π is big.
I
Many, many applications !
Cons
One must be able to sample from π !
→ Need for generic clever sampling methods !
R´
emi Bardenet
Bayesian numerical methods
A very generic problem
13 / 51
Non-Bayesian applications of sampling algorithms: Auger
“A complete Monte Carlo hybrid simulation has been performed to study the
trigger efficiency and the detector performance. The simulation sample consists
of about 6000 proton and 3000 iron CORSIKA [19] showers”
from “The exposure of the hybrid detector of the P. Auger Observatory”, the
Auger Collab., Astroparticle Phys., 2010.
R´
emi Bardenet
Bayesian numerical methods
A very generic problem
14 / 51
Non-Bayesian applications of sampling algorithms: EPOS
“The difficulty with Monte Carlo generation of interaction configurations arises
from the fact that the configuration space is huge and rather nontrivial [...] the
only way to proceed amounts to employing dynamical MC methods”
from “Parton-based Gribov-Regge theory”, Drescher et al., Phys.Rept., 2008.
R´
emi Bardenet
Bayesian numerical methods
A very generic problem
15 / 51
Summary
1
A very generic problem
2
First sampling methods
3
MCMC algorithms
4
A taste of a monster MCMC sampler for Auger
R´
emi Bardenet
Bayesian numerical methods
First sampling methods
16 / 51
The inverse function method
Principle
If U ∼ U(0,1) and F is a cdf, then
F −1 (U) ∼ F .
I
It assumes we know how to sample from U(0, 1) !
I
It needs a known and convenient cdf.
R´
emi Bardenet
Bayesian numerical methods
First sampling methods
17 / 51
The inverse function method
Principle
If U ∼ U(0,1) and F is a cdf, then
F −1 (U) ∼ F .
I
It assumes we know how to sample from U(0, 1) !
I
It needs a known and convenient cdf.
R´
emi Bardenet
Bayesian numerical methods
First sampling methods
17 / 51
Rejection Sampling 1/2
Assumptions
I Target π is known up to a multiplicative constant,
I
an easy-to-sample distribution q s.t. π ≤ Mq is known.
RejectionSampling π, q, M, N
1
for i ← 1 to N,
2
Sample θ(i) ∼ q and u ∼ U(0,1) .
3
Form the acceptance ratio ρ =
4
if u < ρ, then accept θ(i) .
5
else reject.
R´
emi Bardenet
Bayesian numerical methods
π(θ(i) )
.
Mq(θ(i) )
First sampling methods
18 / 51
Rejection Sampling 2/2
Remarks & Drawbacks
I One needs to know good q and M,
I
Lots of wasted samples.
R´
emi Bardenet
Bayesian numerical methods
First sampling methods
19 / 51
Importance Sampling 1/2
Assumptions & principle
I Target π is fully known,
I
an easy-to-sample distribution q is known s.t.
Supp(π) ⊂ Supp(q).
Then
Z
N
1 X
π(θ(i) )
h(θ(i) )
IˆN :=
→
h(θ)π(θ)dθ,
N
q(θ(i) )
Θ
θ(i) ∼ q i.i.d.
i=1
ImportanceSampling π, q, N
1
Sample independent θ(i) ∼ q, i = 1..N,
2
Form the weights wi =
3
π is approximated by
R´
emi Bardenet
Bayesian numerical methods
1
N
π(θ(i) )
.
q(θ(i) )
PN
i=1 wi δθ(i) .
First sampling methods
20 / 51
Importance Sampling 2/2
I
If |h| ≤ M, then
M2
Var(IˆN ) ≤
N
I
Z
(π − q)2
dθ + 1 ,
q
thus q has to be close to π and have heavier tails than π.
Further: the q achieving the smallest variance is
q=R
|h|π
.
|h|πdθ
I
Better and achievable: For smaller variance, or if π is known
up to a constant, use
PN
h(θ(i) )π(θ(i) )/q(θ(i) )
˜
IN := i=1
(biased).
PN
(i)
(i)
i=1 π(θ )/q(θ )
I
One needs to know a good, heavy-tailed q.
I
Adaptive strategies for tuning q are possible [WKB+ 09].
R´
emi Bardenet
Bayesian numerical methods
First sampling methods
21 / 51
Importance Sampling 2/2
I
If |h| ≤ M, then
M2
Var(IˆN ) ≤
N
I
Z
(π − q)2
dθ + 1 ,
q
thus q has to be close to π and have heavier tails than π.
Further: the q achieving the smallest variance is
q=R
|h|π
.
|h|πdθ
I
Better and achievable: For smaller variance, or if π is known
up to a constant, use
PN
h(θ(i) )π(θ(i) )/q(θ(i) )
˜
IN := i=1
(biased).
PN
(i)
(i)
i=1 π(θ )/q(θ )
I
One needs to know a good, heavy-tailed q.
I
Adaptive strategies for tuning q are possible [WKB+ 09].
R´
emi Bardenet
Bayesian numerical methods
First sampling methods
21 / 51
Importance Sampling 2/2
I
If |h| ≤ M, then
M2
Var(IˆN ) ≤
N
I
Z
(π − q)2
dθ + 1 ,
q
thus q has to be close to π and have heavier tails than π.
Further: the q achieving the smallest variance is
q=R
|h|π
.
|h|πdθ
I
Better and achievable: For smaller variance, or if π is known
up to a constant, use
PN
h(θ(i) )π(θ(i) )/q(θ(i) )
˜
IN := i=1
(biased).
PN
(i)
(i)
i=1 π(θ )/q(θ )
I
One needs to know a good, heavy-tailed q.
I
Adaptive strategies for tuning q are possible [WKB+ 09].
R´
emi Bardenet
Bayesian numerical methods
First sampling methods
21 / 51
Importance Sampling 2/2
I
If |h| ≤ M, then
M2
Var(IˆN ) ≤
N
I
Z
(π − q)2
dθ + 1 ,
q
thus q has to be close to π and have heavier tails than π.
Further: the q achieving the smallest variance is
q=R
|h|π
.
|h|πdθ
I
Better and achievable: For smaller variance, or if π is known
up to a constant, use
PN
h(θ(i) )π(θ(i) )/q(θ(i) )
˜
IN := i=1
(biased).
PN
(i)
(i)
i=1 π(θ )/q(θ )
I
One needs to know a good, heavy-tailed q.
I
Adaptive strategies for tuning q are possible [WKB+ 09].
R´
emi Bardenet
Bayesian numerical methods
First sampling methods
21 / 51
Going to high dimensions
Benchmark
Let π(θ) = N (0, Id ) and q(θ) = N (0, σId ).
I
Rejection sampling
I
I
I
needs σ ≥ 1.
Fraction of accepted proposals goes as σ −d .
Importance sampling
I
I
√
yields infinite variance when σ ≤ 1/ 2,
variance of the weights goes as
d/2
σ2
.
2
2 − 1/σ
R´
emi Bardenet
Bayesian numerical methods
First sampling methods
22 / 51
Going to high dimensions
Benchmark
Let π(θ) = N (0, Id ) and q(θ) = N (0, σId ).
I
Rejection sampling
I
I
I
needs σ ≥ 1.
Fraction of accepted proposals goes as σ −d .
Importance sampling
I
I
√
yields infinite variance when σ ≤ 1/ 2,
variance of the weights goes as
d/2
σ2
.
2
2 − 1/σ
R´
emi Bardenet
Bayesian numerical methods
First sampling methods
22 / 51
Going to high dimensions
Benchmark
Let π(θ) = N (0, Id ) and q(θ) = N (0, σId ).
I
Rejection sampling
I
I
I
needs σ ≥ 1.
Fraction of accepted proposals goes as σ −d .
Importance sampling
I
I
√
yields infinite variance when σ ≤ 1/ 2,
variance of the weights goes as
d/2
σ2
.
2
2 − 1/σ
R´
emi Bardenet
Bayesian numerical methods
First sampling methods
22 / 51
So far, so good ?
We have seen that
I
Sums & integrals are ubiquitous in Statistics.
I
Numerical integration is limited to small dimensions.
I
“Basic” sampling is limited to full-information easy cases.
I
RS and IS, well-tuned, are efficient and versatile,
Now what do we do when
I
it is not possible to sample from π directly, but only evaluate
it pointwise, possibly up to a multiplicative constant:
π(θ) = R
p(x|θ)p(θ)
.
Θ p(x|θ)p(θ)dθ
I
We don’t know a good approximation q of π.
I
Θ is high-dimensional.
Well, MCMC is bringing both answers and new problems !
R´
emi Bardenet
Bayesian numerical methods
First sampling methods
23 / 51
So far, so good ?
We have seen that
I
Sums & integrals are ubiquitous in Statistics.
I
Numerical integration is limited to small dimensions.
I
“Basic” sampling is limited to full-information easy cases.
I
RS and IS, well-tuned, are efficient and versatile,
Now what do we do when
I
it is not possible to sample from π directly, but only evaluate
it pointwise, possibly up to a multiplicative constant:
π(θ) = R
p(x|θ)p(θ)
.
Θ p(x|θ)p(θ)dθ
I
We don’t know a good approximation q of π.
I
Θ is high-dimensional.
Well, MCMC is bringing both answers and new problems !
R´
emi Bardenet
Bayesian numerical methods
First sampling methods
23 / 51
So far, so good ?
We have seen that
I
Sums & integrals are ubiquitous in Statistics.
I
Numerical integration is limited to small dimensions.
I
“Basic” sampling is limited to full-information easy cases.
I
RS and IS, well-tuned, are efficient and versatile,
Now what do we do when
I
it is not possible to sample from π directly, but only evaluate
it pointwise, possibly up to a multiplicative constant:
π(θ) = R
p(x|θ)p(θ)
.
Θ p(x|θ)p(θ)dθ
I
We don’t know a good approximation q of π.
I
Θ is high-dimensional.
Well, MCMC is bringing both answers and new problems !
R´
emi Bardenet
Bayesian numerical methods
First sampling methods
23 / 51
Summary
1
A very generic problem
2
First sampling methods
3
MCMC algorithms
4
A taste of a monster MCMC sampler for Auger
R´
emi Bardenet
Bayesian numerical methods
MCMC algorithms
24 / 51
Metropolis’ algorithm 1/2
Goal is to explore Θ, spending more time in places where π is high.
Figure: Taken from [Bis06]
R´
emi Bardenet
Bayesian numerical methods
MCMC algorithms
25 / 51
Metropolis’ algorithm 2/2
I
I
Metropolis’ algorithm builds a random walk with symmetric
steps q, that mimics independent draws from π.
Symmetricity means q(x|y ) = q(y |x), e.g. q(y |x) = N (x, σ).
MetropolisSampler π, q, T , θ(0)
1
S ← ∅.
2
for t ← 1 to T ,
3
Sample θ∗ ∼ q(.|θ(t−1) ) and u ∼ U(0,1) .
4
Form the acceptance ratio
ρ=1∧
π(θ∗ )
.
π(θ(t−1) )
5
if u < ρ, then θ(t) ← θ∗ else θ(t) ← θ(t−1) .
6
S ← S ∪ {θ(t) }.
R´
emi Bardenet
Bayesian numerical methods
MCMC algorithms
26 / 51
The Metropolis-Hastings algorithm
I
When no symmetricity is assumed, change acceptance to
π(y ) q(x|y )
.
ρ(x, y ) = 1 ∧
q(y |x) π(x)
I
Remark the exploration/exploitation trade-off.
MetropolisHastingsSampler π, q, T , θ(0)
1
S ← ∅.
2
for t ← 1 to T ,
3
Sample θ∗ ∼ q(.|θ(t−1) ) and u ∼ U(0,1) .
4
Form the acceptance ratio
ρ=1∧
π(θ∗ )
q(θ(t−1) |θ∗ )
.
q(θ∗ |θ(t−1) ) π(θ(t−1) )
5
if u < ρ, then θ(t) ← θ∗ else θ(t) ← θ(t−1) .
6
S ← S :: {θ(t) }.
R´
emi Bardenet
Bayesian numerical methods
MCMC algorithms
27 / 51
All the maths in two slides 1/2
I
A (stationary) Markov chain (θ(t) ) is defined through a kernel:
P(θ(t+1) ∈ dy |θ(t) = x) = K (x, dy ).
I
If θ(0) = x, then
P(θ
(t)
∈ dy |θ
(0)
Z Z
= x)
=
K (x, dx1 ) . . . K (dxt−1 , dy )
=: K t (x, dy ).
R´
emi Bardenet
Bayesian numerical methods
MCMC algorithms
28 / 51
All the maths in two slides 2/2
I
Under technical conditions, the MH kernel K satisfies
kπ − K t (x, .)k → 0, ∀x.
I
This justifies a burn-in period after which samples are
discarded.
I
Further results like a Law of Large Numbers guarantee that
T
1 X
h(θ(t) ) →
T +1
Z
h(θ)π(θ)d(θ)
t=0
for h bounded.
I
Central Limit Theorem-type results also exist, see Section 6.7
of [RC04].
R´
emi Bardenet
Bayesian numerical methods
MCMC algorithms
29 / 51
Autocorrelation in equilibrium
I
For a scalar chain, define the integrated autocorrelation time
X
τint = 1 + 2
Corr (θ(t) , θ(t+k) ).
k>0
I
One can show that
Var
!
T
1 X
τint
h(θ(t) ) =
Var h(θ(0) ) .
T
T
t=1
Rule of thumb for the proposal
Optimizing similar criteria leads to choosing q(.|x) = N (x, σ 2 ) s.t.
I
acceptance rate is ≈ 0.5 for d = 1, 2.
I
acceptance rate is ≈ 0.25 for d ≥ 3.
R´
emi Bardenet
Bayesian numerical methods
MCMC algorithms
30 / 51
Autocorrelation in equilibrium
I
For a scalar chain, define the integrated autocorrelation time
X
τint = 1 + 2
Corr (θ(t) , θ(t+k) ).
k>0
I
One can show that
Var
!
T
1 X
τint
h(θ(t) ) =
Var h(θ(0) ) .
T
T
t=1
Rule of thumb for the proposal
Optimizing similar criteria leads to choosing q(.|x) = N (x, σ 2 ) s.t.
I
acceptance rate is ≈ 0.5 for d = 1, 2.
I
acceptance rate is ≈ 0.25 for d ≥ 3.
R´
emi Bardenet
Bayesian numerical methods
MCMC algorithms
30 / 51
Autocorrelation in equilibrium
I
For a scalar chain, define the integrated autocorrelation time
X
τint = 1 + 2
Corr (θ(t) , θ(t+k) ).
k>0
I
One can show that
Var
!
T
1 X
τint
h(θ(t) ) =
Var h(θ(0) ) .
T
T
t=1
Rule of thumb for the proposal
Optimizing similar criteria leads to choosing q(.|x) = N (x, σ 2 ) s.t.
I
acceptance rate is ≈ 0.5 for d = 1, 2.
I
acceptance rate is ≈ 0.25 for d ≥ 3.
R´
emi Bardenet
Bayesian numerical methods
MCMC algorithms
30 / 51
A practical wrap-up
We have justified the acceptance ratio, the burn-in period, and the
optimization of the proposal.
MetropolisHastingsSampler π, q, T , θ(0) , Σ0
1
S ← ∅.
2
for t ← 1 to T ,
3
Sample θ∗ ∼ N (.|θ(t−1) , σΣ0 ) and u ∼ U(0,1) .
4
Form the acceptance ratio
ρ=1∧
π(θ∗ )
q(θ(t−1) |θ∗ )
.
∗
(t−1)
q(θ |θ
) π(θ(t−1) )
5
if u < ρ, then θ(t) ← θ∗ else θ(t) ← θ(t−1) .
6
if t ≤ Tb ,
7
8
R´
emi Bardenet
σ←σ+
1
(acc.
t 0.6
rate − 0.25/0.50).
else if t > Tb , then S ← S ∪ θ(t) .
Bayesian numerical methods
MCMC algorithms
31 / 51
No maths on this slide
Feel free to experiment with Laird Breyer’s applet on
http://www.lbreyer.com/classic.html.
R´
emi Bardenet
Bayesian numerical methods
MCMC algorithms
32 / 51
A case study from particle physics
I
Consider again our muonic signal reconstruction task, with
p(data|θ) =
N
bins
Y
i=1
I
n¯i (θ)ni −¯ni (θ)
e
.
ni !
The model (the physics) suggest using specific independent
priors for Aµ and tµ .
0.005
p
0.004
0.003
0.002
0.001
0.000
0
100
200
300 400
@ðPED
500
600
Figure: Prior on Aµ for a given zenith angle
R´
emi Bardenet
Bayesian numerical methods
MCMC algorithms
33 / 51
Case study: simulation results
log(A) chain
4
log(A) chain
4
marginal chain
true value
marginal chain
true value
3.5
3.5
3
3
2.5
2.5
2
2
1.5
0
1000
2000
3000
4000
5000
6000
7000
8000
9000 10000
1.5
0
1000
2000
3000
4000
I
One can often detect bad mixing by eye.
I
Acceptance for the left panel chain is only 3%.
R´
emi Bardenet
Bayesian numerical methods
5000
MCMC algorithms
6000
7000
8000
9000 10000
34 / 51
Case study: simulation results
Marginal of logA
1500
Marginal of logA
2000
marginal histogram
true value
marginal mean
marginal histogram
true value
marginal mean
1800
1600
1400
1000
1200
1000
800
500
600
400
200
0
1.5
2
2.5
3
3.5
4
0
1.5
2
2.5
I
Marginals are simply component histograms !
I
Even the marginals look ugly when mixing is bad.
R´
emi Bardenet
Bayesian numerical methods
MCMC algorithms
3
3.5
4
34 / 51
Case study: simulation results
Marginal of logA
1500
Marginal of logt
2500
marginal histogram
true value
marginal mean
marginal histogram
true value
marginal mean
2000
1000
1500
1000
500
500
0
1.5
2
2.5
3
3.5
4
0
2.5
3
3.5
4
4.5
I
From now on, we only consider the good mixing case.
I
Try to reproduce your marginals with different starting values!
I
Producing the chain was the hard part. Now everything is
easy: estimation, credible intervals, ...
R´
emi Bardenet
Bayesian numerical methods
MCMC algorithms
5
34 / 51
Summary
1
A very generic problem
2
First sampling methods
3
MCMC algorithms
4
A taste of a monster MCMC sampler for Auger
A generative model for the tank signals [BK12]
Reconstruction/Inference
R´
emi Bardenet
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
35 / 51
Let’s go for Nµ > 1
I
Recall
Z
ti
pτ,td (t − tµ )dt.
n¯i (Aµ , tµ ) = Aµ
ti−1
50
45
40
35
ðPE
30
25
20
15
10
5
0
0
100
200
300
400
500
600
t @nsD
I
Now
ni Poisson with mean n¯i (Aµ , tµ ) =
Nµ
X
n¯i (Aµ j , tµ j ),
j=1
R´
emi Bardenet
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
36 / 51
MCMC 101: Metropolis
I
Bayesian inference: obtain
π(Aµ , tµ ) = p(Aµ , tµ |signal) ∝ p(signal|Aµ , tµ )p(Aµ , tµ ).
I
MCMC methods sample from π.
R´
emi Bardenet
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
37 / 51
Metropolis (B) and adaptive Metropolis (B+G) algorithms
`
´
M π(x), X0 , Σ0 , T ,
1
S ← ∅, Σ ← Σ0
2
for t ← 1 to T
3
`
´
e ∼ N · |Xt−1 , Σ
X
4
5
if
e)
π(X
π(Xt−1 )
> U [0, 1] then
Xt ← X
6
7
. proposal
. accept
else
Xt ← Xt−1
8
S ← S ∪ {Xt }
9
. reject
. update posterior sample
10
11
12
13
R´
emi Bardenet
return S
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
38 / 51
Metropolis (B) and adaptive Metropolis (B+G) algorithms
`
´
AM π(x), X0 , Σ0 , T , µ0 , c
1
S ← ∅,
2
for t ← 1 to T
4
Σ ← cΣt−1 . scaled adaptive covariance
`
´
e
X ∼ N · |Xt−1 , Σ
. proposal
5
if
3
e)
π(X
π(Xt−1 )
> U [0, 1] then
Xt ← X
6
7
Xt ← Xt−1
8
10
µt ← µt−1 +
11
Σt ← Σt−1 +
1`
t
. update posterior sample
Xt − µt−1
1 ``
t
´
Xt − µt−1
. update running mean and covariance
´`
Xt − µt−1
´|
− Σt−1
´
1 (acc. rate − 0.25/0.50).
c ← c + 0.6
t
12
R´
emi Bardenet
. reject
S ← S ∪ {Xt }
9
13
. accept
else
return S
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
38 / 51
Three problems, many “solutions”
I
Possibly high dimensions but also highly correlated model.
I
I
I
I
I
Use adaptive proposals.
The number of muons Nµ is unknown.
Use a nonparametric prior [Nea00] or
use a Reversible Jump sampler [Gre95].
Likelihood P (n|Aµ , tµ ) is permutation invariant.
I
Indeed, if Nµ = 2,
p(n| A1 , A2 , t1 , t2 ) = p(n| A2 , A1 , t2 , t1 ).
I
R´
emi Bardenet
Marginals are useless, a problem known as label-switching
[Ste00].
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
39 / 51
Three problems, many “solutions”
I
Possibly high dimensions but also highly correlated model.
I
I
I
I
I
Use adaptive proposals.
The number of muons Nµ is unknown.
Use a nonparametric prior [Nea00] or
use a Reversible Jump sampler [Gre95].
Likelihood P (n|Aµ , tµ ) is permutation invariant.
I
Indeed, if Nµ = 2,
p(n| A1 , A2 , t1 , t2 ) = p(n| A2 , A1 , t2 , t1 ).
I
R´
emi Bardenet
Marginals are useless, a problem known as label-switching
[Ste00].
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
39 / 51
Three problems, many “solutions”
I
Possibly high dimensions but also highly correlated model.
I
I
I
I
I
Use adaptive proposals.
The number of muons Nµ is unknown.
Use a nonparametric prior [Nea00] or
use a Reversible Jump sampler [Gre95].
Likelihood P (n|Aµ , tµ ) is permutation invariant.
I
Indeed, if Nµ = 2,
p(n| A1 , A2 , t1 , t2 ) = p(n| A2 , A1 , t2 , t1 ).
I
R´
emi Bardenet
Marginals are useless, a problem known as label-switching
[Ste00].
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
39 / 51
Label-Switching
I
If the prior is also permutation invariant, then so is π(Aµ , tµ ).
300
250
200
150
100
50
0
0
R´
emi Bardenet
5000
10 000
15 000
Bayesian numerical methods
20 000
25 000
30 000
A taste of a monster MCMC sampler for Auger
40 / 51
Label-Switching
I
If the prior is also permutation invariant, then so is π(Aµ , tµ ).
50
45
40
35
ðPE
30
25
20
15
10
5
0
100
200
300
400
500
600
t @nsD
R´
emi Bardenet
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
40 / 51
Label-Switching
I
If the prior is also permutation invariant, then so is π(Aµ , tµ ).
300
250
200
150
100
50
0
0
R´
emi Bardenet
5000
10 000
15 000
Bayesian numerical methods
20 000
25 000
30 000
A taste of a monster MCMC sampler for Auger
40 / 51
`
´
AMOR π(x), X0 , T , µ0 , Σ0 , c
1
S ←∅
2
for t ← 1 to T
5
Σ ← cΣt−1 . scaled adaptive covariance
`
´
e ∼ N · |Xt−1 , Σ
X
. proposal
`
´
e ∼ arg min L
e
P
P
X
. pick an optimal permutation
(µ
,Σ
)
6
e ←P
eX
e
X
3
4
t−1
P∈P
t−1
. permute
`
´
N
PXt−1 |X , Σ
P∈P
if
`
´ > U [0, 1] then
P
π(Xt−1 ) P∈P N PX |Xt−1 , Σ
e)
π(X
7
P
Xt ← X
8
9
. accept
else
Xt ← Xt−1
10
. reject
11
S ← S ∪ {Xt }
12
µt ← µt−1 +
13
Σt ← Σt−1 +
14
1 (acc. rate − 0.25/0.50).
c ← c + 0.6
t
15
R´
emi Bardenet
1`
t
. update posterior sample
Xt − µt−1
1 ``
t
´
Xt − µt−1
. update running mean and covariance
´`
Xt − µt−1
´|
− Σt−1
´
return S
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
41 / 51
AMOR results on the previous example
50
45
40
35
ðPE
30
25
20
15
10
5
0
100
200
300
400
500
600
t @nsD
R´
emi Bardenet
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
42 / 51
The integration methods ladder (adapted from Iain Murray’s)
Growing item number means higher complexity and either higher
efficiency or wider applicability. Check [RC04] when no further
indication is given.
1
Quadrature,
R´
emi Bardenet
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
43 / 51
The integration methods ladder (adapted from Iain Murray’s)
Growing item number means higher complexity and either higher
efficiency or wider applicability. Check [RC04] when no further
indication is given.
1
Quadrature,
2
Quasi-MC,
R´
emi Bardenet
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
43 / 51
The integration methods ladder (adapted from Iain Murray’s)
Growing item number means higher complexity and either higher
efficiency or wider applicability. Check [RC04] when no further
indication is given.
1
Quadrature,
2
Quasi-MC,
3
Simple MC, Importance Sampling,
R´
emi Bardenet
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
43 / 51
The integration methods ladder (adapted from Iain Murray’s)
Growing item number means higher complexity and either higher
efficiency or wider applicability. Check [RC04] when no further
indication is given.
1
Quadrature,
2
Quasi-MC,
3
Simple MC, Importance Sampling,
4
MCMC (MH, Slice sampling, etc.),
R´
emi Bardenet
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
43 / 51
The integration methods ladder (adapted from Iain Murray’s)
Growing item number means higher complexity and either higher
efficiency or wider applicability. Check [RC04] when no further
indication is given.
1
Quadrature,
2
Quasi-MC,
3
Simple MC, Importance Sampling,
4
MCMC (MH, Slice sampling, etc.),
5
Adaptive MCMC [AFMP11], Hybrid MC [Nea10], Tempering
methods [GRS96], SMC [DdFG01], particle MCMC [ADH10].
R´
emi Bardenet
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
43 / 51
The integration methods ladder (adapted from Iain Murray’s)
Growing item number means higher complexity and either higher
efficiency or wider applicability. Check [RC04] when no further
indication is given.
1
Quadrature,
2
Quasi-MC,
3
Simple MC, Importance Sampling,
4
MCMC (MH, Slice sampling, etc.),
5
Adaptive MCMC [AFMP11], Hybrid MC [Nea10], Tempering
methods [GRS96], SMC [DdFG01], particle MCMC [ADH10].
6
ABC [FP12].
R´
emi Bardenet
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
43 / 51
Conclusions
Take-home message
I MC provides generic integration methods,
I Potential applications in Physics are numerous:
I
I
I
I
Producing a good mixing MCMC chain can be difficult
Higher efficiency can result from:
I
I
I
in forward sampling (aka “simulation”),
in Bayesian inference tasks.
Learning dependencies.
Exploiting existing/added structure of the problem.
Broad range of methods allows to find the level of
sophistication required by the difficulty of your problem.
R´
emi Bardenet
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
44 / 51
Conclusions
Take-home message
I MC provides generic integration methods,
I Potential applications in Physics are numerous:
I
I
I
I
Producing a good mixing MCMC chain can be difficult
Higher efficiency can result from:
I
I
I
in forward sampling (aka “simulation”),
in Bayesian inference tasks.
Learning dependencies.
Exploiting existing/added structure of the problem.
Broad range of methods allows to find the level of
sophistication required by the difficulty of your problem.
R´
emi Bardenet
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
44 / 51
“Introductory” references
I
Tutorials: Lots on videoconference.net.
I
I
I. Murray’s video lessons:
videolectures.net/mlss09uk murray mcmc/
A. Sokal’s tutorial, MCMC from a physicist’s point of view +
applications to Statistical Physics [Sok96]
I
The Monte Carlo bible: C. Robert & G. Casella’s “Monte
Carlo Methods” [RC04], and references within.
I
For more precise informations, please bug me.
R´
emi Bardenet
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
45 / 51
Glossary
In order of appearance:
MAP
MEP
MC
RS
IS
MCMC
MH
AMOR
SMC
ABC
R´
emi Bardenet
Maximum posterior (estimate)
Mean posterior (estimate)
Monte Carlo
Rejection sampling
Importance sampling
Markov chain Monte Carlo
Metropolis-Hastings (algorithm)
Adaptive Metropolis with online relabeling
Sequential Monte Carlo
Approximate Bayesian computation
Bayesian numerical methods
A taste of a monster MCMC sampler for Auger
46 / 51
References I
Christophe Andrieu, Arnaud Doucet, and Roman Holenstein,
Particle Markov chain Monte Carlo methods, Journal of the
Royal Statistical Society B (2010).
Y. Atchad´e, G. Fort, E. Moulines, and P. Priouret, Bayesian
time series models, ch. Adaptive Markov chain Monte Carlo :
Theory and Methods, pp. 33–53, Cambridge Univ. Press, 2011.
R. Bardenet, O. Capp´e, G. Fort, and B. K´egl, An adaptive
Metropolis algorithm with online relabeling, International
Conference on Artificial Intelligence and Statistics (AISTATS),
(JMLR workshop and conference proceedings), vol. 22, April
2012, pp. 91–99.
C. M. Bishop, Pattern recognition and machine learning,
Springer, 2006.
R´
emi Bardenet
Bayesian numerical methods
References
47 / 51
References II
R. Bardenet and B. K´egl, An adaptive monte-carlo markov
chain algorithm for inference from mixture signals, Proceedings
of ACAT’11, Journal of Physics: Conference series, 2012.
A. Doucet, N. de Freitas, and N. Gordon, Sequential Monte
Carlo in practice, Springer, 2001.
P. Fearnhead and D. Prangle, Constructing summary statistics
for approx- imate bayesian computation; semi-automatic abc,
Journal of the Royal Statistical Society B (2012).
P. J. Green, Reversible jump Markov chain Monte Carlo
computation and Bayesian model determination, Biometrika
82 (1995), no. 4, 711–732.
W.R. Gilks, S. Richardson, and D. Spiegelhalter (eds.), Markov
chain Monte Carlo in practice, Chapman & Hall, 1996.
R´
emi Bardenet
Bayesian numerical methods
References
48 / 51
References III
R. M. Neal, Markov chain sampling methods for Dirichlet
process mixture models, Journal of Computational and
Graphical Statistics 9 (2000), 249–265.
, Handbook of Markov Chain Monte Carlo, ch. MCMC
using Hamiltonian dynamics, Chapman & Hall / CRC Press,
2010.
C. P. Robert and G. Casella, Monte Carlo statistical methods,
Springer-Verlag, New York, 2004.
A.D. Sokal, Monte Carlo methods in statistical mechanics:
Foundations and new algorithms, 1996, Lecture notes at the
Carg`ese summer school.
M. Stephens, Dealing with label switching in mixture models,
Journal of the Royal Statistical Society, Series B 62 (2000),
795–809.
R´
emi Bardenet
Bayesian numerical methods
References
49 / 51
References IV
D. Wraith, M. Kilbinger, K. Benabed, O. Capp´e, J.-F.
Cardoso, G. Fort, S. Prunet, and C. P. Robert, Estimation of
cosmological parameters using adaptive importance sampling,
Phys. Rev. D 80 (2009), no. 2.
R´
emi Bardenet
Bayesian numerical methods
References
50 / 51
Thanks for your attention
R´
emi Bardenet
Bayesian numerical methods
References
51 / 51