Monte Carlo - Why

Monte Carlo - Why
•
Need for calculation of integrals
I=
Z
h(x)f (x)dx
x
where f (x) is a density.
• Simulation studies - comparing methods
• Bayesian settings - posterior expectations
• Many non-statistical problems
•
If x high-dim., alternative methods perform bad.
Geir Storvik: STK4050 Fall 09 – p. 1
Monte Carlo - What
•
Idea: Approximate expectations by averages
•
Assume x1 , ..., xM samples from f (x).
¯ = M −1
• h
•
PM
m ).
h(x
m=1
Properties:
• Unbiased
• Consistent (if h(x) has finite variance)
¯ = M −1 Var[h(x)].
• If samples iid, Var[h]
Geir Storvik: STK4050 Fall 09 – p. 2
Monte Carlo - How
•
Need for algorithms for simulating from f (·).
•
Low-dimensional/simple distributions:
• Inversion method
• Accept-reject
• ···
•
High-dimensional/complex distributions
• Markov Chain Monte Carlo
• Importance sampling
• Sequential importance sampling
•
Variance reduction
Geir Storvik: STK4050 Fall 09 – p. 3
Inversion method
Rx
• F (x) =
−∞ f (u)du.
•
Generate U ∼ Unif[0, 1], put X = F −1 (U ).
• Pr(X ≤ x) = Pr(F −1 (U ) ≤ x) = Pr(U ≤ F (x)) = F (x).
•
Exact method.
•
Very efficient if possible.
•
Difficult to generalize to higher dimensions
Geir Storvik: STK4050 Fall 09 – p. 4
Accept-reject
•
Choose proposal distribution g(·)
•
Find M = Maxx f (x)/g(x).
•
Generate x ∼ g(·)
•
Accept with probability f (x)/[M g(x)].
•
Acceptance prob correct for drawing from wrong
distribution.
•
Exact method.
•
Efficiency depending on how close g is to f .
•
Difficult to use in high dimensions.
Geir Storvik: STK4050 Fall 09 – p. 5
Variance reduction
•
Common random numbers
• Use same random numbers when comparing
methods
•
Control variates
• Assume E[Y ] = µY , µY known. Of interest
µX = E[X].
µ
ˆX
M
M
X
X
1
1
=
xm + β
(xm − µY )
M
M
m=1
•
m=1
Rao-Blackwellization
E[X] = E[E[X|Y ]] ≈
M
1 X
E[X|ym ]
Geir Storvik: STK4050 Fall 09 – p. 6
Markov chain Monte Carlo (MCMC)
•
Idea: Generate sequence x1 , x2 , ... such that xs
converges (in distribution) to f (·).
• xs
only depend on xs−1 and not earlier samples ⇒
Markov chain
•
Requirements from Markov chain theory:
• f invariant distribution for Markov chain
• Irreducible Markov chain (all states should be
possible to reach in a finite number of iterations)
• Aperiodicity
•
Not exact, but still consistent (ergodicity)
•
Dependent samples, variance increase.
Geir Storvik: STK4050 Fall 09 – p. 7
Invariant distribution
•
Assume K(x, x′ ) is transition kernel
f (x′ )
R
′ )f (x)dx.
K(x,
x
x
•
Need
•
Sufficient criterions: Detailed balance
=
f (x)K(x, x′ ) = f (x′ )K(x′ , x)
Usually easier to show, not always satisfied.
•
Note: Many transition kernels can have same
invariant distribution
Geir Storvik: STK4050 Fall 09 – p. 8
Types of MCMC algorithms
•
Metropolis-Hastings (MH)
• Independent MH
• Random walk MH
•
Slice sampler
•
Gibbs sampler
Geir Storvik: STK4050 Fall 09 – p. 9
Metropolis-Hastings
•
Choose proposal distribution q(y|x)
•
Generate ys+1 ∼ q(·|xs )
•
Put
xs+1
(
ys+1
=
xs
with prob r(xs , ys+1 )
with prob 1 − r(xs , ys+1 )
where
•
o
n
r(x, y) = min 1, ff (y)q(x|y)
(x)q(y|x)
Difficult part: Choose good proposal q .
Geir Storvik: STK4050 Fall 09 – p. 10
Independent MH
•
Choose q(y|x) = q(y), i.e. not depending on current
state.
•
If q close to f , can be very efficient.
•
Want acceptance rate as high as possible.
•
Very difficult to find good q (active research
theme!)
•
Very similar to accept-reject.
•
Do not need to find maximum M , typically higher
acceptance rate.
Geir Storvik: STK4050 Fall 09 – p. 11
Random walk MH
•
Put ys+1 = xs + εs+1 , εs+1 ∼ g(·).
• g(·)
typically symmetric around zero.
•
Typically need to tune variance in g to get
reasonable acceptance.
•
Rule-of-thumb: Acceptance rate ∈ [0.2, 0.8].
•
Too low acceptance: Not enough new values
•
Too high acceptance: Too small moves.
•
Most popular choice.
Geir Storvik: STK4050 Fall 09 – p. 12
Componentwise moves
•
For random walk MH: Difficult to move all
components in x simultaneously.
•
Can choose to change only one (or a few)
components at a time.
•
Proposal q(y|x) then only change part of x.
• Randomly picking which component(s) to
change
• Systematic scan through all components
Geir Storvik: STK4050 Fall 09 – p. 13
Gibbs sampler
•
Assume x = (x1 , x2 , ..., xp ).
•
Repeat
s , ..., xs )
∼
f
(x
|x
Draw xs+1
1
p
2
1
s+1 s
s)
,
...,
x
,
x
∼
f
(x
|x
Draw xs+1
2 1
p
3
2
..
.
s+1 s+1
s+1
∼
f
(x
|x
Draw xs+1
,
x
,
...,
x
p 1
p
2
p−1 )
•
Need to calculate all conditional distributions
•
No need for tuning parameters
Geir Storvik: STK4050 Fall 09 – p. 14
Slice sampler
•
Assume f (x) > 0 for x ∈ X
•
Define f (x, u) = I(x ∈ X , 0 ≤ u ≤ f (x)).
•
R
u f (x, u)du
= f (x), i.e. marginal distribution is f (x)
•
Gibbs sampling on (x, u):
Draw us+1 ∼ f (u|xs ) = Uniform[0, f (xs )]
Draw xs+1 ∼ f (x|us+1 ) = Uniform{x : us+1 ≤ f (x)}
•
Can potentially make large moves on x
•
Efficiency depends on how easy it is to draw from
f (x|us+1 ).
Geir Storvik: STK4050 Fall 09 – p. 15
Auxiliary variables
•
Slice sampler example of extending space such
that sampling become easier
•
Extra variables called Auxiliary variables
•
Many other examples in the literature.
•
Finding approperiate auxiliary variables can be
difficult.
•
(This theme we have not discussed much in the
course)
Geir Storvik: STK4050 Fall 09 – p. 16
Convergence diagnostics
• E[X] ≈
•
1
T
PB+T
m=B+1 xm
Need to specify B (burnin) and T .
• B
time for convergence to right distribution.
•
Informal methods
• Trace plots
• Multiple chains, different starting points
•
Formal methods
• Test for stationarity (ex: Kolmogorov-Smirnov)
• Multiple chains, comparing between and within
variability
Geir Storvik: STK4050 Fall 09 – p. 17
Importance sampling
•
Rewriting expection wrt new density g
I=
Z
h(x)f (x)dx =
x
•
Z
x
h(x)f (x)
g(x)dx
g(x)
Sample x1 , ..., xM ∼ g(·)
M
M
m )f (xm )
X
X
h(x
1
1
m
m
w(x
)h(x
)
=
Iˆ =
m
M
g(x )
M
m=1
m=1
where w(x) = f (x)/g(x)
•
Correcting for wrong distribution through weights
• Sample from simpler distributions, and/or
•
Geir Storvik: STK4050 Fall 09 – p. 18
State space models
•
Assume dynamic model
xt ∼p(xt |xt−1 )
yt ∼p(yt |xt )
latent process
Observations
•
Aim: Finding p(xt |y1:t ) for each t
•
Requirement: Fast updating
Geir Storvik: STK4050 Fall 09 – p. 19
Sequential importance sampling
(SIS)
•
Problem: p(xt |y1:t ) difficult to sample from and also
to evaluate!
• p(x1:t |y1:t )
easier to evaluate. p(xt |y1:t ) marginal of
this extended distribution.
• p(xt |xt−1 )
•
usually easy to sample from.
Importance sampling
Q
• Proposal q(x1:t ) = p(x1 ) t
s=2 p(xs |xs−1 )
• Weights wt (x1:t ) = p(x1:t |y1:t )/q(x1:t )
Geir Storvik: STK4050 Fall 09 – p. 20
SIS - continued
•
Sequential algorithm
• x1:t generated sequentially
• wt (x1:t ) ∝ p(y1:t |x1:t ) ∝ wt−1 (x1:t−1 )p(yt |xt )
•
Problems:
• New proposal do not use new observation
• Variance of weights increase, a few samples
dominate (in weight) the others, degeneracy
Geir Storvik: STK4050 Fall 09 – p. 21
SIS - avoiding degeneracy
•
Resample
• Sample M samples from {x1 , ..., xM } with
t
t
probabilities proportional to
{wt (x1 : t1 , ..., wt (x1 : tM }
• Put weights equal to 1/M .
•
Avoids degeneracy
•
Introduce (small) bias
Geir Storvik: STK4050 Fall 09 – p. 22