Monte Carlo - Why • Need for calculation of integrals I= Z h(x)f (x)dx x where f (x) is a density. • Simulation studies - comparing methods • Bayesian settings - posterior expectations • Many non-statistical problems • If x high-dim., alternative methods perform bad. Geir Storvik: STK4050 Fall 09 – p. 1 Monte Carlo - What • Idea: Approximate expectations by averages • Assume x1 , ..., xM samples from f (x). ¯ = M −1 • h • PM m ). h(x m=1 Properties: • Unbiased • Consistent (if h(x) has finite variance) ¯ = M −1 Var[h(x)]. • If samples iid, Var[h] Geir Storvik: STK4050 Fall 09 – p. 2 Monte Carlo - How • Need for algorithms for simulating from f (·). • Low-dimensional/simple distributions: • Inversion method • Accept-reject • ··· • High-dimensional/complex distributions • Markov Chain Monte Carlo • Importance sampling • Sequential importance sampling • Variance reduction Geir Storvik: STK4050 Fall 09 – p. 3 Inversion method Rx • F (x) = −∞ f (u)du. • Generate U ∼ Unif[0, 1], put X = F −1 (U ). • Pr(X ≤ x) = Pr(F −1 (U ) ≤ x) = Pr(U ≤ F (x)) = F (x). • Exact method. • Very efficient if possible. • Difficult to generalize to higher dimensions Geir Storvik: STK4050 Fall 09 – p. 4 Accept-reject • Choose proposal distribution g(·) • Find M = Maxx f (x)/g(x). • Generate x ∼ g(·) • Accept with probability f (x)/[M g(x)]. • Acceptance prob correct for drawing from wrong distribution. • Exact method. • Efficiency depending on how close g is to f . • Difficult to use in high dimensions. Geir Storvik: STK4050 Fall 09 – p. 5 Variance reduction • Common random numbers • Use same random numbers when comparing methods • Control variates • Assume E[Y ] = µY , µY known. Of interest µX = E[X]. µ ˆX M M X X 1 1 = xm + β (xm − µY ) M M m=1 • m=1 Rao-Blackwellization E[X] = E[E[X|Y ]] ≈ M 1 X E[X|ym ] Geir Storvik: STK4050 Fall 09 – p. 6 Markov chain Monte Carlo (MCMC) • Idea: Generate sequence x1 , x2 , ... such that xs converges (in distribution) to f (·). • xs only depend on xs−1 and not earlier samples ⇒ Markov chain • Requirements from Markov chain theory: • f invariant distribution for Markov chain • Irreducible Markov chain (all states should be possible to reach in a finite number of iterations) • Aperiodicity • Not exact, but still consistent (ergodicity) • Dependent samples, variance increase. Geir Storvik: STK4050 Fall 09 – p. 7 Invariant distribution • Assume K(x, x′ ) is transition kernel f (x′ ) R ′ )f (x)dx. K(x, x x • Need • Sufficient criterions: Detailed balance = f (x)K(x, x′ ) = f (x′ )K(x′ , x) Usually easier to show, not always satisfied. • Note: Many transition kernels can have same invariant distribution Geir Storvik: STK4050 Fall 09 – p. 8 Types of MCMC algorithms • Metropolis-Hastings (MH) • Independent MH • Random walk MH • Slice sampler • Gibbs sampler Geir Storvik: STK4050 Fall 09 – p. 9 Metropolis-Hastings • Choose proposal distribution q(y|x) • Generate ys+1 ∼ q(·|xs ) • Put xs+1 ( ys+1 = xs with prob r(xs , ys+1 ) with prob 1 − r(xs , ys+1 ) where • o n r(x, y) = min 1, ff (y)q(x|y) (x)q(y|x) Difficult part: Choose good proposal q . Geir Storvik: STK4050 Fall 09 – p. 10 Independent MH • Choose q(y|x) = q(y), i.e. not depending on current state. • If q close to f , can be very efficient. • Want acceptance rate as high as possible. • Very difficult to find good q (active research theme!) • Very similar to accept-reject. • Do not need to find maximum M , typically higher acceptance rate. Geir Storvik: STK4050 Fall 09 – p. 11 Random walk MH • Put ys+1 = xs + εs+1 , εs+1 ∼ g(·). • g(·) typically symmetric around zero. • Typically need to tune variance in g to get reasonable acceptance. • Rule-of-thumb: Acceptance rate ∈ [0.2, 0.8]. • Too low acceptance: Not enough new values • Too high acceptance: Too small moves. • Most popular choice. Geir Storvik: STK4050 Fall 09 – p. 12 Componentwise moves • For random walk MH: Difficult to move all components in x simultaneously. • Can choose to change only one (or a few) components at a time. • Proposal q(y|x) then only change part of x. • Randomly picking which component(s) to change • Systematic scan through all components Geir Storvik: STK4050 Fall 09 – p. 13 Gibbs sampler • Assume x = (x1 , x2 , ..., xp ). • Repeat s , ..., xs ) ∼ f (x |x Draw xs+1 1 p 2 1 s+1 s s) , ..., x , x ∼ f (x |x Draw xs+1 2 1 p 3 2 .. . s+1 s+1 s+1 ∼ f (x |x Draw xs+1 , x , ..., x p 1 p 2 p−1 ) • Need to calculate all conditional distributions • No need for tuning parameters Geir Storvik: STK4050 Fall 09 – p. 14 Slice sampler • Assume f (x) > 0 for x ∈ X • Define f (x, u) = I(x ∈ X , 0 ≤ u ≤ f (x)). • R u f (x, u)du = f (x), i.e. marginal distribution is f (x) • Gibbs sampling on (x, u): Draw us+1 ∼ f (u|xs ) = Uniform[0, f (xs )] Draw xs+1 ∼ f (x|us+1 ) = Uniform{x : us+1 ≤ f (x)} • Can potentially make large moves on x • Efficiency depends on how easy it is to draw from f (x|us+1 ). Geir Storvik: STK4050 Fall 09 – p. 15 Auxiliary variables • Slice sampler example of extending space such that sampling become easier • Extra variables called Auxiliary variables • Many other examples in the literature. • Finding approperiate auxiliary variables can be difficult. • (This theme we have not discussed much in the course) Geir Storvik: STK4050 Fall 09 – p. 16 Convergence diagnostics • E[X] ≈ • 1 T PB+T m=B+1 xm Need to specify B (burnin) and T . • B time for convergence to right distribution. • Informal methods • Trace plots • Multiple chains, different starting points • Formal methods • Test for stationarity (ex: Kolmogorov-Smirnov) • Multiple chains, comparing between and within variability Geir Storvik: STK4050 Fall 09 – p. 17 Importance sampling • Rewriting expection wrt new density g I= Z h(x)f (x)dx = x • Z x h(x)f (x) g(x)dx g(x) Sample x1 , ..., xM ∼ g(·) M M m )f (xm ) X X h(x 1 1 m m w(x )h(x ) = Iˆ = m M g(x ) M m=1 m=1 where w(x) = f (x)/g(x) • Correcting for wrong distribution through weights • Sample from simpler distributions, and/or • Geir Storvik: STK4050 Fall 09 – p. 18 State space models • Assume dynamic model xt ∼p(xt |xt−1 ) yt ∼p(yt |xt ) latent process Observations • Aim: Finding p(xt |y1:t ) for each t • Requirement: Fast updating Geir Storvik: STK4050 Fall 09 – p. 19 Sequential importance sampling (SIS) • Problem: p(xt |y1:t ) difficult to sample from and also to evaluate! • p(x1:t |y1:t ) easier to evaluate. p(xt |y1:t ) marginal of this extended distribution. • p(xt |xt−1 ) • usually easy to sample from. Importance sampling Q • Proposal q(x1:t ) = p(x1 ) t s=2 p(xs |xs−1 ) • Weights wt (x1:t ) = p(x1:t |y1:t )/q(x1:t ) Geir Storvik: STK4050 Fall 09 – p. 20 SIS - continued • Sequential algorithm • x1:t generated sequentially • wt (x1:t ) ∝ p(y1:t |x1:t ) ∝ wt−1 (x1:t−1 )p(yt |xt ) • Problems: • New proposal do not use new observation • Variance of weights increase, a few samples dominate (in weight) the others, degeneracy Geir Storvik: STK4050 Fall 09 – p. 21 SIS - avoiding degeneracy • Resample • Sample M samples from {x1 , ..., xM } with t t probabilities proportional to {wt (x1 : t1 , ..., wt (x1 : tM } • Put weights equal to 1/M . • Avoids degeneracy • Introduce (small) bias Geir Storvik: STK4050 Fall 09 – p. 22
© Copyright 2024