Introduction to Bayesian Statistics Lecture 5: Monte Carlo

Introduction to Bayesian Statistics
Lecture 5: Monte Carlo Approximation
Rung-Ching Tsai
Department of Mathematics
National Taiwan Normal University
April 1, 2015
Monte Carlo Approximation
• We are often interested in summarizing not only the posterior
distribution of the parameter(s), but also other aspects of a
posterior distribution. For examples:
1. Pr(θ ∈ A|y1 , · · · , yn ) for arbitrary sets A.
2. posterior distribution of |θ1 − θ2 |
3. posterior distribution of max{θ1 , · · · , θm }
• Obtaining exact values for these posterior quantities can be
difficult or impossible.
• Instead, we can generate random sample values of the parameters
from their posterior distributions, then all of these posterior
quantities of interest can be approximated to an arbitrary degree of
precision using the Monte Carlo method.
2 of 18
Monte Carlo approximation
Once we have θ(1) , · · · , θ(S) iid samples from p(θ|y1 , · · · , yn )
• The empirical distribution of {θ (1) , · · · , θ (S) } is known as a Monte
Carlo approximation to p(θ|y1 , · · · , yn ).
PS
(s)
s=1 θ /S → E[θ|y1 , · · · , yn ];
PS
(s) − θ)
¯ 2 /(S − 1) → Var[θ|y1 , · · · , yn ];
•
s=1 (θ
• θ¯ =
• #(θ (s) ≤ c)/S → Pr(θ ≤ c|y1 , · · · , yn );
• the median of {θ (1) , · · · , θ (S) } → θ1/2 ;
• the α-percentile of {θ (1) , · · · , θ (S) } → θα .
• For any function of θ, g (θ),
1
S
PS
s=1 g (θ
S → ∞.
3 of 18
(s) )
→ E(g (θ)|y1 , · · · , yn ) =
R
g (θ)p(θ|y1 , · · · , yn ) as
iid
Example: Y1 , · · · , Yn |θ ∼ Poisson(θ)
Y1 , · · · , Yn : numbers of children for the n women without college
iid
degrees; Y1 , · · · , Yn |θ ∼ Poisson(θ).
Pn
• Data: n = 44,
¯ = 1.50
i=1 yi = 66, y
• Prior: θ ∼ gamma(α = 2, β = 1)
• Posterior: θ|y ∼ gamma(α +
P
yi , β + n)
θ|{n = 44, i=1 yi = 66} ∼ gamma(68, 45)
• Quantity of interest:
Pn
◦ Probability: Pr(θ < c|y)
◦ Expectation: E(θ|y)
◦ Quantiles: 100(1-α)% quantile-based credible interval
• Methods
◦ Numerical integration
◦ Monte Carlo method
4 of 18
Example (conti.): θ|y ∼ gamma(68, 45)
• Probabilities: the posterior probability that {θ < 1.75}
◦ Numerical integration:
Z
Pr(θ < 1.75|y)
1.75
Z
p(θ|y1 , · · · , yn )dθ =
=
pgamma(1.75, 68, 45) = 0.8998
0
◦ Monte Carlo: R codes: S = 10, 100, 1000
a ← 2; b ← 1; sy ← 66; n ← 44
theta .mc10 ← rgamma (10 , a+sy , b+n)
theta .mc100 ← rgamma(100 , a+sy , b+n)
theta .mc1000 ← rgamma(1000 , a+sy , b+n)
> mean(theta .mc10<1.75) 0.9
> mean(theta .mc100<1.75) 0.94
> mean(theta .mc1000<1.75) 0.899
5 of 18
1.75
=
0
4548 67 −45θ
θ e
dθ
Γ(68)
Example (conti.): θ|y ∼ gamma(68, 45)
• Expectation:
P
yi
68
◦ numerical integration: posterior mean= α+
β+n = 45 = 1.51
◦ Monte Carlo: > mean(theta .mc10) 1.532794
> mean(theta .mc100) 1.513947
> mean(theta .mc1000) 1.501015
• Quantiles: 95% quantile-based credible interval
◦ posterior quantiles=qgamma(c(.025,.975),68,45)=(1.173,1.891).
◦ Monte Carlo:
6 of 18
> quantile(theta .mc10, c(.025, .975))
2.5% 97.5%
1.260291 1.750068
> quantile(theta .mc100, c(.025, .975))
2.5% 97.5%
1.231646 1.813752
> quantile(theta .mc1000, c(.025, .975))
2.5% 97.5%
1.180194 1.892473
Posterior inference for arbitrary functions g (θ)
Quantity of interest: a function of θ, g (θ)

sample θ(1) ∼ p(θ|y1 , · · · , yn ), compute γ (1) = g (θ(1) )


(2)
(2)
(2) 

sample θ ∼ p(θ|y , · · · , y ), compute γ = g (θ )
1
n
..
.
sample θ(S) ∼ p(θ|y1 , · · · , yn ), compute γ (S) = g (θ(S) )
independently






• γ (1) , · · · , γ (S) iid samples from p(γ|y1 , · · · , yn )
P
◦ γ¯ = Ss=1 γ (s) /S → E[γ|y1 , · · · , yn ] as S → ∞;
PS
(s)
◦
− γ¯ )2 /(S − 1) → Var[γ|y1 , · · · , yn ] as S → ∞.
s=1 (γ
7 of 18
Example: Log-odds
• Data: 54% of the respondents in the 1998 General Social Survey
•
•
•
•
reported their religious preference as Protestant, leaving
non-Protestants in the minority. Respondents were also asked if
they agreed with a Supreme Court ruling that prohibited state or
local governments from requiring the reading of religious texts in
public schools. Of the n = 860 individuals in the religious minority
(non-Protestant), y = 441 (51%) said they agreed with the
Supreme Court ruling, whereas 353 of the 1011 Protestants (35%)
agreed with the ruling.
θ: the population proportion agreeing with the ruling in the
minority population.
prior: θ ∼ uniform(0, 1)
posterior: θ|y ∼ beta(442, 420).
θ
quantity of interest: log-odds γ = log 1−θ
8 of 18
θ
Example: Log-odds γ = log 1−θ
(conti.)
• posterior: θ|y ∼ beta(442, 420)
• Monte Carlo: R codes
◦ samples from prior
a ← 1; b ← 1
theta.prior.mc ← rbeta (10000, a, b)
gamma.prior.mc ← log(theta.prior.mc/(1-theta.prior.mc))
density(gamma.prior.mc)
◦ samples from posterior
n0 ← 860-441; n1 ← 441
theta.post.mc ← rbeta(10000, a+n1, b+n0)
gamma.post.mc ← log(theta.post.mc/(1-theta.post.mc))
density(gamma.post.mc)
9 of 18
Example: Functions of two parameters
• Goal: To compare the birthrates of women with college degrees to
those without in terms of their numbers of children.
◦ Y1,1 , · · · , Yn1 ,1 : numbers of children for the n1 women without college
iid
degrees; Y1,1 , · · · , Yn1 ,1 |θ1 ∼ Poisson(θ1 ).
◦ Y1,2 , · · · , Yn2 ,2 : numbers of children for the n2 women with college
iid
degrees; Y1,2 , · · · , Yn2 ,2 |θ2 ∼ Poisson(θ2 ).
• Data:
P 1
Less than bachelors: n1 = 111,P ni=1
yi,1 = 217, y¯1 = 1.95
n2
Bachelors or higher: n2 = 44, i=1 yi,2 = 66, y¯2 = 1.50
iid
• (Conjugate) Prior: θ1 , θ2 ∼ gamma(α = 2, β = 1)
P
yi , β + n)
Pn1
θ1 |{n1 = 111,P i=1 yi,1 = 217} ∼ gamma(219, 112)
2
θ2 |{n2 = 44, ni=1
yi,2 = 66} ∼ gamma(68, 45)
• Posterior distributions: θ|y ∼ gamma(α +
10 of 18
Monte Carlo method
• Samples of (θ1 .θ2 )
(1)
sample θ1 ∼ p(θ1 |
111
X
(1)
yi,1 = 217), sample θ2 ∼ p(θ2 |
44
X
i=1
(2)
sample θ1 ∼ p(θ1 |
111
X
yi,2 = 66),
i=1
(2)
yi,1 = 217), sample θ2 ∼ p(θ2 |
44
X
yi,2 = 66),
i=1
i=1
···
(S)
sample θ1
∼ p(θ1 |
111
X
(S)
yi,1 = 217), sample θ2
∼ p(θ2 |
i=1
(1)
(1)
(S)
(S)
44
X
yi,2 = 66)
i=1
{(θ1 , θ2 ), · · · , (θ1 , θ2 )} iid samples from joint posterior distribution of θ1
and θ2
• R code
a ← 2 ; b ← 1; sy1 ← 217; n1 ← 111; sy2 ← 66; n2 ← 44
theta1.mc ← rgamma(10000, a+sy1, b+n1)
theta2.mc ← rgamma(10000, a+sy2, b+n2)
11 of 18
Example: Functions of two parameters
• Quantity of our interest: Pr(θ1 > θ2 |y1,1 , · · · , yn1 ,1 , y1,2 , · · · , yn2 ,2 )
• Methods
◦ Numerical integration
Z
∞
Z
θ1
Pr(θ1 > θ2 |y) =
p(θ1 , θ2 |y1,1 , · · · , yn1 ,1 , y1,2 , · · · , yn2 ,2 )dθ2 dθ1
0
Z
0
∞
Z
=
0
0
θ1
112219 4548 218 −112θ1 67 −45θ2
θ e
θ2 e
dθ2 dθ1 .
Γ(219)Γ(68) 1
◦ Monte Carlo:
PS
(s)
(s)
Pr(θ1 > θ2 |y) ≈ S1 s=1 1(θ1 > θ2 )
mean(theta1.mc > theta2 .mc)
0.9708
12 of 18
Example: Functions of two parameters
• Quantity of our interest: the posterior distribution of θ1 /θ2
• Monte Carlo: We could use the empirical distribution of
(1)
(1)
(2)
(2)
(S)
(S)
{θ1 /θ2 , θ1 /θ2 , · · · , θ1 /θ2 } to approximate the posterior
distribution of θ1 /θ2 .
• R code
a ← 2 ; b ← 1; sy1 ← 217; n1 ← 111; sy2 ← 66; n2 ← 44
theta1.mc ← rgamma(10000, a+sy1, b+n1)
theta2.mc ← rgamma(10000, a+sy2, b+n2)
ratiooftheta ← theta1.mc/theta2.mc
density(ratiooftheta)
13 of 18
Sampling from predictive distributions
R
• Prior predictive distribution: p(Y˜ = y˜ ) = p(˜
y |θ)p(θ)dθ
• Posterior predictive distribution:
Z
y |θ)p(θ|y1 , · · · , yn )dθ
p(Y˜ = y˜ |y1 , · · · , yn ) = p(˜
◦ Samples of (θ, y˜ )
sample θ(1) ∼ p(θ|y1 , · · · , yn ), sample y˜ (1) ∼ p(˜
y |θ(1) ),
sample θ(2) ∼ p(θ|y1 , · · · , yn ), sample y˜ (2) ∼ p(˜
y |θ(2) ),
···
sample θ
(S)
∼ p(θ|y1 , · · · , yn ), sample y˜ (S) ∼ p(˜
y |θ(S) ),
◦ {(θ(1) , y˜ (1) ), · · · , (θ(S) , y˜ (S) )} are iid samples from joint posterior
distribution of θ and y˜
◦ {˜
y (1) , · · · , y˜ (S) } are samples from posterior predictive distribution of y˜ .
14 of 18
Example: Poisson model
• Y1,1 , · · · , Yn1 ,1 : numbers of children for the n1 women without college degrees;
Y1,2 , · · · , Yn2 ,2 : numbers of children for the n2 women with college degrees;
iid
iid
Y1,1 , · · · , Yn1 ,1 |θ1 ∼ Poisson(θ1 ) ; Y1,2 , · · · , Yn2 ,2 |θ2 ∼ Poisson(θ2 ).
• Data:
P1
Less than bachelors: n1 = 111,P ni=1
yi,1 = 217, y¯1 = 1.95
2
Bachelors or higher: n2 = 44, ni=1
yi,2 = 66, y¯2 = 1.50
iid
• (Conjugate) Prior: θ1 , θ2 ∼
gamma(α = 2, β = 1)
P
• Posterior distributions: θ|y ∼ gamma(α + yi , β + n)
P1
θ1 |{n1 = 111, ni=1
yi,1 = 217} ∼ gamma(219, 112)
P2
θ2 |{n2 = 44, ni=1
yi,2 = 66} ∼ gamma(68, 45)
• Question: Consider two randomly sampled individuals, one from
each of the two populations. To what extent do we expect the one
without the bachelors degree to have more children than the
other? i.e. find Pr(Y˜1 > Y˜2 |y).
15 of 18
Posterior Predictive distribution of Y˜
• Prior of θ: p(θ) ∼ gamma(α, β)
• Likelihood of θ: Y1 , · · · , Yn |θ ∼ Poisson(θ)
P
• Posterior of θ given y: θ|y ∼ gamma(α + i=1 Yi , β + n).
˜ given y:
• Posterior predictive of Y
∞
Z
p(˜
y |y1 , · · · , yn )
Z
p(˜
y |θ, y)p(θ|y)dθ =
=
0
Z
=
=
=
=
16 of 18
∞
p(˜
y |θ)p(θ|y1 , · · · , yn )dθ
0
P
α+ yi
∞
P
1 y˜ −θ (β + n)
P
θ e }{
θα+ yi −1 e −(β+n)θ }dθ
y
˜
!
Γ(α
+
y
)
i
0
P
Z ∞
P
(β + n)α+ yi
α+ yi +˜
y −1 −(β+n+1)θ
P
θ
e
}dθ
Γ(˜
y + 1)Γ(α + yi ) 0
P
P
(β + n)α+ yi
Γ(α + yi + y˜ )
P
P {
}
Γ(˜
y + 1)Γ(α + yi ) (β + n + 1)α+ yi +˜y
P
α+ yi y˜
P
Γ(α + yi + y˜ )
β+n
1
P
Γ(˜
y + 1)Γ(α + yi ) β + n + 1
β+n+1
{
Posterior Predictive distribution of Y˜ (conti.)
˜ given y
• Posterior predictive of Y
P
α+P yi y˜
β+n
1
Γ(α + yi + y˜ )
P
p(˜
y |y1 , · · · , yn ) =
Γ(˜
y + 1)Γ(α + yi ) β + n + 1
β+n+1
P
i.e., Y˜ |y ∼ Negative
Binomial(α + yi , β + n)
P
yi
◦ E(Y˜ |y) = α+
β+nP
◦ Var(Y˜ |y) = α+ y2i (β + n + 1) =
(β+n)
P
α+ yi β+n+1
β+n
β+n
• Remark on Negative Binomial distribution: X ∼ Neg-bin(a, b),
x +a−1
b a 1 x
(
) (
)
a−1
b+1 b+1
Γ(x + a)
b a 1 x
=
(
) (
)
Γ(x + 1)Γ(a) b + 1 b + 1
p(x) =
E(X ) = ba , Var(X ) =
17 of 18
a
(b
b2
+ 1)
Posterior Predictive distribution of Y˜ (conti.)
Monte Carlo method:
• Known Negative-binomial
a ← 2 ; b ← 1;
sy1 ← 217; n1 ← 111; sy2 ← 66; n2 ← 44
> tildey1.mc ← rnbinom(10000, size=(a+sy1), mu=(a+sy1)/(b+n1))
> tildey2.mc ← rnbinom(10000, size=(a+sy2), mu=(a+sy2)/(b+n2))
mean( tildey1.mc > tildey2.mc)
• Mixture of Poisson
a ← 2 ; b ← 1;
sy1 ← 217; n1 ← 111; sy2 ← 66; n2 ← 44
> theta1.mc← rgamma(10000, a+sy1, b+n1)
> theta2.mc← rgamma(10000, a+sy2, b+n2)
> tildey1.mc← rpois(10000, theta1.mc)
> tildey2.mc← rpois(10000, theta2.mc)
mean(tildey1.mc > tildey2.mc) 0.4823
18 of 18