Bayesian Methods with Monte Carlo Markov Chains I Henry Horng-Shing Lu

Bayesian Methods with Monte
Carlo Markov Chains I
Henry Horng-Shing Lu
Institute of Statistics
National Chiao Tung University
[email protected]
http://tigpbp.iis.sinica.edu.tw/courses.htm
1
Part 1
Introduction to
Bayesian Methods
2
Bayes' Theorem


P( A, B)
Conditional Probability: P( A | B) 
P( B)
One Derivation:
P( B | A) P( A)  P( A, B)  P( A | B) P( B)
P( A | B) P( B)
P( B | A) 
P( A)

Alternative Derivation:
P( B | A) P( A)
P( A | B) 
P( B | A) P( A)  P( B | Ac ) P( Ac )

http://en.wikipedia.org/wiki/Bayes'_theore
m
3
False Positive and Negative
Medical diagnosis:
 Type I and II Errors: hypothesis testing in
statistical inference
 http://en.wikipedia.org/wiki/False_positive

Actual Status
Positive
(Reject
H0)
Diagnosis
Test
Negative
Result
(Accept
H0)
Disease (H1)
Normal (H0)
True Positive
(Power, 1-β)
False Positive (Type I
Error, α)
False Negative (Type II
Error, β)
True Negative
(Confidence Level,
1-α)
4
Bayesian Inference (1)
False positives in a medical test
 Test accuracy by conditional probabilities:
P  Test Positive Disease   P  R H1   1    

P  Test Negative Normal   P  A H 0   1  a  5

Prior probabilities:
P  Disease   P  H1   001
P  Normal   P  H0   999
5
Bayesian Inference (2)

Posterior probabilities by Bayes' theorem:
True Positive Probability  P  Disease Test Positive 
0.99  0.001
 P  H1 R  
 0.019
0.99  0.001  0.05  0.999
False Positive Probability  P  Normal Test Positive 
 P  H0 R   1  0.019  0.981
6
Bayesian Inference (3)
Equal Prior probabilities:
P  Disease   P  H1   P  Normal   P  H0   0.5
 Posterior probabilities by Bayes’ theorem:

True Positive Probability  P  Disease Test Positive 
0.99  0.5
 P  H1 R  
 0.952
0.99  0.5  0.05  0.5

http://en.wikipedia.org/wiki/Bayesian_infer
ence
7
Bayesian Inference (4)

In the courtroom:
P  Evidence of DNA Match Guilty   1 and
P  Evidence of DNA Match Innocent   106
Based on the evidence other than the DNA
match, P  Guilty   0.3 and P  Innocent   0.7
 By the Bayes Theorem,

P  Guilty Evidence of DNA Match 
0.3 1.0

 0.99999766667
6
0.3 1.0  0.7 10
8
Naive Bayes Classifier
Naive Bayes Classifier is a simple
probabilistic classifier based on applying
Bayes' theorem with strong (naive)
independence assumptions.
 http://en.wikipedia.org/wiki/Naive_Bayes_c
lassifier

9
Naive Bayes Probabilistic Model (1)
The probability model for a classifier is a
conditional model P  C F1 , , Fn 
where C is a dependent class variable and
F1 , , Fn are several feature variables.
 By Bayes’ theorem,

P(C ) P( F1 ,..., Fn | C )
P(C | F1 ,..., Fn ) 
P( F1 ,..., Fn )
10
Naive Bayes Probabilistic Model (2)

Use repeated applications of the definition
of conditional probability:
P  C , F1 ,
, Fn   P  C  P  F1 ,
, Fn | C 
 P  C  P  F1 | C  P  F2 ,
, Fn | C , F1 
 P  C  P  F1 | C  P  F2 | C  P  F3 ,
, Fn | C , F1, F2 
and so forth.
 Assume that each Fi is conditionally
independent of every other F j for i  j , this
means that P  Fi | C , Fj   P  Fi | C 
11
Naive Bayes Probabilistic Model (3)

So P  C, F1 , , Fn  can be expressed as
n
P(C )  P( Fi | C ).
i 1

So P  C | F1 , , Fn  can be expressed like
n
1
P(C )  P( Fi | C ),
i 1
Z
where Z is constant if the values of the
feature variables are known.
 Constructing a classifier from the
probability model:
n
Classify( f1 ,.., f n )  arg max c P(C  c)  P( Fi  fi | C  c). 12
i 1
Bayesian Spam Filtering (1)
Bayesian spam filtering, a form of e-mail
filtering, is the process of using a Naive
Bayes classifier to identify spam email.
 References:
http://en.wikipedia.org/wiki/Spam_%28email%29
http://en.wikipedia.org/wiki/Bayesian_spa
m_filtering
http://www.gfi.com/whitepapers/whybayesian-filtering.pdf

13
Bayesian Spam Filtering (2)

Probabilistic model:
P(spam) P(words | spam)
P(spam | words) 
P(words)
where {words} mean {certain words in
spam emails}.
 Particular words have particular
probabilities of occurring in spam emails
and in legitimate emails. For instance, most
email users will frequently encounter the
word “Viagra” in spam emails, but will
seldom see it in other emails.
14
Bayesian Spam Filtering (3)
Before mails can be filtered using this
method, the user needs to generate a
database with words and tokens (such as
the $ sign, IP addresses and domains, and
so on), collected from a sample of spam
mails and valid mails.
 After generating, each word in the email
contributes to the email's spam probability.
This contribution is called the posterior
probability and is computed using Bayes’
theorem.

15
Bayesian Spam Filtering (4)

Then, the email's spam probability is
computed over all words in the email, and if
the total exceeds a certain threshold (say
95%), the filter will mark the email as a
spam.
16
Bayesian Network (1)
Bayesian network is compact
representation of probability distributions
via conditional independence.
 For example, a Bayesian network could
represent the probabilistic relationships
between diseases and symptoms.


http://en.wikipedia.org/wiki/Bayesian_network
http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.ht
ml
http://www.cs.huji.ac.il/~nirf/Nips01Tutorial/index.html
17
Bayesian Network (2)
Conditional independencies & graphical
language capture structure of many realworld distributions
 Graph structure provides much insight into
domain


Allows “knowledge discovery”
Cloudy
Data
+
Prior Information
Learner
Sprinkler
Wet Grass
Rain
S R
P(W | S,R)
T T
1.0
0.0
T F
0.1
0.9
F T
0.1
0.9
F F
0.01 0.99
18
Bayesian Network (3)
Qualitative part:
Cloudy
Directed acyclic graph
(DAG)
Sprinkler
 Nodes - random
variables
 Edges - direct influence
Rain
Wet Grass
Together:
Define a unique distribution
in a factored form
S R P(W | S,R)
T T 1.0 0.0
T F 0.1 0.9
F T
F F
0.1 0.9
0.01 0.99
Quantitative part:
Set of conditional
probability distributions
P(C , S , R,W )  P(C ) P( S | C ) P( R | C ) P(W | S , R)
19
Inference

Posterior probabilities


Most likely explanation


Scenario that explains evidence
Rational decision making



Probability of any event given any evidence
Maximize expected utility
Value of Information
Effect of intervention
Earthquake
Radio
Burglary
Alarm
Call
20
Example 1 (1)
Cloudy
Rain
Sprinkler
Wet Grass
21
Example 1 (2)

By the chain rule of probability, the joint
probability of all the nodes in the graph
above is
P  C, S , R,W   P C  P  S | C  P  R | C, S  P W | C, S , R 

By using conditional independence
relationships, we can rewrite this as
P  C, S , R,W   P C  P  S | C  P  R | C  P W | S , R 
where we were allowed to simplify the third
term because R is independent of S given
its parent C, and the last term because W is
independent of C given its parents S and R.
22
Example 1 (3)

Bayes theorem:
P(W  T )   P(C  c, S  s, R  r ,W  T )
c , s ,r
 PC , S , R ,W ( F , F , T , T )  PC , S , R ,W ( F , T , T , T )  PC , S , R ,W (T , F , T , T )  PC , S , R ,W (T , T , T , T )
 PC , S , R ,W ( F , F , F , T )  PC , S , R ,W ( F , T , F , T )  PC , S , R ,W (T , F , F , T )  PC , S , R ,W (T , T , F , T )
 0.5  0.5  0.2  0.9  0.5  0.5  0.2  0.99  0.5  0.9  0.8  0.9  0.5  0.1 0.8  0.99
0.5  0.5  0.8  0  0.5  0.5  0.8  0.9  0.5  0.9  0.2  0  0.5  0.1 0.2  0.9
 0.4581  0.189
 0.6471
is a normalizing constant, equal to the
probability (likelihood) of the data.
23
Example 1 (4)

The posterior probability of each
explanation
P( R  T ,W  T )  c ,s P(C  c, S  s, R  T ,W  T ) 0.4581
P( R  T | W  T ) 


P(W  T )
P(W  T )
0.6471
P(S  T ,W  T )  r , s P(C  c, S  T , R  r ,W  T ) 0.2781
P( S  T | W  T ) 


P(W  T )
P(W  T )
0.6471

So we see that it is more likely that the
grass is wet because it is raining: the
likelihood ratio is 0.708 / 0.430  1.647.
24
Part 2
MLE vs. Bayesian Methods
25
Maximum Likelihood Estimates
(MLEs) vs. Bayesian Methods

Binomial Experiments:
http://www.math.tau.ac.il/~nin/Courses/ML
04/ml2.ppt

More Explanations and Examples:
http://www.dina.dk/phd/s/s6/learning2.pdf
26
MLE (1)

Binomial Experiments: suppose we toss
coin N times and the random variable is
1, if the ith trial is head
Xi  
0, if the ith trial is tail

We denote by  the (unknown) probability
P  Head  .
Estimation task:
 Given a sequence of toss samples x1 , x2 , , xN
we want to estimate the probabilities P  H   
and P T   1   .
27
MLE (2)

The number of heads we see has a
binomial distribution
N k
P( X  k )     (1   ) N k
k
and thus
E[ X ]  N

Clearly, the MLE of  is
equal to MME of  .
N
x
i 1
N
i
and is also
28
MLE (3)

Suppose we observe the sequence

H, H.
MLE estimate is P  H   1, P T   0.
 Should we really believe that tails are
impossible at this stage?
 Such an estimate can have disastrous
effect.


If we assume that P(T)=0, then we are willing to
act as though this outcome is impossible.
29
Bayesian Reasoning

In Bayesian reasoning we represent our
uncertainty about the unknown parameter
 by a probability distribution.

This probability distribution can be viewed
as subjective probability

This is a personal judgment of uncertainty.
30
Bayesian Inference





P   -prior distribution about the values of 
P  x1 , , xN |   -likelihood of binomial
experiment given a known value 
Given x1 , , xN , we can compute posterior
distribution on 
P( x1 , , xN |  ) P( )
P( | x1 , , xN ) 
P( x1 , , xN )
The marginal likelihood is
P( x1 , , xN )   P( x1 , , xN |  ) P( )d
http://www.dina.dk/phd/s/s6/learning2.pdf
31
Binomial Example (1)
In binomial experiment, the unknown
parameter is   P  H 
 Simplest prior: P    1 for 0    (Uniform
prior)
k
N k
 Likelihood: P ( x1 , , xN |  )   (1   )
where k is number of heads in the
sequence
1
k
nk
 Marginal Likelihood: P ( x1 , , xN )    (1   ) d

0
32
Binomial Example (2)

Using integration by parts, we have:
1
P( x1 ,
, xN )    k (1   ) N k d
0
1 k 1
N  k k 1
N k 1
N  k 1

 (1   ) 0 

(1


)
d

k 1
k 1 0
1
N  k k 1
N  k 1


(1


)
d

k 1 0
1

Multiply both side by N choose k , we have
1
N
N1 k


N k
k 1
N  k 1

(1


)
d



(1


)
d
 


 k 0
 k  1 0
33
Binomial Example (3)

The recursion terminates when k  N,
1
N1 N
1
N N
N
    (1   ) d    d 
N 1
N0
0
Thus,
1
P( x1 ,
, xN )    (1   )
k
0

N k
1 N
d 
 
N 1 k 
1
We conclude that the posterior is
P( | x1 ,
N k
, xN )  ( N  1)    (1   ) N k
k
34
Binomial Example (4)
How do we predict (estimate ) using the
posterior?
 We can think of this as computing the
probability of the next element in the
sequence

P( xN 1 | x1 ,
, xN )   P( xN 1 ,  | x1 ,
  P( xN 1 |  , x1 ,
, xN ) P( | x1 ,
  P( xN 1 |  ) P( | x1 ,

, x N ) d
, x N ) d
, x N ) d
Assumption: if we know  , the probability of xN 1
is independent x1 , , xN
35
P( xN 1 |  , x1 , , xN )  P( xN 1 |  )
Binomial Example (5)

Thus, we conclude that
ˆ  P( xN 1  H | x1 ,
   P( | x1 ,
, xN )   P( xN 1 |  ) P( | x1 ,
, x N ) d
, xN )d
 N  k 1
 ( N  1)     (1   ) N  k d
k
1
 N  1  N  1
k 1
 ( N  1)  
 

N 2
 k  N  2  k 1 
36
Beta Prior (1)

The uniform priori distribution is a
particular case of the Beta Distribution. Its
general form is:
( s )
f ( ) 
 1 1 (1   )0 1
(1 )( 0 )
Where s  1   0 and show as Beta(1 ,  0 ).
1
 The expected value of the parameter is:
1   0
 The uniform is Beta (1,1)
37
Beta Prior (2)
There are important theoretical reasons for
using the Beta prior distribution?
 One of them has also important practical
consequences: it is the conjugate
distribution of binomial sampling.
 If the prior is Beta 1 , 0  and we have
observed some data with N1 and N 0 cases for
the two possible values of the variable,
then the posterior is also Beta with
parameters Beta(1  N1 ,  0  N0 )

38
Beta Prior (3)

The expected value for the posterior
1  N1
distribution is   N    N
1
1
0
0
0
1
(
,
) represent the prior
 The value
1   0 1   0
probabilities for the value of the variables
based in our past experience.
 The value s  1   0 is called equivalent
sample size measure the importance of our
past experience.
 Larger values make that prior probabilities
have more importance.
39
Beta Prior (4)

When  0 , 1  0, then we have maximum
likelihood estimation
40
Multinomial Experiments
Now, assume that we have a variable X
taking values on a finite set a1 , , an  and we
have a serious of independent observations
of this distribution,  x1 , x2 , , xm  and we want
to estimate the value i  P  ai  , i  1, , n .
 Let N i be the number of cases in the sample
in which we have obtained the value
ai  i  1, , n 

Ni
ˆ
 The MLE of  i is  i 
m

The problems with small samples are
completely analogous.
41
Dirichlet Prior (1)
We can also follow the Bayesian
approach, but the prior distribution is the
Dirichlet distribution, a generalization of
the Beta distribution for more than 2
cases: 1 , ,n  .
 The expression of D 1 , ,  n  is

f (1 ,
n
( s )
, n ) 
 ii 1
(1 ) ( n ) i 1
n
where s    i is the equivalent sample size.
i 1
42
Dirichlet Prior (2)

The expected vector is
E (1 ,

,n )  (
1
s
,
,
n
s
)
Greater value of s makes this distribution
more concentrated around the mean
vector.
43
Dirichlet Posterior


If we have a set of data with counts  N1 , , Nn 
, then the posterior distribution is also
Dirichlet with parameters
D(1  N1 ,...,  n  N n )
The Bayesian estimation of probabilities
 n  Nn
1  N1
, ,
)
are: (
sm
sm
n
n
i 1
i 1
where m   N i , s    i .
44
Multinomial Example (1)

Imagine that we have an urn with balls of
different colors: red(R), blue(B) and
green(G); but on an unknown quantity.

Assume that we picked up balls with
replacement, with the following sequence:
 B, B, R, R, B .
45
Multinomial Example (2)

If we assume a Dirichlet prior distribution
with parameters: D 1,1,1 , then the
estimated frequencies for red,blue and
3 4 1
green:  , , 
8 8 8

Observe, as green has a positive
probability, even if never appears in the
sequence.
46
Part 3
An Example in Genetics
47
Example 1 in Genetics (1)

Two linked loci with alleles A and a, and B
and b



A, B: dominant
a, b: recessive
A double heterozygote AaBb will produce
gametes of four types: AB, Ab, aB, ab
A
A
a
B
b
1/2
1/2
A
B
A
a
a
B
b
b
1/2
b
a
1/2
B
48
Example 1 in Genetics (2)

Probabilities for genotypes in gametes
No Recombination
Recombination
Male
1-r
r
Female
1-r’
r’
A
A
a
B
b
1/2
1/2
A
B
A
a
a
B
b
1/2
a
1/2
b
AB
ab
aB
Ab
Male
(1-r)/2
(1-r)/2
r/2
r/2
Female
(1-r’)/2
(1-r’)/2
r’/2
r’/2
b
B
49
Example 1 in Genetics (3)
Fisher, R. A. and Balmukand, B. (1928).
The estimation of linkage from the offspring
of selfed heterozygotes. Journal of Genetics,
20, 79–92.
 More:
http://en.wikipedia.org/wiki/Genetics
http://www2.isye.gatech.edu/~brani/isyeba
yes/bank/handout12.pdf

50
Example 1 in Genetics (4)
MALE
F
E
M
A
L
E
AB
(1-r)/2
ab
(1-r)/2
aB
r/2
Ab
r/2
AB
(1-r’)/2
AABB
(1-r) (1-r’)/4
aABb
(1-r) (1-r’)/4
aABB
r (1-r’)/4
AABb
r (1-r’)/4
ab
(1-r’)/2
AaBb
(1-r) (1-r’)/4
aabb
(1-r) (1-r’)/4
aaBb
r (1-r’)/4
Aabb
r (1-r’)/4
aB
r’/2
AaBB
(1-r) r’/4
aabB
(1-r) r’/4
aaBB
r r’/4
AabB
r r’/4
Ab
r’/2
AABb
(1-r) r’/4
aAbb
(1-r) r’/4
aABb
r r’/4
AAbb
r r’/4
51
Example 1 in Genetics (5)










Four distinct phenotypes:
A*B*, A*b*, a*B* and a*b*.
A*: the dominant phenotype from (Aa, AA, aA).
a*: the recessive phenotype from aa.
B*: the dominant phenotype from (Bb, BB, bB).
b*: the recessive phenotype from bb.
A*B*: 9 gametic combinations.
A*b*: 3 gametic combinations.
a*B*: 3 gametic combinations.
a*b*: 1 gametic combination.
Total: 16 combinations.
52
Example 1 in Genetics (6)

Let   (1  r )(1  r ') , then
2 
P( A * B*) 
4
1
P( A * b*)  P(a * B*) 
4
P(a * b*) 

4
53
Example 1 in Genetics (7)

Hence, the random sample of n from the
offspring of selfed heterozygotes will follow
a multinomial distribution:
 2   1 1  
Multinomial  n;
,
,
, 
4
4
4 4

We know that
  (1  r )(1  r '), 0  r  1/ 2, and 0  r '  1/ 2
So
1/ 4    1
54
Bayesian for Example 1 in Genetics (1)

To simplify computation, we let
P( A * B*)  1 , P( A * b*)  2
P(a * B*)  3 , P(a * b*)  4

The random sample of n from the offspring
of selfed heterozygotes will follow a
multinomial distribution:
Multinomial n;1 , 2 , 3 , 4 
55
Bayesian for Example 1 in Genetics (2)
If we assume a Dirichlet prior distribution
with parameters: D(1 ,  2 , 3 ,  4 ) to estimate
probabilities for A*B*, A*b*, a*B* and
a*b*.
 Recall that

A*B*: 9 gametic combinations.
 A*b*: 3 gametic combinations.
 a*B*: 3 gametic combinations.
 a*b*: 1 gametic combination.
We consider (1 ,  2 , 3 ,  4 )  (9,3,3,1).

56
Bayesian for Example 1 in Genetics (3)

Suppose that we observe the data of
y   y1 , y2 , y3 , y4   125,18, 20, 24 .

So the posterior distribution is also
Dirichlet with parameters D 134, 21, 23, 25

The Bayesian estimation for probabilities
are (1 , 2 , 3 , 4 )   0.660,0.103,0.113,0.123
57
Bayesian for Example 1 in Genetics (4)

Consider the original model,
2 
1

P( A * B*) 
, P( A * b*)  P(a * B*) 
, P(a * b*)  .
4
4
4

The random sample of n also follow a
multinomial distribution:
 2   1 1  
( y1 , y2 , y3 , y4 ) ~ Multinomial  n;
,
,
, .
4
4
4 4


We will assume a Beta prior distribution:
Beta(1 , 2 ).
58
Bayesian for Example 1 in Genetics (5)

The posterior distribution becomes
P( y1 , y2 , y3 , y4 |  ) P( )
P( | y1 , y2 , y3 , y4 ) 
.
 P( y1, y2 , y3 , y4 |  )P( )d

The integration in the above denominator,
 P( y1 , y2 , y3 , y4 |  ) P( )d
does not have a close form.
59
Bayesian for Example 1 in Genetics (6)


How to solve this problem?
Monte Carlo Markov Chains (MCMC)
Method!
What value is appropriate for ( 1 , 2 ) ?
60
Part 4
Monte Carlo Methods
61
Monte Carlo Methods (1)
Consider the game of
solitaire: what’s the
chance of winning with a
properly shuffled deck?
 http://en.wikipedia.org/
wiki/Monte_Carlo_meth
od
 http://nlp.stanford.edu/l
ocal/talks/mcmc_2004_
07_01.ppt

?
Lose
Lose
Win
Lose
Chance of winning is 1 in 4!
62
Monte Carlo Methods (2)
Hard to compute analytically because
winning or losing depends on a complex
procedure of reorganizing cards.
 Insight: why not just play a few hands, and
see empirically how many do in fact win?
 More generally, can approximate a
probability density function using only
samples from that density.

63
Monte Carlo Methods (3)
Given a very large set X and a distribution
f  x  over it.
 We draw a set of N i.i.d. random samples.
 We can then approximate the distribution
using these samples.

f(x)
X
1
f N ( x) 
N
N
1( x
i 1
(i )
 x)  f ( x)
N 
64
Monte Carlo Methods (4)

We can also use these samples to compute
expectations:
1
EN ( g ) 
N

N
(i )
g
(
x
)  E ( g )   g ( x) f ( x)

i 1
N 
x
And even use them to find a maximum:
xˆ  arg max[ f ( x(i ) )]
x( i )
65
Monte Carlo Example
, X n be i.i.d. N  0,1, find E ( X i 4 )  ?

X1 ,

Solution:

E( X i 4 )   x4
-

1
x2
4!
24
exp( )dx 

3
2
8
4/2 4
2
2 ( )!
2
Use Monte Carlo method to approximation
> x <- rnorm(100000) # 100000 samples from N(0,1)
> x <- x^4
> mean(x)
[1] 3.034175
66
Exercises
Write your own programs similar to those
examples presented in this talk.
 Write programs for those examples
mentioned at the reference web pages.
 Write programs for the other examples that
you know.

67