Download Report

Bayesian Belief Networks
Bayes Nets
Interesting because:
I
Naive Bayes assumption of conditional independence too
restrictive
I
But it’s intractable without some such assumptions...
I
Bayesian Belief networks describe conditional
independence among subsets of variables
→ allows combining prior knowledge about (in)dependencies
among variables with observed training data
Conditional Probability
Joint Distribution
Suppose we have a bag with four balls: 3 red and 1 blue. r,r,r,b
The event of drawing a ball is B. B1 is when we draw the first
ball. B2 is when we draw the second
P(B1 = r ) = ? ; P(B1 = b) = ?
Conditional Probability
Joint Distribution
Suppose we have a bag with four balls: 3 red and 1 blue. r,r,r,b
The event of drawing a ball is B. B1 is when we draw the first
ball. B2 is when we draw the second
P(B1 = r ) =
3
4
; P(B1 = b) =
1
4
Conditional probability P(B2 |B1 )?:
B1
r
b
B2
r
?
?
b
?
?
Conditional Probability
Joint Distribution
Suppose we have a bag with four balls: 3 red and 1 blue. r,r,r,b
The event of drawing a ball is B. B1 is when we draw the first
ball. B2 is when we draw the second
P(B1 = r ) =
3
4
; P(B1 = b) =
1
4
Conditional probability P(B1 |B2 )?:
B1
r
b
B2
r
b
2
3
1
3
1
0
Conditional Probability
Joint Distribution
Suppose we have a bag with four balls: 3 red and 1 blue. r,r,r,b
The event of drawing a ball is B. B1 is when we draw the first
ball. B2 is when we draw the second
P(B1 = r ) =
3
4
; P(B1 = b) =
1
4
Conditional probability P(B1 |B2 )?:
Joint probability P(B1 , B2 )
B1
B2
r
B1
r
b
1
2
1
4
r
b
b
1
4
0
B2
r
b
2
3
1
3
1
0
Talking about probabilities
Joint Probability Table (again)
Let’s assume that drink influences your driving style and also the conditions
of the road influence your driving style.
Now,
drink ∈ {legal, illegal}; road ∈ {dry , wet}; style ∈ {excellent, fair , reckless}
One can build a probability table like this:
drink
legal
legal
legal
legal
legal
legal
ill
ill
ill
ill
ill
ill
road
dry
dry
dry
wet
wet
wet
dry
dry
dry
wet
wet
wet
style
excellent
fair
reckless
excellent
fair
reckless
excellent
fair
reckless
excellent
fair
reckless
prob.
0.11
0.23
0.05
0.08
0.1
0.01
0.08
0.08
0.13
0.02
0.03
0.08
Joint probability distribution of
P(drink , road, style)
Probabilities sum to 1
Talking about probabilities
Conditioning
With,
drink ∈ {legal, illegal}; road ∈ {dry , wet}; style ∈ {excellent, fair , reckless}
Now, we know that road = dry , then our table becomes:
drink
legal
legal
legal
legal
legal
legal
ill
ill
ill
ill
ill
ill
road
dry
dry
dry
wet
wet
wet
dry
dry
dry
wet
wet
wet
style
excellent
fair
reckless
excellent
fair
reckless
excellent
fair
reckless
excellent
fair
reckless
prob.
0.11
0.23
0.05
0.08
0.1
0.01
0.08
0.08
0.13
0.02
0.03
0.08
P(drink , style|road = dry )?
Probabilities sum to 0.84!!!
Need to normalize to get probabilities.
P(drink , style|road = dry )
Q(drink , style|road = dry )
= P P
d
s Q(d, s|road = dry )
Talking about probabilities
Marginalization
With,
drink ∈ {legal, illegal}; road ∈ {dry , wet}; style ∈ {excellent, fair , reckless}
Marginalize style
drink
legal
legal
ill
ill
road
dry
wet
dry
wet
style
prob.
0.39
0.19
0.29
0.13
P(drink , road) =
X
P(drink , road, style = s)
s∈{e,f ,r }
Probabilities sum to 1
Conditional Independence
Definition: X is conditionally independent of Y given
Z if the probability distribution governing X is
independent of the value of Y given the value of Z ;
that is, if
(∀xi , yj , zk ) P(X = xi |Y = yj , Z = zk ) = P(X = xi |Z = zk )
more compactly, we write
P(X |Y , Z ) = P(X |Z )
Conditional Independence
Example
Example: Thunder is conditionally independent of Rain, given
Lightning
P(Thunder |Rain, Lightning) = P(Thunder |Lightning)
Naive Bayes uses cond. indep. to justify
P(X , Y |Z ) = P(X |Y , Z )P(Y |Z )
= P(X |Z )P(Y |Z )
Bayesian Belief Networks
Example
I’m at work, neighbor John calls to say my alarm is ringing, but
neighbor Mary doesn’t call. Sometimes it’s set off by minor
earthquakes. Is there a burglar?
Variables: Burglar , Earthquake, Alarm, JohnCalls, MaryCalls
Network topology reflects “causal” knowledge:
I
A burglar can set the alarm off
I
An earthquake can set the alarm off
I
The alarm can cause Mary to call
I
The alarm can cause John to call
Bayesian Belief Networks
Conditional Probability Table (CPT) Example
P(E)
P(B)
Burglary
B
E
P(A|B,E)
T
T
F
F
T
F
T
F
.95
.94
.29
.001
JohnCalls
Earthquake
.001
.002
Alarm
A
P(J|A)
T
F
.90
.05
A P(M|A)
MaryCalls
T
F
.70
.01
Compactness
Joint Distributions
A Joint Distribution for a network with n Boolean nodes has
2n − 1 rows for the combinations of parent values.
B
1
1
1
1
1
1
:
:
0
E
1
1
1
1
1
1
:
:
0
A
1
1
1
1
0
0
:
:
0
J
1
1
0
0
1
1
:
:
0
M
1
0
1
0
1
0
:
:
0
P()
?
?
?
?
Total: 32 rows... Ok, 31.
?
?
:
:
?
Compactness
Conditional Probability Tables (CPTs)
A CPT for a boolean variables Xi with k
Boolean parents (at most) has 2k rows for the
combinations of parent values
B
E
A
Each row requires one number p for Xi = true
(the number for Xi = false is just 1 − p)
If each variable has no more than k parents,
the complete network of n nodes requires
O(n · 2k ) numbers
I.e., grows linearly with n, vs. O(2n ) for the full
joint distribution
For burglary net, 1 + 1 + 4 + 2 + 2 = 10
numbers (vs. 25 − 1 = 31)
J
M
Bayesian Belief Network
Storm
BusTourGroup
S,B
Lightning
Campfire
S,¬B ¬S,B ¬S,¬B
C
0.4
0.1
0.8
0.2
¬C
0.6
0.9
0.2
0.8
Campfire
Thunder
ForestFire
Network represents a set of conditional independence
assertions:
I
Each node is asserted to be conditionally independent of
its nondescendants, given its immediate predecessors.
I
Directed acyclic graph
Bayesian Belief Network
Storm
BusTourGroup
S,B
Lightning
Campfire
S,¬B ¬S,B ¬S,¬B
C
0.4
0.1
0.8
0.2
¬C
0.6
0.9
0.2
0.8
Campfire
Thunder
ForestFire
Represents joint probability distribution over all variables
I
e.g., P(Storm, BusTourGroup, . . . , ForestFire)
I
in general,
P(y1 , . . . , yn ) =
n
Y
P(yi |Parents(Yi ))
i=1
where Parents(Yi ) denotes immediate predecessors of Yi in
graph
I
so, joint distribution is fully defined by graph, plus the
P(yi |Parents(Yi ))
Example
I’m at work, neighbor John calls to say my alarm is ringing, but
neighbor Mary doesn’t call. Sometimes the alarm is set off by
minor earthquakes, but I didn’t feel any. Is there a burglar?
Choose:
Goal : argmaxb∈{t,f } P(j = t, m = f , a = t, b, e = f )
P(j = t, m = f , a = t, b = t, e = f )
= P(j = t|a = t)P(m = f |a = t)P(a = t|b = t, e = f )P(b = t)P(e = f )
vs
P(j = t, m = f , a = t, b = f , e = f )
= P(j = t|a = t)P(m = f |a = t)P(a = t|b = f , e = f )P(b = f )P(e = f )
D-Separation
What Evidence Does to Dependence
X1
X2
X3
X1 is d-separated from X3 given hard evidence for X2
D-Separation
What Evidence Does to Dependence
X2
X1
X3
X1 is d-separated from X3 given hard evidence for X2
D-Separation
What Evidence Does to Dependence
X3
X1
X2
X1 is d-connected to X3 . X2 depends on both.
Semantics
Local semantics: each node is conditionally independent of its
nondescendants given its parents
U1
Um
...
X
Z 1j
Z nj
Y1
...
Yn
Local Semantics
Each node is conditionally independent of all others given its
Markov blanket: parents + children + children’s parents
U1
Um
...
X
Z 1j
Z nj
Y1
...
Yn
example
I
I
I
Initial evidence: car won’t start
Testable variables (green), “broken, so fix it” variables
(orange)
Hidden variables (gray) ensure sparse structure, reduce
parameters
battery age
battery
dead
battery
meter
lights
fanbelt
broken
alternator
broken
no charging
battery
flat
oil light
no oil
gas gauge
no gas
car won’t
start
fuel line
blocked
dipstick
starter
broken
Inference
Consider The probability of an event A given some evidence E:
Here we both condition and marginalize
P(A|E) =
P(E, A)
P(E|A)P(A)
=
= αP(E, A)
P(E)
P(E)
where
α=
1
P(E)
Bayesian Belief Networks
Example
Consider the burglary network.
P(E)
P(B)
Burglary
B
E
P(A|B,E)
T
T
F
F
T
F
T
F
.95
.94
.29
.001
JohnCalls
Earthquake
.001
.002
Alarm
A
P(J|A)
T
F
.90
.05
A P(M|A)
MaryCalls
T
F
.70
.01
Inference by Enumeration
Example
What is the probability distribution of Burglary given that John
and Mary call.
P(B|j = t, m = t)
Inference by Enumeration
Example
What is the probability distribution of Burglary given that John
and Mary call.
P(B|j = t, m = t)
P(B|j, m)
Inference by Enumeration
Example
What is the probability distribution of Burglary given that John
and Mary call.
P(B|j = t, m = t)
P(B|j, m)
=
P(B,j,m)
P(j,m)
Inference by Enumeration
Example
What is the probability distribution of Burglary given that John
and Mary call.
P(B|j = t, m = t)
P(B|j, m)
=
P(B,j,m)
P(j,m)
= αP(B, j, m)
Inference by Enumeration
Example
What is the probability distribution of Burglary given that John
and Mary call.
P(B|j = t, m = t)
P(B|j, m)
=
P(B,j,m)
P(j,m)
= αP(B,
P j,Pm)
=α
e
a P(B, e, a, j, m)
Inference by Enumeration
Example
What is the probability distribution of Burglary given that John
and Mary
call.
P P
= α Pe Pa P(B, e, a, j, m)
=α
e
a P(B)P(e)P(a|B, e)P(j|a)P(m|a)
if b = t, a = t, e = t : P(b)P(e)P(a|b, e)P(j|a)P(m|a)
Inference by Enumeration
Example
What is the probability distribution of Burglary given that John
and Mary
call.
P P
= α Pe Pa P(B, e, a, j, m)
=α
e
a P(B)P(e)P(a|B, e)P(j|a)P(m|a)
if b = t, a = t, e = t : P(b)P(e)P(a|b, e)P(j|a)P(m|a)
if b = t, a = t, e = f : P(b)P(e = f )P(a|b, e = f )P(j|a)P(m|a)
Inference by Enumeration
Example
What is the probability distribution of Burglary given that John
and Mary
call.
P P
= α Pe Pa P(B, e, a, j, m)
=α
e
a P(B)P(e)P(a|B, e)P(j|a)P(m|a)
if b = t, a = t, e = t : P(b)P(e)P(a|b, e)P(j|a)P(m|a)
if b = t, a = t, e = f : P(b)P(e = f )P(a|b, e = f )P(j|a)P(m|a)
if b = t, a = f , e = t : P(b)P(e)P(a = f |b, e)P(j|a = f )P(m|a = f )
Inference by Enumeration
Example
What is the probability distribution of Burglary given that John
and Mary
call.
P P
= α Pe Pa P(B, e, a, j, m)
=α
e
a P(B)P(e)P(a|B, e)P(j|a)P(m|a)
if
if
if
if
b
b
b
b
= t, a = t, e = t
= t, a = t, e = f
= t, a = f , e = t
= t, a = f , e = f
: P(b)P(e)P(a|b, e)P(j|a)P(m|a)
: P(b)P(e = f )P(a|b, e = f )P(j|a)P(m|a)
: P(b)P(e)P(a = f |b, e)P(j|a = f )P(m|a = f )
: P(b)P(e = f )P(a = f |b, e = f )P(j|a = t)P(m|a = t)
Inference by Enumeration
Example cont’d... smarter
=α
P P
e
a
P(B, e, a, j, m)
Rewrite full joint entries using product of CPT entries:
Inference by Enumeration
Example cont’d... smarter
=α
P P
e
a
P(B, e, a, j, m)
Rewrite full joint entries using product of CPT entries:
P(B|j, m)
Inference by Enumeration
Example cont’d... smarter
=α
P P
e
a
P(B, e, a, j, m)
Rewrite full joint entries using product of CPT entries:
P(B|j,
Pm)P
=α
e
a P(B)P(e)P(a|B, e)P(j|a)P(m|a)
Inference by Enumeration
Example cont’d... smarter
=α
P P
e
a
P(B, e, a, j, m)
Rewrite full joint entries using product of CPT entries:
P(B|j,
Pm)P
=α
e)P(j|a)P(m|a)
e Pa P(B)P(e)P(a|B,
P
= αP(B)
e P(e)
a P(a|B, e)P(j|a)P(m|a)
Inference by Enumeration
Example cont’d... smarter
=α
P P
e
a
P(B, e, a, j, m)
Rewrite full joint entries using product of CPT entries:
P(B|j,
Pm)P
=α
e)P(j|a)P(m|a)
e Pa P(B)P(e)P(a|B,
P
= αP(B) P e P(e)P a P(a|B, e)P(j|a)P(m|a)
= α P(B) e P(e) a P(a|B, e) P(j|a) P(m|a)
| {z }
| {z }
| {z } | {z } | {z }
B
E
A
J
M
Inference
Consider the query P(JohnCalls|Burglary = true)
What can be taken out of the summation?
What can be cancelled (goes to 1)?
Inference
Irrelevant variables
Consider the query P(JohnCalls|Burglary = true)
Inference
Irrelevant variables
Consider the query P(JohnCalls|Burglary = true)
X
X
X
P(J|b) = αP(b)
P(e)
P(a|b, e)P(J|a)
P(m|a)
e
a
m
Inference
Irrelevant variables
Consider the query P(JohnCalls|Burglary = true)
X
X
X
P(J|b) = αP(b)
P(e)
P(a|b, e)P(J|a)
P(m|a)
e
a
m
Sum over m is identically 1; M is irrelevant to the query
Inference
Irrelevant variables
Consider the query P(JohnCalls|Burglary = true)
X
X
X
P(J|b) = αP(b)
P(e)
P(a|b, e)P(J|a)
P(m|a)
e
a
m
Sum over m is identically 1; M is irrelevant to the query
Y is irrelevant unless Y ∈ Ancestors({X } ∪ E)
Here, X = JohnCalls,
E = {Burglary }, and
S
Ancestors({X } E) = {Alarm, Earthquake} so MaryCalls is
irrelevant
Compact Conditional Distributions
I
CPT grows exponentially with number of parents
I
CPT becomes infinite with continuous-valued parent or
child
Solution: canonical distributions that are defined compactly
Deterministic nodes are the simplest case:
X = f (Parents(X )) for some function f
E.g., Boolean functions
NorthAmerican = Canadian OR US OR Mexican
E.g., numerical relationships among continuous variables
∂Level
= inflow + precipitation - outflow - evaporation
∂t
Compact Conditional Distributions
Noisy OR: Modeling Multiple Non-Interacting Causes
I
I
Parents U1 . . . Uk include all causes (can add leak node)
Independent failure probability qi for each Q
cause alone
=⇒ P(X |U1 . . . Uj , ¬Uj+1 . . . ¬Uk ) = 1 − ji=1 qi
Cold
F
F
F
F
T
T
T
T
Flu
F
F
T
T
F
F
T
T
Malaria
F
T
F
T
F
T
F
T
P(Fever )
0.0
0.9
0.8
0.98
0.4
0.94
0.88
0.988
P(¬Fever )
1.0
0.1
0.2
0.02 = 0.2 * 0.1
0.6
0.06 = 0.6 * 0.1
0.12 = 0.6 * 0.2
0.012 = 0.6 * 0.2 * 0.1
Number of parameters linear in number of parents
Hybrid (discrete+continuous) networks
Discrete (Subsidy ? and Buys?); continuous (Harvest and Cost)
I
I
Option 1:
discretization—possibly large
errors, large CPTs
Option 2: prob. density
functions per discrete
parameter
I
I
Continuous variable,
discrete+continuous parents
(e.g., Cost)
Discrete variable,
continuous parents (e.g.,
Buys?)
Subsidy?
Harvest
Cost
Buys?
Continuous child variables
Need one conditional density function for child variable given
continuous parents, for each possible assignment to discrete
parents
in other words:
I
child node has a function f (parentsd iscrete parentsc ont)
I
child incorporates a linear (ax + b) function for continuous
parents
I
function “forks” into different alternatives for discrete
parents
Continuous Child variables
Linear Gaussian
Most common function f : linear Gaussian model, e.g.,:
One linear gaussian per value of discrete parent.
P(Cost = c|Harvest = h, Subsidy ? = true)
= N(at h + bt , σt )(c)
!
1
1 c − (at h + bt ) 2
√ exp −
=
2
σt
σt 2π
Mean Cost varies linearly with Harvest, variance is fixed
1
1
P(cost = c|h, false) = √ exp −
2
σt 2π
c − (af h + bf )
σf
2 !
Linear variation is unreasonable over the full range, but works OK if
the likely range of Harvest is narrow
Continuous child variables
P(c | h, subsidy)
0.4
0.35
0.3
0.25
0.2
0.15
0.1
12
0.05
10
0
8
6
0 2
4
4 6
2 Harvest h
8 10
Cost c
12 0
All-continuous network with LG distributions =⇒ full joint distribution
is a multivariate Gaussian
Discrete+continuous LG network is a conditional Gaussian network
i.e., a multivariate Gaussian over all continuous variables for each
combination of discrete variable values
Discrete variable w/ continuous parents
Probability of Buys? given Cost should be a “soft” threshold:
1
P(Buys?=false|Cost=c)
0.8
0.6
0.4
0.2
0
0
2
4
6
Cost c
8
Probit distribution
uses integral of Gaussian:
Rx
Φ(x) = −∞ N(0, 1)(x)dx
P(Buys? = true|Cost = c) = Φ((−c + µ)/σ)
Discrete variable contd.
Sigmoid (or logit) distribution also used in neural networks:
1
P(Buys? = true|Cost = c) =
1 + exp(−2 −c+µ
σ )
Sigmoid has similar shape to probit but much longer tails:
1
0.9
P(Buys?=false|Cost=c)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
2
4
6
Cost c
8
Inference in Bayesian Networks
How can one infer the (probabilities of) values of one or more
network variables, given observed values of others?
I
Bayes net contains all information needed for this inference
I
If only one variable with unknown value, easy to infer it
I
In general case, problem is NP hard
In practice, can succeed in many cases
I
Exact inference methods work well for some network
structures
I
Monte Carlo methods “simulate” the network randomly to
calculate approximate solutions
Approximate Inference
Sampling
One idea: if there is a query and the joint prob. table all you
need is to count.
Approximate Inference
Sampling
I
I
sample P(cloudy ) =< 0.5, 0.5 >: choose true
sample
P(Sprinkler |Cloudy = true) =< 0.1, 0.9 >:
choose false
I
sample P(Rain|Cloudy = true) =< 0.8, 0.2 >:
choose true
I
sample (WetGrass|Sprinkler = false, Rain =
true) =< 0.9, 0.1 >: choose true
I
return < true, false, true, true >.
Approximate Inference
Prior Sampling
prior _sample takes in a Bayesian network bn specifying
P(X1 , X2 , ..., Xn )
function prior_sample (bn)
X=vector with the vars. of the BN //< X1 , X2 , ..., Xn >
sample=an array of n elements
for each var from X:
sample[i] = random sample P(var|Parents(var))
return sample
Approximate Inference
Prior Sampling
Data Sampled from the BN: 10 and 29 samples: [c,s,r,w]
[’f’,
[’t’,
[’f’,
[’t’,
[’t’,
[’t’,
[’f’,
[’t’,
[’t’,
[’t’,
’f’,
’f’,
’f’,
’f’,
’f’,
’f’,
’f’,
’f’,
’f’,
’f’,
’f’,
’t’,
’f’,
’f’,
’t’,
’t’,
’t’,
’t’,
’t’,
’f’,
’f’]
’t’]
’f’]
’f’]
’t’]
’t’]
’f’]
’t’]
’t’]
’f’]
[’f’,
[’f’,
[’t’,
[’f’,
[’f’,
[’f’,
[’t’,
[’f’,
[’f’,
[’f’,
[’f’,
[’t’,
[’t’,
[’f’,
[’t’,
[’t’,
[’t’,
[’t’,
[’t’,
[’t’,
[’t’,
[’f’,
[’t’,
[’f’,
[’f’,
[’f’,
[’t’,
[’f’,
’t’,
’f’,
’f’,
’t’,
’f’,
’f’,
’t’,
’t’,
’t’,
’f’,
’t’,
’f’,
’f’,
’t’,
’f’,
’f’,
’f’,
’f’,
’f’,
’f’,
’f’,
’f’,
’f’,
’f’,
’t’,
’t’,
’f’,
’f’,
’f’,
’f’,
’t’,
’f’,
’f’,
’f’,
’t’,
’f’,
’t’,
’f’,
’t’,
’t’,
’t’,
’f’,
’t’,
’t’,
’t’,
’t’,
’t’,
’t’,
’t’,
’f’,
’t’,
’f’,
’t’,
’t’,
’f’,
’f’,
’t’]
’f’]
’t’]
’t’]
’f’]
’f’]
’t’]
’f’]
’t’]
’f’]
’t’]
’t’]
’t’]
’t’]
’t’]
’t’]
’t’]
’t’]
’t’]
’t’]
’t’]
’f’]
’t’]
’f’]
’t’]
’t’]
’f’]
’f’]
Approximate Inference
Prior Sampling
P(t, f , t, t) = 0.5 × 0.9 × 0.8 × 0.9 = 0.324 (using CPTs)
P(t, f , t, t) =
5
10
= 0.5 with 10 samples
P(t, f , t, t) =
12
29
= 0.41 with 29 samples
P(t, f , t, t) =
33
100
= 0.33 with 100 samples
0.325 with 1000 samples
Approximate Inference
Prior Sampling
Each sampling will give you a slightly different number, but in
general:
NPS (x1 , x2 , ..., xn )
= P(x1 , x2 , ..., xn )
N
N→∞
lim
With N total number of samples and NPS (x1 , x2 , ..., xn ) the
number of times that event (x1 , x2 , ..., xn ) appears in the
samples.
So,
P(x1 , x2 , ..., xn ) ≈
NPS (x1 , x2 , ..., xn )
N
Approximate Inference
Rejection Sampling
What if you know something about the variables:
You want to find P(X = x|e) where X is the unknown variable
and e the evidence (all variables you know the value of).
ˆ |e) is the distribution that the algorithm returns
If P(X
ˆ |e) = αNPS (X , e) = NPS (X , e)
P(X
NPS (e)
≈
P(X , e)
= P(X |e)
P(e)
What’s the probability of P(Rain|Sprinkler = t)?
Approximate Inference
Rejection Sampling
/*
bn: A bayesian network
X: the query variable
e: Observed values for evidence
num: The number of samples to generate
*/
function rejection_sample (bn,X,e,num)
N=vector of counts for each val. of X
for 1 to num:
sample = prior_sample(bn)
if sample is consistent with e:
N[x]=N[x]+1 //where x is the value of X in sample
return normalize(N) //divide each N[i] by the sum of Ns
Problem: How many iterations to get 100 consistent samples?
Approximate Inference
Importance Sampling
For bayesian networks we can use Likelihood Weighting
Fix the values of the evidence and sample non evidence only.
But Wait! We may end up with a number of events that are very
unlikely
For example: P(Rain|Cloudy = t, WetGrass = t) can generate
[c=t,s=f,r=f,w=t], but
P(WetGrass = t|Sprinkler = f , Rain = f ) = 0!!!
We have to weight samples by their likelihood
Approximate Inference
Importance Sampling
Example of Likelihood Weighting
P(Rain|Cloudy = t, WetGrass = t)
Let w be a weight assigned to a sample. Initially 1.
I
Cloudy = t b/c it is evidence. Therefore
w = w × P(Cloudy = t) = 0.5
I
Sprinkler is not evidence. Sample
P(Sprinkler |Cloudy = t) =< 0.1, 0.9 > pick f
I
Rain is not evidence. Sample
P(Rain|Cloudy = t) =< 0.8, 0.2 > pick t
I
WetGrass = t is evidence. Therefore
w = w ×P(wetgrass = t|Sprinkler = f , Rain = true) = 0.45
Return [t,f,t,t] with weight of 0.45 and tally it under Rain = true
Approximate Inference
Likelihood Weighting
/*
bn: A bayesian network specifying P(X1 , X2 , . . . , Xn )
X: the query variable
e: Observed values for evidence
num: The number of samples to generate
*/
function likelihood_weighting (bn,X,e,num)
W=vector of weighted counts for each val of X
for 1 to num:
sample,w = weighted_sample(bn,e)
W[x]=W[x]+w //where x is the value of X in sample
return normalize(W) //divide each W[i] by the sum of Ws
function weighted_sample(bn,e)
w=1
sample=a vector with n elems. initialized with vals from e and unknowns
X=vector with the vars. of the BN //< X1 , X2 , ..., Xn >
for each var in X:
if var is an evidence variable with value x in e:
w=w*P(var=x|parents(var))
else
i = index of var in X
sample[i] = a random sample from P(var|parents(var))
return (sample,w)
Inference by Markov Chain Monte Carlo
Definitions
Monte Carlo (Monaco)
Markov Chain: New state comes from a random change on
the previous one: P(Xt |Xt−1 )
Gibbs Sampling
Example
I
start with an arbitrary state (evidence is fixed)
I
generate the next state sampling from one non-evidence
variable
I
pick the next non-evidence variable and sample using the
rest of the configuration
Note that sampling the non-evidence variable is conditioned on
its Markov Blanket
P(xi |markovBlanket(Xi )) = αP(xi |Parents(Xi ))×
Y
Yj ∈Children(Xi )
P(yi |Parents(Yj ))
Gibbs Sampling
Example
P(Rain|Sprinkler = t, WetGrass = t)
I
start with [t,t,f,t] (picked at random with evidence fixed)
I
Cloudy is sampled given its MB.
P(Cloudy |Sprinkler = true, Rain = false). choose false.
I
Rain is sampled given its MB. P(Rain|Cloudy =
false, Sprinkler = true, WetGrass = true) pick true
I
current state = [f,t,t,t]. Go back to sampling Cloudy...
Say the processs visited 20 states where rain=t and 60 where
rain=f, then
P(Rain|Sprinkler = t, WetGrass = t) = α(< 20, 60 >) =
20 60
< 80
, 80 >=< 0.25, 0.75 >
Gibbs Sampling
Algorithm
/*
bn: A bayesian network specifying P(X1 , X2 , . . . , Xn )
X: the query variable
e: Observed values for evidence
num: The number of samples to generate
*/
function gibbs_ask (bn,X,e,num)
N=vector of counts for each val of X
Z=non evidence vars in bn
sample = initialized from e and non evidence random.
for 1 to num:
for var in Z:
i = index of var in Z
sample[i] = sample from P(var|mb(var))
N[x]=N[x]+1 //where x is the value of X in sample
return normalize(N) //divide each N[i] by the sum of Ns
Learning in Bayesian Networks
Several variants of this learning task
I
Network structure might be known or unknown
I
Training examples might provide values of all network
variables, or just some
If structure known and observe all variables
I
Then it’s easy as training a Naive Bayes classifier
If structure is known, but only some variables can be
observed...
Learning Parameters in Bayes Nets
Estimation Maximization (EM)
I
We have a network with a node of which we do not know
the parameters, but we know how it should behave
(distribution)
I
We have observations that depend on the values of that
node (parameters)
EM Algorithm
The Case for Two Conditions with Success or Failure
Imagine there are two coins A and B
One is more likely to get Heads, the other more likely to get
Tails.
You pick one at random and toss it. Which one was it?
EM Algorithm
A Tutorial1
Well, let’s do this five times:
I
pick a coin randomly
I
toss it 10 times
I
record the number of heads and tails
Then, get the average number of heads for each coin.
1
Cuong B. Do and Serafim Batzoglou (2008)
EM Algorithm
A Tutorial
EM Algorithm
A Tutorial
That was easy: Coin A yields heads 80% of the time, Coin B
45% of the time.
What if we are given ONLY the results of our coin tosses
Can we guess the percentage of heads that each coin yields?
Can we guess which coin was picked for each set of 10 coin
tosses?
EM Algorithm
A Tutorial
One way to think about this is:
1. Assign random averages to both coins
2. For each of the 5 rounds of 10 coin tosses
I
I
I
I
I
Check the percentage of heads
Find the probability of it coming from each coin
Compute the expected number of heads: using that
probability as a weight, multiply it by the number of heads
Record those numbers
Re-Compute new means for coin A and B.
3. With these new means go back to step 2.
EM Algorithm
How do Coin Tosses Behave?2
Binomial Distribution
2
Math Spoken Here. Binomial Distribution
EM Algorithm
How do Coin Tosses Behave?3
3
PennState Eberly College of Science Statistics Online
EM Algorithm
How do Coin Tosses Behave?4
4
PennState Eberly College of Science Statistics Online
EM Algorithm
A Tutorial
So, the five rounds of 10 coin tosses with θA = 0.6; θB = 0.55
yield:
1 H T T T H H T H T H
2 H H H H H T H H H H
3 H T H H H H H T H H
4 H T H T T T H H T T
5 T H H H T H H H T H
Let’s take the first round:
5
10
heads and
5
10
tails.
compute the likelihood that it was coin “A” and coin “B” using
the binomial distributionwith mean probability θ on n trials with
k successes. p(k ) = kn θk (1 − θ)n−k
5
θi is the average number of heads for coin i. Initially it is randomly
assigned
EM Algorithm
A Tutorial: E-Step
So, We have:
θA = 0.6; θB = 0.5
1 H T T T
2 H H H H
3 H T H H
4 H T H T
5 T H H H
H
H
H
T
T
Let’s take the first round:
H
T
H
T
H
5
10
T
H
H
H
H
H
H
T
H
H
T
H
H
T
T
heads and
H
H
H
T
H
5
10
tails.
likelihood of “A”= pA (h)h (1 − pA (h))10−h = 0.0007962624
likelihood of “B”= pB (h)h (1 − pB (h))10−h = 0.0009765625
Normalizing I get probabilities: 0.45 and 0.55
EM Algorithm
A Tutorial. M-Step
So, We have:
θA = 0.6; θB = 0.5
1 H T T T
2 H H H H
3 H T H H
4 H T H T
5 T H H H
H
H
H
T
T
H
T
H
T
H
T
H
H
H
H
H
H
T
H
H
T
H
H
T
T
H
H
H
T
H
Recap: P(Coin = A) = 0.45; P(Coin = B) = 0.55
Estimating likely number of heads and tails from:
I “A”: H = 0.45 × 5 heads = 2.2 heads; T = 0.45 × 5 tails = 2.2 tails
I “B”: H = 0.55 × 5 heads = 2.8 heads; T = 0.55 × 5 tails = 2.8 tails
Do the same for all five runs
EM Algorithm
A Tutorial
So, We have:
θA = 0.6; θB = 0.5
H
Compute the new probabilities for each coin ( H+T
)
That gives you the new maximized parameter θ for each coin
EM Algorithm
A Tutorial
So, We have:
θA = 0.6; θB = 0.5
Repeat E-Step and M-Step until convergence
Expectation Maximization (EM)
When to use:
I
Data is only partially observable
I
Unsupervised clustering (target value unobservable)
I
Supervised learning (some instance attributes
unobservable)
Some uses:
I
Train Bayesian Belief Networks
I
Unsupervised clustering (AUTOCLASS)
I
Learning Hidden Markov Models
p(x)
Generating Data from Mixture of k Gaussians
x
Each instance x generated by
1. Choosing one of the k Gaussians with uniform probability
2. Generating an instance at random according to that
Gaussian
EM for Estimating k Means
Given:
I Instances from X generated by mixture of k Gaussian
distributions
I Unknown means hµ1 , . . . , µk i of the k Gaussians
I Don’t know which instance xi was generated by which
Gaussian
Determine:
I Maximum likelihood estimates of hµ1 , . . . , µk i
Think of full description of each instance as yi = hxi , zi1 , zi2 i,
where
I zij is 1 if xi generated by jth Gaussian
I xi observable
I zij unobservable
EM for Estimating k Means
EM Algorithm: Pick random initial h = hµ1 , µ2 i, then iterate
E step: Calculate the expected value E[zij ] of each hidden variable
zij , assuming the current hypothesis h = hµ1 , µ2 i holds.
E[zij ] =
=
p(x = xi |µ = µj )
P2
n=1 p(x = xi |µ = µn )
e
P2
−
1
(xi −µj )2
2σ 2
n=1 e
−
1
(xi −µn )2
2σ 2
EM for Estimating k Means
EM Algorithm: Pick random initial h = hµ1 , µ2 i, then iterate
E step: Calculate the expected value E[zij ] of each hidden variable
zij , assuming the current hypothesis h = hµ1 , µ2 i holds.
E[zij ] =
=
p(x = xi |µ = µj )
P2
n=1 p(x = xi |µ = µn )
e
P2
−
1
(xi −µj )2
2σ 2
n=1 e
−
1
(xi −µn )2
2σ 2
M step: Calculate a new maximum likelihood hypothesis
h0 = hµ01 , µ02 i, assuming the value taken on by each hidden
variable zij is its expected value E[zij ] calculated above.
Replace h = hµ1 , µ2 i by h0 = hµ01 , µ02 i.
Pm
i=1 E[zij ] xi
µj ← P
m
i=1 E[zij ]
Estimating k means example
Mixture of gaussians
Say we know that on cloudy days the temperature is generally
lower than on sunny days. Both cloudy and sunny days can
experiment variations of around 10 degrees.
Now, given the following temperatures for 10 random days
Days = {70, 62, 89, 54, 97, 75, 82, 56, 32, 78}
What’s the mean temperature for sunny and for cloudy days?
Estimating k means example
Set up
We have sunny and cloudy days ⇒ k = 2
Estimating k means example
Set up
We have sunny and cloudy days ⇒ k = 2
Let’s assign random means to sunny and cloudy: 80 and 55
respectively.
Estimating k means example
Set up
We have sunny and cloudy days ⇒ k = 2
Let’s assign random means to sunny and cloudy: 80 and 55
respectively.
The standard deviation for each day is 10 (given).
Estimating k means example
EM
Data {70, 62, 89, 54, 97, 75, 82, 56, 32, 78}
E Step
Sunny (80) {0.65, 0.20, 0.99, 0.03, 0.99, 0.86, 0.97, 0.053, 0.00, 0.93}
Cloudy (55) {0.34, 0.79, 0.00, 0.96, 0.00, 0.13, 0.02, 0.94, 0.99, 0.067}
M Step
Sunny:
Cloudy:
45.59+12.51+88.58+1.78+96.93+65.02+79.87+2.99+0.00+72.73
5.70
24.4+49.48+0.41+52.21+0.06+9.97+2.12+53.00+31.99+5.26
4.29
= 81.75
= 53.35
Iterate until the differences between means is less than 0.1
Estimating k means example
EM
Iteration 2
Sunny: 81.79 Cloudy: 53.00
Iteration 3
Sunny: 81.77 Cloudy: 52.91
EM Algorithm
Converges to local maximum likelihood h
and provides estimates of hidden variables zij
In fact, local maximum in E[ln P(Y |h)]
I
Y is complete (observable plus unobservable variables)
data
I
Expected value is taken over possible values of
unobserved variables in Y
General EM Problem
Given:
I
Observed data X = {x1 , . . . , xm }
I
Unobserved data Z = {z1 , . . . , zm }
Parameterized probability distribution P(Y |h), where
I
I
I
Y = {y1 , . . . , ym } is the full data yi = xi
h are the parameters
S
Determine:
I
h that (locally) maximizes E[ln P(Y |h)]
Many uses:
I
Train Bayesian belief networks
I
Unsupervised clustering (e.g., k means)
I
Hidden Markov Models
zi
General EM Method
Define likelihood function Q(h0 |h) which calculates Y = X
using observed X and current parameters h to estimate Z
S
Q(h0 |h) ← E[ln P(Y |h0 )|h, X ]
EM Algorithm:
Estimation (E) step: Calculate Q(h0 |h) using the current
hypothesis h and the observed data X to estimate the
probability distribution over Y .
Q(h0 |h) ← E[ln P(Y |h0 )|h, X ]
Maximization (M) step: Replace hypothesis h by the
hypothesis h0 that maximizes this Q function.
h ← argmaxh0 Q(h0 |h)
Z
Gibbs Sampling
Gaussian Mixtures
I
I
Given some estimates of µ1 , µ2
repeat until convergence
I
I
I
I
for i = 1 to N:
Sample π according to the E-step of the EM Algorithm*
Update
PN
(1 − πit )xi
µ
ˆi = Pi=1
N
t
i=1 (1 − πi )
Sample from the gaussians with these estimates and
produce new means.
Learning The Net’s Structure
When structure unknown...
I
Algorithms use greedy search to add/substract edges and
nodes
I
Active research topic
Summary: Bayesian Belief Networks
I
Combine prior knowledge with observed data
I
Impact of prior knowledge (when correct!) is to lower the
sample complexity
Active research area
I
I
I
I
I
I
Extend from boolean to real-valued variables
Parameterized distributions instead of tables
Extend to first-order instead of propositional systems
More effective inference methods
...