slides

Probabilistic Reasoning I
Chapter 14.1–3, Bayesian Networks
Chapter 14.1–3, Bayesian Networks
1
Warm-up: People are bad at probabilities
Suppose that you are worried that you might have a rare disease. You decide
to get tested, and suppose that the testing methods for this disease are
correct 99 percent of the time (in other words, if you have the disease, it
shows that you do with 99 percent probability, and if you don’t have the
disease, it shows that you do not with 99 percent probability). Suppose this
disease is actually quite rare, occurring randomly in the general population
in only one of every 10,000 people.
If your test results come back positive, what are your chances that you
actually have the disease?
Do you think it is approximately: (a) .99, (b) .90, (c) .10, or (d) .01?
Chapter 14.1–3, Bayesian Networks
2
Outline
♦ Syntax of Bayesian Networks
♦ Semantics of Bayesian Networks
♦ Efficient representation of conditional distributions
Chapter 14.1–3, Bayesian Networks
3
Example
Weather
Cavity
Toothache
Catch
W eather is independent of the other variables
T oothache and Catch are conditionally independent given Cavity
Topology of network encodes conditional independence assertions
Each node is annotated with probability information
The combination of the topology and the conditional distributions suffices
to specify the full joint distribution for all the variables
Chapter 14.1–3, Bayesian Networks
4
Bayesian networks
A Bayesian network is a directed graph in which each node
is annotated with quantitative probability information.
Syntax:
a set of nodes, one per random variable
a directed, acyclic graph (link ≈ “directly influences”)
a conditional distribution for each node given its parents: P(Xi|P arents(Xi))
that quantifies the effects of the parents on Xi
Example: I’m at work, neighbor John calls to say my alarm is ringing, but
neighbor Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is
there a burglar?
Given the evidence of who has or has not called, we would like to estimate
the probability of a burglary
Random variables:
Causal knowledge:
Chapter 14.1–3, Bayesian Networks
5
Example
I’m at work, neighbor John calls to say my alarm is ringing, but neighbor
Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there a
burglar?
Variables: Burglar, Earthquake, Alarm, JohnCalls, M aryCalls
Network topology reflects “causal” knowledge:
– A burglar can set the alarm off
– An earthquake can set the alarm off
– The alarm can cause Mary to call
– The alarm can cause John to call
Laziness and Ignorance: MaryListeningMusic, JohnTelefonRings, ConfusingJohn
Chapter 14.1–3, Bayesian Networks
6
Conditional probability table
P(E)
P(B)
Burglary
B
E
P(A|B,E)
T
T
F
F
T
F
T
F
.95
.94
.29
.001
JohnCalls
Earthquake
.001
.002
Alarm
A
P(J|A)
T
F
.90
.05
A P(M|A)
MaryCalls
T
F
.70
.01
Sumarising uncertainty: i.e., the value 0.9 includes also the uncertainty for
JohnTelefonRings and ConfusingJohn
How many probabilities do we need?
Chapter 14.1–3, Bayesian Networks
7
14.2 Semantics of Bayesian Networks
What does a Bayesian network mean?
representation of the joint probability distribution (global semantics)
collection of conditional independence statements (local semantics)
Global semantics defines the full joint distribution
as the product of the local conditional distributions:
B
n
P (x1, . . . , xn) = Πi = 1P (xi|parents(Xi))
E
A
J
M
Recall: generic entry in the joint distribution is the probability of a conjunction of particular assignments to each variable, P (X1 = x1 ∧ ... ∧ Xn = xn).
e.g., P (j ∧ m ∧ a ∧ ¬b ∧ ¬e)
=
Chapter 14.1–3, Bayesian Networks
8
Global semantics
“Global” semantics defines the full joint distribution
as the product of the local conditional distributions:
B
n
P (x1, . . . , xn) = Πi = 1P (xi|parents(Xi))
e.g., P (j ∧ m ∧ a ∧ ¬b ∧ ¬e)
E
A
J
M
= P (j|a)P (m|a)P (a|¬b, ¬e)P (¬b)P (¬e)
= 0.9 × 0.7 × 0.001 × 0.999 × 0.998
≈ 0.00063
Chapter 14.1–3, Bayesian Networks
9
Constructing Bayesian networks
Need a method such that a series of locally testable assertions of
conditional independence guarantees the required global semantics
1. Choose an ordering of variables X1, . . . , Xn
2. For i = 1 to n
add Xi to the network
select parents from X1, . . . , Xi−1 such that
P(Xi|P arents(Xi)) = P(Xi|X1, . . . , Xi−1)
This choice of parents guarantees the global semantics (equivalence between
full joint distribution and bayesian networks):
n
P(X1, . . . , Xn) = Πi = 1P(Xi|X1, . . . , Xi−1) (chain rule)
n
= Πi = 1P(Xi|P arents(Xi)) (by construction)
Chapter 14.1–3, Bayesian Networks
10
Example
Suppose we choose the ordering M , J, A, B, E
MaryCalls
JohnCalls
P (J|M ) = P (J)?
Chapter 14.1–3, Bayesian Networks
11
Example
Suppose we choose the ordering M , J, A, B, E
MaryCalls
JohnCalls
Alarm
P (J|M ) = P (J)? No
P (A|J, M ) = P (A|J)? P (A|J, M ) = P (A)?
Chapter 14.1–3, Bayesian Networks
12
Example
Suppose we choose the ordering M , J, A, B, E
MaryCalls
JohnCalls
Alarm
Burglary
P (J|M ) = P (J)? No
P (A|J, M ) = P (A|J)? P (A|J, M ) = P (A)? No
P (B|A, J, M ) = P (B|A)?
P (B|A, J, M ) = P (B)?
Chapter 14.1–3, Bayesian Networks
13
Example
Suppose we choose the ordering M , J, A, B, E
MaryCalls
JohnCalls
Alarm
Burglary
Earthquake
P (J|M ) = P (J)? No
P (A|J, M ) = P (A|J)? P (A|J, M ) = P (A)? No
P (B|A, J, M ) = P (B|A)? Yes
P (B|A, J, M ) = P (B)? No
P (E|B, A, J, M ) = P (E|A)?
P (E|B, A, J, M ) = P (E|A, B)?
Chapter 14.1–3, Bayesian Networks
14
Example
Suppose we choose the ordering M , J, A, B, E
MaryCalls
JohnCalls
Alarm
Burglary
Earthquake
P (J|M ) = P (J)? No
P (A|J, M ) = P (A|J)? P (A|J, M ) = P (A)? No
P (B|A, J, M ) = P (B|A)? Yes
P (B|A, J, M ) = P (B)? No
P (E|B, A, J, M ) = P (E|A)? No
P (E|B, A, J, M ) = P (E|A, B)? Yes
Chapter 14.1–3, Bayesian Networks
15
Compactness and node ordering
MaryCalls
JohnCalls
Alarm
Burglary
Earthquake
Assessing conditional probabilities is hard in noncausal directions
(i.e, P (Earthquake|Burglary, Alarm)
Any order will work, but the resulting network will be more compact if the
variables are ordered such that causes precede effects
Network is less compact:
For burglary net:
For full joint distribution:
Chapter 14.1–3, Bayesian Networks
16
Compactness
Network is less compact: 1 + 2 + 4 + 2 + 4 = 13 numbers needed
For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers
For full joint distribution 25 − 1 = 31
What about the node ordering: M,J,E,B,A? How many probabilities are
required?
A Conditional Probability Table (CPT) for Boolean Xi with k
Boolean parents has 2k rows for the combinations of parent values
B
E
A
Each row requires one number p for Xi = true
(the number for Xi = f alse is just 1 − p)
J
If each variable has no more than k parents,
the complete network requires O(n · 2k ) numbers
I.e., grows linearly with n, vs. O(2n) for the full joint distribution (k << n)
Chapter 14.1–3, Bayesian Networks
17
M
Local semantics
Local semantics: each node is conditionally independent
of its non-descendants given its parents
U1
Um
...
X
Z 1j
Z nj
Y1
...
Yn
JohnCalls is indepedent from:
Chapter 14.1–3, Bayesian Networks
18
Markov blanket
Each node is conditionally independent of all others given its
Markov blanket: parents + children + children’s parents
U1
Um
...
X
Z 1j
Z nj
Y1
...
Yn
Bulgary is indepedent from:
Chapter 14.1–3, Bayesian Networks
19
Example: Car diagnosis
Initial evidence: car won’t start
Testable variables (green), “broken, so fix it” variables (orange)
Hidden variables (gray) ensure sparse structure, reduce parameters
battery age
battery
dead
battery
meter
lights
fanbelt
broken
alternator
broken
no charging
battery
flat
oil light
no oil
gas gauge
no gas
car won’t
start
fuel line
blocked
starter
broken
dipstick
Chapter 14.1–3, Bayesian Networks
20
Example: Car insurance
SocioEcon
Age
GoodStudent
ExtraCar
Mileage
RiskAversion
VehicleYear
SeniorTrain
MakeModel
DrivingSkill
DrivingHist
Antilock
DrivQuality
Airbag
Ruggedness
CarValue HomeBase
AntiTheft
Accident
Theft
OwnDamage
Cushioning
MedicalCost
OtherCost
LiabilityCost
OwnCost
PropertyCost
Chapter 14.1–3, Bayesian Networks
21
14.3 Efficient representation of cond. distribution
CPT grows exponentially with number of parents
CPT becomes infinite with continuous-valued parent or child
Deterministic nodes are the simplest case: its value is specified by the value
of parent
X = f (P arents(X)) for some function f
E.g., logical relationships
N orthAmerican ⇔ Canadian ∨ U S ∨ M exican
E.g., numerical relationships among continuous variables
∂Level
= inflow + precipitation - outflow - evaporation
∂t
Chapter 14.1–3, Bayesian Networks
22
Efficient representation of cond. distribution
Noisy-OR distributions model multiple noninteracting causes
Example: F ever ⇔ Cold ∨ F lu ∨ M alaria
Allows for uncertainty about the ability of each parent to cause the child to
be true (the causal relationship between parent and child may be inhibited)
Example: a patient could have a Cold, but not exhibit a Fever
Assumptions:
1) Parents U1 . . . Uk include all causes (can add leak node)
2) Independent failure probability qi for each cause alone
j
⇒ P (X|U1 . . . Uj , ¬Uj+1 . . . ¬Uk ) = 1 − Πi = 1qi
Chapter 14.1–3, Bayesian Networks
23
Efficient representation of cond. distribution
Fever is false if and only if all its parents are inhibited. The probability of
this is the product of the inhibition probabilities q for each parent
qcold = P (¬f ever|cold, ¬f lu, ¬malaria) = 0.6
qf lu = P (¬f ever|¬cold, f lu, ¬malaria) = 0.2
qmalaria = P (¬f ever|¬cold, ¬f lu, malaria) = 0.1
Cold
F
F
F
F
T
T
T
T
F lu
F
F
T
T
F
F
T
T
M alaria
F
T
F
T
F
T
F
T
P (F ever)
0.0
0.9
0.8
0.98
0.4
0.94
0.88
P (¬F ever)
1.0
0.1
0.2
0.02 = 0.2 × 0.1
0.6
0.06 = 0.6 × 0.1
0.12 = 0.6 × 0.2
Number of parameters linear in number of parents
Chapter 14.1–3, Bayesian Networks
24
Hybrid (discrete+continuous) networks
Discrete (Subsidy? and Buys?); continuous (Harvest and Cost)
Subsidy?
Harvest
Cost
Buys?
Option 1: discretization—possibly large errors, large CPTs
Option 2: finitely parameterized canonical families
Option 3: define conditional distribution with a collection of instances
1) Continuous variable, discrete+continuous parents (e.g., Cost)
2) Discrete variable, continuous parents (e.g., Buys?)
Chapter 14.1–3, Bayesian Networks
25
Continuous child variables
Need one conditional density function for child variable given continuous
parents, for each possible assignment to discrete parents
Most common is the linear Gaussian model, e.g.,:
P (Cost = c|Harvest = h, Subsidy? = true)
= N (ath + bt, σt)(c)

2 


1
1  c − (ath + bt)  


√
=
exp − 
 

2
σt
σt 2π
Cost decreases as supply increases
Chapter 14.1–3, Bayesian Networks
26
Discrete variable with continuous parents
Probability of Buys? given Cost should be a “soft” threshold:
1
P(Buys?=false|Cost=c)
0.8
0.6
0.4
0.2
0
0
2
4
6
Cost c
8
10
12
Probit distribution uses integral of Gaussian:
Rx
N (0, 1)(x)dx
Φ(x) = −∞
P (Buys? = true | Cost = c) = Φ((−c + µ)/σ)
Chapter 14.1–3, Bayesian Networks
27
Why the probit?
1. It’s sort of the right shape
2. Can view as hard threshold whose location is subject to noise
Cost
Cost
Noise
Buys?
Chapter 14.1–3, Bayesian Networks
28
Discrete variable contd.
Sigmoid (or logit) distribution also used in neural networks:
P (Buys? = true | Cost = c) =
1
1 + exp(−2 −c+µ
σ )
Sigmoid has similar shape to probit but much longer tails:
1
0.9
P(Buys?=false|Cost=c)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
2
4
6
Cost c
8
10
12
Chapter 14.1–3, Bayesian Networks
29
Summary
A Bayesian network is a directed acyclic graph whose nodes correspond to
random variables; each node has a conditional distribution for the node,
given its parents
Bayesian networks provide a concise way to represent conditional independence relationships in the domain
A Bayesian network specifies a full joint distribution; each joint entry is
defined as the product of the corresponding entries in the local conditional
distributions. A Bayesian network is often exponentially smaller than an
explicitly enumerated joint distribution.
Hybrid Bayesian networks, which include both discrete and continuous variables, use a variety of canonical distributions.
Chapter 14.1–3, Bayesian Networks
30
Detecting pregnancy based on three tests
The first is a scanning test which has a false positive of 1% and a false
negative of 10%.
The second is a blood test, which detects progesterone with a false positive
of 10a false negative of 30%.
The third test is a urine test, which also detects progesterone with a false
positive of 10% and a false negative of 20%.
The probability of a detectable progesterone level is 90% given pregnancy,
and 1% given no pregnancy.
The probability of pregnancy after two positive tests is 0.364%.
Construct the Bayesian network and specify the conditional probability tables.
Chapter 14.1–3, Bayesian Networks
31