Reasoning with Uncertainty Why important: biomedical

Reasoning with Uncertainty
Why important: biomedical
Have you got Mexican Flu?
Topics:
Why is uncertainty important?
How do we represent and reason with uncertain
knowledge?
Progress in research:
1980s: rule-based representation (MYCIN,
Prospector)
P (m, c, s) = 0.009215
P (m, c¯, s) = 0.000485
P (m, c, s¯) = 0.000285
1990s to present: network models (Bayesian
networks – Munin)
P (m, c¯, s¯) = 1.5 · 10−5
P (m,
¯ c, s) = 9.9 · 10−6
P (m,
¯ c¯, s) = 0.0098901
P (m,
¯ c, s¯) = 0.0009801
latest developments: integration of network models
and logic (Markov logica)
P (m,
¯ c¯, s¯) = 0.97912
M : mexican flu; C : chills;
S : sore throat
Probability of mexican flu
given sore throat?
– p. 2/42
– p. 1/42
Why important: embedded systems
Why important: agents
Control of behaviour of large production printer
Agents (robots) perceive an incomplete image of the
world using sensors that are inherently unreliable
Partially observable state
Speed v given available power P and required energy:
– p. 3/42
– p. 4/42
Representation of uncertainty
Rule-based uncertain knowledge
Representation of uncertainty is clearly important!
Early, simple approach – certainty-factor calculus:
fever ∧ myalgia → fluCF=0.8
How to do it? For example rule based:
e: evidence
Example how its works:
CF(fever, B) = 0.6;
CF(myalgia, B) = 1
(B is background knowledge)
h: hypothesis
e1 ∧ · · · ∧ en → hx
If e1 , e2 , . . . , en are true (observed), then conclusion h
is true with certainty x
Combination functions:
CF(flu, {fever, myalgia} ∪ B)
= 0.8 · max{0, min{CF(fever, B), CF(myalgia, B)}}
= 0.8 · max{0, min{0.6, 1}} = 0.48
How to proceed when ei , i = 1, . . . , n are uncertain?
⇒ uncertainty propagation/inference/reasoning
– p. 5/42
Propagation rules
Propagation rules (continued)
Define for this theory combination functions f∧ , f∨ ,
fprop , fco , where
···
fprop (x, y): propagation of uncertain evidence ex to h
given
e → hy
You need some basic ‘theory’ (e.g. certainty-factor
calculus)
Define for this theory combination functions f∧ , f∨ ,
fprop , fco , where
f∧ (x, y): combine uncertainty w.r.t. x and y using the
∧-rule
f∨ (x, y): similar for ∨
fco (x, y): for two co-concluding (same conclusion)
rules
e1 → hx
e.g. contact chicken → flu0.01
Examples:
Certainty-factor model (Mycin)
Subjective Bayesian method (Prospector) – not
discussed
and
e2 → hy
– p. 6/42
Ishizuka’s application of Dempster-Shafer theory – not
discussed
e.g. train contact humans → flu0.1
– p. 7/42
– p. 8/42
Certainty factor calculus
Combination functions
fprop
rule e → hCF(h,e) , and
Developed by E.H. Shortliffe and B.G. Buchanan for
rule-based expert systems
uncertain evidence w.r.t. e, i.e. CF(e, e′ ) (e′ includes
all evidence so far)
then:
CF(h, e′ ) = CF(h, e) · max{0, CF(e, e′ )}
There is a relationship to probability theory
Applied in MYCIN, the expert system for the diagnosis
of infectious disease
f∧ (or f∨ )
rule: e1 ∧ e2 → hCF(h,e) with
uncertain evidence CF(e1 , e′ ) and CF(e2 , e′ )
then:
Certainty factors (CFs): subjective estimates of
uncertainty with CF(x, y) ∈ [−1, 1] (CF(x, y) = −1 false,
CF(x, y) = 0 unknown, and CF(x, y) = 1 true)
CF-calculus offers fill-in for combination functions: f∧ ,
f∨ , fprop , fco
CF(e1 ∧ e2 , e′ ) = min{CF(e1 , e′ ), CF(e2 , e′ )}
CF(e1 ∨ e2 , e′ ) = max{CF(e1 , e′ ), CF(e2 , e′ )}
– p. 9/42
– p. 10/42
Example
Combination functions
fco :
two rules:
e1 → hCF(h,e1 ) , and
e2 → hCF(h,e2 )
R = {R1 : flu → feverCF(fever,flu)=0.8 ,
R2 : common-cold → feverCF(fever,common-cold)=0.3 }
Evidence: CF(flu, e′ ) = 0.6 and CF(common-cold, e′ ) = 1
For rule R1 :
uncertain evidence CF(e1 , e′ ) and CF(e2 , e′ )
CF(fever, e′1 ) = CF(fever, flu) · max{0, CF(flu, e′ )}


 x + y(1 − x) if x, y > 0
x+y
′
′
CF(h, e1 co e2 ) =
1−min{|x|,|y|} if −1 < xy ≤ 0


x + y(1 + x) if x, y < 0
= 0.8 · 0.6 = 0.48
for rule R2 this yields CF(fever, e′2 ) = 0.3
Application of fco :
with CF(h, e′1 ) = x and CF(h, e′2 ) = y (using fprop )
CF(fever, e′1 co e′2 ) = CF(fever, e′1 ) + CF(fever, e′2 )(1 − CF(fever, e′1 ))
= 0.48 + 0.3(1 − 0.48) = 0.636
– p. 11/42
– p. 12/42
Bayesian networks
However · · ·
fever ∧ myalgia
→ fluCF=0.8
(my)
(yes/no)
myalgia
How likely is the occurrence of fever or myalgia given
that the patient has flu?
flu (fl)
(yes/no)
How likely is the occurrence of fever or myalgia in the
absence of flu?
(fe)
(yes/no)
fever
(pn)
(yes/no)
pneumonia
How likely is the presence of flu when just fever is
present?
temp
(≤ 37.5/> 37.5)
How likely is the presence of no flu when just fever is
present?
– p. 14/42
– p. 13/42
Specification of probabilities
Specification of probabilities 2
Without independences:
P (temp, fe, my, fl, pn) =
P (temp|fe, my, fl, pn) × P (fe|my, fl, pn)
×P (my|fl, pn) × P (fl|pn) × P (pn)
P (temp, fe, my, fl, pn)
P (my = y|fl = y) = 0.96
P (my = y|fl = n) = 0.20
(my)
(yes/no)
myalgia
Number of probabilities: 31 (×2)
With independences:
P (temp, fe, my, fl, pn) =
P (temp|fe) × P (fe|fl, pn) × P (my|fl) × P (fl) × P (pn)
P (fl = y) = 0.1
(fl)
(yes/no)
flu
P (pn = y) = 0.05
General: P (X1 , . . . , Xn ) =
(π(Xi ): parents of node Xi )
i=1 P (Xi
P (fe = y|fl = y, pn = n) = 0.88
P (fe = y|fl = n, pn = n) = 0.001
Number of probabilities: 10 (×2)
Qn
P (fe = y|fl = y, pn = y) = 0.95
P (fe = y|fl = n, pn = y) = 0.80
(pn)
(yes/no)
pneumonia
| π(Xi ))
(fe)
(yes/no)
fever
temp
(≤ 37.5/> 37.5)
P (temp ≤ 37.5|fe = y) = 0.1
P (temp ≤ 37.5|fe = n) = 0.99
– p. 15/42
– p. 16/42
Inference: evidence propagation
Inference: evidence propagation
Nothing known:
Nothing known:
MYALGIA
MYALGIA
NO
YES
NO
YES
FLU
FLU
NO
YES
NO
YES
FEVER
TEMP
FEVER
<=37.5
>37.5
NO
YES
TEMP
<=37.5
>37.5
NO
YES
PNEUMONIA
PNEUMONIA
NO
YES
NO
YES
Temperature > 37.5 grades Celcius:
Which symptoms belong to flu?
MYALGIA
MYALGIA
NO
YES
NO
YES
FLU
FLU
NO
YES
NO
YES
FEVER
NO
YES
TEMP
FEVER
<=37.5
>37.5
TEMP
<=37.5
>37.5
NO
YES
PNEUMONIA
PNEUMONIA
NO
YES
NO
YES
– p. 17/42
Probabilistic reasoning
– p. 18/42
Probabilistic reasoning: general
We are interested in conditional probabilities:
Joint probability distribution P (V ):
P (Xi | E) = P E (Xi )
P (V ) = P (X1 , X2 , . . . , Xn )
marginalisation:
X Y
X
P (X | π(X))
P (V ) =
P (W ) =
for (possibly empty) evidence E
Examples
V \W
P (FLU = yes | TEMP < 37.5)
V \W X∈V
Conditional probabilities and Bayes’ rule:
P (FLU = yes | MYALGIA = yes, TEMP < 37.5)
P (Y | X, Z) =
Mostly (marginal) probability distributions of single
variables
P (Y, X, Z)
P (X, Z | Y )P (Y )
=
P (X, Z)
P (X, Z)
There are many efficient algorithms
– p. 19/42
– p. 20/42
Have you got Mexican Flu?
P (m, c, s) = 0.009215
P (m, c¯, s) = 0.000485
P (m, c, s¯) = 0.000285
P (m, c¯, s¯) = 1.5 · 10−5
−6
P (m,
¯ c, s) = 9.9 · 10
P (m,
¯ c¯, s) = 0.0098901
P (m,
¯ c, s¯) = 0.0009801
Have you got Mexican Flu?
P (m, c, s) = 0.009215
P (m, c¯, s) = 0.000485
P (m, c, s¯) = 0.000285
P (m,
¯ c¯, s¯) = 0.97912
M : mexican flu; C : chills;
S : sore throat
P (m,
¯ c¯, s¯) = 0.97912
M : mexican flu; C : chills;
S : sore throat
P (m, c¯, s¯) = 1.5 · 10−5
Probability of mexican flu
and sore throat?
Probability of mexican flu
and sore throat? 0.0097
−6
P (m,
¯ c, s) = 9.9 · 10
P (m,
¯ c¯, s) = 0.0098901
P (m,
¯ c, s¯) = 0.0009801
Probability of mexican flu
given sore throat?
Probability of mexican flu
given sore throat? 0.495
– p. 21/42
– p. 21/42
Have you got Mexican Flu?
Naive probabilistic reasoning
How did we compute these probabilities?
X1
y/n
Marginalisation:
P (m, s) = P (m, c, s) + P (m, c¯, s)
= 0.009215 + 0.000485
= 0.0097
X2
y/n
X3
y/n
X4
y/n
Conditional probability:
P (m | s) = P (m, s)/P (s)
= 0.0097/0.0196
= 0.495
P E (x2 ) = P (x2 | x4 ) =
P (x4 |x2 )P (x2 )
P (x4 )
P (x4 | x3 ) = 0.4
P (x4 | ¬x3 ) = 0.1
P (x3 | x1 , x2 ) = 0.3
P (x3 | ¬x1 , x2 ) = 0.5
P (x3 | x1 , ¬x2 ) = 0.7
P (x3 | ¬x1 , ¬x2 ) = 0.9
P (x1 ) = 0.6
P (x2 ) = 0.2
(Bayes’ rule)
P
P
P (x4 |X3 ) X1 P (X3 |X1 , x2 )P (X1 )P (x2 )
P
=P
≈ 0.14
X1 ,X2 P (X3 | X1 , X2 )P (X1 )P (X2 )
X3 P (x4 | X3 )
X3
– p. 22/42
– p. 23/42
Judea Pearl’s algorithm
G1
V1
V2
π(V1 ) π(V2 )
λ(V1 )
π(V0 )
Probabilistic interpretation CF calculus?
Rule-based uncertainty: e → hx
propagation from antecedent e to conclusion h
(fprop )
combination of ∧ and ∨ evidence in e (f∧ and f∨ )
co-concluding rules (fco ):
G2
λ(V2 )
V0 π(V
0)
e1 → hx
e2 → hy
λ(V0 ) λ(V0 )
G3
V3
V4
G4
Bayesian networks: joint probabilityP
distribution
P (X1 , . . . , Xn ) with marginalisation Y P (Y, Z) and
conditioning P (Y | Z)
Object-oriented: nodes are objects that contain local
information and carry out local computations
Updating via message passing: arrows are
communication channels
– p. 24/42
– p. 25/42
Propagation
Co-concluding
fprop (propagation):
fco (co-concluding):
CF(e, e′ )
e′
CF(h, e)
e
h
h
e′2
CF(h, e′ ) = CF(h, e) · max{0, CF(e, e′ )}
corresponding Bayesian network (with P (e′ ) extra):
E′
P (E |
E ′)
E
P (H | E)
H
E1′
NB: P (h |
= P (h |
|
e′ ) + P (h
|
¬e, e′ )P (¬e
I1
H
⇒ P (h | ¬e) = 0 (assumption of CF-model)
e, e′ )P (e
CF(h, e′2 )
idea: see this as uncertain deterministic interaction ⇒
causal independence model
P (h | e′ ) = P (h | e)P (e | e′ ) + P (h | ¬e)P (¬e | e′ )
e′ )
CF(h, e′1 )
e′1
|
E2′
e′ )
– p. 26/42
function
f
I2
– p. 27/42
Causal Independence
C1
C2
Example: noisy OR
Cn
...
C1
C2
I1
I2
conditional
independence
I1
I2
f
P (e | C1 , . . . , Cn ) =
...
E
In
interaction function
X
P (e | I1 , . . . , In )
I1 ,...,In
=
X
n
Y
OR
Interactions between ‘causes’: logical OR
P (Ik | Ck )
k=1
n
Y
E
Meaning: presence of the intermediate causes Ik
produces effect e (i.e. E = true )
X
Y
P (e|C1 , C2 ) =
P (e|I1 , I2 )
P (Ik | Ck )
P (Ik | Ck )
f (I1 ,...,In )=e k=1
I1 ∨I2 =e
Boolean functions: P (E | I1 , . . . , In ) ∈ {0, 1} with
f (I1 , . . . , In ) = 1 if P (e | I1 , . . . , In ) = 1
k=1,2
= P (i1 |C1 )P (i2 |C2 ) + P (¬i1 |C1 )P (i2 |C2 )
+P (i1 |C1 )P (¬i2 |C2 )
– p. 29/42
– p. 28/42
Noisy OR and fco
Example
fco :
The consequences of ‘flu’ and ‘common cold’ on ‘fever’
are modelled by the variables I1 and I2 :
P (i1 | flu) = 0.8, and
P (i2 | common-cold) = 0.3
CF(h, e′1 co e′ ) = CF(h, e′1 ) + CF(h, e′2 )(1 − CF(h, e′1 ))
for CF(h, e′1 ) ∈ [0, 1] and CF(h, e′2 ) ∈ [0, 1]
causal independence with logical OR (noisy OR):
X
Y
P (e|C1 , C2 ) =
P (e|I1 , I2 )
P (Ik | Ck )
I1 ∨I2 =e
Furthermore, P (ik | w) = 0, k = 1, 2, if
w ∈ {¬flu, ¬common-cold}
Interaction between FLU and COMMON - COLD as noisy-OR:
(
0 if I1 = false and I2 = false
P (fever | I1 , I2 ) =
1 otherwise
k=1,2
= P (i1 |C1 )P (i2 |C2 ) + P (¬i1 |C1 )P (i2 |C2 )
+P (i1 |C1 )P (¬i2 |C2 )
= P (i1 |C1 ) + P (i2 |C2 )(1 − P (i1 |C1 ))
– p. 30/42
– p. 31/42
Result
Conclusions
Bayesian network:
FLU
FALSE
TRUE
0.400
0.600
Bayesian networks and other probabilistic graphical
models (Markov networks, chain graphs) are the state
of the art for reasoning with uncertainty
COMMON-COLD
FALSE
0.000
TRUE
1.000
I1
Earlier rule-based uncertainty reasoning can be
mapped (at least particially) to Bayesian networks
I2
FALSE
TRUE
0.520
0.480
FALSE
TRUE
0.700
0.300
Imposed restrictions of early methods become
apparent, and new ideas for modelling may emerge
FEVER
FALSE
TRUE
0.364
0.636
Fragment CF model:
CF(fever, e′1 co e′2 ) = CF(fever, e′1 ) + CF(fever, e′2 )(1− CF(fever, e′1 ))
= 0.48 + 0.3(1 − 0.48) = 0.636
– p. 32/42
Logic and graphical models
Decomposition
Structure of a joint probability distribution P can also be
described by undirected graphs (instead of directed
graphs as in Bayesian networks)
X1
X4
X2
X3
– p. 33/42
Probability distribution of a Markov network:
X
1Y
1
P (V ) =
φk (X{k} ) = exp
wk fk (V )
Z
Z
k
X7
k
X{k} ⊆ V , wk ∈ R a weight and feature fk (V ) ∈ {0, 1}
X5
The functions φk are called potentials
When a Markov network is chordal (i.e. given three
subsequent connected nodes a − b − c, then a and c are
always connected by a link)
X6
Together with P (V ) = P (X1 , X2 , X3 , X4 , X5 , X6 , X7 ):
Markov network
φk (X{k} ) = P (Rk | Sk )
Marginalisation (example):
X
P (X1 , ¬x2 , X3 , X4 , X5 , X6 , X7 )
P (¬x2 ) =
with Rk called the residual and Sk the separator, i.e. φ is
a conditional probability distribution (and Z = 1)
X1 ,X3 ,X4 ,X5 ,X6 ,X7
– p. 34/42
– p. 35/42
Markov logic network (MLN)
Semantics of MLN
C = {c1 , . . . , cn } is a set of constants, then: corresponding
Markov network ML,C :
A Markov logic net (MLN) is a set of pairs:
L = {(Fk , wk ) | k = 1, . . . , n}
where Fk is formula in first-order logic and wk ∈ R
ML,C contains a node with corresponding binary
variable for each ground atom in BL (Herband base)
Example (smoking causes cancer; if one friend smokes,
the other smokes as well):
the value of the variable is 1 if the ground atom is true;
otherwise 0
ML,C contains a complete graph with feature fk for each
instance of formula Fk
0.8 ∀x(S(x) → C(x))
0.3 ∀x∀y(F (x, y) → (S(x) ↔ S(y)))
For the probability distribution it holds that:
X
1Y
1
φk (X{k} )nk (V )
wk nk (V ) =
P (V ) = exp
Z
Z
where
S : Smoking
C : Cancer
F : Friends
k
k
where nk (V ) is the number of instances of Fk based on V
– p. 37/42
– p. 36/42
Example
Alchemy system
Input:
C = {A,B}
Friends(C,C)
Smoking(C)
Cancer(C)
0.8 Smoking(x) => Cancer(x)
0.3 Friends(x,y) => (Smoking(x) <=> Smoking(y))
Formula F ≡ w ∀x(S(x) → C(x))
with S ‘smoking’ and C ‘cancer’
weight w
Constants C = {a, b} (interpretations of x)
Some combined interpretations of F (worlds):
{S(a), C(a), S(b), C(b)}
2 models
{S(a), ¬C(a), S(b), ¬C(b)} 0 models
{S(a), ¬C(a), S(b), C(b)} 1 model
..
.
1
2 models
P (S(a), C(a), S(b), C(b)) = ew2
Z
Markov network:
S(a)
C(a)
S(b)
Output:
Friends(A,A) 0.484002
Friends(A,B) 0.476002
Friends(B,A) 0.458004
Friends(B,B) 0.519998
Cancer(A) 0.569993
Cancer(B) 0.566993
C(b)
– p. 38/42
– p. 39/42
Alchemy system
Exact calculations
Input:
C = {A,B}
Friends(C,C)
Smoking(C)
Cancer(C)
0.8 Smoking(x) => Cancer(x)
0.3 Friends(x,y) => (Smoking(x) <=> Smoking(y))
Initializing MC-SAT with MaxWalksat on hard clauses...
Running MC-SAT sampling...
Sample (per pred) 100, time elapsed = 0 secs, num. clauses = 6
Done burning. 100 samples.
Sample (per pred) 100, time elapsed = 0.01 secs, num. clauses = 6
Sample (per pred) 200, time elapsed = 0.01 secs, num. clauses = 6
Sample (per pred) 300, time elapsed = 0.01 secs, num. clauses = 6
Sample (per pred) 400, time elapsed = 0.02 secs, num. clauses = 6
Sample (per pred) 500, time elapsed = 0.02 secs, num. clauses = 6
Sample (per pred) 600, time elapsed = 0.02 secs, num. clauses = 6
Sample (per pred) 700, time elapsed = 0.03 secs, num. clauses = 6
Sample (per pred) 800, time elapsed = 0.03 secs, num. clauses = 6
Sample (per pred) 900, time elapsed = 0.03 secs, num. clauses = 6
Sample (per pred) 1000, time elapsed = 0.04 secs, num. clauses = 6
Done MC-SAT sampling. 1000 samples.
P (Cancer(A)) ≡ P (C(A)):
P (C(A)) = P (S(A), C(A)) + P (¬S(A), C(A))
Z = 1 + 3 exp 0.8, as {S(A), ¬C(A)} inconsistent
⇒ P (C(A)) = 2 Z1 exp(0.8 · 1) ≈
4.45
7.68
≈ 0.58
P (Friends(A, A)) = 2 Z1 exp(0.3 · 1), with Z = 4 exp(0.3 · 1),
since 4 A, A models ⇒ P (Friends(A, A)) = 0.5
– p. 40/42
Summary
Early rule-based (logical) approach to reasoning with
uncertainty was attractive
However, these had severe limitations (which ones?)
Therefore:
exploitation of probability theory
however, probability theory has also limitations
(which ones?)
“Best of all worlds” (Leibniz): probabilistic logics
(Markov logic, Bayesian logic, etc.)
– p. 42/42
– p. 41/42