Bayesian Belief Networks Bayes Nets Interesting because: I Naive Bayes assumption of conditional independence too restrictive I But it’s intractable without some such assumptions... I Bayesian Belief networks describe conditional independence among subsets of variables → allows combining prior knowledge about (in)dependencies among variables with observed training data Conditional Probability Joint Distribution Suppose we have a bag with four balls: 3 red and 1 blue. r,r,r,b The event of drawing a ball is B. B1 is when we draw the first ball. B2 is when we draw the second P(B1 = r ) = ? ; P(B1 = b) = ? Conditional Probability Joint Distribution Suppose we have a bag with four balls: 3 red and 1 blue. r,r,r,b The event of drawing a ball is B. B1 is when we draw the first ball. B2 is when we draw the second P(B1 = r ) = 3 4 ; P(B1 = b) = 1 4 Conditional probability P(B2 |B1 )?: B1 r b B2 r ? ? b ? ? Conditional Probability Joint Distribution Suppose we have a bag with four balls: 3 red and 1 blue. r,r,r,b The event of drawing a ball is B. B1 is when we draw the first ball. B2 is when we draw the second P(B1 = r ) = 3 4 ; P(B1 = b) = 1 4 Conditional probability P(B1 |B2 )?: B1 r b B2 r b 2 3 1 3 1 0 Conditional Probability Joint Distribution Suppose we have a bag with four balls: 3 red and 1 blue. r,r,r,b The event of drawing a ball is B. B1 is when we draw the first ball. B2 is when we draw the second P(B1 = r ) = 3 4 ; P(B1 = b) = 1 4 Conditional probability P(B1 |B2 )?: Joint probability P(B1 , B2 ) B1 B2 r B1 r b 1 2 1 4 r b b 1 4 0 B2 r b 2 3 1 3 1 0 Talking about probabilities Joint Probability Table (again) Let’s assume that drink influences your driving style and also the conditions of the road influence your driving style. Now, drink ∈ {legal, illegal}; road ∈ {dry , wet}; style ∈ {excellent, fair , reckless} One can build a probability table like this: drink legal legal legal legal legal legal ill ill ill ill ill ill road dry dry dry wet wet wet dry dry dry wet wet wet style excellent fair reckless excellent fair reckless excellent fair reckless excellent fair reckless prob. 0.11 0.23 0.05 0.08 0.1 0.01 0.08 0.08 0.13 0.02 0.03 0.08 Joint probability distribution of P(drink , road, style) Probabilities sum to 1 Talking about probabilities Conditioning With, drink ∈ {legal, illegal}; road ∈ {dry , wet}; style ∈ {excellent, fair , reckless} Now, we know that road = dry , then our table becomes: drink legal legal legal legal legal legal ill ill ill ill ill ill road dry dry dry wet wet wet dry dry dry wet wet wet style excellent fair reckless excellent fair reckless excellent fair reckless excellent fair reckless prob. 0.11 0.23 0.05 0.08 0.1 0.01 0.08 0.08 0.13 0.02 0.03 0.08 P(drink , style|road = dry )? Probabilities sum to 0.84!!! Need to normalize to get probabilities. P(drink , style|road = dry ) Q(drink , style|road = dry ) = P P d s Q(d, s|road = dry ) Talking about probabilities Marginalization With, drink ∈ {legal, illegal}; road ∈ {dry , wet}; style ∈ {excellent, fair , reckless} Marginalize style drink legal legal ill ill road dry wet dry wet style prob. 0.39 0.19 0.29 0.13 P(drink , road) = X P(drink , road, style = s) s∈{e,f ,r } Probabilities sum to 1 Conditional Independence Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z ; that is, if (∀xi , yj , zk ) P(X = xi |Y = yj , Z = zk ) = P(X = xi |Z = zk ) more compactly, we write P(X |Y , Z ) = P(X |Z ) Conditional Independence Example Example: Thunder is conditionally independent of Rain, given Lightning P(Thunder |Rain, Lightning) = P(Thunder |Lightning) Naive Bayes uses cond. indep. to justify P(X , Y |Z ) = P(X |Y , Z )P(Y |Z ) = P(X |Z )P(Y |Z ) Bayesian Belief Networks Example I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there a burglar? Variables: Burglar , Earthquake, Alarm, JohnCalls, MaryCalls Network topology reflects “causal” knowledge: I A burglar can set the alarm off I An earthquake can set the alarm off I The alarm can cause Mary to call I The alarm can cause John to call Bayesian Belief Networks Conditional Probability Table (CPT) Example P(E) P(B) Burglary B E P(A|B,E) T T F F T F T F .95 .94 .29 .001 JohnCalls Earthquake .001 .002 Alarm A P(J|A) T F .90 .05 A P(M|A) MaryCalls T F .70 .01 Compactness Joint Distributions A Joint Distribution for a network with n Boolean nodes has 2n − 1 rows for the combinations of parent values. B 1 1 1 1 1 1 : : 0 E 1 1 1 1 1 1 : : 0 A 1 1 1 1 0 0 : : 0 J 1 1 0 0 1 1 : : 0 M 1 0 1 0 1 0 : : 0 P() ? ? ? ? Total: 32 rows... Ok, 31. ? ? : : ? Compactness Conditional Probability Tables (CPTs) A CPT for a boolean variables Xi with k Boolean parents (at most) has 2k rows for the combinations of parent values B E A Each row requires one number p for Xi = true (the number for Xi = false is just 1 − p) If each variable has no more than k parents, the complete network of n nodes requires O(n · 2k ) numbers I.e., grows linearly with n, vs. O(2n ) for the full joint distribution For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 25 − 1 = 31) J M Bayesian Belief Network Storm BusTourGroup S,B Lightning Campfire S,¬B ¬S,B ¬S,¬B C 0.4 0.1 0.8 0.2 ¬C 0.6 0.9 0.2 0.8 Campfire Thunder ForestFire Network represents a set of conditional independence assertions: I Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors. I Directed acyclic graph Bayesian Belief Network Storm BusTourGroup S,B Lightning Campfire S,¬B ¬S,B ¬S,¬B C 0.4 0.1 0.8 0.2 ¬C 0.6 0.9 0.2 0.8 Campfire Thunder ForestFire Represents joint probability distribution over all variables I e.g., P(Storm, BusTourGroup, . . . , ForestFire) I in general, P(y1 , . . . , yn ) = n Y P(yi |Parents(Yi )) i=1 where Parents(Yi ) denotes immediate predecessors of Yi in graph I so, joint distribution is fully defined by graph, plus the P(yi |Parents(Yi )) Example I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t call. Sometimes the alarm is set off by minor earthquakes, but I didn’t feel any. Is there a burglar? Choose: Goal : argmaxb∈{t,f } P(j = t, m = f , a = t, b, e = f ) P(j = t, m = f , a = t, b = t, e = f ) = P(j = t|a = t)P(m = f |a = t)P(a = t|b = t, e = f )P(b = t)P(e = f ) vs P(j = t, m = f , a = t, b = f , e = f ) = P(j = t|a = t)P(m = f |a = t)P(a = t|b = f , e = f )P(b = f )P(e = f ) D-Separation What Evidence Does to Dependence X1 X2 X3 X1 is d-separated from X3 given hard evidence for X2 D-Separation What Evidence Does to Dependence X2 X1 X3 X1 is d-separated from X3 given hard evidence for X2 D-Separation What Evidence Does to Dependence X3 X1 X2 X1 is d-connected to X3 . X2 depends on both. Semantics Local semantics: each node is conditionally independent of its nondescendants given its parents U1 Um ... X Z 1j Z nj Y1 ... Yn Local Semantics Each node is conditionally independent of all others given its Markov blanket: parents + children + children’s parents U1 Um ... X Z 1j Z nj Y1 ... Yn example I I I Initial evidence: car won’t start Testable variables (green), “broken, so fix it” variables (orange) Hidden variables (gray) ensure sparse structure, reduce parameters battery age battery dead battery meter lights fanbelt broken alternator broken no charging battery flat oil light no oil gas gauge no gas car won’t start fuel line blocked dipstick starter broken Inference Consider The probability of an event A given some evidence E: Here we both condition and marginalize P(A|E) = P(E, A) P(E|A)P(A) = = αP(E, A) P(E) P(E) where α= 1 P(E) Bayesian Belief Networks Example Consider the burglary network. P(E) P(B) Burglary B E P(A|B,E) T T F F T F T F .95 .94 .29 .001 JohnCalls Earthquake .001 .002 Alarm A P(J|A) T F .90 .05 A P(M|A) MaryCalls T F .70 .01 Inference by Enumeration Example What is the probability distribution of Burglary given that John and Mary call. P(B|j = t, m = t) Inference by Enumeration Example What is the probability distribution of Burglary given that John and Mary call. P(B|j = t, m = t) P(B|j, m) Inference by Enumeration Example What is the probability distribution of Burglary given that John and Mary call. P(B|j = t, m = t) P(B|j, m) = P(B,j,m) P(j,m) Inference by Enumeration Example What is the probability distribution of Burglary given that John and Mary call. P(B|j = t, m = t) P(B|j, m) = P(B,j,m) P(j,m) = αP(B, j, m) Inference by Enumeration Example What is the probability distribution of Burglary given that John and Mary call. P(B|j = t, m = t) P(B|j, m) = P(B,j,m) P(j,m) = αP(B, P j,Pm) =α e a P(B, e, a, j, m) Inference by Enumeration Example What is the probability distribution of Burglary given that John and Mary call. P P = α Pe Pa P(B, e, a, j, m) =α e a P(B)P(e)P(a|B, e)P(j|a)P(m|a) if b = t, a = t, e = t : P(b)P(e)P(a|b, e)P(j|a)P(m|a) Inference by Enumeration Example What is the probability distribution of Burglary given that John and Mary call. P P = α Pe Pa P(B, e, a, j, m) =α e a P(B)P(e)P(a|B, e)P(j|a)P(m|a) if b = t, a = t, e = t : P(b)P(e)P(a|b, e)P(j|a)P(m|a) if b = t, a = t, e = f : P(b)P(e = f )P(a|b, e = f )P(j|a)P(m|a) Inference by Enumeration Example What is the probability distribution of Burglary given that John and Mary call. P P = α Pe Pa P(B, e, a, j, m) =α e a P(B)P(e)P(a|B, e)P(j|a)P(m|a) if b = t, a = t, e = t : P(b)P(e)P(a|b, e)P(j|a)P(m|a) if b = t, a = t, e = f : P(b)P(e = f )P(a|b, e = f )P(j|a)P(m|a) if b = t, a = f , e = t : P(b)P(e)P(a = f |b, e)P(j|a = f )P(m|a = f ) Inference by Enumeration Example What is the probability distribution of Burglary given that John and Mary call. P P = α Pe Pa P(B, e, a, j, m) =α e a P(B)P(e)P(a|B, e)P(j|a)P(m|a) if if if if b b b b = t, a = t, e = t = t, a = t, e = f = t, a = f , e = t = t, a = f , e = f : P(b)P(e)P(a|b, e)P(j|a)P(m|a) : P(b)P(e = f )P(a|b, e = f )P(j|a)P(m|a) : P(b)P(e)P(a = f |b, e)P(j|a = f )P(m|a = f ) : P(b)P(e = f )P(a = f |b, e = f )P(j|a = t)P(m|a = t) Inference by Enumeration Example cont’d... smarter =α P P e a P(B, e, a, j, m) Rewrite full joint entries using product of CPT entries: Inference by Enumeration Example cont’d... smarter =α P P e a P(B, e, a, j, m) Rewrite full joint entries using product of CPT entries: P(B|j, m) Inference by Enumeration Example cont’d... smarter =α P P e a P(B, e, a, j, m) Rewrite full joint entries using product of CPT entries: P(B|j, Pm)P =α e a P(B)P(e)P(a|B, e)P(j|a)P(m|a) Inference by Enumeration Example cont’d... smarter =α P P e a P(B, e, a, j, m) Rewrite full joint entries using product of CPT entries: P(B|j, Pm)P =α e)P(j|a)P(m|a) e Pa P(B)P(e)P(a|B, P = αP(B) e P(e) a P(a|B, e)P(j|a)P(m|a) Inference by Enumeration Example cont’d... smarter =α P P e a P(B, e, a, j, m) Rewrite full joint entries using product of CPT entries: P(B|j, Pm)P =α e)P(j|a)P(m|a) e Pa P(B)P(e)P(a|B, P = αP(B) P e P(e)P a P(a|B, e)P(j|a)P(m|a) = α P(B) e P(e) a P(a|B, e) P(j|a) P(m|a) | {z } | {z } | {z } | {z } | {z } B E A J M Inference Consider the query P(JohnCalls|Burglary = true) What can be taken out of the summation? What can be cancelled (goes to 1)? Inference Irrelevant variables Consider the query P(JohnCalls|Burglary = true) Inference Irrelevant variables Consider the query P(JohnCalls|Burglary = true) X X X P(J|b) = αP(b) P(e) P(a|b, e)P(J|a) P(m|a) e a m Inference Irrelevant variables Consider the query P(JohnCalls|Burglary = true) X X X P(J|b) = αP(b) P(e) P(a|b, e)P(J|a) P(m|a) e a m Sum over m is identically 1; M is irrelevant to the query Inference Irrelevant variables Consider the query P(JohnCalls|Burglary = true) X X X P(J|b) = αP(b) P(e) P(a|b, e)P(J|a) P(m|a) e a m Sum over m is identically 1; M is irrelevant to the query Y is irrelevant unless Y ∈ Ancestors({X } ∪ E) Here, X = JohnCalls, E = {Burglary }, and S Ancestors({X } E) = {Alarm, Earthquake} so MaryCalls is irrelevant Compact Conditional Distributions I CPT grows exponentially with number of parents I CPT becomes infinite with continuous-valued parent or child Solution: canonical distributions that are defined compactly Deterministic nodes are the simplest case: X = f (Parents(X )) for some function f E.g., Boolean functions NorthAmerican = Canadian OR US OR Mexican E.g., numerical relationships among continuous variables ∂Level = inflow + precipitation - outflow - evaporation ∂t Compact Conditional Distributions Noisy OR: Modeling Multiple Non-Interacting Causes I I Parents U1 . . . Uk include all causes (can add leak node) Independent failure probability qi for each Q cause alone =⇒ P(X |U1 . . . Uj , ¬Uj+1 . . . ¬Uk ) = 1 − ji=1 qi Cold F F F F T T T T Flu F F T T F F T T Malaria F T F T F T F T P(Fever ) 0.0 0.9 0.8 0.98 0.4 0.94 0.88 0.988 P(¬Fever ) 1.0 0.1 0.2 0.02 = 0.2 * 0.1 0.6 0.06 = 0.6 * 0.1 0.12 = 0.6 * 0.2 0.012 = 0.6 * 0.2 * 0.1 Number of parameters linear in number of parents Hybrid (discrete+continuous) networks Discrete (Subsidy ? and Buys?); continuous (Harvest and Cost) I I Option 1: discretization—possibly large errors, large CPTs Option 2: prob. density functions per discrete parameter I I Continuous variable, discrete+continuous parents (e.g., Cost) Discrete variable, continuous parents (e.g., Buys?) Subsidy? Harvest Cost Buys? Continuous child variables Need one conditional density function for child variable given continuous parents, for each possible assignment to discrete parents in other words: I child node has a function f (parentsd iscrete parentsc ont) I child incorporates a linear (ax + b) function for continuous parents I function “forks” into different alternatives for discrete parents Continuous Child variables Linear Gaussian Most common function f : linear Gaussian model, e.g.,: One linear gaussian per value of discrete parent. P(Cost = c|Harvest = h, Subsidy ? = true) = N(at h + bt , σt )(c) ! 1 1 c − (at h + bt ) 2 √ exp − = 2 σt σt 2π Mean Cost varies linearly with Harvest, variance is fixed 1 1 P(cost = c|h, false) = √ exp − 2 σt 2π c − (af h + bf ) σf 2 ! Linear variation is unreasonable over the full range, but works OK if the likely range of Harvest is narrow Continuous child variables P(c | h, subsidy) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 12 0.05 10 0 8 6 0 2 4 4 6 2 Harvest h 8 10 Cost c 12 0 All-continuous network with LG distributions =⇒ full joint distribution is a multivariate Gaussian Discrete+continuous LG network is a conditional Gaussian network i.e., a multivariate Gaussian over all continuous variables for each combination of discrete variable values Discrete variable w/ continuous parents Probability of Buys? given Cost should be a “soft” threshold: 1 P(Buys?=false|Cost=c) 0.8 0.6 0.4 0.2 0 0 2 4 6 Cost c 8 Probit distribution uses integral of Gaussian: Rx Φ(x) = −∞ N(0, 1)(x)dx P(Buys? = true|Cost = c) = Φ((−c + µ)/σ) Discrete variable contd. Sigmoid (or logit) distribution also used in neural networks: 1 P(Buys? = true|Cost = c) = 1 + exp(−2 −c+µ σ ) Sigmoid has similar shape to probit but much longer tails: 1 0.9 P(Buys?=false|Cost=c) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 Cost c 8 Inference in Bayesian Networks How can one infer the (probabilities of) values of one or more network variables, given observed values of others? I Bayes net contains all information needed for this inference I If only one variable with unknown value, easy to infer it I In general case, problem is NP hard In practice, can succeed in many cases I Exact inference methods work well for some network structures I Monte Carlo methods “simulate” the network randomly to calculate approximate solutions Approximate Inference Sampling One idea: if there is a query and the joint prob. table all you need is to count. Approximate Inference Sampling I I sample P(cloudy ) =< 0.5, 0.5 >: choose true sample P(Sprinkler |Cloudy = true) =< 0.1, 0.9 >: choose false I sample P(Rain|Cloudy = true) =< 0.8, 0.2 >: choose true I sample (WetGrass|Sprinkler = false, Rain = true) =< 0.9, 0.1 >: choose true I return < true, false, true, true >. Approximate Inference Prior Sampling prior _sample takes in a Bayesian network bn specifying P(X1 , X2 , ..., Xn ) function prior_sample (bn) X=vector with the vars. of the BN //< X1 , X2 , ..., Xn > sample=an array of n elements for each var from X: sample[i] = random sample P(var|Parents(var)) return sample Approximate Inference Prior Sampling Data Sampled from the BN: 10 and 29 samples: [c,s,r,w] [’f’, [’t’, [’f’, [’t’, [’t’, [’t’, [’f’, [’t’, [’t’, [’t’, ’f’, ’f’, ’f’, ’f’, ’f’, ’f’, ’f’, ’f’, ’f’, ’f’, ’f’, ’t’, ’f’, ’f’, ’t’, ’t’, ’t’, ’t’, ’t’, ’f’, ’f’] ’t’] ’f’] ’f’] ’t’] ’t’] ’f’] ’t’] ’t’] ’f’] [’f’, [’f’, [’t’, [’f’, [’f’, [’f’, [’t’, [’f’, [’f’, [’f’, [’f’, [’t’, [’t’, [’f’, [’t’, [’t’, [’t’, [’t’, [’t’, [’t’, [’t’, [’f’, [’t’, [’f’, [’f’, [’f’, [’t’, [’f’, ’t’, ’f’, ’f’, ’t’, ’f’, ’f’, ’t’, ’t’, ’t’, ’f’, ’t’, ’f’, ’f’, ’t’, ’f’, ’f’, ’f’, ’f’, ’f’, ’f’, ’f’, ’f’, ’f’, ’f’, ’t’, ’t’, ’f’, ’f’, ’f’, ’f’, ’t’, ’f’, ’f’, ’f’, ’t’, ’f’, ’t’, ’f’, ’t’, ’t’, ’t’, ’f’, ’t’, ’t’, ’t’, ’t’, ’t’, ’t’, ’t’, ’f’, ’t’, ’f’, ’t’, ’t’, ’f’, ’f’, ’t’] ’f’] ’t’] ’t’] ’f’] ’f’] ’t’] ’f’] ’t’] ’f’] ’t’] ’t’] ’t’] ’t’] ’t’] ’t’] ’t’] ’t’] ’t’] ’t’] ’t’] ’f’] ’t’] ’f’] ’t’] ’t’] ’f’] ’f’] Approximate Inference Prior Sampling P(t, f , t, t) = 0.5 × 0.9 × 0.8 × 0.9 = 0.324 (using CPTs) P(t, f , t, t) = 5 10 = 0.5 with 10 samples P(t, f , t, t) = 12 29 = 0.41 with 29 samples P(t, f , t, t) = 33 100 = 0.33 with 100 samples 0.325 with 1000 samples Approximate Inference Prior Sampling Each sampling will give you a slightly different number, but in general: NPS (x1 , x2 , ..., xn ) = P(x1 , x2 , ..., xn ) N N→∞ lim With N total number of samples and NPS (x1 , x2 , ..., xn ) the number of times that event (x1 , x2 , ..., xn ) appears in the samples. So, P(x1 , x2 , ..., xn ) ≈ NPS (x1 , x2 , ..., xn ) N Approximate Inference Rejection Sampling What if you know something about the variables: You want to find P(X = x|e) where X is the unknown variable and e the evidence (all variables you know the value of). ˆ |e) is the distribution that the algorithm returns If P(X ˆ |e) = αNPS (X , e) = NPS (X , e) P(X NPS (e) ≈ P(X , e) = P(X |e) P(e) What’s the probability of P(Rain|Sprinkler = t)? Approximate Inference Rejection Sampling /* bn: A bayesian network X: the query variable e: Observed values for evidence num: The number of samples to generate */ function rejection_sample (bn,X,e,num) N=vector of counts for each val. of X for 1 to num: sample = prior_sample(bn) if sample is consistent with e: N[x]=N[x]+1 //where x is the value of X in sample return normalize(N) //divide each N[i] by the sum of Ns Problem: How many iterations to get 100 consistent samples? Approximate Inference Importance Sampling For bayesian networks we can use Likelihood Weighting Fix the values of the evidence and sample non evidence only. But Wait! We may end up with a number of events that are very unlikely For example: P(Rain|Cloudy = t, WetGrass = t) can generate [c=t,s=f,r=f,w=t], but P(WetGrass = t|Sprinkler = f , Rain = f ) = 0!!! We have to weight samples by their likelihood Approximate Inference Importance Sampling Example of Likelihood Weighting P(Rain|Cloudy = t, WetGrass = t) Let w be a weight assigned to a sample. Initially 1. I Cloudy = t b/c it is evidence. Therefore w = w × P(Cloudy = t) = 0.5 I Sprinkler is not evidence. Sample P(Sprinkler |Cloudy = t) =< 0.1, 0.9 > pick f I Rain is not evidence. Sample P(Rain|Cloudy = t) =< 0.8, 0.2 > pick t I WetGrass = t is evidence. Therefore w = w ×P(wetgrass = t|Sprinkler = f , Rain = true) = 0.45 Return [t,f,t,t] with weight of 0.45 and tally it under Rain = true Approximate Inference Likelihood Weighting /* bn: A bayesian network specifying P(X1 , X2 , . . . , Xn ) X: the query variable e: Observed values for evidence num: The number of samples to generate */ function likelihood_weighting (bn,X,e,num) W=vector of weighted counts for each val of X for 1 to num: sample,w = weighted_sample(bn,e) W[x]=W[x]+w //where x is the value of X in sample return normalize(W) //divide each W[i] by the sum of Ws function weighted_sample(bn,e) w=1 sample=a vector with n elems. initialized with vals from e and unknowns X=vector with the vars. of the BN //< X1 , X2 , ..., Xn > for each var in X: if var is an evidence variable with value x in e: w=w*P(var=x|parents(var)) else i = index of var in X sample[i] = a random sample from P(var|parents(var)) return (sample,w) Inference by Markov Chain Monte Carlo Definitions Monte Carlo (Monaco) Markov Chain: New state comes from a random change on the previous one: P(Xt |Xt−1 ) Gibbs Sampling Example I start with an arbitrary state (evidence is fixed) I generate the next state sampling from one non-evidence variable I pick the next non-evidence variable and sample using the rest of the configuration Note that sampling the non-evidence variable is conditioned on its Markov Blanket P(xi |markovBlanket(Xi )) = αP(xi |Parents(Xi ))× Y Yj ∈Children(Xi ) P(yi |Parents(Yj )) Gibbs Sampling Example P(Rain|Sprinkler = t, WetGrass = t) I start with [t,t,f,t] (picked at random with evidence fixed) I Cloudy is sampled given its MB. P(Cloudy |Sprinkler = true, Rain = false). choose false. I Rain is sampled given its MB. P(Rain|Cloudy = false, Sprinkler = true, WetGrass = true) pick true I current state = [f,t,t,t]. Go back to sampling Cloudy... Say the processs visited 20 states where rain=t and 60 where rain=f, then P(Rain|Sprinkler = t, WetGrass = t) = α(< 20, 60 >) = 20 60 < 80 , 80 >=< 0.25, 0.75 > Gibbs Sampling Algorithm /* bn: A bayesian network specifying P(X1 , X2 , . . . , Xn ) X: the query variable e: Observed values for evidence num: The number of samples to generate */ function gibbs_ask (bn,X,e,num) N=vector of counts for each val of X Z=non evidence vars in bn sample = initialized from e and non evidence random. for 1 to num: for var in Z: i = index of var in Z sample[i] = sample from P(var|mb(var)) N[x]=N[x]+1 //where x is the value of X in sample return normalize(N) //divide each N[i] by the sum of Ns Learning in Bayesian Networks Several variants of this learning task I Network structure might be known or unknown I Training examples might provide values of all network variables, or just some If structure known and observe all variables I Then it’s easy as training a Naive Bayes classifier If structure is known, but only some variables can be observed... Learning Parameters in Bayes Nets Estimation Maximization (EM) I We have a network with a node of which we do not know the parameters, but we know how it should behave (distribution) I We have observations that depend on the values of that node (parameters) EM Algorithm The Case for Two Conditions with Success or Failure Imagine there are two coins A and B One is more likely to get Heads, the other more likely to get Tails. You pick one at random and toss it. Which one was it? EM Algorithm A Tutorial1 Well, let’s do this five times: I pick a coin randomly I toss it 10 times I record the number of heads and tails Then, get the average number of heads for each coin. 1 Cuong B. Do and Serafim Batzoglou (2008) EM Algorithm A Tutorial EM Algorithm A Tutorial That was easy: Coin A yields heads 80% of the time, Coin B 45% of the time. What if we are given ONLY the results of our coin tosses Can we guess the percentage of heads that each coin yields? Can we guess which coin was picked for each set of 10 coin tosses? EM Algorithm A Tutorial One way to think about this is: 1. Assign random averages to both coins 2. For each of the 5 rounds of 10 coin tosses I I I I I Check the percentage of heads Find the probability of it coming from each coin Compute the expected number of heads: using that probability as a weight, multiply it by the number of heads Record those numbers Re-Compute new means for coin A and B. 3. With these new means go back to step 2. EM Algorithm How do Coin Tosses Behave?2 Binomial Distribution 2 Math Spoken Here. Binomial Distribution EM Algorithm How do Coin Tosses Behave?3 3 PennState Eberly College of Science Statistics Online EM Algorithm How do Coin Tosses Behave?4 4 PennState Eberly College of Science Statistics Online EM Algorithm A Tutorial So, the five rounds of 10 coin tosses with θA = 0.6; θB = 0.55 yield: 1 H T T T H H T H T H 2 H H H H H T H H H H 3 H T H H H H H T H H 4 H T H T T T H H T T 5 T H H H T H H H T H Let’s take the first round: 5 10 heads and 5 10 tails. compute the likelihood that it was coin “A” and coin “B” using the binomial distributionwith mean probability θ on n trials with k successes. p(k ) = kn θk (1 − θ)n−k 5 θi is the average number of heads for coin i. Initially it is randomly assigned EM Algorithm A Tutorial: E-Step So, We have: θA = 0.6; θB = 0.5 1 H T T T 2 H H H H 3 H T H H 4 H T H T 5 T H H H H H H T T Let’s take the first round: H T H T H 5 10 T H H H H H H T H H T H H T T heads and H H H T H 5 10 tails. likelihood of “A”= pA (h)h (1 − pA (h))10−h = 0.0007962624 likelihood of “B”= pB (h)h (1 − pB (h))10−h = 0.0009765625 Normalizing I get probabilities: 0.45 and 0.55 EM Algorithm A Tutorial. M-Step So, We have: θA = 0.6; θB = 0.5 1 H T T T 2 H H H H 3 H T H H 4 H T H T 5 T H H H H H H T T H T H T H T H H H H H H T H H T H H T T H H H T H Recap: P(Coin = A) = 0.45; P(Coin = B) = 0.55 Estimating likely number of heads and tails from: I “A”: H = 0.45 × 5 heads = 2.2 heads; T = 0.45 × 5 tails = 2.2 tails I “B”: H = 0.55 × 5 heads = 2.8 heads; T = 0.55 × 5 tails = 2.8 tails Do the same for all five runs EM Algorithm A Tutorial So, We have: θA = 0.6; θB = 0.5 H Compute the new probabilities for each coin ( H+T ) That gives you the new maximized parameter θ for each coin EM Algorithm A Tutorial So, We have: θA = 0.6; θB = 0.5 Repeat E-Step and M-Step until convergence Expectation Maximization (EM) When to use: I Data is only partially observable I Unsupervised clustering (target value unobservable) I Supervised learning (some instance attributes unobservable) Some uses: I Train Bayesian Belief Networks I Unsupervised clustering (AUTOCLASS) I Learning Hidden Markov Models p(x) Generating Data from Mixture of k Gaussians x Each instance x generated by 1. Choosing one of the k Gaussians with uniform probability 2. Generating an instance at random according to that Gaussian EM for Estimating k Means Given: I Instances from X generated by mixture of k Gaussian distributions I Unknown means hµ1 , . . . , µk i of the k Gaussians I Don’t know which instance xi was generated by which Gaussian Determine: I Maximum likelihood estimates of hµ1 , . . . , µk i Think of full description of each instance as yi = hxi , zi1 , zi2 i, where I zij is 1 if xi generated by jth Gaussian I xi observable I zij unobservable EM for Estimating k Means EM Algorithm: Pick random initial h = hµ1 , µ2 i, then iterate E step: Calculate the expected value E[zij ] of each hidden variable zij , assuming the current hypothesis h = hµ1 , µ2 i holds. E[zij ] = = p(x = xi |µ = µj ) P2 n=1 p(x = xi |µ = µn ) e P2 − 1 (xi −µj )2 2σ 2 n=1 e − 1 (xi −µn )2 2σ 2 EM for Estimating k Means EM Algorithm: Pick random initial h = hµ1 , µ2 i, then iterate E step: Calculate the expected value E[zij ] of each hidden variable zij , assuming the current hypothesis h = hµ1 , µ2 i holds. E[zij ] = = p(x = xi |µ = µj ) P2 n=1 p(x = xi |µ = µn ) e P2 − 1 (xi −µj )2 2σ 2 n=1 e − 1 (xi −µn )2 2σ 2 M step: Calculate a new maximum likelihood hypothesis h0 = hµ01 , µ02 i, assuming the value taken on by each hidden variable zij is its expected value E[zij ] calculated above. Replace h = hµ1 , µ2 i by h0 = hµ01 , µ02 i. Pm i=1 E[zij ] xi µj ← P m i=1 E[zij ] Estimating k means example Mixture of gaussians Say we know that on cloudy days the temperature is generally lower than on sunny days. Both cloudy and sunny days can experiment variations of around 10 degrees. Now, given the following temperatures for 10 random days Days = {70, 62, 89, 54, 97, 75, 82, 56, 32, 78} What’s the mean temperature for sunny and for cloudy days? Estimating k means example Set up We have sunny and cloudy days ⇒ k = 2 Estimating k means example Set up We have sunny and cloudy days ⇒ k = 2 Let’s assign random means to sunny and cloudy: 80 and 55 respectively. Estimating k means example Set up We have sunny and cloudy days ⇒ k = 2 Let’s assign random means to sunny and cloudy: 80 and 55 respectively. The standard deviation for each day is 10 (given). Estimating k means example EM Data {70, 62, 89, 54, 97, 75, 82, 56, 32, 78} E Step Sunny (80) {0.65, 0.20, 0.99, 0.03, 0.99, 0.86, 0.97, 0.053, 0.00, 0.93} Cloudy (55) {0.34, 0.79, 0.00, 0.96, 0.00, 0.13, 0.02, 0.94, 0.99, 0.067} M Step Sunny: Cloudy: 45.59+12.51+88.58+1.78+96.93+65.02+79.87+2.99+0.00+72.73 5.70 24.4+49.48+0.41+52.21+0.06+9.97+2.12+53.00+31.99+5.26 4.29 = 81.75 = 53.35 Iterate until the differences between means is less than 0.1 Estimating k means example EM Iteration 2 Sunny: 81.79 Cloudy: 53.00 Iteration 3 Sunny: 81.77 Cloudy: 52.91 EM Algorithm Converges to local maximum likelihood h and provides estimates of hidden variables zij In fact, local maximum in E[ln P(Y |h)] I Y is complete (observable plus unobservable variables) data I Expected value is taken over possible values of unobserved variables in Y General EM Problem Given: I Observed data X = {x1 , . . . , xm } I Unobserved data Z = {z1 , . . . , zm } Parameterized probability distribution P(Y |h), where I I I Y = {y1 , . . . , ym } is the full data yi = xi h are the parameters S Determine: I h that (locally) maximizes E[ln P(Y |h)] Many uses: I Train Bayesian belief networks I Unsupervised clustering (e.g., k means) I Hidden Markov Models zi General EM Method Define likelihood function Q(h0 |h) which calculates Y = X using observed X and current parameters h to estimate Z S Q(h0 |h) ← E[ln P(Y |h0 )|h, X ] EM Algorithm: Estimation (E) step: Calculate Q(h0 |h) using the current hypothesis h and the observed data X to estimate the probability distribution over Y . Q(h0 |h) ← E[ln P(Y |h0 )|h, X ] Maximization (M) step: Replace hypothesis h by the hypothesis h0 that maximizes this Q function. h ← argmaxh0 Q(h0 |h) Z Gibbs Sampling Gaussian Mixtures I I Given some estimates of µ1 , µ2 repeat until convergence I I I I for i = 1 to N: Sample π according to the E-step of the EM Algorithm* Update PN (1 − πit )xi µ ˆi = Pi=1 N t i=1 (1 − πi ) Sample from the gaussians with these estimates and produce new means. Learning The Net’s Structure When structure unknown... I Algorithms use greedy search to add/substract edges and nodes I Active research topic Summary: Bayesian Belief Networks I Combine prior knowledge with observed data I Impact of prior knowledge (when correct!) is to lower the sample complexity Active research area I I I I I I Extend from boolean to real-valued variables Parameterized distributions instead of tables Extend to first-order instead of propositional systems More effective inference methods ...
© Copyright 2024