Bayesian Methods with Monte Carlo Markov Chains I Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University [email protected] http://tigpbp.iis.sinica.edu.tw/courses.htm 1 Part 1 Introduction to Bayesian Methods 2 Bayes' Theorem P( A, B) Conditional Probability: P( A | B) P( B) One Derivation: P( B | A) P( A) P( A, B) P( A | B) P( B) P( A | B) P( B) P( B | A) P( A) Alternative Derivation: P( B | A) P( A) P( A | B) P( B | A) P( A) P( B | Ac ) P( Ac ) http://en.wikipedia.org/wiki/Bayes'_theore m 3 False Positive and Negative Medical diagnosis: Type I and II Errors: hypothesis testing in statistical inference http://en.wikipedia.org/wiki/False_positive Actual Status Positive (Reject H0) Diagnosis Test Negative Result (Accept H0) Disease (H1) Normal (H0) True Positive (Power, 1-β) False Positive (Type I Error, α) False Negative (Type II Error, β) True Negative (Confidence Level, 1-α) 4 Bayesian Inference (1) False positives in a medical test Test accuracy by conditional probabilities: P Test Positive Disease P R H1 1 P Test Negative Normal P A H 0 1 a 5 Prior probabilities: P Disease P H1 001 P Normal P H0 999 5 Bayesian Inference (2) Posterior probabilities by Bayes' theorem: True Positive Probability P Disease Test Positive 0.99 0.001 P H1 R 0.019 0.99 0.001 0.05 0.999 False Positive Probability P Normal Test Positive P H0 R 1 0.019 0.981 6 Bayesian Inference (3) Equal Prior probabilities: P Disease P H1 P Normal P H0 0.5 Posterior probabilities by Bayes’ theorem: True Positive Probability P Disease Test Positive 0.99 0.5 P H1 R 0.952 0.99 0.5 0.05 0.5 http://en.wikipedia.org/wiki/Bayesian_infer ence 7 Bayesian Inference (4) In the courtroom: P Evidence of DNA Match Guilty 1 and P Evidence of DNA Match Innocent 106 Based on the evidence other than the DNA match, P Guilty 0.3 and P Innocent 0.7 By the Bayes Theorem, P Guilty Evidence of DNA Match 0.3 1.0 0.99999766667 6 0.3 1.0 0.7 10 8 Naive Bayes Classifier Naive Bayes Classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. http://en.wikipedia.org/wiki/Naive_Bayes_c lassifier 9 Naive Bayes Probabilistic Model (1) The probability model for a classifier is a conditional model P C F1 , , Fn where C is a dependent class variable and F1 , , Fn are several feature variables. By Bayes’ theorem, P(C ) P( F1 ,..., Fn | C ) P(C | F1 ,..., Fn ) P( F1 ,..., Fn ) 10 Naive Bayes Probabilistic Model (2) Use repeated applications of the definition of conditional probability: P C , F1 , , Fn P C P F1 , , Fn | C P C P F1 | C P F2 , , Fn | C , F1 P C P F1 | C P F2 | C P F3 , , Fn | C , F1, F2 and so forth. Assume that each Fi is conditionally independent of every other F j for i j , this means that P Fi | C , Fj P Fi | C 11 Naive Bayes Probabilistic Model (3) So P C, F1 , , Fn can be expressed as n P(C ) P( Fi | C ). i 1 So P C | F1 , , Fn can be expressed like n 1 P(C ) P( Fi | C ), i 1 Z where Z is constant if the values of the feature variables are known. Constructing a classifier from the probability model: n Classify( f1 ,.., f n ) arg max c P(C c) P( Fi fi | C c). 12 i 1 Bayesian Spam Filtering (1) Bayesian spam filtering, a form of e-mail filtering, is the process of using a Naive Bayes classifier to identify spam email. References: http://en.wikipedia.org/wiki/Spam_%28email%29 http://en.wikipedia.org/wiki/Bayesian_spa m_filtering http://www.gfi.com/whitepapers/whybayesian-filtering.pdf 13 Bayesian Spam Filtering (2) Probabilistic model: P(spam) P(words | spam) P(spam | words) P(words) where {words} mean {certain words in spam emails}. Particular words have particular probabilities of occurring in spam emails and in legitimate emails. For instance, most email users will frequently encounter the word “Viagra” in spam emails, but will seldom see it in other emails. 14 Bayesian Spam Filtering (3) Before mails can be filtered using this method, the user needs to generate a database with words and tokens (such as the $ sign, IP addresses and domains, and so on), collected from a sample of spam mails and valid mails. After generating, each word in the email contributes to the email's spam probability. This contribution is called the posterior probability and is computed using Bayes’ theorem. 15 Bayesian Spam Filtering (4) Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam. 16 Bayesian Network (1) Bayesian network is compact representation of probability distributions via conditional independence. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. http://en.wikipedia.org/wiki/Bayesian_network http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.ht ml http://www.cs.huji.ac.il/~nirf/Nips01Tutorial/index.html 17 Bayesian Network (2) Conditional independencies & graphical language capture structure of many realworld distributions Graph structure provides much insight into domain Allows “knowledge discovery” Cloudy Data + Prior Information Learner Sprinkler Wet Grass Rain S R P(W | S,R) T T 1.0 0.0 T F 0.1 0.9 F T 0.1 0.9 F F 0.01 0.99 18 Bayesian Network (3) Qualitative part: Cloudy Directed acyclic graph (DAG) Sprinkler Nodes - random variables Edges - direct influence Rain Wet Grass Together: Define a unique distribution in a factored form S R P(W | S,R) T T 1.0 0.0 T F 0.1 0.9 F T F F 0.1 0.9 0.01 0.99 Quantitative part: Set of conditional probability distributions P(C , S , R,W ) P(C ) P( S | C ) P( R | C ) P(W | S , R) 19 Inference Posterior probabilities Most likely explanation Scenario that explains evidence Rational decision making Probability of any event given any evidence Maximize expected utility Value of Information Effect of intervention Earthquake Radio Burglary Alarm Call 20 Example 1 (1) Cloudy Rain Sprinkler Wet Grass 21 Example 1 (2) By the chain rule of probability, the joint probability of all the nodes in the graph above is P C, S , R,W P C P S | C P R | C, S P W | C, S , R By using conditional independence relationships, we can rewrite this as P C, S , R,W P C P S | C P R | C P W | S , R where we were allowed to simplify the third term because R is independent of S given its parent C, and the last term because W is independent of C given its parents S and R. 22 Example 1 (3) Bayes theorem: P(W T ) P(C c, S s, R r ,W T ) c , s ,r PC , S , R ,W ( F , F , T , T ) PC , S , R ,W ( F , T , T , T ) PC , S , R ,W (T , F , T , T ) PC , S , R ,W (T , T , T , T ) PC , S , R ,W ( F , F , F , T ) PC , S , R ,W ( F , T , F , T ) PC , S , R ,W (T , F , F , T ) PC , S , R ,W (T , T , F , T ) 0.5 0.5 0.2 0.9 0.5 0.5 0.2 0.99 0.5 0.9 0.8 0.9 0.5 0.1 0.8 0.99 0.5 0.5 0.8 0 0.5 0.5 0.8 0.9 0.5 0.9 0.2 0 0.5 0.1 0.2 0.9 0.4581 0.189 0.6471 is a normalizing constant, equal to the probability (likelihood) of the data. 23 Example 1 (4) The posterior probability of each explanation P( R T ,W T ) c ,s P(C c, S s, R T ,W T ) 0.4581 P( R T | W T ) P(W T ) P(W T ) 0.6471 P(S T ,W T ) r , s P(C c, S T , R r ,W T ) 0.2781 P( S T | W T ) P(W T ) P(W T ) 0.6471 So we see that it is more likely that the grass is wet because it is raining: the likelihood ratio is 0.708 / 0.430 1.647. 24 Part 2 MLE vs. Bayesian Methods 25 Maximum Likelihood Estimates (MLEs) vs. Bayesian Methods Binomial Experiments: http://www.math.tau.ac.il/~nin/Courses/ML 04/ml2.ppt More Explanations and Examples: http://www.dina.dk/phd/s/s6/learning2.pdf 26 MLE (1) Binomial Experiments: suppose we toss coin N times and the random variable is 1, if the ith trial is head Xi 0, if the ith trial is tail We denote by the (unknown) probability P Head . Estimation task: Given a sequence of toss samples x1 , x2 , , xN we want to estimate the probabilities P H and P T 1 . 27 MLE (2) The number of heads we see has a binomial distribution N k P( X k ) (1 ) N k k and thus E[ X ] N Clearly, the MLE of is equal to MME of . N x i 1 N i and is also 28 MLE (3) Suppose we observe the sequence H, H. MLE estimate is P H 1, P T 0. Should we really believe that tails are impossible at this stage? Such an estimate can have disastrous effect. If we assume that P(T)=0, then we are willing to act as though this outcome is impossible. 29 Bayesian Reasoning In Bayesian reasoning we represent our uncertainty about the unknown parameter by a probability distribution. This probability distribution can be viewed as subjective probability This is a personal judgment of uncertainty. 30 Bayesian Inference P -prior distribution about the values of P x1 , , xN | -likelihood of binomial experiment given a known value Given x1 , , xN , we can compute posterior distribution on P( x1 , , xN | ) P( ) P( | x1 , , xN ) P( x1 , , xN ) The marginal likelihood is P( x1 , , xN ) P( x1 , , xN | ) P( )d http://www.dina.dk/phd/s/s6/learning2.pdf 31 Binomial Example (1) In binomial experiment, the unknown parameter is P H Simplest prior: P 1 for 0 (Uniform prior) k N k Likelihood: P ( x1 , , xN | ) (1 ) where k is number of heads in the sequence 1 k nk Marginal Likelihood: P ( x1 , , xN ) (1 ) d 0 32 Binomial Example (2) Using integration by parts, we have: 1 P( x1 , , xN ) k (1 ) N k d 0 1 k 1 N k k 1 N k 1 N k 1 (1 ) 0 (1 ) d k 1 k 1 0 1 N k k 1 N k 1 (1 ) d k 1 0 1 Multiply both side by N choose k , we have 1 N N1 k N k k 1 N k 1 (1 ) d (1 ) d k 0 k 1 0 33 Binomial Example (3) The recursion terminates when k N, 1 N1 N 1 N N N (1 ) d d N 1 N0 0 Thus, 1 P( x1 , , xN ) (1 ) k 0 N k 1 N d N 1 k 1 We conclude that the posterior is P( | x1 , N k , xN ) ( N 1) (1 ) N k k 34 Binomial Example (4) How do we predict (estimate ) using the posterior? We can think of this as computing the probability of the next element in the sequence P( xN 1 | x1 , , xN ) P( xN 1 , | x1 , P( xN 1 | , x1 , , xN ) P( | x1 , P( xN 1 | ) P( | x1 , , x N ) d , x N ) d , x N ) d Assumption: if we know , the probability of xN 1 is independent x1 , , xN 35 P( xN 1 | , x1 , , xN ) P( xN 1 | ) Binomial Example (5) Thus, we conclude that ˆ P( xN 1 H | x1 , P( | x1 , , xN ) P( xN 1 | ) P( | x1 , , x N ) d , xN )d N k 1 ( N 1) (1 ) N k d k 1 N 1 N 1 k 1 ( N 1) N 2 k N 2 k 1 36 Beta Prior (1) The uniform priori distribution is a particular case of the Beta Distribution. Its general form is: ( s ) f ( ) 1 1 (1 )0 1 (1 )( 0 ) Where s 1 0 and show as Beta(1 , 0 ). 1 The expected value of the parameter is: 1 0 The uniform is Beta (1,1) 37 Beta Prior (2) There are important theoretical reasons for using the Beta prior distribution? One of them has also important practical consequences: it is the conjugate distribution of binomial sampling. If the prior is Beta 1 , 0 and we have observed some data with N1 and N 0 cases for the two possible values of the variable, then the posterior is also Beta with parameters Beta(1 N1 , 0 N0 ) 38 Beta Prior (3) The expected value for the posterior 1 N1 distribution is N N 1 1 0 0 0 1 ( , ) represent the prior The value 1 0 1 0 probabilities for the value of the variables based in our past experience. The value s 1 0 is called equivalent sample size measure the importance of our past experience. Larger values make that prior probabilities have more importance. 39 Beta Prior (4) When 0 , 1 0, then we have maximum likelihood estimation 40 Multinomial Experiments Now, assume that we have a variable X taking values on a finite set a1 , , an and we have a serious of independent observations of this distribution, x1 , x2 , , xm and we want to estimate the value i P ai , i 1, , n . Let N i be the number of cases in the sample in which we have obtained the value ai i 1, , n Ni ˆ The MLE of i is i m The problems with small samples are completely analogous. 41 Dirichlet Prior (1) We can also follow the Bayesian approach, but the prior distribution is the Dirichlet distribution, a generalization of the Beta distribution for more than 2 cases: 1 , ,n . The expression of D 1 , , n is f (1 , n ( s ) , n ) ii 1 (1 ) ( n ) i 1 n where s i is the equivalent sample size. i 1 42 Dirichlet Prior (2) The expected vector is E (1 , ,n ) ( 1 s , , n s ) Greater value of s makes this distribution more concentrated around the mean vector. 43 Dirichlet Posterior If we have a set of data with counts N1 , , Nn , then the posterior distribution is also Dirichlet with parameters D(1 N1 ,..., n N n ) The Bayesian estimation of probabilities n Nn 1 N1 , , ) are: ( sm sm n n i 1 i 1 where m N i , s i . 44 Multinomial Example (1) Imagine that we have an urn with balls of different colors: red(R), blue(B) and green(G); but on an unknown quantity. Assume that we picked up balls with replacement, with the following sequence: B, B, R, R, B . 45 Multinomial Example (2) If we assume a Dirichlet prior distribution with parameters: D 1,1,1 , then the estimated frequencies for red,blue and 3 4 1 green: , , 8 8 8 Observe, as green has a positive probability, even if never appears in the sequence. 46 Part 3 An Example in Genetics 47 Example 1 in Genetics (1) Two linked loci with alleles A and a, and B and b A, B: dominant a, b: recessive A double heterozygote AaBb will produce gametes of four types: AB, Ab, aB, ab A A a B b 1/2 1/2 A B A a a B b b 1/2 b a 1/2 B 48 Example 1 in Genetics (2) Probabilities for genotypes in gametes No Recombination Recombination Male 1-r r Female 1-r’ r’ A A a B b 1/2 1/2 A B A a a B b 1/2 a 1/2 b AB ab aB Ab Male (1-r)/2 (1-r)/2 r/2 r/2 Female (1-r’)/2 (1-r’)/2 r’/2 r’/2 b B 49 Example 1 in Genetics (3) Fisher, R. A. and Balmukand, B. (1928). The estimation of linkage from the offspring of selfed heterozygotes. Journal of Genetics, 20, 79–92. More: http://en.wikipedia.org/wiki/Genetics http://www2.isye.gatech.edu/~brani/isyeba yes/bank/handout12.pdf 50 Example 1 in Genetics (4) MALE F E M A L E AB (1-r)/2 ab (1-r)/2 aB r/2 Ab r/2 AB (1-r’)/2 AABB (1-r) (1-r’)/4 aABb (1-r) (1-r’)/4 aABB r (1-r’)/4 AABb r (1-r’)/4 ab (1-r’)/2 AaBb (1-r) (1-r’)/4 aabb (1-r) (1-r’)/4 aaBb r (1-r’)/4 Aabb r (1-r’)/4 aB r’/2 AaBB (1-r) r’/4 aabB (1-r) r’/4 aaBB r r’/4 AabB r r’/4 Ab r’/2 AABb (1-r) r’/4 aAbb (1-r) r’/4 aABb r r’/4 AAbb r r’/4 51 Example 1 in Genetics (5) Four distinct phenotypes: A*B*, A*b*, a*B* and a*b*. A*: the dominant phenotype from (Aa, AA, aA). a*: the recessive phenotype from aa. B*: the dominant phenotype from (Bb, BB, bB). b*: the recessive phenotype from bb. A*B*: 9 gametic combinations. A*b*: 3 gametic combinations. a*B*: 3 gametic combinations. a*b*: 1 gametic combination. Total: 16 combinations. 52 Example 1 in Genetics (6) Let (1 r )(1 r ') , then 2 P( A * B*) 4 1 P( A * b*) P(a * B*) 4 P(a * b*) 4 53 Example 1 in Genetics (7) Hence, the random sample of n from the offspring of selfed heterozygotes will follow a multinomial distribution: 2 1 1 Multinomial n; , , , 4 4 4 4 We know that (1 r )(1 r '), 0 r 1/ 2, and 0 r ' 1/ 2 So 1/ 4 1 54 Bayesian for Example 1 in Genetics (1) To simplify computation, we let P( A * B*) 1 , P( A * b*) 2 P(a * B*) 3 , P(a * b*) 4 The random sample of n from the offspring of selfed heterozygotes will follow a multinomial distribution: Multinomial n;1 , 2 , 3 , 4 55 Bayesian for Example 1 in Genetics (2) If we assume a Dirichlet prior distribution with parameters: D(1 , 2 , 3 , 4 ) to estimate probabilities for A*B*, A*b*, a*B* and a*b*. Recall that A*B*: 9 gametic combinations. A*b*: 3 gametic combinations. a*B*: 3 gametic combinations. a*b*: 1 gametic combination. We consider (1 , 2 , 3 , 4 ) (9,3,3,1). 56 Bayesian for Example 1 in Genetics (3) Suppose that we observe the data of y y1 , y2 , y3 , y4 125,18, 20, 24 . So the posterior distribution is also Dirichlet with parameters D 134, 21, 23, 25 The Bayesian estimation for probabilities are (1 , 2 , 3 , 4 ) 0.660,0.103,0.113,0.123 57 Bayesian for Example 1 in Genetics (4) Consider the original model, 2 1 P( A * B*) , P( A * b*) P(a * B*) , P(a * b*) . 4 4 4 The random sample of n also follow a multinomial distribution: 2 1 1 ( y1 , y2 , y3 , y4 ) ~ Multinomial n; , , , . 4 4 4 4 We will assume a Beta prior distribution: Beta(1 , 2 ). 58 Bayesian for Example 1 in Genetics (5) The posterior distribution becomes P( y1 , y2 , y3 , y4 | ) P( ) P( | y1 , y2 , y3 , y4 ) . P( y1, y2 , y3 , y4 | )P( )d The integration in the above denominator, P( y1 , y2 , y3 , y4 | ) P( )d does not have a close form. 59 Bayesian for Example 1 in Genetics (6) How to solve this problem? Monte Carlo Markov Chains (MCMC) Method! What value is appropriate for ( 1 , 2 ) ? 60 Part 4 Monte Carlo Methods 61 Monte Carlo Methods (1) Consider the game of solitaire: what’s the chance of winning with a properly shuffled deck? http://en.wikipedia.org/ wiki/Monte_Carlo_meth od http://nlp.stanford.edu/l ocal/talks/mcmc_2004_ 07_01.ppt ? Lose Lose Win Lose Chance of winning is 1 in 4! 62 Monte Carlo Methods (2) Hard to compute analytically because winning or losing depends on a complex procedure of reorganizing cards. Insight: why not just play a few hands, and see empirically how many do in fact win? More generally, can approximate a probability density function using only samples from that density. 63 Monte Carlo Methods (3) Given a very large set X and a distribution f x over it. We draw a set of N i.i.d. random samples. We can then approximate the distribution using these samples. f(x) X 1 f N ( x) N N 1( x i 1 (i ) x) f ( x) N 64 Monte Carlo Methods (4) We can also use these samples to compute expectations: 1 EN ( g ) N N (i ) g ( x ) E ( g ) g ( x) f ( x) i 1 N x And even use them to find a maximum: xˆ arg max[ f ( x(i ) )] x( i ) 65 Monte Carlo Example , X n be i.i.d. N 0,1, find E ( X i 4 ) ? X1 , Solution: E( X i 4 ) x4 - 1 x2 4! 24 exp( )dx 3 2 8 4/2 4 2 2 ( )! 2 Use Monte Carlo method to approximation > x <- rnorm(100000) # 100000 samples from N(0,1) > x <- x^4 > mean(x) [1] 3.034175 66 Exercises Write your own programs similar to those examples presented in this talk. Write programs for those examples mentioned at the reference web pages. Write programs for the other examples that you know. 67
© Copyright 2024