Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25 Trevor Cohn (Slide credits: William Webber) COMP90042, 2015, Semester 1 What we’ll learn in this lecture Probabilistic models for IR I As a model of relevance, P(R|d, q) I Binary independence model I Okapi BM25 Probabilistic vs. geometric models Fundamental calculations: Geometric How similar sim(d, q) is document d to query q? Probabilistic What is the probability P(R = 1|d, q) that d is relevant to q? Different starting points, but leading to similar formulations. Probabilistic models Probabilistic models: I Clearer theoretical basis that geometric I Particularly when considering extensions & modifications I Probabilistic models initially not very successful, but now widespread I We look at “classical” probabilistic models up to BM25 I Next lecture, language models (Discrete) Probability Recap Random Variable variable that can take on a value, e.g., A ∈ {0, 1} Probability distribution likelihood of taking on a value, e.g., P(A = 1), P(A = 0) ¯ ⇒ P(A = 0). Shorthand: P(A) ⇒ P(A = 1), P(A) Joint probability distribution what is the likelihood of several RVs each taking on specific values. E.g., P(A, B). Conditional probability distribution given some side information, what is the likelihood of the RV taking on certain ¯ P(A|B), ¯ ¯ B). ¯ values. E.g., P(A|B), P(A|B), P(A| Representing Probability Distributions as vectors and matrices V P(C ) 0 1 0.001 0.999 C V=0 V=1 0 1 0.7 0.3 0.95 0.05 C V=0 V=1 0 1 0.699 0.300 0.00005 0.00095 Volcanic eruption (V) Cloudy given volcanic eruption, P(C |V ) Cloudy and volcanic eruption, P(C , V ) Basic Rules of Probability Conditioning P(A, B) P(B) P(A) × P(B) = P(B) P(A|B) = if A and B are independent Chain rule P(A, B, C ) = P(A)P(B|A)P(C |A, B) = P(C )P(B|C )P(A|B, C ) = ... Marginalisation property X ¯ P(A) = P(A, B = b) = P(A, B) + P(A, B) b Bayes Rule The conditioning forumlation and chain rule imply: P(B|A)P(A) P(A, B) = P(B) P(B) P(B|A) =P · P(A) ¯ P(B|X )P(X ) X ∈{A,A} P(A|B) = Commonly referred to as “Bayes rule” or “Bayes theorem.” Volcano Example If we know it’s cloudy, was this caused by a volcano? I.e., what is P(V |C )? 0.95 × 0.001 0.3 × 0.999 + 0.95 × 0.001 = 0.0032 P(V |C ) = Now 3× more likely than before! Bayes theorem P(A|B) = P(B|A) · P(A) P(B) I P(A) is the prior probability (distribution) of A I We then observe B I P(B|A)/P(B) is support that B provides for A (known as the likelihood and the evidence, resp.) P(B|A) =1 P(B) I if A and B are independent P(A|B) is posterior probability of A Bayes theorem for relevance P(R|d, q) = P(d|R, q) · P(R|q) P(d|q) (1) I P(R|q) can be understood as proportion of documents in collection that are relevant to query I P(d|R, q) is probability that a (retrieved) document relevant to q looks like d ¯ q) P(R|q) ¯ P(d|q) = P(d|R, q) P(R|q) + P(d|R, is probability of observing (retrieved) document, regardless of relevance I I Effectively a normaliser such that (1) is valid, ¯ q) = 1. P(R|d, q) + P(R|d, OK, but how do we go about estimating these values? Rank-equivalence given query Probability Ranking Principle (PRP) I Assume output is ranking I Assume that relevance of each document is independent I Then optimal ranking is by decreasing probability of relevance I For ranking, we only care about I Relative probability for given query I This allows various simplifications to Equation 1 I Provided they are monotonic, i.e., for transformation f (), P(A) > P(B) ⇒ f (P(A)) > f (P(B)) e.g., f = log, ± constant, × a positive constant, . . . Odds-based matching score Take odds ratio between relevance and irrelevance: P(R|d, q) O(R|d, q) = ¯ q) = P(R|d, P(R|q)P(d|R,q) P(d|q) ¯ ¯ P(R|q)P(d| R,q) P(d|q) P(R|q) P(d|R, q) · ¯ ¯ q) P(R|q) P(d|R, P(d|R, q) = O(R|q) × ¯ q) P(d|R, = where O(R|q) captures the first term which does not involve d (and can be ignored in ranking). We have removed 2 of the 3 terms from Equation 1. Binary independence model I ¯ q)? How to estimate P(d|R, q) and P(d|R, I Must be based on attributes of d and q Binary indendence model Binary Doc attributes are presence of terms (not frequency) Independence Term appearances independent given relevance Represent: I Document as binary vector ~d I Query as binary vector ~q over words. BIM odds ratio Under BIM, odds ratio resolves to: O(R|d, q) ∝ |T | Y P(dt |R, q) ¯ q) P(dt |R, t=1 Y P(dt |R, q) Y P(d¯t |R, q) ¯ q) · ¯ q) P(dt |R, P(d¯t |R, t:dt =1 t:dt =0 Y pt Y 1 − pt · = ut 1 − ut = t:dt =1 t:dt =0 ¯ q). where pt = P(dt |R, q) and ut = P(dt |R, Note similarity to Naive Bayes. Query terms only Assume non-query terms equally likely to occur in relevant as irrelevant documents I prescence or abscence of these terms irrelevant I many factors cancel, leaving O(R|d, q) ∝ Y t:qt =dt =1 = Y t:dt =qt =1 ∝ Y t:dt =qt =1 pt · ut Y t:qt =1∧dt =0 1 − pt 1 − ut pt (1 − ut ) Y (1 − pt ) · ut (1 − pt ) (1 − ut ) t:qt =1 pt (1 − ut ) ut (1 − pt ) where the last step drops the constant scaling factor. Retreival status value The retreival status value (RSVd ) is defined as the log transformation of the odds-ratio1 Y RSVd = log t:dt =qt =1 = X t:dt =qt =1 log pt (1 − ut ) ut (1 − pt ) pt (1 − ut ) ut (1 − pt ) The RSV can now be used for ranking. Note the similarity with geometric vector space models, here the weight of term t is: pt 1 − ut wt = log + log 1 − pt ut X as used in RSVd = wt t:dt =qt =1 1 up to a scaling constant Assessment-time estimation I Term weights wt still depends upon unknown pt = P(dt |R, q) ¯ q). and ut = P(dt |R, I If we have relevance judgements, pt and ut can be estimated as pˆt = uˆt = 1 X dt |R| d∈R 1 X dt |R0 | 0 d∈R But generally R is unknown at retreival time. How to estimate? Retrieval-time estimation: ut Probablity of term occurrence in non-relevant document I Assume relevant documents rare I If all documents in the collection are not relevant then ut = I Overall, this gives log I ft N Look familiar? 1 − ut N − ft N ≈ log ≈ log ut ft ft Retrieval-time estimation: pt I I Setting log pt /(1 − pt ) to a constant E.g., pt = 0.5 removes pt /(1 − pt ) term entirely I I I Relevance score of doc is just sum of IDFs Plausible for binary model Or based on analysis of empirical data, e.g., Greiff, “A theory of term weighting”, SIGIR, 1998 Summary: Binary independence model I Document attributes in BIM are term occurrences ∈ {0, 1} I Models pt = P(dt = 1|R, q), the chance of a relavent document containing t ¯ q), the chance of an irrelevant and ut = P(dt = 1|R, document containing t I Weight wt of query term t occurring in document d is then: wt = log 1 − ut pt + log 1 − pt ut If pt = 0.5, this approximates IDF. Okapi BM25 I Inspired by the BIM probabilistic formulation I Why not try other alteratives for term weighting? Can we capture various aspects in a simple formula? I I I I I idf tf document length query tf I Then seek to tune each component . . . I Results in effective and widely used models, Okapi BM25 Okapi BM25 N − ft + 0.5 wt = log ft + 0.5 (k1 + 1)fd,t × d k1 (1 − b) + b LLave + fd,t × I (tf and doc. length) (query tf) Parameters k1 , b, and k3 need to be tuned (k3 only for very long queries). I I (k3 + 1) fq,t k3 + fq,t (idf) k1 = 1.5 and b = 0.75 common defaults. BM25 highly effective, most widely used weighting in IR Robertson, Walker et al. (1994, 1998). What have we achieved? Pros I Started from plausible probabilistic model of term distribution I Shown how it can be made to fit something like TF*IDF I Providing a probabilistic justification TF*IDF-like approaches Cons I Directly trying to estimate P(fdt |R) not practicable in retrieval (too many parameters, not enough evidence) I Such approaches end up as ad-hoc as geometric model I Progress requires letting query tell us what relevance looks like I This the approach of language models Looking back and forward Back I Probabilistic IR models estimate P(R|d, q) (or monotonic function thereof) I Probability derived from attributes (term occurrences) of documents BIM ‘naive Bayes’ assumptions I I I Binary attributes (term occurs or doesn’t) Term occurrences independent I Provides probabilistic justification for IDF I BM25 builds on this using heuristic weighting formula to account for term frequency, document length, query term frequency etc Looking back and forward Forward I Language models an alternative probabilistic IR framework Further reading I Chapter 11, “Probabilistic information retrieval” of Manning, Raghavan, and Schutze, Introduction to Information Retrieval. http://nlp.stanford.edu/IR-book/pdf/11prob.pdf I Sparck Jones, Walker, and Robertson, “A Probabilistic MOdel of Information Retrieval”, IPM, 2000.
© Copyright 2024