Lecture 9: Probabilistic IR The Binary Independence Model and

Lecture 9: Probabilistic IR
The Binary Independence Model and Okapi
BM25
Trevor Cohn
(Slide credits: William Webber)
COMP90042, 2015, Semester 1
What we’ll learn in this lecture
Probabilistic models for IR
I
As a model of relevance, P(R|d, q)
I
Binary independence model
I
Okapi BM25
Probabilistic vs. geometric models
Fundamental calculations:
Geometric How similar sim(d, q) is document d to query q?
Probabilistic What is the probability P(R = 1|d, q) that d is
relevant to q?
Different starting points, but leading to similar formulations.
Probabilistic models
Probabilistic models:
I
Clearer theoretical basis that geometric
I
Particularly when considering extensions & modifications
I
Probabilistic models initially not very successful, but now
widespread
I
We look at “classical” probabilistic models up to BM25
I
Next lecture, language models
(Discrete) Probability Recap
Random Variable variable that can take on a value, e.g.,
A ∈ {0, 1}
Probability distribution likelihood of taking on a value, e.g.,
P(A = 1), P(A = 0)
¯ ⇒ P(A = 0).
Shorthand: P(A) ⇒ P(A = 1), P(A)
Joint probability distribution what is the likelihood of several RVs
each taking on specific values. E.g., P(A, B).
Conditional probability distribution given some side information,
what is the likelihood of the RV taking on certain
¯ P(A|B),
¯
¯ B).
¯
values. E.g., P(A|B), P(A|B),
P(A|
Representing Probability Distributions as vectors and
matrices
V
P(C )
0
1
0.001
0.999
C
V=0
V=1
0
1
0.7
0.3
0.95
0.05
C
V=0
V=1
0
1
0.699
0.300
0.00005
0.00095
Volcanic eruption (V)
Cloudy given volcanic eruption, P(C |V )
Cloudy and volcanic eruption, P(C , V )
Basic Rules of Probability
Conditioning
P(A, B)
P(B)
P(A) × P(B)
=
P(B)
P(A|B) =
if A and B are independent
Chain rule
P(A, B, C ) = P(A)P(B|A)P(C |A, B)
= P(C )P(B|C )P(A|B, C )
= ...
Marginalisation property
X
¯
P(A) =
P(A, B = b) = P(A, B) + P(A, B)
b
Bayes Rule
The conditioning forumlation and chain rule imply:
P(B|A)P(A)
P(A, B)
=
P(B)
P(B)
P(B|A)
=P
· P(A)
¯ P(B|X )P(X )
X ∈{A,A}
P(A|B) =
Commonly referred to as “Bayes rule” or “Bayes theorem.”
Volcano Example
If we know it’s cloudy, was this caused by a volcano? I.e., what is
P(V |C )?
0.95 × 0.001
0.3 × 0.999 + 0.95 × 0.001
= 0.0032
P(V |C ) =
Now 3× more likely than before!
Bayes theorem
P(A|B) =
P(B|A)
· P(A)
P(B)
I
P(A) is the prior probability (distribution) of A
I
We then observe B
I
P(B|A)/P(B) is support that B provides for A (known as the
likelihood and the evidence, resp.)
P(B|A)
=1
P(B)
I
if A and B are independent
P(A|B) is posterior probability of A
Bayes theorem for relevance
P(R|d, q) =
P(d|R, q)
· P(R|q)
P(d|q)
(1)
I
P(R|q) can be understood as proportion of documents in
collection that are relevant to query
I
P(d|R, q) is probability that a (retrieved) document relevant
to q looks like d
¯ q) P(R|q)
¯
P(d|q) = P(d|R, q) P(R|q) + P(d|R,
is probability
of observing (retrieved) document, regardless of relevance
I
I
Effectively a normaliser such that (1) is valid,
¯ q) = 1.
P(R|d, q) + P(R|d,
OK, but how do we go about estimating these values?
Rank-equivalence given query
Probability Ranking Principle (PRP)
I
Assume output is ranking
I
Assume that relevance of each document is independent
I
Then optimal ranking is by decreasing probability of relevance
I
For ranking, we only care about
I
Relative probability for given query
I
This allows various simplifications to Equation 1
I
Provided they are monotonic, i.e., for transformation f (),
P(A) > P(B) ⇒ f (P(A)) > f (P(B))
e.g., f = log, ± constant, × a positive constant, . . .
Odds-based matching score
Take odds ratio between relevance and irrelevance:
P(R|d, q)
O(R|d, q) =
¯ q) =
P(R|d,
P(R|q)P(d|R,q)
P(d|q)
¯
¯
P(R|q)P(d|
R,q)
P(d|q)
P(R|q) P(d|R, q)
·
¯
¯ q)
P(R|q)
P(d|R,
P(d|R, q)
= O(R|q) ×
¯ q)
P(d|R,
=
where O(R|q) captures the first term which does not involve d
(and can be ignored in ranking). We have removed 2 of the 3
terms from Equation 1.
Binary independence model
I
¯ q)?
How to estimate P(d|R, q) and P(d|R,
I
Must be based on attributes of d and q
Binary indendence model
Binary Doc attributes are presence of terms (not frequency)
Independence Term appearances independent given relevance
Represent:
I
Document as binary vector ~d
I
Query as binary vector ~q
over words.
BIM odds ratio
Under BIM, odds ratio resolves to:
O(R|d, q) ∝
|T |
Y
P(dt |R, q)
¯ q)
P(dt |R,
t=1
Y P(dt |R, q) Y P(d¯t |R, q)
¯ q) ·
¯ q)
P(dt |R,
P(d¯t |R,
t:dt =1
t:dt =0
Y pt Y 1 − pt
·
=
ut
1 − ut
=
t:dt =1
t:dt =0
¯ q).
where pt = P(dt |R, q) and ut = P(dt |R,
Note similarity to Naive Bayes.
Query terms only
Assume non-query terms equally likely to occur in relevant as
irrelevant documents
I
prescence or abscence of these terms irrelevant
I
many factors cancel, leaving
O(R|d, q) ∝
Y
t:qt =dt =1
=
Y
t:dt =qt =1
∝
Y
t:dt =qt =1
pt
·
ut
Y
t:qt =1∧dt =0
1 − pt
1 − ut
pt (1 − ut ) Y (1 − pt )
·
ut (1 − pt )
(1 − ut )
t:qt =1
pt (1 − ut )
ut (1 − pt )
where the last step drops the constant scaling factor.
Retreival status value
The retreival status value (RSVd ) is defined as the log
transformation of the odds-ratio1
Y
RSVd = log
t:dt =qt =1
=
X
t:dt =qt =1
log
pt (1 − ut )
ut (1 − pt )
pt (1 − ut )
ut (1 − pt )
The RSV can now be used for ranking. Note the similarity with
geometric vector space models, here the weight of term t is:
pt
1 − ut
wt = log
+ log
1 − pt
ut
X
as used in RSVd =
wt
t:dt =qt =1
1
up to a scaling constant
Assessment-time estimation
I
Term weights wt still depends upon unknown pt = P(dt |R, q)
¯ q).
and ut = P(dt |R,
I
If we have relevance judgements, pt and ut can be estimated
as
pˆt
=
uˆt
=
1 X
dt
|R|
d∈R
1 X
dt
|R0 |
0
d∈R
But generally R is unknown at retreival time. How to estimate?
Retrieval-time estimation: ut
Probablity of term occurrence in non-relevant document
I
Assume relevant documents rare
I
If all documents in the collection are not relevant then
ut =
I
Overall, this gives
log
I
ft
N
Look familiar?
1 − ut
N − ft
N
≈ log
≈ log
ut
ft
ft
Retrieval-time estimation: pt
I
I
Setting log pt /(1 − pt ) to a constant
E.g., pt = 0.5 removes pt /(1 − pt ) term entirely
I
I
I
Relevance score of doc is just sum of IDFs
Plausible for binary model
Or based on analysis of empirical data, e.g.,
Greiff, “A theory of term weighting”, SIGIR, 1998
Summary: Binary independence model
I
Document attributes in BIM are term occurrences ∈ {0, 1}
I
Models pt = P(dt = 1|R, q), the chance of a relavent
document containing t
¯ q), the chance of an irrelevant
and ut = P(dt = 1|R,
document containing t
I
Weight wt of query term t occurring in document d is then:
wt = log
1 − ut
pt
+ log
1 − pt
ut
If pt = 0.5, this approximates IDF.
Okapi BM25
I
Inspired by the BIM probabilistic formulation
I
Why not try other alteratives for term weighting?
Can we capture various aspects in a simple formula?
I
I
I
I
I
idf
tf
document length
query tf
I
Then seek to tune each component . . .
I
Results in effective and widely used models, Okapi BM25
Okapi BM25
N − ft + 0.5
wt = log
ft + 0.5
(k1 + 1)fd,t
× d
k1 (1 − b) + b LLave
+ fd,t
×
I
(tf and doc. length)
(query tf)
Parameters k1 , b, and k3 need to be tuned (k3 only for very
long queries).
I
I
(k3 + 1) fq,t
k3 + fq,t
(idf)
k1 = 1.5 and b = 0.75 common defaults.
BM25 highly effective, most widely used weighting in IR
Robertson, Walker et al. (1994, 1998).
What have we achieved?
Pros
I
Started from plausible probabilistic model of term distribution
I
Shown how it can be made to fit something like TF*IDF
I
Providing a probabilistic justification TF*IDF-like approaches
Cons
I
Directly trying to estimate P(fdt |R) not practicable in
retrieval (too many parameters, not enough evidence)
I
Such approaches end up as ad-hoc as geometric model
I
Progress requires letting query tell us what relevance looks like
I
This the approach of language models
Looking back and forward
Back
I
Probabilistic IR models estimate
P(R|d, q) (or monotonic function
thereof)
I
Probability derived from attributes
(term occurrences) of documents
BIM ‘naive Bayes’ assumptions
I
I
I
Binary attributes (term occurs or
doesn’t)
Term occurrences independent
I
Provides probabilistic justification for
IDF
I
BM25 builds on this using heuristic
weighting formula to account for term
frequency, document length, query
term frequency etc
Looking back and forward
Forward
I
Language models an alternative
probabilistic IR framework
Further reading
I
Chapter 11, “Probabilistic information retrieval” of Manning,
Raghavan, and Schutze, Introduction to Information Retrieval.
http://nlp.stanford.edu/IR-book/pdf/11prob.pdf
I
Sparck Jones, Walker, and Robertson, “A Probabilistic MOdel of
Information Retrieval”, IPM, 2000.