6-up - SEAS

Introduction to Markov Models
But first:
A few preliminaries
Estimating the probability of phrases of words,
sentences, etc.…
Corrected 3/30
CIS 521 - Intro to AI
2
TexPoint fonts used in EMF.
ARead the TexPoint manual before you delete this box.:
AAAAAAAAAAAAAAAAA
What’s a Word? Tokenization Is Tricky…
CIS 521 - Intro to AI
How to find Sentences??
3
CIS 521 - Intro to AI
4
TexPoint fonts used in EMF.
Read the TexPoint manual before you delete this box.:
AAAAAAAAAAAAA
Q1: How to estimate the probability of a
given sentence?
Predicting a word sequence II
 A crucial step in speech recognition (and lots of
other applications)
 First guess: products of unigrams
Pˆ (W )   P( w)
wW
Proposed word lattice:
form
subsidy
for
farm
subsidies
far
 Next guess: products of bigrams
• For W=w1w2w3… wn, Pˆ (W ) 
i 1
form
subsidy
for
farm
subsidies
far
form subsidy 0
106
subsidy 15
for 18185
form subsidies 0
subsidy far 0
subsidies 55
far 570
farm subsidy 0
subsidies for 6
farm subsidies 4
subsidies far 0
5
)
words of AP text):
farm 74
CIS 521 - Intro to AI
i 1
subsidy for 2
form 183
Not quite right…
i
Proposed word lattice:
Bigram counts (in 1.7 *
Unigram counts (in 1.7 * 106 words of AP text):
n 1
 P( w w
Just right…
CIS 521 - Intro to AI
But the counts are tiny! Why?
6
1
and finally, Estimating P(wn|w1w2…wn-1)
How can we estimate P correctly?
 Problem: Naïve Bayes model for bigrams violates independence
assumptions.
Again, we can estimate P(wn|w1w2…wn-1) with the MLE
Count ( w1w2 ...wn )
Count ( w1 w2 ...wn 1 )
Let’s do this right….
 Let W=w1w2w3… wn. Then, by the chain rule,
P(W )  P( w1 ) * P( w2 | w1 ) * P( w3 | w1 w2 ) *...* P( wn | w1 ...wn 1 )
 We can estimate P(w2|w1) by the Maximum Likelihood Estimator
and P(w3|w1w2) by
So to decide pat vs. pot in Heat up the oil in a large p?t,
compute for pot
Count ( w1w2 )
Count ( w1 )
Count ("Heat up the oil in a large pot") 0

Count ("Heat up the oil in a large")
0
Count ( w1w2 w3 )
Count ( w1 w2 )
and so on…
CIS 521 - Intro to AI
7
Hmm..The Web Changes Things (2008 or so)
8
Statistics and the Web II
So, P(“pot”|”heat up the oil in a large___”) = 8/49  0.16
Even the web in 2008 yields low counts!
CIS 521 - Intro to AI
CIS 521 - Intro to AI
9
CIS 521 - Intro to AI
10
….
But the web has grown!!!
165/891=0.185
CIS 521 - Intro to AI
11
CIS 521 - Intro to AI
12
2
So….
A BOTEC Estimate of What We Can Estimate
 A larger corpus won’t help much unless it’s
HUGE ….
but the web is!!!
What parameters can we estimate with 100 million
words of training data??
Assuming (for now) uniform distribution over only 5000 words
What if we only have 100 million words for our
estimates??
So even with 108 words of data, for even trigrams we encounter
the sparse data problem…..
CIS 521 - Intro to AI
13
The Markov Assumption:
Only the Immediate Past Matters
CIS 521 - Intro to AI
14
The Markov Assumption: Estimation
We estimate the probability of each wi given previous context by
P(wi|w1w2…wi-1) = P(wi|wi-1)
which can be estimated by
Count ( wi 1wi )
Count ( wi 1 )
So we’re back to counting only unigrams and bigrams!!
AND we have a correct practical estimation method for P(W)
given the Markov assumption!
CIS 521 - Intro to AI
15
CIS 521 - Intro to AI
16
Visualizing an n-gram based language model:
the Shannon/Miller/Selfridge method
Markov Models
 To generate a sequence of n words given unigram
estimates:
• Fix some ordering of the vocabulary v1 v2 v3 …vk.
• For each word
wi , 1 ≤ i ≤ n
—Choose a random value ri between 0 and 1
j
— wi = the first vj such that
CIS 521 - Intro to AI
17
CIS 521 - Intro to AI
 P (v
m 1
m
)  ri
18
3
Visualizing an n-gram based language model:
the Shannon/Miller/Selfridge method
The Shannon/Miller/Selfridge method trained on
Shakespeare
 To generate a sequence of n words given a 1st
order Markov model (i.e. conditioned on one
previous word):
• Fix some ordering of the vocabulary v1 v2 v3 …vk.
• Use unigram method to generate an initial word w1
• For each remaining wi , 2 ≤ i ≤ n
—Choose a random value ri between 0 and 1
j
— wi
= the first vj such that
 P (v
m 1
m
| wi 1 )  ri
(This and next two slides from Jurafsky)
CIS 521 - Intro to AI
19
Wall Street Journal just isn’t Shakespeare
CIS 521 - Intro to AI
20
Shakespeare as corpus
 N=884,647 tokens, V=29,066
 Shakespeare produced 300,000 bigram types
out of V2= 844 million possible bigrams.
• So 99.96% of the possible bigrams were never seen
(have zero entries in the table)
 Quadrigrams worse: What's coming out looks
like Shakespeare because it is Shakespeare
The Sparse Data Problem
Again
English word frequencies well described
by Zipf’s Law
 Zipf (1949) characterized the relation between word
frequency and rank as:
f  r  C (for constant C )
r  C/f
log(r)  log(C) - log (f)
 Purely Zipfian data plots as a straight line on a loglog scale
 How likely is a 0 count? Much more likely than I let on!!!
CIS 521 - Intro to AI
23
*Rank (r): The numerical position of a word in a list sorted by
decreasing frequency (f ).
CIS 521 - Intro to AI
24
4
Word frequency & rank in Brown Corpus
vs Zipf
Zipf’s law for the Brown corpus
Lots of area
under the tail
of this curve!
From: Interactive mathematics http://www.intmath.com
CIS 521 - Intro to AI
25
CIS 521 - Intro to AI
26
Smoothing
 At least one unknown word likely per sentence given Zipf!!
 To fix 0’s caused by this, we can smooth the data.
• Assume we know how many types never occur in the data.
• Steal probability mass from types that occur at least once.
• Distribute this probability mass over the types that never occur.
Smoothing
This black art is why
NLP is taught in the
engineering school –
Jason Eisner
CIS 521 - Intro to AI
28
Smoothing
Solution: Add-One Smoothing (again)
….is like Robin Hood:
 Pro: Very simple technique
 Cons:
• it steals from the rich
• Probability of frequent n-grams is underestimated
• Probability of rare (or unseen) n-grams is overestimated
• Therefore, too much probability mass is shifted towards unseen ngrams
• All unseen n-grams are smoothed in the same way
• and gives to the poor
 Using a smaller added-count improves things but only some
 More advanced techniques (Kneser Ney, Witten-Bell) use
properties of component n-1 grams and the like...
(Hint for this homework  )
CIS 521 - Intro to AI
29
CIS 521 - Intro to AI
31
5