ppt

Hidden Markov Models
for
Information Extraction
CSE 454
Course Overview
Info Extraction
Datamining
Ecommerce
P2P
Security
Web Services
Semantic Web
Case Studies: Nutch, Google, Altavista
Information Retrieval
Precision vs Recall
Inverted Indicies
Crawler Architecture
Synchronization & Monitors
Systems Foundation: Networking & Clusters
© Daniel S. Weld
2
What is Information Extraction (IE)
• Is the task of populating database slots with
corresponding phrases from text
Slide
byS.Okan
©
Daniel
Weld Basegmez
3
What are HMMs?
• A HMM is a finite state automaton with
stochastic transitions and symbol emissions.
(Rabiner 1989)
© Daniel S. Weld
4
Why use HMMs for IE
•
•
•
•
Strong statistical foundations
Well suited to natural language domains
Handling new data robustly
Computationally efficient to develop
Disadvantages
• A priori notion of model topology
• Large amounts of training data
Slide
byS.Okan
©
Daniel
Weld Basegmez
5
Defn: Markov Model
Q: set of states
p: init prob distribution
A: transition probability distribution
s0 s1 s2 s3 s4 s5 s6
s0
s1
s2
s3
s4
s5
s6
© Daniel S. Weld
p12
Probability of transitioning from s1 to s2
∑ ?
6
E.g. Predict Web Behavior
When will visitor leave site?
Q: set of states
(Pages)
p: init prob distribution
(Likelihood of site entry point)
A: transition probability distribution
(User navigation model)
© Daniel S. Weld
7
Diversion: Relational Markov Models
Avgerage negative log likelihood
8.5
RMM-uniform
RMM-rank
RMM-PET
PMM
8.0
7.5
7.0
6.5
6.0
5.5
5.0
4.5
4.0
3.5
10
© Daniel S. Weld
100
1000
10000
Number training examples
100000
1000000
8
Probability Distribution, A
• Forward Causality
The probability of st does not depend directly
on values of future states.
• Probability of new state could depend on
The history of states visited.
Pr(st|st-1,st-2,…, s0)
• Markovian Assumption
Pr(st|st-1,st-2,…s0) = Pr(st|st-1)
• Stationary Model Assumption
Pr(st|st-1) = Pr(sk|sk-1) for all k.
© Daniel S. Weld
9
Defn: Hidden Markov Model
Q: set of states (hidden!)
p: init prob distribution
A: transition probability distribution
O
bi(ot)
© Daniel S. Weld
set of possible observatons
probability of si emitting ot
10
HMMs and their Usage
• HMMs very common in Computational Linguistics:
Speech recognition
• (observed: acoustic signal, hidden: words)
Handwriting recognition
• (observed: image, hidden: words)
Part-of-speech tagging
• (observed: words, hidden: part-of-speech tags)
Machine translation
• (observed: foreign words, hidden: words in target language)
Information Extraction
• (observed:
Slide
byS.Bonnie
©
Daniel
Weld
Dorr
11
Information Extraction with HMMs
• Example - Research Paper Headers
Slide
byS.Okan
©
Daniel
Weld Basegmez
12
The Three Basic HMM Problems
• Problem 1 (Evaluation): Given the observation
sequence O=o1,…,oT and an HMM model
  (A,B, p ) , how do we compute the probability
of O given the model?
Slide
byS.Bonnie
©
Daniel
Weld
Dorr
13
The Three Basic HMM Problems
• Problem 2 (Decoding): Given the observation
sequence O=o1,…,oT and an HMM model
  (A,B, p ) , how do we find the state sequence
that best explains the observations?
Slide
byS.Bonnie
©
Daniel
Weld
Dorr
14
The Three Basic HMM Problems
• Problem 3 (Learning): How do we adjust the
model parameters   (A,B, p ) , to maximize
P(O |  ) ?

Slide
byS.Bonnie
©
Daniel
Weld
Dorr
15
Information Extraction with HMMs
• Given a Model M and its parameters
Information Extraction is performed by
determining the sequence that was most
likely to have generated the entire document
• This sequence can be recovered by dynamic
programming with Viterbi algorithm
Slide
byS.Okan
©
Daniel
Weld Basegmez
16
Information Extraction with HMMs
• Probability of a string x being emitted by an
HMM M
• State sequence V(x|M) that has the
highest probability of having produced an
observation sequence
Slide
byS.Okan
©
Daniel
Weld Basegmez
17
Simple Example
Rt-1 P(Rt)
t
0.7
f
0.3
Raint-1
Raint
Raint+1
Rt
t
f
Umbrellat-1
© Daniel S. Weld
Umbrellat
P(Ut)
0.9
0.2
Umbrellat+1
18
Simple Example
Rain1
Rain2
Rain3
Rain4
true
true
true
true
false
false
false
false
Umbrella true
© Daniel S. Weld
true
false
true
19
Forward Probabilities
• What is the probability that, given an HMM
, at time t the state is i and the partial
observation o1 … ot has been generated?
t (i)  P(o1... ot , qt  si | )
Slide
byS.Bonnie
©
Daniel
Weld
Dorr
20
Problem 1: Probability of an Observation
Sequence
• What is P(O |  ) ?
• The probability of a observation sequence is the
sum of the probabilities of all possible state
sequences in the HMM.

• Naïve
computation is very expensive. Given T
observations and N states, there are NT possible
state sequences.
• Even small HMMs, e.g. T=10 and N=10, contain 10
billion different paths
• Solution to this and problem 2 is to use dynamic
programming
Slide
byS.Bonnie
©
Daniel
Weld
Dorr
21
Forward Probabilities
t (i)  P(o1...ot , qt  si | )

Slide
byS.Bonnie
©
Daniel
Weld
Dorr
N

 t ( j)   t1(i) aij b j (ot )
i1

22
Forward Algorithm
• Initialization:
• Induction:
1(i)  p ibi (o1) 1 i  N
N

 t
( j)   t1(i) aij b j (ot ) 2  t  T,1  j  N
i1

• Termination:
N
P(O | )   T (i)

i1
Slide
byS.Bonnie
©
Daniel
Weld
Dorr
23
Forward Algorithm Complexity
• In the naïve approach to solving problem 1 it
takes on the order of 2T*NT computations
• The forward algorithm takes on the order of
N2T computations
Slide
byS.Bonnie
©
Daniel
Weld
Dorr
24
Backward Probabilities
• Analogous to the forward probability, just in
the other direction
• What is the probability that given an HMM 
and given the state at time t is i, the partial
observation ot+1 … oT is generated?

t (i)  P(ot 1...oT | qt  si , )
Slide
byS.Bonnie
©
Daniel
Weld
Dorr
25
Backward Probabilities
t (i)  P(ot 1...oT | qt  si , )

Slide
byS.Bonnie
©
Daniel
Weld
Dorr
N

 t (i)   aijb j (ot 1) t 1 ( j) 


j1

26
Backward Algorithm
• Initialization:
T (i) 1, 1 i  N
• Induction:
N

 t (i)   aijb j (ot 1) t 1 ( j)  t  T 1...1,1  i  N


j1

• Termination:


N
P(O | )   p i 1 (i)
i1
Slide
byS.Bonnie
©
Daniel
Weld
Dorr
27
Problem 2: Decoding
• The solution to Problem 1 (Evaluation) gives us the
sum of all paths through an HMM efficiently.
• For Problem 2, we want to find the path with the
highest probability.
• We want to find the state sequence Q=q1…qT, such
that
Q  arg max P(Q'| O, )
Q'
Slide
byS.Bonnie
©
Daniel
Weld
Dorr
28
Viterbi Algorithm
• Similar to computing the forward
probabilities, but instead of summing over
transitions from incoming states, compute
the maximum
N


• Forward:
 ( j)   (i) a b (o )
t
i1
t1
ij

j
t
• Viterbi Recursion:


1iN
Slide
byS.Bonnie
©
Daniel
Weld
Dorr

t ( j)  max t1(i) aij b j (ot )
29
Viterbi Algorithm
• Initialization:
• Induction:
1 (i)  p ib j (o1) 1  i  N


t ( j)  max t1(i) aij b j (ot )
1iN



t ( j)  arg max t1 (i) aij  2  t  T,1  j  N
 1iN


• Termination:
qT*  argmax T (i)
p*  max T (i)
1iN
•Read out path:

Slide
byS.Bonnie
©
Daniel
Weld
Dorr
1iN
q   t 1 (q ) t  T 1,...,1
*
t
*
t 1

30
Information Extraction
• We want specific info from text documents
• For example, from colloq emails, want
Speaker name
Location
Start time
© Daniel S. Weld
31
Simple HMM for Job Titles
© Daniel S. Weld
32
HMMs for Info Extraction
For sparse extraction tasks :
• Separate HMM for each type of target
• Each HMM should
Model entire document
Consist of target and non-target states
Not necessarily fully connected
Given HMM, how extract info?
Slide
byS.Okan
©
Daniel
Weld Basegmez
33
How Learn HMM?
• Two questions: structure & parameters
© Daniel S. Weld
34
Simplest Case
• Fix structure
• Learn transition & emission probabilities
• Training data…?
Label each word as target or non-target
• Challenges
Sparse training data
Unseen words have…
Smoothing!
© Daniel S. Weld
35
Problem 3: Learning
• So far: assumed we know the underlying model
  (A,B, p )
• Often these parameters are estimated on
annotated training data, which has 2 drawbacks:
Annotation is difficult and/or expensive
Training data is different from the current data
• We want to maximize the parameters with
respect to the current data, i.e., we’re looking for
a model ' , such that ' argmax P(O | )

Slide
byS.Bonnie
©
Daniel
Weld
Dorr
36
Problem 3: Learning
• Unfortunately, there is no known way to analytically
find a global maximum, i.e., a model ' , such that
' argmax P(O | )

• But it is possible to find a local maximum!
can always find a model
• Given an initial model  , we

' , such that P(O | ')  P(O | )

Slide
byS.Bonnie
©
Daniel
Weld
Dorr

37
Parameter Re-estimation
• Use the forward-backward (or Baum-Welch)
algorithm, which is a hill-climbing algorithm
• Using an initial parameter instantiation, the
forward-backward algorithm
iteratively re-estimates the parameters and
improves the probability that given observation
are generated by the new parameters
Slide
byS.Bonnie
©
Daniel
Weld
Dorr
38
Parameter Re-estimation
• Three parameters need to be re-estimated:
Initial state distribution: p i
Transition probabilities: ai,j
Emission probabilities: bi(ot)

Slide
byS.Bonnie
©
Daniel
Weld
Dorr
39
Re-estimating Transition Probabilities
• What’s the probability of being in state si at
time t and going to state sj, given the
current model and parameters?
 t (i, j)  P(qt  si , qt 1  s j | O, )

Slide
byS.Bonnie
©
Daniel
Weld
Dorr
40
Re-estimating Transition Probabilities
 t (i, j)  P(qt  si , qt 1  s j | O, )

 t (i, j) 
Slide
byS.Bonnie
©
Daniel
Weld
Dorr
 t (i) ai, j b j (ot 1 )  t 1 ( j)
N
N
  (i) a
t
i1 j1
i, j
b j (ot 1 )  t 1 ( j)
41
Re-estimating Transition Probabilities
• The intuition behind the re-estimation
equation for transition probabilities is
aˆ i, j 
expected number of transitions from state s
i
to state s j
expected number of transitions from state s
• Formally:
i

T 1
 (i, j)
t
aˆ i, j 
t1
T 1 N
  (i, j')
t
t1 j'1
Slide
byS.Bonnie
©
Daniel
Weld
Dorr
42
• Defining
Re-estimating Transition
Probabilities
N
 t (i)   t (i, j)
j1
As the probability of being in state si, given
the complete observation O

• We can say:
T 1
 (i, j)
t
aˆ i, j 
t1
T 1
  (i)
t
t1
Slide
byS.Bonnie
©
Daniel
Weld
Dorr
43
Review of Probabilities
• Forward probability:  (i)
t
The probability of being in state si, given the partial
observation o1,…,ot
• Backward probability: t (i)
The probability
of being in state si, given the partial

observation ot+1,…,oT
• Transition probability:  t (i, j)

The probability of going from state si, to state sj, given
the complete observation o1,…,oT

(i)
• State probability:
t

The probability of being in state si, given the complete
observation o1,…,oT
Slide
byS.Bonnie
©
Daniel
Weld
Dorr
44
Re-estimating Initial State Probabilities
• Initial state distribution: p i is the
probability that si is a start state
• Re-estimation is easy:

ˆ i  expected number of times in state s
p
• Formally:
i
at time 1
ˆ i  1(i)
p


Slide
byS.Bonnie
©
Daniel
Weld
Dorr
45
Re-estimation of Emission Probabilities
• Emission probabilities are re-estimated as
ˆb (k)  expected number of times in state s i and observe symbol v
i
expected number of times in state s i
• Formally:

k
T
bˆi (k) 
(o ,v )  (i)
t
k
t
t1
T
 (i)
t
t1
Where
(ot ,vk ) 1, if ot  vk , and 0 otherwise
Note that  here is the Kronecker delta function and is not
related to the  in the discussion of the Viterbi algorithm!!


Slide
byS.Bonnie
©
Daniel
Weld
Dorr
46
The Updated Model
• Coming from   (A,B, p ) we get to
ˆ ) by the following update rules:
' ( Aˆ , Bˆ , p

T 1
 (i, j)
t
aˆ i, j 
t1
T 1
  (i)
t
t1
Slide
byS.Bonnie
©
Daniel
Weld
Dorr

T
bˆi (k) 
(o ,v )  (i)
t
k
t
t1
T
 (i)
ˆ i  1(i)
p
t
t1

47
Expectation Maximization
• The forward-backward algorithm is an
instance of the more general EM algorithm
The E Step: Compute the forward and backward
probabilities for a give model
The M Step: Re-estimate the model parameters
Slide
byS.Bonnie
©
Daniel
Weld
Dorr
48
Importance of HMM Topology
• Certain structures better capture the
observed phenomena in the prefix, target
and suffix sequences
• Building structures by hand does not scale to
large corpora
• Human intuitions don’t always correspond to
structures that make the best use of HMM
potential
Slide
byS.Okan
©
Daniel
Weld Basegmez
49
How Learn Structure?
© Daniel S. Weld
50
Conclusion
• IE is performed by recovering the most
likely state sequence
(Viterbi)
• Transition and Emission Parameters can be
learned from training data
(Baum-Welch)
• Shrinkage improves parameter estimation
• Task-specific state-transition structure can
be automatically discovered
Slide
byS.Okan
©
Daniel
Weld Basegmez
51
References
• Information Extraction with HMM Structures Learned by
Stochastic Optimization, Dayne Freitag and Andrew McCallum
• Information Extraction with HMMs and Shrinkage - Dayne
Frietag and Andrew McCallum
• Learning Hidden Markov Model Structure for Information
Extraction, Kristie Seymore, Andrew McCallum, Roni Rosenfeld
• Inducing Probabilistic Grammars by Bayesian Model Merging,
Andreas Stolcke, Stephen Omohundro
• Information Extraction using Hidden Markov Models, Leek, T. R,
Master’s thesis, UCSD
Slide
byS.Okan
©
Daniel
Weld Basegmez
52