Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06

Maximum Entropy Model
LING 572
Fei Xia
02/07-02/09/06
History
• The concept of Maximum Entropy can be
traced back along multiple threads to
Biblical times.
• Introduced to NLP area by Berger et. Al.
(1996).
• Used in many NLP tasks: MT, Tagging,
Parsing, PP attachment, LM, …
Outline
•
•
•
•
Modeling: Intuition, basic concepts, …
Parameter training
Feature selection
Case study
Reference papers
•
•
•
•
(Ratnaparkhi, 1997)
(Ratnaparkhi, 1996)
(Berger et. al., 1996)
(Klein and Manning, 2003)
 Different notations.
Modeling
The basic idea
• Goal: estimate p
• Choose p with maximum entropy (or “uncertainty”)
subject to the constraints (or “evidence”).
H ( p)  
 p( x) log p( x)
xA B
x  (a, b), where a  A  b  B
Setting
• From training data, collect (a, b) pairs:
– a: thing to be predicted (e.g., a class in a
classification problem)
– b: the context
– Ex: POS tagging:
• a=NN
• b=the words in a window and previous two tags
• Learn the prob of each (a, b): p(a, b)
Features in POS tagging
(Ratnaparkhi, 1996)
context (a.k.a. history)
allowable classes
Maximum Entropy
• Why maximum entropy?
• Maximize entropy = Minimize commitment
• Model all that is known and assume nothing
about what is unknown.
– Model all that is known: satisfy a set of constraints
that must hold
– Assume nothing about what is unknown:
choose the most “uniform” distribution
 choose the one with maximum entropy
Ex1: Coin-flip example
(Klein & Manning 2003)
•
•
•
•
Toss a coin: p(H)=p1, p(T)=p2.
Constraint: p1 + p2 = 1
Question: what’s your estimation of p=(p1, p2)?
Answer: choose the p that maximizes H(p)
H ( p)   p( x) log p( x)
x
H
p1
p1=0.3
Coin-flip example (cont)
H
p1 + p2 = 1
p1
p2
p1+p2=1.0, p1=0.3
Ex2: An MT example
(Berger et. al., 1996)
Possible translation for the word “in” is:
Constraint:
Intuitive answer:
An MT example (cont)
Constraints:
Intuitive answer:
An MT example (cont)
Constraints:
Intuitive answer:
??
Ex3: POS tagging
(Klein and Manning, 2003)
Ex3 (cont)
Ex4: overlapping features
(Klein and Manning, 2003)
Modeling the problem
• Objective function: H(p)
• Goal: Among all the distributions that satisfy the
constraints, choose the one, p*, that maximizes
H(p).
p*  arg max H ( p)
pP
• Question: How to represent constraints?
Features
• Feature (a.k.a. feature function, Indicator function) is a binaryvalued function on events:
f j :   {0,1},   A  B
• A: the set of possible classes (e.g., tags in POS tagging)
• B: space of contexts (e.g., neighboring words/ tags in POS
tagging)
• Ex:
1 if a  DET & curWord (b) " that"
f j ( a, b )  
0 o.w.
Some notations
Finite training sample of events:
S
Observed probability of x in S:
~
p ( x)
The model p’s probability of x:
p (x )
The jth feature:
fj
Observed expectation of f j :
(empirical count of f j )
Model expectation of
f :j
E ~p f j   ~
p ( x) f j ( x)
x
E p f j   p ( x) f j ( x)
x
Constraints
• Model’s feature expectation = observed feature
expectation
E p f j  E ~p f j
• How to calculate
E ~p f j
?
N
E ~p f j   ~
p ( x) f j ( x) 
x
f
i 1
j
( x)
N
1 if a  DET & curWord (b) " that"
f j ( a, b )  
0 o.w.
Training data  observed events
Restating the problem
The task: find p* s.t. p*  arg max H ( p)
pP
where
P  { p | E p f j  E ~p f j , j  {1,..., k}}
Objective function: -H(p)
Constraints:
{E p f j  E ~p f j  d j , j  {1,..., k}}
 p ( x)  1
x
Add a feature
f 0 (a, b)  1 a, b
E p f 0  E ~p f 0  1
Questions
• Is P empty?
• Does p* exist?
• Is p* unique?
• What is the form of p*?
• How to find p*?
What is the form of p*?
(Ratnaparkhi, 1997)
P  { p | E p f j  E ~p f j , j  {1,..., k}}
k
Q  { p | p ( x)    j
f j ( x)
, j  0}
j 1
H ( p)
Theorem: if p*  P  Q then p*  argpmax
P
Furthermore, p* is unique.
Using Lagrangian multipliers
k
A( p)   H ( p)    j ( E p f j  d j )
Minimize A(p):
j 0
A' ( p)  0
k
  ( p( x) log p( x)    j ((  p( x) f j ( x))  d j ) / p( x)  0
j 0
x
x
k
 1  log p( x)    j f j ( x)  0
j 0
k
 log p( x)    j f j ( x)  1
j 0
k
  j f j ( x ) 1
 p( x)  e j  0
k
  j f j ( x )  0 1
 e j 1
k
  j f j ( x)
 p( x) 
e j 1
Z
where Z  e10
Two equivalent forms
k
p ( x)     j
f j ( x)
j 1
k
  j f j ( x)
p( x) 
e j 1
Z
1

Z
 j  ln  j
Relation to Maximum Likelihood
The log-likelihood of the empirical distribution ~
p
as predicted by a model q is defined as
~
L(q)   p ( x) log q( x)
x
L(q)
Theorem: if p*  P  Q then p*  argqmax
Q
Furthermore, p* is unique.
Summary (so far)
Goal: find p* in P, which maximizes H(p).
P  { p | E p f j  E ~p f j , j  {1,..., k}}
It can be proved that when p* exists it is unique.
The model p* in P with maximum entropy is the model in Q
that maximizes the likelihood of the training sample ~
p
k
Q  { p | p ( x)    j
j 1
f j ( x)
, j  0}
Summary (cont)
• Adding constraints (features):
(Klein and Manning, 2003)
– Lower maximum entropy
– Raise maximum likelihood of data
– Bring the distribution further from uniform
– Bring the distribution closer to data
Parameter estimation
Algorithms
• Generalized Iterative Scaling (GIS):
(Darroch and Ratcliff, 1972)
• Improved Iterative Scaling (IIS): (Della
Pietra et al., 1995)
GIS: setup
Requirements for running GIS:
• Obey form of model and constraints:
k
p( x) 
e
  j f j ( x)
Z
• An additional constraint:
Let
Ep f j  d j
j 1
x  
k
C  max  f j ( x)
x
k
f
j 1
j
( x)  C
j 1
Add a new feature fk+1: x  
k
f k 1 ( x)  C   f j ( x)
j 1
GIS algorithm
• Compute dj, j=1, …, k+1
• Initialize (j1) (any values, e.g., 0)
• Repeat until converge
– For each j
• Compute
E p ( n ) f j   p ( n ) ( x) f j ( x)
x
k 1
where
p ( x) 
(n)
• Update

( n 1)
j

(n)
j
e
 j
(n)
j 1
f j ( x)
Z
di
1
 (log
)
C
E p( n ) f j
Approximation for calculating
feature expectation
E p f j   p ( x) f j ( x) 
 p ( a, b) f
x

a A,bB
 p(b) p(a | b) f
j
( a, b)
 ~p (b) p(a | b) f
j
( a, b)
a A,bB

a A,bB
~
p (b) p (a | b) f j (a, b)
bB
1

N
a A
N
 p(a | b ) f
i 1 a A
i
j
(a,b i )
j
( a, b)
Properties of GIS
• L(p(n+1)) >= L(p(n))
• The sequence is guaranteed to converge to p*.
• The converge can be very slow.
• The running time of each iteration is O(NPA):
– N: the training set size
– P: the number of classes
– A: the average number of features that are active for
a given event (a, b).
IIS algorithm
k
#
f
( x)   f j ( x)
• Compute dj, j=1, …, k+1 and
j 1
(1)
• Initialize  j
(any values, e.g., 0)
• Repeat until converge
– For each j
• Let  j
 p
x
• Update
(n)
be the solution to
( x ) f j ( x )e
 j f # ( x )
(jn1)  (jn)   j
 dj
Calculating  j
If
x  
k
f
j 1
Then
j
( x)  C
di
1
 j  (log
)
C
E p( n ) f j
GIS is the same as IIS
Else
 j
must be calcuated numerically.
Feature selection
Feature selection
• Throw in many features and let the
machine select the weights
– Manually specify feature templates
• Problem: too many features
• An alternative: greedy algorithm
– Start with an empty set S
– Add a feature at each iteration
Notation
With the feature set S:
After adding a feature:
The gain in the log-likelihood of the training data:
Feature selection algorithm
(Berger et al., 1996)
• Start with S being empty; thus ps is uniform.
• Repeat until the gain is small enough
– For each candidate feature f
• Computer the model p S  f
using IIS
• Calculate the log-likelihood gain
– Choose the feature with maximal gain, and add it to
S
 Problem: too expensive
Approximating gains
(Berger et. al., 1996)
• Instead of recalculating all the weights,
calculate only the weight of the new
feature.
Training a MaxEnt Model
Scenario #1:
• Define features templates
• Create the feature set
• Determine the optimum feature weights via GIS or IIS
Scenario #2:
• Define feature templates
• Create candidate feature set S
• At every iteration, choose the feature from S (with max
gain) and determine its weight (or choose top-n features
and their weights).
Case study
POS tagging
(Ratnaparkhi, 1996)
• Notation variation:
– fj(a, b): a: class, b: context
– fj(hi, ti): h: history for ith word, t: tag for ith word
• History:
hi  {wi , wi 1 , wi 2 , wi 1 , wi  2 , ti 1 , ti 2 }
• Training data:
– Treat it as a list of (hi, ti) pairs.
– How many pairs are there?
Using a MaxEnt Model
• Modeling:
• Training:
– Define features templates
– Create the feature set
– Determine the optimum feature weights via
GIS or IIS
• Decoding:
Modeling
P (t1 ,..., t n | w1 ,..., wn )
n
  p (ti | w1n , t1i 1 )
i 1
n
  p (ti | hi )
i 1
p(h, t )
p(t | h) 
 p(h, t ' )
t 'T
Training step 1:
define feature templates
History hi
Tag ti
Step 2: Create feature set
Collect all the features from the training data
Throw away features that appear less than 10 times
Step 3: determine the feature
weights
• GIS
• Training time:
– Each iteration: O(NTA):
• N: the training set size
• T: the number of allowable tags
• A: average number of features that are active for a (h, t).
– About 24 hours on an IBM RS/6000 Model 380.
• How many features?
Decoding: Beam search
• Generate tags for w1, find top N, set s1j
accordingly, j=1, 2, …, N
• For i=2 to n (n is the sentence length)
– For j=1 to N
• Generate tags for wi, given s(i-1)j as previous tag
context
• Append each tag to s(i-1)j to make a new sequence.
– Find N highest prob sequences generated
above, and set sij accordingly, j=1, …, N
• Return highest prob sequence sn1.
Beam search
Viterbi search
Decoding (cont)
• Tags for words:
– Known words: use tag dictionary
– Unknown words: try all possible tags
• Ex: “time flies like an arrow”
• Running time: O(NTAB)
–
–
–
–
N: sentence length
B: beam size
T: tagset size
A: average number of features that are active for a given event
Experiment results
Comparison with other learners
• HMM: MaxEnt uses more context
• SDT: MaxEnt does not split data
• TBL: MaxEnt is statistical and it provides
probability distributions.
MaxEnt Summary
• Concept: choose the p* that maximizes entropy while
satisfying all the constraints.
• Max likelihood: p* is also the model within a model family
that maximizes the log-likelihood of the training data.
• Training: GIS or IIS, which can be slow.
• MaxEnt handles overlapping features well.
• In general, MaxEnt achieves good performances on
many NLP tasks.
Additional slides
Ex4 (cont)
??