slides PDF

WSTA Lecture 14
Part-of-speech Tagging
l 
l 
l 
Tags
- 
introduction
- 
tagged corpora, tagsets
Tagging
- 
motivation
- 
Simple unigram tagger
- 
Markov model tagging
- 
Rule based tagging
Evaluation
Slide credits: Steven Bird
1
COMP90042
Trevor Cohn
NLP versus IR
l 
l 
Covered predominantly IR up until now
l 
processing, stemming, indexing, querying, etc
l 
mostly bag of words and vector space models
l 
word order unimportant*
l 
word inflections unimportant*
What do we mean by “natural language processing”?
l 
and how does this differ from / overlap with IR?
2
COMP90042
Trevor Cohn
Tags 1: ambiguity
l 
time flies like an arrow
l 
fruit flies like a banana
l 
ambiguous headlines
- 
http://www.snopes.com/humor/nonsense/head97.htm
- 
British Left Waffles on Falkland Islands
- 
Juvenile Court to Try Shooting Defendant
3
COMP90042
Trevor Cohn
Tags 2: Representations
to resolve ambiguity
4
COMP90042
Trevor Cohn
Exercise: tag some headlines
l 
British Left Waffles on Falkland Islands
l 
Juvenile Court to Try Shooting Defendant
5
COMP90042
Trevor Cohn
Tags 3: Tagged Corpora
•  The/DT limits/NNS to/TO legal/JJ absurdity/NN
stretched/VBD another/DT notch/NN this/DT week/NN
when/WRB the/DT Supreme/NNP Court/NNP refused/VBD
to/TO hear/VB an/DT appeal/NN from/IN a/DT case/NN
that/WDT says/VBZ corporate/JJ defendants/NNS must/MD
pay/VB damages/NNS even/RB after/IN proving/VBG that/IN
they/PRP could/MD not/RB possibly/RB have/VB
caused/VBN the/DT harm/NN ./.
•  Source: Penn Treebank Corpus
(nltk/data/treebank/wsj_0130)
6
COMP90042
Trevor Cohn
Another kind of tagging:
Sense Tagging
•  The Pantheon's interior/a , still in its original/a form/a ,
•  interior: (a) inside a space; (b) inside a country and at a distance from
the coast or border; (c) domestic; (d) private.
•  original: (a) relating to the beginning of something; (b) novel; (c) that
from which a copy is made; (d) mentally ill or eccentric.
•  form: (a) definite shape or appearance; (b) body; (c) mould; (d)
particular structural character exhibited by something; (e) a style as in
music, art or literature; (f) homogenous polynomial in two or more
variables; ...
7
COMP90042
Trevor Cohn
Significance of Parts of Speech
l 
l 
a word's POS tells us a lot about the word and its
neighbors
- 
limits the range of meanings (deal),
pronunciations (object vs object), or both (wind)
- 
helps in stemming
- 
limits the range of following words for ASR
- 
helps select nouns from a document for IR
More advanced uses (these won't make sense yet):
- 
basis for chunk parsing
- 
parsers can build trees directly on the POS tags instead of
maintaining a lexicon
- 
first step for many different NLP tasks
8
COMP90042
Trevor Cohn
What does Tagging do?
1. 
Collapses Distinctions
• 
Lexical identity may be discarded
• 
2. 
e.g. all personal pronouns tagged with PRP
Introduces Distinctions
• 
Ambiguities may be removed
• 
e.g. deal tagged with NN or VB; deal tagged with DEAL1 or DEAL2
3. 
Helps classification and prediction
Ø 
There are many tagsets. This is due to:
Ø 
the different ways to define a tag
Ø 
the need to balance classification and prediction
Ø 
harder/easier classification task; vs more/less information about context
9
COMP90042
Trevor Cohn
Tagged Corpora
l 
Brown Corpus:
- 
The first digital corpus (1961), Francis and Kucera, Brown U
- 
Contents: 500 texts, each 2000 words long
l 
from American books, newspapers, magazines, representing 15 genres:
- 
l 
science fiction, romance fiction, press reportage scientific writing, popular lore.
- 
See nltk/data/brown/
- 
See reading for definition of Brown tags
Penn Treebank:
- 
First syntactically annotated corpus
- 
Contents: 1 million words from WSJ; POS tags, syntax trees
- 
See nltk/data/treebank/
(5% sample)
10
COMP90042
Trevor Cohn
Tagged Corpora in other languages
l 
l 
Parsed treebanks in many other languages
- 
Basque, Bulgarian, Chinese, Czech, Finnish, French
- 
German, Greek, Hebrew, Hungarian, Irish, Italian
- 
Japanese, Korean, Persian, Romanian, Spanish
- 
Swedish … and many more!
All with part-of-speech annotation
- 
language specific tag sets
- 
recent work on mapping to common tag set
l 
https://code.google.com/p/universal-pos-tags/
l 
http://universaldependencies.github.io/docs/
11
COMP90042
Trevor Cohn
Application of tagged corpora:
genre classification
12
COMP90042
Trevor Cohn
Important Treebank Tags
•  NN noun
JJ adjective
•  NNP proper noun
CC coord conjunc (and/or/..)
•  DT determiner (the/a/..)
CD cardinal number
•  IN preposition (in/of/..)
PRP personal pronoun (I/you/..)
•  VB verb
RB adverb (gently, now)
•  -R comparative (better)
•  -S superlative (bravest) or plural
•  -$ possessive (my)
13
COMP90042
Trevor Cohn
Verb Tags
•  VBP base present
take
•  VB infinitive
take
•  VBD past
took
•  VBG present participle
taking
•  VBN past participle
taken
•  VBZ present 3sg
takes
•  MD modal
can, would
14
COMP90042
Trevor Cohn
Simple Tagging in NLTK
l 
l 
Reading Tagged Corpora:
- 
>>> from nltk.corpus import treebank
>>> treebank.fileids()
>>> treebank.tagged_sents('wsj_0001.mrg')[0]
[(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), (u',', u','), (u'61', u'CD'), (u'years', u'NNS'),
(u'old', u'JJ'), (u',', u','), (u'will', u'MD'), (u'join', u'VB'), (u'the', u'DT'), ...]
- 
see also Brown corpus, Conll2000, Alpino and more
Tagging a string
- 
>>> nltk.tag.pos_tag('Fruit flies like a banana'.split())
[('Fruit', 'NN'), ('flies', 'NNS'), ('like', 'IN'), ('a', 'DT'), ('banana', 'NN')]
(N.b. Uses a maximum entropy tagger)
15
COMP90042
Trevor Cohn
Tagging Algorithms
l 
rule based taggers
- 
l 
unigram tagger
- 
l 
assign the tag which is the most probable for the word in question,
based on frequency in a training corpus
bigram tagger, n-gram tagger
- 
l 
original methods, based on layers of rules about how to tag words
based on their context (e.g., Brill tagger)
inspect one or more tags in the context
(usually, immediate left context)
Maximum entropy and HMM taggers (next lecture)
16
COMP90042
Trevor Cohn
Unigram Tagging
l 
Unigram = table of tag frequencies for each word
- 
e.g. in tagged WSJ sample (from Penn Treebank):
l 
l 
l 
deal: NN (11); VB (1); VBP (1)
Training
- 
load a corpus
- 
count the occurrences of each (word, tag) in the corpus
Tagging
- 
lookup the most common tag for each word to tag
l 
Gets 90% accuracy!
l 
See the code in nltk.tag.UnigramTagger
17
COMP90042
Trevor Cohn
The problem with unigram taggers
l 
what evidence do they consider when assigning a tag?
l 
when does this method fail?
18
COMP90042
Trevor Cohn
Fixing the problem using
a bigram tagger
l 
l 
construct sentences involving a word which can have two
different parts of speech
- 
e.g. wind: noun, verb
- 
The wind blew forcefully
- 
I wind up the clock
gather statistics for current tag, based on:
- 
(i) current word; (ii) previous tag
- 
result: a 2-D array of frequency distributions
- 
what does this look like?
19
COMP90042
Trevor Cohn
Generalizing the context
20
COMP90042
Trevor Cohn
Bigram & n-gram taggers
l 
l 
n-gram tagger: consider n-1 previous tags
- 
how big does the model get?
- 
how much data do we need to train it?
Sparse-data problem:
- 
l 
- 
As n gets large, the chances of having seen all possible patterns of
tags during training diminishes (large: >3)
Approaches:
- 
Combine taggers (backoff, weighted average)
- 
statistical estimation of the probability of unseen events
See nltk.tag.sequential.NgramTagger
- 
and various others in nltk.tag package
21
COMP90042
Trevor Cohn
Markov Model Taggers
l 
Recall n-gram language model
l 
l 
l 
similar problem of modelling next word given previous words,
similar issues with sparsity and estimation
l 
here we focus on generating tag sequences rather than words
l 
both are in instances of a Markov model
l 
tag sequence modelled as a Markov chain
l 
each tag is linked to word sequence
Can we just predict each tag in sequence?
l 
need to know the preceding tag(s)
l 
but these are unknown…
Next lecture, we’ll explore this further using Hidden Markov
Models
22
COMP90042
Trevor Cohn
The Brill rule-Based Tagger
l 
l 
The Linguistic Complaint:
- 
where is the linguistic knowledge of a tagger?
- 
just a massive table of numbers
- 
aren't there any linguistic insights that could emerge from the data?
Transformation-Based Tagging / Brill Tagging:
- 
Tag each word with its most likely tag
- 
Repeatedly correct tags based on context
- 
Example rule: NN VB PREVTAG TO
l 
- 
Other contexts:
l 
l 
to/TO race/NN -> to/TO race/VB
PREV1OR2TAG, PREV1OR2WD, WDNEXTTAG, ...
See nltk.tag.brill.BrillTagger
23
COMP90042
Trevor Cohn
Evaluating Tagger Performance
•  Need an objective measure of performance
•  Commonly use per-token accuracy
-  measured against heldout ‘gold standard’ data
-  fraction of words tagged correctly
•  Simple methods get ~90% performance
-  1 and 2-gram
-  Brill tagger
•  HMMs get ~95% and CRFs get ~97% performance
-  see nltk.tag.{hmm,tnt,crf,stanford,senna,…}
•  Why can't we get 100%?
24
COMP90042
Trevor Cohn
Tagging: broader lessons
l 
Tagging has several properties that are typical of NLP
- 
classification
- 
disambiguation through representation
- 
sequence learning from annotated corpora
- 
simple, general methods:
l 
(words have properties)
conditional frequency distributions
l 
Cool things you can do now: elementary NLU, NLG
l 
Review:
- 
tokenization + tagging = segmentation and annotation of words
- 
chunking = segmentation and annotation of word sequences
25
COMP90042
Trevor Cohn
Readings
l 
l 
One of:
l 
Jurafsky & Martin, chapter 5
l 
Manning & Schutze, chapter 10
NLTK tagging tutorial
l 
l 
http://www.nltk.org/book/ch05.html
Next lecture
l 
tagging with (hidden) Markov models
l 
other sequence tagging tasks
l 
named entity tagging
l 
shallow parsing
26
COMP90042
Trevor Cohn