Natural Language Processing with Python

Natural Language Processing
with Python
CS372: Spring, 2015
Lecture 12
Categorizing and Tagging Words
Jong C. Park
Department of Computer Science
Korea Advanced Institute of Science and Technology
CATEGORIZING AND
TAGGING WORDS
Using a Tagger
Tagged Corpora
Mapping Words to Properties Using Python Dictionaries
Automatic Tagging
N-Gram Tagging
Transformation-based Tagging
How to Determine the Category of a Word
2015-04-09
CS372: NLP with Python
2
Introduction

Questions
• What are lexical categories, and how are they
used in natural language processing?
• What is a good Python data structure for
storing words and their categories?
• How can we automatically tag each word of a
text with its word class?
2015-04-09
CS372: NLP with Python
3
Mapping Words to Properties
Using Python Dictionaries
Indexing Lists Versus Dictionaries
 Dictionaries in Python
 Defining Dictionaries
 Default Dictionaries
 Incrementally Updating a Dictionary
 Complex Keys and Values
 Inverting a Dictionary

dictionary data type
2015-04-09
CS372: NLP with Python
4
Indexing Lists Versus Dictionaries

List
• A text is treated in Python as a list of words.
• We can look up a particular item by giving its
index.
• text1[100]
 Figure 5-2. List lookup.
2015-04-09
CS372: NLP with Python
5
Indexing Lists Versus Dictionaries

With frequency distributions, we specify a
word and get back a number.
• fdist[‘monstrous’]
 Figure 5-3. Dictionary lookup.
Other names for dictionary are map, hashmap, hash, and associative array.
2015-04-09
CS372: NLP with Python
6
Indexing Lists Versus Dictionaries
In Figure 5-3, we mapped from names to
numbers, unlike with a list.
 Table 5-4. Linguistic objects as mappings
from keys to values.

The mapping is from a “word” to some structured object.
2015-04-09
CS372: NLP with Python
7
Dictionaries in Python

Python provides a dictionary data type
that can be used for mapping between
arbitrary types.
pos is defined as an empty dictionary.
2015-04-09
CS372: NLP with Python
8
Dictionaries in Python

We can employ the keys to retrieve values.

Question:
• How do we work out the legal keys for a
dictionary, where in the case of lists and strings
we can use len() to work out which integers will
be legal indexes?
If the dictionary is not big, we can simply inspect its contents by evaluating the variable pos.
2015-04-09
CS372: NLP with Python
9
Dictionaries in Python

To just find the keys, we can either convert
the dictionary to a list or use the dictionary
in a context where a list is expected, as the
parameter of sorted() or in a for loop.
2015-04-09
CS372: NLP with Python
10
Dictionaries in Python

The dictionary methods keys(), values(),
and items() allow us to access the keys,
values, and key-value pairs as separate
lists.
2015-04-09
CS372: NLP with Python
11
Dictionaries in Python

When we look something up in a dictionary, we
get only one value for each key.

However, there is a way of storing multiple
values in an entry.

We may use a list value, e.g., pos[‘sleep’] = [‘N’,
‘V’].
Cf. the CMU Pronouncing Dictionary
2015-04-09
CS372: NLP with Python
12
Defining Dictionaries

We can use the same key-value pair format to
create a dictionary.

Dictionary keys must be immutable types, such
as strings and tuples.
2015-04-09
CS372: NLP with Python
13
Default Dictionaries

If we try to access a key that is not in a dictionary,
we get an error.

Since Python 2.5, a special kind of dictionary
called a defaultdict has been available.
int, float, str, list, dict, tuple
When we access a non-existent entry, it is automatically added to the dictionary.
2015-04-09
CS372: NLP with Python
14
Default Dictionaries

We can use default dictionaries to deal
with hapaxes and low frequency words.
We can replace low frequency words with
a special “out of vocabulary” token.
2015-04-09
CS372: NLP with Python
15
Incrementally Updating a Dictionary

Example 5-3. Incrementally updating a
dictionary, and sorting by value.
2015-04-09
CS372: NLP with Python
16
Incrementally Updating a Dictionary
2015-04-09
CS372: NLP with Python
17
Incrementally Updating a Dictionary
itemgetter(n) returns a function that can be called on some other
sequence object to obtain the nth element.
2015-04-09
CS372: NLP with Python
18
Incrementally Updating a Dictionary

Useful programming idiom:
• We initialize a defaultdict and then use a for
loop to update its values.
2015-04-09
CS372: NLP with Python
19
Incrementally Updating a Dictionary

The following example uses the same
pattern to create an anagram dictionary.

NLTK provides a convenient way of
accumulating words through nltk.Index().
2015-04-09
CS372: NLP with Python
20
Complex Keys and Values

Default dictionaries can have complex
keys and values.
2015-04-09
CS372: NLP with Python
21
Inverting a Dictionary

Dictionaries support efficient lookup.
• However, finding a key given a value is slower
and more cumbersome.
• If we expect to do this kind of “reverse lookup”
often, it helps to construct a dictionary that
maps values to keys.
2015-04-09
CS372: NLP with Python
22
Inverting a Dictionary

Examples of reverse lookup
2015-04-09
CS372: NLP with Python
23
Inverting a Dictionary

Table 5-5. Python’s dictionary methods.
2015-04-09
CS372: NLP with Python
24
Automatic Tagging
The Default Tagger
 The Regular Expression Tagger
 The Lookup Tagger
 Evaluation

>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents(categories=‘news’)
>>> brown_sents = brown.sents(categories=‘news’)
2015-04-09
CS372: NLP with Python
25
The Default Tagger

The simplest possible tagger assigns the
same tag to each token.
• It establishes an important baseline.
• In order to get the best result, we tag each
word with the most likely tag.
Pros?
2015-04-09
CS372: NLP with Python
26
The Regular Expression Tagger

Assign tags to tokens on the basis of
matching patterns.
2015-04-09
CS372: NLP with Python
27
The Lookup Tagger

Find the hundred most frequent words
and store their most likely tag.
• Use it as the model for a “lookup tagger”.
2015-04-09
CS372: NLP with Python
28
The Lookup Tagger

Example 5-4. Lookup tagger performance
with varying model size.
2015-04-09
CS372: NLP with Python
29
The Lookup Tagger
2015-04-09
CS372: NLP with Python
30
Evaluation

We evaluate the performance of a tagger
relative to the tags a human expert would
assign.
• Since we usually don’t have access to an
expert and impartial human judge we make
do instead with gold standard test data.
• The tagger is regarded as being correct if the
tag it guesses for a given word is the same as
the gold standard tag.
2015-04-09
CS372: NLP with Python
31
Summary

Mapping Words to Properties Using
Python Dictionaries
•
•
•
•
•
•
•
Indexing Lists Versus Dictionaries
Dictionaries in Python
Defining Dictionaries
Default Dictionaries
Incrementally Updating a Dictionary
Complex Keys and Values
Inverting a Dictionary
2015-04-09
CS372: NLP with Python
32
Summary

Automatic Tagging
•
•
•
•
The Default Tagger
The Regular Expression Tagger
The Lookup Tagger
Evaluation
2015-04-09
CS372: NLP with Python
33
Project: First Presentation
30 April, 2015 (in class)
 Prepare a 5 minute presentation for your
term project (approximately 7 slides).

• The project must have a clear I/O.
• Explain the measure of the quality of the
output.
• Give a measure of how good your system is
(together with a prediction of your system’s
performance against this measure).
2015-04-09
CS372: NLP with Python
34