Python for NLP and the Natural Language Toolkit

Python for NLP and the Natural
Language Toolkit
CS1573: AI Application Development, Spring 2003
(modified from Edward Loper’s notes)
Outline
•
•
•
Review: Introduction to NLP (knowledge of
language, ambiguity, representations and
algorithms, applications)
HW 2 discussion
Tutorials: Basics, Probability
Python and Natural Language Processing
• Python is a great language for NLP:
– Simple
– Easy to debug:
• Exceptions
• Interpreted language
– Easy to structure
• Modules
• Object oriented programming
– Powerful string manipulation
Modules and Packages
•
Python modules “package program code and
data for reuse.” (Lutz)
–
•
•
Similar to library in C, package in Java.
Python packages are hierarchical modules
(i.e., modules that contain other modules).
Three commands for accessing modules:
1.
2.
3.
import
from…import
reload
Modules and Packages: import
• The import command loads a module:
# Load the regular expression module
>>> import re
• To access the contents of a module, use dotted
names:
# Use the search method from the re module
>>> re.search(‘\w+’, str)
• To list the contents of a module, use dir:
>>> dir(re)
[‘DOTALL’, ‘I’, ‘IGNORECASE’,…]
Modules and Packages
from…import
• The from…import command loads individual
functions and objects from a module:
# Load the search function from the re module
>>> from re import search
• Once an individual function or object is loaded
with from…import, it can be used directly:
# Use the search method from the re module
>>> search (‘\w+’, str)
Import vs. from…import
Import
from…import
• Keeps module
functions separate
from user functions.
• Requires the use of
dotted names.
• Works with reload.
• Puts module functions
and user functions
together.
• More convenient
names.
• Does not work with
reload.
Modules and Packages: reload
• If you edit a module, you must use the reload
command before the changes become visible in
Python:
>>> import mymodule
...
>>> reload (mymodule)
• The reload command only affects modules that
have been loaded with import; it does not
update individual functions and objects loaded
with from...import.
Introduction to NLTK
• The Natural Language Toolkit (NLTK) provides:
– Basic classes for representing data relevant to natural
language processing.
– Standard interfaces for performing tasks, such as
tokenization, tagging, and parsing.
– Standard implementations of each task, which can be
combined to solve complex problems.
NLTK: Example Modules
• nltk.token: processing individual elements of text,
such as words or sentences.
• nltk.probability: modeling frequency distributions
and probabilistic systems.
• nltk.tagger: tagging tokens with supplemental
information, such as parts of speech or wordnet sense tags.
• nltk.parser: high-level interface for parsing texts.
• nltk.chartparser: a chart-based implementation of
the parser interface.
• nltk.chunkparser: a regular-expression based
surface parser.
NLTK: Top-Level Organization
• NLTK is organized as a flat hierarchy of packages
and modules.
• Each module provides the tools necessary to
address a specific task
• Modules contain two types of classes:
– Data-oriented classes are used to represent information
relevant to natural language processing.
– Task-oriented classes encapsulate the resources and
methods needed to perform a specific task.
To the First Tutorials
• Tokens and Tokenization
• Frequency Distributions
The Token Module
• It is often useful to think of a text in terms of
smaller elements, such as words or sentences.
• The nltk.token module defines classes for
representing and processing these smaller
elements.
• What might be other useful smaller elements?
Tokens and Types
•
The term word can be used in two different ways:
1.
2.
•
•
To refer to an individual occurrence of a word
To refer to an abstract vocabulary item
For example, the sentence “my dog likes his dog”
contains five occurrences of words, but four vocabulary
items.
To avoid confusion use more precise terminology:
1.
2.
Word token: an occurrence of a word
Word Type: a vocabulary item
Tokens and Types (continued)
•
•
In NLTK, tokens are constructed from their
types using the Token constructor:
>>> from nltk.token import *
>>> my_word_type = 'dog'
'dog'
>>> my_word_token
=Token(my_word_type) ‘dog'@[?]
Token member functions include type and
loc
Text Locations
• A text location @ [s:e] specifies a region of a text:
– s is the start index
– e is the end index
• The text location @ [s:e]specifies the text beginning
at s, and including everything up to (but not including)
the text at e.
• This definition is consistent with Python slice.
• Think of indices as appearing between elements:
I
saw
a
man
0
1
2
3
4
• Shorthand notation when location width = 1.
Text Locations (continued)
• Indices can be based on different units:
– character
– word
– sentence
• Locations can be tagged with sources (files, other
text locations – e.g., the first word of the first
sentence in the file)
• Location member functions:
–
–
–
–
start
end
unit
source
Tokenization
• The simplest way to represent a text is with a
single string.
• Difficult to process text in this format.
• Often, it is more convenient to work with a list of
tokens.
• The task of converting a text from a single string
to a list of tokens is known as tokenization.
Tokenization (continued)
• Tokenization is harder that it seems
I’ll see you in New York.
The aluminum-export ban.
• The simplest approach is to use “graphic words”
(i.e., separate words using whitespace)
• Another approach is to use regular expressions to
specify which substrings are valid words.
• NLTK provides a generic tokenization interface:
TokenizerI
TokenizerI
• Defines a single method, tokenize, which
takes a string and returns a list of tokens
• Tokenize is independent of the level of
tokenization and the implementation
algorithm
Example
• from nltk.token import WSTokenizer
from nltk.draw.plot import Plot
#Extract a list of words from the corpus
corpus = open('corpus.txt').read()
tokens = WSTokenizer().tokenize(corpus)
# Count up how many times each word length
occurs wordlen_count_list = []
for token in tokens:
wordlen = len(token.type())
# Add zeros until wordlen_count_list is long
enough while wordlen >= len(wordlen_count_list):
wordlen_count_list.append(0)
# Increment the count for this word length
wordlen_count_list[wordlen] += 1
Plot(wordlen_count_list)
Next Tutorial: Probability
• An experiment is any process which leads to
a well-defined outcome
• A sample is any possible outcome of a
given experiment
• Rolling a die?
Outline
Review Basics
Probability
Experiments and Samples
Frequency Distributions
Conditional Frequency Distributions
Review: NLTK Goals
•
•
•
Classes for NLP data
Interfaces for NLP tasks
Implementations, easily combined (what is an
example?)
Accessing NLTK
•
What is the relation to Python?
Words
•
•
•
Types and Tokens
Text Locations
Member Functions
Tokenization
•
•
•
•
TokenizerI
Implementations
>>> tokenizer = WSTokenizer()
>>> tokenizer.tokenize(text_str)
['Hello'@[0w], 'world.'@[1w], 'This'@[2w],
'is'@[3w], 'a'@[4w], 'test'@[5w],
'file.'@[6w]]
Word Length Freq. Distribution Example
•
from nltk.token import WSTokenizer
from nltk.probability import SimpleFreqDist
# Extract a list of words from the corpus
corpus = open('corpus.txt').read()
tokens = WSTokenizer().tokenize(corpus)
# Construct a frequency distribution of word lengths
wordlen_freqs = SimpleFreqDist()
for token in tokens:
wordlen_freqs.inc(len(token.type()))
# Extract the set of word lengths found in the
corpus wordlens = wordlen_freqs.samples()
Frequency Distributions
• A frequency distribution records the
number of times each outcome of an
experiment has occurred
• >>> freq_dist = FreqDist()
>>> for token in document:
...
freq_dist.inc(token.type())
• Constructor, then initialization by storing
experimental outcomes
Methods
• The freq method returns the frequencey of a given
sample.
• We can find the number of times a given sample
occured with the count method
• We can find the total number of sample outcomes
recorded by a frequency distribution with the N
method
• The samples method returns a list of all samples
that have been recorded as outcomes by a
frequency distribution
• We can find the sample with the greatest number
of outcomes with the max method
Examples of Methods
• >>> freq_dist.count('the')
6
• >>> freq_dist.freq('the')
0.012
• >>> freq_dist.N()
500
• >>> freq_dist.max()
‘the’
Simple Word Length Example
• >>> from nltk.token import WSTokenizer
>>> from nltk.probability import FreqDist
>>> corpus = open('corpus.txt').read()
>>> tokens =
WSTokenizer().tokenize(corpus)
#
What is the distribution of word lengths in a
corpus? >>> freq_dist = FreqDist()
>>> for token in tokens:
...
freq_dist.inc(len(token.type()))
– What is the "outcome" for our experiment?
Simple Word Length Example
• >>> from nltk.token import WSTokenizer
>>> from nltk.probability import FreqDist
>>> corpus = open('corpus.txt').read()
>>> tokens = WSTokenizer().tokenize(corpus)
# What is the distribution of word lengths in a
corpus? >>> freq_dist = FreqDist()
>>> for token in tokens:
...
freq_dist.inc(len(token.type()))
– This length is the "outcome" for our experiment, so we use
inc() to increment its count in a frequency distribution.
Complex Word Length Example
• # define vowels as "a", "e", "i", "o", and "u"
>>> VOWELS = ('a', 'e', 'i', 'o', 'u')
# distribution for words ending in vowels?
>>> freq_dist = FreqDist()
>>> for token in tokens:
...
if token.type()[-1].lower() in VOWELS:
...
freq_dist.inc(len(token.type()))
• What is the condition?
More Complex Example
• # What is the distribution of word lengths for
# words following words that end in vowels?
>>> ended_in_vowel = 0 #Did last word end in
vowel?
>>> freq_dist = FreqDist()
>>> for token in tokens:
...
if ended_in_vowel:
...
Freq_dist.inc(len(token.type()))
...
ended_in_vowel=token.type()[-1].lower() in
VOWELS
Conditional Frequency
Distributions
• A condition specifies the context in which an
experiment is performed
• A conditional frequency distribution is a collection
of frequency distribtuions for the same
experiment, run under different conditions
• The individual frequency distributions are indexed
by the condition.
• NLTK ConditionalFreqDist class
• >>> cfdist = ConditionalFreqDist()
<ConditionalFreqDist with 0 conditions>
Conditional Frequency
Distributions (continued)
• To access the frequency distribution for a condition, use
the indexing operator :
>>> cfdist['a']
<FreqDist with 0 outcomes>
• # Record lengths of some words starting with 'a'
>>> for word in 'apple and arm'.split():
...
cfdist['a'].inc(len(word))
• # How many are 3 characters long?
>>> cfdist['a'].freq(3)
0.66667
• To list accessed conditions, use the conditions method:
>>> cfdist.conditions()
['a']
Example: Conditioning on a
Word’s Initial Letter
• >>> from nltk.token import WSTokenizer
>>> from nltk.probability import
ConditionalFreqDist
>>> from nltk.draw.plot import Plot
#
>>> corpus = open('corpus.txt').read()
>>> tokens =
WSTokenizer().tokenize(corpus)
>>>
cfdist = ConditionalFreqDist()
Example (continued)
• # How does initial letter affect word length?
>>> for token in tokens:
...
outcome = len(token.type())
...
condition = token.type()[0].lower()
...
cfdist[condition].inc(outcome)
• What are the condition and the outcome?
Example (continued)
• # How does initial letter affect word length?
>>> for token in tokens:
...
outcome = len(token.type())
...
condition = token.type()[0].lower()
...
cfdist[condition].inc(outcome)
• What are the condition and the outcome?
• Condition = the initial letter of the token
• Outcome = its word length
Prediction
• Prediction is the problem of deciding a likely
outcome for a given run of an experiment.
• To predict the outcome, we first examine a
training corpus.
• Training corpus
– The context and outcome for each run are known
– Given a new run, we choose the outcome that occurred
most frequently for the context
– Conditional frequency distribution finds the most
frequent occurrrence
Prediction Example: Outline
•
•
•
Record each outcome in the training corpus,
using the context that the experiment was
under as the condition
Access the frequency distribution for a given
context with the indexing operator
Use the max() method to find the most likely
outcome
Example: Predicting Words
• Predict word's type, based on preceding word
type
• >>> from nltk.token import WSTokenizer
>>> from nltk.probability import
ConditionalFreqDist
>>> corpus = open('corpus.txt').read()
>>> tokens =
WSTokenizer().tokenize(corpus) >>> cfdist =
ConditionalFreqDist() #empty
Example (continued)
• >>> context = None
# The type of the preceding word
>>> for token in tokens:
...
outcome = token.type()
...
cfdist[context].inc(outcome)
...
context = token.type()
Example (continued)
• >>> cfdist['prediction'].max()
'problems'
>>> cfdist['problems'].max()
'in'
>>> cfdist['in'].max()
'the‘
• What are we predicting here?
Example (continued)
We predict the most likely word for any context
Generation application:
>>> word = 'prediction'
>>> for i in range(15):
...
print word,
...
word = cfdist[word].max()
prediction problems in the frequency distribution of
the frequency distribution of the frequency
distribution of
For Next Time
•
•
•
HW3
To run NLTK from unixs.cis.pitt.edu, you
should add /afs/cs.pitt.edu/projects/nltk/bin to
your search path
Regular Expressions (J&M handout, NLTK
tutorial)