Analyzing Text Documents with Perl

 CSC105, Introduction to Computer Science Lab06: Analyzing Text Documents with Perl [NOTE: This material assumes that you have reviewed Chapters 3, “Lists and Hashes” in Beginning Perl, especially pages 104–9.] Directions. In this lab, you will explore analyzing text using Perl programs in a series of separate activities. Create a submission folder named with your last name + Lab06. As previously, this will serve as the home location for all of your programs or scripts. Upload this folder electronically when you complete the lab activities. You will need the folder lab06Files, which is in the lab materials folder in the csc121 folder in your Box account. Download a copy to your desktop. It contains a text file that will be employed in these activities. It is a good idea to relocate the test text file into your submission folder as before. Otherwise, you may have some problems opening the files when executing your programs. I. ACTIVITY: Understanding Hashes. 1. Hashes. A hash data structure is like an array or list in that it can hold any number of values that you can retrieve later when needed. But instead of indexing the values by number (i.e., position), you can look up the values by name. Technically, a hash is an associative array. Think of it as two arrays that are linked side-­‐by-­‐side. The first array is a list of keys used to index another list of values. In Figure 1, the hash has five elements. The keys are always string values. Each one references a specific value associated with it. Hash values can be any legal scalar value. In Figure 1, there are both numeric and text instances. Figure 1: Picturing a Perl hash as an associative array. 2. Keys. The keys are arbitrary. You can use any string expression as a key; moreover, they do not have to be organized in any particular order. The only constraint is that they must be unique. The reason for this restriction is that keys uniquely identify their associated values. Thus, the key acts like a tag. Whenever we are searching for a particular value, we identify it by its tag (key). It is common to use a special variable for denoting keys, i.e., $key. -­‐1-­‐ The advantage of using hashes is that you do not have to know where a value is stored—as in a list or array. You can find it just by indicating the key. 3. Values. As mentioned, values may be numbers, text, or mixtures of both. In addition, they do not have to be unique. For example, if you were storing the height of a group of people, it is certainly possible that two or more might be the same height. It is also common to use a special variable for denoting values, $value. 4. Here is a simple example that defines a hash called coins. Notice that the %-­‐sign is used to signify the variable coins refers to a hash—just as $ denotes scalar values and @ refers to lists. #!/usr/bin/perl
# example1.pl
# define the hash
%coins = ("Quarter", 25, "Dime", 10, "Nickel", 5);
# print the hash
print %coins;
Create your own version—call it example1.pl—and observe the output. 5. A more civilized version is given below as example2.pl. #! /usr/bin/perl
# example2.pl
# define hash
%coins = (
'Quarter' => 25,
'Dime' => 10,
'Nickel' => 5,
'Penny' => 1
);
#loop through the hash printing each pair
while (($key, $value) = each %coins) {
$line = $key . ' ,' . $value;
print "$line\n";
}
The big arrow operator (=>) can be used for assigning keys and values. The formatting, however, is optional. Try this version and note the difference in output. The loop uses the each function, which makes it possible to iterate over the entire hash structure. In practice, the only way to use the each function is within a while loop, as in the example. NOTE: The $line variable demonstrates how you can concatenate additional information or formatting along with the hash keys and values. 6. This next version, example3.pl, shows how to organize the keys using the sort( ) function. -­‐2-­‐ #! /usr/bin/perl
# example3.pl
# define hash
%coins = (
'Quarter' => 25,
'Dime' => 10,
'Nickel' => 5,
'Penny' => 1
);
#loop through the hash printing each pair
foreach $x (sort keys %coins) {
print "$x: $coins{$x}\n";
}
Try this version and compare the output with example2.pl. The expression $coins{$x} acts as the variable for each element value. The Perl idiom would normally be written as $coins{$key}, but this example shows that there is no special significance in the scalar variable. Whatever variable you designate in the foreach loop is what would be expected inside the braces. 7. If we wanted to sort the hash by values instead of keys, some changes would be necessary to the hash values. Examine this last version, example4.pl. #! /usr/bin/perl
# example4.pl
# define hash
%coins = (
'Quarter' => 0.25,
'Dime' => 0.10,
'Nickel' => 0.05,
'Penny' => 0.01
);
#loop through the hash printing each pair
foreach $value (sort {$coins{$a} cmp $coins{$b}} keys %coins) {
print "$value: $coins{$value}\n";
}
The numeric values must be changed to reflect accurate comparisons. Try this version. Is the output sorted properly? How does it work? HINT: Recall the cmp operator? 8. Include copies of example1.pl, example2.pl, example3.pl, and example4.pl in your submission folder. II. ACTIVITY: Counting Word Frequencies. 1. Create a program that opens a file, reads it line-­‐by-­‐line, processes the words in each line—counting the number of times that the word occurs, and outputs a list of the words with their total number of occurrences. Use the file gettysburg2.txt for testing purposes. A copy is available in the folder lab06files. -­‐3-­‐ HINTS: • you will need to create an empty hash initially. Here is an example: %words = ();
• to count the words properly, it will be necessary to convert every line to lowercase and strip all the punctuation from it. (These are tasks that you have done before.) • finally, the hash structure can be used to count the frequency of words by using an assignment like $words{$x} = $words{$x} + 1;
where $x (variable) refers the current word in the word list and the key in the hash structure. Specifically, using a loop that processes each word in the current line, the result of the assignment statement (a) adds any new word to the hash structure with a value of 1, or (b) increments the value of any key value already found in the hash structure. Name your program frequency.pl and include it in the submission folder. Here is a brief excerpt of how the output might look. shall 3
should 1
so 3
struggled 1
take 1
task 1
testing 1
that 13
the 11
their 1
these 2
they 3
this 4
those 1
thus 1
to 8
under 1
unfinished 1
us 3
vain 1
war 2
we 10
what 2
whether 1
which 2
who 3
will 1
work 1
world 1
years 1
-­‐4-­‐ III. ACTIVITY: Building a Basic Concordance 1. A concordance is an alphabetical listing of the principal words in a text document along with their immediate contexts. Concordances are used for studying word usage, word frequencies, analyzing idioms, creating indices, and a host of other applications. Create a Perl program called concord.pl that constructs a basic concordance of its input text file. Use gettysburg2.txt for testing purposes. Here is an excerpt of how the output might look. (Color has been added for emphasis only.) can
4 conceived and so dedicated can long endure. We are met on a great
battlefield of
10 The world will little note nor long remember what we say here, but
it can never forget
cannot
7 fitting and proper that we should do this. But in a larger sense, we
cannot dedicate,
8 we cannot consecrate, we cannot hallow this ground. The brave men,
living and dead
8 we cannot consecrate, we cannot hallow this ground. The brave men,
living and dead
cause
14 these honored dead we take increased devotion to that cause for
which they gave the
civil
3 Now we are engaged in a great civil war, testing whether that nation
or any nation so
come
5 that war. We have come to dedicate a portion of that field as a
final resting-place
conceived
2 conceived
are created
4 conceived
battlefield
in liberty and dedicated to the proposition that all men
equal.
and so dedicated can long endure. We are met on a great
of
consecrate
8 we cannot consecrate, we cannot hallow this ground. The brave men,
living and dead
consecrated
9 who struggled here have consecrated it far above our poor power to
add or detract.
continent
1 Fourscore and seven years ago our fathers brought forth on this
continent a new nation,
-­‐5-­‐ created
2 conceived in liberty and dedicated to the proposition that all men
are created equal.
dead
8 we cannot consecrate, we cannot hallow this ground. The brave men,
living and dead
14 these honored dead we take increased devotion to that cause for
which they gave the
15 last full measure of devotion--that we here highly resolve that
these dead shall not
As you can see, each block lists • the word (in alphabetical order), and • each of the original lines in which it occurs; • the lines are also numbered to signify where they occur in the file. • the line is repeated if the word occurs more than once. (A smarter version might notice this, but we won’t.) As before, the basic process involves reading in the file line by line and processing it. Afterwards, the results are displayed. Here is a pseudocoded version of the algorithm # create a hash structure of words in the text (keys) and the lines they occur in (values)
while there are more lines to read
increment the line number counter
remove the newline from it
copy the original line so that it can be reused
convert the line to lowercase
strip out all punctuation
split the line into individual words
for each word in the line
add it to a hash structure along with the line number and line copy
end for
end while (lines)
# display the results
for each key in the sorted hash structure
split the values (i.e., line occurrences) into separate units in a list
for each unit in the list
print each unit (line occurrence)
end for (units)
end for (keys)
The trick is to add each line occurrence to the hash structure by concatenating it to the value. Here is one way to do it. foreach $x (@word) {
$words{$x} = $words{$x} . $lineNum . ' ' . $copy . '#';
}
@ word = the list of words from the current line. -­‐6-­‐ $words{$x} is the variable signifying the value of the hash whose key is $x. $lineNum is the variable that tracks the current line’s number in the text, and $copy is the copy of the original line of text from the file. As before, the assignment causes either a new hash <key, value> pair to be added, if $x is not in the hash structure, or concatenates the line number + a space + a copy of the original line + the # symbol to an existing hash value that matches $x. Later, when it is time to print out these lines, the ‘#’ symbol can be used to separate the individual line occurrences (using the split( ) function) for printing. Specifically, @lines = split("#", $words{$x});
where @lines is a list of strings separated from the current hash value So, the ‘#’ symbol is just a convenient marker for distinguishing one line occurrence from another. -­‐7-­‐